Redis Sentinel High Availability Design Prompt
Design Redis Sentinel HA — quorum, automatic failover, and client discovery — for resilient primary/replica setups without Cluster.
- Target user
- SREs building Redis HA with Sentinel
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE and Redis expert who designs Sentinel-based high availability. I will provide: - The current primary/replica layout - Number and placement of Sentinels - How clients connect today Your job: 1. **Sentinel role**: Sentinels monitor primary+replicas, detect failure, elect a new primary, and reconfigure replicas — they do NOT proxy data. 2. **Quorum and count**: run an ODD number of Sentinels (>= 3) across separate failure domains. `sentinel monitor <name> <ip> <port> <quorum>` — quorum is the votes needed to agree the primary is down. A majority of Sentinels must also be reachable to authorize failover. 3. **Failure detection**: `down-after-milliseconds` sets how long unresponsiveness = subjectively down (SDOWN). Quorum SDOWNs = objectively down (ODOWN) → failover. 4. **Failover controls**: `failover-timeout` bounds retries; `parallel-syncs` limits how many replicas resync the new primary at once (avoid overwhelming it). 5. **Client discovery**: clients ask Sentinel `SENTINEL get-master-addr-by-name <name>` and subscribe to `+switch-master` pub/sub to learn the new primary. Use a Sentinel-aware client library — never hardcode the primary IP. 6. **Auth**: set `sentinel auth-pass`/`requirepass` and `sentinel auth-user` (ACL) consistently across nodes. 7. **Split-brain avoidance**: `min-replicas-to-write`/`min-replicas-max-lag` on the primary make it stop accepting writes if too few replicas are in sync. 8. **Validate**: `SENTINEL master <name>`, `SENTINEL replicas <name>`, `SENTINEL sentinels <name>`, and rehearse a failover in staging. Mark DESTRUCTIVE: `SENTINEL FAILOVER <name>` in prod without a plan (forces a switch), even quorum with 2 Sentinels (can't form majority → split-brain), `FLUSHALL` on the primary, and `KEYS *`/`DEBUG` on prod. --- Current layout: [DESCRIBE] Sentinel count/placement: [DESCRIBE] Client connection method: [DESCRIBE]
Why this prompt works
Sentinel HA fails in predictable ways: even Sentinel counts that can’t reach majority, aggressive timeouts that flap, and clients that hardcode the primary and never notice a failover. This prompt enforces an odd Sentinel count across failure domains, ties down-after/quorum/parallel-syncs to real behavior, and insists on Sentinel-aware client discovery — the three things that make automatic failover actually work.
How to use it
- Describe failure domains — Sentinels must span racks/AZs to survive one failing.
- State the Sentinel count — it must be odd and at least 3.
- Explain how clients find the primary — this is where most outages hide.
- Rehearse in staging using
SENTINEL FAILOVERbefore trusting prod.
Useful commands
# Query Sentinel state (port 26379)
redis-cli -p 26379 SENTINEL master mymaster
redis-cli -p 26379 SENTINEL replicas mymaster
redis-cli -p 26379 SENTINEL sentinels mymaster
# Client discovery: current primary address
redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster
# Watch failover events
redis-cli -p 26379 PSUBSCRIBE '+switch-master' '+odown' '+sdown'
# Adjust monitoring at runtime
redis-cli -p 26379 SENTINEL set mymaster down-after-milliseconds 5000
redis-cli -p 26379 SENTINEL set mymaster parallel-syncs 1
# Rehearse a failover (staging only)
redis-cli -p 26379 SENTINEL FAILOVER mymaster
Example config
# sentinel.conf (run 3 of these across separate AZs)
port 26379
sentinel monitor mymaster 10.0.0.10 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1
sentinel auth-pass mymaster <STRONG_PASSWORD>
# On the primary (redis.conf) — refuse writes if replicas fall behind
min-replicas-to-write 1
min-replicas-max-lag 10
Common findings this catches
- Even Sentinel count → no majority on partition; failover stalls.
- Sentinels co-located with primary → HA dies with that domain.
down-aftertoo low → flapping failovers on brief blips.- Clients hardcode primary IP → never follow the failover.
parallel-syncstoo high → new primary overwhelmed by resyncs.- No
min-replicas-to-write→ primary keeps accepting writes during split-brain.
When to escalate
- Sharding needs beyond a single primary — evaluate Redis Cluster.
- Cross-region failover — needs a broader DR design.
- Zero-data-loss failover requirements — async replication is insufficient alone.
Related prompts
-
Redis Cluster Sharding Design Prompt
Design Redis Cluster sharding — 16384 hash slots, resharding, hash tags, and multi-key operation constraints across shards.
-
Redis Connection Pool Tuning Prompt
Tune Redis client connection pools: pool sizing, timeouts, maxclients, TCP keepalive, and avoiding connection exhaustion and leaks.
-
Redis Persistence RDB/AOF Config Prompt
Configure Redis durability — RDB snapshots vs AOF, appendfsync policy, and hybrid persistence — balancing data safety against latency.
-
Redis Replication Setup Review Prompt
Review Redis primary/replica topology — replicaof, replica-read-only, sync health, and lag — for read scaling and failover readiness.