Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AI for Redis Difficulty: Advanced ClaudeChatGPT

Redis Sentinel High Availability Design Prompt

Design Redis Sentinel HA — quorum, automatic failover, and client discovery — for resilient primary/replica setups without Cluster.

Target user
SREs building Redis HA with Sentinel
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior SRE and Redis expert who designs Sentinel-based high availability.

I will provide:
- The current primary/replica layout
- Number and placement of Sentinels
- How clients connect today

Your job:

1. **Sentinel role**: Sentinels monitor primary+replicas, detect failure, elect a new primary, and reconfigure replicas — they do NOT proxy data.
2. **Quorum and count**: run an ODD number of Sentinels (>= 3) across separate failure domains. `sentinel monitor <name> <ip> <port> <quorum>` — quorum is the votes needed to agree the primary is down. A majority of Sentinels must also be reachable to authorize failover.
3. **Failure detection**: `down-after-milliseconds` sets how long unresponsiveness = subjectively down (SDOWN). Quorum SDOWNs = objectively down (ODOWN) → failover.
4. **Failover controls**: `failover-timeout` bounds retries; `parallel-syncs` limits how many replicas resync the new primary at once (avoid overwhelming it).
5. **Client discovery**: clients ask Sentinel `SENTINEL get-master-addr-by-name <name>` and subscribe to `+switch-master` pub/sub to learn the new primary. Use a Sentinel-aware client library — never hardcode the primary IP.
6. **Auth**: set `sentinel auth-pass`/`requirepass` and `sentinel auth-user` (ACL) consistently across nodes.
7. **Split-brain avoidance**: `min-replicas-to-write`/`min-replicas-max-lag` on the primary make it stop accepting writes if too few replicas are in sync.
8. **Validate**: `SENTINEL master <name>`, `SENTINEL replicas <name>`, `SENTINEL sentinels <name>`, and rehearse a failover in staging.

Mark DESTRUCTIVE: `SENTINEL FAILOVER <name>` in prod without a plan (forces a switch), even quorum with 2 Sentinels (can't form majority → split-brain), `FLUSHALL` on the primary, and `KEYS *`/`DEBUG` on prod.

---

Current layout: [DESCRIBE]
Sentinel count/placement: [DESCRIBE]
Client connection method: [DESCRIBE]

Why this prompt works

Sentinel HA fails in predictable ways: even Sentinel counts that can’t reach majority, aggressive timeouts that flap, and clients that hardcode the primary and never notice a failover. This prompt enforces an odd Sentinel count across failure domains, ties down-after/quorum/parallel-syncs to real behavior, and insists on Sentinel-aware client discovery — the three things that make automatic failover actually work.

How to use it

  1. Describe failure domains — Sentinels must span racks/AZs to survive one failing.
  2. State the Sentinel count — it must be odd and at least 3.
  3. Explain how clients find the primary — this is where most outages hide.
  4. Rehearse in staging using SENTINEL FAILOVER before trusting prod.

Useful commands

# Query Sentinel state (port 26379)
redis-cli -p 26379 SENTINEL master mymaster
redis-cli -p 26379 SENTINEL replicas mymaster
redis-cli -p 26379 SENTINEL sentinels mymaster

# Client discovery: current primary address
redis-cli -p 26379 SENTINEL get-master-addr-by-name mymaster

# Watch failover events
redis-cli -p 26379 PSUBSCRIBE '+switch-master' '+odown' '+sdown'

# Adjust monitoring at runtime
redis-cli -p 26379 SENTINEL set mymaster down-after-milliseconds 5000
redis-cli -p 26379 SENTINEL set mymaster parallel-syncs 1

# Rehearse a failover (staging only)
redis-cli -p 26379 SENTINEL FAILOVER mymaster

Example config

# sentinel.conf (run 3 of these across separate AZs)
port 26379
sentinel monitor mymaster 10.0.0.10 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 60000
sentinel parallel-syncs mymaster 1
sentinel auth-pass mymaster <STRONG_PASSWORD>

# On the primary (redis.conf) — refuse writes if replicas fall behind
min-replicas-to-write 1
min-replicas-max-lag 10

Common findings this catches

  • Even Sentinel count → no majority on partition; failover stalls.
  • Sentinels co-located with primary → HA dies with that domain.
  • down-after too low → flapping failovers on brief blips.
  • Clients hardcode primary IP → never follow the failover.
  • parallel-syncs too high → new primary overwhelmed by resyncs.
  • No min-replicas-to-write → primary keeps accepting writes during split-brain.

When to escalate

  • Sharding needs beyond a single primary — evaluate Redis Cluster.
  • Cross-region failover — needs a broader DR design.
  • Zero-data-loss failover requirements — async replication is insufficient alone.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week