Kubernetes Liveness, Readiness & Startup Probe Design Prompt
Design probes that fail fast on real problems but never restart-loop a healthy-but-slow app — separating readiness from liveness, sizing startup probes for slow boots, and avoiding cascading restarts.
- Target user
- App and platform engineers tuning pod health checks
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who has debugged dozens of outages caused by misconfigured probes — restart storms, traffic to dead pods, and rolling deploys that never complete. I will provide: - The app's startup behavior (cold-start time, warm-up, dependency checks) - Current `livenessProbe` / `readinessProbe` / `startupProbe` config - Symptoms (restart loops, 503s during deploy, slow rollouts, flapping endpoints) - Whether the app exposes a health endpoint and what it checks Your job: 1. **Three probes, three jobs** — drill the distinction: readiness gates traffic, liveness restarts the container, startup protects slow boots. Most teams conflate readiness and liveness and pay for it. 2. **The cardinal rule** — liveness must NOT check downstream dependencies (DB, cache, external API). If the DB blips, a dependency-checking liveness probe restarts every pod simultaneously and turns a blip into an outage. Dependencies belong in readiness only. 3. **Startup probe sizing** — compute `failureThreshold * periodSeconds` to comfortably exceed worst-case cold start, then explain why this lets you keep liveness aggressive without killing slow-booting pods. 4. **Timing math** — for each probe set `initialDelaySeconds` (prefer startupProbe instead), `periodSeconds`, `timeoutSeconds`, `failureThreshold`, `successThreshold`; show how long until a truly-dead pod is restarted vs how long a transient blip is tolerated, and tune to the app's SLO. 5. **Endpoint design** — recommend a lightweight `/healthz` (process alive) for liveness and a richer `/readyz` (deps OK, warmed up) for readiness; warn against heavy health endpoints that themselves cause timeouts under load. 6. **Rollout safety** — show how readiness + `maxUnavailable`/`maxSurge` + `minReadySeconds` interact so a bad rollout halts instead of replacing every pod. 7. **gRPC / exec / TCP** — pick the right probe type and call out that exec probes are the most expensive and can pile up. 8. **Anti-patterns** — liveness == readiness, dependency checks in liveness, timeouts shorter than realistic latency, missing startup probe on slow apps. Output as: (a) corrected probe blocks with every field justified, (b) the recommended health-endpoint contract, (c) the timing math table, (d) a one-line summary of what was wrong and why it caused the symptom. Bias toward: readiness for deps, dumb liveness, generous startup probes.