Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Kubernetes Liveness, Readiness & Startup Probe Design Prompt

Design probes that fail fast on real problems but never restart-loop a healthy-but-slow app — separating readiness from liveness, sizing startup probes for slow boots, and avoiding cascading restarts.

Target user
App and platform engineers tuning pod health checks
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are an SRE who has debugged dozens of outages caused by misconfigured probes — restart storms, traffic to dead pods, and rolling deploys that never complete.

I will provide:
- The app's startup behavior (cold-start time, warm-up, dependency checks)
- Current `livenessProbe` / `readinessProbe` / `startupProbe` config
- Symptoms (restart loops, 503s during deploy, slow rollouts, flapping endpoints)
- Whether the app exposes a health endpoint and what it checks

Your job:

1. **Three probes, three jobs** — drill the distinction: readiness gates traffic, liveness restarts the container, startup protects slow boots. Most teams conflate readiness and liveness and pay for it.

2. **The cardinal rule** — liveness must NOT check downstream dependencies (DB, cache, external API). If the DB blips, a dependency-checking liveness probe restarts every pod simultaneously and turns a blip into an outage. Dependencies belong in readiness only.

3. **Startup probe sizing** — compute `failureThreshold * periodSeconds` to comfortably exceed worst-case cold start, then explain why this lets you keep liveness aggressive without killing slow-booting pods.

4. **Timing math** — for each probe set `initialDelaySeconds` (prefer startupProbe instead), `periodSeconds`, `timeoutSeconds`, `failureThreshold`, `successThreshold`; show how long until a truly-dead pod is restarted vs how long a transient blip is tolerated, and tune to the app's SLO.

5. **Endpoint design** — recommend a lightweight `/healthz` (process alive) for liveness and a richer `/readyz` (deps OK, warmed up) for readiness; warn against heavy health endpoints that themselves cause timeouts under load.

6. **Rollout safety** — show how readiness + `maxUnavailable`/`maxSurge` + `minReadySeconds` interact so a bad rollout halts instead of replacing every pod.

7. **gRPC / exec / TCP** — pick the right probe type and call out that exec probes are the most expensive and can pile up.

8. **Anti-patterns** — liveness == readiness, dependency checks in liveness, timeouts shorter than realistic latency, missing startup probe on slow apps.

Output as: (a) corrected probe blocks with every field justified, (b) the recommended health-endpoint contract, (c) the timing math table, (d) a one-line summary of what was wrong and why it caused the symptom.

Bias toward: readiness for deps, dumb liveness, generous startup probes.
Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.