Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Kubernetes Pod Lifecycle & Graceful Shutdown Prompt

Design and debug pod lifecycle — preStop hooks, terminationGracePeriodSeconds, SIGTERM handling, connection draining, readiness probe behavior on shutdown.

Target user
Kubernetes engineers designing production workloads
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes engineer who has debugged "users see 502 during deploys" too many times. You know that graceful shutdown requires correctly handling SIGTERM, draining connections, and coordinating with the readiness probe and Service endpoints.

I will provide:
- The workload (HTTP server, gRPC, message consumer)
- Current pod spec (preStop, terminationGracePeriodSeconds, probes)
- The symptom (in-flight requests dropped, 502s during rollouts, slow shutdown)
- App's signal handling behavior

Your job:

1. **Understand the shutdown sequence**:
   1. Pod marked for deletion (status updated)
   2. **Endpoints removed** from Service (kube-proxy updates iptables/IPVS) — but this is ASYNC
   3. **preStop hook** runs (if defined)
   4. **SIGTERM** sent to PID 1
   5. **terminationGracePeriodSeconds** countdown begins
   6. If not exited by then → **SIGKILL**
2. **The endpoint propagation race**:
   - Service endpoint removal is async; load balancers may still send traffic
   - **Apps that exit immediately on SIGTERM lose those in-flight requests**
   - Solution: preStop sleep (5-15s) gives kube-proxy time to update
3. **For HTTP servers**:
   - On SIGTERM: stop accepting NEW connections, drain existing
   - Set readinessProbe to fail → endpoints remove → no new traffic (slow path)
   - preStop sleep > readinessProbe failureThreshold × periodSeconds
4. **For SIGTERM handling**:
   - **PID 1 receives SIGTERM** — but if PID 1 is a shell, it ignores SIGTERM by default
   - Use `exec` in script: `exec myapp` so myapp is PID 1
   - Or use tini / dumb-init as PID 1 to forward signals
5. **For terminationGracePeriodSeconds**:
   - Default 30s
   - Set to (drain time + buffer): for long-lived connections, may need 5+ min
   - kubectl delete with `--grace-period=0 --force` SKIPS this
6. **For preStop hook**:
   - Runs BEFORE SIGTERM
   - Can be `exec` (command) or `httpGet`
   - Common: `sleep 15` for endpoint propagation
   - Or: notify load balancer to drain
7. **For long-running workloads** (batch jobs, message consumers):
   - Save progress on shutdown
   - For at-least-once queues: ack only after work done
8. **For sidecar coordination**:
   - Sidecars die at the same time as main; order matters
   - Native sidecar (1.28+) reverses: sidecars die last

Mark DESTRUCTIVE: `kubectl delete pod --grace-period=0 --force` (no graceful), removing preStop without verifying endpoint propagation, increasing termGracePeriodSeconds without bounded shutdown logic (pod hangs forever).

---

Workload: [DESCRIBE]
Pod spec excerpt:
```yaml
[PASTE]
```
Symptom: [DESCRIBE — 502s, dropped requests, slow shutdown]
App signal handling: [DESCRIBE]

Why this prompt works

Pod shutdown is a series of races: endpoint propagation, SIGTERM handling, connection draining. Getting any wrong drops user requests. This prompt walks the sequence.

How to use it

  1. Verify SIGTERM reaches your app (PID 1, signal forwarding).
  2. Add preStop sleep for endpoint propagation.
  3. Set terminationGracePeriodSeconds to (drain + buffer).
  4. Test rolling restart under load.

Useful commands

# Test signal handling
kubectl exec <pod> -- kill -SIGTERM 1     # send SIGTERM to PID 1
# Watch app logs to see if it received

# Force-test shutdown
kubectl delete pod <pod>                  # uses spec.terminationGracePeriodSeconds
time kubectl delete pod <pod>             # measure time

# Test with load
# Run wrk or hey while doing kubectl rollout restart
hey -c 50 -z 30s http://svc.example.com/ &
kubectl rollout restart deploy/web

# Check current settings
kubectl get pod <pod> -o yaml | yq '.spec.containers[].lifecycle, .spec.terminationGracePeriodSeconds'

# Endpoints during shutdown
kubectl get endpoints <svc> -w

Patterns

HTTP server with proper drain

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60         # drain + buffer
      containers:
      - name: app
        image: myapp:v1
        ports: [{ containerPort: 8080 }]
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - "sleep 15"                       # give endpoints time to update
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          periodSeconds: 5
          failureThreshold: 1

App on SIGTERM:

  1. Stop accepting new connections (close listener)
  2. Drain in-flight requests (with timeout)
  3. Exit cleanly

Tini as PID 1 (signal forwarding)

FROM node:20-alpine
RUN apk add --no-cache tini
ENTRYPOINT ["/sbin/tini", "--"]
CMD ["node", "server.js"]

Tini forwards SIGTERM to the node process and reaps zombies.

Long-running batch worker (save progress)

# Python example
import signal
import sys

shutdown = False

def handle_sigterm(signum, frame):
    global shutdown
    shutdown = True

signal.signal(signal.SIGTERM, handle_sigterm)

while not shutdown:
    job = queue.get()
    process(job)
    if shutdown:
        # Save state, ack if done
        queue.ack(job)
        break

sys.exit(0)

With pod spec:

spec:
  terminationGracePeriodSeconds: 300        # 5 min for current job + buffer
  containers:
  - name: worker
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "echo 'preStop'; sleep 5"]

Java app with shutdown hook

Runtime.getRuntime().addShutdownHook(new Thread(() -> {
    server.shutdown();             // stop accepting
    server.awaitTermination(30, TimeUnit.SECONDS);
}));

Common findings this catches

  • 502s during rollout → no preStop sleep; endpoint propagation race.
  • App doesn’t exit on SIGTERM → shell as PID 1 without exec.
  • terminationGracePeriodSeconds too short → SIGKILL before drain completes.
  • Long drain blocks rollouts → bound your drain time.
  • Sidecar dies first, main fails → use native sidecars (1.28+) OR preStop on main with sleep.
  • Force-delete pod loses data → only for stuck pods.
  • App keeps accepting connections during shutdown → close listener on SIGTERM.

When to escalate

  • Long-lived connection workloads (WebSocket, gRPC streaming) — design for graceful close at protocol level.
  • Stateful workloads needing coordinated shutdown — engage StatefulSet model.
  • Custom controller dropping work — observability for shutdown phase.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.