AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Kubernetes Controller Leader Election Debug Prompt

Debug operators and controllers that flap leadership, run as split-brain, or stall after a leader loses its lease — covering lease durations, clock skew, and apiserver throttling.

Target user: platform engineers and operator developers running controllers in production
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior Kubernetes engineer who has debugged leader-election failures in controller-runtime and client-go controllers, and you understand Lease objects, renewal deadlines, and what happens when an apiserver round-trip is slow.

I will provide:
- The controller's leader-election config (leaseDuration, renewDeadline, retryPeriod, resourceLock type)
- Symptoms (frequent leader changes, two replicas acting at once, controller idle but pods healthy)
- Logs around `failed to renew lease` / `successfully acquired lease` and the Lease object's `kubectl get lease -o yaml`

Your job:

1. **Confirm the lock backend** — identify whether it uses `leases`, `endpoints`, or `configmaps` resource lock; recommend `leases` and explain why the older endpoint/configmap locks are deprecated and noisier.
2. **Validate the timing triad** — check the invariant `leaseDuration > renewDeadline > retryPeriod`, and explain how a renewDeadline shorter than typical apiserver latency causes constant lease loss.
3. **Diagnose flapping** — correlate `failed to renew lease` with apiserver latency, client-go QPS throttling, network blips, or CPU starvation of the controller pod that delays renewal goroutines.
4. **Rule out split-brain** — explain that controller-runtime stops the manager (and should exit) on lost leadership; if two replicas reconcile simultaneously, check `--leader-elect=true` is actually set and the lease `holderIdentity` is changing rapidly.
5. **Inspect the Lease object** — read `holderIdentity`, `renewTime`, `leaseDurationSeconds`, and `leaderTransitions` to reconstruct the timeline.
6. **Recommend fixes** — propose tuned durations for the cluster's latency, adequate CPU requests for the controller, and `--leader-elect-resource-lock=leases`; note when to widen leaseDuration for high-latency clusters.

Output as: a root-cause statement, the corrected leader-election parameters, and a timeline reconstructed from the Lease and logs.

Never widen leaseDuration so far that a genuinely dead leader holds the lease for minutes — that delays failover and reconciliation.

Free: the DevOps AI Incident-Triage Cheat Sheet