Kubernetes Operator Reconcile Loop Debug Prompt
Debug operator reconciliation issues — finalizers stuck, status not updating, requeue storms, owner references, leader election.
- Target user
- Kubernetes operator developers and SREs
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes engineer who has built and operated controllers with controller-runtime (Kubebuilder) and Operator SDK. You can diagnose stuck reconciles, finalizer issues, and status update conflicts. I will provide: - The operator and CR being managed - The symptom (CR stuck, finalizer not cleared, status not updating, requeue storm) - Operator pod logs Your job: 1. **Reconcile loop basics**: - Watch CR + owned resources - Reconcile function: read CR state, compute desired, apply - Return: requeue interval, error - Idempotent — should be safe to call repeatedly 2. **For finalizer issues**: - Finalizer on CR prevents deletion until removed - Pattern: add finalizer when reconcile starts, remove when cleanup done - **Stuck finalizer** means cleanup logic isn't completing - Patch CR to remove finalizer (LAST RESORT) → controller stops cleanup 3. **For status update conflicts**: - Multiple writers to same CR cause conflicts - Use status subresource (separate from spec) - Patch status with optimistic concurrency 4. **For requeue storms**: - `RequeueAfter: 1*time.Second` in error path → CPU storm - Use backoff - Reconcile should converge; if not, may be stuck loop 5. **For owner references**: - Controllers create resources with `OwnerReferences` pointing to CR - Garbage collector deletes children when parent deleted - `controller: true` on one ref blocks pod from updating 6. **For leader election**: - Multi-replica operators use lease for leader - Only leader reconciles - Lease expiry causes brief gap during failover 7. **For watches**: - Operator watches CR and owned resources - Adding watches for unrelated resources increases load - Filter with predicates to reduce noise 8. **For "controller doesn't see changes"**: - Cache (informer) not warm - Wrong RBAC - Watching wrong resource Mark DESTRUCTIVE: removing finalizer without cleanup (orphans backend resources), reconcile that delete-recreates instead of update (data loss), leader election misconfig causing dual-leader. --- Operator: [DESCRIBE] CR + state: [DESCRIBE] Symptom: [DESCRIBE] Operator logs: ``` [PASTE] ```
Why this prompt works
Operators are powerful but failure modes are subtle. This prompt walks them.
How to use it
- Check operator pod first — running, leader.
- Check CR finalizers for stuck deletes.
- For requeues, check logic for convergence.
- For watches, audit what’s monitored.
Useful commands
# Operator pod
kubectl get pods -n <operator-ns>
kubectl logs -n <operator-ns> deploy/<operator>
kubectl logs -n <operator-ns> deploy/<operator> --previous
# CR state
kubectl get <crd-plural> <name> -o yaml
kubectl describe <crd-plural> <name>
# Finalizers
kubectl get <crd-plural> <name> -o jsonpath='{.metadata.finalizers}'
# Manual finalizer removal (LAST RESORT)
kubectl patch <crd-plural> <name> -p '{"metadata":{"finalizers":null}}' --type=merge
# Leader election
kubectl get leases -n <operator-ns>
kubectl get lease <name> -n <operator-ns> -o yaml
# Events related to CR
kubectl get events --field-selector involvedObject.name=<cr-name>
# Owner references
kubectl get pod <pod> -o jsonpath='{.metadata.ownerReferences}'
# Reconcile metrics (controller-runtime exposes Prometheus)
kubectl port-forward -n <operator-ns> svc/<operator-metrics> 8080:8080
curl http://localhost:8080/metrics | grep controller_runtime
Patterns
Finalizer cleanup
const myFinalizer = "myapp.example.com/cleanup"
func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
cr := &myv1.MyResource{}
if err := r.Get(ctx, req.NamespacedName, cr); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
if cr.DeletionTimestamp.IsZero() {
// Not being deleted — ensure finalizer
if !controllerutil.ContainsFinalizer(cr, myFinalizer) {
controllerutil.AddFinalizer(cr, myFinalizer)
if err := r.Update(ctx, cr); err != nil {
return ctrl.Result{}, err
}
}
} else {
// Being deleted — run cleanup
if controllerutil.ContainsFinalizer(cr, myFinalizer) {
if err := r.cleanup(ctx, cr); err != nil {
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil // retry with backoff
}
controllerutil.RemoveFinalizer(cr, myFinalizer)
if err := r.Update(ctx, cr); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// Normal reconcile
...
}
Status update with conditions
meta.SetStatusCondition(&cr.Status.Conditions, metav1.Condition{
Type: "Ready",
Status: metav1.ConditionTrue,
Reason: "ReconcileSucceeded",
Message: "All resources synced",
})
cr.Status.ObservedGeneration = cr.Generation
if err := r.Status().Update(ctx, cr); err != nil {
return ctrl.Result{}, err
}
Common findings this catches
- CR stuck Terminating → finalizer cleanup failing; check operator logs.
- Operator not reacting to CR changes → not leader; check lease.
- Reconcile CPU spike → tight requeue loop; add backoff.
- Owned resources not garbage collected → owner references missing or wrong.
- Status not updating → status subresource not enabled in CRD.
- Multiple operators reconciling same CR → coordinate; or split.
- Operator OOM → cache size; filter watches.
When to escalate
- Major operator failure during incident → revert to previous version.
- Operator CRD migration → coordinate with users.
- Cluster-wide operator deployment — staged.
Related prompts
-
Kubernetes Admission Webhook Debug Prompt
Diagnose admission webhook failures — timeout, TLS cert errors, mutating/validating semantics, failure policy traps, cluster-wide outages from webhook misconfig.
-
Kubernetes CRD Design & Versioning Prompt
Design Custom Resource Definitions — schema validation, versioning (v1alpha1 → v1), conversion webhooks, status subresource, printer columns.
-
Kubernetes Events Analysis Prompt
Filter, aggregate, and decode Kubernetes events — FailedScheduling, BackOff, ProvisioningFailed — to diagnose cluster-wide issues from noisy event streams.