Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Kubernetes Operator Reconcile Loop Debug Prompt

Debug operator reconciliation issues — finalizers stuck, status not updating, requeue storms, owner references, leader election.

Target user
Kubernetes operator developers and SREs
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes engineer who has built and operated controllers with controller-runtime (Kubebuilder) and Operator SDK. You can diagnose stuck reconciles, finalizer issues, and status update conflicts.

I will provide:
- The operator and CR being managed
- The symptom (CR stuck, finalizer not cleared, status not updating, requeue storm)
- Operator pod logs

Your job:

1. **Reconcile loop basics**:
   - Watch CR + owned resources
   - Reconcile function: read CR state, compute desired, apply
   - Return: requeue interval, error
   - Idempotent — should be safe to call repeatedly
2. **For finalizer issues**:
   - Finalizer on CR prevents deletion until removed
   - Pattern: add finalizer when reconcile starts, remove when cleanup done
   - **Stuck finalizer** means cleanup logic isn't completing
   - Patch CR to remove finalizer (LAST RESORT) → controller stops cleanup
3. **For status update conflicts**:
   - Multiple writers to same CR cause conflicts
   - Use status subresource (separate from spec)
   - Patch status with optimistic concurrency
4. **For requeue storms**:
   - `RequeueAfter: 1*time.Second` in error path → CPU storm
   - Use backoff
   - Reconcile should converge; if not, may be stuck loop
5. **For owner references**:
   - Controllers create resources with `OwnerReferences` pointing to CR
   - Garbage collector deletes children when parent deleted
   - `controller: true` on one ref blocks pod from updating
6. **For leader election**:
   - Multi-replica operators use lease for leader
   - Only leader reconciles
   - Lease expiry causes brief gap during failover
7. **For watches**:
   - Operator watches CR and owned resources
   - Adding watches for unrelated resources increases load
   - Filter with predicates to reduce noise
8. **For "controller doesn't see changes"**:
   - Cache (informer) not warm
   - Wrong RBAC
   - Watching wrong resource

Mark DESTRUCTIVE: removing finalizer without cleanup (orphans backend resources), reconcile that delete-recreates instead of update (data loss), leader election misconfig causing dual-leader.

---

Operator: [DESCRIBE]
CR + state: [DESCRIBE]
Symptom: [DESCRIBE]
Operator logs:
```
[PASTE]
```

Why this prompt works

Operators are powerful but failure modes are subtle. This prompt walks them.

How to use it

  1. Check operator pod first — running, leader.
  2. Check CR finalizers for stuck deletes.
  3. For requeues, check logic for convergence.
  4. For watches, audit what’s monitored.

Useful commands

# Operator pod
kubectl get pods -n <operator-ns>
kubectl logs -n <operator-ns> deploy/<operator>
kubectl logs -n <operator-ns> deploy/<operator> --previous

# CR state
kubectl get <crd-plural> <name> -o yaml
kubectl describe <crd-plural> <name>

# Finalizers
kubectl get <crd-plural> <name> -o jsonpath='{.metadata.finalizers}'

# Manual finalizer removal (LAST RESORT)
kubectl patch <crd-plural> <name> -p '{"metadata":{"finalizers":null}}' --type=merge

# Leader election
kubectl get leases -n <operator-ns>
kubectl get lease <name> -n <operator-ns> -o yaml

# Events related to CR
kubectl get events --field-selector involvedObject.name=<cr-name>

# Owner references
kubectl get pod <pod> -o jsonpath='{.metadata.ownerReferences}'

# Reconcile metrics (controller-runtime exposes Prometheus)
kubectl port-forward -n <operator-ns> svc/<operator-metrics> 8080:8080
curl http://localhost:8080/metrics | grep controller_runtime

Patterns

Finalizer cleanup

const myFinalizer = "myapp.example.com/cleanup"

func (r *Reconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    cr := &myv1.MyResource{}
    if err := r.Get(ctx, req.NamespacedName, cr); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    if cr.DeletionTimestamp.IsZero() {
        // Not being deleted — ensure finalizer
        if !controllerutil.ContainsFinalizer(cr, myFinalizer) {
            controllerutil.AddFinalizer(cr, myFinalizer)
            if err := r.Update(ctx, cr); err != nil {
                return ctrl.Result{}, err
            }
        }
    } else {
        // Being deleted — run cleanup
        if controllerutil.ContainsFinalizer(cr, myFinalizer) {
            if err := r.cleanup(ctx, cr); err != nil {
                return ctrl.Result{RequeueAfter: 30 * time.Second}, nil  // retry with backoff
            }
            controllerutil.RemoveFinalizer(cr, myFinalizer)
            if err := r.Update(ctx, cr); err != nil {
                return ctrl.Result{}, err
            }
        }
        return ctrl.Result{}, nil
    }

    // Normal reconcile
    ...
}

Status update with conditions

meta.SetStatusCondition(&cr.Status.Conditions, metav1.Condition{
    Type:    "Ready",
    Status:  metav1.ConditionTrue,
    Reason:  "ReconcileSucceeded",
    Message: "All resources synced",
})
cr.Status.ObservedGeneration = cr.Generation
if err := r.Status().Update(ctx, cr); err != nil {
    return ctrl.Result{}, err
}

Common findings this catches

  • CR stuck Terminating → finalizer cleanup failing; check operator logs.
  • Operator not reacting to CR changes → not leader; check lease.
  • Reconcile CPU spike → tight requeue loop; add backoff.
  • Owned resources not garbage collected → owner references missing or wrong.
  • Status not updating → status subresource not enabled in CRD.
  • Multiple operators reconciling same CR → coordinate; or split.
  • Operator OOM → cache size; filter watches.

When to escalate

  • Major operator failure during incident → revert to previous version.
  • Operator CRD migration → coordinate with users.
  • Cluster-wide operator deployment — staged.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.