Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Advanced ClaudeChatGPT

Kubernetes Admission Webhook Debug Prompt

Diagnose admission webhook failures — timeout, TLS cert errors, mutating/validating semantics, failure policy traps, cluster-wide outages from webhook misconfig.

Target user
Kubernetes platform engineers managing admission webhooks
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes engineer who has debugged admission webhook failures that took down entire clusters (admission webhook is down → can't create pods → can't fix the webhook). You know the failurePolicy semantics, TLS cert rotation, and how to recover.

I will provide:
- The symptom (apiserver returns errors, all create/update blocked, specific operation rejected)
- The webhook involved (`kubectl get validatingwebhookconfigurations`, `mutatingwebhookconfigurations`)
- The webhook config (`kubectl get vwc/mwc <name> -o yaml`) — `failurePolicy`, `objectSelector`, `namespaceSelector`, `timeoutSeconds`, `caBundle`
- Webhook service health (`kubectl get svc -n <ns> <webhook-svc>`, pod logs)
- apiserver logs if accessible (`/var/log/kube-apiserver/`)

Your job:

1. **Triage the webhook**:
   - `kubectl get validatingwebhookconfigurations`
   - `kubectl get mutatingwebhookconfigurations`
   - For each: timeoutSeconds, failurePolicy, scope (clusterScope, namespaceSelector, objectSelector)
2. **For "all operations failing"**:
   - Webhook backend (the Service it points to) is down or unhealthy
   - With `failurePolicy: Fail` (default), apiserver refuses the operation
   - With `failurePolicy: Ignore`, apiserver proceeds without webhook decision
   - Emergency recovery: delete the webhook config OR scale up backend
3. **For specific operation rejected**:
   - apiserver returns the webhook's denial message
   - Check webhook pod logs for the reason
   - Use `--v=6` on kubectl to see the webhook response
4. **For TLS cert errors**:
   - `caBundle` in webhook config must validate the cert the webhook service presents
   - Cert-manager + kyverno/gatekeeper auto-rotate certs but operators must update caBundle
   - Symptom: `x509: certificate signed by unknown authority` in apiserver logs
   - Fix: update caBundle in the webhook config (most operators do this automatically via injection)
5. **For timeouts**:
   - `timeoutSeconds` (max 30s for webhook)
   - Webhook backend slow → apiserver waits, blocks operation
   - In high-traffic operations, slow webhook causes user-visible delay
6. **For self-locking** (webhook blocks itself):
   - Webhook config has objectSelector that includes the webhook's own pods
   - When webhook pod restarts → admission rejects → can't restart → cluster broken
   - Fix: exclude kube-system or the webhook's namespace from the webhook's scope
7. **For mutating vs validating**:
   - **Mutating** runs first; can modify the object
   - **Validating** runs after; can only accept/reject
   - Multiple webhooks: order isn't guaranteed within a phase (mutating webhooks may run in any order)
   - reinvocationPolicy `IfNeeded` may re-invoke mutating after others mutate
8. **For namespace/object selectors**:
   - `namespaceSelector` excludes/includes namespaces by labels
   - Important to exclude `kube-system` for cluster stability
   - `matchPolicy: Equivalent` matches different API groups for the same resource

Mark DESTRUCTIVE: deleting a webhook configuration during normal ops (workloads that depended on its mutation/validation are now uncontrolled), failurePolicy: Fail on webhooks selecting kube-system (cluster lockup), caBundle update timing (gap between old/new cert is an outage window).

---

Symptom: [DESCRIBE]
Webhook configs:
```yaml
[PASTE relevant webhook config]
```
Webhook backend status:
```
[PASTE kubectl get pods + logs for webhook backend]
```
Recent failed operations: [DESCRIBE]
apiserver logs (if accessible):
```
[PASTE relevant excerpts]
```

Why this prompt works

Webhook misconfig is the leading cause of “cluster suddenly can’t accept any changes.” The failurePolicy: Fail + broad scope + dead webhook = total outage. This prompt walks the model through finding and recovering from this class of failure.

How to use it

  1. For “everything broken”, list all webhook configs first.
  2. Identify which one has Fail policy and broad scope — that’s the suspect.
  3. For recovery, scale up backend OR delete the webhook config.
  4. For cert issues, check the issuer (cert-manager) AND the injection controller.

Useful commands

# Inventory
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations

# Detail
kubectl get vwc <name> -o yaml
kubectl get mwc <name> -o yaml

# Check failurePolicy and scope
kubectl get vwc -o json | jq '.items[] | {name:.metadata.name, webhooks:[.webhooks[] | {name, failurePolicy, namespaceSelector:.namespaceSelector // null, scope:.rules[].scope}]}'

# Webhook backend health
kubectl get svc -A | grep webhook
kubectl get pods -A | grep webhook
kubectl logs -n <ns> <webhook-pod> --tail=200

# Verify TLS cert
kubectl get secret <webhook-cert-secret> -n <ns> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout | grep -E "Subject|Issuer|Not After"

# CA bundle in webhook config
kubectl get vwc <name> -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d | openssl x509 -text -noout

# kubectl with verbose admission errors
kubectl apply -f bad.yaml --v=6 2>&1 | grep -i webhook

# Test webhook endpoint (from a debug pod)
kubectl run debug --rm -it --image=nicolaka/netshoot --restart=Never -- \
    curl -k -X POST https://<webhook-svc>.<ns>.svc:443/path -H "Content-Type: application/json" -d '{}'

# Emergency: delete a problematic webhook config
kubectl delete vwc <name>
kubectl delete mwc <name>

# apiserver logs (kubeadm-style)
sudo journalctl -u kubelet --since "30 min ago" | grep -i admission
sudo tail -100 /var/log/kube-apiserver/audit.log | grep -i webhook

Recovery patterns

Cluster locked: webhook backend dead + failurePolicy=Fail

# 1. Identify the problem webhook
kubectl get vwc -o json | jq '.items[] | select(.webhooks[].failurePolicy=="Fail") | .metadata.name'

# 2. Delete the config (workloads that depended on its validation now proceed)
kubectl delete vwc <name>

# 3. Fix the webhook backend
kubectl get pods -n <webhook-ns>
# (scale up, fix crash, etc.)

# 4. Re-create the webhook config
kubectl apply -f webhook-config.yaml

TLS cert expired

# Identify
kubectl get vwc <name> -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d | openssl x509 -checkend 0 || echo "EXPIRED"

# Force cert-manager to renew
kubectl annotate certificate <cert-name> -n <ns> cert-manager.io/issue-temporary-certificate=true --overwrite

# If using a controller that injects caBundle (e.g., cert-manager CAInjector), check:
kubectl logs -n cert-manager deploy/cert-manager-cainjector --tail=100

# Manual fallback: update caBundle
NEW_CA=$(kubectl get secret <ca-secret> -n <ns> -o jsonpath='{.data.ca\.crt}')
kubectl patch vwc <name> --type='json' -p="[{'op':'replace','path':'/webhooks/0/clientConfig/caBundle','value':'$NEW_CA'}]"

Safe webhook config patterns

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: kyverno-policy-validating-webhook
webhooks:
- name: validate.kyverno.svc
  clientConfig:
    service:
      name: kyverno-svc
      namespace: kyverno
      path: /validate
    caBundle: <base64-ca>
  rules:
  - apiGroups: ["*"]
    apiVersions: ["*"]
    operations: ["CREATE", "UPDATE"]
    resources: ["*"]
    scope: "Namespaced"            # don't apply to cluster-scoped resources
  failurePolicy: Ignore             # don't block cluster on webhook downtime
  matchPolicy: Equivalent
  namespaceSelector:
    matchExpressions:
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values: ["kube-system", "kube-public", "kyverno"]   # exclude critical namespaces
  objectSelector:
    matchExpressions:
    - key: control-plane
      operator: DoesNotExist        # never match control-plane components
  timeoutSeconds: 5
  sideEffects: None
  admissionReviewVersions: ["v1"]

Common findings this catches

  • Internal error occurred: failed calling webhook → backend down or cert mismatch.
  • x509: certificate signed by unknown authority → caBundle stale; injection failed.
  • All pod creates failing in kube-system → webhook with failurePolicy: Fail selecting kube-system; cluster locked.
  • Timeouts on every API call → webhook backend slow; raise replicas or lower request rate.
  • Mutating webhook unexpectedly changing resources → reinvocationPolicy IfNeeded re-running it; review chain.
  • GitOps drift after every sync → mutating webhook modifies resources; bake into chart or use OPA exception.

When to escalate

  • Cluster locked and you can’t delete the webhook config → use --server with raw kubectl, or restart apiserver with admission webhook disabled flag.
  • Suspected webhook compromise → security incident; rotate any secrets the webhook might’ve read.
  • Cert-manager + injection failures in production → engage platform team; this is foundational.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.