Kubernetes Admission Webhook Debug Prompt
Diagnose admission webhook failures — timeout, TLS cert errors, mutating/validating semantics, failure policy traps, cluster-wide outages from webhook misconfig.
- Target user
- Kubernetes platform engineers managing admission webhooks
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes engineer who has debugged admission webhook failures that took down entire clusters (admission webhook is down → can't create pods → can't fix the webhook). You know the failurePolicy semantics, TLS cert rotation, and how to recover. I will provide: - The symptom (apiserver returns errors, all create/update blocked, specific operation rejected) - The webhook involved (`kubectl get validatingwebhookconfigurations`, `mutatingwebhookconfigurations`) - The webhook config (`kubectl get vwc/mwc <name> -o yaml`) — `failurePolicy`, `objectSelector`, `namespaceSelector`, `timeoutSeconds`, `caBundle` - Webhook service health (`kubectl get svc -n <ns> <webhook-svc>`, pod logs) - apiserver logs if accessible (`/var/log/kube-apiserver/`) Your job: 1. **Triage the webhook**: - `kubectl get validatingwebhookconfigurations` - `kubectl get mutatingwebhookconfigurations` - For each: timeoutSeconds, failurePolicy, scope (clusterScope, namespaceSelector, objectSelector) 2. **For "all operations failing"**: - Webhook backend (the Service it points to) is down or unhealthy - With `failurePolicy: Fail` (default), apiserver refuses the operation - With `failurePolicy: Ignore`, apiserver proceeds without webhook decision - Emergency recovery: delete the webhook config OR scale up backend 3. **For specific operation rejected**: - apiserver returns the webhook's denial message - Check webhook pod logs for the reason - Use `--v=6` on kubectl to see the webhook response 4. **For TLS cert errors**: - `caBundle` in webhook config must validate the cert the webhook service presents - Cert-manager + kyverno/gatekeeper auto-rotate certs but operators must update caBundle - Symptom: `x509: certificate signed by unknown authority` in apiserver logs - Fix: update caBundle in the webhook config (most operators do this automatically via injection) 5. **For timeouts**: - `timeoutSeconds` (max 30s for webhook) - Webhook backend slow → apiserver waits, blocks operation - In high-traffic operations, slow webhook causes user-visible delay 6. **For self-locking** (webhook blocks itself): - Webhook config has objectSelector that includes the webhook's own pods - When webhook pod restarts → admission rejects → can't restart → cluster broken - Fix: exclude kube-system or the webhook's namespace from the webhook's scope 7. **For mutating vs validating**: - **Mutating** runs first; can modify the object - **Validating** runs after; can only accept/reject - Multiple webhooks: order isn't guaranteed within a phase (mutating webhooks may run in any order) - reinvocationPolicy `IfNeeded` may re-invoke mutating after others mutate 8. **For namespace/object selectors**: - `namespaceSelector` excludes/includes namespaces by labels - Important to exclude `kube-system` for cluster stability - `matchPolicy: Equivalent` matches different API groups for the same resource Mark DESTRUCTIVE: deleting a webhook configuration during normal ops (workloads that depended on its mutation/validation are now uncontrolled), failurePolicy: Fail on webhooks selecting kube-system (cluster lockup), caBundle update timing (gap between old/new cert is an outage window). --- Symptom: [DESCRIBE] Webhook configs: ```yaml [PASTE relevant webhook config] ``` Webhook backend status: ``` [PASTE kubectl get pods + logs for webhook backend] ``` Recent failed operations: [DESCRIBE] apiserver logs (if accessible): ``` [PASTE relevant excerpts] ```
Why this prompt works
Webhook misconfig is the leading cause of “cluster suddenly can’t accept any changes.” The failurePolicy: Fail + broad scope + dead webhook = total outage. This prompt walks the model through finding and recovering from this class of failure.
How to use it
- For “everything broken”, list all webhook configs first.
- Identify which one has Fail policy and broad scope — that’s the suspect.
- For recovery, scale up backend OR delete the webhook config.
- For cert issues, check the issuer (cert-manager) AND the injection controller.
Useful commands
# Inventory
kubectl get validatingwebhookconfigurations
kubectl get mutatingwebhookconfigurations
# Detail
kubectl get vwc <name> -o yaml
kubectl get mwc <name> -o yaml
# Check failurePolicy and scope
kubectl get vwc -o json | jq '.items[] | {name:.metadata.name, webhooks:[.webhooks[] | {name, failurePolicy, namespaceSelector:.namespaceSelector // null, scope:.rules[].scope}]}'
# Webhook backend health
kubectl get svc -A | grep webhook
kubectl get pods -A | grep webhook
kubectl logs -n <ns> <webhook-pod> --tail=200
# Verify TLS cert
kubectl get secret <webhook-cert-secret> -n <ns> -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -text -noout | grep -E "Subject|Issuer|Not After"
# CA bundle in webhook config
kubectl get vwc <name> -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d | openssl x509 -text -noout
# kubectl with verbose admission errors
kubectl apply -f bad.yaml --v=6 2>&1 | grep -i webhook
# Test webhook endpoint (from a debug pod)
kubectl run debug --rm -it --image=nicolaka/netshoot --restart=Never -- \
curl -k -X POST https://<webhook-svc>.<ns>.svc:443/path -H "Content-Type: application/json" -d '{}'
# Emergency: delete a problematic webhook config
kubectl delete vwc <name>
kubectl delete mwc <name>
# apiserver logs (kubeadm-style)
sudo journalctl -u kubelet --since "30 min ago" | grep -i admission
sudo tail -100 /var/log/kube-apiserver/audit.log | grep -i webhook
Recovery patterns
Cluster locked: webhook backend dead + failurePolicy=Fail
# 1. Identify the problem webhook
kubectl get vwc -o json | jq '.items[] | select(.webhooks[].failurePolicy=="Fail") | .metadata.name'
# 2. Delete the config (workloads that depended on its validation now proceed)
kubectl delete vwc <name>
# 3. Fix the webhook backend
kubectl get pods -n <webhook-ns>
# (scale up, fix crash, etc.)
# 4. Re-create the webhook config
kubectl apply -f webhook-config.yaml
TLS cert expired
# Identify
kubectl get vwc <name> -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d | openssl x509 -checkend 0 || echo "EXPIRED"
# Force cert-manager to renew
kubectl annotate certificate <cert-name> -n <ns> cert-manager.io/issue-temporary-certificate=true --overwrite
# If using a controller that injects caBundle (e.g., cert-manager CAInjector), check:
kubectl logs -n cert-manager deploy/cert-manager-cainjector --tail=100
# Manual fallback: update caBundle
NEW_CA=$(kubectl get secret <ca-secret> -n <ns> -o jsonpath='{.data.ca\.crt}')
kubectl patch vwc <name> --type='json' -p="[{'op':'replace','path':'/webhooks/0/clientConfig/caBundle','value':'$NEW_CA'}]"
Safe webhook config patterns
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: kyverno-policy-validating-webhook
webhooks:
- name: validate.kyverno.svc
clientConfig:
service:
name: kyverno-svc
namespace: kyverno
path: /validate
caBundle: <base64-ca>
rules:
- apiGroups: ["*"]
apiVersions: ["*"]
operations: ["CREATE", "UPDATE"]
resources: ["*"]
scope: "Namespaced" # don't apply to cluster-scoped resources
failurePolicy: Ignore # don't block cluster on webhook downtime
matchPolicy: Equivalent
namespaceSelector:
matchExpressions:
- key: kubernetes.io/metadata.name
operator: NotIn
values: ["kube-system", "kube-public", "kyverno"] # exclude critical namespaces
objectSelector:
matchExpressions:
- key: control-plane
operator: DoesNotExist # never match control-plane components
timeoutSeconds: 5
sideEffects: None
admissionReviewVersions: ["v1"]
Common findings this catches
Internal error occurred: failed calling webhook→ backend down or cert mismatch.x509: certificate signed by unknown authority→ caBundle stale; injection failed.- All pod creates failing in kube-system → webhook with
failurePolicy: Failselecting kube-system; cluster locked. - Timeouts on every API call → webhook backend slow; raise replicas or lower request rate.
- Mutating webhook unexpectedly changing resources → reinvocationPolicy IfNeeded re-running it; review chain.
- GitOps drift after every sync → mutating webhook modifies resources; bake into chart or use OPA exception.
When to escalate
- Cluster locked and you can’t delete the webhook config → use
--serverwith raw kubectl, or restart apiserver with admission webhook disabled flag. - Suspected webhook compromise → security incident; rotate any secrets the webhook might’ve read.
- Cert-manager + injection failures in production → engage platform team; this is foundational.
Related prompts
-
Kubernetes RBAC Audit Prompt
Audit Kubernetes Role, ClusterRole, RoleBinding, and ClusterRoleBinding for excessive permissions, stale bindings, and dangerous wildcards.
-
Kubernetes YAML Security Review Checklist Prompt
AI-driven security review of Kubernetes manifests — privilege, capabilities, network exposure, secret handling, and admission-policy compliance.
-
Kyverno / Gatekeeper Policy Authoring Prompt
Author and debug policies for Kyverno or OPA Gatekeeper — validate, mutate, generate; audit vs enforce modes; exclusions; constraint templates.