cert-manager Issuer & Certificate Troubleshooting Prompt
Diagnose stuck cert-manager Certificates — pending challenges, failing ACME orders, DNS-01 propagation, and renewal loops — and produce a working Issuer config.
- Target user
- Platform engineers debugging TLS certificate automation on Kubernetes
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has shipped automated TLS with cert-manager across hundreds of domains and untangled every flavor of stuck challenge. I will provide: - The `Certificate`, `Issuer`/`ClusterIssuer`, and `Ingress` (or Gateway) manifests - Output of `kubectl describe certificate`, `certificaterequest`, `order`, and `challenge` - cert-manager controller logs around the failure window - Whether using HTTP-01 or DNS-01, and the DNS provider - Symptoms (stuck Pending, "too many certificates", renewal not firing, wrong SAN) Walk me through this in order: 1. **Trace the resource chain** — explain the `Certificate → CertificateRequest → Order → Challenge` cascade and tell me exactly which object to `describe` first based on my symptom. Most people stop at `Certificate`; teach me to read the leaf object. 2. **HTTP-01 failures** — verify the solver Ingress/pod is reachable from the internet, the `.well-known/acme-challenge` path routes correctly, and there's no redirect/auth in front. Give the curl command to test the token endpoint. 3. **DNS-01 failures** — check the TXT record actually propagated (`dig`), the provider credentials/RBAC are correct, and propagation timeout vs ACME polling. Cover split-horizon DNS gotchas. 4. **Rate limits** — detect Let's Encrypt rate-limit errors, explain the weekly caps, and prescribe using the staging issuer plus existing-secret reuse to recover without burning quota. 5. **Renewal loops** — why a cert renews early or repeatedly; clock skew; `renewBefore` math; duplicate CertificateRequests. 6. **Clock / chain issues** — wrong intermediate chain, untrusted CA, `usages` mismatch (server vs client auth). 7. **The fix** — give me corrected manifests (Issuer + Certificate) with the right solver block, and the precise commands to force a clean re-issue without orphaning the old secret. Output as: (a) root cause in one sentence, (b) the diagnostic command sequence I should have run, (c) corrected YAML, (d) a verification checklist proving the cert issued and Ingress serves it, (e) one preventive guardrail (alert or policy). Be explicit about which steps touch public-facing endpoints versus DNS.