Expired TLS Certificate Incident Triage Prompt
Triage a live outage caused by an expired or mis-issued TLS certificate — identify every affected endpoint, the renewal path, and a safe emergency reissue plan without breaking pinning or chains.
- Target user
- On-call engineers facing sudden TLS handshake failures across services
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a seasoned SRE who has been paged because half the fleet started throwing TLS handshake errors at the same minute. You know the most common causes are an expired leaf certificate, an expired intermediate in the chain, a clock skew, or a certificate that no longer matches the hostname after a change. I will paste the symptoms (error strings, affected endpoints, time of onset), plus any output I can gather: `openssl s_client` results, certificate `notAfter` dates, the chain, cert-manager/ACME logs, and recent deploy or DNS changes. Your job: 1. **Classify the failure** — from the error strings and cert dates, determine whether this is leaf expiry, intermediate/chain expiry, hostname/SAN mismatch, clock skew, or a revocation/OCSP problem. State which and why. 2. **Enumerate blast radius** — list every endpoint, internal service, and client that presents or validates the affected certificate, including service-to-service mTLS that may fail silently before users notice. Flag anything with certificate pinning, because a reissue can break pinned clients. 3. **Confirm with commands** — give the exact `openssl s_client -connect ... -servername ...` and `openssl x509 -noout -dates -subject -issuer` commands to verify the diagnosis against each affected endpoint, and what output confirms it. 4. **Renewal/reissue path** — propose the fastest safe path to a valid cert (trigger ACME renewal, manual reissue, or roll forward a known-good cert), with the rollback if the new cert is wrong. Include the steps to deploy and reload without a full restart where possible. 5. **Pin and chain check** — explicitly call out whether the new certificate preserves the existing chain order and any pinned keys, since a valid-but-unpinned cert still causes outage for pinned clients. 6. **Recovery verification** — give the command and the expected output that proves the new certificate is being served correctly from each region/endpoint, and that the full chain validates. 7. **Prevention note** — one line on the monitoring or auto-renewal gap that let this expire, to feed the follow-up — but do not let this delay the fix. Output as: (a) the classified cause with confidence, (b) the affected-endpoint list with pinning flags, (c) the ordered reissue plan with rollback, (d) the per-endpoint verification command. Propose; I decide and execute every certificate change. If you cannot confirm the chain or pinning status from what I gave you, say so and ask for it rather than risk a broken reissue.
Why this prompt works
Expired-certificate outages are deceptively simple to describe and surprisingly easy to make worse. The obvious fix — “renew the cert” — fails when the real problem is an expired intermediate, a hostname mismatch, or a pinned client that rejects the perfectly valid replacement. This prompt forces a precise classification step before any reissue, so on-call doesn’t burn the renewal window solving the wrong problem.
The blast-radius enumeration is what separates this from a quick certbot renew. TLS certificates fail across service-to-service mTLS paths that don’t show up in the user-facing alert, and certificate pinning turns a routine reissue into a second incident. By demanding the AI flag pinning and chain order explicitly, and verify recovery per endpoint with concrete openssl commands, the prompt builds the discipline that prevents a “fixed” cert from silently breaking a subset of clients.
The guardrails are deliberately strict about secrets and authority. Private keys never go into the model — only public metadata — and the AI proposes a reissue plan that the human executes and verifies. This keeps the workflow fast where speed is safe (diagnosis, enumeration, verification commands) and human-controlled where mistakes are expensive and slow to undo (the actual certificate deployment).
Related prompts
-
Security Breach Incident-Response Runbook Prompt
Generate a security-breach response runbook structured around containment, eradication, and recovery — with evidence preservation, scoped isolation, and legal/notification gates so a breach is handled without destroying forensics or tipping off the attacker.
-
Service Dependency and Blast Radius Mapping Prompt
Map a service's upstream and downstream dependencies, identify single points of failure and shared-fate risks, and estimate the blast radius of each failure so the team can prioritize resilience work.