Surviving TLS Certificate Expiry Outages Without Making Them Worse
How to triage and fix a live TLS certificate expiry outage — classify the failure, map the blast radius including mTLS and pinning, and reissue safely with a verified chain.
- #incident-response
- #ai
- #tls
- #certificates
- #security
At 2:14 AM half the fleet starts throwing certificate has expired and the synthetic checks go red in unison. Certificate expiry outages have a cruel signature: they hit everything at once, at a wall-clock time nobody chose, and the “obvious” fix — renew the cert — is wrong often enough to extend the outage by hours. The replacement can be perfectly valid and still fail, because the real problem was an expired intermediate, a hostname mismatch, or a pinned client that rejects the new key.
This guide is about fixing TLS expiry incidents fast without the second outage that a careless reissue causes.
Classify the failure before you reissue
The error string and the certificate dates tell you which of five problems you actually have:
# what is the endpoint actually serving, and when does it expire?
openssl s_client -connect api.example.com:443 -servername api.example.com </dev/null 2>/dev/null \
| openssl x509 -noout -dates -subject -issuer
- Leaf expiry — the served certificate’s
notAfteris in the past. The common case; renew the leaf. - Intermediate/chain expiry — the leaf is valid but an intermediate CA in the chain expired. Renewing the leaf alone does nothing; you must fix the chain.
- Hostname/SAN mismatch — the cert is valid but no longer covers the hostname after a routing or domain change.
- Clock skew — the cert is fine; a host’s clock drifted and it thinks the cert is expired or not-yet-valid.
- Revocation/OCSP — the cert was revoked or OCSP stapling is failing.
Spend the ninety seconds to classify. A team that reissues the leaf on an expired-intermediate incident loses an hour confirming the new cert “didn’t work.”
Map the blast radius — including the parts users can’t see
User-facing endpoints are the obvious victims, but TLS certificates also guard service-to-service mTLS paths that fail silently before any customer notices. Enumerate every place the affected certificate is presented or validated:
- public edge endpoints and load balancers
- internal service mesh / mTLS between services
- clients with certificate pinning, which will reject even a valid new cert if the key changed
That last one is the trap. A reissue that changes the key breaks every pinned client, turning a fixable outage into a worse one. Flag pinning before you reissue, not after the pinned mobile clients start failing.
Reissue safely, then verify the chain
The fastest safe path is usually triggering your ACME/cert-manager renewal and reloading without a full restart. Whatever the path, verify the full chain before you trust it:
# verify the served chain validates end to end
openssl s_client -connect api.example.com:443 -servername api.example.com \
-showcerts </dev/null 2>/dev/null | openssl verify
A valid leaf with a broken chain order is a classic “fixed it but it’s still down” outcome. Confirm the chain validates and, where pinning is in play, that the key matches what clients pinned — or that you’ve coordinated a pin rotation.
A note on hygiene during the incident: never paste private keys into a chat assistant or a shared doc. Share only public metadata — dates, subjects, issuers, the chain. The expired TLS certificate incident triage prompt is built around exactly that boundary.
Let AI run the enumeration, you run the deploy
The blast-radius enumeration and the verification commands are mechanical and easy to skip when adrenaline is high — a good fit for an assistant. Keep the certificate deployment itself in human hands.
Prompt: “We’re getting
certificate has expiredonapi.example.com. Here’s theopenssl x509 -datesoutput (leaf expires next month, but the intermediate expired today) and our service list. Classify the failure, list every affected path including mTLS and pinning, and give me a reissue plan with chain verification and rollback.”Response (abridged): “This is intermediate expiry, not leaf — renewing the leaf alone won’t help. Affected: public edge plus the service mesh mTLS between
checkoutandpayments(will fail silently).mobile-apiuses key pinning; a key-changing reissue breaks it, so plan a coordinated pin rotation or reuse the pinned key. Fix the chain by deploying the current intermediate bundle, reload, thenopenssl verifyper endpoint before declaring recovery.”
That’s a plan you execute and verify, not a change the model applies.
Confirm recovery per endpoint, then close the gap
Verify the new certificate is served correctly from each region and endpoint — not just the one you tested — because edge nodes and mesh sidecars pick up new certs at different times. Once recovery is confirmed, capture the one-line cause: an expiry that reached production means a monitoring or auto-renewal gap. Feed that to the follow-up, but don’t let it delay the fix.
Where this fits
Certificate outages are a recurring theme across incident response, and they overlap with security work whenever the cause is a revocation or mis-issuance — pair this with your security breach runbook. When the page fires, run the triage through your AI assistant from the incident response dashboard, keeping it to enumeration and verification while you own the deploy.
The discipline that prevents the second outage: classify the failure, map the silent mTLS and pinning paths, verify the full chain before you trust the new cert, and never let “renew it” stand in for “understand it.”
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.