Surviving TLS Certificate Expiry Outages Without Making Them

At 2:14 AM half the fleet starts throwing certificate has expired and the synthetic checks go red in unison. Certificate expiry outages have a cruel signature: they hit everything at once, at a wall-clock time nobody chose, and the “obvious” fix — renew the cert — is wrong often enough to extend the outage by hours. The replacement can be perfectly valid and still fail, because the real problem was an expired intermediate, a hostname mismatch, or a pinned client that rejects the new key.

This guide is about fixing TLS expiry incidents fast without the second outage that a careless reissue causes.

Classify the failure before you reissue

The error string and the certificate dates tell you which of five problems you actually have:

# what is the endpoint actually serving, and when does it expire?
openssl s_client -connect api.example.com:443 -servername api.example.com </dev/null 2>/dev/null \
  | openssl x509 -noout -dates -subject -issuer

Leaf expiry — the served certificate’s notAfter is in the past. The common case; renew the leaf.
Intermediate/chain expiry — the leaf is valid but an intermediate CA in the chain expired. Renewing the leaf alone does nothing; you must fix the chain.
Hostname/SAN mismatch — the cert is valid but no longer covers the hostname after a routing or domain change.
Clock skew — the cert is fine; a host’s clock drifted and it thinks the cert is expired or not-yet-valid.
Revocation/OCSP — the cert was revoked or OCSP stapling is failing.

Spend the ninety seconds to classify. A team that reissues the leaf on an expired-intermediate incident loses an hour confirming the new cert “didn’t work.”

Map the blast radius — including the parts users can’t see

User-facing endpoints are the obvious victims, but TLS certificates also guard service-to-service mTLS paths that fail silently before any customer notices. Enumerate every place the affected certificate is presented or validated:

public edge endpoints and load balancers
internal service mesh / mTLS between services
clients with certificate pinning, which will reject even a valid new cert if the key changed

That last one is the trap. A reissue that changes the key breaks every pinned client, turning a fixable outage into a worse one. Flag pinning before you reissue, not after the pinned mobile clients start failing.

Reissue safely, then verify the chain

The fastest safe path is usually triggering your ACME/cert-manager renewal and reloading without a full restart. Whatever the path, verify the full chain before you trust it:

# verify the served chain validates end to end
openssl s_client -connect api.example.com:443 -servername api.example.com \
  -showcerts </dev/null 2>/dev/null | openssl verify

A valid leaf with a broken chain order is a classic “fixed it but it’s still down” outcome. Confirm the chain validates and, where pinning is in play, that the key matches what clients pinned — or that you’ve coordinated a pin rotation.

A note on hygiene during the incident: never paste private keys into a chat assistant or a shared doc. Share only public metadata — dates, subjects, issuers, the chain. The expired TLS certificate incident triage prompt is built around exactly that boundary.

Let AI run the enumeration, you run the deploy

The blast-radius enumeration and the verification commands are mechanical and easy to skip when adrenaline is high — a good fit for an assistant. Keep the certificate deployment itself in human hands.

Prompt: “We’re getting certificate has expired on api.example.com. Here’s the openssl x509 -dates output (leaf expires next month, but the intermediate expired today) and our service list. Classify the failure, list every affected path including mTLS and pinning, and give me a reissue plan with chain verification and rollback.”

Response (abridged): “This is intermediate expiry, not leaf — renewing the leaf alone won’t help. Affected: public edge plus the service mesh mTLS between checkout and payments (will fail silently). mobile-api uses key pinning; a key-changing reissue breaks it, so plan a coordinated pin rotation or reuse the pinned key. Fix the chain by deploying the current intermediate bundle, reload, then openssl verify per endpoint before declaring recovery.”

That’s a plan you execute and verify, not a change the model applies.

Confirm recovery per endpoint, then close the gap

Verify the new certificate is served correctly from each region and endpoint — not just the one you tested — because edge nodes and mesh sidecars pick up new certs at different times. Once recovery is confirmed, capture the one-line cause: an expiry that reached production means a monitoring or auto-renewal gap. Feed that to the follow-up, but don’t let it delay the fix.

Where this fits

Certificate outages are a recurring theme across incident response, and they overlap with security work whenever the cause is a revocation or mis-issuance — pair this with your security breach runbook. When the page fires, run the triage through your AI assistant from the incident response dashboard, keeping it to enumeration and verification while you own the deploy.

The discipline that prevents the second outage: classify the failure, map the silent mTLS and pinning paths, verify the full chain before you trust the new cert, and never let “renew it” stand in for “understand it.”

Surviving TLS Certificate Expiry Outages Without Making Them Worse