Certificate Lifecycle and Internal PKI: Ending the 3 AM

Ask any veteran on-call engineer for their most embarrassing outage and a striking number will name the same cause: a certificate expired. Not a sophisticated attack, not a cascading failure — a date passed, and a critical service started rejecting connections. The fix is trivial. The reason it keeps happening is that certificate lifecycle is treated as a calendar reminder instead of an automated system.

Certificates are a security control: they prove identity and encrypt traffic. But they’re also a reliability landmine, because they have an expiry date and humans are bad at deadlines measured in years. The answer to both problems is the same — automate the entire lifecycle so no human is ever the thing standing between you and an expired cert.

The lifecycle, all of it

People think “certificate management” means “renew before it expires.” That’s one stage of five, and skipping the others is where things go wrong:

Issuance — generate a key, request a cert, get it signed by a CA.
Distribution — deliver the cert and key securely to where they’re used.
Renewal — replace the cert well before expiry, automatically.
Revocation — invalidate a cert immediately when a key is compromised.
Inventory — know every cert you have, where it lives, and when it expires.

The last one is the quiet killer. You cannot renew a cert you’ve forgotten exists. Most expiry outages are really inventory failures.

Public-facing: let ACME do the work

For anything internet-facing, there’s no excuse for manual certs anymore. ACME (the protocol behind Let’s Encrypt) automates issuance and renewal end to end. cert-manager in Kubernetes makes it declarative:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-tls
  namespace: production
spec:
  secretName: api-tls
  duration: 2160h      # 90 days
  renewBefore: 720h    # renew 30 days early
  dnsNames:
    - api.example.com
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer

The renewBefore: 720h is the whole game. Renewing 30 days early means you have a month of slack if renewal hits a snag — DNS validation hiccup, rate limit, CA outage — instead of discovering the problem at the moment of expiry. Short cert lifetimes (90 days) plus generous early renewal is the modern standard precisely because it forces the automation to be exercised constantly; a renewal path you run every 60 days can’t silently rot.

Internal services: stand up your own CA

Public CAs can’t issue for internal hostnames or service identities. For east-west traffic you run a private PKI. The mistake here is the long-lived self-signed cert copied around by hand — that’s how you end up with a CA key on someone’s laptop and no idea which services trust it.

Vault’s PKI engine gives you a real internal CA with automated issuance and short lifetimes:

# Define a role that mints short-lived internal certs
vault write pki_int/roles/internal-service \
  allowed_domains="svc.internal" \
  allow_subdomains=true \
  max_ttl="72h"

# A service requests its cert at startup
vault write pki_int/issue/internal-service \
  common_name="payments.svc.internal" ttl="72h"

Three-day internal certs sound aggressive until you realize the point: they rotate so often that rotation must be automated, which means it always works. A cert that lives three days can’t be the thing that expires at 3 AM, because it gets replaced twice a day without anyone thinking about it.

Structure the PKI properly: an offline root CA that signs an intermediate, and the intermediate does all the day-to-day issuing. If the intermediate is ever compromised, you revoke and reissue it from the root without rebuilding trust from scratch. Keep the root key offline — in an HSM or air-gapped — and use it a handful of times a year.

Revocation that actually works

Renewal is the easy half; revocation is where most PKIs are quietly broken. If a private key leaks, you need every client to stop trusting that cert immediately — but classic CRL and OCSP have real gaps in coverage and latency. The pragmatic answer is to lean on short lifetimes as your primary revocation mechanism. A 72-hour cert is “revoked” within 72 hours no matter what, simply by not being reissued. For the urgent case, OCSP stapling plus short TTLs gets you close enough for most threat models.

# OCSP stapling so clients get fresh revocation status cheaply
ssl_stapling on;
ssl_stapling_verify on;
ssl_trusted_certificate /etc/ssl/chain.pem;

Build the inventory you’re missing

You can’t manage what you can’t see. Stand up a continuous scan of your real endpoints and surface expiry as a metric:

# Check expiry on a live endpoint
echo | openssl s_client -connect api.example.com:443 2>/dev/null \
  | openssl x509 -noout -enddate
# notAfter=Sep 10 12:00:00 2026 GMT

Wrap that in an exporter so Prometheus tracks ssl_cert_not_after across every endpoint, then alert when any cert drops under 21 days remaining. That alert is your safety net for the one cert that slipped outside automation — the appliance, the legacy VM, the vendor integration nobody put in cert-manager.

Make it a system, not a chore

The throughline of every certificate outage is the same: a human was load-bearing in the lifecycle. Remove them. Issuance is automated by ACME or Vault, renewal happens weeks early, internal certs live days not years, and an independent monitor catches whatever escapes. Reviewing PKI and issuer config changes through automated code review keeps an over-broad allowed_domains or a too-long TTL from slipping in, and the broader security hardening guides cover how certificate identity ties into mTLS and zero trust.

Pick your next cert expiry — the one you’re vaguely dreading — and put that one on automated renewal this week. Then make the inventory monitor so the next one can never surprise you.

PKI and renewal configs are starting points. Validate issuance policies, TTLs, and trust chains in a staging environment before relying on them in production.

Certificate Lifecycle and Internal PKI: Ending the 3 AM Expiry Outage