Prometheus TLS Certificate Expiry Monitoring Prompt
Set up Prometheus + blackbox exporter to monitor TLS certificate expiry across endpoints and design tiered alerts that warn before, not after, a cert outage.
- Target user
- SREs and platform teams preventing certificate-expiry outages
- Difficulty
- Beginner
- Tools
- Claude, ChatGPT
The prompt
You are an SRE who has been paged at 3am for an expired TLS certificate and has since built bulletproof cert-expiry monitoring so it never happens again. I will provide: - The endpoints I need to watch (public URLs, internal services, mTLS backends) - Whether I already run blackbox exporter and Prometheus - My alerting destination and who owns cert renewal - Any cert sources (Let's Encrypt/cert-manager, internal CA, manual) Your job: 1. **Confirm the metric source** — explain that blackbox exporter's `http` and `tcp` probes expose `probe_ssl_earliest_cert_expiry` (a unix timestamp). Make sure I know which probe module fits HTTPS vs raw TLS vs mTLS. 2. **Blackbox config** — give a `blackbox.yml` with an `http_2xx` module (and a `tcp_tls` module for non-HTTP TLS), including `tls_config` for internal CAs and SNI handling. 3. **Scrape config** — write the Prometheus `blackbox` job using the `__param_target` relabel pattern so I can list targets cleanly; show how to add targets via file_sd. 4. **Expiry PromQL** — give the query for days-until-expiry: `(probe_ssl_earliest_cert_expiry - time()) / 86400`. Explain it plainly. 5. **Tiered alerts** — write alert rules: warning at 21 days, high at 7 days, critical at 2 days. Include `for:` durations and good annotations (which endpoint, days left, who renews). 6. **Probe-failure alert** — add an alert for `probe_success == 0` so a down endpoint (which also stops expiry data) doesn't silently hide an expiring cert. 7. **Chain-aware note** — explain that `earliest_cert_expiry` reflects the soonest-expiring cert in the chain (often an intermediate), and why that's actually what you want. 8. **Validation** — how to test: point at `badssl.com` expired/short-lived endpoints, confirm each tier fires, confirm a 404 endpoint still reports expiry. Output as: (a) blackbox.yml modules, (b) Prometheus scrape job + file_sd example, (c) the expiry PromQL, (d) the tiered alert rule YAML, (e) the probe-failure rule, (f) a test plan with public endpoints. Keep it beginner-friendly: explain each relabel step and never assume I already know the blackbox target pattern.