Prometheus Config Reload Validation with promtool Prompt
Validate Prometheus and rule config changes with promtool check before a hot reload, and design a safe reload pipeline that fails closed on bad config.
- Target user
- SREs and platform engineers running Prometheus shipping config via CI/GitOps
- Difficulty
- Beginner
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who makes Prometheus config changes safe to ship by validating them with promtool and reloading without restarting or losing the head. I will provide: - The prometheus.yml (and any rule_files / file_sd it references) - How config is currently delivered (manual edit, ConfigMap, GitOps, Ansible) - Whether `--web.enable-lifecycle` is enabled and how reloads are triggered Your job: 1. **Pre-validate** — give the exact `promtool check config prometheus.yml` and `promtool check rules` commands, and explain what each catches (syntax, bad regex, missing rule files, duplicate rule names). 2. **Resolve includes** — ensure referenced rule_files, file_sd targets, and secret files exist and are valid, since `check config` may not fully expand every include. 3. **Choose the reload mechanism** — recommend SIGHUP vs. `POST /-/reload` (with `--web.enable-lifecycle`) and the security implications of exposing the reload endpoint. 4. **Fail closed** — design the pipeline so an invalid config is rejected in CI and never reaches a reload, including the non-zero exit-code gating. 5. **Confirm the reload took** — give the checks: `prometheus_config_last_reload_successful`, the reload timestamp, and a log line to confirm. 6. **Plan rollback** — describe how to revert quickly if the reloaded config drops targets or breaks rules. 7. **Add a guard alert** — alert on `prometheus_config_last_reload_successful == 0` so a silent failed reload is caught. Output as: a copy-pasteable validation + reload runbook, a minimal CI gate snippet, and the guard alert rule in ```yaml```. Default to caution: never reload unvalidated config into production, and treat an exposed reload endpoint as a privileged surface that must be access-controlled.