Prometheus Troubleshooting Toolkit
Fix down targets, missing metrics, noisy or silent alerts, and slow PromQL — with alert-rule tooling and monitoring prompts.
Top Prometheus errors
Start with the most common production issues and troubleshooting paths.
alertmanager failed to join cluster
Fix Alertmanager 'failed to join cluster': open port 9094 TCP+UDP, set --cluster.advertise-address, and stop duplicate notifica…
Alert Stuck 'Pending' and Never Firing
Fix Prometheus alerts stuck in Pending or missing from /alerts: tune for and evaluation_interval, verify the expression returns…
binary expression must contain only scalar and instant vector types
Fix PromQL 'binary expression must contain only scalar and instant vector types' errors: wrap range vectors in rate(), use scal…
compaction failed
Fix Prometheus 'compaction failed' errors: remove corrupt blocks, free disk space, recover from unclean shutdowns, and restore…
Error loading config (--config.file=/etc/prometheus/prometheus.yml)
Fix Prometheus 'Error loading config' and HTTP 400 reload failures: validate YAML with promtool, enable web lifecycle, and reso…
duplicate sample for timestamp
Fix Prometheus 'duplicate sample for timestamp' errors: dedupe exporters exposing repeated series, add unique instance/job labe…
found multiple scrape configs with job name
Fix Prometheus 'found multiple scrape configs with job name' errors: locate colliding job_names across included files, dedupe s…
Empty query result
Fix Prometheus 'Empty query result' and 'No data' when a metric should exist: label typos, stale series, stopped targets, lookb…
Best Prometheus prompts
Use these prompts to turn symptoms, logs, and config into a structured troubleshooting plan.
SLO Error Budget & Multi-Window Burn Rate Alerts
Design SLO-based alerts — error budgets, multi-burn-rate alerting, SLI selection, burn budget calculation.
Grafana Loki + Prometheus Correlation
Correlate metrics and logs in Grafana — exemplars from Prometheus to traces, derived fields from Loki, jump from spike to log line.
Prometheus Alert Rule Generator
Generate production-quality Prometheus alerting rules with sensible thresholds, labels, and runbook annotations.
Alertmanager Routing Tree Matcher Design Review
Design or review an Alertmanager routing tree — receivers, matchers, group_by, continue, and timers — so every alert reaches the right team exactly once without falling through to a catch-all.
Free Prometheus tools
Validate, troubleshoot, or analyze your configuration before production changes.
AI Alert Rule Generator
Turn a plain-English SLO into a ready-to-ship Prometheus alerting rule.
Open the toolAI Incident Response Assistant
Paste an alert and metrics, get a structured investigation plan.
Start triagePrometheus runbook
Use a repeatable checklist for production troubleshooting.
A checklist for scrape, query, and alerting problems.
- 1 Check targets and their health (up, last scrape, error)
- 2 Validate scrape configs and relabeling
- 3 Review alert rules and their evaluation state
- 4 Test the PromQL directly in the expression browser
- 5 Inspect remote-write status and queue backpressure