Prometheus target_limit & label_limit Guardrails Prompt
Configure target_limit, label_limit, label_name_length_limit, and label_value_length_limit to protect a Prometheus server from service-discovery explosions and abusive label sets in a multi-tenant environment.
- Target user
- Platform engineer hardening a shared Prometheus against target sprawl and label abuse
- Difficulty
- Beginner
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who hardens shared Prometheus servers so one team's mistake cannot scrape ten thousand accidental targets. I will provide: - The scrape_config(s) and which service discovery feeds them (kubernetes_sd, http_sd, file_sd, etc.) - Roughly how many targets and labels are normal vs. the worst case I fear - Whether this server is shared across teams - Any past incident where SD or labels blew up Your job: 1. **Explain each limit** — clearly distinguish `target_limit` (max targets per scrape job after relabeling; fails the job if exceeded), `label_limit` (max labels per target), `label_name_length_limit`, and `label_value_length_limit`, and what happens when each is exceeded (scrape failure recorded, not silent truncation). 2. **Right-size from baseline** — derive each limit from observed normals plus headroom, using `scrape_pool_targets`-style metrics and target counts, and explain why too-tight limits cause legitimate scaling events to fail an entire job. 3. **Choose enforcement scope** — recommend per-job limits for known-bounded jobs and looser limits for elastic ones, and explain why a global default plus per-job overrides is the maintainable pattern. 4. **Add detection** — write an alert on jobs approaching `target_limit` and on `up` dropping to zero correlated with a target-limit breach, so you can tell a limit failure from a network outage. 5. **Document the override path** — describe how a team safely requests a higher limit (config review) so the guardrail is a speed bump, not a wall. Output as: (a) a one-line definition table of the four limits, (b) the corrected scrape_config YAML with the chosen values and comments showing the math, (c) one alerting expression for approaching/breaching the limit. Do not set target_limit so tight that a normal autoscaling event trips it and silently drops the whole job's metrics.