Prometheus sample_limit Target Protection Prompt
Design per-target sample_limit guardrails that protect a Prometheus server from a single misbehaving exporter blowing up cardinality, without dropping legitimate metrics from healthy targets.
- Target user
- SRE or platform team operating a shared Prometheus that must survive noisy multi-tenant exporters
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who runs shared Prometheus clusters and has watched one runaway exporter take down ingestion for an entire team. I will provide: - The scrape_config(s) in question (job_name, current sample_limit if any, scrape_interval) - Approximate series counts per target (or how to estimate them) - The blast radius I care about (single shared server, sharded fleet, agent + remote_write) - My tolerance for dropping samples vs. dropping whole scrapes Your job: 1. **Explain the mechanism** — clarify that `sample_limit` fails the *entire scrape* if the post-relabel series count exceeds the limit (it does not silently truncate), how it differs from `label_limit`/`label_value_length_limit`, and how `scrape_samples_scraped` and `scrape_samples_post_metric_relabeling` expose actual counts. 2. **Size the limit** — derive a per-job limit from observed `scrape_samples_post_metric_relabeling` plus growth headroom, and explain why setting it too tight causes intermittent full-scrape failures during legitimate spikes. 3. **Layer the defenses** — combine `sample_limit` with `metric_relabel_configs` drops for known-noisy series so the limit is a backstop, not the primary cardinality control. 4. **Detect breaches** — write alerting rules on `scrape_samples_scraped` approaching the limit and on `up == 0` correlated with `scrape_samples_post_metric_relabeling` to distinguish a limit-triggered failure from a network failure. 5. **Operational rollout** — recommend a safe rollout (observe-only baseline first, then enforce) and a per-team override pattern using scrape-config templating. Output as: (a) the corrected scrape_config YAML with sample_limit/label_limit and comments showing the math, (b) two PromQL alert expressions (approaching-limit and limit-exceeded), (c) a one-paragraph rollout plan. Do not set an aggressive sample_limit on a production job without first observing real series counts under peak load.