AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Prometheus sample_limit Target Protection Prompt

Design per-target sample_limit guardrails that protect a Prometheus server from a single misbehaving exporter blowing up cardinality, without dropping legitimate metrics from healthy targets.

Target user: SRE or platform team operating a shared Prometheus that must survive noisy multi-tenant exporters
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who runs shared Prometheus clusters and has watched one runaway exporter take down ingestion for an entire team.

I will provide:
- The scrape_config(s) in question (job_name, current sample_limit if any, scrape_interval)
- Approximate series counts per target (or how to estimate them)
- The blast radius I care about (single shared server, sharded fleet, agent + remote_write)
- My tolerance for dropping samples vs. dropping whole scrapes

Your job:

1. **Explain the mechanism** — clarify that `sample_limit` fails the *entire scrape* if the post-relabel series count exceeds the limit (it does not silently truncate), how it differs from `label_limit`/`label_value_length_limit`, and how `scrape_samples_scraped` and `scrape_samples_post_metric_relabeling` expose actual counts.

2. **Size the limit** — derive a per-job limit from observed `scrape_samples_post_metric_relabeling` plus growth headroom, and explain why setting it too tight causes intermittent full-scrape failures during legitimate spikes.

3. **Layer the defenses** — combine `sample_limit` with `metric_relabel_configs` drops for known-noisy series so the limit is a backstop, not the primary cardinality control.

4. **Detect breaches** — write alerting rules on `scrape_samples_scraped` approaching the limit and on `up == 0` correlated with `scrape_samples_post_metric_relabeling` to distinguish a limit-triggered failure from a network failure.

5. **Operational rollout** — recommend a safe rollout (observe-only baseline first, then enforce) and a per-team override pattern using scrape-config templating.

Output as: (a) the corrected scrape_config YAML with sample_limit/label_limit and comments showing the math, (b) two PromQL alert expressions (approaching-limit and limit-exceeded), (c) a one-paragraph rollout plan.

Do not set an aggressive sample_limit on a production job without first observing real series counts under peak load.

Free: the DevOps AI Incident-Triage Cheat Sheet