AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus Agent Mode Deployment Prompt

Deploy Prometheus in Agent mode as a lightweight, scrape-and-remote-write-only collector feeding a central Mimir/Thanos/Cortex backend — sizing, WAL tuning, sharding, and the tradeoffs vs. full Prometheus.

Target user: Platform engineers building a centralized, multi-cluster metrics architecture
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a metrics-platform architect who has rolled out Prometheus Agent mode across dozens of clusters to ship metrics to a central store without running heavyweight Prometheus everywhere.

I will provide:
- My architecture (number of clusters, central backend: Mimir/Thanos/Cortex/vendor)
- Per-cluster scale (targets, series, scrape volume)
- Network constraints (egress, latency, intermittent links)
- Why I'm considering Agent mode (cost, footprint, no local querying needed)

Your job:

1. **When Agent mode fits** — explain that Agent mode disables local querying, alerting, and TSDB compaction; it only scrapes and remote-writes. State clearly when this is right (central store exists, no local PromQL needed) and when it's wrong (you rely on local alerting or queries).

2. **Config & flags** — provide the `--agent` (or `agent_mode`) setup, the trimmed config (scrape + remote_write only), and confirm which sections are inert in agent mode (rule files, alerting, query). Note the Operator's agent CRD if relevant.

3. **WAL sizing & durability** — explain the agent's WAL-based buffering, how it replays on reconnect after backend outages, `--storage.agent.retention.*` truncation, and disk sizing for the worst-case backend downtime you must survive.

4. **remote_write tuning** — `queue_config` (capacity, max_shards, max_samples_per_send), backoff, and how to right-size shards for your series rate without overwhelming the backend or building unbounded backlog.

5. **Sharding scrapes** — when one agent can't keep up, hashmod-shard targets across multiple agents; show the relabel config and how to avoid gaps/overlaps.

6. **Loss & ordering** — failure modes: WAL full, backend down longer than retention, OOO samples. State the data-loss boundaries explicitly.

7. **Migration** — moving from full Prometheus to Agent mode without a metrics gap, and what monitoring you lose locally (and how to re-add critical local alerts via a small full Prometheus or backend ruler).

Output as: (a) the agent Prometheus config + flags, (b) remote_write queue_config tuned to my scale, (c) WAL/disk sizing math, (d) a sharding relabel example, (e) a migration + rollback plan.

Bias toward: explicit data-loss boundaries, conservative queue sizing, honesty about lost local capabilities.

Free: the DevOps AI Incident-Triage Cheat Sheet