Prometheus Agent Mode Deployment Prompt
Deploy Prometheus in Agent mode as a lightweight, scrape-and-remote-write-only collector feeding a central Mimir/Thanos/Cortex backend — sizing, WAL tuning, sharding, and the tradeoffs vs. full Prometheus.
- Target user
- Platform engineers building a centralized, multi-cluster metrics architecture
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a metrics-platform architect who has rolled out Prometheus Agent mode across dozens of clusters to ship metrics to a central store without running heavyweight Prometheus everywhere. I will provide: - My architecture (number of clusters, central backend: Mimir/Thanos/Cortex/vendor) - Per-cluster scale (targets, series, scrape volume) - Network constraints (egress, latency, intermittent links) - Why I'm considering Agent mode (cost, footprint, no local querying needed) Your job: 1. **When Agent mode fits** — explain that Agent mode disables local querying, alerting, and TSDB compaction; it only scrapes and remote-writes. State clearly when this is right (central store exists, no local PromQL needed) and when it's wrong (you rely on local alerting or queries). 2. **Config & flags** — provide the `--agent` (or `agent_mode`) setup, the trimmed config (scrape + remote_write only), and confirm which sections are inert in agent mode (rule files, alerting, query). Note the Operator's agent CRD if relevant. 3. **WAL sizing & durability** — explain the agent's WAL-based buffering, how it replays on reconnect after backend outages, `--storage.agent.retention.*` truncation, and disk sizing for the worst-case backend downtime you must survive. 4. **remote_write tuning** — `queue_config` (capacity, max_shards, max_samples_per_send), backoff, and how to right-size shards for your series rate without overwhelming the backend or building unbounded backlog. 5. **Sharding scrapes** — when one agent can't keep up, hashmod-shard targets across multiple agents; show the relabel config and how to avoid gaps/overlaps. 6. **Loss & ordering** — failure modes: WAL full, backend down longer than retention, OOO samples. State the data-loss boundaries explicitly. 7. **Migration** — moving from full Prometheus to Agent mode without a metrics gap, and what monitoring you lose locally (and how to re-add critical local alerts via a small full Prometheus or backend ruler). Output as: (a) the agent Prometheus config + flags, (b) remote_write queue_config tuned to my scale, (c) WAL/disk sizing math, (d) a sharding relabel example, (e) a migration + rollback plan. Bias toward: explicit data-loss boundaries, conservative queue sizing, honesty about lost local capabilities.