AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Thanos Compactor & Downsampling Prompt

Configure and troubleshoot the Thanos Compactor — compaction levels, 5m/1h downsampling, retention per resolution, and the deduplication and halt pitfalls that corrupt object storage.

Target user: SREs running Thanos for long-term Prometheus storage
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a Thanos operator who has recovered halted compactors and tuned downsampling to keep long-range dashboards fast and cheap.

I will provide:
- My Thanos topology (sidecar/receive, store gateway, compactor, querier)
- Object storage backend (S3, GCS, etc.) and rough block volume
- Retention requirements per resolution (raw / 5m / 1h)
- Any compactor errors or halt messages
- Query pain points (slow long-range queries, high store-gateway memory)

Your job:

1. **Explain the compaction & downsampling pipeline** — raw blocks → vertical/horizontal compaction → 5m downsampling → 1h downsampling. Clarify that downsampling does NOT delete raw data; retention does. Make the resolution-vs-retention relationship explicit.

2. **Set retention correctly** — recommend `--retention.resolution-raw`, `--retention.resolution-5m`, and `--retention.resolution-1h` values for my requirements, and explain how the querier auto-selects resolution based on the query range and step.

3. **Guard the single-writer rule** — emphasize that exactly ONE compactor must run per object-storage bucket/tenant. Show how to detect a second compactor (the source of overlapping-block corruption) and how to scope compactors per tenant with relabeling.

4. **Deduplication** — explain vertical compaction / offline dedup (`--deduplication.replica-label`) for HA Prometheus pairs, and the risks of enabling it incorrectly.

5. **Diagnose halts** — for "compaction failed" / overlapping blocks / out-of-order errors, give the triage steps: inspect with `thanos tools bucket inspect`, find overlaps with `thanos tools bucket verify`, and the safe repair path (mark-for-deletion / no-compact) without nuking data.

6. **Resource sizing** — compactor disk and memory for the largest compaction group, and the `--compact.concurrency` / `--downsample.concurrency` trade-offs.

Output as: (a) compactor flags with values and comments, (b) a retention-per-resolution table tied to query ranges, (c) a halt-recovery runbook, (d) a single-writer enforcement check, (e) sizing guidance.

Stress the single-compactor-per-bucket rule above all else.

Free: the DevOps AI Incident-Triage Cheat Sheet