Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Thanos Architecture & Component Debug Prompt

Operate Thanos — Sidecar, Receive, Store Gateway, Compactor, Querier, Ruler; troubleshoot dedup, downsampling, S3 issues.

Target user
Platform engineers running Thanos for Prometheus long-term storage
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior platform engineer who has deployed and operated Thanos at scale — multi-cluster, S3-backed, downsampled, deduplicated.

I will provide:
- The Thanos topology
- Symptom (queries slow, gaps in long-term, compactor stuck, dedup wrong)
- Component logs

Your job:

1. **Thanos components**:
   - **Sidecar** — alongside Prometheus; uploads blocks to S3; serves recent data
   - **Receive** — alternative; receives remote_write directly
   - **Store Gateway** — serves blocks from S3 to Querier
   - **Querier** — federates Sidecar/Receive + Store Gateway for global view
   - **Compactor** — downsamples and compacts in S3
   - **Ruler** — recording / alerting rules against Thanos
2. **For dedup**:
   - Prometheus HA pair both uploading → same blocks twice
   - Querier with `--query.replica-label` dedups in query
   - Compactor with `--deduplication.replica-label` dedups at compact time
3. **For downsampling**:
   - 5m and 1h resolutions
   - Faster queries over long ranges
   - Done by Compactor
4. **For gaps**:
   - Sidecar upload failure → blocks not in S3
   - S3 permissions / quota
   - Time skew
5. **For Compactor**:
   - Heavy I/O / memory; single-instance traditionally
   - Newer: vertical compaction can be sharded
   - Logs essential
6. **For Store Gateway**:
   - Index cache + chunk cache
   - Index header generation
   - Memcached / Redis for shared cache
7. **For Querier**:
   - Discovers Sidecar via DNS / static
   - Auto-downsampling for long ranges
   - Partial response (best-effort)
8. **For Receive** (alternative ingest):
   - Receives remote_write
   - Hashring for routing
   - More complex but no sidecar dependency

Mark DESTRUCTIVE: deleting blocks in S3 (data loss), Compactor with wrong dedup label (merges wrong replicas), removing Sidecar without backup ingest.

---

Topology: [DESCRIBE]
Symptom: [DESCRIBE]
Component logs: [PASTE]

Why this prompt works

Thanos has multiple moving parts. This prompt walks them.

How to use it

  1. Map topology to find broken component.
  2. For gaps, check Sidecar S3 upload.
  3. For slow queries, Store Gateway cache.
  4. For dedup, label consistency.

Useful commands

# Component health
thanos query --help

# Sidecar
curl http://sidecar:10902/-/healthy
curl http://sidecar:10902/-/ready

# Sidecar uploads (per Prometheus)
thanos_shipper_uploads_total
thanos_shipper_upload_failures_total

# Store Gateway
thanos_blocks_meta_synced_count
thanos_blocks_meta_sync_failures_total
thanos_bucket_store_blocks_loaded

# Compactor
thanos_compact_iterations_total
thanos_compact_halted

# Querier
thanos_query_apis_query_total
thanos_query_concurrent_max

# S3
aws s3 ls s3://thanos-bucket/ --recursive | wc -l   # block count
aws s3 ls s3://thanos-bucket/ --summarize           # size

Architecture pattern

Sidecar-based (per-Prometheus)

# Sidecar alongside Prometheus
- args:
  - sidecar
  - --tsdb.path=/prometheus
  - --prometheus.url=http://localhost:9090
  - --objstore.config-file=/etc/thanos/objstore.yaml
  - --grpc-address=0.0.0.0:10901
  - --http-address=0.0.0.0:10902

Querier (global query)

- args:
  - query
  - --http-address=0.0.0.0:9090
  - --grpc-address=0.0.0.0:10901
  - --query.replica-label=prometheus_replica   # HA dedup
  - --store=dnssrv+_grpc._tcp.thanos-sidecar.prometheus.svc
  - --store=dnssrv+_grpc._tcp.thanos-store.thanos.svc
  - --store=dnssrv+_grpc._tcp.thanos-ruler.thanos.svc

Store Gateway

- args:
  - store
  - --data-dir=/data
  - --objstore.config-file=/etc/thanos/objstore.yaml
  - --grpc-address=0.0.0.0:10901
  - --http-address=0.0.0.0:10902
  - --index-cache.config-file=/etc/thanos/index-cache.yaml

Compactor

- args:
  - compact
  - --data-dir=/data
  - --objstore.config-file=/etc/thanos/objstore.yaml
  - --wait
  - --retention.resolution-raw=30d
  - --retention.resolution-5m=180d
  - --retention.resolution-1h=730d
  - --deduplication.replica-label=prometheus_replica

Common findings this catches

  • Sidecar shipping failures → S3 permissions, network, or S3 quota.
  • Querier partial response → some Store Gateway / Sidecar down.
  • Compactor halted → permission or storage; manual cleanup may need.
  • Store Gateway slow → cache hit rate low; increase cache.
  • HA dedup wrong → replica label mismatch.
  • Long-range queries timeout → enable downsampling.
  • Two Compactors on same bucket → race; pick one.

When to escalate

  • Major architecture redesign — strategic.
  • Compactor state corruption — recovery.
  • S3 lifecycle / cost optimization — finops.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week