Thanos Architecture & Component Debug Prompt
Operate Thanos — Sidecar, Receive, Store Gateway, Compactor, Querier, Ruler; troubleshoot dedup, downsampling, S3 issues.
- Target user
- Platform engineers running Thanos for Prometheus long-term storage
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has deployed and operated Thanos at scale — multi-cluster, S3-backed, downsampled, deduplicated. I will provide: - The Thanos topology - Symptom (queries slow, gaps in long-term, compactor stuck, dedup wrong) - Component logs Your job: 1. **Thanos components**: - **Sidecar** — alongside Prometheus; uploads blocks to S3; serves recent data - **Receive** — alternative; receives remote_write directly - **Store Gateway** — serves blocks from S3 to Querier - **Querier** — federates Sidecar/Receive + Store Gateway for global view - **Compactor** — downsamples and compacts in S3 - **Ruler** — recording / alerting rules against Thanos 2. **For dedup**: - Prometheus HA pair both uploading → same blocks twice - Querier with `--query.replica-label` dedups in query - Compactor with `--deduplication.replica-label` dedups at compact time 3. **For downsampling**: - 5m and 1h resolutions - Faster queries over long ranges - Done by Compactor 4. **For gaps**: - Sidecar upload failure → blocks not in S3 - S3 permissions / quota - Time skew 5. **For Compactor**: - Heavy I/O / memory; single-instance traditionally - Newer: vertical compaction can be sharded - Logs essential 6. **For Store Gateway**: - Index cache + chunk cache - Index header generation - Memcached / Redis for shared cache 7. **For Querier**: - Discovers Sidecar via DNS / static - Auto-downsampling for long ranges - Partial response (best-effort) 8. **For Receive** (alternative ingest): - Receives remote_write - Hashring for routing - More complex but no sidecar dependency Mark DESTRUCTIVE: deleting blocks in S3 (data loss), Compactor with wrong dedup label (merges wrong replicas), removing Sidecar without backup ingest. --- Topology: [DESCRIBE] Symptom: [DESCRIBE] Component logs: [PASTE]
Why this prompt works
Thanos has multiple moving parts. This prompt walks them.
How to use it
- Map topology to find broken component.
- For gaps, check Sidecar S3 upload.
- For slow queries, Store Gateway cache.
- For dedup, label consistency.
Useful commands
# Component health
thanos query --help
# Sidecar
curl http://sidecar:10902/-/healthy
curl http://sidecar:10902/-/ready
# Sidecar uploads (per Prometheus)
thanos_shipper_uploads_total
thanos_shipper_upload_failures_total
# Store Gateway
thanos_blocks_meta_synced_count
thanos_blocks_meta_sync_failures_total
thanos_bucket_store_blocks_loaded
# Compactor
thanos_compact_iterations_total
thanos_compact_halted
# Querier
thanos_query_apis_query_total
thanos_query_concurrent_max
# S3
aws s3 ls s3://thanos-bucket/ --recursive | wc -l # block count
aws s3 ls s3://thanos-bucket/ --summarize # size
Architecture pattern
Sidecar-based (per-Prometheus)
# Sidecar alongside Prometheus
- args:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://localhost:9090
- --objstore.config-file=/etc/thanos/objstore.yaml
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
Querier (global query)
- args:
- query
- --http-address=0.0.0.0:9090
- --grpc-address=0.0.0.0:10901
- --query.replica-label=prometheus_replica # HA dedup
- --store=dnssrv+_grpc._tcp.thanos-sidecar.prometheus.svc
- --store=dnssrv+_grpc._tcp.thanos-store.thanos.svc
- --store=dnssrv+_grpc._tcp.thanos-ruler.thanos.svc
Store Gateway
- args:
- store
- --data-dir=/data
- --objstore.config-file=/etc/thanos/objstore.yaml
- --grpc-address=0.0.0.0:10901
- --http-address=0.0.0.0:10902
- --index-cache.config-file=/etc/thanos/index-cache.yaml
Compactor
- args:
- compact
- --data-dir=/data
- --objstore.config-file=/etc/thanos/objstore.yaml
- --wait
- --retention.resolution-raw=30d
- --retention.resolution-5m=180d
- --retention.resolution-1h=730d
- --deduplication.replica-label=prometheus_replica
Common findings this catches
- Sidecar shipping failures → S3 permissions, network, or S3 quota.
- Querier partial response → some Store Gateway / Sidecar down.
- Compactor halted → permission or storage; manual cleanup may need.
- Store Gateway slow → cache hit rate low; increase cache.
- HA dedup wrong → replica label mismatch.
- Long-range queries timeout → enable downsampling.
- Two Compactors on same bucket → race; pick one.
When to escalate
- Major architecture redesign — strategic.
- Compactor state corruption — recovery.
- S3 lifecycle / cost optimization — finops.
Related prompts
-
Kubernetes Events Analysis Prompt
Filter, aggregate, and decode Kubernetes events — FailedScheduling, BackOff, ProvisioningFailed — to diagnose cluster-wide issues from noisy event streams.
-
Prometheus Remote Write & Long-term Storage Prompt
Configure remote write to long-term storage — Thanos Receive, Cortex/Mimir, VictoriaMetrics, troubleshoot queue/backlog/back-pressure.
-
Prometheus Storage, Retention & TSDB Prompt
Configure Prometheus TSDB — retention, block size, compaction, WAL, disk sizing, troubleshooting OOM / disk-full.