You are a senior platform engineer who has deployed and operated Thanos at scale — multi-cluster, S3-backed, downsampled, deduplicated. I will provide: - The Thanos topology - Symptom (queries slow, gaps in long-term, compactor stuck, dedup wrong) - Component logs Your job: 1. **Thanos components**: - **Sidecar** — alongside Prometheus; uploads blocks to S3; serves recent data - **Receive** — alternative; receives remote_write directly - **Store Gateway** — serves blocks from S3 to Querier - **Querier** — federates Sidecar/Receive + Store Gateway for global view - **Compactor** — downsamples and compacts in S3 - **Ruler** — recording / alerting rules against Thanos 2. **For dedup**: - Prometheus HA pair both uploading → same blocks twice - Querier with `--query.replica-label` dedups in query - Compactor with `--deduplication.replica-label` dedups at compact time 3. **For downsampling**: - 5m and 1h resolutions - Faster queries over long ranges - Done by Compactor 4. **For gaps**: - Sidecar upload failure → blocks not in S3 - S3 permissions / quota - Time skew 5. **For Compactor**: - Heavy I/O / memory; single-instance traditionally - Newer: vertical compaction can be sharded - Logs essential 6. **For Store Gateway**: - Index cache + chunk cache - Index header generation - Memcached / Redis for shared cache 7. **For Querier**: - Discovers Sidecar via DNS / static - Auto-downsampling for long ranges - Partial response (best-effort) 8. **For Receive** (alternative ingest): - Receives remote_write - Hashring for routing - More complex but no sidecar dependency Mark DESTRUCTIVE: deleting blocks in S3 (data loss), Compactor with wrong dedup label (merges wrong replicas), removing Sidecar without backup ingest. --- Topology: [DESCRIBE] Symptom: [DESCRIBE] Component logs: [PASTE]

Why this prompt works

Thanos has multiple moving parts. This prompt walks them.

How to use it

Map topology to find broken component.
For gaps, check Sidecar S3 upload.
For slow queries, Store Gateway cache.
For dedup, label consistency.

Useful commands

# Component health
thanos query --help

# Sidecar
curl http://sidecar:10902/-/healthy
curl http://sidecar:10902/-/ready

# Sidecar uploads (per Prometheus)
thanos_shipper_uploads_total
thanos_shipper_upload_failures_total

# Store Gateway
thanos_blocks_meta_synced_count
thanos_blocks_meta_sync_failures_total
thanos_bucket_store_blocks_loaded

# Compactor
thanos_compact_iterations_total
thanos_compact_halted

# Querier
thanos_query_apis_query_total
thanos_query_concurrent_max

# S3
aws s3 ls s3://thanos-bucket/ --recursive | wc -l   # block count
aws s3 ls s3://thanos-bucket/ --summarize           # size

Architecture pattern

Sidecar-based (per-Prometheus)

# Sidecar alongside Prometheus
- args:
  - sidecar
  - --tsdb.path=/prometheus
  - --prometheus.url=http://localhost:9090
  - --objstore.config-file=/etc/thanos/objstore.yaml
  - --grpc-address=0.0.0.0:10901
  - --http-address=0.0.0.0:10902

Querier (global query)

- args:
  - query
  - --http-address=0.0.0.0:9090
  - --grpc-address=0.0.0.0:10901
  - --query.replica-label=prometheus_replica   # HA dedup
  - --store=dnssrv+_grpc._tcp.thanos-sidecar.prometheus.svc
  - --store=dnssrv+_grpc._tcp.thanos-store.thanos.svc
  - --store=dnssrv+_grpc._tcp.thanos-ruler.thanos.svc

Store Gateway

- args:
  - store
  - --data-dir=/data
  - --objstore.config-file=/etc/thanos/objstore.yaml
  - --grpc-address=0.0.0.0:10901
  - --http-address=0.0.0.0:10902
  - --index-cache.config-file=/etc/thanos/index-cache.yaml

Compactor

- args:
  - compact
  - --data-dir=/data
  - --objstore.config-file=/etc/thanos/objstore.yaml
  - --wait
  - --retention.resolution-raw=30d
  - --retention.resolution-5m=180d
  - --retention.resolution-1h=730d
  - --deduplication.replica-label=prometheus_replica

Common findings this catches

Sidecar shipping failures → S3 permissions, network, or S3 quota.
Querier partial response → some Store Gateway / Sidecar down.
Compactor halted → permission or storage; manual cleanup may need.
Store Gateway slow → cache hit rate low; increase cache.
HA dedup wrong → replica label mismatch.
Long-range queries timeout → enable downsampling.
Two Compactors on same bucket → race; pick one.

When to escalate

Major architecture redesign — strategic.
Compactor state corruption — recovery.
S3 lifecycle / cost optimization — finops.

Thanos Architecture & Component Debug Prompt

Why this prompt works

How to use it

Useful commands

Architecture pattern

Sidecar-based (per-Prometheus)

Querier (global query)

Store Gateway

Compactor

Common findings this catches

When to escalate

Related prompts

Kubernetes Events Analysis Prompt

Prometheus Remote Write & Long-term Storage Prompt

Prometheus Storage, Retention & TSDB Prompt

Why this prompt works

How to use it

Useful commands

Architecture pattern

Sidecar-based (per-Prometheus)

Querier (global query)

Store Gateway

Compactor

Common findings this catches

When to escalate

Related prompts

Kubernetes Events Analysis Prompt

Prometheus Remote Write & Long-term Storage Prompt

Prometheus Storage, Retention & TSDB Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet