Loki Log Aggregation Design Prompt
Design Loki log aggregation — single-binary vs distributed, retention, label strategy, LogQL queries, multi-tenancy.
- Target user
- Platform engineers running centralized logs
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has deployed Loki for log aggregation — single-binary at small scale, distributed at large, S3-backed retention.
I will provide:
- The deployment (single-binary, microservices, SSD)
- Log ingest rate
- Symptom (slow queries, ingest lag, retention issue, missing labels)
Your job:
1. **Loki architecture**:
- **Distributor** — receives logs; validates
- **Ingester** — buffers; flushes to S3
- **Querier** — queries data
- **Query frontend** — caches, splits queries
- **Compactor** — compacts indexes
- **Single-binary** combines all
2. **For label strategy**:
- Labels = how logs are indexed
- Each unique combo = a stream
- Too many = ingester memory
- Too few = slow queries
- Rule: 10-20 labels max
3. **For LogQL**:
- `{labels} |= "filter"` — like grep
- `{labels} |~ "regex"` — regex
- `{labels} | json` — parse JSON
- `rate(...[5m])` — log rate
4. **For retention**:
- Per-tenant retention
- Compactor deletes old data
- S3 lifecycle as backup
5. **For ingest rate**:
- Match ingester capacity
- Backpressure on overflow
6. **For label cardinality**:
- Avoid high-card labels (request ID)
- Tag at log level for filtering
- Use `| logfmt`/`| json` to extract fields without indexing
7. **For multi-tenant**:
- X-Scope-OrgID header
- Per-tenant limits
8. **For dashboard with Grafana**:
- LogQL data source
- Loki + Prometheus correlation
- Derived fields
Mark DESTRUCTIVE: deleting old streams (loses logs), high-card labels causing ingester OOM, retention too low (compliance issue).
---
Deployment: [DESCRIBE]
Ingest rate: [DESCRIBE]
Symptom: [DESCRIBE]
Why this prompt works
Loki is increasingly common alongside Prometheus. This prompt walks design.
How to use it
- Pick deployment matching scale.
- Labels carefully for index.
- Plan retention.
- Use Promtail / OTel for collection.
Useful commands
# Status
curl http://loki:3100/ready
curl http://loki:3100/metrics
# Logcli
logcli labels # all labels
logcli labels job # values for label 'job'
logcli query '{job="myapp"} |= "error"'
# Promtail
sudo systemctl status promtail
sudo journalctl -u promtail -f
LogQL examples
# Basic filter
{namespace="production",app="web"} |= "error"
# Regex
{job="systemd-journal"} |~ "failed|error"
# JSON parsing
{job="myapp"} | json | level="error"
# Logfmt
{job="myapp"} | logfmt | duration > 1s
# Metric query
rate({job="myapp"}[5m])
# Count errors per minute
sum by (level)(rate({job="myapp"} | json | __error__="" [1m]))
# Top sources
topk(10, sum by (instance)(rate({job="myapp"}[5m])))
Promtail config (Kubernetes pods)
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- cri: {}
- json:
expressions:
level: level
msg: msg
- labels:
level:
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
Loki config (distributed)
auth_enabled: true
server:
http_listen_port: 3100
distributor:
ring:
kvstore: { store: memberlist }
ingester:
lifecycler:
ring: { kvstore: { store: memberlist } }
chunk_target_size: 1572864
max_chunk_age: 1h
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
aws:
s3: s3://us-east-1/my-loki-bucket
limits_config:
retention_period: 720h # 30 days
ingestion_rate_mb: 10
max_label_value_length: 256
max_label_name_length: 1024
max_streams_per_user: 10000
compactor:
working_directory: /loki/compactor
shared_store: s3
retention_enabled: true
Common findings this catches
- Stream count exploding → high-card label; review.
- Ingester OOM → label cardinality OR rate too high.
- Queries slow → time range too large; use query frontend cache.
- Logs delayed → ingest backpressure.
- Old logs vanished → retention; S3 lifecycle.
- Missing logs from pods → Promtail config; selector.
- High S3 costs → tune retention; lifecycle to cold storage.
When to escalate
- Scale issues at large volume — distributed deployment.
- Compliance / legal retention — coordinate.
- Multi-tenancy — strategic.
Related prompts
-
Alert Fatigue Reduction Strategy Prompt
Reduce alert fatigue — SLO-based alerts vs symptom-based, severity tiers, runbook integration, deprecating noisy alerts.
-
OpenTelemetry on Kubernetes Collector Design Prompt
Design and debug the OpenTelemetry Collector on Kubernetes — agent vs gateway, receivers/processors/exporters, sidecar vs DaemonSet, traces/metrics/logs pipelines.
-
Prometheus Storage, Retention & TSDB Prompt
Configure Prometheus TSDB — retention, block size, compaction, WAL, disk sizing, troubleshooting OOM / disk-full.