Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Loki Log Aggregation Design Prompt

Design Loki log aggregation — single-binary vs distributed, retention, label strategy, LogQL queries, multi-tenancy.

Target user
Platform engineers running centralized logs
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior platform engineer who has deployed Loki for log aggregation — single-binary at small scale, distributed at large, S3-backed retention.

I will provide:
- The deployment (single-binary, microservices, SSD)
- Log ingest rate
- Symptom (slow queries, ingest lag, retention issue, missing labels)

Your job:

1. **Loki architecture**:
   - **Distributor** — receives logs; validates
   - **Ingester** — buffers; flushes to S3
   - **Querier** — queries data
   - **Query frontend** — caches, splits queries
   - **Compactor** — compacts indexes
   - **Single-binary** combines all
2. **For label strategy**:
   - Labels = how logs are indexed
   - Each unique combo = a stream
   - Too many = ingester memory
   - Too few = slow queries
   - Rule: 10-20 labels max
3. **For LogQL**:
   - `{labels} |= "filter"` — like grep
   - `{labels} |~ "regex"` — regex
   - `{labels} | json` — parse JSON
   - `rate(...[5m])` — log rate
4. **For retention**:
   - Per-tenant retention
   - Compactor deletes old data
   - S3 lifecycle as backup
5. **For ingest rate**:
   - Match ingester capacity
   - Backpressure on overflow
6. **For label cardinality**:
   - Avoid high-card labels (request ID)
   - Tag at log level for filtering
   - Use `| logfmt`/`| json` to extract fields without indexing
7. **For multi-tenant**:
   - X-Scope-OrgID header
   - Per-tenant limits
8. **For dashboard with Grafana**:
   - LogQL data source
   - Loki + Prometheus correlation
   - Derived fields

Mark DESTRUCTIVE: deleting old streams (loses logs), high-card labels causing ingester OOM, retention too low (compliance issue).

---

Deployment: [DESCRIBE]
Ingest rate: [DESCRIBE]
Symptom: [DESCRIBE]

Why this prompt works

Loki is increasingly common alongside Prometheus. This prompt walks design.

How to use it

  1. Pick deployment matching scale.
  2. Labels carefully for index.
  3. Plan retention.
  4. Use Promtail / OTel for collection.

Useful commands

# Status
curl http://loki:3100/ready
curl http://loki:3100/metrics

# Logcli
logcli labels                              # all labels
logcli labels job                          # values for label 'job'
logcli query '{job="myapp"} |= "error"'

# Promtail
sudo systemctl status promtail
sudo journalctl -u promtail -f

LogQL examples

# Basic filter
{namespace="production",app="web"} |= "error"

# Regex
{job="systemd-journal"} |~ "failed|error"

# JSON parsing
{job="myapp"} | json | level="error"

# Logfmt
{job="myapp"} | logfmt | duration > 1s

# Metric query
rate({job="myapp"}[5m])

# Count errors per minute
sum by (level)(rate({job="myapp"} | json | __error__="" [1m]))

# Top sources
topk(10, sum by (instance)(rate({job="myapp"}[5m])))

Promtail config (Kubernetes pods)

scrape_configs:
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  pipeline_stages:
  - cri: {}
  - json:
      expressions:
        level: level
        msg: msg
  - labels:
      level:
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app]
    target_label: app
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_name]
    target_label: pod

Loki config (distributed)

auth_enabled: true

server:
  http_listen_port: 3100

distributor:
  ring:
    kvstore: { store: memberlist }

ingester:
  lifecycler:
    ring: { kvstore: { store: memberlist } }
  chunk_target_size: 1572864
  max_chunk_age: 1h

schema_config:
  configs:
  - from: 2024-01-01
    store: tsdb
    object_store: s3
    schema: v13
    index:
      prefix: loki_index_
      period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/index_cache
  aws:
    s3: s3://us-east-1/my-loki-bucket

limits_config:
  retention_period: 720h        # 30 days
  ingestion_rate_mb: 10
  max_label_value_length: 256
  max_label_name_length: 1024
  max_streams_per_user: 10000

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  retention_enabled: true

Common findings this catches

  • Stream count exploding → high-card label; review.
  • Ingester OOM → label cardinality OR rate too high.
  • Queries slow → time range too large; use query frontend cache.
  • Logs delayed → ingest backpressure.
  • Old logs vanished → retention; S3 lifecycle.
  • Missing logs from pods → Promtail config; selector.
  • High S3 costs → tune retention; lifecycle to cold storage.

When to escalate

  • Scale issues at large volume — distributed deployment.
  • Compliance / legal retention — coordinate.
  • Multi-tenancy — strategic.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week