You are a senior platform engineer who has deployed Loki for log aggregation — single-binary at small scale, distributed at large, S3-backed retention. I will provide: - The deployment (single-binary, microservices, SSD) - Log ingest rate - Symptom (slow queries, ingest lag, retention issue, missing labels) Your job: 1. **Loki architecture**: - **Distributor** — receives logs; validates - **Ingester** — buffers; flushes to S3 - **Querier** — queries data - **Query frontend** — caches, splits queries - **Compactor** — compacts indexes - **Single-binary** combines all 2. **For label strategy**: - Labels = how logs are indexed - Each unique combo = a stream - Too many = ingester memory - Too few = slow queries - Rule: 10-20 labels max 3. **For LogQL**: - `{labels} |= "filter"` — like grep - `{labels} |~ "regex"` — regex - `{labels} | json` — parse JSON - `rate(...[5m])` — log rate 4. **For retention**: - Per-tenant retention - Compactor deletes old data - S3 lifecycle as backup 5. **For ingest rate**: - Match ingester capacity - Backpressure on overflow 6. **For label cardinality**: - Avoid high-card labels (request ID) - Tag at log level for filtering - Use `| logfmt`/`| json` to extract fields without indexing 7. **For multi-tenant**: - X-Scope-OrgID header - Per-tenant limits 8. **For dashboard with Grafana**: - LogQL data source - Loki + Prometheus correlation - Derived fields Mark DESTRUCTIVE: deleting old streams (loses logs), high-card labels causing ingester OOM, retention too low (compliance issue). --- Deployment: [DESCRIBE] Ingest rate: [DESCRIBE] Symptom: [DESCRIBE]

Why this prompt works

Loki is increasingly common alongside Prometheus. This prompt walks design.

How to use it

Pick deployment matching scale.
Labels carefully for index.
Plan retention.
Use Promtail / OTel for collection.

Useful commands

# Status
curl http://loki:3100/ready
curl http://loki:3100/metrics

# Logcli
logcli labels                              # all labels
logcli labels job                          # values for label 'job'
logcli query '{job="myapp"} |= "error"'

# Promtail
sudo systemctl status promtail
sudo journalctl -u promtail -f

LogQL examples

# Basic filter
{namespace="production",app="web"} |= "error"

# Regex
{job="systemd-journal"} |~ "failed|error"

# JSON parsing
{job="myapp"} | json | level="error"

# Logfmt
{job="myapp"} | logfmt | duration > 1s

# Metric query
rate({job="myapp"}[5m])

# Count errors per minute
sum by (level)(rate({job="myapp"} | json | __error__="" [1m]))

# Top sources
topk(10, sum by (instance)(rate({job="myapp"}[5m])))

Promtail config (Kubernetes pods)

scrape_configs:
- job_name: kubernetes-pods
  kubernetes_sd_configs:
  - role: pod
  pipeline_stages:
  - cri: {}
  - json:
      expressions:
        level: level
        msg: msg
  - labels:
      level:
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app]
    target_label: app
  - source_labels: [__meta_kubernetes_namespace]
    target_label: namespace
  - source_labels: [__meta_kubernetes_pod_name]
    target_label: pod

Loki config (distributed)

auth_enabled: true

server:
  http_listen_port: 3100

distributor:
  ring:
    kvstore: { store: memberlist }

ingester:
  lifecycler:
    ring: { kvstore: { store: memberlist } }
  chunk_target_size: 1572864
  max_chunk_age: 1h

schema_config:
  configs:
  - from: 2024-01-01
    store: tsdb
    object_store: s3
    schema: v13
    index:
      prefix: loki_index_
      period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/index_cache
  aws:
    s3: s3://us-east-1/my-loki-bucket

limits_config:
  retention_period: 720h        # 30 days
  ingestion_rate_mb: 10
  max_label_value_length: 256
  max_label_name_length: 1024
  max_streams_per_user: 10000

compactor:
  working_directory: /loki/compactor
  shared_store: s3
  retention_enabled: true

Common findings this catches

Stream count exploding → high-card label; review.
Ingester OOM → label cardinality OR rate too high.
Queries slow → time range too large; use query frontend cache.
Logs delayed → ingest backpressure.
Old logs vanished → retention; S3 lifecycle.
Missing logs from pods → Promtail config; selector.
High S3 costs → tune retention; lifecycle to cold storage.

When to escalate

Scale issues at large volume — distributed deployment.
Compliance / legal retention — coordinate.
Multi-tenancy — strategic.

Loki Log Aggregation Design Prompt

Why this prompt works

How to use it

Useful commands

LogQL examples

Promtail config (Kubernetes pods)

Loki config (distributed)

Common findings this catches

When to escalate

Related prompts

Alert Fatigue Reduction Strategy Prompt

OpenTelemetry on Kubernetes Collector Design Prompt

Prometheus Storage, Retention & TSDB Prompt

Why this prompt works

How to use it

Useful commands

LogQL examples

Promtail config (Kubernetes pods)

Loki config (distributed)

Common findings this catches

When to escalate

Related prompts

Alert Fatigue Reduction Strategy Prompt

OpenTelemetry on Kubernetes Collector Design Prompt

Prometheus Storage, Retention & TSDB Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet