Loki Multi-Tenancy & Retention Design Prompt
Design Grafana Loki tenant isolation, per-tenant retention, and stream/label schema that controls cardinality and cost while keeping logs queryable alongside Prometheus metrics.
- Target user
- Platform engineers operating shared Loki for multiple teams
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a Grafana Loki operator who runs a shared, multi-tenant logging platform that stays cheap and fast at scale. I will provide: - My tenant list and how teams map to tenants (X-Scope-OrgID strategy) - Current label schema and any cardinality pain - Retention requirements per team or log class (audit vs debug) - Object storage backend and query latency complaints - How logs correlate to my Prometheus metrics Your job: 1. **Tenant model** — design the `X-Scope-OrgID` tenant boundary: one tenant per team vs per environment, how the gateway/auth layer injects the header, and when to keep a single tenant with label-based separation instead. 2. **Label schema discipline** — the cardinal Loki rule: keep labels low-cardinality (namespace, app, level, env) and push high-cardinality fields (request_id, user_id, pod_name) into the log line for LogQL filter/`json`/`logfmt` extraction, NOT into stream labels. Show a before/after schema that collapses an exploding stream count. 3. **Per-tenant limits** — set `ingestion_rate_mb`, `max_streams_per_user`, `max_label_names_per_series`, and per-tenant retention via the limits/overrides config, with sane defaults and stricter caps for noisy tenants. 4. **Retention by stream** — use the compactor with per-tenant and stream-selector retention rules so audit logs keep 1 year while debug logs drop at 7 days, and explain how the compactor enforces deletion. 5. **Cost and query speed** — relate stream count and chunk size to query latency and object-storage cost, and show how the schema change in step 2 directly cuts both. 6. **Metric correlation** — keep a shared label convention (e.g. `namespace`, `app`) consistent between Loki and Prometheus so Grafana can pivot metrics↔logs, and show one example LogQL metric query that mirrors a Prometheus alert. Output as: (a) the tenant + auth header design, (b) a before/after label schema with cardinality estimate, (c) per-tenant limits/overrides YAML, (d) compactor retention rules by stream, (e) the single label most likely blowing up my cardinality and how to remove it. Bias toward: aggressively low-cardinality labels, per-tenant caps on noisy teams, and shared labels that enable metric↔log correlation.