AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Loki Multi-Tenancy & Retention Design Prompt

Design Grafana Loki tenant isolation, per-tenant retention, and stream/label schema that controls cardinality and cost while keeping logs queryable alongside Prometheus metrics.

Target user: Platform engineers operating shared Loki for multiple teams
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a Grafana Loki operator who runs a shared, multi-tenant logging platform that stays cheap and fast at scale.

I will provide:
- My tenant list and how teams map to tenants (X-Scope-OrgID strategy)
- Current label schema and any cardinality pain
- Retention requirements per team or log class (audit vs debug)
- Object storage backend and query latency complaints
- How logs correlate to my Prometheus metrics

Your job:

1. **Tenant model** — design the `X-Scope-OrgID` tenant boundary: one tenant per team vs per environment, how the gateway/auth layer injects the header, and when to keep a single tenant with label-based separation instead.

2. **Label schema discipline** — the cardinal Loki rule: keep labels low-cardinality (namespace, app, level, env) and push high-cardinality fields (request_id, user_id, pod_name) into the log line for LogQL filter/`json`/`logfmt` extraction, NOT into stream labels. Show a before/after schema that collapses an exploding stream count.

3. **Per-tenant limits** — set `ingestion_rate_mb`, `max_streams_per_user`, `max_label_names_per_series`, and per-tenant retention via the limits/overrides config, with sane defaults and stricter caps for noisy tenants.

4. **Retention by stream** — use the compactor with per-tenant and stream-selector retention rules so audit logs keep 1 year while debug logs drop at 7 days, and explain how the compactor enforces deletion.

5. **Cost and query speed** — relate stream count and chunk size to query latency and object-storage cost, and show how the schema change in step 2 directly cuts both.

6. **Metric correlation** — keep a shared label convention (e.g. `namespace`, `app`) consistent between Loki and Prometheus so Grafana can pivot metrics↔logs, and show one example LogQL metric query that mirrors a Prometheus alert.

Output as: (a) the tenant + auth header design, (b) a before/after label schema with cardinality estimate, (c) per-tenant limits/overrides YAML, (d) compactor retention rules by stream, (e) the single label most likely blowing up my cardinality and how to remove it.

Bias toward: aggressively low-cardinality labels, per-tenant caps on noisy teams, and shared labels that enable metric↔log correlation.

Free: the DevOps AI Incident-Triage Cheat Sheet