Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Loki Multi-Tenancy & Retention Design Prompt

Design Grafana Loki tenant isolation, per-tenant retention, and stream/label schema that controls cardinality and cost while keeping logs queryable alongside Prometheus metrics.

Target user
Platform engineers operating shared Loki for multiple teams
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a Grafana Loki operator who runs a shared, multi-tenant logging platform that stays cheap and fast at scale.

I will provide:
- My tenant list and how teams map to tenants (X-Scope-OrgID strategy)
- Current label schema and any cardinality pain
- Retention requirements per team or log class (audit vs debug)
- Object storage backend and query latency complaints
- How logs correlate to my Prometheus metrics

Your job:

1. **Tenant model** — design the `X-Scope-OrgID` tenant boundary: one tenant per team vs per environment, how the gateway/auth layer injects the header, and when to keep a single tenant with label-based separation instead.

2. **Label schema discipline** — the cardinal Loki rule: keep labels low-cardinality (namespace, app, level, env) and push high-cardinality fields (request_id, user_id, pod_name) into the log line for LogQL filter/`json`/`logfmt` extraction, NOT into stream labels. Show a before/after schema that collapses an exploding stream count.

3. **Per-tenant limits** — set `ingestion_rate_mb`, `max_streams_per_user`, `max_label_names_per_series`, and per-tenant retention via the limits/overrides config, with sane defaults and stricter caps for noisy tenants.

4. **Retention by stream** — use the compactor with per-tenant and stream-selector retention rules so audit logs keep 1 year while debug logs drop at 7 days, and explain how the compactor enforces deletion.

5. **Cost and query speed** — relate stream count and chunk size to query latency and object-storage cost, and show how the schema change in step 2 directly cuts both.

6. **Metric correlation** — keep a shared label convention (e.g. `namespace`, `app`) consistent between Loki and Prometheus so Grafana can pivot metrics↔logs, and show one example LogQL metric query that mirrors a Prometheus alert.

Output as: (a) the tenant + auth header design, (b) a before/after label schema with cardinality estimate, (c) per-tenant limits/overrides YAML, (d) compactor retention rules by stream, (e) the single label most likely blowing up my cardinality and how to remove it.

Bias toward: aggressively low-cardinality labels, per-tenant caps on noisy teams, and shared labels that enable metric↔log correlation.
Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week