Enforcing Tenant Labels in Multi-Tenant Prometheus and Mimir

The first time a multi-tenant Mimir cluster bit me, the bill arrived before the bug report did. Finance asked which team owned 40% of our active series, and I genuinely could not answer. Somewhere between a hastily-written scrape job and a remote-write block someone copied off a Slack thread, the team label had quietly gone missing on a few thousand series. Not all of them. Just enough to make every cost report a guess and every alert route to the wrong on-call. That afternoon taught me something I now treat as gospel: in a multi-tenant metrics platform, a label is not metadata. It is identity, billing, and a security boundary all at once.

This post is about making that identity non-negotiable. I’ll walk through enforcing tenant and ownership labels at scrape time, at remote-write time, and at query time, with real relabel YAML you can adapt. And because I drafted half of these configs with an AI assistant, I’ll be honest about where that helped and where it absolutely needed a human reading every line before it shipped.

Why a Missing Tenant Label Quietly Breaks Everything

A single absent team or tenant label doesn’t throw an error. That’s the trap. The scrape succeeds, the series lands, the graph renders. The damage is downstream and invisible until it isn’t.

Cost attribution falls apart first. Mimir and Cortex meter cardinality and ingestion per tenant. If your chargeback model groups by a team label and 5% of series lack it, that 5% becomes an unattributable “other” bucket that finance will eventually ask you to explain.

Alert routing breaks next. Alertmanager routes on labels. A CPUThrottling alert with no team label hits the catch-all receiver, which is usually a channel nobody watches at 3am.

Access control is the scary one. In Mimir, tenancy is enforced by the X-Scope-OrgID header, but within a tenant, label-based authorization in Grafana or a query proxy assumes the labels are trustworthy. A series that slipped in without its expected tenant label is a series that can leak into the wrong dashboard.

Injecting Labels at Scrape Time with relabel_configs

The cheapest place to guarantee a label exists is where the data is born: the scrape. relabel_configs runs against the target’s metadata before the scrape, so you can stamp ownership onto every series a job produces.

scrape_configs:
  - job_name: "payments-api"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Hard-stamp the owning team onto every series from this job.
      - target_label: "team"
        replacement: "payments"
      # Promote a pod annotation into a real label when present.
      - source_labels: ["__meta_kubernetes_pod_annotation_team"]
        target_label: "team"
        regex: "(.+)"
        replacement: "$1"
      # Carry the namespace through as the tenant dimension.
      - source_labels: ["__meta_kubernetes_namespace"]
        target_label: "tenant"

The static replacement rule sets a default; the annotation rule overrides it only when the annotation actually exists (the regex: "(.+)" guard means an empty value won’t blank out your default). This default-then-override pattern is what keeps a forgotten annotation from producing an unlabeled series.

Pro Tip: Order matters in relabel_configs — rules run top to bottom. Put your safe default first and your conditional override second, never the reverse, or a missing source label will wipe the value you just set.

Validating and Dropping at Remote-Write Time

Scrape-time stamping covers metrics you scrape directly. But in a federated or agent-based setup, metrics arrive from many Prometheus instances you don’t fully control. write_relabel_configs is your last line of defense before data leaves for Mimir — it runs on every sample headed to the remote endpoint.

remote_write:
  - url: "https://mimir.internal/api/v1/push"
    headers:
      X-Scope-OrgID: "platform-prod"
    write_relabel_configs:
      # Backstop: if team somehow arrived empty, tag it for triage
      # instead of letting an unowned series through silently.
      - source_labels: ["team"]
        regex: "^$"
        target_label: "team"
        replacement: "unowned"
      # Drop internal debug series that should never reach the tenant.
      - source_labels: ["__name__"]
        regex: "debug_.*"
        action: "drop"
      # Keep only series carrying a tenant label; everything else
      # is a labeling bug and gets dropped at the door.
      - source_labels: ["tenant"]
        regex: ".+"
        action: "keep"

That replacement: "unowned" rule is deliberate. I’d rather surface an unowned bucket I can alert on and hunt down than silently drop billable data or, worse, let it pollute a real team’s namespace. Pair it with a simple alert on count(... {team="unowned"}) > 0 and labeling regressions announce themselves.

The X-Scope-OrgID Header: Tenancy at the Boundary

Labels handle ownership within a tenant. True tenant isolation in Mimir and Cortex is enforced by the X-Scope-OrgID HTTP header on every write and read. Mimir physically segregates each org’s data — separate blocks, separate limits, separate query paths. A series pushed under X-Scope-OrgID: team-a is invisible to a query made under team-b. No label trick crosses that line.

remote_write:
  - url: "https://mimir.internal/api/v1/push"
    headers:
      X-Scope-OrgID: "team-a"

The mistake I see most: treating a tenant label and the X-Scope-OrgID header as interchangeable. They’re complementary layers. The header is the hard wall between organizations; the label is the soft, queryable dimension for cost and routing inside one. If you run a single shared tenant and rely purely on labels for separation, you have no real isolation — any user who can craft a PromQL query can read across “tenants.” Decide consciously which model you’re in. For the deeper trade-offs of running this at scale, I wrote up our setup in running Grafana Mimir at scale.

Enforcing Labels at Query Time

Even with clean writes, query-time enforcement prevents accidental cross-tenant reads inside a shared tenant. A label-enforcement proxy (or Mimir’s own query frontend with the right config) rewrites incoming PromQL to inject a mandatory label matcher. A user asking for http_requests_total actually gets:

sum by (job) (http_requests_total{tenant="team-a"})

The injected {tenant="team-a"} matcher is forced server-side based on the authenticated identity, so a user cannot widen the query to see another team’s series no matter how they phrase it. This is the same defense-in-depth idea as parameterized SQL: never trust the client to scope its own reads. Consistent label names make this enforceable — tenant everywhere, never tenant here and org there, which is exactly the kind of drift I cover in metric naming standards.

Pro Tip: Test your enforcement by deliberately trying to break out of it. Issue a query with {tenant="someone-else"} against your proxy. If it returns data, your matcher is being appended (OR-ed) rather than enforced (AND-ed) — a one-character config difference with very different security properties.

Where AI Helped, and Where I Made It Prove Itself

I’ll be candid: a good chunk of the relabel YAML above started as an AI draft. Treating the model like a fast, eager junior engineer is the right mental model. It produces a plausible relabel_configs block in seconds, and for boilerplate like Kubernetes SD label promotion, that draft is usually 90% right.

The remaining 10% is where you earn your salary. AI confidently wrote me a keep action where I needed drop, which would have silently inverted my filter and shipped exactly the series I meant to exclude. It also reached for labelmap in a spot where an explicit target_label was clearer and safer. None of those threw errors — they’d have just quietly done the wrong thing in production, which is the most expensive failure mode in observability.

So the rule on my team: AI can draft relabel configs, but every rule has to be explainable before it merges. If you can’t say in one sentence what a rule does and what happens when its source label is missing, it doesn’t ship. That’s also why I lean on a deterministic tool like the free Alert Rule Generator for the alerting layer — it gives reviewable, structured YAML rather than freeform guesses. If you want reusable starting points for prompting an assistant through this kind of config work, our prompt library and prompt packs have monitoring-focused templates. And if relabeling itself is new to you, the scrape config and relabeling deep dive covers the mechanics end to end.

Conclusion

Tenant and ownership labels are the load-bearing walls of a multi-tenant metrics platform. Stamp them at scrape time with relabel_configs, validate and backstop them at remote-write time with write_relabel_configs, enforce isolation at the boundary with X-Scope-OrgID, and force a mandatory matcher at query time. Let AI draft the YAML to move fast — then read every line like it’s going straight to prod, because it is. A label that exists on 99% of your series is a label you cannot trust. Make it 100%, and make the config prove it.