Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Prometheus & Monitoring By James Joyner IV · · 9 min read

Prometheus Error Guide: 'OOMKilled' (exit 137) High Memory Crashes

Fix Prometheus OOMKilled (exit 137) and out-of-memory crashes: cut cardinality, drop labels, add recording rules, size memory limits, and shard before the pod dies again.

  • #prometheus-monitoring
  • #troubleshooting
  • #errors
  • #performance

Exact Error Message

OOMKilled is what Kubernetes reports when the kernel out-of-memory killer terminates the Prometheus container for exceeding its memory limit. On a bare host you see the Go runtime’s own panic or a kernel OOM message instead. The common thread: Prometheus needed more RAM than it was allowed and was killed (exit code 137 = 128 + SIGKILL 9).

State:          Waiting
  Reason:       CrashLoopBackOff
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

Other forms of the same problem:

fatal error: runtime: out of memory
kernel: Out of memory: Killed process 21984 (prometheus) total-vm:38211004kB, anon-rss:31022144kB

What the Error Means

Prometheus holds the active head block — roughly the most recent two hours of every series — in memory, plus index structures, mmapped chunks, and per-query working sets. Memory scales with active series count (cardinality), scrape frequency, and concurrent query load.

OOMKilled means total resident memory crossed the container limit (or the host’s available RAM) and the kernel killed the process. Prometheus restarts, replays the WAL (which itself takes memory and time), and if the underlying cardinality has not changed, it OOMs again — a crash loop. This is almost always a cardinality or query problem, not a “needs more RAM” problem, though raising the limit buys time.

Common Causes

  1. High cardinality — too many active series, usually from a label with unbounded values (user IDs, request paths, container IDs).
  2. Expensive queries — a dashboard or recording rule loading millions of series into memory at once.
  3. Large head before compaction — a cardinality spike inflates the in-memory head for up to two hours.
  4. Scrape interval too short — halving the interval roughly doubles samples and memory pressure.
  5. Remote-write buffers — a slow or down remote endpoint causes the in-memory queue to back up.
  6. No memory headroom — the limit is set close to steady-state, leaving nothing for query spikes or WAL replay.
  7. Federation pulling everything — a /federate match that scrapes all series from many child Prometheis multiplies cardinality on the parent.

How to Reproduce the Error

Inject a high-cardinality metric and watch memory climb until the kill:

# An exporter that emits a unique label value per request explodes series count
for i in $(seq 1 200000); do
  echo "http_requests_total{path=\"/item/$i\"} 1"
done > /var/lib/node_exporter/textfile/explode.prom

Scrape that target, then run a query that fans out across every series:

count({__name__=~".+"})

With a tight container limit, the head growth plus the query working set pushes resident memory over the limit and the pod is OOMKilled within a scrape cycle or two.

Diagnostic Commands

Confirm it was an OOM kill and see the limit:

kubectl describe pod prometheus-0 -n monitoring | grep -A4 'Last State'
kubectl get pod prometheus-0 -n monitoring -o jsonpath='{.spec.containers[0].resources}'

Find the cardinality offenders via the TSDB status endpoint:

curl -s http://localhost:9090/api/v1/status/tsdb \
  | jq '.data | {numSeries: .headStats.numSeries,
                 topMetrics: .seriesCountByMetricName[0:10],
                 topLabels: .labelValueCountByLabelName[0:10]}'
{
  "numSeries": 4180233,
  "topMetrics": [
    {"name": "http_request_duration_seconds_bucket", "value": 982140},
    {"name": "container_network_tcp_usage_total", "value": 511002}
  ],
  "topLabels": [
    {"name": "path", "value": 318441},
    {"name": "id", "value": 204118}
  ]
}

Track head series and memory over time:

prometheus_tsdb_head_series
process_resident_memory_bytes
container_memory_working_set_bytes{pod="prometheus-0"}

Check runtime info and remote-write backlog:

curl -s http://localhost:9090/api/v1/status/runtimeinfo | jq '.data | {goroutineCount, GOMAXPROCS, storageRetention}'
prometheus_remote_storage_samples_pending

Find which metric is heaviest right now:

topk(10, count by (__name__)({__name__=~".+"}))

Step-by-Step Resolution

1. Stop the bleeding by raising the limit temporarily so Prometheus stays up while you fix cardinality. This is a band-aid, not the fix.

resources:
  requests: { memory: "8Gi" }
  limits:   { memory: "12Gi" }

2. Drop the high-cardinality label or metric at scrape time with metric_relabel_configs. This is the single most effective fix.

scrape_configs:
  - job_name: app
    static_configs: [{ targets: ["app:8080"] }]
    metric_relabel_configs:
      # Drop the unbounded "path" label entirely
      - action: labeldrop
        regex: path
      # Or drop an entire noisy metric
      - source_labels: [__name__]
        regex: container_network_tcp_usage_total
        action: drop

3. Replace expensive ad-hoc queries with recording rules so the heavy aggregation runs once per interval instead of on every dashboard load.

groups:
  - name: precompute
    rules:
      - record: job:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

4. Reduce scrape frequency and retention where you do not need high resolution.

global:
  scrape_interval: 30s   # was 10s

5. Shard if cardinality is genuinely large. Split scrape targets across multiple Prometheus instances (e.g., by team or namespace) with hashmod relabeling, or adopt a horizontally scalable backend.

6. Reload and verify that prometheus_tsdb_head_series and process_resident_memory_bytes settle below the limit.

curl -s -X POST http://localhost:9090/-/reload

Prevention and Best Practices

  • Alert on prometheus_tsdb_head_series growth and on container_memory_working_set_bytes / limit > 0.85 before the kill.
  • Never use unbounded values (IDs, paths, emails, timestamps) as label values — see the label-limit-exceeded guide to enforce this at scrape time.
  • Set memory requests to steady state and limits with 30–50% headroom for query spikes and WAL replay.
  • Cap heavy dashboards with recording rules; avoid {__name__=~".+"}-style queries in production.
  • For federation, match only aggregated series (match[]={__name__=~"job:.*"}), never everything.
  • Watch remote-write prometheus_remote_storage_samples_pending; a stuck endpoint inflates memory.

Frequently Asked Questions

Does raising the memory limit fix OOMKilled? It buys time but rarely fixes the root cause. If cardinality keeps growing, you will hit the new limit too. Use the bigger limit to stay up while you drop labels and add recording rules.

How do I know which metric is causing the OOM? Hit /api/v1/status/tsdb and read seriesCountByMetricName and labelValueCountByLabelName. The metric or label at the top of those lists is almost always the offender.

Why does Prometheus OOM right after it starts? On restart it replays the WAL to rebuild the head, which temporarily needs as much (or more) memory than steady state. If the limit barely covers steady state, replay tips it over and you get a crash loop. Add headroom or reduce cardinality.

Can a single bad query OOM the whole server? Yes. A query matching millions of series (e.g. {__name__=~".+"} over a long range) loads them all into memory at once. Set --query.max-samples and prefer recording rules for heavy aggregations.

Is exit code 137 always OOM? 137 means the process received SIGKILL (128 + 9). In Kubernetes with Reason: OOMKilled it is the OOM killer. The same 137 can also come from a manual kill -9 or a failed liveness probe, so confirm with kubectl describe pod and the kernel log.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.