Grafana Error Guide: Pod OOMKilled

Overview

In Kubernetes, when the Grafana container’s memory usage exceeds its resources.limits.memory, the kernel OOM-kills the process and the pod restarts with exit code 137. Repeated kills produce CrashLoopBackOff and intermittent unavailability. The trigger is almost always memory pressure — an undersized limit, a heavy dashboard/query, the in-process image renderer, or a leaking plugin.

The literal errors you will see:

State:          Terminated
  Reason:       OOMKilled
  Exit Code:    137

Last State:     Terminated
  Reason:       OOMKilled
Warning  BackOff  pod/grafana-0  Back-off restarting failed container

kernel: Memory cgroup out of memory: Killed process 12345 (grafana) total-vm:...kB

It surfaces as a Grafana pod that restarts under load, during large renders, or while running an expensive dashboard.

Symptoms

kubectl describe pod shows Reason: OOMKilled, Exit Code: 137.
Pod flaps into CrashLoopBackOff; dashboards intermittently unavailable.
Memory usage climbs to the limit then the container restarts.
Kills coincide with rendering, alert storms, or a specific heavy dashboard.

kubectl -n monitoring describe pod grafana-0 | grep -A5 -i "last state\|state:"

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

Common Root Causes

1. Memory limit too low

The limits.memory is set below what Grafana needs for its query/render load, so normal peaks kill it.

kubectl -n monitoring get pod grafana-0 -o jsonpath='{.spec.containers[0].resources}'; echo

{"limits":{"memory":"256Mi"},"requests":{"memory":"128Mi"}}

2. In-process image renderer in the Grafana container

Running the grafana-image-renderer plugin inside the Grafana pod adds headless Chromium’s large, spiky memory to the same limit.

3. Heavy queries / huge result sets

A dashboard pulling millions of series/points (unbounded range, high-cardinality queries) buffers large results in memory.

4. Plugin or renderer memory leak

A leaking plugin or renderer grows memory steadily until the limit is hit — kills recur at regular intervals.

5. Alerting fan-out / many concurrent renders

Alert evaluation and simultaneous image renders spike memory during incidents.

Diagnostic Workflow

Step 1: Confirm OOMKill and the limit

kubectl -n monitoring describe pod grafana-0 | grep -iE "OOMKilled|Exit Code|restart count"
kubectl -n monitoring get deploy grafana -o jsonpath='{.spec.template.spec.containers[0].resources}'; echo

Step 2: Watch memory approach the limit

kubectl -n monitoring top pod -l app=grafana --containers

POD        NAME       CPU    MEMORY
grafana-0  grafana    120m   248Mi     # right at a 256Mi limit

Step 3: Correlate kills with activity in logs

kubectl -n monitoring logs grafana-0 --previous | tail -40
kubectl -n monitoring logs grafana-0 --previous | grep -iE "render|query|alert" | tail

--previous reads the killed container’s logs — look for a heavy render/query just before the restart.

Step 4: Raise the limit (and request) sensibly

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
  limits:
    memory: "1Gi"     # give headroom above observed peak

kubectl -n monitoring set resources deploy/grafana \
  --limits=memory=1Gi --requests=memory=512Mi
kubectl -n monitoring rollout status deploy/grafana

Step 5: Move the renderer out of the Grafana pod

Run the standalone renderer as its own Deployment so its memory doesn’t count against Grafana’s limit:

[rendering]
server_url = http://grafana-renderer:8081/render
callback_url = http://grafana:3000/

Example Root Cause Analysis

A Grafana pod flaps every few minutes. describe:

Last State: Terminated
  Reason: OOMKilled
  Exit Code: 137
Restart Count: 14

kubectl top shows memory climbing to the 512Mi limit right when alerts fire. The --previous log:

logger=rendering msg="Rendering ..." 
logger=ngalert.notifier msg="Sending alert images"

The in-process image renderer is enabled, and during alert storms it launches multiple Chromium instances inside the Grafana container — each hundreds of MB — blowing the limit.

Fix: split the renderer into its own Deployment and give Grafana more headroom:

kubectl -n monitoring set resources deploy/grafana --limits=memory=1Gi --requests=memory=512Mi
kubectl -n monitoring apply -f grafana-renderer-deploy.yaml

[rendering]
server_url = http://grafana-renderer:8081/render
callback_url = http://grafana:3000/

After the rollout, Grafana memory stays well under 1Gi during alert storms and the OOMKills stop. Root cause: Chromium’s spiky memory from the in-process renderer, counted against an undersized Grafana limit.

Prevention Best Practices

Set requests and limits from observed peaks with headroom (start ~512Mi–1Gi and adjust from kubectl top).
Run the image renderer as a separate Deployment so its memory is isolated from Grafana’s limit.
Bound expensive dashboards: shorter default ranges, maxDataPoints, recording rules for high-cardinality queries.
Alert on container memory vs. limit (e.g. 85%) to catch pressure before the kill.
Watch restart counts and --previous logs; a regular kill interval signals a leak — pin down the plugin/renderer version.
See more Grafana guides and the sibling too-many-open-files guide.

Quick Command Reference

# Confirm the OOMKill + limits
kubectl -n monitoring describe pod grafana-0 | grep -iE "OOMKilled|Exit Code|Restart Count"
kubectl -n monitoring get deploy grafana \
  -o jsonpath='{.spec.template.spec.containers[0].resources}'; echo

# Live memory vs limit
kubectl -n monitoring top pod -l app=grafana --containers

# Killed container's logs
kubectl -n monitoring logs grafana-0 --previous | tail -40

# Raise memory
kubectl -n monitoring set resources deploy/grafana \
  --limits=memory=1Gi --requests=memory=512Mi
kubectl -n monitoring rollout status deploy/grafana

Conclusion

An OOMKilled Grafana pod (exit 137) means the container exceeded its memory limit and the kernel killed it. Typical root causes:

A limits.memory set too low for the real query/render load.
The in-process image renderer adding Chromium’s spiky memory to Grafana’s limit.
Heavy queries buffering large result sets.
A plugin/renderer memory leak (regular kill interval).
Alerting and concurrent-render spikes during incidents.

Confirm the OOMKill and watch kubectl top against the limit first; if the renderer is the driver, isolating it into its own Deployment is usually the durable fix rather than just raising the number.

Grafana Error Guide: Pod OOMKilled — High Memory in Kubernetes