Grafana Error Guide: Pod OOMKilled — High Memory in Kubernetes
Fix Grafana pod OOMKilled in Kubernetes — raise memory limits, find the memory hog (renderer, heavy queries, plugins), tune concurrency, and stop restart loops from exit code 137.
- #grafana
- #troubleshooting
- #errors
- #kubernetes
Overview
In Kubernetes, when the Grafana container’s memory usage exceeds its resources.limits.memory, the kernel OOM-kills the process and the pod restarts with exit code 137. Repeated kills produce CrashLoopBackOff and intermittent unavailability. The trigger is almost always memory pressure — an undersized limit, a heavy dashboard/query, the in-process image renderer, or a leaking plugin.
The literal errors you will see:
State: Terminated
Reason: OOMKilled
Exit Code: 137
Last State: Terminated
Reason: OOMKilled
Warning BackOff pod/grafana-0 Back-off restarting failed container
kernel: Memory cgroup out of memory: Killed process 12345 (grafana) total-vm:...kB
It surfaces as a Grafana pod that restarts under load, during large renders, or while running an expensive dashboard.
Symptoms
kubectl describe podshowsReason: OOMKilled,Exit Code: 137.- Pod flaps into
CrashLoopBackOff; dashboards intermittently unavailable. - Memory usage climbs to the limit then the container restarts.
- Kills coincide with rendering, alert storms, or a specific heavy dashboard.
kubectl -n monitoring describe pod grafana-0 | grep -A5 -i "last state\|state:"
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Common Root Causes
1. Memory limit too low
The limits.memory is set below what Grafana needs for its query/render load, so normal peaks kill it.
kubectl -n monitoring get pod grafana-0 -o jsonpath='{.spec.containers[0].resources}'; echo
{"limits":{"memory":"256Mi"},"requests":{"memory":"128Mi"}}
2. In-process image renderer in the Grafana container
Running the grafana-image-renderer plugin inside the Grafana pod adds headless Chromium’s large, spiky memory to the same limit.
3. Heavy queries / huge result sets
A dashboard pulling millions of series/points (unbounded range, high-cardinality queries) buffers large results in memory.
4. Plugin or renderer memory leak
A leaking plugin or renderer grows memory steadily until the limit is hit — kills recur at regular intervals.
5. Alerting fan-out / many concurrent renders
Alert evaluation and simultaneous image renders spike memory during incidents.
Diagnostic Workflow
Step 1: Confirm OOMKill and the limit
kubectl -n monitoring describe pod grafana-0 | grep -iE "OOMKilled|Exit Code|restart count"
kubectl -n monitoring get deploy grafana -o jsonpath='{.spec.template.spec.containers[0].resources}'; echo
Step 2: Watch memory approach the limit
kubectl -n monitoring top pod -l app=grafana --containers
POD NAME CPU MEMORY
grafana-0 grafana 120m 248Mi # right at a 256Mi limit
Step 3: Correlate kills with activity in logs
kubectl -n monitoring logs grafana-0 --previous | tail -40
kubectl -n monitoring logs grafana-0 --previous | grep -iE "render|query|alert" | tail
--previous reads the killed container’s logs — look for a heavy render/query just before the restart.
Step 4: Raise the limit (and request) sensibly
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
memory: "1Gi" # give headroom above observed peak
kubectl -n monitoring set resources deploy/grafana \
--limits=memory=1Gi --requests=memory=512Mi
kubectl -n monitoring rollout status deploy/grafana
Step 5: Move the renderer out of the Grafana pod
Run the standalone renderer as its own Deployment so its memory doesn’t count against Grafana’s limit:
[rendering]
server_url = http://grafana-renderer:8081/render
callback_url = http://grafana:3000/
Example Root Cause Analysis
A Grafana pod flaps every few minutes. describe:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Restart Count: 14
kubectl top shows memory climbing to the 512Mi limit right when alerts fire. The --previous log:
logger=rendering msg="Rendering ..."
logger=ngalert.notifier msg="Sending alert images"
The in-process image renderer is enabled, and during alert storms it launches multiple Chromium instances inside the Grafana container — each hundreds of MB — blowing the limit.
Fix: split the renderer into its own Deployment and give Grafana more headroom:
kubectl -n monitoring set resources deploy/grafana --limits=memory=1Gi --requests=memory=512Mi
kubectl -n monitoring apply -f grafana-renderer-deploy.yaml
[rendering]
server_url = http://grafana-renderer:8081/render
callback_url = http://grafana:3000/
After the rollout, Grafana memory stays well under 1Gi during alert storms and the OOMKills stop. Root cause: Chromium’s spiky memory from the in-process renderer, counted against an undersized Grafana limit.
Prevention Best Practices
- Set
requestsandlimitsfrom observed peaks with headroom (start ~512Mi–1Gi and adjust fromkubectl top). - Run the image renderer as a separate Deployment so its memory is isolated from Grafana’s limit.
- Bound expensive dashboards: shorter default ranges,
maxDataPoints, recording rules for high-cardinality queries. - Alert on container memory vs. limit (e.g. 85%) to catch pressure before the kill.
- Watch restart counts and
--previouslogs; a regular kill interval signals a leak — pin down the plugin/renderer version. - See more Grafana guides and the sibling too-many-open-files guide.
Quick Command Reference
# Confirm the OOMKill + limits
kubectl -n monitoring describe pod grafana-0 | grep -iE "OOMKilled|Exit Code|Restart Count"
kubectl -n monitoring get deploy grafana \
-o jsonpath='{.spec.template.spec.containers[0].resources}'; echo
# Live memory vs limit
kubectl -n monitoring top pod -l app=grafana --containers
# Killed container's logs
kubectl -n monitoring logs grafana-0 --previous | tail -40
# Raise memory
kubectl -n monitoring set resources deploy/grafana \
--limits=memory=1Gi --requests=memory=512Mi
kubectl -n monitoring rollout status deploy/grafana
Conclusion
An OOMKilled Grafana pod (exit 137) means the container exceeded its memory limit and the kernel killed it. Typical root causes:
- A
limits.memoryset too low for the real query/render load. - The in-process image renderer adding Chromium’s spiky memory to Grafana’s limit.
- Heavy queries buffering large result sets.
- A plugin/renderer memory leak (regular kill interval).
- Alerting and concurrent-render spikes during incidents.
Confirm the OOMKill and watch kubectl top against the limit first; if the renderer is the driver, isolating it into its own Deployment is usually the durable fix rather than just raising the number.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.