Kubernetes Error Guide: 'Job has reached the specified backoff limit'
Fix BackoffLimitExceeded in Kubernetes Jobs: a container keeps failing, exhausts backoffLimit retries, and the Job is marked Failed. Diagnose and fix.
- #kubernetes-helm
- #troubleshooting
- #errors
- #jobs
Exact Error Message
A Job never completes. After several failed pod attempts it is marked Failed, and its status carries the BackoffLimitExceeded reason:
$ kubectl get job migrate
NAME COMPLETIONS DURATION AGE
migrate 0/1 3m12s 3m12s
$ kubectl describe job migrate
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 3m job-controller Created pod: migrate-2pn7c
Normal SuccessfulCreate 2m job-controller Created pod: migrate-9xk4d
Warning BackoffLimitExceeded 30s job-controller Job has reached the specified backoff limit
The headline is Job has reached the specified backoff limit with reason BackoffLimitExceeded. The Job retried its pod up to backoffLimit times (default 6), every attempt failed, and the Job gave up.
What the Error Means
A Job runs a pod until it succeeds (exit code 0). If the pod’s container exits non-zero, the Job controller creates a replacement pod, applying an exponential backoff between attempts (10s, 20s, 40s, capped at 6 minutes). The spec.backoffLimit field caps the total number of failed attempts. Once failures reach that limit, the controller stops retrying, marks the Job Failed with reason BackoffLimitExceeded, and records the event.
The key insight: BackoffLimitExceeded is a symptom, not the root cause. The real problem is why each pod failed — a bug, a missing config, a bad migration, a permission error. The backoff limit just bounds how many times Kubernetes tried before declaring defeat. To fix it you must read the logs of the failed pods, not just raise the limit. Note that restartPolicy interacts here: with restartPolicy: OnFailure the kubelet restarts the container in place; with Never the controller creates a fresh pod per attempt, so you see multiple pods.
Common Causes
- Application bug / non-zero exit — the job command genuinely fails (uncaught exception, assertion, bad SQL migration).
- Missing configuration or secret — the container references an env var, mounted secret, or file that is absent and exits early.
- Permission / network error — the job cannot reach a database, API, or bucket (auth failure, DNS, firewall).
backoffLimittoo low for transient errors — a flaky dependency fails a few times before succeeding, but the limit is too small to ride it out.- Bad image or command — wrong entrypoint, missing binary, or
command/argstypo causing an immediate exit. - OOMKilled each attempt — the container exceeds its memory limit and is killed (exit 137) on every try.
How to Reproduce the Error
Run a Job whose container always exits non-zero with a small backoff limit:
apiVersion: batch/v1
kind: Job
metadata:
name: migrate
spec:
backoffLimit: 2
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: busybox:1.36
command: ["sh", "-c", "echo 'running migration'; exit 1"]
kubectl apply -f migrate-job.yaml
kubectl get pods -l job-name=migrate
kubectl describe job migrate | grep -A4 Events
NAME READY STATUS RESTARTS AGE
migrate-2pn7c 0/1 Error 0 90s
migrate-9xk4d 0/1 Error 0 50s
migrate-q4r8t 0/1 Error 0 10s
After three failed pods (backoffLimit: 2 means up to 3 attempts), the Job reports BackoffLimitExceeded.
Diagnostic Commands
# Confirm the Job failed and read the reason
kubectl get job <JOB> -o jsonpath='{.status.conditions}'
# List every pod the Job created, including failed ones
kubectl get pods -l job-name=<JOB> -o wide
# The most important step: read the failed pod's logs
kubectl logs <FAILED-POD>
kubectl logs <FAILED-POD> --previous # if restartPolicy: OnFailure
# Why did the container exit? Look at exit code and state
kubectl describe pod <FAILED-POD> | grep -A6 'State\|Last State\|Exit Code'
# Current backoffLimit value
kubectl get job <JOB> -o jsonpath='{.spec.backoffLimit}'
The failed pod logs and the container Exit Code are the heart of the diagnosis — BackoffLimitExceeded itself tells you nothing about the cause.
Step-by-Step Resolution
1. Read the logs of a failed pod. This is the single most important step. The application’s own output usually names the failure:
kubectl logs <FAILED-POD>
# e.g. "FATAL: password authentication failed for user 'app'"
2. Check the exit code. kubectl describe pod shows the terminated state. Common codes: 1 (generic error), 137 (OOMKilled / SIGKILL), 127 (command not found), 139 (segfault):
kubectl describe pod <FAILED-POD> | grep -A6 'Last State'
3. Fix the root cause. Resolve whatever the logs reveal — add the missing secret, fix credentials, correct the migration, fix the entrypoint. If the exit code is 137, raise the memory limit (see OOMKilled).
4. Recreate the Job, do not just re-run. Job specs are largely immutable; delete and reapply with the fix:
kubectl delete job <JOB>
kubectl apply -f <fixed-job>.yaml
5. Only raise backoffLimit for genuinely transient failures. If the work is idempotent and the dependency is flaky, a higher limit plus a retry delay can help — but never use it to paper over a deterministic bug:
spec:
backoffLimit: 6
6. Verify completion. Confirm the Job reaches Complete:
kubectl get job <JOB>
Prevention and Best Practices
- Make Jobs idempotent so safe retries are possible without side effects (e.g. re-running a migration is a no-op when already applied).
- Always inspect failed-pod logs before touching
backoffLimit— raising it on a deterministic failure just wastes time and resources. - Set realistic memory limits so batch jobs are not silently
OOMKilled(exit 137) on every attempt. - Use
ttlSecondsAfterFinishedto auto-clean finished Jobs while keeping recent failures around long enough to debug. - Pair
backoffLimitwithactiveDeadlineSecondsso a Job that retries forever still has a hard wall-clock cap. More in Kubernetes & Helm guides.
Related Errors
- Job DeadlineExceeded — a Job killed for running too long rather than failing too often.
- CrashLoopBackOff — the same restart-backoff pattern for long-running pods.
- OOMKilled — a frequent exit-137 cause of repeated Job failures.
Frequently Asked Questions
Does raising backoffLimit fix the error? Only if the failures are transient. For a deterministic bug, more retries just fail more times and still end in BackoffLimitExceeded. Fix the underlying cause shown in the pod logs first.
Why do I see multiple failed pods instead of restarts? With restartPolicy: Never, the Job controller creates a brand-new pod for each attempt, so failed pods accumulate. With restartPolicy: OnFailure, the kubelet restarts the container inside one pod, and you check --previous logs instead.
How many attempts does the default give me? The default backoffLimit is 6, so up to six failed attempts before the Job is marked failed, with exponential backoff between them capped at six minutes. The total wall-clock time can therefore be significant.
My pod shows exit code 137 every time — what is that? Exit 137 means the container was SIGKILLed, almost always OOMKilled for exceeding its memory limit. Raise the limit or reduce memory use; no number of retries will help an under-provisioned job.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.