Kubernetes Error Guide: 'Job has reached the specified

Exact Error Message

A Job never completes. After several failed pod attempts it is marked Failed, and its status carries the BackoffLimitExceeded reason:

$ kubectl get job migrate
NAME      COMPLETIONS   DURATION   AGE
migrate   0/1           3m12s      3m12s

$ kubectl describe job migrate
Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      3m     job-controller  Created pod: migrate-2pn7c
  Normal   SuccessfulCreate      2m     job-controller  Created pod: migrate-9xk4d
  Warning  BackoffLimitExceeded  30s    job-controller  Job has reached the specified backoff limit

The headline is Job has reached the specified backoff limit with reason BackoffLimitExceeded. The Job retried its pod up to backoffLimit times (default 6), every attempt failed, and the Job gave up.

What the Error Means

A Job runs a pod until it succeeds (exit code 0). If the pod’s container exits non-zero, the Job controller creates a replacement pod, applying an exponential backoff between attempts (10s, 20s, 40s, capped at 6 minutes). The spec.backoffLimit field caps the total number of failed attempts. Once failures reach that limit, the controller stops retrying, marks the Job Failed with reason BackoffLimitExceeded, and records the event.

The key insight: BackoffLimitExceeded is a symptom, not the root cause. The real problem is why each pod failed — a bug, a missing config, a bad migration, a permission error. The backoff limit just bounds how many times Kubernetes tried before declaring defeat. To fix it you must read the logs of the failed pods, not just raise the limit. Note that restartPolicy interacts here: with restartPolicy: OnFailure the kubelet restarts the container in place; with Never the controller creates a fresh pod per attempt, so you see multiple pods.

Common Causes

Application bug / non-zero exit — the job command genuinely fails (uncaught exception, assertion, bad SQL migration).
Missing configuration or secret — the container references an env var, mounted secret, or file that is absent and exits early.
Permission / network error — the job cannot reach a database, API, or bucket (auth failure, DNS, firewall).
backoffLimit too low for transient errors — a flaky dependency fails a few times before succeeding, but the limit is too small to ride it out.
Bad image or command — wrong entrypoint, missing binary, or command/args typo causing an immediate exit.
OOMKilled each attempt — the container exceeds its memory limit and is killed (exit 137) on every try.

How to Reproduce the Error

Run a Job whose container always exits non-zero with a small backoff limit:

apiVersion: batch/v1
kind: Job
metadata:
  name: migrate
spec:
  backoffLimit: 2
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: busybox:1.36
          command: ["sh", "-c", "echo 'running migration'; exit 1"]

kubectl apply -f migrate-job.yaml
kubectl get pods -l job-name=migrate
kubectl describe job migrate | grep -A4 Events

NAME            READY   STATUS   RESTARTS   AGE
migrate-2pn7c   0/1     Error    0          90s
migrate-9xk4d   0/1     Error    0          50s
migrate-q4r8t   0/1     Error    0          10s

After three failed pods (backoffLimit: 2 means up to 3 attempts), the Job reports BackoffLimitExceeded.

Diagnostic Commands

# Confirm the Job failed and read the reason
kubectl get job <JOB> -o jsonpath='{.status.conditions}'

# List every pod the Job created, including failed ones
kubectl get pods -l job-name=<JOB> -o wide

# The most important step: read the failed pod's logs
kubectl logs <FAILED-POD>
kubectl logs <FAILED-POD> --previous   # if restartPolicy: OnFailure

# Why did the container exit? Look at exit code and state
kubectl describe pod <FAILED-POD> | grep -A6 'State\|Last State\|Exit Code'

# Current backoffLimit value
kubectl get job <JOB> -o jsonpath='{.spec.backoffLimit}'

The failed pod logs and the container Exit Code are the heart of the diagnosis — BackoffLimitExceeded itself tells you nothing about the cause.

Step-by-Step Resolution

1. Read the logs of a failed pod. This is the single most important step. The application’s own output usually names the failure:

kubectl logs <FAILED-POD>
# e.g. "FATAL: password authentication failed for user 'app'"

2. Check the exit code. kubectl describe pod shows the terminated state. Common codes: 1 (generic error), 137 (OOMKilled / SIGKILL), 127 (command not found), 139 (segfault):

kubectl describe pod <FAILED-POD> | grep -A6 'Last State'

3. Fix the root cause. Resolve whatever the logs reveal — add the missing secret, fix credentials, correct the migration, fix the entrypoint. If the exit code is 137, raise the memory limit (see OOMKilled).

4. Recreate the Job, do not just re-run. Job specs are largely immutable; delete and reapply with the fix:

kubectl delete job <JOB>
kubectl apply -f <fixed-job>.yaml

5. Only raise backoffLimit for genuinely transient failures. If the work is idempotent and the dependency is flaky, a higher limit plus a retry delay can help — but never use it to paper over a deterministic bug:

spec:
  backoffLimit: 6

6. Verify completion. Confirm the Job reaches Complete:

kubectl get job <JOB>

Prevention and Best Practices

Make Jobs idempotent so safe retries are possible without side effects (e.g. re-running a migration is a no-op when already applied).
Always inspect failed-pod logs before touching backoffLimit — raising it on a deterministic failure just wastes time and resources.
Set realistic memory limits so batch jobs are not silently OOMKilled (exit 137) on every attempt.
Use ttlSecondsAfterFinished to auto-clean finished Jobs while keeping recent failures around long enough to debug.
Pair backoffLimit with activeDeadlineSeconds so a Job that retries forever still has a hard wall-clock cap. More in Kubernetes & Helm guides.

Job DeadlineExceeded — a Job killed for running too long rather than failing too often.
CrashLoopBackOff — the same restart-backoff pattern for long-running pods.
OOMKilled — a frequent exit-137 cause of repeated Job failures.

Frequently Asked Questions

Does raising backoffLimit fix the error? Only if the failures are transient. For a deterministic bug, more retries just fail more times and still end in BackoffLimitExceeded. Fix the underlying cause shown in the pod logs first.

Why do I see multiple failed pods instead of restarts? With restartPolicy: Never, the Job controller creates a brand-new pod for each attempt, so failed pods accumulate. With restartPolicy: OnFailure, the kubelet restarts the container inside one pod, and you check --previous logs instead.

How many attempts does the default give me? The default backoffLimit is 6, so up to six failed attempts before the Job is marked failed, with exponential backoff between them capped at six minutes. The total wall-clock time can therefore be significant.

My pod shows exit code 137 every time — what is that? Exit 137 means the container was SIGKILLed, almost always OOMKilled for exceeding its memory limit. Raise the limit or reduce memory use; no number of retries will help an under-provisioned job.

Kubernetes Error Guide: 'Job has reached the specified backoff limit'

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Related Errors

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit