Kubernetes Jobs and CronJobs Patterns That Hold Up

Batch workloads are where Kubernetes lulls you into a false sense of ease. A Job is a few lines of YAML, a CronJob a few more, and the happy path works on the first try. Then production happens: a CronJob fires twice for one schedule and double-charges customers, finished Jobs pile up until kubectl get pods is unusable, a long task gets killed at exactly the wrong moment, or a missed schedule silently never runs and nobody notices for a week. None of these are exotic — they’re the default behavior if you don’t configure against them.

Here are the Job and CronJob patterns I now apply by reflex, each one earned from a batch workload that misbehaved.

Jobs: completions, parallelism, and retries

A Job runs pods until a specified number succeed. The two knobs that define its shape are completions (how many successful pods you need) and parallelism (how many run at once):

apiVersion: batch/v1
kind: Job
metadata:
  name: reindex
spec:
  completions: 12        # 12 successful runs needed
  parallelism: 4         # 4 at a time
  backoffLimit: 6        # retry the whole Job up to 6 times on failure
  activeDeadlineSeconds: 3600   # hard wall-clock cap
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: worker
          image: acme/reindex:1.4

The settings that save you:

backoffLimit caps retries. The default is 6, with exponential backoff. Without thinking about this, a Job that always fails will retry six times — fine — but a Job that fails fast can burn through those retries in seconds and give up before a transient dependency recovers. Set it to match how your failures actually behave.
activeDeadlineSeconds is a hard ceiling on total runtime. A Job that hangs forever (waiting on a lock, a dead endpoint) will sit there indefinitely without this. Always set it.
restartPolicy: Never vs OnFailure. With Never, each failed attempt leaves a pod behind (useful for debugging, but they accumulate). With OnFailure, the same pod restarts in place. I prefer Never plus a sane backoffLimit so I can inspect failed attempts.

For embarrassingly parallel work where each item is independent, the Indexed completion mode gives each pod a stable index via JOB_COMPLETION_INDEX, so pod 3 always processes shard 3:

spec:
  completionMode: Indexed
  completions: 10
  parallelism: 10

That turns a Job into a clean fan-out across a fixed set of shards without a work queue.

The idempotency rule

This is the one that matters most, and no Job setting can substitute for it: your batch task must be safe to run more than once. Kubernetes Jobs guarantee at-least-once execution, not exactly-once. A pod can succeed, the node can die before the status is recorded, and the Job controller reruns it. A CronJob under load can fire overlapping runs. If “run twice” means “charge the customer twice” or “send the email twice,” that’s a bug in your task, not in Kubernetes.

Make the work idempotent: use a transaction with a unique key, check-then-act against a ledger, or make the operation naturally repeatable. Design as if every Job pod might run twice, because eventually one will.

CronJobs: the four settings that prevent disasters

A CronJob creates a Job on a schedule. Four fields separate a reliable CronJob from a footgun:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
spec:
  schedule: "0 2 * * *"
  timeZone: "America/New_York"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 300
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      activeDeadlineSeconds: 1800
      backoffLimit: 2
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: report
              image: acme/report:2.1

concurrencyPolicy. The default is Allow, which lets a new run start even if the previous one is still going. For most jobs that’s a recipe for overlapping runs stepping on each other. Use Forbid to skip the new run if the old one is active, or Replace to kill the old and start fresh. Choosing this deliberately prevents the “fired twice, corrupted the report” class of incident.
startingDeadlineSeconds. If the controller can’t start a scheduled run on time (control-plane hiccup, cluster busy), this bounds how late it may still start. Without it, a backlog of missed schedules can all fire at once when things recover — a thundering herd of catch-up Jobs. It also prevents the opposite silent failure: if more than 100 schedules are missed without this set, the CronJob stops scheduling entirely and goes quiet.
timeZone. CronJobs historically ran in UTC, which is how a “2am” report ends up running at the wrong local hour. Set timeZone explicitly so the schedule means what you think it means.
History limits. successfulJobsHistoryLimit and failedJobsHistoryLimit cap how many finished Jobs hang around. Leave these at the defaults and a frequent CronJob buries your namespace in completed pods. Keep a few for debugging, no more.

Clean up finished Jobs automatically

Even with history limits on CronJobs, standalone Jobs linger forever by default. The ttlSecondsAfterFinished field tells the controller to delete a Job (and its pods) a set time after it finishes:

spec:
  ttlSecondsAfterFinished: 600   # garbage-collect 10 min after completion

Set this on every standalone Job. It’s the difference between a tidy namespace and one where kubectl get pods scrolls past a thousand Completed pods from last month.

Observe and alert on batch work

The quiet failure is the dangerous one — a CronJob that stops running tells you nothing unless you’re watching:

kubectl get cronjob nightly-report          # LAST SCHEDULE tells you if it's firing
kubectl get jobs --sort-by=.status.startTime
kubectl logs job/nightly-report-28461600    # logs from a specific run

Alert on a CronJob whose last-successful-time is older than its interval, and on Jobs that hit their backoffLimit. A batch job that silently stops is invisible until the missing output is someone’s emergency.

Where AI helps

The batch failure modes are easy to miss in review because the YAML looks complete without the safety fields. I paste a Job or CronJob and ask the model to flag what’s missing — no activeDeadlineSeconds, concurrencyPolicy: Allow on a job that shouldn’t overlap, no TTL, a schedule with no timezone. It’s also good at sanity-checking a cron expression against the intent (“does 0 2 * * 1-5 mean what I said?”). Running batch manifests through our AI code review tool catches exactly these omissions before they ship, and it’ll flag tasks whose description suggests they aren’t idempotent.

Batch work on Kubernetes is simple to start and full of sharp edges at scale. Set the deadlines, pick the concurrency policy on purpose, make the task idempotent, and clean up after yourself — and your Jobs will hold up when it matters. For more, see our Kubernetes and Helm guides.

AI-assisted reviews are assistive, not authoritative. Always validate scheduling and concurrency behavior in a non-production namespace first.