Kubernetes Jobs and CronJobs Patterns That Hold Up
Batch work on Kubernetes looks trivial until a CronJob fires twice, piles up, or never cleans up. Here are the Job and CronJob patterns that survive production.
- #kubernetes
- #jobs
- #cronjob
- #batch
- #scheduling
- #reliability
Batch workloads are where Kubernetes lulls you into a false sense of ease. A Job is a few lines of YAML, a CronJob a few more, and the happy path works on the first try. Then production happens: a CronJob fires twice for one schedule and double-charges customers, finished Jobs pile up until kubectl get pods is unusable, a long task gets killed at exactly the wrong moment, or a missed schedule silently never runs and nobody notices for a week. None of these are exotic — they’re the default behavior if you don’t configure against them.
Here are the Job and CronJob patterns I now apply by reflex, each one earned from a batch workload that misbehaved.
Jobs: completions, parallelism, and retries
A Job runs pods until a specified number succeed. The two knobs that define its shape are completions (how many successful pods you need) and parallelism (how many run at once):
apiVersion: batch/v1
kind: Job
metadata:
name: reindex
spec:
completions: 12 # 12 successful runs needed
parallelism: 4 # 4 at a time
backoffLimit: 6 # retry the whole Job up to 6 times on failure
activeDeadlineSeconds: 3600 # hard wall-clock cap
template:
spec:
restartPolicy: Never
containers:
- name: worker
image: acme/reindex:1.4
The settings that save you:
backoffLimitcaps retries. The default is 6, with exponential backoff. Without thinking about this, a Job that always fails will retry six times — fine — but a Job that fails fast can burn through those retries in seconds and give up before a transient dependency recovers. Set it to match how your failures actually behave.activeDeadlineSecondsis a hard ceiling on total runtime. A Job that hangs forever (waiting on a lock, a dead endpoint) will sit there indefinitely without this. Always set it.restartPolicy: NevervsOnFailure. WithNever, each failed attempt leaves a pod behind (useful for debugging, but they accumulate). WithOnFailure, the same pod restarts in place. I preferNeverplus a sanebackoffLimitso I can inspect failed attempts.
For embarrassingly parallel work where each item is independent, the Indexed completion mode gives each pod a stable index via JOB_COMPLETION_INDEX, so pod 3 always processes shard 3:
spec:
completionMode: Indexed
completions: 10
parallelism: 10
That turns a Job into a clean fan-out across a fixed set of shards without a work queue.
The idempotency rule
This is the one that matters most, and no Job setting can substitute for it: your batch task must be safe to run more than once. Kubernetes Jobs guarantee at-least-once execution, not exactly-once. A pod can succeed, the node can die before the status is recorded, and the Job controller reruns it. A CronJob under load can fire overlapping runs. If “run twice” means “charge the customer twice” or “send the email twice,” that’s a bug in your task, not in Kubernetes.
Make the work idempotent: use a transaction with a unique key, check-then-act against a ledger, or make the operation naturally repeatable. Design as if every Job pod might run twice, because eventually one will.
CronJobs: the four settings that prevent disasters
A CronJob creates a Job on a schedule. Four fields separate a reliable CronJob from a footgun:
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-report
spec:
schedule: "0 2 * * *"
timeZone: "America/New_York"
concurrencyPolicy: Forbid
startingDeadlineSeconds: 300
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
activeDeadlineSeconds: 1800
backoffLimit: 2
template:
spec:
restartPolicy: Never
containers:
- name: report
image: acme/report:2.1
concurrencyPolicy. The default isAllow, which lets a new run start even if the previous one is still going. For most jobs that’s a recipe for overlapping runs stepping on each other. UseForbidto skip the new run if the old one is active, orReplaceto kill the old and start fresh. Choosing this deliberately prevents the “fired twice, corrupted the report” class of incident.startingDeadlineSeconds. If the controller can’t start a scheduled run on time (control-plane hiccup, cluster busy), this bounds how late it may still start. Without it, a backlog of missed schedules can all fire at once when things recover — a thundering herd of catch-up Jobs. It also prevents the opposite silent failure: if more than 100 schedules are missed without this set, the CronJob stops scheduling entirely and goes quiet.timeZone. CronJobs historically ran in UTC, which is how a “2am” report ends up running at the wrong local hour. SettimeZoneexplicitly so the schedule means what you think it means.- History limits.
successfulJobsHistoryLimitandfailedJobsHistoryLimitcap how many finished Jobs hang around. Leave these at the defaults and a frequent CronJob buries your namespace in completed pods. Keep a few for debugging, no more.
Clean up finished Jobs automatically
Even with history limits on CronJobs, standalone Jobs linger forever by default. The ttlSecondsAfterFinished field tells the controller to delete a Job (and its pods) a set time after it finishes:
spec:
ttlSecondsAfterFinished: 600 # garbage-collect 10 min after completion
Set this on every standalone Job. It’s the difference between a tidy namespace and one where kubectl get pods scrolls past a thousand Completed pods from last month.
Observe and alert on batch work
The quiet failure is the dangerous one — a CronJob that stops running tells you nothing unless you’re watching:
kubectl get cronjob nightly-report # LAST SCHEDULE tells you if it's firing
kubectl get jobs --sort-by=.status.startTime
kubectl logs job/nightly-report-28461600 # logs from a specific run
Alert on a CronJob whose last-successful-time is older than its interval, and on Jobs that hit their backoffLimit. A batch job that silently stops is invisible until the missing output is someone’s emergency.
Where AI helps
The batch failure modes are easy to miss in review because the YAML looks complete without the safety fields. I paste a Job or CronJob and ask the model to flag what’s missing — no activeDeadlineSeconds, concurrencyPolicy: Allow on a job that shouldn’t overlap, no TTL, a schedule with no timezone. It’s also good at sanity-checking a cron expression against the intent (“does 0 2 * * 1-5 mean what I said?”). Running batch manifests through our AI code review tool catches exactly these omissions before they ship, and it’ll flag tasks whose description suggests they aren’t idempotent.
Batch work on Kubernetes is simple to start and full of sharp edges at scale. Set the deadlines, pick the concurrency policy on purpose, make the task idempotent, and clean up after yourself — and your Jobs will hold up when it matters. For more, see our Kubernetes and Helm guides.
AI-assisted reviews are assistive, not authoritative. Always validate scheduling and concurrency behavior in a non-production namespace first.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.