Queueing Batch and ML Jobs on Shared Clusters With Kueue
The default scheduler can't gang-schedule a multi-pod training job or enforce per-team quota. Kueue adds job queueing, all-or-nothing admission, and fair-share borrowing.
- #kubernetes-helm
- #ai
- #kueue
- #batch
- #scheduling
A research team shared a GPU cluster with two other groups, and the failure mode was always the same: someone submitted a 16-GPU training job, the scheduler placed 11 of its pods, ran out of GPUs, and left the other 5 Pending — while the 11 sat there holding GPUs and doing nothing, because the job needs all 16 to start. Meanwhile a flood of single-pod jobs from another team had eaten the quota that should have been shared fairly. The default Kubernetes scheduler has no concept of “admit this job only if all its pods fit” and no concept of per-team quota. It schedules pod by pod, first come first served, and batch workloads suffer for it.
Kueue is the project built for exactly this. It sits in front of Jobs, queues them, and admits them all-or-nothing against quota you define per team. The object model takes a minute to internalize, but it’s the right shape for the problem.
The object hierarchy
Kueue has four objects, and they nest:
- ResourceFlavor describes a kind of node (a GPU pool, a spot pool) via node labels and tolerations.
- ClusterQueue holds quota for those flavors and belongs to a
cohortfor borrowing. - LocalQueue is the namespace-scoped entry point a team actually submits to.
- The Workload (created automatically from a Job) is what gets queued and admitted.
A team interacts with the LocalQueue; platform owns the ClusterQueues and ResourceFlavors. Here’s a GPU flavor and a per-team cluster queue:
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: gpu-a100
spec:
nodeLabels:
gpu-type: a100
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: team-research
spec:
cohort: shared-gpu
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: gpu-a100
resources:
- name: "nvidia.com/gpu"
nominalQuota: 16
borrowingLimit: 8 # may borrow up to 8 idle GPUs from the cohort
The nodeLabels on the flavor must exactly match the node pool — a typo silently routes workloads nowhere, which is one of the most common Kueue setup mistakes.
Gang scheduling: all-or-nothing admission
This is the feature that solves the original problem. Kueue admits a Workload only when all of its pods fit within available quota. A 16-GPU job that can’t get 16 GPUs stays queued — it never gets partially placed, so it never strands GPUs holding a job that can’t start. The pods don’t even get created until the Workload is admitted.
The team’s LocalQueue and the one label that opts a Job into all of this:
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: research
namespace: team-research
spec:
clusterQueue: team-research
---
apiVersion: batch/v1
kind: Job
metadata:
name: train-llm
namespace: team-research
labels:
kueue.x-k8s.io/queue-name: research # without this label, Kueue ignores the Job entirely
spec:
parallelism: 16
completions: 16
suspend: true # Kueue un-suspends when it admits the Workload
template:
spec:
containers:
- name: train
image: registry.internal/train:latest
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never
That queue-name label is load-bearing: a Job without it bypasses Kueue and quota entirely, so pair Kueue with a policy that rejects un-labeled batch Jobs, or teams will route around the queue (accidentally or not).
Fair-share borrowing and its cost
The cohort is what makes a shared cluster feel fair instead of zero-sum. ClusterQueues in the same cohort can borrow each other’s idle quota up to their borrowingLimit, so when the research team is quiet, another team’s jobs can use those GPUs. The catch is reclaim: when the lending team needs its quota back, Kueue can preempt the borrowed workloads. That’s the right behavior for fairness, but it means a borrowed job can be evicted mid-run, which is genuinely disruptive for a long training job.
# Watch admission and queue depth
kubectl get clusterqueue team-research -o wide
kubectl get workloads -n team-research
Prompt: Three teams share a 48-GPU A100 cluster. Design Kueue ResourceFlavors, ClusterQueues in one cohort with per-team nominal quota and borrowing limits, and the LocalQueue each team submits to. Ensure 16-GPU training jobs gang-schedule all-or-nothing, and flag where cohort reclaim can evict a running borrowed job. Inspect-only commands — don’t apply.
Output (excerpt): One
gpu-a100flavor; three ClusterQueues in cohortshared-gpuwith nominalQuota 16 each and borrowingLimit 8. Suspended Jobs labeledkueue.x-k8s.io/queue-nameadmit all-or-nothing so 16-GPU jobs never strand GPUs. CohortreclaimWithinCohortpreemption can evict a borrowed running job when the lender reclaims — DESTRUCTIVE for long trainings; recommend checkpointing. Verify withkubectl get clusterqueue -o wideandkueue_pending_workloads.
This is a good AI-assisted design task: the assistant knows the object hierarchy and the gang-scheduling and borrowing semantics, and it produces the manifests plus the inspection commands while I review the quota math against the teams’ real fair-share expectations. I keep it advisory because preemption is disruptive — the model designs, and I stage the rollout and confirm the un-labeled-Job guard is in place. Related autoscaling patterns are in the Kubernetes & Helm guides.
Wrapping up
Shared clusters running batch and ML work need two things the default scheduler can’t provide: gang scheduling so a multi-pod job never strands resources holding a job that can’t start, and per-team quota so one group can’t starve the others. Kueue delivers both through ResourceFlavors, ClusterQueues, cohorts, and LocalQueues — with the sharp edges being node-label typos, the mandatory queue-name label, and cohort reclaim that can preempt borrowed jobs mid-run. Let an AI assistant design the hierarchy and the inspection commands while you review the quota math and stage the rollout. More batch and autoscaling guides are in the Kubernetes & Helm guides, with reusable prompts in the prompt library.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.