Queueing Batch and ML Jobs on Shared Clusters With Kueue

A research team shared a GPU cluster with two other groups, and the failure mode was always the same: someone submitted a 16-GPU training job, the scheduler placed 11 of its pods, ran out of GPUs, and left the other 5 Pending — while the 11 sat there holding GPUs and doing nothing, because the job needs all 16 to start. Meanwhile a flood of single-pod jobs from another team had eaten the quota that should have been shared fairly. The default Kubernetes scheduler has no concept of “admit this job only if all its pods fit” and no concept of per-team quota. It schedules pod by pod, first come first served, and batch workloads suffer for it.

Kueue is the project built for exactly this. It sits in front of Jobs, queues them, and admits them all-or-nothing against quota you define per team. The object model takes a minute to internalize, but it’s the right shape for the problem.

The object hierarchy

Kueue has four objects, and they nest:

ResourceFlavor describes a kind of node (a GPU pool, a spot pool) via node labels and tolerations.
ClusterQueue holds quota for those flavors and belongs to a cohort for borrowing.
LocalQueue is the namespace-scoped entry point a team actually submits to.
The Workload (created automatically from a Job) is what gets queued and admitted.

A team interacts with the LocalQueue; platform owns the ClusterQueues and ResourceFlavors. Here’s a GPU flavor and a per-team cluster queue:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gpu-a100
spec:
  nodeLabels:
    gpu-type: a100
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: team-research
spec:
  cohort: shared-gpu
  resourceGroups:
    - coveredResources: ["nvidia.com/gpu"]
      flavors:
        - name: gpu-a100
          resources:
            - name: "nvidia.com/gpu"
              nominalQuota: 16
              borrowingLimit: 8     # may borrow up to 8 idle GPUs from the cohort

The nodeLabels on the flavor must exactly match the node pool — a typo silently routes workloads nowhere, which is one of the most common Kueue setup mistakes.

Gang scheduling: all-or-nothing admission

This is the feature that solves the original problem. Kueue admits a Workload only when all of its pods fit within available quota. A 16-GPU job that can’t get 16 GPUs stays queued — it never gets partially placed, so it never strands GPUs holding a job that can’t start. The pods don’t even get created until the Workload is admitted.

The team’s LocalQueue and the one label that opts a Job into all of this:

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: research
  namespace: team-research
spec:
  clusterQueue: team-research
---
apiVersion: batch/v1
kind: Job
metadata:
  name: train-llm
  namespace: team-research
  labels:
    kueue.x-k8s.io/queue-name: research   # without this label, Kueue ignores the Job entirely
spec:
  parallelism: 16
  completions: 16
  suspend: true   # Kueue un-suspends when it admits the Workload
  template:
    spec:
      containers:
        - name: train
          image: registry.internal/train:latest
          resources:
            limits:
              nvidia.com/gpu: 1
      restartPolicy: Never

That queue-name label is load-bearing: a Job without it bypasses Kueue and quota entirely, so pair Kueue with a policy that rejects un-labeled batch Jobs, or teams will route around the queue (accidentally or not).

The cohort is what makes a shared cluster feel fair instead of zero-sum. ClusterQueues in the same cohort can borrow each other’s idle quota up to their borrowingLimit, so when the research team is quiet, another team’s jobs can use those GPUs. The catch is reclaim: when the lending team needs its quota back, Kueue can preempt the borrowed workloads. That’s the right behavior for fairness, but it means a borrowed job can be evicted mid-run, which is genuinely disruptive for a long training job.

# Watch admission and queue depth
kubectl get clusterqueue team-research -o wide
kubectl get workloads -n team-research

Prompt: Three teams share a 48-GPU A100 cluster. Design Kueue ResourceFlavors, ClusterQueues in one cohort with per-team nominal quota and borrowing limits, and the LocalQueue each team submits to. Ensure 16-GPU training jobs gang-schedule all-or-nothing, and flag where cohort reclaim can evict a running borrowed job. Inspect-only commands — don’t apply.

Output (excerpt): One gpu-a100 flavor; three ClusterQueues in cohort shared-gpu with nominalQuota 16 each and borrowingLimit 8. Suspended Jobs labeled kueue.x-k8s.io/queue-name admit all-or-nothing so 16-GPU jobs never strand GPUs. Cohort reclaimWithinCohort preemption can evict a borrowed running job when the lender reclaims — DESTRUCTIVE for long trainings; recommend checkpointing. Verify with kubectl get clusterqueue -o wide and kueue_pending_workloads.

This is a good AI-assisted design task: the assistant knows the object hierarchy and the gang-scheduling and borrowing semantics, and it produces the manifests plus the inspection commands while I review the quota math against the teams’ real fair-share expectations. I keep it advisory because preemption is disruptive — the model designs, and I stage the rollout and confirm the un-labeled-Job guard is in place. Related autoscaling patterns are in the Kubernetes & Helm guides.

Wrapping up

Shared clusters running batch and ML work need two things the default scheduler can’t provide: gang scheduling so a multi-pod job never strands resources holding a job that can’t start, and per-team quota so one group can’t starve the others. Kueue delivers both through ResourceFlavors, ClusterQueues, cohorts, and LocalQueues — with the sharp edges being node-label typos, the mandatory queue-name label, and cohort reclaim that can preempt borrowed jobs mid-run. Let an AI assistant design the hierarchy and the inspection commands while you review the quota math and stage the rollout. More batch and autoscaling guides are in the Kubernetes & Helm guides, with reusable prompts in the prompt library.

The object hierarchy

Gang scheduling: all-or-nothing admission

Fair-share borrowing and its cost

Wrapping up

Download the Free 500-Prompt DevOps AI Toolkit