GitLab CI Error Guide: 'waiting for pod running: timed out waiting for pod to start' Kubernetes Executor
Fix GitLab Kubernetes-executor pod timeouts: image pull secrets, unschedulable nodes, taints, namespace quotas, helper image pulls, and a too-low poll_timeout.
- #gitlab-cicd
- #troubleshooting
- #errors
- #kubernetes-executor
Exact Error Message
A job using the Kubernetes executor fails in the prepare stage with a timeout:
ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start
Two closely related variants point at why the pod never started:
ERROR: Job failed (system failure): prepare environment: image pull failed: failed to pull image "registry.example.com/ci/build:latest": rpc error: code = Unknown desc = failed to pull and unpack image ... 401 Unauthorized
ERROR: Job failed (system failure): prepare environment: waiting for pod running: pod "runner-xyz-project-42-concurrent-0" status is "Pending"
All three mean the same thing at the GitLab level — the build pod never reached Running before poll_timeout elapsed — but the Pending and image pull variants tell you which Kubernetes problem to chase.
What the Error Means
With the Kubernetes executor, GitLab Runner does not run your job on a fixed host. For each job it asks the cluster to create a build pod (build container + helper container + any services:), then polls the Kubernetes API until that pod reports Running. Only then does it stream your script: into the build container.
If the pod is still Pending or ContainerCreating when the runner’s poll_timeout expires, the runner gives up and reports timed out waiting for pod to start as a system failure. The job log is the GitLab side of the story; the real reason lives in the Kubernetes events for that pod — image pull errors, no schedulable node, a LimitRange rejection, or a quota denial. The runner only sees “still not Running,” so you must inspect the cluster to learn why.
This is fundamentally a scheduling and pull problem, not a script problem. Your .gitlab-ci.yml commands never ran.
Common Causes
- Image cannot be pulled. Wrong image name/tag, a private registry with no
imagePullSecrets, or a rate-limited public registry → pod stuckImagePullBackOff/ErrImagePull. - The GitLab helper image fails to pull (air-gapped cluster without
helper_imagemirrored), so the pod never fully starts. - No schedulable node. Cluster is at capacity, or every node has a taint the pod does not tolerate →
Pending, event0/N nodes are available. - Resource requests exceed availability.
cpu_request/memory_requestinconfig.tomlare larger than any node can satisfy →Insufficient cpu/memory. - Namespace
ResourceQuotaorLimitRangerejection. The pod is denied admission or forced to invalid limits. - CNI / networking not ready on a freshly added node, delaying
ContainerCreating. - Cluster autoscaler lag. A new node is being provisioned but takes longer than
poll_timeoutto join and schedule. poll_timeouttoo low for large images or slow pulls — the default may be too short for a cold node.
How to Reproduce the Error
Point a job at a private image without supplying pull credentials, on a runner using the Kubernetes executor:
# .gitlab-ci.yml
build:
tags: [k8s]
image: registry.example.com/private/build:latest # no pull secret configured
script:
- make build
# config.toml — Kubernetes executor, no image_pull_secrets, short timeout
[[runners]]
executor = "kubernetes"
[runners.kubernetes]
namespace = "gitlab-ci"
poll_timeout = 180
image = "alpine:3.20"
The pod is created but stays Pending/ImagePullBackOff; after 180s the job fails with timed out waiting for pod to start.
Diagnostic Commands
Watch the build pod and, crucially, its events — that is where Kubernetes records the real reason:
# Find the build pod (named runner-<id>-project-<n>-concurrent-<m>)
kubectl get pods -n gitlab-ci
# The single most useful command — events explain Pending / ImagePull / quota
kubectl describe pod -n gitlab-ci runner-xyz-project-42-concurrent-0
# Cluster-wide recent events, sorted
kubectl get events -n gitlab-ci --sort-by=.lastTimestamp | tail -30
# Per-container status (which container is stuck)
kubectl get pod -n gitlab-ci runner-xyz-project-42-concurrent-0 \
-o jsonpath='{range .status.containerStatuses[*]}{.name}{": "}{.state}{"\n"}{end}'
# Node schedulability, taints, and free capacity
kubectl get nodes
kubectl describe node <node> | grep -A5 -E 'Taints|Allocatable|Allocated resources'
# Namespace quota / limit range that may be rejecting the pod
kubectl get resourcequota,limitrange -n gitlab-ci -o yaml
# Runner-side config and timeout
sudo grep -A20 '\[runners.kubernetes\]' /etc/gitlab-runner/config.toml
For deeper GitLab-side tracing, set variables: { CI_DEBUG_TRACE: "true" } in the job and inspect gitlab-runner --debug run if you control the runner host. The pod name embeds CI_PROJECT_ID and the concurrent slot, so kubectl get pods correlates directly to the failing job.
Step-by-Step Resolution
-
kubectl describe podon the stuck pod and read the Events. The fix depends entirely on what they say. -
Image pull (
ImagePullBackOff/401): create a registry secret and wire it into the runner:kubectl create secret docker-registry ci-pull \ --docker-server=registry.example.com \ --docker-username="$REG_USER" --docker-password="$REG_PASS" \ -n gitlab-ci[runners.kubernetes] image_pull_secrets = ["ci-pull"] -
Helper image (air-gapped): mirror it and set
helper_imageexplicitly inconfig.toml. -
Pending/0/N nodes available: add capacity, or add a toleration / fix the node selector:[runners.kubernetes.node_tolerations] "ci-only=true" = "NoSchedule" -
Insufficient cpu/memory: lowercpu_request/memory_request, or scale the node pool so requests fit a node. -
Quota/LimitRange rejection: raise the namespace
ResourceQuota, or setcpu_limit/memory_limitthat comply with theLimitRange. -
Autoscaler lag / large images: raise
poll_timeoutso a cold node has time to join and pull:[runners.kubernetes] poll_timeout = 600 # seconds; default is often too short for cold pulls -
Restart the runner (
sudo gitlab-runner restart) after editingconfig.toml, then retry the job.
Prevention and Best Practices
- Pre-pull common images onto nodes (or run a registry pull-through cache) so cold pulls do not race
poll_timeout. - Always configure
image_pull_secretsfor private registries, and authenticate to public registries to dodge anonymous rate limits. - Set realistic
cpu_request/memory_requestthat fit your smallest node, and keep them under the namespaceLimitRangeceiling. - Reserve a node pool for CI with a taint and matching tolerations, so build pods schedule predictably and do not starve workloads.
- Tune
poll_timeoutto match your worst-case (autoscaler spin-up + largest image pull), not the happy path. - Mirror the helper image in air-gapped clusters and pin its version in
config.toml. - For triage, the free incident assistant can turn a pod-timeout log plus
kubectl describeoutput into a likely cause. More patterns live in the GitLab CI/CD guides.
Related Errors
- GitLab CI Error: prepare environment exit status 1 — the shell-executor prepare-stage failure.
- GitLab CI Error: stuck runners tag mismatch — job never assigned to a runner.
- GitLab CI Error: Cannot connect to the Docker daemon (dind) — Docker-in-Docker failures inside a running pod.
Frequently Asked Questions
Where is the real reason for the timeout? Not in the GitLab job log — that only says “timed out.” Run kubectl describe pod <build-pod> -n <ci-namespace> and read the Events section. It will show ImagePullBackOff, 0/N nodes are available, Insufficient cpu, or a quota rejection.
How do I find the build pod for a failing job? Its name is runner-<runner-id>-project-<CI_PROJECT_ID>-concurrent-<slot>. List pods in your CI namespace with kubectl get pods -n gitlab-ci and match the project ID from the job.
My image is private — what do I configure? Create a docker-registry secret in the CI namespace and reference it via image_pull_secrets under [runners.kubernetes] in config.toml. Without it the pod sits in ImagePullBackOff until poll_timeout fires.
Should I just raise poll_timeout to fix every timeout? Only when the cause is genuinely slow (cold autoscaled node, large image pull). If the pod is Pending due to taints, quotas, or a missing pull secret, a higher timeout just delays the same failure — fix the scheduling/pull issue instead.
Why does the helper container matter? The pod includes a GitLab helper container alongside your build container. If the helper image cannot be pulled — common in air-gapped clusters — the pod never reaches Running even when your build image is fine. Mirror and pin helper_image.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.