Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for GitLab CI/CD By James Joyner IV · · 9 min read

GitLab CI Error Guide: 'waiting for pod running: timed out waiting for pod to start' Kubernetes Executor

Fix GitLab Kubernetes-executor pod timeouts: image pull secrets, unschedulable nodes, taints, namespace quotas, helper image pulls, and a too-low poll_timeout.

  • #gitlab-cicd
  • #troubleshooting
  • #errors
  • #kubernetes-executor

Exact Error Message

A job using the Kubernetes executor fails in the prepare stage with a timeout:

ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start

Two closely related variants point at why the pod never started:

ERROR: Job failed (system failure): prepare environment: image pull failed: failed to pull image "registry.example.com/ci/build:latest": rpc error: code = Unknown desc = failed to pull and unpack image ... 401 Unauthorized
ERROR: Job failed (system failure): prepare environment: waiting for pod running: pod "runner-xyz-project-42-concurrent-0" status is "Pending"

All three mean the same thing at the GitLab level — the build pod never reached Running before poll_timeout elapsed — but the Pending and image pull variants tell you which Kubernetes problem to chase.

What the Error Means

With the Kubernetes executor, GitLab Runner does not run your job on a fixed host. For each job it asks the cluster to create a build pod (build container + helper container + any services:), then polls the Kubernetes API until that pod reports Running. Only then does it stream your script: into the build container.

If the pod is still Pending or ContainerCreating when the runner’s poll_timeout expires, the runner gives up and reports timed out waiting for pod to start as a system failure. The job log is the GitLab side of the story; the real reason lives in the Kubernetes events for that pod — image pull errors, no schedulable node, a LimitRange rejection, or a quota denial. The runner only sees “still not Running,” so you must inspect the cluster to learn why.

This is fundamentally a scheduling and pull problem, not a script problem. Your .gitlab-ci.yml commands never ran.

Common Causes

  • Image cannot be pulled. Wrong image name/tag, a private registry with no imagePullSecrets, or a rate-limited public registry → pod stuck ImagePullBackOff/ErrImagePull.
  • The GitLab helper image fails to pull (air-gapped cluster without helper_image mirrored), so the pod never fully starts.
  • No schedulable node. Cluster is at capacity, or every node has a taint the pod does not tolerate → Pending, event 0/N nodes are available.
  • Resource requests exceed availability. cpu_request/memory_request in config.toml are larger than any node can satisfy → Insufficient cpu/memory.
  • Namespace ResourceQuota or LimitRange rejection. The pod is denied admission or forced to invalid limits.
  • CNI / networking not ready on a freshly added node, delaying ContainerCreating.
  • Cluster autoscaler lag. A new node is being provisioned but takes longer than poll_timeout to join and schedule.
  • poll_timeout too low for large images or slow pulls — the default may be too short for a cold node.

How to Reproduce the Error

Point a job at a private image without supplying pull credentials, on a runner using the Kubernetes executor:

# .gitlab-ci.yml
build:
  tags: [k8s]
  image: registry.example.com/private/build:latest   # no pull secret configured
  script:
    - make build
# config.toml — Kubernetes executor, no image_pull_secrets, short timeout
[[runners]]
  executor = "kubernetes"
  [runners.kubernetes]
    namespace = "gitlab-ci"
    poll_timeout = 180
    image = "alpine:3.20"

The pod is created but stays Pending/ImagePullBackOff; after 180s the job fails with timed out waiting for pod to start.

Diagnostic Commands

Watch the build pod and, crucially, its events — that is where Kubernetes records the real reason:

# Find the build pod (named runner-<id>-project-<n>-concurrent-<m>)
kubectl get pods -n gitlab-ci

# The single most useful command — events explain Pending / ImagePull / quota
kubectl describe pod -n gitlab-ci runner-xyz-project-42-concurrent-0

# Cluster-wide recent events, sorted
kubectl get events -n gitlab-ci --sort-by=.lastTimestamp | tail -30

# Per-container status (which container is stuck)
kubectl get pod -n gitlab-ci runner-xyz-project-42-concurrent-0 \
  -o jsonpath='{range .status.containerStatuses[*]}{.name}{": "}{.state}{"\n"}{end}'

# Node schedulability, taints, and free capacity
kubectl get nodes
kubectl describe node <node> | grep -A5 -E 'Taints|Allocatable|Allocated resources'

# Namespace quota / limit range that may be rejecting the pod
kubectl get resourcequota,limitrange -n gitlab-ci -o yaml

# Runner-side config and timeout
sudo grep -A20 '\[runners.kubernetes\]' /etc/gitlab-runner/config.toml

For deeper GitLab-side tracing, set variables: { CI_DEBUG_TRACE: "true" } in the job and inspect gitlab-runner --debug run if you control the runner host. The pod name embeds CI_PROJECT_ID and the concurrent slot, so kubectl get pods correlates directly to the failing job.

Step-by-Step Resolution

  1. kubectl describe pod on the stuck pod and read the Events. The fix depends entirely on what they say.

  2. Image pull (ImagePullBackOff/401): create a registry secret and wire it into the runner:

    kubectl create secret docker-registry ci-pull \
      --docker-server=registry.example.com \
      --docker-username="$REG_USER" --docker-password="$REG_PASS" \
      -n gitlab-ci
    [runners.kubernetes]
      image_pull_secrets = ["ci-pull"]
  3. Helper image (air-gapped): mirror it and set helper_image explicitly in config.toml.

  4. Pending / 0/N nodes available: add capacity, or add a toleration / fix the node selector:

    [runners.kubernetes.node_tolerations]
      "ci-only=true" = "NoSchedule"
  5. Insufficient cpu/memory: lower cpu_request/memory_request, or scale the node pool so requests fit a node.

  6. Quota/LimitRange rejection: raise the namespace ResourceQuota, or set cpu_limit/memory_limit that comply with the LimitRange.

  7. Autoscaler lag / large images: raise poll_timeout so a cold node has time to join and pull:

    [runners.kubernetes]
      poll_timeout = 600   # seconds; default is often too short for cold pulls
  8. Restart the runner (sudo gitlab-runner restart) after editing config.toml, then retry the job.

Prevention and Best Practices

  • Pre-pull common images onto nodes (or run a registry pull-through cache) so cold pulls do not race poll_timeout.
  • Always configure image_pull_secrets for private registries, and authenticate to public registries to dodge anonymous rate limits.
  • Set realistic cpu_request/memory_request that fit your smallest node, and keep them under the namespace LimitRange ceiling.
  • Reserve a node pool for CI with a taint and matching tolerations, so build pods schedule predictably and do not starve workloads.
  • Tune poll_timeout to match your worst-case (autoscaler spin-up + largest image pull), not the happy path.
  • Mirror the helper image in air-gapped clusters and pin its version in config.toml.
  • For triage, the free incident assistant can turn a pod-timeout log plus kubectl describe output into a likely cause. More patterns live in the GitLab CI/CD guides.

Frequently Asked Questions

Where is the real reason for the timeout? Not in the GitLab job log — that only says “timed out.” Run kubectl describe pod <build-pod> -n <ci-namespace> and read the Events section. It will show ImagePullBackOff, 0/N nodes are available, Insufficient cpu, or a quota rejection.

How do I find the build pod for a failing job? Its name is runner-<runner-id>-project-<CI_PROJECT_ID>-concurrent-<slot>. List pods in your CI namespace with kubectl get pods -n gitlab-ci and match the project ID from the job.

My image is private — what do I configure? Create a docker-registry secret in the CI namespace and reference it via image_pull_secrets under [runners.kubernetes] in config.toml. Without it the pod sits in ImagePullBackOff until poll_timeout fires.

Should I just raise poll_timeout to fix every timeout? Only when the cause is genuinely slow (cold autoscaled node, large image pull). If the pod is Pending due to taints, quotas, or a missing pull secret, a higher timeout just delays the same failure — fix the scheduling/pull issue instead.

Why does the helper container matter? The pod includes a GitLab helper container alongside your build container. If the helper image cannot be pulled — common in air-gapped clusters — the pod never reaches Running even when your build image is fine. Mirror and pin helper_image.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.