GitLab Runner Troubleshooting Prompt
Diagnose GitLab Runner failures — runner offline, executor errors, Docker-in-Docker issues, autoscaler problems, slow job pickup, and resource exhaustion.
- Target user
- DevOps engineers operating GitLab Runners (self-hosted or SaaS)
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior DevOps engineer with deep experience operating GitLab Runners across executors — shell, Docker, docker-machine autoscaler, Kubernetes, instance autoscaler — in production at scale. I will provide: - The symptom (runner offline, "no runners assigned", `ERROR: Job failed (system failure)`, slow job pickup, DinD failures, OOM in jobs) - Runner type and executor: `gitlab-runner --version` and the `[[runners]]` block from `/etc/gitlab-runner/config.toml` - Recent runner logs: `journalctl -u gitlab-runner --since "1 hour ago" --no-pager -n 200` - The failing job's metadata from GitLab UI (project, pipeline ID, job ID, runner tags) - GitLab server version (`/help` page or admin area) - For autoscaling: scaler config (docker-machine, instance autoscaler/fleeting plugin, Kubernetes executor `[runners.kubernetes]`) - For DinD: the `.gitlab-ci.yml` services block + the docker version pinning Your job: 1. **Classify the failure**: - **Runner offline / not registered** — agent process dead, registration token wrong, network to GitLab broken - **Runner online, jobs not picked up** — tag mismatch, project not enabled, protected-branch + non-protected runner, instance/group runner config - **Job starts, fails with "system failure"** — executor-level error: image pull, cache mount, network in container, DinD privileged mode - **Job runs, slow throughput** — autoscaler underprovisioned, network to artifacts/registry slow, large clones - **Resource exhaustion** — host OOM, disk full from leftover caches/builds, autoscaler scale-out cap hit - **DinD-specific** — `--privileged` not set, TLS not configured (or wrong cert), nested-container performance 2. **Walk the runner pickup chain**: - Runner registered? `gitlab-runner verify --delete` lists registered runners and validity - Tags on job vs tags on runner — strict tag matching required unless `run_untagged` is set - Protected branch + protected runner mismatch — protected jobs only run on protected runners - Project / Group / Instance scope — group runners only serve their group's projects 3. **For executor-specific issues**: - **Docker executor**: check `--privileged`, `volumes`, image pull, host resource limits - **Kubernetes executor**: pod stuck pending → namespace quota, PVC class, image pull secrets. Pod runs but job fails → check the `helper` container logs separately - **Shell executor**: shell user permissions, `PATH`, leftover state between jobs (no isolation) - **Docker autoscaler (legacy docker-machine, deprecated 2024)**: machine driver, idle timeout, max machines - **Instance autoscaler (fleeting)**: provider plugin (AWS/GCP/Azure), IAM permissions, instance type quota 4. **For DinD failures**: - `Cannot connect to the Docker daemon` → `services: docker:dind` not started, or `DOCKER_HOST` wrong - Modern DinD requires either `DOCKER_TLS_CERTDIR: ""` (disable TLS) or proper cert directory mount - Docker `:latest` for DinD is unsafe — pin to a specific version matching your runner 5. **For autoscaler problems**: - Provider IAM not allowing instance create - Quota exhausted in the provider - Idle timeout too high (cost) or too low (constant churn) - Capacity not matching peak load 6. **Recommend the diagnostic next step** with the exact command, host, and expected output. 7. Mark every DESTRUCTIVE action: restarting `gitlab-runner` mid-job (kills jobs), `gitlab-runner verify --delete` (removes registrations), `docker system prune -a` (deletes images other jobs may need). --- Symptom: [DESCRIBE] Runner executor + version: [e.g., docker on Ubuntu, gitlab-runner 17.0.0] GitLab server version: [e.g., 17.2.1 self-managed / GitLab.com] `config.toml` (sanitized — strip tokens): ```toml [PASTE relevant [[runners]] block] ``` Runner logs: ``` [PASTE] ``` Failing job log (last 100 lines): ``` [PASTE] ``` .gitlab-ci.yml (relevant job + services): ```yaml [PASTE] ```
Why this prompt works
GitLab Runner failures span the runner agent, the executor (Docker/K8s/shell), the autoscaler (if any), the host OS, and the network to GitLab. A “system failure” message tells you nothing about which layer broke. This prompt forces an executor-aware diagnosis.
How to use it
- Always include the executor type. Docker, Kubernetes, shell, instance-autoscaler each have entirely different failure modes.
- Strip tokens from
config.tomlbefore pasting — runner tokens are credentials. - Provide both runner logs and the failing job log. They live in different places and tell different parts of the story.
- Mention runner scope (instance, group, project) — pickup issues are usually scope/tag mismatches.
Useful commands
# Runner side
gitlab-runner --version
gitlab-runner verify # list registrations + validity
sudo systemctl status gitlab-runner
sudo journalctl -u gitlab-runner --since "1 hour ago" -n 200 --no-pager
sudo cat /etc/gitlab-runner/config.toml # sanitize before sharing
# List active processes managed by runner
ps -ef | grep gitlab-runner
ps -ef | grep docker-machine # legacy autoscaler
# For Docker executor — see what's been spawned
docker ps -a | grep runner-
docker images | grep gitlab/gitlab-runner-helper
# For Kubernetes executor — see runner pods
kubectl get pods -n <runner-ns> -l app=gitlab-runner
kubectl logs -n <runner-ns> <runner-pod>
# Per-job pods (created and torn down by the K8s executor):
kubectl get pods -n <jobs-ns> -l "ci.gitlab.com/job-id"
kubectl describe pod -n <jobs-ns> <job-pod>
# Disk / image cleanup
docker system df
sudo du -shx /var/lib/docker/* | sort -h | tail
sudo find /home/gitlab-runner/builds -mtime +7 -type d # old caches
# GitLab side (admin)
# Admin → CI/CD → Runners — check the runner's last contact, status, tags
# Admin → Jobs → look at the specific job's runner assignment
Differential cheatsheet
| Symptom | Likely cause | First check |
|---|---|---|
| Runner shows “offline” in GitLab UI | Agent process dead or network to GitLab broken | systemctl status gitlab-runner; egress to gitlab |
| Runner online, jobs not picked up | Tag mismatch OR scope mismatch | Job tags vs runner tags; protected/non-protected; runner’s project/group enablement |
ERROR: Job failed (system failure) | Executor couldn’t start the job environment | Runner logs around job ID; image pull, network, privileged flag |
Cannot connect to the Docker daemon in job | DinD service not started or DOCKER_HOST wrong | services: in .gitlab-ci.yml; DOCKER_HOST=tcp://docker:2375 or tcp://docker:2376 with TLS |
| Slow job startup (5+ min before first script line) | Image pull, autoscaler boot, large git clone | Runner pull cache; autoscaler IdleTime; GIT_DEPTH |
| Repeated “no space left on device” | Runner host disk full from caches/images | docker system df; clean policy needed |
| K8s executor pod stuck Pending | Namespace quota, PVC class, image pull secret | kubectl describe pod |
| Autoscaler not scaling up | Provider IAM, quota, capacity error | Runner log around scale event; provider audit logs |
Common findings this catches
- Runner registered to wrong project — token was a project token but should have been group/instance. Re-register with right scope.
- Job tag
gpumismatches runner taggpus— strict equality. Edit one. - Protected branch + non-protected runner — protected job won’t pick up. Mark runner protected (after auditing).
- Docker autoscaler still using legacy
docker-machine— deprecated 2024; migrate to instance autoscaler withfleetingplugin. - DinD without
DOCKER_TLS_CERTDIR: ""or volume mount — silent connection failures. Pick one: TLS or no-TLS, configure both sides matching. - K8s executor SA with broad RBAC — every CI job has cluster-admin. Scope to a namespace; provide an explicit Role.
docker:latestas DinD service — pinning required for reproducibility and known-good behavior.
Hardened DinD service block (modern)
variables:
DOCKER_TLS_CERTDIR: "/certs"
DOCKER_HOST: tcp://docker:2376
DOCKER_TLS_VERIFY: 1
DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"
services:
- name: docker:26.1.4-dind
alias: docker
Or for simpler (TLS off; only OK inside trusted runner):
variables:
DOCKER_HOST: tcp://docker:2375
DOCKER_TLS_CERTDIR: ""
services:
- name: docker:26.1.4-dind
alias: docker
command: ["--tls=false"]
When to escalate
- Runner host hardware failure (disk errors in dmesg) — replace; don’t fight it.
- Suspected leaked runner token (sudden unknown jobs running) — rotate immediately, audit logs.
- GitLab.com SaaS runner issues — open a GitLab support ticket; you can’t fix shared runners yourself.
- Autoscaler provider API regression — coordinate with cloud provider and the GitLab
fleetingplugin maintainers.
Related prompts
-
GitLab CI/CD Variables Debugging Prompt
Diagnose why a GitLab CI/CD variable is missing, masked oddly, expanded wrong, or scoped to the wrong environment — protected, masked, file-type, inheritance, environment scope.
-
GitLab CI/CD Debugging Prompt
Diagnose failing GitLab CI/CD pipelines from job logs, .gitlab-ci.yml, and runner configuration.
-
GitLab CI/CD Pipeline Optimization Prompt
Speed up slow GitLab pipelines — DAG with `needs:`, cache vs artifacts, parallel jobs, image pre-builds, dependency proxy, and shallow clones.