Skip to content
CloudOps
All prompts
AI for GitLab CI/CD Difficulty: Advanced ClaudeChatGPT

GitLab Runner Troubleshooting Prompt

Diagnose GitLab Runner failures — runner offline, executor errors, Docker-in-Docker issues, autoscaler problems, slow job pickup, and resource exhaustion.

Target user
DevOps engineers operating GitLab Runners (self-hosted or SaaS)
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior DevOps engineer with deep experience operating GitLab Runners across executors — shell, Docker, docker-machine autoscaler, Kubernetes, instance autoscaler — in production at scale.

I will provide:
- The symptom (runner offline, "no runners assigned", `ERROR: Job failed (system failure)`, slow job pickup, DinD failures, OOM in jobs)
- Runner type and executor: `gitlab-runner --version` and the `[[runners]]` block from `/etc/gitlab-runner/config.toml`
- Recent runner logs: `journalctl -u gitlab-runner --since "1 hour ago" --no-pager -n 200`
- The failing job's metadata from GitLab UI (project, pipeline ID, job ID, runner tags)
- GitLab server version (`/help` page or admin area)
- For autoscaling: scaler config (docker-machine, instance autoscaler/fleeting plugin, Kubernetes executor `[runners.kubernetes]`)
- For DinD: the `.gitlab-ci.yml` services block + the docker version pinning

Your job:

1. **Classify the failure**:
   - **Runner offline / not registered** — agent process dead, registration token wrong, network to GitLab broken
   - **Runner online, jobs not picked up** — tag mismatch, project not enabled, protected-branch + non-protected runner, instance/group runner config
   - **Job starts, fails with "system failure"** — executor-level error: image pull, cache mount, network in container, DinD privileged mode
   - **Job runs, slow throughput** — autoscaler underprovisioned, network to artifacts/registry slow, large clones
   - **Resource exhaustion** — host OOM, disk full from leftover caches/builds, autoscaler scale-out cap hit
   - **DinD-specific** — `--privileged` not set, TLS not configured (or wrong cert), nested-container performance
2. **Walk the runner pickup chain**:
   - Runner registered? `gitlab-runner verify --delete` lists registered runners and validity
   - Tags on job vs tags on runner — strict tag matching required unless `run_untagged` is set
   - Protected branch + protected runner mismatch — protected jobs only run on protected runners
   - Project / Group / Instance scope — group runners only serve their group's projects
3. **For executor-specific issues**:
   - **Docker executor**: check `--privileged`, `volumes`, image pull, host resource limits
   - **Kubernetes executor**: pod stuck pending → namespace quota, PVC class, image pull secrets. Pod runs but job fails → check the `helper` container logs separately
   - **Shell executor**: shell user permissions, `PATH`, leftover state between jobs (no isolation)
   - **Docker autoscaler (legacy docker-machine, deprecated 2024)**: machine driver, idle timeout, max machines
   - **Instance autoscaler (fleeting)**: provider plugin (AWS/GCP/Azure), IAM permissions, instance type quota
4. **For DinD failures**:
   - `Cannot connect to the Docker daemon` → `services: docker:dind` not started, or `DOCKER_HOST` wrong
   - Modern DinD requires either `DOCKER_TLS_CERTDIR: ""` (disable TLS) or proper cert directory mount
   - Docker `:latest` for DinD is unsafe — pin to a specific version matching your runner
5. **For autoscaler problems**:
   - Provider IAM not allowing instance create
   - Quota exhausted in the provider
   - Idle timeout too high (cost) or too low (constant churn)
   - Capacity not matching peak load
6. **Recommend the diagnostic next step** with the exact command, host, and expected output.
7. Mark every DESTRUCTIVE action: restarting `gitlab-runner` mid-job (kills jobs), `gitlab-runner verify --delete` (removes registrations), `docker system prune -a` (deletes images other jobs may need).

---

Symptom: [DESCRIBE]
Runner executor + version: [e.g., docker on Ubuntu, gitlab-runner 17.0.0]
GitLab server version: [e.g., 17.2.1 self-managed / GitLab.com]
`config.toml` (sanitized — strip tokens):
```toml
[PASTE relevant [[runners]] block]
```
Runner logs:
```
[PASTE]
```
Failing job log (last 100 lines):
```
[PASTE]
```
.gitlab-ci.yml (relevant job + services):
```yaml
[PASTE]
```

Why this prompt works

GitLab Runner failures span the runner agent, the executor (Docker/K8s/shell), the autoscaler (if any), the host OS, and the network to GitLab. A “system failure” message tells you nothing about which layer broke. This prompt forces an executor-aware diagnosis.

How to use it

  1. Always include the executor type. Docker, Kubernetes, shell, instance-autoscaler each have entirely different failure modes.
  2. Strip tokens from config.toml before pasting — runner tokens are credentials.
  3. Provide both runner logs and the failing job log. They live in different places and tell different parts of the story.
  4. Mention runner scope (instance, group, project) — pickup issues are usually scope/tag mismatches.

Useful commands

# Runner side
gitlab-runner --version
gitlab-runner verify                       # list registrations + validity
sudo systemctl status gitlab-runner
sudo journalctl -u gitlab-runner --since "1 hour ago" -n 200 --no-pager
sudo cat /etc/gitlab-runner/config.toml    # sanitize before sharing

# List active processes managed by runner
ps -ef | grep gitlab-runner
ps -ef | grep docker-machine               # legacy autoscaler

# For Docker executor — see what's been spawned
docker ps -a | grep runner-
docker images | grep gitlab/gitlab-runner-helper

# For Kubernetes executor — see runner pods
kubectl get pods -n <runner-ns> -l app=gitlab-runner
kubectl logs -n <runner-ns> <runner-pod>
# Per-job pods (created and torn down by the K8s executor):
kubectl get pods -n <jobs-ns> -l "ci.gitlab.com/job-id"
kubectl describe pod -n <jobs-ns> <job-pod>

# Disk / image cleanup
docker system df
sudo du -shx /var/lib/docker/* | sort -h | tail
sudo find /home/gitlab-runner/builds -mtime +7 -type d   # old caches

# GitLab side (admin)
# Admin → CI/CD → Runners — check the runner's last contact, status, tags
# Admin → Jobs → look at the specific job's runner assignment

Differential cheatsheet

SymptomLikely causeFirst check
Runner shows “offline” in GitLab UIAgent process dead or network to GitLab brokensystemctl status gitlab-runner; egress to gitlab
Runner online, jobs not picked upTag mismatch OR scope mismatchJob tags vs runner tags; protected/non-protected; runner’s project/group enablement
ERROR: Job failed (system failure)Executor couldn’t start the job environmentRunner logs around job ID; image pull, network, privileged flag
Cannot connect to the Docker daemon in jobDinD service not started or DOCKER_HOST wrongservices: in .gitlab-ci.yml; DOCKER_HOST=tcp://docker:2375 or tcp://docker:2376 with TLS
Slow job startup (5+ min before first script line)Image pull, autoscaler boot, large git cloneRunner pull cache; autoscaler IdleTime; GIT_DEPTH
Repeated “no space left on device”Runner host disk full from caches/imagesdocker system df; clean policy needed
K8s executor pod stuck PendingNamespace quota, PVC class, image pull secretkubectl describe pod
Autoscaler not scaling upProvider IAM, quota, capacity errorRunner log around scale event; provider audit logs

Common findings this catches

  • Runner registered to wrong project — token was a project token but should have been group/instance. Re-register with right scope.
  • Job tag gpu mismatches runner tag gpus — strict equality. Edit one.
  • Protected branch + non-protected runner — protected job won’t pick up. Mark runner protected (after auditing).
  • Docker autoscaler still using legacy docker-machine — deprecated 2024; migrate to instance autoscaler with fleeting plugin.
  • DinD without DOCKER_TLS_CERTDIR: "" or volume mount — silent connection failures. Pick one: TLS or no-TLS, configure both sides matching.
  • K8s executor SA with broad RBAC — every CI job has cluster-admin. Scope to a namespace; provide an explicit Role.
  • docker:latest as DinD service — pinning required for reproducibility and known-good behavior.

Hardened DinD service block (modern)

variables:
  DOCKER_TLS_CERTDIR: "/certs"
  DOCKER_HOST: tcp://docker:2376
  DOCKER_TLS_VERIFY: 1
  DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"

services:
  - name: docker:26.1.4-dind
    alias: docker

Or for simpler (TLS off; only OK inside trusted runner):

variables:
  DOCKER_HOST: tcp://docker:2375
  DOCKER_TLS_CERTDIR: ""

services:
  - name: docker:26.1.4-dind
    alias: docker
    command: ["--tls=false"]

When to escalate

  • Runner host hardware failure (disk errors in dmesg) — replace; don’t fight it.
  • Suspected leaked runner token (sudden unknown jobs running) — rotate immediately, audit logs.
  • GitLab.com SaaS runner issues — open a GitLab support ticket; you can’t fix shared runners yourself.
  • Autoscaler provider API regression — coordinate with cloud provider and the GitLab fleeting plugin maintainers.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.