GitLab Runner Troubleshooting Prompt

You are a senior DevOps engineer with deep experience operating GitLab Runners across executors — shell, Docker, docker-machine autoscaler, Kubernetes, instance autoscaler — in production at scale. I will provide: - The symptom (runner offline, "no runners assigned", `ERROR: Job failed (system failure)`, slow job pickup, DinD failures, OOM in jobs) - Runner type and executor: `gitlab-runner --version` and the `[[runners]]` block from `/etc/gitlab-runner/config.toml` - Recent runner logs: `journalctl -u gitlab-runner --since "1 hour ago" --no-pager -n 200` - The failing job's metadata from GitLab UI (project, pipeline ID, job ID, runner tags) - GitLab server version (`/help` page or admin area) - For autoscaling: scaler config (docker-machine, instance autoscaler/fleeting plugin, Kubernetes executor `[runners.kubernetes]`) - For DinD: the `.gitlab-ci.yml` services block + the docker version pinning Your job: 1. **Classify the failure**: - **Runner offline / not registered** — agent process dead, registration token wrong, network to GitLab broken - **Runner online, jobs not picked up** — tag mismatch, project not enabled, protected-branch + non-protected runner, instance/group runner config - **Job starts, fails with "system failure"** — executor-level error: image pull, cache mount, network in container, DinD privileged mode - **Job runs, slow throughput** — autoscaler underprovisioned, network to artifacts/registry slow, large clones - **Resource exhaustion** — host OOM, disk full from leftover caches/builds, autoscaler scale-out cap hit - **DinD-specific** — `--privileged` not set, TLS not configured (or wrong cert), nested-container performance 2. **Walk the runner pickup chain**: - Runner registered? `gitlab-runner verify --delete` lists registered runners and validity - Tags on job vs tags on runner — strict tag matching required unless `run_untagged` is set - Protected branch + protected runner mismatch — protected jobs only run on protected runners - Project / Group / Instance scope — group runners only serve their group's projects 3. **For executor-specific issues**: - **Docker executor**: check `--privileged`, `volumes`, image pull, host resource limits - **Kubernetes executor**: pod stuck pending → namespace quota, PVC class, image pull secrets. Pod runs but job fails → check the `helper` container logs separately - **Shell executor**: shell user permissions, `PATH`, leftover state between jobs (no isolation) - **Docker autoscaler (legacy docker-machine, deprecated 2024)**: machine driver, idle timeout, max machines - **Instance autoscaler (fleeting)**: provider plugin (AWS/GCP/Azure), IAM permissions, instance type quota 4. **For DinD failures**: - `Cannot connect to the Docker daemon` → `services: docker:dind` not started, or `DOCKER_HOST` wrong - Modern DinD requires either `DOCKER_TLS_CERTDIR: ""` (disable TLS) or proper cert directory mount - Docker `:latest` for DinD is unsafe — pin to a specific version matching your runner 5. **For autoscaler problems**: - Provider IAM not allowing instance create - Quota exhausted in the provider - Idle timeout too high (cost) or too low (constant churn) - Capacity not matching peak load 6. **Recommend the diagnostic next step** with the exact command, host, and expected output. 7. Mark every DESTRUCTIVE action: restarting `gitlab-runner` mid-job (kills jobs), `gitlab-runner verify --delete` (removes registrations), `docker system prune -a` (deletes images other jobs may need). --- Symptom: [DESCRIBE] Runner executor + version: [e.g., docker on Ubuntu, gitlab-runner 17.0.0] GitLab server version: [e.g., 17.2.1 self-managed / GitLab.com] `config.toml` (sanitized — strip tokens): ```toml [PASTE relevant [[runners]] block] ``` Runner logs: ``` [PASTE] ``` Failing job log (last 100 lines): ``` [PASTE] ``` .gitlab-ci.yml (relevant job + services): ```yaml [PASTE] ```

Why this prompt works

GitLab Runner failures span the runner agent, the executor (Docker/K8s/shell), the autoscaler (if any), the host OS, and the network to GitLab. A “system failure” message tells you nothing about which layer broke. This prompt forces an executor-aware diagnosis.

How to use it

Always include the executor type. Docker, Kubernetes, shell, instance-autoscaler each have entirely different failure modes.
Strip tokens from config.toml before pasting — runner tokens are credentials.
Provide both runner logs and the failing job log. They live in different places and tell different parts of the story.
Mention runner scope (instance, group, project) — pickup issues are usually scope/tag mismatches.

Useful commands

# Runner side
gitlab-runner --version
gitlab-runner verify                       # list registrations + validity
sudo systemctl status gitlab-runner
sudo journalctl -u gitlab-runner --since "1 hour ago" -n 200 --no-pager
sudo cat /etc/gitlab-runner/config.toml    # sanitize before sharing

# List active processes managed by runner
ps -ef | grep gitlab-runner
ps -ef | grep docker-machine               # legacy autoscaler

# For Docker executor — see what's been spawned
docker ps -a | grep runner-
docker images | grep gitlab/gitlab-runner-helper

# For Kubernetes executor — see runner pods
kubectl get pods -n <runner-ns> -l app=gitlab-runner
kubectl logs -n <runner-ns> <runner-pod>
# Per-job pods (created and torn down by the K8s executor):
kubectl get pods -n <jobs-ns> -l "ci.gitlab.com/job-id"
kubectl describe pod -n <jobs-ns> <job-pod>

# Disk / image cleanup
docker system df
sudo du -shx /var/lib/docker/* | sort -h | tail
sudo find /home/gitlab-runner/builds -mtime +7 -type d   # old caches

# GitLab side (admin)
# Admin → CI/CD → Runners — check the runner's last contact, status, tags
# Admin → Jobs → look at the specific job's runner assignment

Differential cheatsheet

Symptom	Likely cause	First check
Runner shows “offline” in GitLab UI	Agent process dead or network to GitLab broken	`systemctl status gitlab-runner`; egress to gitlab
Runner online, jobs not picked up	Tag mismatch OR scope mismatch	Job tags vs runner tags; protected/non-protected; runner’s project/group enablement
`ERROR: Job failed (system failure)`	Executor couldn’t start the job environment	Runner logs around job ID; image pull, network, privileged flag
`Cannot connect to the Docker daemon` in job	DinD service not started or DOCKER_HOST wrong	`services:` in .gitlab-ci.yml; `DOCKER_HOST=tcp://docker:2375` or `tcp://docker:2376` with TLS
Slow job startup (5+ min before first script line)	Image pull, autoscaler boot, large `git clone`	Runner pull cache; autoscaler IdleTime; GIT_DEPTH
Repeated “no space left on device”	Runner host disk full from caches/images	`docker system df`; clean policy needed
K8s executor pod stuck Pending	Namespace quota, PVC class, image pull secret	`kubectl describe pod`
Autoscaler not scaling up	Provider IAM, quota, capacity error	Runner log around scale event; provider audit logs

Common findings this catches

Runner registered to wrong project — token was a project token but should have been group/instance. Re-register with right scope.
Job tag gpu mismatches runner tag gpus — strict equality. Edit one.
Protected branch + non-protected runner — protected job won’t pick up. Mark runner protected (after auditing).
Docker autoscaler still using legacy docker-machine — deprecated 2024; migrate to instance autoscaler with fleeting plugin.
DinD without DOCKER_TLS_CERTDIR: "" or volume mount — silent connection failures. Pick one: TLS or no-TLS, configure both sides matching.
K8s executor SA with broad RBAC — every CI job has cluster-admin. Scope to a namespace; provide an explicit Role.
docker:latest as DinD service — pinning required for reproducibility and known-good behavior.

Hardened DinD service block (modern)

variables:
  DOCKER_TLS_CERTDIR: "/certs"
  DOCKER_HOST: tcp://docker:2376
  DOCKER_TLS_VERIFY: 1
  DOCKER_CERT_PATH: "$DOCKER_TLS_CERTDIR/client"

services:
  - name: docker:26.1.4-dind
    alias: docker

Or for simpler (TLS off; only OK inside trusted runner):

variables:
  DOCKER_HOST: tcp://docker:2375
  DOCKER_TLS_CERTDIR: ""

services:
  - name: docker:26.1.4-dind
    alias: docker
    command: ["--tls=false"]

When to escalate

Runner host hardware failure (disk errors in dmesg) — replace; don’t fight it.
Suspected leaked runner token (sudden unknown jobs running) — rotate immediately, audit logs.
GitLab.com SaaS runner issues — open a GitLab support ticket; you can’t fix shared runners yourself.
Autoscaler provider API regression — coordinate with cloud provider and the GitLab fleeting plugin maintainers.

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response

Instant PDF download — yours free, forever

Plus one practical AI-workflow email a week (no spam)

GitLab Runner Troubleshooting Prompt

Why this prompt works

How to use it

Useful commands

Differential cheatsheet

Common findings this catches

Hardened DinD service block (modern)

When to escalate

Related prompts

GitLab CI/CD Debugging Prompt

GitLab CI/CD Pipeline Optimization Prompt

GitLab CI/CD Variables Debugging Prompt

GitLab Runner Disk Space Cleanup Job Prompt

Reading prompts? Get all 500 in one free PDF