GitLab CI Error Guide: 'This job is stuck because of a runner system failure'
Fix GitLab CI's 'job is stuck ... runner system failure': crashed runner processes, unhealthy executors, and lost runner-coordinator heartbeats.
- #gitlab-cicd
- #troubleshooting
- #errors
- #runner-config
Exact Error Message
A pipeline hangs in pending, then GitLab drops this notice and fails the job:
This job is stuck because of a runner system failure. Try again or contact your
administrator if the problem persists.
In the runner’s own logs you’ll often see the matching crash or heartbeat loss:
ERROR: Job failed (system failure): aborted: terminated
WARNING: Appending trace to coordinator... aborted status=canceled
ERROR: Submitting job to coordinator... failed code=500 ...
PANIC: runtime error: invalid memory address or nil pointer dereference
The job did not fail because of your code — the runner that picked it up died, stalled, or lost contact with GitLab mid-execution.
What the Error Means
GitLab dispatches a job to a runner, which then streams a trace and periodic updates back to the coordinator. “Stuck because of a runner system failure” means the runner accepted the job but then stopped reporting — it crashed, its executor became unhealthy, the host ran out of resources, or the network between runner and GitLab dropped. After a grace period with no heartbeat, GitLab marks the job as a system failure.
It differs from “no runners online” (no runner ever picked it up) and “stuck because of tag mismatch” (no eligible runner). Here a runner did take the job and then failed to follow through.
Common Causes
- The
gitlab-runnerprocess crashed or was OOM-killed on the host mid-job. - Executor unhealthy — Docker daemon down, Kubernetes API unreachable, or shell host wedged.
- Host resource exhaustion (memory, file descriptors, disk) killing the runner or its children.
- Network partition between runner and GitLab so trace/heartbeat updates can’t be submitted.
- A runner panic/bug (rare) or a runner version incompatible with the GitLab version.
How to Reproduce the Error
Hard to reproduce on purpose, but you can simulate the symptom by killing the runner while a job runs:
slow:
script:
- sleep 120 # while this runs, the runner process is stopped
If gitlab-runner is stopped or OOM-killed during the sleep, GitLab eventually reports:
This job is stuck because of a runner system failure.
Diagnostic Commands
Read-only health checks on the runner host:
# Is the runner process actually up and healthy?
systemctl status gitlab-runner
# Recent runner errors, panics, and coordinator submit failures
journalctl -u gitlab-runner --since "1 hour ago" | grep -iE 'panic|system failure|aborted|oom'
# Was the runner (or a child) OOM-killed by the kernel?
journalctl -k --since "1 hour ago" | grep -i 'killed process'
# Is the executor reachable? (Docker / Kubernetes)
docker info >/dev/null 2>&1 && echo docker-ok || echo docker-DOWN
kubectl version --short 2>/dev/null
# Resource headroom on the host
df -h / ; free -m
Active: failed (Result: oom-kill)
journal: Out of memory: Killed process 4123 (gitlab-runner)
docker-DOWN
An oom-kill on the runner process, or an executor reporting DOWN, confirms the runner died or lost its backend mid-job.
Step-by-Step Resolution
1. Restart the runner and its executor backend
Bring the runner and the thing it executes on back to a healthy state:
sudo systemctl restart docker # if using the Docker executor
sudo systemctl restart gitlab-runner
gitlab-runner verify # checks runners are still registered & reachable
gitlab-runner verify confirms the registration is intact; remove dead entries it reports as stale.
2. Fix the root resource problem
If the host was OOM-killed or out of disk, address that so it doesn’t recur:
[[runners]]
executor = "docker"
[runners.docker]
memory = "2g" # cap per-job memory so one job can't OOM the host
memory_swap = "2g"
Lower concurrent if too many jobs share a small host, and clean disk (docker system prune during idle windows).
3. Repair an unhealthy executor
For Kubernetes, confirm the API and namespace are reachable from the runner; for shell/Docker, confirm the daemon is up. A runner pointed at a dead executor will accept jobs and then fail them as system failures.
4. Align runner and GitLab versions
A runner far older or newer than the GitLab instance can mis-handle the API. Keep the runner within GitLab’s supported version window and upgrade if logs show protocol/500 errors.
5. Retry the job
System-failure jobs are safe to retry once the runner is healthy. Use Retry in the UI, or add automatic retries for transient runner failures:
slow:
retry:
max: 2
when: runner_system_failure
script: ["./build.sh"]
when: runner_system_failure retries only this class of failure, not your real test failures.
Prevention and Best Practices
- Cap per-job memory (
memory/memory_swap) and set a saneconcurrentso one heavy job can’t OOM the runner host. - Monitor the runner host’s memory, disk, and the
gitlab-runnerservice; alert onoom-killand service restarts. - Add
retry: { when: runner_system_failure }to long or critical jobs so transient runner crashes self-heal without manual retries. - Keep the runner version within GitLab’s supported window and schedule
docker system pruneduring idle periods. - Pasting the runner journal and the stuck-job notice into the free incident assistant separates an OOM from a network or executor failure. More patterns live in the GitLab CI/CD guides.
Related Errors
- There has been a runner system failure / no runners online — no runner ever picked the job up, versus this case where one took it and then failed.
- ERROR: Job failed: exit code 137 (OOMKilled) — the job’s container was OOM-killed; here it’s the runner process or host that died.
Job failed (system failure): aborted: terminated— the raw runner-side log line behind the user-facing “stuck” notice.
Frequently Asked Questions
Is this my pipeline’s fault?
Usually not. A runner system failure means the runner that accepted the job crashed, lost its executor, or lost contact with GitLab. Your script may never have run. Check the runner host’s health before suspecting your .gitlab-ci.yml.
Should I just retry the job?
Once the runner is healthy, yes — system-failure jobs are safe to retry. Add retry: { when: runner_system_failure } so transient runner crashes retry automatically without masking genuine test failures.
How do I know if the runner was OOM-killed?
Run journalctl -k | grep -i 'killed process' and systemctl status gitlab-runner on the host. An oom-kill result or a kernel “Out of memory: Killed process gitlab-runner” line confirms it. Cap per-job memory and lower concurrency to prevent recurrence.
How is this different from “no runners online”?
“No runners online” means no eligible runner ever picked the job up. This error means a runner did accept the job and then failed mid-execution (crash, dead executor, lost heartbeat). The fixes target runner-host health, not runner availability.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.