Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for GitLab CI/CD By James Joyner IV · · 9 min read

GitLab CI Error Guide: 'This job is stuck because of a runner system failure'

Fix GitLab CI's 'job is stuck ... runner system failure': crashed runner processes, unhealthy executors, and lost runner-coordinator heartbeats.

  • #gitlab-cicd
  • #troubleshooting
  • #errors
  • #runner-config

Exact Error Message

A pipeline hangs in pending, then GitLab drops this notice and fails the job:

This job is stuck because of a runner system failure. Try again or contact your
administrator if the problem persists.

In the runner’s own logs you’ll often see the matching crash or heartbeat loss:

ERROR: Job failed (system failure): aborted: terminated
WARNING: Appending trace to coordinator... aborted  status=canceled
ERROR: Submitting job to coordinator... failed   code=500 ...
PANIC: runtime error: invalid memory address or nil pointer dereference

The job did not fail because of your code — the runner that picked it up died, stalled, or lost contact with GitLab mid-execution.

What the Error Means

GitLab dispatches a job to a runner, which then streams a trace and periodic updates back to the coordinator. “Stuck because of a runner system failure” means the runner accepted the job but then stopped reporting — it crashed, its executor became unhealthy, the host ran out of resources, or the network between runner and GitLab dropped. After a grace period with no heartbeat, GitLab marks the job as a system failure.

It differs from “no runners online” (no runner ever picked it up) and “stuck because of tag mismatch” (no eligible runner). Here a runner did take the job and then failed to follow through.

Common Causes

  1. The gitlab-runner process crashed or was OOM-killed on the host mid-job.
  2. Executor unhealthy — Docker daemon down, Kubernetes API unreachable, or shell host wedged.
  3. Host resource exhaustion (memory, file descriptors, disk) killing the runner or its children.
  4. Network partition between runner and GitLab so trace/heartbeat updates can’t be submitted.
  5. A runner panic/bug (rare) or a runner version incompatible with the GitLab version.

How to Reproduce the Error

Hard to reproduce on purpose, but you can simulate the symptom by killing the runner while a job runs:

slow:
  script:
    - sleep 120     # while this runs, the runner process is stopped

If gitlab-runner is stopped or OOM-killed during the sleep, GitLab eventually reports:

This job is stuck because of a runner system failure.

Diagnostic Commands

Read-only health checks on the runner host:

# Is the runner process actually up and healthy?
systemctl status gitlab-runner

# Recent runner errors, panics, and coordinator submit failures
journalctl -u gitlab-runner --since "1 hour ago" | grep -iE 'panic|system failure|aborted|oom'

# Was the runner (or a child) OOM-killed by the kernel?
journalctl -k --since "1 hour ago" | grep -i 'killed process'

# Is the executor reachable? (Docker / Kubernetes)
docker info >/dev/null 2>&1 && echo docker-ok || echo docker-DOWN
kubectl version --short 2>/dev/null

# Resource headroom on the host
df -h / ; free -m
Active: failed (Result: oom-kill)
journal: Out of memory: Killed process 4123 (gitlab-runner)
docker-DOWN

An oom-kill on the runner process, or an executor reporting DOWN, confirms the runner died or lost its backend mid-job.

Step-by-Step Resolution

1. Restart the runner and its executor backend

Bring the runner and the thing it executes on back to a healthy state:

sudo systemctl restart docker          # if using the Docker executor
sudo systemctl restart gitlab-runner
gitlab-runner verify                    # checks runners are still registered & reachable

gitlab-runner verify confirms the registration is intact; remove dead entries it reports as stale.

2. Fix the root resource problem

If the host was OOM-killed or out of disk, address that so it doesn’t recur:

[[runners]]
  executor = "docker"
  [runners.docker]
    memory = "2g"          # cap per-job memory so one job can't OOM the host
    memory_swap = "2g"

Lower concurrent if too many jobs share a small host, and clean disk (docker system prune during idle windows).

3. Repair an unhealthy executor

For Kubernetes, confirm the API and namespace are reachable from the runner; for shell/Docker, confirm the daemon is up. A runner pointed at a dead executor will accept jobs and then fail them as system failures.

4. Align runner and GitLab versions

A runner far older or newer than the GitLab instance can mis-handle the API. Keep the runner within GitLab’s supported version window and upgrade if logs show protocol/500 errors.

5. Retry the job

System-failure jobs are safe to retry once the runner is healthy. Use Retry in the UI, or add automatic retries for transient runner failures:

slow:
  retry:
    max: 2
    when: runner_system_failure
  script: ["./build.sh"]

when: runner_system_failure retries only this class of failure, not your real test failures.

Prevention and Best Practices

  • Cap per-job memory (memory/memory_swap) and set a sane concurrent so one heavy job can’t OOM the runner host.
  • Monitor the runner host’s memory, disk, and the gitlab-runner service; alert on oom-kill and service restarts.
  • Add retry: { when: runner_system_failure } to long or critical jobs so transient runner crashes self-heal without manual retries.
  • Keep the runner version within GitLab’s supported window and schedule docker system prune during idle periods.
  • Pasting the runner journal and the stuck-job notice into the free incident assistant separates an OOM from a network or executor failure. More patterns live in the GitLab CI/CD guides.

Frequently Asked Questions

Is this my pipeline’s fault?

Usually not. A runner system failure means the runner that accepted the job crashed, lost its executor, or lost contact with GitLab. Your script may never have run. Check the runner host’s health before suspecting your .gitlab-ci.yml.

Should I just retry the job?

Once the runner is healthy, yes — system-failure jobs are safe to retry. Add retry: { when: runner_system_failure } so transient runner crashes retry automatically without masking genuine test failures.

How do I know if the runner was OOM-killed?

Run journalctl -k | grep -i 'killed process' and systemctl status gitlab-runner on the host. An oom-kill result or a kernel “Out of memory: Killed process gitlab-runner” line confirms it. Cap per-job memory and lower concurrency to prevent recurrence.

How is this different from “no runners online”?

“No runners online” means no eligible runner ever picked the job up. This error means a runner did accept the job and then failed mid-execution (crash, dead executor, lost heartbeat). The fixes target runner-host health, not runner availability.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.