GitLab Pipeline Audit & Slow Job Hunt Prompt
Audit GitLab pipelines for stale jobs, queueing delays, runner capacity issues, and find the slow jobs that dominate critical path.
- Target user
- DevOps engineers and platform leads investigating pipeline performance
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior DevOps engineer who has audited GitLab CI/CD at scale — finding pipelines stuck for hours, jobs queueing because runners are saturated, jobs that should have completed in 5 minutes taking 50. I will provide: - The scope: project-level audit / group-wide audit / a specific slow pipeline - Recent timing data (pipeline durations, queue times) - Runner inventory (count, executor type, capacity) - The goal: find slow jobs / fix queueing / capacity plan Your job: 1. **Identify the slowest jobs**: - Use GitLab API to pull jobs across recent pipelines - Calculate p50/p95/p99 duration per job name - Look for outliers AND systematic slowness 2. **Distinguish duration from queue time**: - `duration` — wall-clock time the job ran - `queued_duration` — time the job waited for a runner before starting - High queued_duration = capacity issue, not job slowness 3. **For capacity analysis**: - Concurrent jobs running vs `concurrent` setting on runners - Time of day patterns: peak hours overloaded - Per-runner utilization 4. **For job-level slowness**: - Is the job's actual work slow (e.g., compile takes 20 min) or is it waiting on something (cache restore, image pull)? - Cache restore + push at job start/end can dominate - Image pull on cold runner is significant on first job - Artifact upload/download for large artifacts 5. **For pipeline-level slowness**: - Critical path: longest chain of dependent jobs - Stage-based pipelines have implicit ordering; DAG (`needs:`) can shorten - Single bottleneck job (e.g., e2e test 30 min) dominates total 6. **For queueing**: - Runners pool fully consumed → jobs wait - Solution: add runners OR reduce per-job duration OR change job-runner tag matching - Group runners vs project runners: scope matters 7. **For stuck / stale jobs**: - Jobs that ran but never reported finish (lost runner, network issue) - Default `timeout` per job (1 hour) — past this, GitLab kills them - Manual jobs (`when: manual`) that nobody clicked 8. **For org-wide patterns**: - Which projects consume most runner time? - Which jobs are unnecessarily heavy? - Are caches effective? Mark DESTRUCTIVE: cancelling running jobs without notice, reducing runner count without capacity plan, removing caches "to test" (often dramatically slower). --- Scope: [project / group / specific pipeline] Recent timing data: [DESCRIBE] Runner inventory: [count, executor, capacity] Symptom: [DESCRIBE — slow / queued / stale / random] Goal: [audit / fix / plan]
Why this prompt works
Pipeline performance is a top complaint and the slowest jobs aren’t always the obvious ones. Queue time vs duration is the key first split; many “slow pipelines” are capacity-constrained, not job-constrained.
How to use it
- Pull timing data first —
kubectl get podsequivalent for pipelines. - Distinguish queue time from run time — different fixes.
- Find top 5 slowest and focus there.
- For org-wide audit, aggregate by project.
Useful commands
# Pipeline durations (last N pipelines for a project)
curl --header "PRIVATE-TOKEN: <t>" \
"https://gitlab.example.com/api/v4/projects/<id>/pipelines?per_page=100" | \
jq -r '.[] | "\(.id) \(.duration)s \(.queued_duration)s \(.status) \(.ref)"' | head
# Per-job stats for a pipeline
curl --header "PRIVATE-TOKEN: <t>" \
"https://gitlab.example.com/api/v4/projects/<id>/pipelines/<pid>/jobs" | \
jq -r '.[] | "\(.duration)s queue=\(.queued_duration)s \(.name) [\(.stage)]"' | sort -nr | head
# Average duration per job name across last N pipelines
PROJ_ID=42
for PID in $(curl -s --header "PRIVATE-TOKEN: $TOKEN" \
"$GITLAB/api/v4/projects/$PROJ_ID/pipelines?per_page=50&status=success" | jq -r '.[].id'); do
curl -s --header "PRIVATE-TOKEN: $TOKEN" \
"$GITLAB/api/v4/projects/$PROJ_ID/pipelines/$PID/jobs" | \
jq -r '.[] | "\(.name)\t\(.duration)"'
done | awk -F'\t' '{sum[$1]+=$2; count[$1]++} END {for (n in sum) print sum[n]/count[n], n}' | sort -n | tail
# Runner status (admin)
curl --header "PRIVATE-TOKEN: <t>" \
"https://gitlab.example.com/api/v4/runners" | jq '.[] | {id, description, active, online, status, contacted_at}'
# Find old/stuck pipelines
curl --header "PRIVATE-TOKEN: <t>" \
"https://gitlab.example.com/api/v4/projects/<id>/pipelines?status=running&updated_before=$(date -d '1 day ago' -Iseconds)" | jq
# Cancel a stuck pipeline (carefully)
curl --request POST --header "PRIVATE-TOKEN: <t>" \
"https://gitlab.example.com/api/v4/projects/<id>/pipelines/<pid>/cancel"
# Per-project usage (admin)
curl --header "PRIVATE-TOKEN: <t>" \
"https://gitlab.example.com/api/v4/groups/<id>/projects?include_subgroups=true&statistics=true" | \
jq -r '.[] | "\(.statistics.shared_runners_minutes)\t\(.path_with_namespace)"' | sort -nr | head
Aggregation scripts
Find slowest jobs project-wide
#!/bin/bash
PROJ_ID=$1
echo "Pulling last 30 successful pipelines..."
JOBS_FILE=$(mktemp)
for PID in $(curl -s --header "PRIVATE-TOKEN: $TOKEN" \
"$GITLAB/api/v4/projects/$PROJ_ID/pipelines?per_page=30&status=success" | jq -r '.[].id'); do
curl -s --header "PRIVATE-TOKEN: $TOKEN" \
"$GITLAB/api/v4/projects/$PROJ_ID/pipelines/$PID/jobs" | \
jq -r '.[] | "\(.name)|\(.duration)|\(.queued_duration)"' >> "$JOBS_FILE"
done
echo "=== Top 10 slowest by p50 ==="
awk -F'|' '{durations[$1]=durations[$1]" "$2} END {for (n in durations) print n, durations[n]}' "$JOBS_FILE" | \
while read name nums; do
sorted=$(echo "$nums" | tr ' ' '\n' | sort -n)
p50=$(echo "$sorted" | awk 'BEGIN{c=0} {a[c++]=$0} END{print a[int(c/2)]}')
echo "$p50 $name"
done | sort -n | tail
rm "$JOBS_FILE"
Detect queueing patterns
# Across all recent pipelines, find jobs with high queue time
for PID in $(curl -s --header "PRIVATE-TOKEN: $TOKEN" \
"$GITLAB/api/v4/projects/$PROJ_ID/pipelines?per_page=20" | jq -r '.[].id'); do
curl -s --header "PRIVATE-TOKEN: $TOKEN" \
"$GITLAB/api/v4/projects/$PROJ_ID/pipelines/$PID/jobs" | \
jq -r '.[] | select(.queued_duration > 60) | "\(.queued_duration)s queue \(.duration)s run \(.name)"'
done | sort -nr | head
Common findings this catches
- Single slow integration test dominating pipeline duration → split, parallelize, or move to async post-merge.
- All jobs queue at 9 AM → runner capacity inadequate for peak; add or autoscale.
- One project consumes 70% of runner-minutes → audit; possibly broken cache invalidation or excessive testing.
- Manual jobs sit pending for days → either remove the manual gate or assign owners.
- Image pull dominates startup of every job → pre-pull on runners; use dependency proxy.
- Cache restore takes 3 min for a 5-min job → cache too large; trim paths.
- Pipelines hit 1-hour timeout often → break into smaller pipelines or raise timeout.
Capacity planning template
Concurrent jobs (peak observed) = N
Concurrent jobs (typical) = M
Average job duration = D minutes
Pipelines per hour (peak) = P
Required concurrent runner capacity = N × (D / 60) × buffer
When to escalate
- Org-wide queue issues — capacity planning meeting; budget for more runners.
- Specific project anti-pattern — engage owners; share findings.
- GitLab.com shared runner saturation — consider buying minutes or moving to self-hosted runners.
Related prompts
-
GitLab CI/CD `needs:` DAG Optimization Prompt
Convert stage-based GitLab pipelines to DAG (`needs:`), find hidden ordering bugs, design clean fan-out/fan-in patterns, and avoid `needs:` traps.
-
GitLab CI/CD Pipeline Optimization Prompt
Speed up slow GitLab pipelines — DAG with `needs:`, cache vs artifacts, parallel jobs, image pre-builds, dependency proxy, and shallow clones.
-
GitLab Runner Troubleshooting Prompt
Diagnose GitLab Runner failures — runner offline, executor errors, Docker-in-Docker issues, autoscaler problems, slow job pickup, and resource exhaustion.