GitLab Pipeline Audit & Slow Job Hunt Prompt

You are a senior DevOps engineer who has audited GitLab CI/CD at scale — finding pipelines stuck for hours, jobs queueing because runners are saturated, jobs that should have completed in 5 minutes taking 50. I will provide: - The scope: project-level audit / group-wide audit / a specific slow pipeline - Recent timing data (pipeline durations, queue times) - Runner inventory (count, executor type, capacity) - The goal: find slow jobs / fix queueing / capacity plan Your job: 1. **Identify the slowest jobs**: - Use GitLab API to pull jobs across recent pipelines - Calculate p50/p95/p99 duration per job name - Look for outliers AND systematic slowness 2. **Distinguish duration from queue time**: - `duration` — wall-clock time the job ran - `queued_duration` — time the job waited for a runner before starting - High queued_duration = capacity issue, not job slowness 3. **For capacity analysis**: - Concurrent jobs running vs `concurrent` setting on runners - Time of day patterns: peak hours overloaded - Per-runner utilization 4. **For job-level slowness**: - Is the job's actual work slow (e.g., compile takes 20 min) or is it waiting on something (cache restore, image pull)? - Cache restore + push at job start/end can dominate - Image pull on cold runner is significant on first job - Artifact upload/download for large artifacts 5. **For pipeline-level slowness**: - Critical path: longest chain of dependent jobs - Stage-based pipelines have implicit ordering; DAG (`needs:`) can shorten - Single bottleneck job (e.g., e2e test 30 min) dominates total 6. **For queueing**: - Runners pool fully consumed → jobs wait - Solution: add runners OR reduce per-job duration OR change job-runner tag matching - Group runners vs project runners: scope matters 7. **For stuck / stale jobs**: - Jobs that ran but never reported finish (lost runner, network issue) - Default `timeout` per job (1 hour) — past this, GitLab kills them - Manual jobs (`when: manual`) that nobody clicked 8. **For org-wide patterns**: - Which projects consume most runner time? - Which jobs are unnecessarily heavy? - Are caches effective? Mark DESTRUCTIVE: cancelling running jobs without notice, reducing runner count without capacity plan, removing caches "to test" (often dramatically slower). --- Scope: [project / group / specific pipeline] Recent timing data: [DESCRIBE] Runner inventory: [count, executor, capacity] Symptom: [DESCRIBE — slow / queued / stale / random] Goal: [audit / fix / plan]

Why this prompt works

Pipeline performance is a top complaint and the slowest jobs aren’t always the obvious ones. Queue time vs duration is the key first split; many “slow pipelines” are capacity-constrained, not job-constrained.

How to use it

Pull timing data first — kubectl get pods equivalent for pipelines.
Distinguish queue time from run time — different fixes.
Find top 5 slowest and focus there.
For org-wide audit, aggregate by project.

Useful commands

# Pipeline durations (last N pipelines for a project)
curl --header "PRIVATE-TOKEN: <t>" \
    "https://gitlab.example.com/api/v4/projects/<id>/pipelines?per_page=100" | \
    jq -r '.[] | "\(.id) \(.duration)s \(.queued_duration)s \(.status) \(.ref)"' | head

# Per-job stats for a pipeline
curl --header "PRIVATE-TOKEN: <t>" \
    "https://gitlab.example.com/api/v4/projects/<id>/pipelines/<pid>/jobs" | \
    jq -r '.[] | "\(.duration)s queue=\(.queued_duration)s \(.name) [\(.stage)]"' | sort -nr | head

# Average duration per job name across last N pipelines
PROJ_ID=42
for PID in $(curl -s --header "PRIVATE-TOKEN: $TOKEN" \
    "$GITLAB/api/v4/projects/$PROJ_ID/pipelines?per_page=50&status=success" | jq -r '.[].id'); do
    curl -s --header "PRIVATE-TOKEN: $TOKEN" \
        "$GITLAB/api/v4/projects/$PROJ_ID/pipelines/$PID/jobs" | \
        jq -r '.[] | "\(.name)\t\(.duration)"'
done | awk -F'\t' '{sum[$1]+=$2; count[$1]++} END {for (n in sum) print sum[n]/count[n], n}' | sort -n | tail

# Runner status (admin)
curl --header "PRIVATE-TOKEN: <t>" \
    "https://gitlab.example.com/api/v4/runners" | jq '.[] | {id, description, active, online, status, contacted_at}'

# Find old/stuck pipelines
curl --header "PRIVATE-TOKEN: <t>" \
    "https://gitlab.example.com/api/v4/projects/<id>/pipelines?status=running&updated_before=$(date -d '1 day ago' -Iseconds)" | jq

# Cancel a stuck pipeline (carefully)
curl --request POST --header "PRIVATE-TOKEN: <t>" \
    "https://gitlab.example.com/api/v4/projects/<id>/pipelines/<pid>/cancel"

# Per-project usage (admin)
curl --header "PRIVATE-TOKEN: <t>" \
    "https://gitlab.example.com/api/v4/groups/<id>/projects?include_subgroups=true&statistics=true" | \
    jq -r '.[] | "\(.statistics.shared_runners_minutes)\t\(.path_with_namespace)"' | sort -nr | head

Aggregation scripts

Find slowest jobs project-wide

#!/bin/bash
PROJ_ID=$1
echo "Pulling last 30 successful pipelines..."
JOBS_FILE=$(mktemp)
for PID in $(curl -s --header "PRIVATE-TOKEN: $TOKEN" \
    "$GITLAB/api/v4/projects/$PROJ_ID/pipelines?per_page=30&status=success" | jq -r '.[].id'); do
    curl -s --header "PRIVATE-TOKEN: $TOKEN" \
        "$GITLAB/api/v4/projects/$PROJ_ID/pipelines/$PID/jobs" | \
        jq -r '.[] | "\(.name)|\(.duration)|\(.queued_duration)"' >> "$JOBS_FILE"
done

echo "=== Top 10 slowest by p50 ==="
awk -F'|' '{durations[$1]=durations[$1]" "$2} END {for (n in durations) print n, durations[n]}' "$JOBS_FILE" | \
    while read name nums; do
        sorted=$(echo "$nums" | tr ' ' '\n' | sort -n)
        p50=$(echo "$sorted" | awk 'BEGIN{c=0} {a[c++]=$0} END{print a[int(c/2)]}')
        echo "$p50 $name"
    done | sort -n | tail

rm "$JOBS_FILE"

Detect queueing patterns

# Across all recent pipelines, find jobs with high queue time
for PID in $(curl -s --header "PRIVATE-TOKEN: $TOKEN" \
    "$GITLAB/api/v4/projects/$PROJ_ID/pipelines?per_page=20" | jq -r '.[].id'); do
    curl -s --header "PRIVATE-TOKEN: $TOKEN" \
        "$GITLAB/api/v4/projects/$PROJ_ID/pipelines/$PID/jobs" | \
        jq -r '.[] | select(.queued_duration > 60) | "\(.queued_duration)s queue \(.duration)s run \(.name)"'
done | sort -nr | head

Common findings this catches

Single slow integration test dominating pipeline duration → split, parallelize, or move to async post-merge.
All jobs queue at 9 AM → runner capacity inadequate for peak; add or autoscale.
One project consumes 70% of runner-minutes → audit; possibly broken cache invalidation or excessive testing.
Manual jobs sit pending for days → either remove the manual gate or assign owners.
Image pull dominates startup of every job → pre-pull on runners; use dependency proxy.
Cache restore takes 3 min for a 5-min job → cache too large; trim paths.
Pipelines hit 1-hour timeout often → break into smaller pipelines or raise timeout.

Capacity planning template

Concurrent jobs (peak observed) = N
Concurrent jobs (typical) = M
Average job duration = D minutes
Pipelines per hour (peak) = P
Required concurrent runner capacity = N × (D / 60) × buffer

When to escalate

Org-wide queue issues — capacity planning meeting; budget for more runners.
Specific project anti-pattern — engage owners; share findings.
GitLab.com shared runner saturation — consider buying minutes or moving to self-hosted runners.

GitLab Pipeline Audit & Slow Job Hunt Prompt

Why this prompt works

How to use it

Useful commands

Aggregation scripts

Find slowest jobs project-wide

Detect queueing patterns

Common findings this catches

Capacity planning template

When to escalate

Related prompts

GitLab CI/CD `needs:` DAG Optimization Prompt

GitLab CI/CD Pipeline Optimization Prompt

GitLab Runner Troubleshooting Prompt

Why this prompt works

How to use it

Useful commands

Aggregation scripts

Find slowest jobs project-wide

Detect queueing patterns

Common findings this catches

Capacity planning template

When to escalate

Related prompts

GitLab CI/CD `needs:` DAG Optimization Prompt

GitLab CI/CD Pipeline Optimization Prompt

GitLab Runner Troubleshooting Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet