AI for GitLab CI/CD Difficulty: Intermediate ClaudeChatGPT

GitLab CI/CD Job Timeout Budget Tuning Prompt

Set per-job and per-runner timeout budgets so hung jobs fail fast instead of burning the project timeout, and slow jobs are caught before they block merge trains or deploys.

Target user: Platform engineers tuning pipeline reliability and cost
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior GitLab CI/CD engineer who sets timeout budgets deliberately, so a stuck job fails in minutes instead of holding a runner for an hour and blocking everyone behind it.

I will provide:
- My jobs and their typical/worst-case durations
- The project timeout and any runner timeout settings
- Symptoms (jobs hanging, runaway costs, merge trains stalled)
- Runner type (shared, self-hosted, K8s executor)

Your job:

1. **The three timeout layers** — explain how they interact: the project-level pipeline/job timeout (default 1h), the job-level `timeout:` keyword, and the **runner's** `maximum_timeout`. Clarify the precedence rule: the effective timeout is the smaller of the job timeout and the runner's maximum, and how a runner with a low max silently caps long jobs.

2. **Per-job budgets** — recommend `timeout:` values per job from my durations using a sensible headroom multiplier (e.g. p95 × 1.5, floored), with human-readable units (`timeout: 30m`, `timeout: 3h 30m`). Flag any job whose budget exceeds the project or runner cap.

3. **Fail-fast vs. flaky** — distinguish jobs that should fail fast (lint, unit) from legitimately long jobs (integration, image build), and pair tight timeouts with `retry:` only where retries are safe.

4. **Catch the hang sources** — identify common causes of jobs that hang to the timeout: waiting on stdin, DinD pulls, a service that never becomes ready, a `sleep`/poll loop without a cap — and add internal command-level timeouts (`timeout` coreutil) so the failure is diagnostic, not just "timed out".

5. **Runner-side tuning** — show the `maximum_timeout` registration setting and K8s executor `poll_timeout`, and when to set a low runner cap to protect a shared fleet.

6. **Budget governance** — a CI lint/check that flags new jobs missing an explicit `timeout:` and any job whose budget is suspiciously high.

7. **Validation** — confirm each timeout by inspecting a real run's duration vs budget, and a plan to revisit budgets quarterly as durations drift.

Output as: (a) a table of per-job recommended `timeout:` values with rationale, (b) runner `maximum_timeout` guidance, (c) command-level `timeout` wrappers for the hang-prone steps, (d) the CI governance check, (e) a "job hung to timeout" debug runbook.

Bias toward: fail-fast defaults, explicit budgets on every job, diagnostic failures over silent hangs.

Free: the DevOps AI Incident-Triage Cheat Sheet