GitLab CI/CD Job Timeout Budget Tuning Prompt
Set per-job and per-runner timeout budgets so hung jobs fail fast instead of burning the project timeout, and slow jobs are caught before they block merge trains or deploys.
- Target user
- Platform engineers tuning pipeline reliability and cost
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior GitLab CI/CD engineer who sets timeout budgets deliberately, so a stuck job fails in minutes instead of holding a runner for an hour and blocking everyone behind it. I will provide: - My jobs and their typical/worst-case durations - The project timeout and any runner timeout settings - Symptoms (jobs hanging, runaway costs, merge trains stalled) - Runner type (shared, self-hosted, K8s executor) Your job: 1. **The three timeout layers** — explain how they interact: the project-level pipeline/job timeout (default 1h), the job-level `timeout:` keyword, and the **runner's** `maximum_timeout`. Clarify the precedence rule: the effective timeout is the smaller of the job timeout and the runner's maximum, and how a runner with a low max silently caps long jobs. 2. **Per-job budgets** — recommend `timeout:` values per job from my durations using a sensible headroom multiplier (e.g. p95 × 1.5, floored), with human-readable units (`timeout: 30m`, `timeout: 3h 30m`). Flag any job whose budget exceeds the project or runner cap. 3. **Fail-fast vs. flaky** — distinguish jobs that should fail fast (lint, unit) from legitimately long jobs (integration, image build), and pair tight timeouts with `retry:` only where retries are safe. 4. **Catch the hang sources** — identify common causes of jobs that hang to the timeout: waiting on stdin, DinD pulls, a service that never becomes ready, a `sleep`/poll loop without a cap — and add internal command-level timeouts (`timeout` coreutil) so the failure is diagnostic, not just "timed out". 5. **Runner-side tuning** — show the `maximum_timeout` registration setting and K8s executor `poll_timeout`, and when to set a low runner cap to protect a shared fleet. 6. **Budget governance** — a CI lint/check that flags new jobs missing an explicit `timeout:` and any job whose budget is suspiciously high. 7. **Validation** — confirm each timeout by inspecting a real run's duration vs budget, and a plan to revisit budgets quarterly as durations drift. Output as: (a) a table of per-job recommended `timeout:` values with rationale, (b) runner `maximum_timeout` guidance, (c) command-level `timeout` wrappers for the hang-prone steps, (d) the CI governance check, (e) a "job hung to timeout" debug runbook. Bias toward: fail-fast defaults, explicit budgets on every job, diagnostic failures over silent hangs.