Using AI to Detect and Quarantine Flaky Tests in GitLab CI
Use AI to spot flaky tests from GitLab CI JUnit reports, cluster them apart from real failures, and auto-quarantine the offenders so your pipelines stay green.
- #gitlab
- #ci-cd
- #ai
- #testing
- #flaky-tests
I once lost the better part of a Friday to a single test. It failed on roughly one run in five, always with a timeout, never on my laptop. We didn’t fix it. We did something worse: we taught the whole team to click Retry. Within a month, “just hit retry” was institutional knowledge, and a green pipeline meant almost nothing. The flake had quietly trained us to ignore CI.
That’s the real cost of flaky tests. It isn’t the wasted minutes — it’s the erosion of trust. Once people stop believing red means broken, your test suite is decoration. This guide is about clawing that trust back in GitLab CI, using AI to do the part humans are bad at (spotting patterns across thousands of failure logs) while keeping humans firmly in charge of the part AI is bad at (deciding what’s actually safe to quarantine).
First, make failures legible to GitLab
You can’t analyze what you can’t see. The single most valuable change is emitting JUnit reports so GitLab parses results per-test instead of treating a job as one opaque pass/fail.
test:
stage: test
image: node:20
script:
- npm ci
- npm test -- --reporter=junit --outputFile=report.xml
artifacts:
when: always
reports:
junit: report.xml
paths:
- report.xml
expire_in: 30 days
The when: always matters — without it, a failed job won’t upload the report, and the failures are exactly the data you need. Once this lands, GitLab shows a per-test breakdown in the merge request widget and the pipeline’s Tests tab. That structured XML is also what you’ll hand to AI later.
Stop blanket-retrying everything
The lazy fix is a global retry. Resist it. Retrying everything hides real regressions and burns runner minutes. Scope retries narrowly so they paper over infrastructure blips, not logic bugs.
test:
stage: test
script:
- npm ci
- npm test -- --reporter=junit --outputFile=report.xml
retry:
max: 2
when:
- runner_system_failure
- stuck_or_timeout_failure
- script_failure
when: script_failure will retry on a non-zero exit from your script, which is the broadest case — use it deliberately and only while you’re actively hunting flakes. The goal is a temporary bridge, not a permanent crutch. Every retry you add is a flake you’ve chosen to tolerate instead of diagnose.
Pro Tip: Track your retry rate as a metric. If the percentage of jobs that only pass on attempt 2+ is climbing, your suite is rotting — no dashboard will tell you that as bluntly as that one number.
Let GitLab flag flakiness for you
GitLab can detect flaky tests on its own once it has enough JUnit history. When a test fails and then passes on the same commit (via retry) or flips status without code changes, GitLab marks it flaky in the test report and surfaces it in the MR widget. You don’t configure a special keyword for this — it falls out of consistent reports:junit artifacts plus retries.
test:
stage: test
script:
- npm ci
- npm test -- --reporter=junit --outputFile=report.xml
retry:
max: 1
when: script_failure
artifacts:
when: always
reports:
junit: report.xml
This native signal is your ground truth. It’s conservative, though — it only catches flakes that happen to fail-then-pass within the data it sees. The long tail of “fails once a day on a schedule” needs more analysis, which is where AI earns its keep.
Run tests in parallel to surface order-dependence
Many flakes are actually hidden dependencies between tests — shared global state, an unclean database, a leaked port. Sharding across parallel jobs shuffles which tests run together and exposes order-dependent failures fast.
test:
stage: test
parallel: 5
script:
- npm ci
- npm test -- --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL --reporter=junit --outputFile=report-$CI_NODE_INDEX.xml
artifacts:
when: always
reports:
junit: report-*.xml
GitLab merges the glob of JUnit files into one report automatically. When a test passes in shard 3 but fails in shard 1, you’ve found a coupling bug, not a true flake — and that’s a real fix, not a quarantine candidate.
Feed failure history to AI to cluster flaky vs. real
Here’s the part that scales beyond human patience. You’ve got weeks of JUnit XML sitting in artifacts. Pull the failures into a compact dataset — test name, error message, stack signature, pass/fail history, timestamp — and ask an assistant to cluster them.
analyze-flakes:
stage: analyze
image: node:20
rules:
- if: '$CI_PIPELINE_SOURCE == "schedule"'
script:
- node scripts/collect-junit.js > failures.json
- node scripts/ask-ai-to-cluster.js failures.json > flake-report.md
artifacts:
paths:
- flake-report.md
expire_in: 90 days
The AI is genuinely good here: it reads a thousand stack traces and notices that fourteen “different” failures all share a connection reset near the same fixture, or that a cluster only fails between 00:00–01:00 UTC (a nightly cron stealing the runner). That’s pattern-spotting across noisy text — exactly its strength.
But treat it like a fast junior engineer, not an oracle. It will confidently mislabel a genuine regression as “probably flaky” because the error message looks familiar. Its output is a sorted to-do list and a hypothesis, never a verdict. A human reads flake-report.md and decides what actually gets quarantined. And the cardinal rule: never give the AI your CI secrets. It needs test names and stack traces, not $DEPLOY_TOKEN, database URLs, or signing keys. Sanitize the data you export, and run the analysis job with no access to protected variables. If you want a structured place to keep these clustering and triage prompts, a prompt workspace or a curated prompt pack beats pasting ad-hoc instructions into a chat window each week.
Pro Tip: Give the AI the test’s pass/fail timeline, not just the latest failure. “Failed 6 of the last 40 runs, all on shared CI runners, never locally” is a far stronger flaky signal than one stack trace, and it stops the model from overfitting to a single scary-looking log line.
Quarantine without deleting
Once a human confirms a test is flaky, move it somewhere it can’t block merges but stays visible. A dedicated quarantine stage with allow_failure: true does exactly that.
stages:
- test
- quarantine
test:
stage: test
script:
- npm test -- --grep-invert @quarantine --reporter=junit --outputFile=report.xml
artifacts:
when: always
reports:
junit: report.xml
quarantine-tests:
stage: quarantine
allow_failure: true
script:
- npm test -- --grep @quarantine --reporter=junit --outputFile=quarantine.xml
artifacts:
when: always
reports:
junit: quarantine.xml
Tag the flaky test with @quarantine, and the main test job excludes it so the pipeline goes green on real correctness. The quarantine-tests job still runs it — failures are reported but never block the merge. Crucially, the test isn’t deleted, so when someone fixes the underlying race, you delete the tag and it rejoins the gate. Pair this with an expiry: a quarantine that never empties is just a graveyard. Open a ticket for every quarantined test and review the list each sprint.
Close the loop with review
The output of all this — the AI’s clusters, the quarantine list, the proposed .gitlab-ci.yml edits — still needs a human gate before it merges. Run the proposed config and code changes through your normal MR review, and lean on an automated code review pass to catch the obvious mistakes (a grep-invert typo that silently skips half your suite) before a reviewer’s time is spent. The discipline is the same one that runs through everything in GitLab CI/CD work: AI proposes at machine speed, a human approves with judgment. If you’re refining the prompts you use for triage, the patterns over in the prompt library are a decent starting point.
Wrapping up
Flaky tests don’t survive because they’re hard to fix — they survive because nobody can tell which of the five hundred red builds last month were the same bug. AI dissolves that needle-in-a-haystack problem: it clusters failures, ranks suspects, and hands you a short list in seconds. You keep the steering wheel. Emit JUnit reports, retry narrowly, let GitLab flag the obvious flakes, parallelize to expose coupling, and quarantine — visibly, temporarily, never silently — only what a human has confirmed. Do that, and red starts meaning broken again. That’s the whole point.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.