Using AI to Detect and Quarantine Flaky Tests in GitLab CI

I once lost the better part of a Friday to a single test. It failed on roughly one run in five, always with a timeout, never on my laptop. We didn’t fix it. We did something worse: we taught the whole team to click Retry. Within a month, “just hit retry” was institutional knowledge, and a green pipeline meant almost nothing. The flake had quietly trained us to ignore CI.

That’s the real cost of flaky tests. It isn’t the wasted minutes — it’s the erosion of trust. Once people stop believing red means broken, your test suite is decoration. This guide is about clawing that trust back in GitLab CI, using AI to do the part humans are bad at (spotting patterns across thousands of failure logs) while keeping humans firmly in charge of the part AI is bad at (deciding what’s actually safe to quarantine).

First, make failures legible to GitLab

You can’t analyze what you can’t see. The single most valuable change is emitting JUnit reports so GitLab parses results per-test instead of treating a job as one opaque pass/fail.

test:
  stage: test
  image: node:20
  script:
    - npm ci
    - npm test -- --reporter=junit --outputFile=report.xml
  artifacts:
    when: always
    reports:
      junit: report.xml
    paths:
      - report.xml
    expire_in: 30 days

The when: always matters — without it, a failed job won’t upload the report, and the failures are exactly the data you need. Once this lands, GitLab shows a per-test breakdown in the merge request widget and the pipeline’s Tests tab. That structured XML is also what you’ll hand to AI later.

Stop blanket-retrying everything

The lazy fix is a global retry. Resist it. Retrying everything hides real regressions and burns runner minutes. Scope retries narrowly so they paper over infrastructure blips, not logic bugs.

test:
  stage: test
  script:
    - npm ci
    - npm test -- --reporter=junit --outputFile=report.xml
  retry:
    max: 2
    when:
      - runner_system_failure
      - stuck_or_timeout_failure
      - script_failure

when: script_failure will retry on a non-zero exit from your script, which is the broadest case — use it deliberately and only while you’re actively hunting flakes. The goal is a temporary bridge, not a permanent crutch. Every retry you add is a flake you’ve chosen to tolerate instead of diagnose.

Pro Tip: Track your retry rate as a metric. If the percentage of jobs that only pass on attempt 2+ is climbing, your suite is rotting — no dashboard will tell you that as bluntly as that one number.

Let GitLab flag flakiness for you

GitLab can detect flaky tests on its own once it has enough JUnit history. When a test fails and then passes on the same commit (via retry) or flips status without code changes, GitLab marks it flaky in the test report and surfaces it in the MR widget. You don’t configure a special keyword for this — it falls out of consistent reports:junit artifacts plus retries.

test:
  stage: test
  script:
    - npm ci
    - npm test -- --reporter=junit --outputFile=report.xml
  retry:
    max: 1
    when: script_failure
  artifacts:
    when: always
    reports:
      junit: report.xml

This native signal is your ground truth. It’s conservative, though — it only catches flakes that happen to fail-then-pass within the data it sees. The long tail of “fails once a day on a schedule” needs more analysis, which is where AI earns its keep.

Run tests in parallel to surface order-dependence

Many flakes are actually hidden dependencies between tests — shared global state, an unclean database, a leaked port. Sharding across parallel jobs shuffles which tests run together and exposes order-dependent failures fast.

test:
  stage: test
  parallel: 5
  script:
    - npm ci
    - npm test -- --shard=$CI_NODE_INDEX/$CI_NODE_TOTAL --reporter=junit --outputFile=report-$CI_NODE_INDEX.xml
  artifacts:
    when: always
    reports:
      junit: report-*.xml

GitLab merges the glob of JUnit files into one report automatically. When a test passes in shard 3 but fails in shard 1, you’ve found a coupling bug, not a true flake — and that’s a real fix, not a quarantine candidate.

Feed failure history to AI to cluster flaky vs. real

Here’s the part that scales beyond human patience. You’ve got weeks of JUnit XML sitting in artifacts. Pull the failures into a compact dataset — test name, error message, stack signature, pass/fail history, timestamp — and ask an assistant to cluster them.

analyze-flakes:
  stage: analyze
  image: node:20
  rules:
    - if: '$CI_PIPELINE_SOURCE == "schedule"'
  script:
    - node scripts/collect-junit.js > failures.json
    - node scripts/ask-ai-to-cluster.js failures.json > flake-report.md
  artifacts:
    paths:
      - flake-report.md
    expire_in: 90 days

The AI is genuinely good here: it reads a thousand stack traces and notices that fourteen “different” failures all share a connection reset near the same fixture, or that a cluster only fails between 00:00–01:00 UTC (a nightly cron stealing the runner). That’s pattern-spotting across noisy text — exactly its strength.

But treat it like a fast junior engineer, not an oracle. It will confidently mislabel a genuine regression as “probably flaky” because the error message looks familiar. Its output is a sorted to-do list and a hypothesis, never a verdict. A human reads flake-report.md and decides what actually gets quarantined. And the cardinal rule: never give the AI your CI secrets. It needs test names and stack traces, not $DEPLOY_TOKEN, database URLs, or signing keys. Sanitize the data you export, and run the analysis job with no access to protected variables. If you want a structured place to keep these clustering and triage prompts, a prompt workspace or a curated prompt pack beats pasting ad-hoc instructions into a chat window each week.

Pro Tip: Give the AI the test’s pass/fail timeline, not just the latest failure. “Failed 6 of the last 40 runs, all on shared CI runners, never locally” is a far stronger flaky signal than one stack trace, and it stops the model from overfitting to a single scary-looking log line.

Quarantine without deleting

Once a human confirms a test is flaky, move it somewhere it can’t block merges but stays visible. A dedicated quarantine stage with allow_failure: true does exactly that.

stages:
  - test
  - quarantine

test:
  stage: test
  script:
    - npm test -- --grep-invert @quarantine --reporter=junit --outputFile=report.xml
  artifacts:
    when: always
    reports:
      junit: report.xml

quarantine-tests:
  stage: quarantine
  allow_failure: true
  script:
    - npm test -- --grep @quarantine --reporter=junit --outputFile=quarantine.xml
  artifacts:
    when: always
    reports:
      junit: quarantine.xml

Tag the flaky test with @quarantine, and the main test job excludes it so the pipeline goes green on real correctness. The quarantine-tests job still runs it — failures are reported but never block the merge. Crucially, the test isn’t deleted, so when someone fixes the underlying race, you delete the tag and it rejoins the gate. Pair this with an expiry: a quarantine that never empties is just a graveyard. Open a ticket for every quarantined test and review the list each sprint.

Close the loop with review

The output of all this — the AI’s clusters, the quarantine list, the proposed .gitlab-ci.yml edits — still needs a human gate before it merges. Run the proposed config and code changes through your normal MR review, and lean on an automated code review pass to catch the obvious mistakes (a grep-invert typo that silently skips half your suite) before a reviewer’s time is spent. The discipline is the same one that runs through everything in GitLab CI/CD work: AI proposes at machine speed, a human approves with judgment. If you’re refining the prompts you use for triage, the patterns over in the prompt library are a decent starting point.

Wrapping up

Flaky tests don’t survive because they’re hard to fix — they survive because nobody can tell which of the five hundred red builds last month were the same bug. AI dissolves that needle-in-a-haystack problem: it clusters failures, ranks suspects, and hands you a short list in seconds. You keep the steering wheel. Emit JUnit reports, retry narrowly, let GitLab flag the obvious flakes, parallelize to expose coupling, and quarantine — visibly, temporarily, never silently — only what a human has confirmed. Do that, and red starts meaning broken again. That’s the whole point.