Automating Feature Flag Cleanup With AI

I found a flag last month called enable-new-checkout-v2. It had been at 100% rollout for two years. Two years. The “v3” checkout had already shipped behind a different flag. Nobody could tell me whether the false branch still compiled, because nobody had run it since 2024. That single toggle wrapped four files, a feature-detection helper, and a config block that three other services read on boot.

This is the quiet tax of feature flags. They are wonderful for shipping safely, and they are toxic when they outlive their purpose. Every stale flag is a permanent if statement that doubles the number of code paths a reader has to reason about. Multiply that by a few hundred flags and you have a codebase that lies to you about what it actually does at runtime.

I have been using AI to chip away at this backlog, and it works, but only with guardrails. The model is a fast junior engineer: great at the tedious cross-referencing and the mechanical refactor, terrible at deciding what is safe to delete. The decision stays with a human. The model never touches prod and never holds a production credential. Here is the workflow I landed on.

What “stale” actually means

Before automating anything, define the target. A flag is a cleanup candidate when several things are true at once, not just one:

It has been at 0% or 100% rollout for a long time (I use 90 days as the floor).
Its last-evaluated timestamp is old, meaning live traffic stopped hitting the decision point.
No open ticket or kill-switch policy depends on it staying around.

That last point matters. Some flags are deliberately permanent operational kill switches. Those are not tech debt, they are insurance. Your tooling has to be able to tell the difference, usually with a naming convention or a tag in the flag provider.

Querying the provider for rollout and last-evaluated state

Start with the source of truth, the flag provider’s API. Here is a script against LaunchDarkly that pulls each flag’s rollout state and the last time it was evaluated. I run this from a CI job using a read-only API token, never a write-capable one.

import os
import requests

LD_TOKEN = os.environ["LD_READ_TOKEN"]  # read-only, scoped to one project
PROJECT = "default"
ENV = "production"
BASE = "https://app.launchdarkly.com/api/v2"

headers = {"Authorization": LD_TOKEN}

def get_flags():
    url = f"{BASE}/flags/{PROJECT}?summary=false&env={ENV}"
    resp = requests.get(url, headers=headers, timeout=30)
    resp.raise_for_status()
    return resp.json()["items"]

def stale_candidates(flags, days=90):
    out = []
    for f in flags:
        env = f["environments"][ENV]
        # fallthrough served + no recent evaluation = likely dead
        last_eval = env.get("lastRequested")  # ISO8601 or null
        is_temporary = "permanent" not in f.get("tags", [])
        if env.get("on") and is_temporary and not last_eval:
            out.append((f["key"], "on but never requested", f.get("tags")))
    return out

for key, reason, tags in stale_candidates(get_flags()):
    print(f"{key}\t{reason}\ttags={tags}")

The OpenFeature and Unleash APIs expose the same shape of data under different names. Unleash gives you lastSeenAt per environment; OpenFeature providers vary, but most expose evaluation metrics you can scrape. The point is the same: rollout percentage plus an evaluation timestamp tells you whether a flag is decided and dead.

Pro Tip: The “lastRequested” or “lastSeenAt” field is the single highest-signal indicator you have. A flag that hasn’t been evaluated in 90 days is either dead or guarding a code path nobody runs. Both are worth investigating, and both are safe to surface for human review.

Cross-referencing the dashboard against the code

The provider tells you a flag is decided. It does not tell you whether the flag key still appears in your source. You want both halves: a flag that is decided in the dashboard AND still referenced in code is a prime cleanup target. A flag in the dashboard with zero code references is an orphan you can archive outright.

#!/usr/bin/env bash
# cross-ref.sh — match dashboard flags against code references
set -euo pipefail

python ld_stale.py | cut -f1 > /tmp/stale_keys.txt

while read -r key; do
  count=$(grep -rIn --include='*.{ts,tsx,js,go,py}' \
            -F "\"$key\"" ./src 2>/dev/null | wc -l)
  if [ "$count" -gt 0 ]; then
    echo "REVIEW  $key  ($count code references)"
  else
    echo "ORPHAN  $key  (archive in dashboard, no code refs)"
  fi
done < /tmp/stale_keys.txt

This is exactly the kind of mechanical toil that AI and scripts are built to absorb. If you want a broader framework for finding this work, I wrote about identifying and eliminating toil with AI that covers how to spot these repetitive, automatable chores in the first place.

Asking AI to remove the flag and collapse the dead branch

Once a human has confirmed a flag is decided and safe to remove, the actual refactor is where AI shines. Removing a flag is not just deleting a line. You have to collapse the conditional, keep the winning branch, delete the losing branch, and clean up any now-unused helpers and imports. Done by hand across many files, this is error-prone busywork.

The prompt I use with a coding assistant such as Cursor or Claude is deliberately narrow and scoped:

The flag "enable-new-checkout-v2" is fully rolled out (always true in prod).
Remove it. For every usage:
- Keep the branch that runs when the flag is TRUE.
- Delete the FALSE branch entirely, including any helpers only it called.
- Remove the flag import/initialization if it becomes unused.
- Do NOT change behavior in the TRUE path.
Show me a diff per file. Do not touch test fixtures yet — list them separately.

The model produces a diff. It is usually 90% right and 10% subtly wrong: it will occasionally keep a variable that is now dead, or miss that the false branch had a side effect the true branch assumed was already done. That 10% is exactly why this is a junior-engineer task, not an autonomous one. You read every line. A tool like our code review dashboard can help here, but a human still signs off.

Pro Tip: Scope the model to one flag at a time and ask for a per-file diff. A PR that removes one flag is reviewable. A PR that removes fifteen flags across forty files is a rubber-stamp waiting to happen, and rubber stamps are how dead-branch side effects reach production.

This is a code change, so treat it like one

Here is the part people skip in their enthusiasm to automate: removing a flag is a code change with real blast radius. It needs the same review, the same staging soak, and the same back-out path as any other change. A few non-negotiables I enforce:

The model opens a PR. It never auto-merges. A human approves.
The blast radius is one flag per PR, so a revert is a clean git revert.
The PR description states the back-out plan explicitly: re-add the flag, or revert the commit.
CI runs the full suite, because the deleted branch may have had coverage the kept branch did not.

If you are wiring automated actions into a pipeline at all, build the approval step in deliberately. I covered the pattern in ChatOps approval gates for AI-suggested actions, and the same principle applies: the AI proposes, a named human disposes, and nothing destructive happens without a confirmation and a way back.

A CI check that flags new long-lived toggles

Cleanup is reactive. The better fix is to stop the debt from accumulating. I added a CI check that fails, or at least warns, when a flag has lived past its expected lifetime. Most providers let you set a creation date or a maintainer tag, so you can enforce a TTL.

import datetime as dt
import sys

MAX_AGE_DAYS = 90

def check(flags):
    today = dt.date.today()
    offenders = []
    for f in flags:
        created = dt.date.fromisoformat(f["creationDate"][:10])
        age = (today - created).days
        permanent = "permanent" in f.get("tags", [])
        if age > MAX_AGE_DAYS and not permanent:
            offenders.append((f["key"], age))
    return offenders

offenders = check(get_flags())
for key, age in offenders:
    print(f"::warning::Flag '{key}' is {age} days old. Retire it or tag 'permanent'.")

# Make it a hard failure once your team is ready:
# if offenders: sys.exit(1)

I start this as a warning so the team gets used to it, then flip it to a hard failure once the backlog is under control. The permanent tag is the escape hatch for legitimate kill switches, and requiring an explicit tag forces an intentional decision rather than silent drift.

If you want ready-made prompts for this kind of refactor and review work, the prompt library and the curated prompt packs have scoped templates you can adapt instead of writing every instruction from scratch.

The cadence that actually works

I run the stale-flag report weekly and triage it in fifteen minutes. Anything clearly decided gets a one-flag cleanup PR from the assistant, which I review the same day. The CI TTL check keeps new debt from piling up. The whole loop is mostly automated, but every irreversible step, the merge, has a human in front of it. More patterns like this live under automation.

Feature flags are a tool for moving fast safely. Let them rot and they become the opposite: a tax on every future change. AI makes the cleanup cheap enough to actually do, as long as you remember what it is. It is a fast, tireless junior engineer that should never hold prod credentials, never auto-merge, and never make the final call on what is safe to delete. Scope it tight, gate the destructive steps, keep a back-out path, and let it grind through the toil while you keep the judgment for yourself.