Debugging a Flaky Automation Script with AI Step by Step

A script that fails every time is easy. A script that fails one run in ten, with a different error each time, is a special kind of misery — and it’s usually a race condition, a timing assumption, or a resource that isn’t always ready. I worked through exactly this with an AI assistant recently, and it was a genuinely good debugging partner: fast at generating hypotheses and instrumentation, the way a sharp junior engineer is. But it confidently proposed wrong root causes too, and the actual fix only landed because a human ran the experiments and read the evidence before changing anything on prod.

Give the AI the failures, not just the code

Flaky bugs hide in the gap between code and runtime. So I don’t just paste the script — I paste several failure logs from different runs and ask: “These are intermittent failures of the same script. What hypotheses explain a one-in-ten failure rate with these varying errors?”

The AI read across the logs and proposed four hypotheses: a race between two background jobs, a network timeout too aggressive for cold connections, a temp file collision, and a dependency on a service that wasn’t always up yet. That ranked list of hypotheses is exactly what a fast reviewer gives you — breadth fast. The ranking was wrong (it put the race first; the real cause was the timeout), but the right answer was on the list.

Add instrumentation before changing logic

The mistake is to start “fixing” based on a hypothesis. With flaky bugs you instrument first, then let evidence pick the cause. I asked the AI to add timing and state logging without changing behavior.

log() { printf '[%s] %s\n' "$(date -u +%H:%M:%S.%3N)" "$*" >&2; }

log "starting fetch"
start=$(date +%s.%N)
curl -fsS --max-time 5 "$url" -o "$out"
end=$(date +%s.%N)
log "fetch took $(echo "$end - $start" | bc)s"

The timestamped logs immediately showed that on slow runs the curl was hitting its 5-second --max-time on a cold connection. The AI’s instrumentation was clean and it suggested the right things to measure — but it was my reading of the timing data, not its hypothesis ranking, that identified the timeout. Instrument, then read.

Pro Tip: for flaky bugs, ask the AI to add logging that records the timing and the order of operations, then run the script in a loop until it fails. for i in $(seq 50); do ./script.sh || break; done reproduces a one-in-ten bug fast, and the captured logs from the failing run are worth more than any amount of static analysis.

Reproduce it deterministically

A bug you can’t reproduce on demand isn’t fixed, it’s just quiet. I work with the AI to find a way to force the failure — adding artificial latency, constraining a resource, or running under load.

# Temporary fault injection to force the suspected timeout
import time, random
def slow_network_simulation():
    time.sleep(random.uniform(4.5, 6.0))  # straddle the 5s timeout

Injecting latency that straddles the timeout turned a one-in-ten failure into a reliable repro. Now I could test a fix and know it worked. The AI is good at suggesting fault-injection points; I remove every line of it before the script goes anywhere near prod, because injected faults are debugging scaffolding, not shippable code.

Fix the root cause, not the symptom

The AI’s first proposed fix was “increase the retry count.” That’s symptom-patching — it would mask the cold-connection timeout without addressing it. The real fix was a longer connect timeout plus a bounded retry with backoff for genuine transient failures.

import httpx

client = httpx.Client(
    timeout=httpx.Timeout(connect=10.0, read=15.0, write=10.0, pool=5.0),
)

Separating connect from read timeout is the correct fix for a cold-connection problem; a blanket retry just makes the script slower and still flaky. I pushed back on the AI’s first answer, and once I described the timing evidence, its revised fix was right. You have to drive it with the evidence — it doesn’t have the runtime data, you do. The retry-versus-fix distinction is covered in retry and backoff patterns.

Hunt the race condition properly

For the suspected race between background jobs, the AI helped reason about ordering, but the fix is a real synchronization primitive, not a sleep.

# Wrong: hope the other job finished
sleep 2

# Right: wait on the actual condition
until [[ -f "$ready_marker" ]]; do sleep 0.2; done

A sleep is a guess that’s too short under load and too slow otherwise — the AI loves to suggest it. Waiting on an actual readiness marker is deterministic. For shared-resource contention, the proper tool is locking, covered in file locking and signal handling.

Confirm the fix under the conditions that broke it

A flaky fix has to be proven under the same stress that exposed the bug. I run the fixed script through the fault-injection loop dozens of times and require a clean streak before I believe it.

fails=0
for i in $(seq 100); do ./script.sh || ((fails++)); done
echo "$fails failures in 100 runs"

Zero failures across a hundred stressed runs is the bar. The AI can generate this harness, but the decision that the bug is actually dead is mine — I’ve watched it declare victory after three runs of a one-in-ten bug. For wiring these failures into alerting so the next flake gets caught early, the monitoring alerts dashboard and incident response dashboard are where the signals should land.

Keep prod data and secrets out of the loop

The failure logs I pasted for analysis were scrubbed — real hostnames, tokens, and customer identifiers replaced with placeholders. The AI debugs the logic and timing from sanitized evidence; it never needs the real credential or the real customer record, and a debugging transcript is not where production data should live.

Conclusion

AI is a strong debugging partner for flaky automation: it generates hypotheses across logs fast, writes clean instrumentation, and suggests fault-injection points — all at junior-engineer speed. It also ranks the wrong cause first, reaches for symptom patches like blanket retries and sleep, and declares victory too early. Instrument before changing logic, reproduce deterministically, fix the root cause with the runtime evidence in hand, and prove the fix under stress. Strip every debugging scaffold and keep prod data out of the prompt before it runs on prod. The speed is the AI’s; the read of the evidence is yours. More debugging prompts live in the prompts library, the prompt packs, and the bash and Python automation category.