Using AI to Add Tests to a Crufty Python Automation Script

Every ops team has one: a 600-line Python script called something like sync_stuff.py that runs in cron, that nobody fully understands, and that everyone is terrified to touch. Mine pulled inventory from three APIs and wrote to a database. Zero tests. I used an AI assistant to put a test harness around it before refactoring, and the workflow worked surprisingly well — as long as I remembered the AI was a fast junior who’d never seen our actual data and shouldn’t be trusted with prod credentials.

Don’t refactor first — characterize first

The instinct is to ask the AI to “clean this up.” Resist it. With no tests, any refactor is a blind change. Instead I asked: “Write characterization tests that capture the current behavior of this script exactly as it is, including the ugly parts. Do not fix any bugs.” Characterization tests pin down what the code does today, so when you refactor, the tests scream if behavior changes.

The AI is good at this because it reads every branch. It produced tests for code paths I’d forgotten existed, including a fallback that silently swapped to a stale cache file. I had no idea that path existed. That alone justified the exercise.

Find the dependency seams

A crufty script is hard to test because it talks to the outside world directly — network calls and DB writes buried mid-function. I ask the AI: “Identify every external dependency in this script (network, filesystem, database, time, randomness) and tell me where I’d inject a fake.”

For my script it listed three requests.get calls, a psycopg2.connect, a datetime.now(), and a hardcoded path read. Those are the seams. The AI suggested wrapping each in a small function so tests can monkeypatch them.

def fetch_inventory(client, source_url):
    resp = client.get(source_url, timeout=10)
    resp.raise_for_status()
    return resp.json()

By passing client in rather than calling requests directly, the test can hand it a fake. The AI proposed this refactor, but I reviewed it carefully — its first version changed the timeout from the original 30 seconds to 10, a subtle behavior change smuggled into a “pure” refactor. Caught in review. This is the recurring lesson: the AI’s drafts drift, and you are the one who notices.

Use pytest fixtures and monkeypatch

Once the seams exist, the AI writes solid pytest scaffolding fast.

import pytest

@pytest.fixture
def fake_client():
    class FakeResponse:
        def __init__(self, payload):
            self._payload = payload
        def raise_for_status(self):
            pass
        def json(self):
            return self._payload

    class FakeClient:
        def __init__(self, payload):
            self.payload = payload
        def get(self, url, timeout=None):
            return FakeResponse(self.payload)

    return FakeClient

def test_fetch_inventory_returns_parsed_json(fake_client):
    client = fake_client({"items": [{"sku": "A1"}]})
    result = fetch_inventory(client, "http://example/inv")
    assert result == {"items": [{"sku": "A1"}]}

Pro Tip: ask the AI to generate tests from real (sanitized) sample payloads, not invented ones. Hand it a scrubbed JSON response with fake SKUs and never the real production export — AI-invented fixtures often miss the weird nulls and trailing fields that break parsing in production.

Freeze time and randomness explicitly

The datetime.now() call made one function untestable because output depended on the run time. The AI suggested injecting the clock.

def build_report(rows, now=None):
    now = now or datetime.datetime.utcnow()
    return {"generated_at": now.isoformat(), "count": len(rows)}

def test_build_report_uses_injected_time():
    fixed = datetime.datetime(2026, 6, 16, 12, 0, 0)
    report = build_report([1, 2, 3], now=fixed)
    assert report["generated_at"] == "2026-06-16T12:00:00"

This pattern — default-to-real, allow-injection — keeps production behavior identical while making tests deterministic. It pairs well with pytest and bats testing patterns.

Mock the database, don’t hit it

The riskiest part was the DB write. I never let the AI see real connection strings. I replaced them with __DB_DSN__ in the snippet I shared, and asked it to make the write path testable.

def upsert_items(conn, items):
    with conn.cursor() as cur:
        for item in items:
            cur.execute(
                "INSERT INTO inventory (sku, qty) VALUES (%s, %s) "
                "ON CONFLICT (sku) DO UPDATE SET qty = EXCLUDED.qty",
                (item["sku"], item["qty"]),
            )
    conn.commit()

In tests I pass a fake connection that records calls and asserts the SQL ran once per item. The AI wrote the fake, but I reviewed the actual SQL by hand — it had initially generated a plain INSERT with no ON CONFLICT, which would have broken idempotency on re-run. See writing idempotent automation scripts for why that matters.

Wire it into CI and watch coverage climb

With a handful of characterization tests green, I added a pytest step to CI and turned on coverage. The AI is genuinely helpful for filling gaps: “Here’s my coverage report showing lines 220–260 uncovered — write tests for that branch.” It targets the exact uncovered code.

I cap how much I trust it by always running the suite myself and spot-checking that a test actually fails when I break the code it covers. A test that passes no matter what is worse than no test, and AI occasionally writes those.

Now you can refactor safely

Only after the harness was green did I let the AI clean up the cruft — extracting functions, removing the dead stale-cache path, tightening error handling. Every change ran against the test suite, and the suite caught two regressions immediately. That’s the payoff: tests first, refactor second, AI accelerating both while a human reviews every diff.

For reusable prompts that drive this workflow, check the prompts library and the bash and Python automation category. The code review dashboard is a useful final gate before merging the refactor.

Conclusion

AI made a terrifying legacy script tractable by writing characterization tests faster than I could, exposing hidden code paths, and scaffolding clean pytest fixtures. But it also quietly changed a timeout, dropped an ON CONFLICT, and invented fixtures that didn’t match reality. Treat it as a fast junior engineer: brilliant for volume, in need of constant review, and never trusted with the real database. Tests first, then refactor — with a human reading every line.