AI-Assisted CSV and Spreadsheet Wrangling in Python for Ops

Half the “automation” requests I get aren’t glamorous infrastructure work — they’re somebody handing me a 12,000-row CSV exported from a billing console or a monitoring tool and asking, “can you tell me which accounts spiked last month?” The data is messy: inconsistent date formats, a stray BOM at the front of the file, numbers stored as strings with currency symbols, duplicate rows from a double-export. Cleaning and joining this stuff is tedious, well-trodden Python, and it’s a sweet spot for an AI assistant that has written the same pandas clean-up a thousand times.

The frame I keep: the AI is a fast junior analyst. It writes the transform; I’m the one who confirms the totals are real before anyone makes a decision from them.

Start with the standard library

Before reaching for pandas, remember that Python’s built-in csv module handles a lot, and it’s zero-dependency — important when the script has to run on a locked-down jump host.

import csv

with open("usage.csv", newline="", encoding="utf-8-sig") as fh:
    reader = csv.DictReader(fh)
    total = sum(float(row["cost"]) for row in reader if row["cost"])
print(f"Total: ${total:,.2f}")

The detail that bites everyone is encoding="utf-8-sig" — that -sig strips the byte-order mark Excel loves to prepend, the thing that makes your first column header come out as account_id and break every lookup. When an AI drafts CSV code, check the encoding argument; if it just says utf-8, ask it specifically about BOM handling. It’s the single most common silent CSV bug.

Let pandas do the heavy joins

When you need to join two exports — usage by account against a name lookup — pandas earns its dependency:

import pandas as pd

usage = pd.read_csv("usage.csv", encoding="utf-8-sig")
accounts = pd.read_csv("accounts.csv")

report = usage.merge(accounts, on="account_id", how="left")
top = (
    report.groupby("account_name")["cost"]
    .sum()
    .sort_values(ascending=False)
    .head(10)
)
print(top.to_string())

A prompt that gets you most of the way there:

I have two CSVs: usage.csv with columns account_id, date, cost and accounts.csv with account_id, account_name. Write pandas to produce the top 10 accounts by total cost. Use a left join and explain what happens to usage rows with no matching account.

That last clause matters. A left join keeps unmatched usage rows with a null name, an inner join silently drops them — and “silently dropped rows” is exactly how a report ends up understating a total by 8% and nobody notices. Make the AI tell you which join it chose and why.

The numbers-as-strings trap

The most insidious CSV problem is numeric columns that arrive as text — "$1,204.50" or "1.204,50" in European locale. Operations on them either error loudly (good) or, worse, concatenate as strings and give a plausible-looking wrong answer.

# Clean a currency column into a real float
report["cost"] = (
    report["cost"]
    .str.replace(r"[$,]", "", regex=True)
    .astype(float)
)

When you review AI-generated cleaning, sanity-check the result with df["cost"].describe() — if the max is implausibly large or the count dropped, the parse went sideways. Trust the assistant to write the transform; trust your own eyes on the summary stats.

Pro Tip: After any AI-generated transform, assert your invariants in code, not in your head. assert len(report) == len(usage), "join changed row count" turns a silent data-loss bug into a loud crash. A report you can’t make assertions about is a report you can’t trust.

Reshaping for the audience

The cleaned data usually needs two output forms: a machine-readable CSV for the next pipeline and a human-readable summary for the Slack message. Ask for both.

# Machine: full detail back out
report.to_csv("report_clean.csv", index=False)

# Human: a small pivot
pivot = report.pivot_table(
    index="account_name", values="cost", aggfunc="sum"
).round(2)
print(pivot.sort_values("cost", ascending=False).head(5).to_markdown())

to_markdown() gives you a table you can paste straight into a ticket or chat. index=False on the CSV write keeps pandas from prepending a meaningless row-number column — a small thing the AI sometimes forgets, and the kind of detail you catch by reading the output rather than the code.

Scaling down when pandas is overkill

If the file streams in once and you only need a running tally, pandas loads the whole thing into memory for no reason. For a multi-gigabyte log-style CSV, ask the AI for a streaming version with the csv module and a collections.Counter:

import csv
from collections import defaultdict

totals = defaultdict(float)
with open("huge_usage.csv", newline="", encoding="utf-8-sig") as fh:
    for row in csv.DictReader(fh):
        totals[row["account_id"]] += float(row["cost"] or 0)

Constant memory, one pass, no dependency. The judgment call — whole-file pandas versus streaming stdlib — is yours; the AI will happily write either, so tell it the file size and let it pick, then confirm the choice fits the box it’ll run on.

Keep the live data out of the prompt

Here’s the part people get casual about: these CSVs often contain customer account names, internal IDs, sometimes billing amounts that are genuinely sensitive. Do not paste the real export into a chat to get help cleaning it. Paste the header row and three fabricated sample rows that match the shape. The AI needs the structure, not your customers’ data. Same rule as secrets: the model gets the schema, never the payload.

Where to go from here

CSV wrangling is the on-ramp to richer reporting, and the same draft-then-verify loop applies all the way up. I keep the transforms I trust in a prompt workspace and reuse them across the recurring reports. The prompt patterns are in our prompt library, with deeper bundles in the prompt packs, and adjacent data-munging techniques live in the Bash and Python automation category. For drafting I lean on Claude or ChatGPT, pasting only sanitized samples.

The bottom line

A spreadsheet transform is dangerous precisely because a wrong answer looks exactly like a right one — there’s no stack trace for “your join dropped 800 rows.” So let the AI write the pandas fast, but you own the verification: check the encoding, name the join, assert the row counts, eyeball the summary stats, and never feed the model the real data. Fast junior analyst, human who signs off on the numbers. That’s the deal.

AI-Assisted CSV and Spreadsheet Wrangling in Python for Ops Reports