Online Schema Changes With gh-ost and AI
A blocking ALTER on a big table is a self-inflicted outage. Here's how I use AI to plan a safe gh-ost migration and verify the cutover before it touches prod.
- #mysql
- #ai
- #gh-ost
- #migrations
- #schema
The first time someone runs ALTER TABLE orders ADD COLUMN ... on a 200-million-row table in production, they learn — usually during an incident — that a default ALTER can lock the table or block writes for the entire duration of the rebuild. On a big table that’s minutes to hours of downtime. Online schema change tools like gh-ost and pt-online-schema-change exist precisely to avoid this: they build a shadow copy, backfill it in throttled chunks, keep it in sync via binlog, and swap it in with a near-instant rename.
The tools are reliable but they have a lot of flags, and the failure modes — replica lag, foreign keys, triggers, disk space — are easy to overlook. I use AI to draft the migration plan and the exact gh-ost command, sanity-check it against my table’s specifics, and walk through the cutover and rollback steps. Then I dry-run it on a replica. The model plans; I verify on real infrastructure before the production run.
Understand why the naive ALTER is dangerous
Some ALTER operations in modern MySQL are genuinely online (ALGORITHM=INSTANT for adding a column at the end, in many cases). Many are not. Before reaching for a tool, I check whether the change can be instant:
ALTER TABLE orders
ADD COLUMN fulfillment_region VARCHAR(32) NULL,
ALGORITHM=INSTANT, LOCK=NONE;
If MySQL accepts that, you’re done in milliseconds. If it errors because the operation can’t be instant (adding an index, changing a column type, adding a column mid-table on older versions), that’s your signal to use gh-ost. I ask AI up front:
For MySQL 8.0, can this specific ALTER use ALGORITHM=INSTANT or INPLACE without blocking, or does it require a copy? Here’s the statement and the table version. If it requires a copy, say so explicitly.
It’s usually right, but I confirm with the documentation for my exact version, because the list of instant-eligible operations has changed across releases.
Have AI draft the gh-ost command
gh-ost reads the binlog rather than using triggers, which makes it lighter on the primary. The command has real subtlety around throttling and where it connects. I describe my topology and ask for the command:
Draft a
gh-ostcommand to add an index onorders(customer_id, status). The table has 180M rows, we have one replica, and writes peak at 4,000/min. Connect gh-ost to the replica for binlog reading to reduce primary load. Include throttling on replica lag and a cut-over that requires my manual approval. Explain each flag.
A solid draft:
gh-ost \
--host=replica.db.internal \
--user=ghost --password=... \
--database=shop \
--table=orders \
--alter="ADD INDEX idx_cust_status (customer_id, status)" \
--max-load=Threads_running=40 \
--critical-load=Threads_running=80 \
--max-lag-millis=1500 \
--throttle-control-replicas=replica.db.internal \
--chunk-size=1000 \
--cut-over=atomic \
--postpone-cut-over-flag-file=/tmp/ghost.postpone \
--exact-rowcount \
--verbose \
--execute
The --postpone-cut-over-flag-file is the flag I care most about: with that file present, gh-ost does all the copying and waits, holding off the final rename until I delete the file. That means the disruptive part happens when I’m watching, not at 3 a.m. --max-lag-millis and --throttle-control-replicas make it back off automatically when the replica falls behind, which protects your read traffic.
Check the gotchas AI must be told about
gh-ost has hard requirements the model won’t know unless you tell it:
- No foreign keys.
gh-ostdoesn’t support tables with foreign keys (in or out). Ifordershas FKs, the migration will refuse. I check withSHOW CREATE TABLEand tell the model, so it can recommendpt-online-schema-changeinstead, which handles FKs (with caveats). - No triggers.
gh-ostwon’t run on a table that already has triggers.pt-oscuses triggers itself, so the two tools have opposite constraints. - Disk space. A shadow copy needs roughly the size of the table plus the new index free on disk. I confirm
df -hhas room before starting. AI can’t see your disk. - A unique key.
gh-ostneeds a usable unique or primary key to chunk on.
I paste the SHOW CREATE TABLE and ask the model to flag every one of these against the actual schema. It catches them reliably when it has the real DDL. The MySQL category has more on choosing between gh-ost and pt-osc.
Dry-run, then cut over deliberately
The single most important step: run it without --execute first. gh-ost validates the whole plan and reports what it would do, including row count and estimated duration, without changing anything. Only after a clean dry-run on a replica do I run the real thing against production, with the postpone flag set.
When the copy finishes and gh-ost is waiting at the postpone flag, I verify the shadow table looks right and then trigger the atomic cut-over:
rm /tmp/ghost.postpone
The rename is near-instant. If anything looks wrong before that, I let gh-ost keep waiting or kill it — nothing has been swapped yet, so there’s no rollback to perform. That’s the beauty of the postpone flag: the point of no return is a deliberate action I take, not something that happens on a timer.
Keep the plan reproducible
Schema migrations on big tables come up often enough that I templatized the planning prompt — topology, row count, write rate, and the gotcha checklist all go in every time. There’s a starter set at /prompts/. And I always have AI write out the rollback story explicitly before I start, even though gh-ost’s pre-cutover state is its own rollback, because the discipline of stating it catches plans that don’t actually have a safe abort point.
The pattern holds: AI drafts a genuinely good gh-ost command and flags the schema-specific traps faster than I’d assemble the flags by hand. What it can’t do is see your disk space, know your foreign keys without the DDL, or run the dry-run that proves the plan is valid. So it plans, I verify on a replica, and the production cut-over happens on my schedule with the postpone flag holding the line. That turns a feared, outage-prone operation into a routine, observable one.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.