Skip to content
CloudOps
Newsletter
All guides
AI for MySQL By James Joyner IV · · 11 min read

Online Schema Changes With gh-ost and AI

A blocking ALTER on a big table is a self-inflicted outage. Here's how I use AI to plan a safe gh-ost migration and verify the cutover before it touches prod.

  • #mysql
  • #ai
  • #gh-ost
  • #migrations
  • #schema

The first time someone runs ALTER TABLE orders ADD COLUMN ... on a 200-million-row table in production, they learn — usually during an incident — that a default ALTER can lock the table or block writes for the entire duration of the rebuild. On a big table that’s minutes to hours of downtime. Online schema change tools like gh-ost and pt-online-schema-change exist precisely to avoid this: they build a shadow copy, backfill it in throttled chunks, keep it in sync via binlog, and swap it in with a near-instant rename.

The tools are reliable but they have a lot of flags, and the failure modes — replica lag, foreign keys, triggers, disk space — are easy to overlook. I use AI to draft the migration plan and the exact gh-ost command, sanity-check it against my table’s specifics, and walk through the cutover and rollback steps. Then I dry-run it on a replica. The model plans; I verify on real infrastructure before the production run.

Understand why the naive ALTER is dangerous

Some ALTER operations in modern MySQL are genuinely online (ALGORITHM=INSTANT for adding a column at the end, in many cases). Many are not. Before reaching for a tool, I check whether the change can be instant:

ALTER TABLE orders
  ADD COLUMN fulfillment_region VARCHAR(32) NULL,
  ALGORITHM=INSTANT, LOCK=NONE;

If MySQL accepts that, you’re done in milliseconds. If it errors because the operation can’t be instant (adding an index, changing a column type, adding a column mid-table on older versions), that’s your signal to use gh-ost. I ask AI up front:

For MySQL 8.0, can this specific ALTER use ALGORITHM=INSTANT or INPLACE without blocking, or does it require a copy? Here’s the statement and the table version. If it requires a copy, say so explicitly.

It’s usually right, but I confirm with the documentation for my exact version, because the list of instant-eligible operations has changed across releases.

Have AI draft the gh-ost command

gh-ost reads the binlog rather than using triggers, which makes it lighter on the primary. The command has real subtlety around throttling and where it connects. I describe my topology and ask for the command:

Draft a gh-ost command to add an index on orders(customer_id, status). The table has 180M rows, we have one replica, and writes peak at 4,000/min. Connect gh-ost to the replica for binlog reading to reduce primary load. Include throttling on replica lag and a cut-over that requires my manual approval. Explain each flag.

A solid draft:

gh-ost \
  --host=replica.db.internal \
  --user=ghost --password=... \
  --database=shop \
  --table=orders \
  --alter="ADD INDEX idx_cust_status (customer_id, status)" \
  --max-load=Threads_running=40 \
  --critical-load=Threads_running=80 \
  --max-lag-millis=1500 \
  --throttle-control-replicas=replica.db.internal \
  --chunk-size=1000 \
  --cut-over=atomic \
  --postpone-cut-over-flag-file=/tmp/ghost.postpone \
  --exact-rowcount \
  --verbose \
  --execute

The --postpone-cut-over-flag-file is the flag I care most about: with that file present, gh-ost does all the copying and waits, holding off the final rename until I delete the file. That means the disruptive part happens when I’m watching, not at 3 a.m. --max-lag-millis and --throttle-control-replicas make it back off automatically when the replica falls behind, which protects your read traffic.

Check the gotchas AI must be told about

gh-ost has hard requirements the model won’t know unless you tell it:

  • No foreign keys. gh-ost doesn’t support tables with foreign keys (in or out). If orders has FKs, the migration will refuse. I check with SHOW CREATE TABLE and tell the model, so it can recommend pt-online-schema-change instead, which handles FKs (with caveats).
  • No triggers. gh-ost won’t run on a table that already has triggers. pt-osc uses triggers itself, so the two tools have opposite constraints.
  • Disk space. A shadow copy needs roughly the size of the table plus the new index free on disk. I confirm df -h has room before starting. AI can’t see your disk.
  • A unique key. gh-ost needs a usable unique or primary key to chunk on.

I paste the SHOW CREATE TABLE and ask the model to flag every one of these against the actual schema. It catches them reliably when it has the real DDL. The MySQL category has more on choosing between gh-ost and pt-osc.

Dry-run, then cut over deliberately

The single most important step: run it without --execute first. gh-ost validates the whole plan and reports what it would do, including row count and estimated duration, without changing anything. Only after a clean dry-run on a replica do I run the real thing against production, with the postpone flag set.

When the copy finishes and gh-ost is waiting at the postpone flag, I verify the shadow table looks right and then trigger the atomic cut-over:

rm /tmp/ghost.postpone

The rename is near-instant. If anything looks wrong before that, I let gh-ost keep waiting or kill it — nothing has been swapped yet, so there’s no rollback to perform. That’s the beauty of the postpone flag: the point of no return is a deliberate action I take, not something that happens on a timer.

Keep the plan reproducible

Schema migrations on big tables come up often enough that I templatized the planning prompt — topology, row count, write rate, and the gotcha checklist all go in every time. There’s a starter set at /prompts/. And I always have AI write out the rollback story explicitly before I start, even though gh-ost’s pre-cutover state is its own rollback, because the discipline of stating it catches plans that don’t actually have a safe abort point.

The pattern holds: AI drafts a genuinely good gh-ost command and flags the schema-specific traps faster than I’d assemble the flags by hand. What it can’t do is see your disk space, know your foreign keys without the DDL, or run the dry-run that proves the plan is valid. So it plans, I verify on a replica, and the production cut-over happens on my schedule with the postpone flag holding the line. That turns a feared, outage-prone operation into a routine, observable one.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.