Automating Database Schema Migrations Safely With AI

The first time I let an AI write a database migration, it handed me a single line that would have taken a 40-million-row table offline for nine minutes during peak traffic. The SQL was correct. It was also a loaded gun. That migration never ran in production, because the AI was never anywhere near production. It produced a draft, I read it, a clone rejected it, and a human approved the rewrite. That sequence is the whole job.

I treat AI for schema changes the way I treat a sharp junior engineer: fast, tireless, occasionally brilliant, and absolutely not allowed to touch prod credentials or click the button. The model drafts. CI gates. A human owns the decision. Everything below is how I wire that up so the speed is real and the blast radius is small.

The migration that locks prod is the one that “looked fine”

The dangerous migrations rarely look dangerous. Ask any model for “add a NOT NULL column with a default to the orders table” and on older Postgres or most MySQL configurations you get:

ALTER TABLE orders ADD COLUMN status text NOT NULL DEFAULT 'pending';

On a small table, instant. On a large one, depending on engine and version, this can rewrite the whole table while holding an ACCESS EXCLUSIVE lock, blocking every read and write until it finishes. The AI didn’t lie to you. It answered the question you asked, not the question you meant: “change this schema without taking the application down.”

So I never ask for the change. I ask for the change plus its lock profile. My standing prompt to Claude or ChatGPT for any DDL is:

Generate this migration for PostgreSQL 15. For every statement, state the lock it acquires, whether it rewrites the table, and the expected duration on a table with 40M rows. If any statement holds ACCESS EXCLUSIVE for more than a few milliseconds, rewrite it as a non-blocking sequence.

That last sentence turns a one-liner into the safe multi-step version, which for Postgres looks like this:

-- Step 1: add the column, nullable, no default rewrite
ALTER TABLE orders ADD COLUMN status text;

-- Step 2: set the default for *future* rows only (cheap, metadata-only)
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';

-- Step 3: backfill in batches, outside a single giant transaction
UPDATE orders SET status = 'pending'
WHERE status IS NULL AND id BETWEEN 1 AND 100000;
-- ...repeat in batches via the app or a loop...

-- Step 4: only after backfill completes, enforce the constraint
ALTER TABLE orders
  ADD CONSTRAINT orders_status_not_null
  CHECK (status IS NOT NULL) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;

NOT VALID followed by VALIDATE is the trick: it adds the constraint without scanning the whole table under a heavy lock, then validates with only a SHARE UPDATE EXCLUSIVE lock that lets writes continue. The AI knows this pattern. It just won’t volunteer it unless you make lock impact part of the spec.

Expand and contract is the only pattern that survives a rollback

If a migration and the code that depends on it must deploy atomically, you have already lost. The schema and the app deploy at different times, replicas lag, and rollbacks happen. The pattern that absorbs all of that is expand/contract (also called parallel change).

You never rename or drop in one shot. You expand the schema to support old and new, migrate the data and the reads, then contract in a later deploy once nothing references the old shape.

Say I’m splitting a full_name column into first_name and last_name.

Expand (deploy 1, additive, safe to roll back):

ALTER TABLE users ADD COLUMN first_name text;
ALTER TABLE users ADD COLUMN last_name text;
-- backfill in batches, app dual-writes both old and new columns

Migrate (deploy 2): application reads from the new columns, still writes both.

Contract (deploy 3, only after the old column is provably unused):

ALTER TABLE users DROP COLUMN full_name;

Each step is independently reversible because no step both adds the new world and destroys the old one. This is the database cousin of writing idempotent automation scripts: every stage can run twice, and any stage can be abandoned without stranding the system in a broken half-state.

Pro Tip: Make the AI tag every migration as additive, backfill, or destructive. I gate on that tag in CI — anything destructive requires a named human approver and a confirmed backup snapshot before it can merge. Additive changes flow fast; destructive ones slow down on purpose.

Indexes are where “it’s just one line” goes to die

Building an index on a busy table with a plain CREATE INDEX takes an ACCESS EXCLUSIVE-adjacent lock that blocks writes for the entire build. On a large table that is an outage. Postgres gives you the escape hatch:

-- Builds the index without blocking writes. Cannot run inside a transaction block.
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);

CONCURRENTLY is also the one place the transaction-wrapping rule I describe below breaks: you cannot run it inside a transaction, and if it fails it leaves an INVALID index behind that you must drop and rebuild. AI tools constantly forget both facts. So my index prompt is explicit:

Use CREATE INDEX CONCURRENTLY. Mark this migration as non-transactional for the migration tool. Include the cleanup statement to drop the index if the build fails, and include the matching DROP INDEX CONCURRENTLY for the down migration.

Which gets me a paired down migration the model would otherwise skip:

DROP INDEX CONCURRENTLY IF EXISTS idx_orders_status;

Wire the gates into the migration tool, not into good intentions

The safety has to live in tooling, because intentions don’t survive a Friday afternoon. Pick your migration framework and let it own ordering, versioning, and the up/down pair. With Flyway I keep undo scripts beside every versioned migration:

sql/
  V7__add_status_to_orders.sql
  U7__add_status_to_orders.sql   # the explicit undo

# flyway.conf
flyway.url=jdbc:postgresql://localhost:5432/app_clone
flyway.locations=filesystem:sql
flyway.cleanDisabled=true          # never let an automated run wipe a schema
flyway.outOfOrder=false
flyway.validateMigrationNaming=true

If you live in Python, Alembic generates the up/down pair as code, which is the structure you want the AI filling in:

def upgrade() -> None:
    op.add_column("orders", sa.Column("status", sa.Text(), nullable=True))
    op.alter_column("orders", "status", server_default="pending")

def downgrade() -> None:
    op.drop_column("orders", "status")

The downgrade() function is not optional and is not boilerplate. When I ask the AI to write a migration, the explicit instruction is: write the down migration first, and make it exactly reverse the up. A down path you wrote after the fact, under pressure, at 2 a.m., is not a down path. Knowing when to actually pull that lever is its own discipline, which I get into in knowing when to roll back your automation.

Dry-run on a clone before anything touches prod

No migration earns a production run until it has run cleanly against a restored copy of production data, not an empty schema. A migration that’s instant on an empty dev database can lock for minutes on real row counts. The clone is where lock duration becomes a real number instead of an AI’s guess.

A CI step that does this, end to end:

# .github/workflows/migrate.yml
jobs:
  migration-dry-run:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: ci
        ports: ["5432:5432"]
    steps:
      - uses: actions/checkout@v4

      - name: Restore prod-shaped snapshot (sanitized, no PII)
        run: pg_restore -d "$CLONE_URL" ./fixtures/prod_snapshot.dump

      - name: Run migrations against the clone
        run: flyway migrate

      - name: Verify the down path actually reverses
        run: flyway undo

      - name: Re-apply to prove forward path is repeatable
        run: flyway migrate

      - name: Lint for blocking operations
        run: |
          # squawk flags ACCESS EXCLUSIVE locks, table rewrites, etc.
          npx squawk sql/V*.sql

Squawk is the deterministic backstop. The AI can claim a statement is non-blocking; squawk checks the actual rule set and fails the build if it isn’t. I trust the linter over the language model every time. The dry-run is non-negotiable for the same reason I argue for simulation before any automated action: you find out the cost before the customer does.

Pro Tip: Restore a sanitized snapshot, not raw prod. Strip PII in the dump pipeline so your CI clone has production’s row counts and data distribution without production’s liability. The lock behavior depends on volume and cardinality, not on the actual values.

Transactions and backups: the floor you never go below

Two rules I don’t let AI or anyone talk me out of.

Wrap every migration in a transaction so a failure rolls the whole thing back instead of leaving the schema half-changed. Most tools do this per file when the statements allow it — the exceptions are CREATE INDEX CONCURRENTLY and a few others, which is exactly why you flag those as non-transactional explicitly rather than discovering it mid-deploy.

BEGIN;
ALTER TABLE orders ADD COLUMN status text;
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
COMMIT;

Take a backup or snapshot immediately before any destructive DDL. A DROP COLUMN or DROP TABLE has no undo that brings the data back — the down migration recreates the column, empty. The only real back-out path for destructive changes is the snapshot. So the snapshot is a gate, enforced in the pipeline, not a checkbox in a runbook.

This is also the bright line on credentials. The AI drafts SQL, the linter inspects it, CI runs it against a clone with throwaway creds. The human approves the merge, and the production apply runs through your deploy system with credentials the model has never seen. The model is a junior engineer with a keyboard and no keys to the building. If a generated migration ever does cause trouble in prod, that’s an incident — handle it through your incident-response workflow, not by improvising fixes against the primary.

Make the review repeatable

The review itself benefits from a standard prompt rather than ad-hoc questions. I keep a saved migration-review prompt — refine it in the prompt workspace, pull battle-tested versions from prompt packs, and route the diff through automated code review before a human signs off. The checklist the AI runs every time: lock type per statement, table rewrite yes/no, presence and correctness of the down migration, destructive-change tag, and whether a backup gate applies.

Conclusion

AI made my migrations faster, not riskier, because I never let the speed reach production unmediated. The model drafts the expand/contract sequence, the concurrent index, the reversible down path. The linter and the clone catch what the model misses. The transaction wrap and the pre-DDL backup are the floor. And a human reads the diff and owns the merge. Treat the model as a quick junior engineer, keep the prod credentials in human hands, and you get the throughput without ever betting the database on a confident paragraph of SQL. For more in this vein, the automation category collects the rest of how I think about gating fast tools.