Skip to content
CloudOps
Newsletter
All guides
AI for Postgres By James Joyner IV · · 12 min read

Zero-Downtime, Lock-Aware Postgres Schema Migrations With AI

Use AI to review Postgres migrations for dangerous locks and draft safe multi-step rollouts — NOT NULL, new columns, type changes — verified on a replica first.

  • #postgres
  • #ai
  • #migrations
  • #schema

The outage post-mortem fit in one sentence: someone ran ALTER TABLE orders ADD COLUMN status text NOT NULL DEFAULT 'pending' on a 400M-row table during business hours, it took an ACCESS EXCLUSIVE lock, and every read and write to the busiest table in the system queued behind it for several minutes. The migration was correct. The lock behavior was catastrophic. This is the defining hazard of Postgres schema changes: the SQL that works fine on a small table can lock your largest one solid. AI is genuinely useful as a migration reviewer because it knows which DDL takes which locks and can rewrite a dangerous one-liner into a safe multi-step rollout. But it can’t see your table sizes or your traffic. You give it the context; it drafts; you verify on a replica.

This is the review process I run before any migration touches a table that matters.

Know which locks hurt

The mental model: some DDL takes a brief lock that’s invisible under normal load, and some takes ACCESS EXCLUSIVE, which blocks everything — reads included — for the duration. The dangerous operations on large tables are: adding a column with a volatile default (older Postgres), rewriting a column type, adding a NOT NULL constraint via table scan, and creating indexes non-concurrently. The whole game is converting these into steps that each take only a brief lock.

Get AI to triage the migration

I never let a migration near prod without an AI lock review first. The prompt has to include the table size and Postgres version, because the answer depends on both.

Review this Postgres 16 migration for lock behavior. For each statement tell me the lock level it acquires, whether it requires a full table rewrite, and roughly how long it will hold the lock on a 400M-row table under write load. Rewrite anything that takes ACCESS EXCLUSIVE for more than a moment into a safe multi-step version. Output the rewritten migration with a comment on each step explaining the lock.

The model is reliable here because lock levels are deterministic and documented. What it gives back is a plan; what I owe it is verification.

Adding a NOT NULL column safely

The naive version scans and rewrites the whole table under an exclusive lock. The safe version splits it into add-nullable, backfill-in-batches, then validate.

-- Step 1: add the column nullable (fast, metadata-only on PG 11+)
ALTER TABLE orders ADD COLUMN status text;

-- Step 2: backfill in batches so no single statement locks long
-- (run in a loop from the app or a script, committing each batch)
UPDATE orders SET status = 'pending'
WHERE status IS NULL AND id BETWEEN 1 AND 100000;

-- Step 3: add a NOT VALID check constraint (brief lock, no scan)
ALTER TABLE orders ADD CONSTRAINT orders_status_nn
  CHECK (status IS NOT NULL) NOT VALID;

-- Step 4: validate it separately (scans, but only SHARE UPDATE EXCLUSIVE)
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_nn;

The NOT VALID then VALIDATE two-step is the key trick: it lets you enforce the constraint without an ACCESS EXCLUSIVE table scan. I ask AI to confirm the lock level of each step, then prove it on the replica.

Set a lock timeout so a slow lock can’t cascade

Even a “safe” migration can stall if it waits behind a long transaction to acquire its lock — and while it waits, it blocks everything behind it. A short lock timeout turns a potential cascade into a quick, retryable failure.

SET lock_timeout = '3s';
SET statement_timeout = '30s';
-- now run the DDL; if it can't get the lock fast, it aborts cleanly
ALTER TABLE orders ADD COLUMN status text;

This is non-negotiable on a busy table. I’d rather a migration fail and retry than have it sit in the lock queue blocking traffic. AI consistently reminds you to set this; I treat it as a checklist item.

Renaming and type changes

Renaming a column is instant, but it breaks any code still using the old name — so the safe pattern is expand/contract: add the new column, dual-write from the app, backfill, switch reads, then drop the old column in a later release. For type changes, the same idea: add a new column of the new type, backfill, swap. AI is good at laying out the expand/contract sequence across multiple deploys, which is the part people get wrong by trying to do it in one shot.

-- expand: new column, no rewrite of the old one
ALTER TABLE orders ADD COLUMN total_cents bigint;
-- backfill in batches, app dual-writes both columns
-- contract (later release): drop the old column
ALTER TABLE orders DROP COLUMN total;

Always dry-run against a clone

The verification step that catches what AI can’t: run the migration on a replica or a recent restore with production-scale data, and time each step while watching pg_locks from another session.

# restore a recent backup into a throwaway instance, then time the migration
pg_restore -d migration_test latest.dump
psql migration_test -c '\timing on' -f migration.sql

If a step that AI promised was “metadata-only” takes ninety seconds on the real data volume, you find out here, not in prod. This single dry-run has saved me more times than any review.

The division of labor

AI reads the lock levels, rewrites unsafe DDL into safe multi-step rollouts, and lays out expand/contract sequences across releases — fast and accurately, because lock semantics are deterministic. What it cannot do is know your table sizes, your traffic windows, or how long a step actually takes on your data. That’s the dry-run on a replica, and it stays a human decision. The broader Postgres operational set is in the Postgres guides, and I keep my migration-review prompts in the prompt library so every change gets the same lock triage before it ships.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.