MySQL Backup and Point-in-Time Recovery With AI

The worst time to learn how your recovery process works is at 2 a.m. after someone ran an UPDATE without a WHERE clause on the production users table. The full backup from last night gets you back to midnight, but you need the state from 1:47 a.m., right before the bad statement — and that’s where the binary log comes in, replaying the transactions between the backup and the disaster, stopping just short of it. Point-in-time recovery (PITR) is the skill that separates “we lost a day” from “we lost four minutes,” and almost nobody practices it until they need it.

I plan and rehearse PITR with AI in the loop. It’s good at constructing the mysqlbinlog replay command with the right stop position, explaining what each step does, and catching the ordering mistakes that ruin a restore. What it cannot do is verify that your backup actually restores or that the recovered data is correct — that’s a thing you do on a separate instance, ideally before any disaster, on a schedule.

Have a backup that supports PITR at all

PITR only works if two things are true: you have a consistent full backup with a known binlog coordinate, and you have every binlog from that coordinate forward. The full backup, with mysqldump:

mysqldump --single-transaction --source-data=2 \
  --all-databases --routines --triggers --events \
  > /backups/full-$(date +%F).sql

--single-transaction gives a consistent snapshot without locking InnoDB, and --source-data=2 writes the binlog file and position into the dump as a comment — that’s the anchor point for replay. For large databases I’d use xtrabackup instead, but the principle is identical: capture the binlog coordinate at backup time. And of course log_bin must be on:

[mysqld]
log_bin            = mysql-bin
binlog_format      = ROW
binlog_expire_logs_seconds = 1209600   # keep 14 days

If binlogs expire before your backup retention, you have a gap where PITR is impossible. I make sure the binlog retention covers at least as long as the full-backup retention.

Plan the recovery with AI

When disaster hits, the sequence is: restore the full backup to a clean instance, then replay binlogs from the backup’s coordinate up to just before the bad statement. I describe the situation and ask:

A bad UPDATE ran on production at approximately 01:47 today. My last full mysqldump is from 00:00 and recorded binlog position mysql-bin.000312:194. Walk me through restoring to a scratch instance and replaying binlogs to stop just before the bad statement. Give exact mysqlbinlog commands and explain how to find the precise stop position. Emphasize anything that would make this irreversible.

A good plan:

Restore the full dump to a fresh instance (not production). Then identify the exact event before the bad UPDATE — use mysqlbinlog to inspect events around 01:47 and find the GTID or position of the offending statement. Replay from the backup coordinate up to but not including it. Critically: do all of this on a scratch instance and validate the data before touching production, because replaying past the bad statement reintroduces the damage.

The “stop just before” precision is everything, and the model is good at reminding you to find the exact event rather than guessing a timestamp.

Execute the replay carefully

First restore the full backup to scratch, then inspect the binlog to pinpoint the bad event:

# Restore full backup to a scratch instance
mysql -h scratch.db.internal < /backups/full-2026-06-20.sql

# Inspect events around the incident to find the exact stop point
mysqlbinlog --start-datetime="2026-06-20 01:45:00" \
            --stop-datetime="2026-06-20 01:50:00" \
            mysql-bin.000312 | less

Once I’ve found the position right before the destructive statement (say --stop-position=49210), I replay up to it:

mysqlbinlog mysql-bin.000312 \
  --start-position=194 \
  --stop-position=49210 \
  | mysql -h scratch.db.internal

Stopping by position is more precise than by datetime, because multiple statements can share a second. I confirm the bad row is now in its correct pre-incident state on the scratch instance, diff a few known records, and only then plan how to bring the corrected data back into production — often by exporting just the affected rows rather than swapping the whole instance. The MySQL category goes deeper on binlog inspection.

Rehearse before you need it

The non-negotiable part: I restore a backup to a throwaway instance on a regular schedule, run the PITR replay, and verify the result, when there’s no emergency. A backup that’s never been restored is a hope, not a recovery plan. I’ve found genuine problems this way — a dump that excluded a schema, a binlog gap from a misconfigured expiry, a restore that took six hours when the SLA assumed one. None of those surface until you actually run the restore.

I have AI help build the rehearsal checklist and the validation queries that prove the restored data is complete:

Generate a restore-rehearsal checklist: restore the latest full backup to a scratch host, replay one day of binlogs, and validate. Include specific validation queries (row counts per major table, latest timestamps, a checksum) that would reveal a partial or corrupt restore.

Then I actually run it, because the model’s checklist is only as good as the execution I give it.

Guardrails

Never replay against production. Always recover on a scratch instance first and validate before reintroducing data. AI should reinforce this every time; if a suggestion replays straight into prod, stop.
Stop by position, confirm the event. A timestamp can include or exclude the wrong statement. Inspect the binlog and stop on the exact position before the damage.
Validate completeness, not just success. A restore that “succeeds” but silently dropped a schema is worse than a failure. Run the row-count and checksum queries.
Test retention overlap. Confirm binlog retention always covers backup retention, or PITR has a hole.

I keep the PITR-planning and rehearsal prompts saved with the “scratch instance only” and “stop by exact position” guardrails baked in — there’s a starter at /prompts/. AI builds the replay command and the validation checklist quickly and catches the irreversibility traps. But the thing that actually saves you at 2 a.m. is having rehearsed the restore on real data beforehand, because that’s the only proof your backup is a backup at all.

Have a backup that supports PITR at all

Plan the recovery with AI

Execute the replay carefully

Rehearse before you need it

Guardrails

Download the Free 500-Prompt DevOps AI Toolkit