MySQL Replication Setup and Lag Debugging With AI

Replication is one of those things that runs invisibly for months and then breaks at the worst possible moment — a replica falls hours behind during a batch job, or stops dead on a duplicate-key error, and suddenly your read traffic is serving stale data or your failover target is useless. The status output that tells you what’s wrong, SHOW REPLICA STATUS, has about fifty fields, and knowing which three actually matter for your problem is the whole skill.

I set up GTID-based replication by hand and I debug it by hand, but I’ve started pasting the replica status and relevant error log lines into an AI assistant to get a fast, structured read of where the pipeline is stuck. It’s good at saying “your SQL thread is blocked on this error, your IO thread is fine, here’s the GTID gap.” I confirm everything against the actual coordinates before running any recovery command, because a wrong RESET REPLICA can cost you the replica entirely.

Set up GTID replication cleanly

GTID (Global Transaction Identifier) replication is far easier to reason about than file-and-position because each transaction has a globally unique ID. On both servers in my.cnf:

[mysqld]
server_id            = 1          # unique per server
log_bin              = mysql-bin
binlog_format        = ROW
gtid_mode            = ON
enforce_gtid_consistency = ON
log_replica_updates  = ON

On the replica, point it at the source and start with auto-positioning so GTID figures out where to begin:

CHANGE REPLICATION SOURCE TO
  SOURCE_HOST='10.0.0.5',
  SOURCE_USER='repl',
  SOURCE_PASSWORD='...',
  SOURCE_AUTO_POSITION=1;

START REPLICA;
SHOW REPLICA STATUS\G

(On MariaDB and older MySQL the keywords are MASTER/SLAVE, e.g. CHANGE MASTER TO ... MASTER_USE_GTID=slave_pos.) The two threads to watch are Replica_IO_Running and Replica_SQL_Running; both should say Yes.

Read the status with help

When something’s wrong, the relevant fields are scattered through fifty lines. I copy the whole SHOW REPLICA STATUS\G output and ask:

Here is SHOW REPLICA STATUS from a replica that’s behind. Tell me whether the IO thread or the SQL thread is the bottleneck, what Seconds_Behind_Source and Retrieved_Gtid_Set vs Executed_Gtid_Set imply about where the lag is, and whether there’s an error blocking replication. Quote the exact field values you’re reasoning from.

A useful answer distinguishes the two common cases:

Your Replica_IO_Running is Yes and Retrieved_Gtid_Set is far ahead of Executed_Gtid_Set, so the IO thread has pulled the events but the SQL thread is replaying them slowly — this is apply lag, not network lag. Seconds_Behind_Source is 1,847 and climbing. There’s no Last_Error, so nothing is broken; the SQL thread simply can’t keep up, likely because replication is single-threaded for this workload. Enabling parallel replication is the fix.

That distinction — fetch lag versus apply lag — determines the entire fix, and it’s exactly what the model is good at extracting.

Fix apply lag with parallel replication

If the SQL thread is the bottleneck, modern MySQL can apply transactions in parallel:

STOP REPLICA SQL_THREAD;
SET GLOBAL replica_parallel_workers = 8;
SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL replica_preserve_commit_order = ON;
START REPLICA SQL_THREAD;

replica_preserve_commit_order = ON keeps transactions committing in source order, which you want for consistency. I ask AI to confirm the worker count makes sense for the box’s core count, then I watch Seconds_Behind_Source actually trend down before declaring victory. If it doesn’t drop, the lag is from something else — long-running individual transactions, or disk I/O on the replica — and I keep digging.

When replication stops on an error

The nastier case is a stopped SQL thread with a Last_Error. The dangerous reflex is to skip the offending transaction. Sometimes that’s right; often it silently diverges the replica from the source. I paste the error and ask AI to explain what skipping would actually do:

The error is a duplicate-key violation applying a GTID. Before you skip it, understand that skipping means this transaction never applies on the replica, so the replica’s data will permanently differ from the source for that row. Only skip if you’ve confirmed the row already has the correct value (e.g. from a prior partial apply). Otherwise the right fix is to repair the data and let it apply.

To inspect the actual transaction before deciding, read it out of the binlog on the source:

mysqlbinlog --include-gtids='3E11FA47-...:1000' \
  --base64-output=DECODE-ROWS -vv mysql-bin.000042

Seeing the real statement tells you whether skipping is safe. If it is, on GTID you skip by injecting an empty transaction with that GTID; the AI can show the exact syntax, but I verify the GTID matches the error before running anything. The MySQL category covers binlog inspection in more detail.

Guardrails I never skip

A few hard rules around replication recovery:

Never run RESET REPLICA or RESET MASTER on a hunch. They wipe replication state and GTID history, and the recovery from getting that wrong is a full re-clone. If AI suggests one, I make it justify exactly what state will be lost.
Verify GTID coordinates by eye. Before skipping a transaction, the GTID in my command must match the one in the error. AI has fat-fingered these in suggestions, and an empty transaction on the wrong GTID corrupts the set.
Confirm Seconds_Behind_Source actually recovers. A fix isn’t a fix until the lag is back to near zero and stays there through a traffic cycle.
Check data consistency after a skip. pt-table-checksum is how I confirm the replica still matches the source after any manual intervention.

I keep my replication-debugging prompts saved with the “quote the exact field” and “explain what this destructive command loses” instructions baked in; the shared set is at /prompts/. AI is excellent at turning the fifty-field status dump into a clear statement of which thread is stuck and why, and at spelling out the consequences of recovery commands before you run them. But the coordinates, the GTIDs, and the consistency checks are all things you verify yourself, because in replication the cost of acting on a confident wrong answer is a destroyed replica.

Set up GTID replication cleanly

Read the status with help

Fix apply lag with parallel replication

When replication stops on an error

Guardrails I never skip

Download the Free 500-Prompt DevOps AI Toolkit