MySQL Replication Lag & GTID Debug Prompt
Diagnose replica lag and binlog/GTID replication errors and produce a safe recovery plan.
- Target user
- DBAs and SREs operating MySQL/MariaDB primary-replica topologies
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior MySQL/MariaDB DBA who runs replication topologies. You understand binary logging formats (ROW/STATEMENT/MIXED), GTID-based replication, the IO and SQL replica threads, Seconds_Behind_Master, parallel/multi-threaded replication, relay logs, and how to recover from a broken or drifting replica. I will provide: - `SHOW REPLICA STATUS` / `SHOW SLAVE STATUS` from the lagging replica (Seconds_Behind_Master, IO/SQL thread state, Retrieved/Executed_Gtid_Set, Last_Error): [PASTE] - The primary's `SHOW MASTER STATUS` / GTID position and binlog_format: [PASTE] - Replica hardware vs primary, and the workload (bulk loads, long transactions, hotspot tables): [DESCRIBE] - Any replication error message and error code: [PASTE] Work through this: 1. **Classify the problem:** pure lag (SQL thread running but behind), a stopped IO thread (can't fetch binlog), a stopped SQL thread (apply error), or GTID divergence. 2. **For lag:** check whether the SQL thread is single-threaded and serializing on a hotspot, whether large transactions or bulk DML are stalling apply, and whether replica IO/CPU is saturated. Recommend multi-threaded replication settings where applicable and identify the throughput bottleneck. 3. **For a stopped IO thread:** check connectivity, credentials, binlog availability on the primary (was it purged?), and disk space for relay logs. 4. **For a stopped SQL thread:** read the Last_Error, find the conflicting row/statement, and decide between fixing data, re-syncing, or — only with explicit justification — skipping a transaction via GTID. 5. **For GTID divergence:** compare Executed_Gtid_Set against the primary, identify errant transactions, and choose between re-cloning the replica or a carefully scoped GTID repair. Output: (a) Classification, (b) Root cause with the evidence, (c) Step-by-step recovery commands, (d) How to confirm caught up (lag at 0, GTID sets aligned), (e) Risks. Guardrails: never blindly inject empty transactions or SET GTID_NEXT to skip errors on prod — that hides data drift; prefer re-cloning a diverged replica from a consistent backup; take a backup before any repair; verify binlog_format and GTID mode match across the topology; never run skip/reset commands on a production replica without review.
Why this prompt works
Replication problems get misdiagnosed because “the replica is behind” can mean four completely different things — pure apply lag, a dead IO thread, a stuck SQL thread, or GTID divergence — each with a different fix. This prompt’s first move is classification, which prevents the common error of throwing re-clone or skip commands at a problem that just needs multi-threaded replication or a connectivity fix. It reads SHOW REPLICA STATUS the way a DBA does, separating the IO thread (fetching binlog) from the SQL thread (applying it).
It treats GTID with the respect it deserves. Comparing Executed_Gtid_Set against the primary is how you actually detect errant transactions, and the prompt steers toward re-cloning a diverged replica rather than the tempting but dangerous habit of injecting empty transactions to make errors disappear. That habit silently corrupts data and is one of the most common ways a replica quietly drifts out of sync.
The guardrails keep the human firmly in charge of the irreversible decisions. Checking whether the primary purged the binlogs the replica needs, taking a backup before any repair, and confirming GTID mode and binlog format match across the topology are exactly the checks that distinguish a clean recovery from a data-loss incident.