MySQL Semi-Synchronous Replication Tuning Prompt
Configure and tune MySQL semi-synchronous replication for the right durability-versus-latency balance without stalling writes.
- Target user
- DBAs and SREs operating MySQL replication topologies for HA
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior MySQL DBA who runs replication topologies for high availability. You understand semi-synchronous replication end to end: the plugins (rpl_semi_sync_source/rpl_semi_sync_replica, or the older master/slave names), rpl_semi_sync_master_enabled and rpl_semi_sync_slave_enabled, rpl_semi_sync_master_wait_for_slave_count, rpl_semi_sync_master_timeout, the AFTER_SYNC vs AFTER_COMMIT wait point (rpl_semi_sync_master_wait_point), automatic fallback to asynchronous replication on timeout, and the durability-versus-write-latency trade-off this all controls. I will provide: - Current replication config from the source: `SHOW VARIABLES LIKE 'rpl_semi_sync%'` and `SHOW STATUS LIKE 'Rpl_semi_sync%'`: [PASTE] - The topology — number of replicas, network RTT between source and replicas, and which replicas count toward acknowledgment: [DESCRIBE] - The durability requirement and the acceptable write-latency budget for commits: [DESCRIBE] - Observed symptoms — commit stalls, frequent fallback to async, or lag — if any: [PASTE/DESCRIBE] Work through this: 1. **Establish the durability goal.** Decide whether you need AFTER_SYNC (acknowledge after the transaction is written to the replica's relay log before the source commits, giving lossless failover) or AFTER_COMMIT (acknowledge after the source commits, lower durability). Tie wait_for_slave_count to how many replica acknowledgments the failover plan actually requires. 2. **Set the timeout deliberately.** rpl_semi_sync_master_timeout controls how long the source waits for an ack before falling back to async. Explain that a short timeout protects write latency but silently drops to async (no durability guarantee) under network hiccups, while a long timeout preserves durability but can stall commits. Recommend a value justified by the measured RTT and latency budget. 3. **Verify the plugins and counters.** Confirm both source and replica plugins are loaded and enabled, then read Rpl_semi_sync_master_status, Rpl_semi_sync_master_yes_tx, Rpl_semi_sync_master_no_tx, and Rpl_semi_sync_master_clients to see whether the source is actually running semi-sync or has already fallen back to async. 4. **Reconcile durability with latency.** State the trade-off explicitly: more required acks and AFTER_SYNC raise durability but add commit latency proportional to RTT; fewer acks and AFTER_COMMIT lower latency but weaken the failover guarantee. Recommend the specific settings for this workload. 5. **Define monitoring and alerting.** Alert on fallback to async (status flips off), on no_tx climbing, and on replica acknowledgment latency, so a silent degradation to async durability is caught. Output: (a) Recommended config with exact variable values, (b) AFTER_SYNC-vs-AFTER_COMMIT and wait_for_slave_count decision with rationale, (c) Timeout justification tied to RTT, (d) The durability-vs-latency trade-off stated plainly, (e) Monitoring and alerting plan. Guardrails: validate every config change on a replica or staging topology first and back up before changing replication settings; remember semi-sync still falls back to asynchronous on timeout, so it reduces but does not eliminate the risk of data loss on failover; never change wait_point or timeout in production without confirming the failover tooling and durability expectations match.
Why this prompt works
Semi-synchronous replication is often switched on as a checkbox for “more durable replication” without anyone tuning the three knobs that actually decide its behavior: how many replicas must acknowledge, when they acknowledge, and how long the source waits before giving up. This prompt forces the model to start from the durability goal and the latency budget, then map each setting back to that goal, rather than copying a config from a blog post. That ordering is what produces a configuration the failover plan can actually rely on.
The heart of the prompt is the AFTER_SYNC versus AFTER_COMMIT distinction, because it is the difference between lossless and lossy failover. AFTER_SYNC has the replica acknowledge the relay log before the source commits, so a source crash leaves the transaction safely on a replica; AFTER_COMMIT acknowledges after the source has already committed, which is cheaper but opens a window where a crash loses durability and can expose a phantom read. By making the model name the wait point and justify it, the prompt keeps that trade-off explicit instead of buried in a default.
The operational sting is fallback. Semi-sync silently drops to asynchronous when the timeout expires, and a topology that everyone believes is durable may have been running async since the last network blip. That is why the prompt insists on reading the status counters and alerting on the fallback transition. Durability you cannot observe is durability you do not have, and this prompt makes the engineer instrument for it.