Tuning MySQL Semi-Synchronous Replication With AI

I’ve been bitten by asynchronous replication exactly once in a way I’ll never forget: a source crashed, we failed over to a replica, and the last few seconds of committed transactions were simply gone because they’d never made it across the wire. With pure async, the source commits and moves on without waiting for any replica to receive the change. That’s fast, and for plenty of workloads it’s the right call. But when those last few transactions are customer payments, “probably replicated” isn’t good enough. Semi-synchronous replication closes that gap, and lately I draft the configuration with AI and then spend my real effort testing failover on staging. AI gets the syntax right; staging tells me whether the durability is real.

Async Versus Semi-Sync

Under asynchronous replication, a commit on the source returns to the client the moment it’s flushed locally. The replica pulls changes whenever it gets around to it. If the source dies, anything not yet pulled is lost. Semi-synchronous replication changes one thing: before the source acknowledges the commit to the client, it waits for at least one replica to confirm it has received the transaction’s events into its relay log. You’re not waiting for the replica to apply the change, only to durably receive it, which keeps the latency cost modest while guaranteeing the data exists in at least two places.

The trade-off is plain: every commit now pays a network round-trip to a replica. On a low-latency LAN that’s sub-millisecond; across availability zones it adds up. You’re buying durability with latency, and the whole tuning exercise is about choosing how much latency you’ll pay and what happens when a replica can’t keep up.

Enabling the Plugins

MySQL 8.0 ships semi-sync as two plugins, one for the source side and one for the replica side. The terminology moved to source/replica, and the plugin variables use rpl_semi_sync_source_* and rpl_semi_sync_replica_*.

-- On the source
INSTALL PLUGIN rpl_semi_sync_source SONAME 'semisync_source.so';
SET GLOBAL rpl_semi_sync_source_enabled = ON;
SET GLOBAL rpl_semi_sync_source_timeout = 1000;          -- milliseconds
SET GLOBAL rpl_semi_sync_source_wait_for_replica_count = 1;
SET GLOBAL rpl_semi_sync_source_wait_point = 'AFTER_SYNC';

-- On each replica
INSTALL PLUGIN rpl_semi_sync_replica SONAME 'semisync_replica.so';
SET GLOBAL rpl_semi_sync_replica_enabled = ON;

After enabling on a replica you restart its receiver thread so it reconnects under the semi-sync protocol:

STOP REPLICA IO_THREAD;
START REPLICA IO_THREAD;

To make this survive a restart, the variables go in my.cnf rather than only being set at runtime:

sudo tee -a /etc/mysql/mysql.conf.d/semisync.cnf >/dev/null <<'CONF'
[mysqld]
plugin-load-add        = semisync_source.so
rpl_semi_sync_source_enabled = 1
rpl_semi_sync_source_timeout = 1000
rpl_semi_sync_source_wait_for_replica_count = 1
rpl_semi_sync_source_wait_point = AFTER_SYNC
CONF
sudo systemctl restart mysql

AFTER_SYNC Versus AFTER_COMMIT

The single most important setting here is rpl_semi_sync_source_wait_point, and it has two values that look similar but behave very differently for crash safety.

With AFTER_SYNC (the default and the one I use), the source writes the transaction to its binary log, waits for a replica to acknowledge receipt, and only then commits in the storage engine and returns to the client. The consequence is that no client ever sees a transaction as committed until a replica already has it. If the source crashes before its own commit completes, the transaction is still safe on the replica, and after failover no client ever saw phantom data that later vanished.

With AFTER_COMMIT, the source commits in the engine first, then waits for replica acknowledgement before returning to the client. The window is subtle but real: another session can read the committed row before the replica confirms, so if the source crashes in that window, a value that was visible on the old source is now gone. AFTER_SYNC is the safer choice for almost everyone, and I only deviate with a specific reason.

This is exactly the kind of nuance I ask AI to lay out before I commit to a config.

You are a senior MySQL 8.0 DBA. Explain the crash-safety difference between rpl_semi_sync_source_wait_point AFTER_SYNC and AFTER_COMMIT for a payments workload, and recommend a wait point plus a starting timeout. Keep it specific to MySQL 8.0 source/replica semantics.

For a payments workload, use AFTER_SYNC. The source waits for a replica to acknowledge receipt of the binlog events before the storage-engine commit becomes visible to clients, so no committed-and-read transaction can be lost if the source crashes before replication catches up. AFTER_COMMIT exposes a window where a client can read a transaction that the replica hasn’t yet acknowledged, risking phantom reads after failover. Start with rpl_semi_sync_source_timeout = 1000 ms and wait_for_replica_count = 1, then test under real failover before tightening. Larger timeouts increase durability but block commits longer when a replica stalls.

That matches my own reasoning, but I treat it as a draft to validate, not a config to paste into production.

Wait Count, Timeout, and the Fallback to Async

Two more settings shape behavior under stress. rpl_semi_sync_source_wait_for_replica_count is how many replicas must acknowledge before the source proceeds. Setting it to 2 across three replicas tolerates one slow replica while still guaranteeing two durable copies, at the cost of more latency. rpl_semi_sync_source_timeout is the safety valve: if no replica acknowledges within that window, the source automatically falls back to asynchronous replication so your application doesn’t hang indefinitely behind a dead replica.

That automatic fallback is a feature and a trap. It keeps you available when a replica dies, but it also means you can silently lose your durability guarantee during a network blip. So monitoring is not optional. The status variable that tells you the truth is Rpl_semi_sync_master_status (the status variables kept the older naming for compatibility):

SHOW STATUS LIKE 'Rpl_semi_sync_master_status';      -- ON = semi-sync active, OFF = fell back to async
SHOW STATUS LIKE 'Rpl_semi_sync_master_no_tx';       -- commits that went without semi-sync ack
SHOW STATUS LIKE 'Rpl_semi_sync_master_yes_tx';      -- commits that got an ack
SHOW STATUS LIKE 'Rpl_semi_sync_master_clients';     -- connected semi-sync replicas

I alert hard on Rpl_semi_sync_master_status = OFF, because that single flag is the difference between “we’re durable” and “we silently aren’t.” A rising no_tx counter means timeouts are firing and a replica is struggling.

Testing Failover on Staging

Configuration is the easy part. The reason I don’t trust any semi-sync setup until staging confirms it is that the entire value proposition only shows up during failure. So I deliberately break things.

# On staging: enable semi-sync, generate write load, then sever the replica mid-stream
mysql -h stg-source -u dba -p -e "SHOW STATUS LIKE 'Rpl_semi_sync_master_status';"

# Block replica acks with a firewall rule and watch the source fall back after the timeout
sudo iptables -A INPUT -s 10.20.0.0/24 -p tcp --dport 3306 -j DROP
sleep 2
mysql -h stg-source -u dba -p -e "SHOW STATUS LIKE 'Rpl_semi_sync_master_status';"   # expect OFF

# Restore the replica and confirm semi-sync re-engages
sudo iptables -D INPUT -s 10.20.0.0/24 -p tcp --dport 3306 -j DROP
sleep 3
mysql -h stg-source -u dba -p -e "SHOW STATUS LIKE 'Rpl_semi_sync_master_status';"   # expect ON

Then I kill the source process while writes are in flight, promote the replica, and verify that every transaction the application believed it committed is present. That last check is the whole point of semi-sync, and the only way to know it works is to watch it work under a real failover, not to read that the plugin loaded.

Where AI Fits

AI is excellent at the parts that are about precision and recall: the right plugin names, the wait-point semantics, the status variables to watch, a sensible starting timeout. It saves me from typos and from misremembering which variable kept the old naming. What it cannot do is feel the latency on your network or tell you whether your application survives a real source crash. So the division of labor is firm: AI drafts the config and explains the trade-offs, staging proves the durability under failover.

For more on running MySQL in production, the MySQL category has the rest of my playbooks, and the exact prompts I use for replication and failover work live in my AI prompt collection. Get the wait point right, monitor the fallback flag, and never ship a semi-sync change you haven’t watched fail over.