Running a MySQL Group Replication Cluster With AI

I’ve run both Galera and MySQL Group Replication in production, and the thing nobody tells you when you adopt synchronous clustering is that the failure modes are completely different from classic asynchronous replication. A normal replica falls behind and you debug a pipeline. A clustered node, by contrast, can get expelled, refuse writes because the cluster lost quorum, or throttle the whole group to a crawl because one slow member triggered flow control. The status you need to read isn’t SHOW REPLICA STATUS anymore — it’s performance_schema.replication_group_members and a pile of group-level counters that most DBAs only look at when something is already on fire.

I use AI the way I’d use a sharp junior who has read every line of the documentation but has never touched my cluster: it interprets the status dump fast and structurally, and I verify every conclusion against the actual nodes — on a replica or a staging cluster — before I touch the primary. Let me walk through how I think about it.

Single-Primary vs Multi-Primary

The first decision is mode. Group Replication defaults to single-primary: one node accepts writes, the rest are read-only (super_read_only=ON), and on failure the group elects a new primary automatically. This is what I run almost everywhere because it sidesteps the hardest class of problems — write conflicts between nodes.

Multi-primary lets every node accept writes. It sounds great until two nodes commit conflicting changes to the same row in the same instant. Group Replication resolves this with certification: each transaction is checked against the others’ write-sets, and the loser of a conflict gets a rollback with a deadlock-style error. Galera behaves the same way. If your application isn’t written to retry those, multi-primary will hand you intermittent failures that look like phantom deadlocks.

-- Which mode am I actually running?
SELECT @@group_replication_single_primary_mode;

-- Who is the current primary in single-primary mode?
SELECT MEMBER_HOST, MEMBER_ROLE, MEMBER_STATE
FROM performance_schema.replication_group_members;

Reading replication_group_members

This view is the heartbeat of the cluster. Every node should appear, every node should be ONLINE, and exactly one should be PRIMARY in single-primary mode.

SELECT MEMBER_ID, MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
FROM performance_schema.replication_group_members;

The states that matter:

ONLINE — healthy, participating, applying transactions.
RECOVERING — joined the group and is catching up via distributed recovery. Normal briefly; alarming if it sticks.
UNREACHABLE — the local node can’t talk to it. This is the one that precedes quorum loss.
ERROR — the node hit an applier or recovery error and dropped out.

When a member goes UNREACHABLE, the clock starts. Group Replication needs a majority of the configured membership to be reachable to keep accepting writes. Lose the majority and the surviving minority blocks all writes to protect consistency — it will not split-brain. That’s the feature working, even though it feels like an outage.

Quorum and the Partition You Didn’t Plan For

A three-node cluster tolerates one failure. Lose two and the remaining node has no quorum: it sits there ONLINE but refuses writes, and replication_group_members shows the dead nodes as UNREACHABLE indefinitely. This is the moment people panic and do something destructive.

If you’ve genuinely lost the other nodes and confirmed they aren’t coming back, you force a new membership on the survivor:

-- Survivor's own address:port, only after confirming the others are truly gone
SET GLOBAL group_replication_force_members = '10.0.0.11:33061';
-- Clear it again immediately afterward
SET GLOBAL group_replication_force_members = '';

I treat force_members like a fire axe behind glass. Run it while a “dead” node is actually just partitioned and you create the split-brain the system was designed to prevent. I always check ICMP and the group-communication port from the survivor before considering it.

# Confirm the peers are really unreachable, not just the MySQL port
ping -c 3 10.0.0.12
nc -vz 10.0.0.12 33061

Flow Control: When the Slowest Node Sets the Pace

Synchronous-ish replication means the group can only commit as fast as its members can certify and apply. When a node’s apply queue backs up, Group Replication engages flow control and throttles the writers so the laggard can catch up. The symptom is brutal and counterintuitive: your fast primary suddenly commits slowly, and nothing on the primary itself looks wrong.

-- Look for a member with a growing apply/certify queue
SELECT MEMBER_ID,
       COUNT_TRANSACTIONS_IN_QUEUE        AS certify_queue,
       COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE AS apply_queue
FROM performance_schema.replication_group_member_stats;

A queue that keeps climbing on one member is your throttle source. The fix is almost always that node, not the primary — slow disk, an undersized instance, a long-running query holding things up, or a missing index forcing full scans on the applier.

Using AI to Interpret Cluster Status

Here’s where AI earns its place. The status across these views is a lot to hold in your head under pressure. I dump the members view and the member stats, and ask for a structured read. The prompt I keep handy:

You are a senior MySQL DBA. Below is output from replication_group_members and replication_group_member_stats for a 3-node single-primary Group Replication cluster. Tell me: (1) do we have quorum, (2) which member if any is causing flow control, (3) the single safest next action. Do not suggest force_members unless quorum is already lost. Output: node-03 (10.0.0.13) UNREACHABLE for ~40s, member count 3.

A representative response:

Quorum status: HEALTHY. 2 of 3 members ONLINE, 1 UNREACHABLE — you retain a majority and writes continue. Flow control: node-02 shows an apply queue of several thousand and climbing; it is the throttle source, not node-03. Safest next action: investigate node-02’s applier (check for a blocking query and disk I/O), and monitor node-03 for return to RECOVERING. Do NOT force membership — quorum is intact and node-03 may rejoin on its own.

That’s exactly the read I want: it separates the unreachable node from the slow node, which are two different problems people constantly conflate. But I verify before acting — I open node-02 myself, confirm the queue, and check processlist for the blocking statement. The AI is right often enough to save minutes and wrong often enough that I never let it drive.

Recovering an Out-of-Sync Node

When a node drops to ERROR or gets expelled, the cleanest recovery is usually to rejoin it and let distributed recovery resync from a donor. First clear the applier error, then restart Group Replication:

-- On the failed node, after fixing the underlying applier error
STOP GROUP_REPLICATION;
RESET REPLICA ALL FOR CHANNEL 'group_replication_recovery';
START GROUP_REPLICATION;

-- Watch it move from RECOVERING to ONLINE
SELECT MEMBER_HOST, MEMBER_STATE FROM performance_schema.replication_group_members;

If the node has diverged so far that incremental recovery can’t proceed — typically a GTID set that the group can no longer supply because the binlogs were purged — you reclone it from a healthy donor and let it rejoin clean. I’d rather spend the time on a fresh clone than chase a node whose GTID history no longer lines up with the group’s.

Where I Land

Clustering trades the simplicity of asynchronous replication for strong consistency, and it pays for that with failure modes — quorum loss, flow control, certification conflicts — that demand you read group-level status fluently. AI is genuinely good at parsing those status dumps into a clear story, and that’s worth a lot at 3 a.m. But synchronous clusters punish wrong actions immediately and sometimes irreversibly, so every AI conclusion gets checked against the live nodes before I commit to anything.

For more on running MySQL well under pressure, browse the rest of the MySQL guides, and grab the cluster-status interpretation prompts from the prompt library to keep your incident playbook sharp.