Setting Up and Debugging Postgres Replication With AI

Replication is the feature you set up once and then never think about until it silently breaks, at which point you discover your “high availability” replica has been hours behind for a week. I’ve seen a replica fall so far behind that failover would have lost a day of data, and the only sign was a metric nobody had alerted on. Replication has a lot of moving parts — WAL, slots, sender and receiver processes, conflict handling on the standby — and the failure messages are terse. AI is a strong partner here: it explains slot and lag semantics clearly and drafts the config, but it can’t read your live pg_stat_replication. You hand it the catalog snapshot; it tells you what’s stuck.

This is how I stand up replication and how I debug it when it stalls.

Streaming vs. logical — pick deliberately

Streaming (physical) replication ships the WAL byte-for-byte; the replica is an exact binary copy, read-only, all databases at once. Logical replication decodes the WAL into row changes and replays them, so you can replicate selected tables, across major versions, into a writable target. Streaming is for HA and read replicas; logical is for selective copies, upgrades, and zero-downtime version migrations. I describe my goal to AI and have it argue both sides before I commit — picking the wrong one means rebuilding later.

Stand up streaming replication

The primary needs WAL configured for streaming and a replication slot so it retains WAL the standby hasn’t consumed yet.

-- on the primary
ALTER SYSTEM SET wal_level = 'replica';
ALTER SYSTEM SET max_wal_senders = 10;
SELECT pg_create_physical_replication_slot('standby1');
-- restart required for wal_level

# build the standby from a base backup that registers the slot
pg_basebackup -h primary.internal -U replicator \
  -D /var/lib/postgresql/16/main \
  -S standby1 -R -P --wal-method=stream

The -R flag writes the connection config and standby.signal for you. I ask AI to review the pg_basebackup flags against my goal — the slot reference and -R are the parts people forget, and missing the slot means the primary can recycle WAL the standby still needs.

Watch lag like you mean it

The single most important habit: monitor lag from the primary’s view, in bytes, not just time. Time lag looks fine on an idle system and hides a slot that’s silently stuck.

-- on the primary
SELECT
  application_name,
  client_addr,
  state,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   AS sent_lag,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag,
  write_lag, flush_lag, replay_lag AS replay_time
FROM pg_stat_replication;

If replay_lag in bytes is growing, the standby can’t keep up applying changes — often I/O bound on the replica. I paste this to AI and ask it to distinguish network lag (sent vs. replay diverging) from apply lag (replay falling behind on the standby), because the fixes are completely different.

Debug a stuck or disconnected replica

When a standby disconnects, the first question is whether the slot is retaining WAL (filling the primary’s disk) and the second is why the standby can’t connect.

-- on the primary: is a slot stuck and hoarding WAL?
SELECT slot_name, active, restart_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

# on the standby: what is the receiver complaining about?
grep -iE "replication|wal receiver|streaming" /var/log/postgresql/postgresql-16-main.log | tail -40

Here’s pg_replication_slots from the primary showing a slot with active = false retaining 60GB of WAL, plus the standby’s log showing “requested WAL segment has already been removed.” Explain the failure mode, tell me whether the standby can recover by reconnecting or must be rebuilt, and give me the safest sequence to recover without filling the primary’s disk.

The model reliably recognizes the “WAL already removed” failure (the standby fell so far behind the primary recycled WAL it needed) and walks you to the rebuild. An inactive slot hoarding WAL is its own emergency — it can fill the primary’s disk and take the primary down, so you sometimes have to drop the slot to save the primary even though it means rebuilding the standby.

Logical replication essentials

For logical, you create a publication on the source and a subscription on the target, and you watch the subscription worker for conflicts.

-- source
CREATE PUBLICATION orders_pub FOR TABLE orders, customers;
-- target
CREATE SUBSCRIPTION orders_sub
  CONNECTION 'host=source.internal dbname=proddb user=replicator'
  PUBLICATION orders_pub;

-- target: is the subscription healthy?
SELECT subname, received_lsn, latest_end_lsn,
       latest_end_time FROM pg_stat_subscription;

Logical replication’s classic failure is a conflict — a row the subscriber can’t apply, usually a primary-key collision or a missing replica identity — which stalls the whole subscription. AI is good at reading the worker error and pointing at the offending table, but you verify by checking pg_stat_subscription actually advances after the fix.

The standing rule

Set up replication with AI’s help, then alert on lag in bytes and on inactive slots — those two alerts catch nearly every silent failure. When something stalls, the catalog (pg_stat_replication, pg_replication_slots, pg_stat_subscription) tells you the truth; AI translates the terse state into a recovery plan in minutes. It does not get to drop your slots or rebuild your standby on its own — those are decisions with data-loss consequences. More operational Postgres material is in the Postgres guides, and my replication-triage prompts live in the prompt library.