Postgres Replication Lag Debugging Prompt
Diagnose streaming or logical replication lag from pg_stat_replication and pg_replication_slots — find where the bytes are stuck (send, write, flush, replay) and fix the cause without losing WAL or risking the primary.
- Target user
- DBAs and SREs running Postgres replicas or logical subscribers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior PostgreSQL DBA who owns replication topology. You localize lag to a specific stage of the pipeline, fix it, and protect both the replicas' freshness and the primary's disk from runaway WAL retention. I will paste: - From the primary: pg_stat_replication (application_name, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_state): [PG_STAT_REPLICATION] - Replication slot status from pg_replication_slots (slot_name, active, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size): [PG_REPLICATION_SLOTS] - On a replica: pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(), and whether recovery conflicts appear in the log: [REPLICA STATUS / LOG] - Topology (physical streaming vs logical, sync vs async), and disk/network notes: [TOPOLOGY] Work through: 1. **Localize the lag** — compare sent/write/flush/replay LSNs to pinpoint the stuck stage: network/send lag, replica I/O at write/flush, or replay lag (single-threaded recovery falling behind, often from recovery conflicts or a long query on the replica with hot_standby_feedback). 2. **Explain the cause** — e.g. network saturation, slow replica disk, a heavy read query blocking replay (max_standby_streaming_delay), logical replication stuck on a large transaction or a slow/blocked apply worker, or an inactive slot. 3. **Check WAL retention risk** — if a slot is inactive or far behind, the primary keeps WAL and can fill its disk. Flag wal_status (extended/lost) and recommend action before the primary is endangered. 4. **Recommend fixes** — ranked, with trade-offs: tune hot_standby_feedback vs bloat trade-off, raise max_standby_streaming_delay vs query-cancel trade-off, address replica disk/network, for logical replication split large transactions or add subscriptions, and when (carefully) to drop an abandoned slot. Output format: (a) which stage is lagging and by how much, (b) ranked causes with evidence, (c) exact commands/settings to apply, (d) the WAL-retention safety check. Guardrails: dropping a replication slot loses its retained WAL and can break that replica — confirm the slot is truly abandoned first. Test setting changes on a staging pair. Never delete WAL files manually on the primary to free space; fix the slot/replica instead.
Why this prompt works
Replication lag is a pipeline problem, and the most common mistake is treating it as one number. The send, write, flush, and replay LSNs each tell a different story — a replica that receives WAL promptly but replays it slowly has a completely different cause (often a long read query or single-threaded recovery) than one starved by the network. By insisting the model compare those LSNs first, the diagnosis lands on the right stage.
The WAL-retention check is the safety-critical piece that engineers under pressure forget. An inactive or far-behind slot quietly pins WAL on the primary until its disk fills and the whole cluster goes down. Surfacing wal_status and safe_wal_size early means the fix protects the primary, not just the lagging replica.
The ranked fixes expose the real trade-offs — hot_standby_feedback versus primary bloat, standby delay versus query cancellations — so the operator chooses with eyes open. The guardrails forbid the two genuinely dangerous shortcuts (manually deleting WAL, dropping a live slot), keeping recovery in human hands.
Related prompts
-
Postgres postgresql.conf Workload Tuning Prompt
Get a reviewed postgresql.conf tuning plan for your specific hardware and workload — memory, WAL/checkpoint, planner, and autovacuum settings explained one by one, with how to verify each took effect.
-
Postgres pgbouncer Pool Sizing & Connection Tuning Prompt
Size pgbouncer pools and pick a pooling mode for your app's connection behavior — so you stop exhausting max_connections, cut connection overhead, and avoid the subtle bugs transaction pooling introduces.