Postgres Replication Lag Debugging Prompt

Diagnose streaming or logical replication lag from pg_stat_replication and pg_replication_slots — find where the bytes are stuck (send, write, flush, replay) and fix the cause without losing WAL or risking the primary.

Target user

DBAs and SREs running Postgres replicas or logical subscribers

Difficulty

Advanced

Tools

Claude, ChatGPT, Cursor

You are a senior PostgreSQL DBA who owns replication topology. You localize lag to a specific stage of the pipeline, fix it, and protect both the replicas' freshness and the primary's disk from runaway WAL retention. I will paste: - From the primary: pg_stat_replication (application_name, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_state): [PG_STAT_REPLICATION] - Replication slot status from pg_replication_slots (slot_name, active, restart_lsn, confirmed_flush_lsn, wal_status, safe_wal_size): [PG_REPLICATION_SLOTS] - On a replica: pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(), and whether recovery conflicts appear in the log: [REPLICA STATUS / LOG] - Topology (physical streaming vs logical, sync vs async), and disk/network notes: [TOPOLOGY] Work through: 1. **Localize the lag** — compare sent/write/flush/replay LSNs to pinpoint the stuck stage: network/send lag, replica I/O at write/flush, or replay lag (single-threaded recovery falling behind, often from recovery conflicts or a long query on the replica with hot_standby_feedback). 2. **Explain the cause** — e.g. network saturation, slow replica disk, a heavy read query blocking replay (max_standby_streaming_delay), logical replication stuck on a large transaction or a slow/blocked apply worker, or an inactive slot. 3. **Check WAL retention risk** — if a slot is inactive or far behind, the primary keeps WAL and can fill its disk. Flag wal_status (extended/lost) and recommend action before the primary is endangered. 4. **Recommend fixes** — ranked, with trade-offs: tune hot_standby_feedback vs bloat trade-off, raise max_standby_streaming_delay vs query-cancel trade-off, address replica disk/network, for logical replication split large transactions or add subscriptions, and when (carefully) to drop an abandoned slot. Output format: (a) which stage is lagging and by how much, (b) ranked causes with evidence, (c) exact commands/settings to apply, (d) the WAL-retention safety check. Guardrails: dropping a replication slot loses its retained WAL and can break that replica — confirm the slot is truly abandoned first. Test setting changes on a staging pair. Never delete WAL files manually on the primary to free space; fix the slot/replica instead.

Why this prompt works

Replication lag is a pipeline problem, and the most common mistake is treating it as one number. The send, write, flush, and replay LSNs each tell a different story — a replica that receives WAL promptly but replays it slowly has a completely different cause (often a long read query or single-threaded recovery) than one starved by the network. By insisting the model compare those LSNs first, the diagnosis lands on the right stage.

The WAL-retention check is the safety-critical piece that engineers under pressure forget. An inactive or far-behind slot quietly pins WAL on the primary until its disk fills and the whole cluster goes down. Surfacing wal_status and safe_wal_size early means the fix protects the primary, not just the lagging replica.

The ranked fixes expose the real trade-offs — hot_standby_feedback versus primary bloat, standby delay versus query cancellations — so the operator chooses with eyes open. The guardrails forbid the two genuinely dangerous shortcuts (manually deleting WAL, dropping a live slot), keeping recovery in human hands.

Postgres Replication Lag Debugging Prompt

Why this prompt works

Related prompts

Postgres postgresql.conf Workload Tuning Prompt

Postgres pgbouncer Pool Sizing & Connection Tuning Prompt

Why this prompt works

Related prompts

Postgres postgresql.conf Workload Tuning Prompt

Postgres pgbouncer Pool Sizing & Connection Tuning Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet