Skip to content
CloudOps
Newsletter
All prompts
AI for Postgres Difficulty: Advanced ClaudeChatGPTCursor

Postgres Replication Lag Debugging Prompt

Diagnose streaming or logical replication lag from pg_stat_replication and pg_replication_slots — find where the bytes are stuck (send, write, flush, replay) and fix the cause without losing WAL or risking the primary.

Target user
DBAs and SREs running Postgres replicas or logical subscribers
Difficulty
Advanced
Tools
Claude, ChatGPT, Cursor

The prompt

You are a senior PostgreSQL DBA who owns replication topology. You localize lag to a
specific stage of the pipeline, fix it, and protect both the replicas' freshness and the
primary's disk from runaway WAL retention.

I will paste:
- From the primary: pg_stat_replication (application_name, state, sent_lsn, write_lsn,
  flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_state): [PG_STAT_REPLICATION]
- Replication slot status from pg_replication_slots (slot_name, active, restart_lsn,
  confirmed_flush_lsn, wal_status, safe_wal_size): [PG_REPLICATION_SLOTS]
- On a replica: pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(),
  pg_last_xact_replay_timestamp(), and whether recovery conflicts appear in the log:
  [REPLICA STATUS / LOG]
- Topology (physical streaming vs logical, sync vs async), and disk/network notes:
  [TOPOLOGY]

Work through:

1. **Localize the lag** — compare sent/write/flush/replay LSNs to pinpoint the stuck stage:
   network/send lag, replica I/O at write/flush, or replay lag (single-threaded recovery
   falling behind, often from recovery conflicts or a long query on the replica with
   hot_standby_feedback).

2. **Explain the cause** — e.g. network saturation, slow replica disk, a heavy read query
   blocking replay (max_standby_streaming_delay), logical replication stuck on a large
   transaction or a slow/blocked apply worker, or an inactive slot.

3. **Check WAL retention risk** — if a slot is inactive or far behind, the primary keeps
   WAL and can fill its disk. Flag wal_status (extended/lost) and recommend action before
   the primary is endangered.

4. **Recommend fixes** — ranked, with trade-offs: tune hot_standby_feedback vs bloat
   trade-off, raise max_standby_streaming_delay vs query-cancel trade-off, address replica
   disk/network, for logical replication split large transactions or add subscriptions, and
   when (carefully) to drop an abandoned slot.

Output format: (a) which stage is lagging and by how much, (b) ranked causes with evidence,
(c) exact commands/settings to apply, (d) the WAL-retention safety check.

Guardrails: dropping a replication slot loses its retained WAL and can break that replica —
confirm the slot is truly abandoned first. Test setting changes on a staging pair. Never
delete WAL files manually on the primary to free space; fix the slot/replica instead.

Why this prompt works

Replication lag is a pipeline problem, and the most common mistake is treating it as one number. The send, write, flush, and replay LSNs each tell a different story — a replica that receives WAL promptly but replays it slowly has a completely different cause (often a long read query or single-threaded recovery) than one starved by the network. By insisting the model compare those LSNs first, the diagnosis lands on the right stage.

The WAL-retention check is the safety-critical piece that engineers under pressure forget. An inactive or far-behind slot quietly pins WAL on the primary until its disk fills and the whole cluster goes down. Surfacing wal_status and safe_wal_size early means the fix protects the primary, not just the lagging replica.

The ranked fixes expose the real trade-offs — hot_standby_feedback versus primary bloat, standby delay versus query cancellations — so the operator chooses with eyes open. The guardrails forbid the two genuinely dangerous shortcuts (manually deleting WAL, dropping a live slot), keeping recovery in human hands.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week