Postgres PITR WAL Backup Strategy Prompt
Turn an RPO/RTO target into a concrete WAL-archiving and Point-In-Time-Recovery design — base backups, archive command, retention, and a tested restore runbook — so you can recover to any moment instead of hoping a nightly dump is enough.
- Target user
- DBAs and SREs designing or hardening Postgres backup and disaster recovery
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior PostgreSQL DBA who designs backup and recovery from the recovery objectives backward. You start from the RPO and RTO the business actually needs, then choose the archiving, base-backup, and retention scheme that meets them — and you treat a backup as unproven until a restore has succeeded. You know a `pg_dump` is a logical export, not a PITR strategy, and you design for physical base backups plus continuous WAL. I will paste: - Database size, write/WAL generation rate, and Postgres version: [DB SIZE / WAL RATE / VERSION] - The required RPO (max acceptable data loss) and RTO (max acceptable downtime): [RPO / RTO] - Where backups can live and how long they must be retained, plus any compliance rules: [STORAGE / RETENTION / COMPLIANCE] - Current backup setup, if any, and the deployment (self-managed, container, replica topology): [CURRENT SETUP / TOPOLOGY] Work through this in order: 1. **Translate RPO/RTO into a scheme** — derive how often base backups must run and how continuously WAL must ship. Map RPO to WAL archiving cadence (streaming vs `archive_timeout` flushing) and RTO to restore time (which is dominated by base-backup size plus WAL replay volume). State the implied base-backup frequency explicitly. 2. **Choose base backups and WAL archiving** — specify `pg_basebackup` versus a managed tool, and the archiving path: a raw `archive_command`/`restore_command`, the newer `archive_library`, or a purpose-built tool. Compare `pgBackRest` and `WAL-G` for parallelism, compression, incremental/differential backups, and remote object storage, and recommend one with reasons. Cover where to run backups (a replica to offload prod). 3. **Define the recovery procedure** — give the exact restore flow: stage the base backup, set the `recovery_target` (time, LSN, named restore point, or "latest"), set `recovery_target_action`, and the WAL source. Spell out how to recover to a precise moment (for example just before an accidental DROP) and how to promote afterward. 4. **Set retention and integrity** — define a retention policy that satisfies both the RPO window and compliance, including how full backups expire WAL safely so you never prune WAL still needed by a retained base backup. Add checksum/manifest verification and immutability/off-site copies against ransomware and a single failed region. 5. **Build a restore-test plan** — describe a scheduled, automated restore drill that actually replays into a scratch instance and validates the data, plus how to measure real RTO and confirm RPO is met. An untested backup does not count. Output format: (a) one-line statement of the achievable RPO/RTO with this design, (b) a configuration table [setting | value | reason], (c) the step-by-step restore runbook, and (d) the retention + restore-test schedule. Guardrails: prove every change on a non-production instance first and never validate a new backup scheme by restoring over a live primary — restore into a fresh, isolated instance. Take a fresh base backup before and after any configuration change that affects WAL, and confirm the very first end-to-end restore succeeds before you trust the new pipeline. Keep a human approving any destructive recovery action (promotion, `recovery_target` choice, or pruning) on production.
Why this prompt works
Backup designs fail when they start from tooling instead of objectives. By forcing the RPO and RTO to the top, this prompt makes every later decision derivable: WAL shipping cadence falls out of the loss budget, and base-backup size plus replay volume — the real drivers of recovery time — fall out of the downtime budget. That ordering is what separates a genuine PITR design from a nightly pg_dump that quietly cannot recover to a specific moment.
It also addresses the two places PITR setups silently rot. The first is the archive_command that returns success without durably writing the segment, leaving a pipeline that looks healthy until a real recovery; demanding end-to-end verification catches it. The second is retention that prunes WAL still required by a retained base backup, which turns a kept backup into an unrestorable one. Tying retention to the tool’s own policy and to the RPO window closes both gaps.
Finally, the prompt treats a backup as unproven until a restore has succeeded. The scheduled restore drill into an isolated instance is the single highest-value practice in disaster recovery, and pairing it with backup-first, replica-testing, and human-in-the-loop guardrails keeps the riskiest operations — promotion, choosing a recovery target, pruning — under deliberate control rather than improvised during an outage.