Skip to content
DevOps AI ToolKit
Newsletter
All prompts
AI for Postgres Difficulty: Advanced ClaudeChatGPTCursor

Postgres PITR WAL Backup Strategy Prompt

Turn an RPO/RTO target into a concrete WAL-archiving and Point-In-Time-Recovery design — base backups, archive command, retention, and a tested restore runbook — so you can recover to any moment instead of hoping a nightly dump is enough.

Target user
DBAs and SREs designing or hardening Postgres backup and disaster recovery
Difficulty
Advanced
Tools
Claude, ChatGPT, Cursor

The prompt

You are a senior PostgreSQL DBA who designs backup and recovery from the recovery
objectives backward. You start from the RPO and RTO the business actually needs, then
choose the archiving, base-backup, and retention scheme that meets them — and you treat a
backup as unproven until a restore has succeeded. You know a `pg_dump` is a logical export,
not a PITR strategy, and you design for physical base backups plus continuous WAL.

I will paste:
- Database size, write/WAL generation rate, and Postgres version: [DB SIZE / WAL RATE / VERSION]
- The required RPO (max acceptable data loss) and RTO (max acceptable downtime): [RPO / RTO]
- Where backups can live and how long they must be retained, plus any compliance rules: [STORAGE / RETENTION / COMPLIANCE]
- Current backup setup, if any, and the deployment (self-managed, container, replica topology): [CURRENT SETUP / TOPOLOGY]

Work through this in order:

1. **Translate RPO/RTO into a scheme** — derive how often base backups must run and how
   continuously WAL must ship. Map RPO to WAL archiving cadence (streaming vs
   `archive_timeout` flushing) and RTO to restore time (which is dominated by base-backup
   size plus WAL replay volume). State the implied base-backup frequency explicitly.

2. **Choose base backups and WAL archiving** — specify `pg_basebackup` versus a managed
   tool, and the archiving path: a raw `archive_command`/`restore_command`, the newer
   `archive_library`, or a purpose-built tool. Compare `pgBackRest` and `WAL-G` for
   parallelism, compression, incremental/differential backups, and remote object storage,
   and recommend one with reasons. Cover where to run backups (a replica to offload prod).

3. **Define the recovery procedure** — give the exact restore flow: stage the base backup,
   set the `recovery_target` (time, LSN, named restore point, or "latest"), set
   `recovery_target_action`, and the WAL source. Spell out how to recover to a precise
   moment (for example just before an accidental DROP) and how to promote afterward.

4. **Set retention and integrity** — define a retention policy that satisfies both the RPO
   window and compliance, including how full backups expire WAL safely so you never prune
   WAL still needed by a retained base backup. Add checksum/manifest verification and
   immutability/off-site copies against ransomware and a single failed region.

5. **Build a restore-test plan** — describe a scheduled, automated restore drill that
   actually replays into a scratch instance and validates the data, plus how to measure
   real RTO and confirm RPO is met. An untested backup does not count.

Output format: (a) one-line statement of the achievable RPO/RTO with this design, (b) a
configuration table [setting | value | reason], (c) the step-by-step restore runbook, and
(d) the retention + restore-test schedule.

Guardrails: prove every change on a non-production instance first and never validate a new
backup scheme by restoring over a live primary — restore into a fresh, isolated instance.
Take a fresh base backup before and after any configuration change that affects WAL, and
confirm the very first end-to-end restore succeeds before you trust the new pipeline. Keep
a human approving any destructive recovery action (promotion, `recovery_target` choice, or
pruning) on production.

Why this prompt works

Backup designs fail when they start from tooling instead of objectives. By forcing the RPO and RTO to the top, this prompt makes every later decision derivable: WAL shipping cadence falls out of the loss budget, and base-backup size plus replay volume — the real drivers of recovery time — fall out of the downtime budget. That ordering is what separates a genuine PITR design from a nightly pg_dump that quietly cannot recover to a specific moment.

It also addresses the two places PITR setups silently rot. The first is the archive_command that returns success without durably writing the segment, leaving a pipeline that looks healthy until a real recovery; demanding end-to-end verification catches it. The second is retention that prunes WAL still required by a retained base backup, which turns a kept backup into an unrestorable one. Tying retention to the tool’s own policy and to the RPO window closes both gaps.

Finally, the prompt treats a backup as unproven until a restore has succeeded. The scheduled restore drill into an isolated instance is the single highest-value practice in disaster recovery, and pairing it with backup-first, replica-testing, and human-in-the-loop guardrails keeps the riskiest operations — promotion, choosing a recovery target, pruning — under deliberate control rather than improvised during an outage.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 2,104 DevOps AI prompts
  • One practical workflow email per week