Postgres Point-in-Time Recovery and WAL Archiving With AI

A nightly pg_dump is not a backup strategy. It is a way to lose up to 24 hours of data and find out at the worst possible moment. The grown-up answer for any database you care about is point-in-time recovery: a base backup plus a continuous stream of write-ahead log segments that lets you restore to any moment, including the second before someone ran the unfiltered DELETE. This guide covers how PITR actually works, how to wire up WAL archiving, and how to use AI to draft the recovery runbook without trusting it blindly, because a recovery procedure you have never executed is just a hopeful comment in a wiki.

How PITR works

Postgres records every change to a write-ahead log before applying it to the data files. WAL is what makes the database crash-safe. PITR exploits the same mechanism: if you have a consistent snapshot of the data directory taken at time T, plus every WAL segment generated since T, you can replay that log forward to any point you choose. Restore the base backup, feed Postgres the WAL stream, tell it where to stop, and it reconstructs the exact state at that instant.

Two pieces make this work. The base backup is a physical copy of the data directory taken with pg_basebackup while the server runs. WAL archiving is the continuous offloading of completed 16 MB WAL segments to durable storage as they fill. Your recovery floor, the oldest point you can restore to, is your oldest base backup. Your ceiling is the last archived WAL segment. The gap between “now” and your last safely archived segment is your real recovery point objective, and it is usually larger than people think.

Turning on WAL archiving

Archiving lives in postgresql.conf. You need wal_level high enough to support archiving and an archive_command that copies each finished segment somewhere durable.

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f'
archive_timeout = 300

The archive_command runs once per completed segment. %p is the path to the segment, %f is its filename. The test ! -f guard refuses to overwrite an existing archived segment, which is a safety property, not a nicety: an archive_command that returns success without durably storing the segment will cause Postgres to recycle that WAL, and the segment is gone forever. The command must return zero only when the data is genuinely safe.

That cp to a local mount is a teaching example. In production you archive to object storage with checksums and retention, which is why nobody hand-rolls archive_command for real anymore. More on that below. The archive_timeout = 300 forces a segment switch every five minutes even on a quiet database, so your worst-case data loss on an idle system is bounded at five minutes rather than “whenever the next segment happens to fill.”

Taking a base backup

With archiving running, take a base backup that pairs with the WAL stream.

pg_basebackup \
  --pgdata=/mnt/backups/base_$(date +%Y%m%d) \
  --format=tar --gzip \
  --wal-method=stream \
  --checkpoint=fast \
  --progress --verbose

--wal-method=stream opens a second connection to capture WAL generated during the backup itself, so the backup is self-consistent even before you touch the archive. --checkpoint=fast triggers an immediate checkpoint instead of waiting, which shortens the backup window at the cost of a brief I/O spike. Run this on a schedule, and crucially, monitor that it finishes. A base backup job that has been silently failing for three weeks is the classic way to discover your recovery floor is a month old.

Restoring to a recovery target

Recovery is where the planning pays off. You restore the base backup into a fresh data directory, then tell Postgres how far to replay. Recovery settings moved into postgresql.conf in version 12, and the presence of a recovery.signal file is what puts the server into recovery mode.

# postgresql.conf on the restore target
restore_command = 'cp /mnt/wal_archive/%f %p'
recovery_target_time = '2026-06-23 14:25:00+00'
recovery_target_action = 'promote'

# Restore the base backup, then create the signal file
tar -xzf /mnt/backups/base_20260623/base.tar.gz -C /var/lib/postgresql/restore/
touch /var/lib/postgresql/restore/recovery.signal
pg_ctl -D /var/lib/postgresql/restore/ start

Postgres restores the base backup, then calls restore_command to pull WAL segments and replay them until it reaches recovery_target_time, at which point recovery_target_action = 'promote' brings the database up as a normal read-write server. You can target a time, a named restore point, an LSN, or a transaction ID. Time is the most common because incidents are described in clock time. The subtlety that bites people: the target must fall within the WAL range you have archived, and the base backup must predate the target. Get the ordering wrong and recovery either undershoots or refuses to start.

RPO, RTO, and the numbers that matter

Two objectives drive every backup decision. Recovery point objective is how much data you can afford to lose, measured in time. Recovery time objective is how long you can afford to be down. WAL archiving with a five-minute archive_timeout gives you an RPO of roughly five minutes on an idle system and much tighter on a busy one. RTO is dominated by two things: how long it takes to retrieve the base backup, and how much WAL has to replay on top of it. A weekly base backup means replaying up to seven days of WAL, which can take hours. Daily or twice-daily base backups shrink the replay window and your RTO with it.

I had an assistant estimate the recovery window for a given setup and it was a useful sanity check, with a caveat.

Prompt: Given daily base backups and continuous WAL archiving, what is my worst-case data loss and roughly how long will recovery take for a 200 GB database?

Output (excerpt): “Worst-case data loss is bounded by your archive_timeout plus the time for the final segment to reach durable storage. Recovery time is base-backup retrieval (200 GB over your network link) plus WAL replay for up to ~24 hours of accumulated segments. Measure both; do not assume.”

That is the right shape of answer, and notice it refused to invent a number. The actual figures depend entirely on your storage throughput and write volume, so the AI correctly handed the measurement back to you. Use it to structure the runbook and surface the variables; do the timing with a real restore. The Postgres PITR and WAL backup strategy prompt is built to drive exactly this conversation, walking through RPO/RTO targets and turning them into concrete backup cadence.

Test the restore, or you do not have a backup

This is the part everyone skips and the only part that matters. A backup you have never restored is a hypothesis. Schedule an automated restore drill: spin up a throwaway instance, restore the latest base backup, replay WAL to a target, and run a checksum or row-count assertion against known data.

#!/usr/bin/env bash
set -euo pipefail
TARGET_DIR=/var/lib/postgresql/drill
rm -rf "$TARGET_DIR"; mkdir -p "$TARGET_DIR"
tar -xzf "$(ls -t /mnt/backups/base_*/base.tar.gz | head -1)" -C "$TARGET_DIR"
# ...write recovery settings + recovery.signal, start, wait for promotion...
psql -h /tmp -p 5433 -d appdb -c "SELECT count(*) FROM orders;" \
  | grep -q -E '[0-9]+' && echo "RESTORE DRILL PASSED"

Run that on a timer and alert when it fails. The day you actually need PITR, you want muscle memory and a procedure proven on last night’s data, not a first attempt under pressure.

Use a real tool: pgBackRest or WAL-G

Hand-rolled archive_command with cp works for learning and falls over in production: no parallelism, no compression, no retention management, no integrity verification, no incremental backups. pgBackRest and WAL-G solve all of that. They handle parallel compressed uploads to S3-compatible storage, incremental and differential backups, automatic retention expiration, encryption, and end-to-end checksums, with a single pgbackrest restore --type=time --target=... doing the work of the entire manual procedure above. Understand the primitives first so you can reason about what the tool is doing, then let the tool do it.

AI is excellent for drafting the config, the recovery runbook, and the drill harness, and for catching the ordering mistakes that make recovery fail. It cannot tell you your restore works. Only a tested restore does that, so test it on a schedule and treat a failed drill as a page.