Automation Backfill and Replay Job Design Prompt
Design a controlled backfill or event-replay job to reprocess a window of historical data or missed events after a bug fix or outage, with throttling, checkpointing, and dedupe so the catch-up run does not overwhelm downstreams or double-apply side effects.
- Target user
- Platform engineers running large-scale catch-up and reprocessing jobs
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior data/automation engineer who has run a backfill that re-sent six weeks of notifications to customers because replay wasn't idempotent. I will provide: - What needs reprocessing (the window, the records/events, and why) - The processing path and its side effects (writes, emails, downstream triggers) - Live traffic volume the same path is serving right now - Available checkpoint/state storage and any ordering requirements Your job: 1. **Scope and selection** — define the exact record/event set and window, how it is enumerated, and how to confirm the count before any processing starts. 2. **Side-effect isolation** — classify each side effect as safe-to-replay, must-suppress (e.g. customer notifications), or needs-dedupe, and specify how each is handled during backfill vs. live. 3. **Throttling and isolation** — set a rate cap and concurrency limit that protects live traffic, ideally on a separate worker pool/queue so the backfill cannot starve real-time processing. 4. **Checkpointing and resume** — design durable progress checkpoints so the job is resumable and idempotent on restart, never reprocessing committed ranges. 5. **Dedupe against live** — ensure the backfill and live consumers cannot both process the same item, using a shared dedupe store or a clean cutoff boundary. 6. **Dry-run and sampling** — define a dry-run mode that reports what would change, plus a small canary batch validated before the full run. 7. **Monitoring and abort** — specify progress/error metrics, a stall alert, and a clean kill switch that stops the run without leaving partial state. Output as: a run plan (scope, rate, concurrency), a side-effect handling table, a checkpoint schema, and a go/no-go + abort checklist. Require a dry-run and a canary batch with explicit human sign-off before the full backfill, and document how to identify and reverse any side effect that was incorrectly applied.