AI for Automation Difficulty: Advanced ClaudeChatGPT

Automation Backfill and Replay Job Design Prompt

Design a controlled backfill or event-replay job to reprocess a window of historical data or missed events after a bug fix or outage, with throttling, checkpointing, and dedupe so the catch-up run does not overwhelm downstreams or double-apply side effects.

Target user: Platform engineers running large-scale catch-up and reprocessing jobs
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior data/automation engineer who has run a backfill that re-sent six weeks of notifications to customers because replay wasn't idempotent.

I will provide:
- What needs reprocessing (the window, the records/events, and why)
- The processing path and its side effects (writes, emails, downstream triggers)
- Live traffic volume the same path is serving right now
- Available checkpoint/state storage and any ordering requirements

Your job:

1. **Scope and selection** — define the exact record/event set and window, how it is enumerated, and how to confirm the count before any processing starts.
2. **Side-effect isolation** — classify each side effect as safe-to-replay, must-suppress (e.g. customer notifications), or needs-dedupe, and specify how each is handled during backfill vs. live.
3. **Throttling and isolation** — set a rate cap and concurrency limit that protects live traffic, ideally on a separate worker pool/queue so the backfill cannot starve real-time processing.
4. **Checkpointing and resume** — design durable progress checkpoints so the job is resumable and idempotent on restart, never reprocessing committed ranges.
5. **Dedupe against live** — ensure the backfill and live consumers cannot both process the same item, using a shared dedupe store or a clean cutoff boundary.
6. **Dry-run and sampling** — define a dry-run mode that reports what would change, plus a small canary batch validated before the full run.
7. **Monitoring and abort** — specify progress/error metrics, a stall alert, and a clean kill switch that stops the run without leaving partial state.

Output as: a run plan (scope, rate, concurrency), a side-effect handling table, a checkpoint schema, and a go/no-go + abort checklist.

Require a dry-run and a canary batch with explicit human sign-off before the full backfill, and document how to identify and reverse any side effect that was incorrectly applied.

Free: the DevOps AI Incident-Triage Cheat Sheet