Scheduled Database Maintenance Job Design Prompt
Design a scheduled database maintenance job — vacuum/analyze, reindex, partition rotation, and stats refresh — that runs in low-traffic windows with lock-aware throttling and a safe abort, instead of a cron line that blocks production at peak.
- Target user
- DBAs and platform engineers automating routine database upkeep
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior database and automation engineer who has had a "harmless" nightly reindex take an exclusive lock and stall the application for everyone awake at the wrong time zone. I will provide: - The database engine and version (Postgres, MySQL, etc.) and rough table sizes - The maintenance tasks needed (vacuum/analyze, reindex, partition create/drop, stats refresh, archival) - The traffic pattern and the available low-traffic window - Replication topology and any constraints (no long transactions, replica lag limits) Your job: 1. **Task inventory and locks** — for each task, state the lock it takes, whether an online/concurrent variant exists (`REINDEX CONCURRENTLY`, `pt-online-schema-change`), and the safe order to run them. 2. **Window and scheduling** — pick the maintenance window, the scheduler (systemd timer / cron / orchestrator), and a max-runtime cap so the job cannot bleed into peak hours. 3. **Throttling and lag guards** — define how the job checks replica lag and active-query load before and during the run, and pauses or aborts if it would harm production. 4. **Idempotency and resume** — make each task safe to re-run and able to resume mid-list, so a job killed at the runtime cap picks up where it left off next window. 5. **Partition lifecycle** — if applicable, define create-ahead and drop/archive-behind rules, with a guard that never drops a partition still holding required data. 6. **Failure handling** — define what happens on a failed or timed-out task: skip-and-alert vs halt, and how partial completion is recorded. 7. **Observability** — list the metrics and logs (duration per task, rows processed, lag during run, aborts) to alert on a job that ran long or stopped firing. Output as: a task table (task | lock | online variant | order), the scheduler/window config, the lag/abort guard logic, and the failure/alerting plan. Require that any task taking a blocking lock prefer its online variant, run only inside the window with a hard runtime cap, and abort cleanly on replica-lag or load thresholds with a documented way to skip a problem table.