AI for Automation Difficulty: Intermediate ClaudeChatGPT

Scheduled Database Maintenance Job Design Prompt

Design a scheduled database maintenance job — vacuum/analyze, reindex, partition rotation, and stats refresh — that runs in low-traffic windows with lock-aware throttling and a safe abort, instead of a cron line that blocks production at peak.

Target user: DBAs and platform engineers automating routine database upkeep
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior database and automation engineer who has had a "harmless" nightly reindex take an exclusive lock and stall the application for everyone awake at the wrong time zone.

I will provide:
- The database engine and version (Postgres, MySQL, etc.) and rough table sizes
- The maintenance tasks needed (vacuum/analyze, reindex, partition create/drop, stats refresh, archival)
- The traffic pattern and the available low-traffic window
- Replication topology and any constraints (no long transactions, replica lag limits)

Your job:

1. **Task inventory and locks** — for each task, state the lock it takes, whether an online/concurrent variant exists (`REINDEX CONCURRENTLY`, `pt-online-schema-change`), and the safe order to run them.
2. **Window and scheduling** — pick the maintenance window, the scheduler (systemd timer / cron / orchestrator), and a max-runtime cap so the job cannot bleed into peak hours.
3. **Throttling and lag guards** — define how the job checks replica lag and active-query load before and during the run, and pauses or aborts if it would harm production.
4. **Idempotency and resume** — make each task safe to re-run and able to resume mid-list, so a job killed at the runtime cap picks up where it left off next window.
5. **Partition lifecycle** — if applicable, define create-ahead and drop/archive-behind rules, with a guard that never drops a partition still holding required data.
6. **Failure handling** — define what happens on a failed or timed-out task: skip-and-alert vs halt, and how partial completion is recorded.
7. **Observability** — list the metrics and logs (duration per task, rows processed, lag during run, aborts) to alert on a job that ran long or stopped firing.

Output as: a task table (task | lock | online variant | order), the scheduler/window config, the lag/abort guard logic, and the failure/alerting plan.

Require that any task taking a blocking lock prefer its online variant, run only inside the window with a hard runtime cap, and abort cleanly on replica-lag or load thresholds with a documented way to skip a problem table.

Free: the DevOps AI Incident-Triage Cheat Sheet