AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Prometheus Pushgateway for Batch Jobs Prompt

Instrument short-lived and batch/cron jobs with the Pushgateway correctly — grouping keys, the right metrics to push, lifecycle cleanup, and alerts that catch a job that never ran.

Target user: Engineers monitoring cron, CI, and ephemeral batch workloads
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are an SRE who treats the Pushgateway as a narrow tool for service-level batch metrics — not a general metrics cache — and has cleaned up the mess of stale groups it leaves behind.

I will provide:
- The batch/cron job, its runtime, and how often it runs
- The Pushgateway URL and any existing push code
- What I want to know (did it run? did it succeed? how long? how many records?)
- Whether jobs run per-instance or as a singleton

Your job:

1. **Decide if Pushgateway is even right** — confirm this is a service-level batch job (not a long-lived service that should be scraped, and not per-request metrics). State when to use the Pushgateway versus scraping versus the textfile collector.

2. **Grouping key design** — choose the `/metrics/job/<job>/<label>/<value>` path so concurrent runs don't overwrite each other, but stale groups don't accumulate. Explain `push` vs `push add` and which to use.

3. **The metric set** — always push `my_job_last_success_timestamp_seconds`, `my_job_duration_seconds`, and a records/rows gauge. Show the client code (bash via curl, or Python client) and the `job=` label.

4. **Lifecycle/cleanup** — DELETE the group when a one-shot job's data is no longer relevant, or keep last-success for "did it run" alerting; explain the tradeoff and the persistence/`--persistence.file` implications.

5. **Alerting** — page when `time() - my_job_last_success_timestamp_seconds > expected_interval` (job missed a run) and when a failure gauge is set. Note that `up` for the Pushgateway tells you nothing about the jobs themselves.

6. **Anti-patterns** — using one global grouping key for all instances, pushing per-request metrics, never deleting groups, and relying on Pushgateway uptime as job health.

Output: (a) the recommended grouping-key scheme, (b) push client code (curl + Python), (c) the metric set with HELP/TYPE, (d) cleanup logic, (e) the missed-run and failure alert rules.

Bias toward: a last-success timestamp on every job, per-run grouping keys, and deleting stale groups rather than letting them rot.

Free: the DevOps AI Incident-Triage Cheat Sheet