Instrumenting Python Scripts with prometheus

The most dangerous automation is the kind that fails quietly. A backup script that exits 0 but copied nothing. A sync job that’s been throwing the same caught exception for three weeks. A daemon that’s alive but stopped processing. Logs help, but nobody reads logs until something is already on fire. What you want is a metric you can graph and alert on.

The prometheus_client library makes this surprisingly easy to bolt onto Python automation. Here’s how I instrument both long-running daemons and short-lived batch jobs, and the gotcha that trips people up with the latter.

The four metric types you actually use

Prometheus has four core metric types, and for ops scripts you’ll mostly use three:

Counter — a value that only goes up: tasks processed, errors seen, bytes transferred. You alert on its rate.
Gauge — a value that goes up and down: queue depth, items pending, current temperature, “last success timestamp.”
Histogram — a distribution: request durations, batch sizes. Gives you quantiles and a count for free.
Summary — like a histogram but client-side quantiles; reach for histograms first.

Get those right and most of your instrumentation falls out naturally.

Instrumenting a long-running daemon

For a service that stays up, you expose an HTTP endpoint and let Prometheus scrape it. The library spins up the server for you:

from prometheus_client import start_http_server, Counter, Gauge, Histogram
import time

CYCLES = Counter("reconciler_cycles_total", "Reconciliation cycles run")
ERRORS = Counter("reconciler_errors_total", "Reconciliation errors", ["kind"])
QUEUE_DEPTH = Gauge("reconciler_queue_depth", "Items pending in queue")
CYCLE_TIME = Histogram("reconciler_cycle_seconds", "Time per cycle")

def main():
    start_http_server(9101)   # exposes /metrics on :9101
    while True:
        with CYCLE_TIME.time():           # times the block, records to histogram
            try:
                pending = do_one_cycle()
                QUEUE_DEPTH.set(pending)
                CYCLES.inc()
            except TimeoutError:
                ERRORS.labels(kind="timeout").inc()
            except Exception:
                ERRORS.labels(kind="unknown").inc()
        time.sleep(15)

A few things worth highlighting. CYCLE_TIME.time() as a context manager times the block and records it — no manual time.perf_counter() math. The ERRORS counter has a kind label so you can distinguish timeouts from other failures in your alerting. And QUEUE_DEPTH as a gauge lets you see backlog building before it becomes an incident.

Then your Prometheus scrape config points at host:9101. Done.

The batch-job problem: scrapes need something to scrape

Here’s the gotcha. Prometheus works by scraping — it pulls metrics from a running endpoint on an interval. A cron job that runs for 4 seconds and exits gives Prometheus nothing to scrape. By the time the scrape happens, the process is gone.

The answer is the Pushgateway: short-lived jobs push their metrics to it, and Prometheus scrapes the Pushgateway.

from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

def run_backup():
    registry = CollectorRegistry()
    success = Gauge("backup_last_success_timestamp", "Last successful backup (unix)",
                    registry=registry)
    duration = Gauge("backup_duration_seconds", "Backup duration", registry=registry)

    start = time.time()
    do_the_backup()                       # raises on failure
    duration.set(time.time() - start)
    success.set_to_current_time()         # records "we succeeded just now"

    push_to_gateway("pushgateway:9091", job="nightly_backup", registry=registry)

Note the dedicated CollectorRegistry() — for push jobs you build a fresh registry so you control exactly which metrics get pushed under that job name. The single most useful metric here is backup_last_success_timestamp. You don’t alert on “did it fail” — failure means nothing got pushed. You alert on staleness:

# Alert if no successful backup in 26 hours
time() - backup_last_success_timestamp > 26 * 3600

That catches both “the job failed” and “the job never ran because cron was broken” — the second of which is the failure mode that pure error counters completely miss.

Naming and labels: the rules that save you later

Suffix counters with _total. requests_total, not requests. It’s the convention and tooling expects it.
Put units in the name. _seconds, _bytes. Future-you will thank present-you.
Keep label cardinality low. Never put a user ID, request ID, or raw timestamp in a label. Each unique label combination is a new time series; unbounded labels will blow up your Prometheus. Labels are for bounded dimensions: status, kind, region.
Don’t label on error message text. Use a small enum of error kinds, as in the daemon example.

High-cardinality labels are the single fastest way to turn helpful instrumentation into a Prometheus outage.

What to actually alert on

Metrics are only useful if they drive action. The high-value alerts for automation are:

Staleness — time() - last_success_timestamp exceeding the expected interval. Catches silent non-runs.
Error rate — rate(errors_total[5m]) > 0 sustained. Catches the “succeeds while failing” case.
Backlog growth — a gauge like queue_depth trending up. Catches “alive but falling behind.”

Skip alerting on raw cycle counts or durations unless you have a real SLO — they’re better as dashboards than pages.

Instrumentation is a small upfront tax that converts “I hope the job ran” into “I’ll get paged if it didn’t.” For a backup or sync job that runs unattended every night, that’s the whole difference between trustworthy automation and a time bomb.

For more on making scripts observable and re-runnable, see the Bash & Python automation guides or start from a prompt.

Metric and alert thresholds are examples. Tune intervals and cardinality to your own scrape config and Prometheus capacity before relying on them.

Instrumenting Python Scripts with prometheus_client