Prometheus Pushgateway: When to Use It and When Not To

The Prometheus Pushgateway is the component people reach for the moment Prometheus’s pull model feels inconvenient — and that instinct is almost always wrong. I’ve cleaned up enough Pushgateway messes to have a firm rule: it exists for exactly one job, and using it for anything else creates problems that look like monitoring bugs for months. Let me draw the line clearly.

What the Pushgateway is actually for

The Pushgateway exists for one thing: short-lived batch jobs that exit before Prometheus can scrape them. A nightly backup, a cron-driven ETL, a CI deployment step. These processes run, finish, and disappear — there’s no /metrics endpoint left for Prometheus to pull. So the job pushes its final metrics to the Pushgateway, which holds them, and Prometheus scrapes the Pushgateway instead.

That’s it. That’s the use case.

# At the end of a backup job
echo "backup_last_success_timestamp_seconds $(date +%s)" \
  | curl --data-binary @- \
    http://pushgateway:9091/metrics/job/nightly_backup/instance/db01

What it is emphatically NOT for

Here’s where teams go off the rails. The Pushgateway is not:

A way to monitor long-running services behind a firewall. (Use a remote-write setup or a Collector instead.)
A general “push because pull is annoying” gateway.
A buffer or proxy for high-frequency metrics.
A place to push per-request or event-style data.

If your process is still running and could expose /metrics, you don’t want the Pushgateway. Full stop.

Trap 1: stale metrics live forever

This is the big one. The Pushgateway has no concept of staleness. Once you push a metric, it sits there and Prometheus scrapes it on every cycle — forever — until something explicitly deletes it. So if your batch job pushes a job_duration_seconds and then stops running for a week, your dashboards happily show last Tuesday’s value as if it were current.

A normal exporter going away produces an absent series and a up == 0. The Pushgateway produces a frozen series that looks perfectly healthy. That’s a uniquely nasty failure mode: your monitoring lies confidently.

The mitigation is to delete the group when the job’s lifecycle ends, or push a fresh timestamp every run and alert on its age:

# Backup hasn't succeeded in over 25 hours
time() - backup_last_success_timestamp_seconds > 25 * 3600

Alert on freshness, never on the raw value, precisely because the raw value can be a ghost.

Trap 2: the instance label disappears

When you push to the Pushgateway, the metrics take on the labels of the push path (job, instance, plus any grouping labels), not the labels of the originating host. By default Prometheus also honors those pushed labels and overrides the target labels via honor_labels:

scrape_configs:
  - job_name: 'pushgateway'
    honor_labels: true    # required, or your job/instance labels get clobbered
    static_configs:
      - targets: ['pushgateway:9091']

Forget honor_labels: true and every pushed metric gets relabeled with instance="pushgateway:9091", collapsing all your distinct jobs into one. People debug this for hours.

Trap 3: it’s a single point of failure

The Pushgateway is intentionally not clustered and not highly available. It’s a single process holding pushed state in memory (with optional persistence to disk). If it restarts, in-memory state is gone unless you enabled persistence. So don’t put it in a critical path, don’t push to it at high rates, and don’t expect HA from it. It’s a parking lot for batch results, not a pipeline.

The grouping key gotcha

You push to a group, identified by the job and any /label/value segments in the URL. Pushing again with the same grouping key replaces that group; a different key creates a new group. Inconsistent grouping keys across runs leave orphaned groups accumulating forever:

# Good: stable grouping key, replaces cleanly each run
.../metrics/job/etl/instance/region-eu

# Bad: timestamp in the key, leaks a new group every run
.../metrics/job/etl/instance/run-1718150400

Keep the grouping key stable and finite, or you’ve reinvented a cardinality leak.

Better alternatives for the common temptations

Most “I need the Pushgateway” situations are actually one of these:

Service behind a firewall / NAT → run an OpenTelemetry Collector or Prometheus agent at the edge and use remote-write. The pull model stays intact end-to-end.
Host-local facts (cert expiry, backup age) → the node_exporter textfile collector writes a .prom file that Prometheus scrapes normally. No extra moving part, and you keep staleness handling.
Serverless / FaaS → push to a managed remote-write endpoint or accumulate via the Collector; the Pushgateway’s lack of HA makes it a poor fit here.

The textfile collector in particular replaces the Pushgateway for a huge fraction of cases people misuse it for.

A clean checklist

Before you deploy a Pushgateway, confirm:

The producer is a short-lived job that genuinely can’t be scraped.
You alert on metric freshness, not raw values.
honor_labels: true is set on the scrape config.
Grouping keys are stable and bounded.
You delete groups when a job’s lifecycle ends.
It’s not in any latency-critical or HA-critical path.

Get those right and the Pushgateway is a small, boring, useful tool. Get them wrong and it’s a source of phantom green dashboards that survive long after the thing they measured stopped working. When you do alert on those freshness metrics, route them through a sane monitoring alert pipeline so a stale batch job actually reaches a human.

Pushgateway behavior depends on version and flags. Validate persistence and labeling against your own deployment before relying on it.