Blackbox and Synthetic Monitoring With Prometheus

Your internal metrics can swear everything is healthy while a user in another region gets a TLS error on every request. Whitebox metrics measure the system from the inside; they can’t see a broken load balancer, an expired certificate, a DNS failure, or a route that’s black-holing traffic. For that you need to probe from the outside, like a user would. That’s blackbox and synthetic monitoring. After years of catching outages this way, here’s how I run it.

Whitebox vs blackbox, and why you need both

Whitebox monitoring instruments the system itself: request rates, error counts, queue depth from inside the process. Rich, detailed, but blind to anything between the user and your code.
Blackbox monitoring probes the system from outside: “can I actually load this URL right now, and how long did it take?” It sees the whole path — DNS, TLS, load balancer, network — exactly as a user does.

You need both. Whitebox tells you why something is slow once you know it’s slow. Blackbox tells you that it’s broken from a user’s perspective, including failures your internal metrics literally cannot observe.

The blackbox exporter

Prometheus’s blackbox_exporter is the standard tool. It runs as a service and, when asked, probes a target over HTTP, HTTPS, TCP, ICMP, or DNS, then returns metrics about the result. The clever part is the scrape config: Prometheus passes the target URL as a parameter, so one exporter probes many endpoints.

# blackbox.yml — module definitions
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_status_codes: [200]
      method: GET
      fail_if_ssl: false

# prometheus.yml — point scrapes through the exporter
scrape_configs:
  - job_name: blackbox-http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://devopsaitoolkit.com
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

That relabel dance is the idiom: rewrite the scrape so Prometheus hits the exporter but passes the real target as __param_target, and keeps the real URL as the instance label so your metrics are readable.

The metrics you get, and the alerts that matter

Each probe returns a handful of useful series:

probe_success — 1 if the probe passed, 0 if it failed. Your core uptime signal.
probe_duration_seconds — how long the whole probe took. Your external latency.
probe_http_status_code — what status came back.
probe_ssl_earliest_cert_expiry — when the TLS cert expires. Worth its weight in gold.

The three alerts I always set:

- alert: EndpointDown
  expr: probe_success == 0
  for: 2m
  labels: { severity: page }
  annotations:
    summary: "{{ $labels.instance }} failing external probes"

- alert: CertExpiringSoon
  expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 14
  labels: { severity: ticket }
  annotations:
    summary: "TLS cert for {{ $labels.instance }} expires in under 14 days"

- alert: SlowEndpoint
  expr: probe_duration_seconds > 2
  for: 5m
  labels: { severity: ticket }

That certificate-expiry alert alone has saved me more weekend incidents than almost anything else. Expired certs are a self-inflicted outage that’s completely preventable with one query.

Probe from where your users are

A probe from inside your own datacenter tests almost nothing — of course your service is reachable from next to it. The value of blackbox monitoring multiplies when you probe from where users actually are: other regions, other cloud providers, the public internet. Run blackbox exporters in a few geographic locations and label the probe with its origin.

relabel_configs:
  - target_label: probe_region
    replacement: us-east

Now probe_success{instance="...", probe_region="eu-west"} == 0 while us-east is fine tells you it’s a regional problem, not a total outage — a distinction your internal metrics can’t make.

Synthetic transactions for critical flows

A 200 on the homepage doesn’t prove checkout works. For the flows that actually make money, go beyond a simple GET to a synthetic transaction: a scripted multi-step journey (log in, add to cart, check out) run on a schedule, exporting success and duration as metrics. The blackbox exporter handles simple multi-step HTTP; richer journeys use a headless browser tool that emits Prometheus metrics.

The principle is the same: continuously prove the user-critical path works, from the outside, before a real user discovers it doesn’t.

Don’t let probes lie to you

A few traps I’ve been bitten by:

Probe the real thing, not a /health that always returns 200. A health endpoint that doesn’t touch the database will happily report healthy while the database is down.
Set sane timeouts. Too short and you alert on normal latency; too long and a hung endpoint takes forever to register as down.
Watch the probes themselves. If your blackbox exporter dies, probe_success goes absent, not zero. Alert on up{job="blackbox-http"} == 0 too.

Where AI helps

The blackbox scrape config — especially that relabel idiom — is notoriously easy to get subtly wrong, and the error mode is “no data” with no explanation. I describe the endpoints and what “healthy” means, and let AI generate the module definitions and the relabel block. It also drafts the cert-expiry and regional-outage alerts cleanly.

You verify the probes actually fire against your endpoints, but it removes the relabel-config guesswork. We keep monitoring prompts for synthetic monitoring, and the Alert Rule Generator will produce probe-based alerts with sensible for windows and severities.

The bottom line

Whitebox metrics tell you how your system feels from the inside; blackbox probes tell you how it feels to a user, including the failures your internal view is blind to. Probe the real user paths, probe from where users are, alert on probe failures, slow probes, and expiring certs — and you’ll find out about outages from your monitoring instead of from your customers. That’s the whole job.

Generated probe configs and alerts are assistive, not authoritative. Always verify probes fire against real endpoints and test config changes in staging before production.