Prometheus Error Guide: 'context deadline exceeded'

Overview

Alertmanager receives firing alerts from Prometheus, groups and routes them, and dispatches notifications to receivers (email, Slack, PagerDuty, generic webhooks). When a receiver’s integration fails, Alertmanager logs the error and retries with backoff; if every attempt fails, the notification is dropped for that group cycle. The most common failures are context deadline exceeded (the receiver did not respond within the timeout), SMTP errors (auth, TLS, relay refusal), and webhook errors (connection refused, non-2xx response).

You will see these in the Alertmanager log:

ts=2026-06-23T14:16:40.221Z caller=notify.go:732 level=warn component=dispatcher receiver=team-pager integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=3 err="Post \"https://hooks.example.com/alert\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

The SMTP variant looks like:

err="*email.Email: establish connection to server: dial tcp 10.0.3.4:587: connect: connection refused"

It is a delivery-path problem: Prometheus correctly evaluated the alert and Alertmanager received it, but the human or system at the end never got notified. Because retries are silent, alerts can be lost without any obvious failure unless you watch Alertmanager’s own metrics.

Symptoms

Alerts visible as firing in Prometheus and in the Alertmanager UI, but no email/Slack/page arrives.
Notify attempt failed, will retry later log lines for a specific receiver/integration.
alertmanager_notifications_failed_total increasing for an integration.
A config reload silently failing, so routing/receivers run stale.

rate(alertmanager_notifications_failed_total[5m]) > 0

{integration="email", instance="alertmanager:9093"}  0.20

Common Root Causes

1. Webhook/receiver timeout (context deadline exceeded)

The receiver endpoint is slow or unreachable and exceeds the HTTP timeout. Probe it directly:

curl -s -o /dev/null -w 'code=%{http_code} time=%{time_total}s\n' -X POST \
  -H 'Content-Type: application/json' -d '{"test":true}' https://hooks.example.com/alert

code=000 time=10.001s

code=000 with a 10s time means the endpoint never responded — exactly what produces context deadline exceeded in Alertmanager.

2. SMTP authentication or TLS failure (email)

The mail receiver can’t authenticate or negotiate TLS with the relay. Test the SMTP path:

curl -v --url 'smtp://smtp.example.com:587' --mail-from alerts@example.com \
  --mail-rcpt oncall@example.com --user 'alerts@example.com:<APP_PASS>' \
  --ssl-reqd -T <(printf 'Subject: test\n\nbody\n') 2>&1 | grep -Ei 'auth|tls|535|530|220|250'

< 535 5.7.8 Username and Password not accepted

A 535 auth rejection (often a wrong/rotated app password) means every email notification fails at send time.

3. SMTP relay refusing the connection or sender

The relay refuses the connection (firewall, wrong port) or rejects the From address.

nc -vz smtp.example.com 587

nc: connect to smtp.example.com port 587 (tcp) failed: Connection refused

A refused connection on the SMTP port means Alertmanager can never establish the session.

4. Webhook returns a non-2xx status

The endpoint is reachable but returns 4xx/5xx, which Alertmanager treats as a failed notification.

curl -s -o /dev/null -w '%{http_code}\n' -X POST \
  -H 'Content-Type: application/json' -d @sample-alert.json https://hooks.example.com/alert

A 401/403 means the webhook auth (token/header) is wrong; a 5xx means the receiver app is failing.

5. Routing tree never reaches the intended receiver

The notification “fails” because the alert is routed to the wrong (or default) receiver, or matched a continue: false branch early. Test the route:

amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml \
  severity=critical team=payments

team-default

If a team=payments severity=critical alert resolves to team-default instead of payments-pager, the routing labels/matchers are wrong.

6. A failed config reload leaving stale receivers

An edit with a bad receiver/template fails to load; Alertmanager keeps the old config and new receivers never take effect.

amtool check-config /etc/alertmanager/alertmanager.yml
curl -s http://localhost:9093/api/v2/status | jq '.config.original' | head

Checking '/etc/alertmanager/alertmanager.yml'  FAILED: undefined template "slack.title"

A FAILED check means the running config is stale — the intended notification path isn’t active.

Diagnostic Workflow

Step 1: Confirm the alert reached Alertmanager and which receiver it targets

amtool alert query --alertmanager.url=http://localhost:9093 alertname=<NAME>

If the alert is listed, Prometheus did its job; the problem is downstream in routing/notification.

Step 2: Identify the failing integration from logs and metrics

journalctl -u alertmanager --no-pager | grep -i 'Notify attempt failed' | tail -10

rate(alertmanager_notifications_failed_total[5m])

This names the receiver and integration (email/webhook/slack) that is failing.

Step 3: Reproduce the integration by hand

For webhooks, curl the endpoint; for email, test SMTP with curl --url smtp://... or nc. Match the timeout and auth Alertmanager uses.

Step 4: Verify routing resolves to the intended receiver

amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml \
  <label1>=<value1> <label2>=<value2>

Confirm the alert’s labels resolve to the receiver you expect.

Step 5: Validate the config and that it actually loaded

amtool check-config /etc/alertmanager/alertmanager.yml
curl -s http://localhost:9093/-/reload -X POST
journalctl -u alertmanager --no-pager | grep -i 'reload' | tail -3

A clean check plus a successful reload confirms the running config matches the file.

Example Root Cause Analysis

On-call reports they stopped getting pages overnight, yet Prometheus shows the KubeNodeNotReady alert firing and present in the Alertmanager UI.

Checking metrics and logs:

rate(alertmanager_notifications_failed_total{integration="webhook"}[15m])

{receiver="team-pager"}  0.30

journalctl -u alertmanager --no-pager | grep 'Notify attempt failed' | tail -3

err="Post \"https://events.pagerduty.com/v2/enqueue\": context deadline exceeded"

Every page to the PagerDuty webhook times out. Probing the endpoint from the Alertmanager host:

curl -s -o /dev/null -w 'code=%{http_code} time=%{time_total}s\n' \
  https://events.pagerduty.com/v2/enqueue

code=000 time=10.002s

The host has no route to the internet on egress port 443 — a firewall change overnight blocked outbound HTTPS, so the webhook never connected and Alertmanager logged context deadline exceeded on every retry.

The fix restores egress to the PagerDuty endpoint (allowlist the destination on the firewall), after which the probe returns 202 and pending pages flush:

curl -s -o /dev/null -w '%{http_code}\n' https://events.pagerduty.com/v2/enqueue

(A 400 from a bare GET means the endpoint is now reachable; real notifications with a valid payload return 202.) Notifications resume and alertmanager_notifications_failed_total flattens.

Prevention Best Practices

Monitor Alertmanager itself: alert on rate(alertmanager_notifications_failed_total[5m]) > 0 and run a synthetic “watchdog” alert that always fires, so a broken notification path is detected by its absence.
Validate config in CI with amtool check-config and confirm the post-deploy reload succeeded; a silently stale config is a common cause of “no pages.”
Test routing with amtool config routes test for representative label sets whenever you change the route tree.
Keep receiver credentials (SMTP app passwords, webhook tokens) in a secret store and alert on auth failures; rotated secrets are a frequent overnight break.
Ensure Alertmanager egress to external receivers (PagerDuty, Slack, SMTP relay) is firewalled-open and monitored, since context deadline exceeded is often a network/egress problem, not the receiver.
The free incident assistant can classify a notification failure as timeout vs auth vs routing and point at the integration to fix; more alerting guidance is under Prometheus and monitoring.

Quick Command Reference

# Is the alert in Alertmanager and which receiver?
amtool alert query --alertmanager.url=http://localhost:9093 alertname=<NAME>

# Failing integration from logs
journalctl -u alertmanager --no-pager | grep -i 'Notify attempt failed' | tail -10

# Probe a webhook receiver
curl -s -o /dev/null -w 'code=%{http_code} time=%{time_total}s\n' -X POST \
  -H 'Content-Type: application/json' -d '{"test":true}' <WEBHOOK_URL>

# Test SMTP reachability/auth
nc -vz <SMTP_HOST> 587
curl -v --url 'smtp://<SMTP_HOST>:587' --user '<USER>:<PASS>' --ssl-reqd 2>&1 | grep -Ei '535|220|250|tls'

# Routing and config validity
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml severity=critical team=payments
amtool check-config /etc/alertmanager/alertmanager.yml

# Notification failure rate by integration
rate(alertmanager_notifications_failed_total[5m])

Conclusion

Failing notifications mean the alert was evaluated and routed, but delivery to the receiver broke. Diagnose in order:

Confirm the alert is in Alertmanager and note its target receiver.
Identify the failing integration from logs and alertmanager_notifications_failed_total.
Reproduce the receiver by hand — curl the webhook or test SMTP.
Verify routing resolves to the intended receiver with amtool.
Validate the config and confirm the reload actually applied.

context deadline exceeded usually points at network/egress or a slow receiver; SMTP errors at auth/relay; webhook non-2xx at auth or the receiver app. Test the integration directly and the cause is unambiguous.

Prometheus Error Guide: 'context deadline exceeded' Alertmanager Notifications Failing