Prometheus Error Guide: 'context deadline exceeded' Alertmanager Notifications Failing
Fix Alertmanager notification failures: SMTP errors, webhook timeouts, 'context deadline exceeded', and silent drops. Diagnose receivers, routing, and config reloads.
- #prometheus-monitoring
- #troubleshooting
- #errors
- #alertmanager
Overview
Alertmanager receives firing alerts from Prometheus, groups and routes them, and dispatches notifications to receivers (email, Slack, PagerDuty, generic webhooks). When a receiver’s integration fails, Alertmanager logs the error and retries with backoff; if every attempt fails, the notification is dropped for that group cycle. The most common failures are context deadline exceeded (the receiver did not respond within the timeout), SMTP errors (auth, TLS, relay refusal), and webhook errors (connection refused, non-2xx response).
You will see these in the Alertmanager log:
ts=2026-06-23T14:16:40.221Z caller=notify.go:732 level=warn component=dispatcher receiver=team-pager integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=3 err="Post \"https://hooks.example.com/alert\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
The SMTP variant looks like:
err="*email.Email: establish connection to server: dial tcp 10.0.3.4:587: connect: connection refused"
It is a delivery-path problem: Prometheus correctly evaluated the alert and Alertmanager received it, but the human or system at the end never got notified. Because retries are silent, alerts can be lost without any obvious failure unless you watch Alertmanager’s own metrics.
Symptoms
- Alerts visible as firing in Prometheus and in the Alertmanager UI, but no email/Slack/page arrives.
Notify attempt failed, will retry laterlog lines for a specificreceiver/integration.alertmanager_notifications_failed_totalincreasing for an integration.- A config reload silently failing, so routing/receivers run stale.
rate(alertmanager_notifications_failed_total[5m]) > 0
{integration="email", instance="alertmanager:9093"} 0.20
Common Root Causes
1. Webhook/receiver timeout (context deadline exceeded)
The receiver endpoint is slow or unreachable and exceeds the HTTP timeout. Probe it directly:
curl -s -o /dev/null -w 'code=%{http_code} time=%{time_total}s\n' -X POST \
-H 'Content-Type: application/json' -d '{"test":true}' https://hooks.example.com/alert
code=000 time=10.001s
code=000 with a 10s time means the endpoint never responded — exactly what produces context deadline exceeded in Alertmanager.
2. SMTP authentication or TLS failure (email)
The mail receiver can’t authenticate or negotiate TLS with the relay. Test the SMTP path:
curl -v --url 'smtp://smtp.example.com:587' --mail-from alerts@example.com \
--mail-rcpt oncall@example.com --user 'alerts@example.com:<APP_PASS>' \
--ssl-reqd -T <(printf 'Subject: test\n\nbody\n') 2>&1 | grep -Ei 'auth|tls|535|530|220|250'
< 535 5.7.8 Username and Password not accepted
A 535 auth rejection (often a wrong/rotated app password) means every email notification fails at send time.
3. SMTP relay refusing the connection or sender
The relay refuses the connection (firewall, wrong port) or rejects the From address.
nc -vz smtp.example.com 587
nc: connect to smtp.example.com port 587 (tcp) failed: Connection refused
A refused connection on the SMTP port means Alertmanager can never establish the session.
4. Webhook returns a non-2xx status
The endpoint is reachable but returns 4xx/5xx, which Alertmanager treats as a failed notification.
curl -s -o /dev/null -w '%{http_code}\n' -X POST \
-H 'Content-Type: application/json' -d @sample-alert.json https://hooks.example.com/alert
401
A 401/403 means the webhook auth (token/header) is wrong; a 5xx means the receiver app is failing.
5. Routing tree never reaches the intended receiver
The notification “fails” because the alert is routed to the wrong (or default) receiver, or matched a continue: false branch early. Test the route:
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml \
severity=critical team=payments
team-default
If a team=payments severity=critical alert resolves to team-default instead of payments-pager, the routing labels/matchers are wrong.
6. A failed config reload leaving stale receivers
An edit with a bad receiver/template fails to load; Alertmanager keeps the old config and new receivers never take effect.
amtool check-config /etc/alertmanager/alertmanager.yml
curl -s http://localhost:9093/api/v2/status | jq '.config.original' | head
Checking '/etc/alertmanager/alertmanager.yml' FAILED: undefined template "slack.title"
A FAILED check means the running config is stale — the intended notification path isn’t active.
Diagnostic Workflow
Step 1: Confirm the alert reached Alertmanager and which receiver it targets
amtool alert query --alertmanager.url=http://localhost:9093 alertname=<NAME>
If the alert is listed, Prometheus did its job; the problem is downstream in routing/notification.
Step 2: Identify the failing integration from logs and metrics
journalctl -u alertmanager --no-pager | grep -i 'Notify attempt failed' | tail -10
rate(alertmanager_notifications_failed_total[5m])
This names the receiver and integration (email/webhook/slack) that is failing.
Step 3: Reproduce the integration by hand
For webhooks, curl the endpoint; for email, test SMTP with curl --url smtp://... or nc. Match the timeout and auth Alertmanager uses.
Step 4: Verify routing resolves to the intended receiver
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml \
<label1>=<value1> <label2>=<value2>
Confirm the alert’s labels resolve to the receiver you expect.
Step 5: Validate the config and that it actually loaded
amtool check-config /etc/alertmanager/alertmanager.yml
curl -s http://localhost:9093/-/reload -X POST
journalctl -u alertmanager --no-pager | grep -i 'reload' | tail -3
A clean check plus a successful reload confirms the running config matches the file.
Example Root Cause Analysis
On-call reports they stopped getting pages overnight, yet Prometheus shows the KubeNodeNotReady alert firing and present in the Alertmanager UI.
Checking metrics and logs:
rate(alertmanager_notifications_failed_total{integration="webhook"}[15m])
{receiver="team-pager"} 0.30
journalctl -u alertmanager --no-pager | grep 'Notify attempt failed' | tail -3
err="Post \"https://events.pagerduty.com/v2/enqueue\": context deadline exceeded"
Every page to the PagerDuty webhook times out. Probing the endpoint from the Alertmanager host:
curl -s -o /dev/null -w 'code=%{http_code} time=%{time_total}s\n' \
https://events.pagerduty.com/v2/enqueue
code=000 time=10.002s
The host has no route to the internet on egress port 443 — a firewall change overnight blocked outbound HTTPS, so the webhook never connected and Alertmanager logged context deadline exceeded on every retry.
The fix restores egress to the PagerDuty endpoint (allowlist the destination on the firewall), after which the probe returns 202 and pending pages flush:
curl -s -o /dev/null -w '%{http_code}\n' https://events.pagerduty.com/v2/enqueue
400
(A 400 from a bare GET means the endpoint is now reachable; real notifications with a valid payload return 202.) Notifications resume and alertmanager_notifications_failed_total flattens.
Prevention Best Practices
- Monitor Alertmanager itself: alert on
rate(alertmanager_notifications_failed_total[5m]) > 0and run a synthetic “watchdog” alert that always fires, so a broken notification path is detected by its absence. - Validate config in CI with
amtool check-configand confirm the post-deploy reload succeeded; a silently stale config is a common cause of “no pages.” - Test routing with
amtool config routes testfor representative label sets whenever you change the route tree. - Keep receiver credentials (SMTP app passwords, webhook tokens) in a secret store and alert on auth failures; rotated secrets are a frequent overnight break.
- Ensure Alertmanager egress to external receivers (PagerDuty, Slack, SMTP relay) is firewalled-open and monitored, since
context deadline exceededis often a network/egress problem, not the receiver. - The free incident assistant can classify a notification failure as timeout vs auth vs routing and point at the integration to fix; more alerting guidance is under Prometheus and monitoring.
Quick Command Reference
# Is the alert in Alertmanager and which receiver?
amtool alert query --alertmanager.url=http://localhost:9093 alertname=<NAME>
# Failing integration from logs
journalctl -u alertmanager --no-pager | grep -i 'Notify attempt failed' | tail -10
# Probe a webhook receiver
curl -s -o /dev/null -w 'code=%{http_code} time=%{time_total}s\n' -X POST \
-H 'Content-Type: application/json' -d '{"test":true}' <WEBHOOK_URL>
# Test SMTP reachability/auth
nc -vz <SMTP_HOST> 587
curl -v --url 'smtp://<SMTP_HOST>:587' --user '<USER>:<PASS>' --ssl-reqd 2>&1 | grep -Ei '535|220|250|tls'
# Routing and config validity
amtool config routes test --config.file=/etc/alertmanager/alertmanager.yml severity=critical team=payments
amtool check-config /etc/alertmanager/alertmanager.yml
# Notification failure rate by integration
rate(alertmanager_notifications_failed_total[5m])
Conclusion
Failing notifications mean the alert was evaluated and routed, but delivery to the receiver broke. Diagnose in order:
- Confirm the alert is in Alertmanager and note its target receiver.
- Identify the failing integration from logs and
alertmanager_notifications_failed_total. - Reproduce the receiver by hand —
curlthe webhook or test SMTP. - Verify routing resolves to the intended receiver with
amtool. - Validate the config and confirm the reload actually applied.
context deadline exceeded usually points at network/egress or a slow receiver; SMTP errors at auth/relay; webhook non-2xx at auth or the receiver app. Test the integration directly and the cause is unambiguous.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.