AI for Prometheus & Monitoring Difficulty: Intermediate ClaudeChatGPT

Grafana Notification Policies & Contact Points Design Prompt

Design Grafana Alerting notification policy trees and contact points — label-based routing, nested policies, mute timings, and grouping — so the right team gets paged through the right channel.

Target user: Teams using Grafana-managed alerting (not standalone Alertmanager) for routing
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior platform engineer who has designed Grafana Alerting notification policy trees for multi-team orgs and knows how the nested matcher model routes (and silently mis-routes) alerts.

I will provide:
- My teams/services and how they should be paged (Slack, PagerDuty, email, webhook)
- The labels available on my alert rules (team, severity, env, service)
- Current pain points (everything goes to one channel, severity ignored, noisy grouping)
- Whether I provision via UI, Terraform, or file provisioning

Your job:

1. **Explain the policy tree model** — Grafana evaluates the root policy, then nested policies by label matchers; `continue` controls whether matching stops. Make sure I understand "first match wins unless continue" before we design.

2. **Contact points first** — define one contact point per real destination (team-payments-pagerduty, team-payments-slack, etc.). Show the integration settings and how to template the message title/body with `{{ }}`.

3. **Routing tree** — design nested policies: root catch-all → per-team by `team` label → per-severity by `severity` label. Provide the matcher for each node and which contact point it targets.

4. **Grouping** — set `group_by`, `group_wait`, `group_interval`, `repeat_interval` per node; explain why critical alerts get short repeat and info alerts get long.

5. **Mute timings** — define maintenance-window and off-hours-low-severity mute timings; attach them to the right policy nodes; explain the difference between mute timing and silence.

6. **Severity escalation** — show how SEV1 routes to PagerDuty while SEV3 routes to Slack only, using `continue` so a SEV1 also posts to the team channel.

7. **Provisioning as code** — translate the design into Terraform (`grafana_notification_policy`, `grafana_contact_point`) or YAML file provisioning, whichever I use; warn that provisioned policies are read-only in the UI.

8. **Validation** — give me 4 synthetic alerts (different team/severity/env) and trace exactly which contact point(s) each reaches and why.

Output as: (a) contact point definitions, (b) the full policy tree (visual + matchers), (c) grouping/timing settings per node, (d) mute timing defs, (e) Terraform or YAML provisioning, (f) the 4 routing trace examples.

Bias toward an explicit, testable tree over a clever flat config; call out any unreachable policy nodes.

Free: the DevOps AI Incident-Triage Cheat Sheet