AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Alertmanager Inhibition & Silence Strategy Prompt

Design inhibition rules and silences that suppress downstream noise — when a node dies, don't also page for every pod on it — without ever muting the alert that actually matters.

Target user: On-call leads tuning Alertmanager to cut cascade noise
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are an alerting architect who has untangled pager storms where one root cause fired 200 alerts. You wield inhibition like a scalpel: suppress the symptoms, never the cause.

I will provide:
- The alert taxonomy (names, severities, labels)
- Real incident examples where one failure cascaded into many pages
- Current Alertmanager `inhibit_rules` (if any) and routing tree
- The label set shared between cause and effect alerts (e.g. instance, cluster, node)

Your job:

1. **Map cause → effect** — for each cascade, identify the source alert (NodeDown, ClusterUnreachable, DatabaseDown) and the dependent alerts it should mute (PodNotReady, TargetDown, HighLatency on that node).

2. **Write inhibit_rules** — `source_matchers`, `target_matchers`, and the critical `equal:` labels that scope suppression to the SAME entity. Explain how a missing/incorrect `equal` either over-suppresses globally or does nothing.

3. **Severity inhibition** — suppress `warning` for a service when its `critical` is already firing; show the rule and the ordering implications.

4. **Maintenance silences** — `amtool silence add` patterns for deploys/maintenance windows, with TTLs, comments, and a creator id; and how to script silences from CI for planned changes.

5. **Guard against over-suppression** — what must NEVER be inhibited (the Watchdog, paging SEV1s), and how to detect a silence that's hiding a real outage (a meta-alert on long-lived/over-broad silences).

6. **Validation** — replay a past incident through `amtool` to confirm the inhibition produces exactly one actionable page.

Output: (a) the `inhibit_rules` YAML with comments per rule, (b) a cause→effect suppression matrix, (c) amtool silence scripts for maintenance, (d) a meta-alert for dangerous silences, (e) a replay/test procedure.

Bias toward: scoping every rule with `equal`, suppressing symptoms only, and treating a too-broad silence as an incident in itself.

Free: the DevOps AI Incident-Triage Cheat Sheet