AI for Slack Difficulty: Advanced ClaudeChatGPT

Slack Notification Deduplication & Burst Suppression Prompt

Design dedup and burst suppression for high-volume Slack alert pipelines — fingerprinting, sliding windows, exponential cooldown, and dependency-aware suppression.

Target user: SREs taming high-volume monitoring fire-hoses in Slack
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior SRE who has tuned alert pipelines that turned 3,000 alerts/day into 80 alerts/day with no missed real incidents, using disciplined dedup and burst suppression.

I will provide:
- Alert sources (Alertmanager / Datadog / Sentry / custom)
- Current alert volume + breakdown by source
- Slack channels currently flooded
- Examples of duplicate / burst patterns
- SLO for alert delivery (you can't lose real alerts!)

Your job:

1. **Dedup vs suppress vs silence** — different mechanisms for different problems:
   - **Dedup** — same alert fired twice; deliver only the first
   - **Burst suppress** — many similar alerts in a short window; deliver a single summary
   - **Dependency suppress** — child alert is fired because parent is down; suppress children
   - **Silence** — known issue, planned maintenance — suppress for a window

2. **Fingerprinting** — what makes two alerts "the same":
   - Alertname + service + env + relevant labels
   - For burst: alertname + service (broader than dedup)
   - For dependency: source vs derived (e.g. http_5xx is derived from upstream_db_down)

3. **Dedup window**:
   - 5 min default for most alerts
   - 30 min for slow-burn SLO alerts
   - Per-rule override

4. **Burst suppress design**:
   - **Trigger** — > N fingerprints with same key in W window
   - **Action** — deliver one summary message: "<service> <alertname> firing on N instances in last W min — top 5 hosts: …"
   - **Continued bursts** — suppress further alerts of same key until burst ends + cooldown
   - **End signal** — N min without new alerts in the key

5. **Exponential cooldown** for noisy patterns:
   - 1st burst: send summary
   - 2nd burst within 1h: longer summary, longer suppression
   - 3rd: page on-call to investigate why this is so noisy

6. **Dependency-aware suppression**:
   - Build a dependency graph (service A depends on service B); when B is down, B's alerts are primary, A's `connect refused` alerts are secondary
   - During the parent-down window, secondary alerts are summarized as "N services impacted by B" rather than each firing individually
   - When parent resolves, re-evaluate children before any further fires

7. **What you must NEVER suppress**:
   - First firing of a new alert type
   - SEV1 alerts in production
   - Customer-impacting symptoms
   - Anything during a freeze / change-block window (these should auto-page louder, not quieter)

8. **Architecture**:
   - **Translator service** between alert source and Slack
   - **Dedup store** — Redis with TTLs; key = fingerprint, value = first-seen + last-seen + count
   - **Burst detector** — sliding-window counter per key
   - **Dependency graph** — periodically refreshed from service catalog
   - **Audit log** — every suppression decision is logged for review

9. **Periodic review**:
   - Weekly: review suppression decisions; were any incidents masked?
   - Monthly: tune fingerprint rules based on FN cases
   - Quarterly: prune dead-and-never-fire alerts (they're not paying their cost)

10. **Metrics**:
   - Inbound alert volume
   - Outbound message volume
   - Suppression rate by reason (dedup / burst / dependency)
   - False-suppression cases (incident masked) — should be 0
   - p95 time-to-first-message after a real incident starts

Output as: (a) fingerprint scheme per source, (b) dedup window policy, (c) burst suppress logic, (d) dependency graph model, (e) never-suppress rule set, (f) translator service architecture, (g) audit log + review process, (h) metrics dashboard.

Bias toward: aggressive suppression of duplicates, conservative suppression of singletons, every suppression decision is auditable, false suppression is a SEV1 bug.

Free: the DevOps AI Incident-Triage Cheat Sheet