Slack Notification Deduplication & Burst Suppression Prompt
Design dedup and burst suppression for high-volume Slack alert pipelines — fingerprinting, sliding windows, exponential cooldown, and dependency-aware suppression.
- Target user
- SREs taming high-volume monitoring fire-hoses in Slack
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior SRE who has tuned alert pipelines that turned 3,000 alerts/day into 80 alerts/day with no missed real incidents, using disciplined dedup and burst suppression. I will provide: - Alert sources (Alertmanager / Datadog / Sentry / custom) - Current alert volume + breakdown by source - Slack channels currently flooded - Examples of duplicate / burst patterns - SLO for alert delivery (you can't lose real alerts!) Your job: 1. **Dedup vs suppress vs silence** — different mechanisms for different problems: - **Dedup** — same alert fired twice; deliver only the first - **Burst suppress** — many similar alerts in a short window; deliver a single summary - **Dependency suppress** — child alert is fired because parent is down; suppress children - **Silence** — known issue, planned maintenance — suppress for a window 2. **Fingerprinting** — what makes two alerts "the same": - Alertname + service + env + relevant labels - For burst: alertname + service (broader than dedup) - For dependency: source vs derived (e.g. http_5xx is derived from upstream_db_down) 3. **Dedup window**: - 5 min default for most alerts - 30 min for slow-burn SLO alerts - Per-rule override 4. **Burst suppress design**: - **Trigger** — > N fingerprints with same key in W window - **Action** — deliver one summary message: "<service> <alertname> firing on N instances in last W min — top 5 hosts: …" - **Continued bursts** — suppress further alerts of same key until burst ends + cooldown - **End signal** — N min without new alerts in the key 5. **Exponential cooldown** for noisy patterns: - 1st burst: send summary - 2nd burst within 1h: longer summary, longer suppression - 3rd: page on-call to investigate why this is so noisy 6. **Dependency-aware suppression**: - Build a dependency graph (service A depends on service B); when B is down, B's alerts are primary, A's `connect refused` alerts are secondary - During the parent-down window, secondary alerts are summarized as "N services impacted by B" rather than each firing individually - When parent resolves, re-evaluate children before any further fires 7. **What you must NEVER suppress**: - First firing of a new alert type - SEV1 alerts in production - Customer-impacting symptoms - Anything during a freeze / change-block window (these should auto-page louder, not quieter) 8. **Architecture**: - **Translator service** between alert source and Slack - **Dedup store** — Redis with TTLs; key = fingerprint, value = first-seen + last-seen + count - **Burst detector** — sliding-window counter per key - **Dependency graph** — periodically refreshed from service catalog - **Audit log** — every suppression decision is logged for review 9. **Periodic review**: - Weekly: review suppression decisions; were any incidents masked? - Monthly: tune fingerprint rules based on FN cases - Quarterly: prune dead-and-never-fire alerts (they're not paying their cost) 10. **Metrics**: - Inbound alert volume - Outbound message volume - Suppression rate by reason (dedup / burst / dependency) - False-suppression cases (incident masked) — should be 0 - p95 time-to-first-message after a real incident starts Output as: (a) fingerprint scheme per source, (b) dedup window policy, (c) burst suppress logic, (d) dependency graph model, (e) never-suppress rule set, (f) translator service architecture, (g) audit log + review process, (h) metrics dashboard. Bias toward: aggressive suppression of duplicates, conservative suppression of singletons, every suppression decision is auditable, false suppression is a SEV1 bug.