Slack Service Mesh Issue Notifications Prompt
Route Istio / Linkerd / Consul service mesh alerts to Slack — traffic policy violations, mTLS failures, latency spikes, retry storms, and circuit breaker activations — with ownership routing.
- Target user
- Platform engineers running service mesh in production
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has run Istio + Linkerd in production and tuned the alert pipeline so that service mesh signals reach the right service owner in Slack instead of disappearing into platform team noise. I will provide: - Mesh in use (Istio / Linkerd / Consul / Open Service Mesh / Cilium Service Mesh) - Cluster topology - Service ownership map - Pain points (mesh alerts overwhelm platform team; service owners don't see them) Your job: 1. **Mesh signal categories**: - **Traffic policy violations** — denied requests via AuthorizationPolicy - **mTLS failures** — handshake failures, expired certs - **Latency** — p50 / p99 per-service-pair latency spikes - **Error rate** — non-2xx response rate change - **Retry storms** — retry rate > N per second per route - **Circuit breaker** — open state, half-open transitions - **Outlier ejection** — instance ejected from pool - **Connection pool exhaustion** — service-to-service connection limits hit - **Config drift** — desired vs applied policy 2. **Ownership mapping** — the killer feature: - Mesh signals are per-route (source-service → destination-service) - **Source-service owns** — retry policies, timeouts - **Destination-service owns** — error rate, latency - **Platform owns** — cert lifecycle, mesh upgrades, ingress - Route alerts to the right owner; don't dump everything on platform team 3. **Routing rules**: - **Latency spike (destination)** → destination team channel - **Retry storm (source)** → source team channel + suggest "increase timeout?" or "back-off?" - **mTLS handshake failure** → platform channel (likely cert issue) - **Authz denial spike** → both teams (boundary issue) - **Circuit breaker open** → destination team + heads-up to source 4. **Alert message anatomy**: - Source service → destination service - Affected route + endpoint - Metric value + threshold + window - Dashboard link with the right pre-filtered view - Runbook link - Recent deploys for either side (for change-correlation) - Owner ping 5. **Latency alerts** — multi-signal: - p50 jump: usually capacity / contention - p99 jump (p50 stable): tail latency; investigate slow code paths - p99 + p50 both jumping: stuck dependency or saturation - Differentiate by metric pattern 6. **Retry storm detection**: - Spike in retry-count metric - Cross-reference with destination error rate (was it slow? unavailable?) - Suggest action: reduce retries OR fix destination 7. **mTLS certificate health**: - Per-pod cert age + expiration - Alert if any pod's cert > 80% of TTL (rotation hasn't happened) - Alert if handshake-failure rate > baseline - Cross-reference with cert-manager / mesh control plane 8. **Outlier ejection events**: - Bot posts each ejection with: which instance, from which pool, by which policy - Cross-reference with pod restarts / crashes - If ejection rate climbs, escalate 9. **Config drift detection**: - Compare desired mesh config (in Git) to applied (in cluster) - On drift: post to platform channel + identify who made the manual change 10. **Anti-patterns to avoid**: - Routing all mesh alerts to platform (overload + no service ownership) - Alerting on every retry (too noisy) - mTLS alerts without action (just "rotate now" instead of "expires in N days") - Missing the change-correlation (deploys, config changes) - Single threshold for latency (different services have wildly different normals) 11. **Tuning**: - Per-service baselines (anomaly detection, not absolute threshold) - Time-of-day awareness (different normals at peak vs off-peak) - Weekly review of FP rate; tune Output as: (a) signal category taxonomy, (b) ownership mapping rules, (c) routing matrix, (d) Block Kit JSON for one latency alert, (e) retry storm correlation logic, (f) mTLS certificate health monitoring, (g) config drift detection, (h) tuning process. Bias toward: route alerts to actual owners, change-correlation built in, per-service baselines over absolutes, action-oriented messages.