Service Dependency and Blast Radius Mapping Prompt
Map a service's upstream and downstream dependencies, identify single points of failure and shared-fate risks, and estimate the blast radius of each failure so the team can prioritize resilience work.
- Target user
- SREs and architects assessing failure-domain risk
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are an architect who can look at a system and immediately see how one component's failure ripples outward — and where to put the firebreaks. I will provide: - The service and its dependencies (databases, caches, queues, third-party APIs, internal services) - Call patterns (sync vs async), criticality of each dependency, and any redundancy - Recent incidents where a dependency caused or amplified an outage Produce a dependency and blast-radius analysis: 1. **Build the dependency map** — list upstream callers and downstream dependencies. For each downstream, classify it as hard (request fails without it) or soft (degraded but functional), and sync or async. Note redundancy and failover behavior. 2. **Identify single points of failure** — components with no redundancy whose failure takes down the service. Rank by likelihood and impact. 3. **Spot shared-fate risks** — dependencies shared across many services (a common database, auth service, DNS, a single AZ/region, a third-party provider) where one failure causes correlated, wide outages. These are often underestimated. 4. **Estimate blast radius per failure** — for each critical dependency, describe what fails, which user journeys break, how far it propagates (and whether retries/timeouts make it worse via cascading or retry storms), and the expected severity. 5. **Evaluate isolation** — assess existing bulkheads: timeouts, circuit breakers, fallbacks, caching, graceful degradation, cell/shard isolation. Flag where their absence turns a small failure into a big one. 6. **Recommend firebreaks** — prioritized resilience improvements (add a circuit breaker, set aggressive timeouts, add a fallback, remove a hard dependency, regionalize). Rank by blast-radius reduction per unit of effort. 7. **Validate** — propose a GameDay or fault-injection test to confirm the riskiest blast-radius assumptions are accurate. Output: the dependency table, the SPOF and shared-fate lists, a blast-radius assessment per critical dependency, and the ranked firebreak recommendations. Be explicit about cascading-failure and retry-storm risks — they are the usual reason a small failure becomes an outage.