Blast-Radius Mapping: Knowing What Breaks Before It Does

Halfway through an incident, someone always asks the question that decides how bad your night gets: “If we restart the auth service, what else breaks?” In a healthy organization, you can answer in seconds. In most, the answer is a nervous shrug and a dawning realization that nobody has the whole map in their head.

Blast-radius mapping is the work of building that map before the incident, so that during one you’re reading a diagram instead of discovering your architecture by causing a second outage.

What blast radius actually means

Blast radius is the set of things affected when a given component degrades or fails. It runs in two directions, and you need both:

Downstream (what I take down) — if service X fails, who depends on X and therefore also fails or degrades?
Upstream (what takes me down) — what does X depend on, such that their failure becomes mine?

The downstream direction tells you the impact of an outage. The upstream direction tells you where to look when you’re the symptom but not the cause. During triage you walk upstream to find the root; during impact assessment you walk downstream to size the damage.

Build the dependency inventory first

You can’t map blast radius without a dependency inventory. Start coarse — a spreadsheet beats nothing — with one row per service:

Service	Depends on	Depended on by	Failure mode if dependency down	Hard or soft dependency
checkout	payments, inventory, auth	web	fails closed	hard
recommendations	catalog	web	degrades, hides widget	soft
auth	postgres, redis	~everything	fails closed	hard

The most important column is the last one. A hard dependency means you fail when it fails. A soft dependency means you degrade gracefully — you hide a widget, serve stale data, skip a non-critical call. Knowing which is which changes everything about how you triage and communicate.

Most catastrophic blast radii come from a hard dependency nobody realized was hard. The “optional” feature-flag service that, when it times out, blocks every request because the client wasn’t written with a timeout. Finding those is half the value of the exercise.

Tier your dependencies by criticality

Once you have the inventory, tier it. A simple three-tier model:

Tier 0 — shared infrastructure everything needs: auth, DNS, the primary database, service mesh. A Tier 0 failure is a company-wide event.
Tier 1 — core product paths: checkout, login, the main API.
Tier 2 — degradable features: recommendations, analytics, non-critical async jobs.

This tiering directly drives severity. A Tier 0 outage is a SEV1 almost by definition. A Tier 2 degradation might be a SEV3. You’ve pre-computed part of your severity assessment, which is exactly the kind of decision you don’t want to be making from scratch at 3am.

Find the single points of failure

With the map drawn, look for the components that show up in everyone’s upstream list. Those are your single points of failure, and they’re where you invest in resilience:

Add timeouts and circuit breakers so a slow Tier 0 dependency degrades callers instead of hanging them.
Convert hard dependencies to soft where the product allows it.
Add bulkheads so one failing dependency can’t exhaust a shared connection pool.
Cache aggressively at the boundary so a brief outage is invisible.

The goal isn’t to eliminate every SPOF — you can’t. It’s to know them, so when one fails you reach for the map instead of being surprised.

Using AI to build and stress-test the map

This is reasoning over architecture, which AI does well, and it requires no production access. Two strong uses:

Drafting the map from your configs. Paste service definitions, a service-mesh config, or your IaC and ask:

“Here are my service definitions and mesh config. Build a dependency table: for each service list what it depends on and what depends on it, and guess hard vs soft based on whether there’s a timeout/fallback configured. Flag any dependency with no timeout as a hard-dependency risk.”

Stress-testing blast radius. Give it the map and ask the failure questions before reality does:

“Given this dependency map, walk through what happens if [auth] returns errors for 5 minutes. List every affected service, whether it fails or degrades, the customer-visible impact, and the order I should investigate in if I’m seeing this symptom but don’t know the cause.”

The model is good at noticing the transitive failure you’d miss — the Tier 2 service whose failure backs up a queue that eventually starves a Tier 1 service. We keep dependency-mapping prompts for this kind of pre-incident analysis.

Keep the map alive

The fatal flaw of dependency maps is that they rot. The architecture changes weekly; a diagram drawn once is wrong within a month. A few ways to keep it honest:

Generate it from real signals where you can — distributed traces and service-mesh telemetry show the actual call graph, not the one you remember.
Make “update the dependency entry” part of the definition of done for new services.
Pull the map out during every postmortem and check it against what actually happened. Incidents are the best dependency-discovery tool you have.

Why this pays off

The teams that recover fast aren’t the ones with the cleverest engineers. They’re the ones who, when the pager fires, already know the shape of the system. Blast-radius mapping turns “let’s find out what this affects” into “we already know, here’s the plan.”

If you want help turning configs and traces into a reviewed dependency map and failure walkthrough, that’s part of what the AI Incident Response Assistant is built to do.

Generated dependency maps are assistive, not authoritative. Validate every hard/soft classification against real failure behavior before relying on it during an incident.