Scoping Incidents Faster With AI Blast-Radius Mapping

The page said auth-service was throwing 5xx. The first question in the channel wasn’t “why” — it was “wait, what does that take down?” And nobody had a clean answer. Someone thought checkout used auth synchronously; someone else thought there was a cache in front of it; a third person wasn’t sure if the mobile API path was affected at all. We spent eight minutes assembling a picture of who was hurting before we could even decide how big a response this needed. Scoping is the step between detection and action, and when it’s fuzzy, everything downstream waits.

Blast-radius mapping — what fails when X fails, what’s merely downstream, what’s insulated — is reasoning over a dependency graph. If you can hand a model that graph, it can produce the tiered impact map far faster than five people reconstructing it from memory. The catch, as always, is keeping the model on mapping and off diagnosing.

Fuzzy scope makes you over- or under-react

Bad scoping fails in both directions. Under-scope and you treat a platform-wide outage like a single-service blip, paging too few people and missing victims two hops downstream. Over-scope and you declare half the company down when fallbacks are quietly holding, mobilizing a war room for what a circuit breaker already contained. Either way you’ve sized the response wrong, and resizing it mid-incident costs time. Getting scope right early is the same leverage point the whole MTTR funnel keeps returning to: structure beats heroics.

A dependency graph plus a failure mode is exactly the input an LLM can reason over. Ask it to sort dependents into direct, indirect, and insulated tiers and you get a scope map in seconds — one you then confirm rather than assemble.

Ask for tiers, not a verdict

The framing keeps it to impact, never cause.

You are scoping an incident, not diagnosing it. The failing component is auth-service (returning 5xx). Here is the dependency graph. Sort dependents into three tiers: direct (synchronous callers that fail directly), indirect (downstream via timeouts/retries/backpressure — mark “likely, unconfirmed”), and insulated (async, cached, or with a working fallback — state the assumption that makes each safe). For each, predict the user-facing symptom and a read-only check to confirm which tier it’s really in. Do not name a root cause. End with a one-line scope statement the IC can paste.

The output is a map the IC can act on:

Direct: checkout (login fails → can’t pay), admin-console (locked out) Indirect (likely, unconfirmed): order-history (calls checkout; will time out), notifications (queue backpressure) Insulated (assumption): product-catalog (auth cached 5 min — confirm cache TTL), mobile-read API (anonymous path, no auth) Scope: auth failure; checkout + admin confirmed in scope, order-history likely; catalog and mobile-read being ruled out pending checks.

Eight minutes of assembling that becomes a ten-second read plus a few confirming checks.

Confirm the tier with read-only checks

The map is a set of predictions. Confirm the boundaries that matter:

# Is the "insulated" catalog actually serving from cache, or hitting auth?
kubectl exec -n catalog deploy/catalog -- \
  sh -c 'curl -s localhost:9090/metrics | grep auth_cache_hit_ratio'

# Is order-history actually timing out (indirect) or fine?
curl -s "http://prom:9090/api/v1/query?query=\
sum(rate(http_requests_total{service=\"order-history\",code=~\"5..\"}[1m]))" \
  | jq -r '.data.result[].value[1]'

In the auth incident, the catalog check came back clean — cache hit ratio at 99%, genuinely insulated — which let the IC shrink the public-facing scope and avoid a status-page post that would have alarmed users who weren’t actually affected.

Map impact, but verify before you rule out

The discipline: a blast-radius map is predictions, and the two ways it misleads are symmetrical. It can over-scope by traversing the graph to a worst case that fallbacks prevent — which is why every predicted impact carries a confidence label. And it can under-scope by trusting the documented graph, when an undocumented caller or a shared cache puts a service in the blast radius the model marked “insulated.”

Rules I hold to:

Treat “insulated” as “confirm before ruling out,” not “safe.” The most dangerous miss is a real victim hiding in the safe tier because the graph was stale.
Keep confidence labels visible in the channel. “Likely, unconfirmed” is honest and prevents the indirect tier from being announced as fact.
Re-map when a check surprises you. A service that’s affected when the graph said it shouldn’t be means the graph is wrong — regenerate, don’t patch.

You can try this on the free incident assistant — paste a dependency list and a failing component and ask for the tiered map, then notice how having “insulated” separated from “direct” changes how confidently you can state scope. The prompt library has a hardened blast-radius prompt with the confidence-label and confirm-check framing built in.

You can’t fix what you haven’t scoped, and the minutes spent guessing who’s affected are pure MTTR you can recover. AI turns your dependency graph into a tiered impact map in seconds — and as long as every tier is a prediction you confirm rather than a verdict you trust, you get fast, honest scope instead of a confident worst case.

Fuzzy scope makes you over- or under-react

Ask for tiers, not a verdict

Confirm the tier with read-only checks

Map impact, but verify before you rule out

Download the Free 500-Prompt DevOps AI Toolkit