Dependency Mapping: A Service Catalog for Incident Response
When a service goes down at 3am, the first question is 'what else does this take with it?' A dependency map answers it before you have to guess.
- #incident-response
- #dependencies
- #service-catalog
- #sre
- #architecture
- #reliability
The worst incidents I’ve worked weren’t the ones where a service broke. They were the ones where a service broke and nobody in the room knew what else depended on it. The fix would have been obvious if anyone could answer “what does this service feed, and what does it call?” Instead we spent the first thirty minutes drawing the architecture on a whiteboard from collective memory, during the outage, while customers waited.
A dependency map turns that thirty-minute archaeology dig into a lookup. It’s one of the highest-return pieces of incident-response infrastructure you can build, and most teams never build it because it feels like documentation rather than tooling. It’s tooling.
What a dependency map is for
During an incident, a dependency map answers three questions fast:
- Blast radius: “Service X is down — what else fails or degrades because of it?” (its dependents)
- Suspect list: “Service X is misbehaving — what does it depend on that might be the real cause?” (its dependencies)
- Ownership: “Who do I page for the thing two hops upstream?”
Without the map, you answer these by waking people up and asking. With it, you answer them in seconds and page exactly the right person.
The minimum useful model
You don’t need a perfect graph of every call. You need a useful one. For each service, capture:
- Name and owning team — and how to page them.
- Upstream dependencies — what it calls to do its job, and whether each is hard (it fails without it) or soft (it degrades gracefully).
- Downstream dependents — what calls it (often discoverable by inverting everyone else’s upstream list).
- Criticality tier — is this on a critical user path, or peripheral?
The hard-versus-soft distinction is the most valuable field. A hard dependency means a failure propagates; a soft one means it’s contained behind a fallback or timeout. During an incident, that single attribute tells you whether the blast radius is going to spread or stop.
A service catalog entry template
Keep entries boringly consistent so they’re scannable at 3am:
Service: payments-api Owner / page: payments-team / [escalation link] Tier: critical (checkout path) Hard upstream deps: postgres-primary, fraud-service Soft upstream deps: receipts-service (degrades to async), analytics-sink (fire-and-forget) Known dependents: checkout-web, mobile-api, subscription-billing Failure mode notes: returns 503 if fraud-service is unreachable; queues writes if postgres-primary is down (up to 5 min)
That “failure mode notes” line is gold during an incident — it tells you what symptom to expect when a dependency fails, which short-circuits diagnosis.
How to build it without boiling the ocean
Don’t try to map everything at once. Two practical approaches, best combined:
- Top-down, by criticality. Start with your most critical user journeys — checkout, login, the core write path — and map only the services on those paths. The critical paths are where incidents hurt most and where the map pays off first. You’ll cover the highest-value 20% quickly.
- Bottom-up, from real signals. Service meshes, distributed tracing, and API gateways already know who calls whom. Mine traces, mesh telemetry, or even network flow logs to discover real dependencies — including the ones nobody remembered to document. Observed dependencies beat documented ones, because they reflect what’s actually happening.
The combination gives you a map that’s both prioritized and accurate: criticality tells you what to map first, telemetry tells you the truth about the edges.
Keep it from going stale
A dependency map is only trusted if it’s current, and architecture drifts constantly. Stale maps are worse than none, because people trust them and get burned.
- Derive, don’t hand-maintain, where possible. If your mesh or tracing can regenerate the edge list, schedule it. A nightly job that diffs observed dependencies against the documented catalog catches drift automatically.
- Make updating it part of the change. Adding a new dependency should update the catalog in the same change, the way you’d update a runbook. Tie it to your service-creation or deploy process.
- Validate it in gamedays. When you run a chaos experiment that kills a dependency, check whether the actual blast radius matched the map. Mismatches are bugs in your map; fix them.
Using it live
When the page fires, the map drives the first moves:
- Look up the failing service, read its dependents — that’s your blast radius and your comms scope. The comms lead now knows exactly which customer journeys to mention.
- Read its hard upstream deps — that’s your initial suspect list for “what’s the actual cause,” since failures usually come from below.
- The criticality tiers tell you severity fast: a critical-path service with many dependents is a different incident than a leaf node nobody depends on.
This is also where blast-radius thinking and dependency mapping meet: the map is the raw data that makes blast-radius estimation a lookup instead of a guess.
The payoff compounds
A dependency map isn’t just an incident tool. It informs gameday design (kill the dependencies that propagate), capacity planning (the most-depended-on services need the most headroom), and architecture review (a service with twenty hard dependents is a single point of failure you should know about). But the incident value alone justifies it: every minute you don’t spend whiteboarding the architecture mid-outage is a minute spent fixing it.
We keep dependency-mapping and service-catalog templates in our incident-response toolkit — start with your critical paths, derive the edges from real telemetry, and keep the map honest with a nightly diff. The first time it turns a thirty-minute “what depends on this?” scramble into a five-second lookup, it has paid for itself.
Dependency models are guidance, not ground truth. Always validate a map against live telemetry and real failure behavior before relying on it during an incident.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.