Observability for Incidents: The Signals You Need Before 3am
Dashboards built for demos are useless during an outage. Here's how to instrument for the questions you'll actually ask at 3am, not the ones that look good.
- #incident-response
- #observability
- #metrics
- #tracing
- #logging
- #sre
There’s a particular kind of despair that hits at 3am when you open the dashboard during an outage and realize it can’t answer your question. It has forty beautiful panels, none of which tell you which customers are failing or what changed. It was built to look impressive in a demo, not to be useful when something’s on fire. Observability built for show is worse than useless during an incident, because it gives you the false comfort of “we have monitoring” while you fly blind.
Good incident observability isn’t about more data. It’s about instrumenting for the specific questions you’ll ask under pressure. Let me walk through what those questions are and how to be ready for them.
The questions you actually ask at 3am
Every incident, you ask roughly the same sequence:
- Is it real, and how bad? Are users actually affected, and how many?
- What’s the symptom? Errors? Latency? Where, exactly?
- What changed? Deploys, config, infra events, traffic shifts.
- Where in the stack? Which service, which dependency, which layer?
- Is my fix working? Did the thing I just did actually help?
Your observability stack exists to answer these five, fast. If a signal doesn’t help answer one of them during an incident, it’s dashboard decoration. Build for the questions, not the other way around.
The three signals, used the incident way
The classic three — metrics, logs, traces — each have a specific job during an incident:
Metrics answer “is it real and how bad.” Lead with the symptom metrics tied to your SLOs: error rate, latency percentiles (p50/p95/p99 — averages lie), and throughput, sliced by the dimensions that matter (endpoint, region, customer tier). Metrics are cheap to query and give you the shape of the problem in seconds.
Traces answer “where in the stack.” When a request is slow or failing, a distributed trace shows you which service or dependency ate the time. This is the signal that collapses “it’s slow somewhere” into “it’s the payments call timing out.” If you’ve done dependency mapping, traces are how you confirm which edge of the map is actually broken.
Logs answer “what exactly happened.” Once metrics and traces point you at a service, structured logs give you the specific error, the stack trace, the offending request. Logs are for the close-up, not the overview — don’t start an incident by grepping logs, you’ll drown.
The flow is metrics → traces → logs: zoom from “how bad” to “where” to “what.” Teams that start with logs at the bottom of the funnel waste the first ten minutes.
Build incident dashboards, not vanity dashboards
Separate your dashboards by purpose. The dashboard you stare at during an incident should be ruthlessly focused:
- The top row is symptoms — error rate, p99 latency, success rate, for your critical paths. This answers “is it real.”
- The second row is the obvious suspects — saturation (CPU, memory, connection pools), dependency health, queue depth.
- A change overlay — deploy markers and config-change annotations on the graphs, so a spike visually lines up with what caused it. This single feature answers question 3 (“what changed”) faster than anything else.
- Slice-ability — the ability to break any panel down by region, endpoint, or customer tier without writing a query, because “30% of EU users” is the scope statement you need for customer comms.
Everything else — the capacity-planning graphs, the business metrics, the weekly trends — belongs on other dashboards. The incident dashboard is for incidents.
The signals teams most often lack
When I audit observability for incident-readiness, the same gaps recur:
- No deploy/change annotations on graphs. You can see the spike but not what caused it, so you waste time correlating by hand.
- Averages instead of percentiles. A healthy-looking average latency hides the p99 that’s killing 5% of users.
- No per-customer or per-tier slicing. You know “errors are up” but not whose errors, so you can’t scope impact or comms.
- Unstructured logs. If you can’t filter logs by request ID, trace ID, or error type, they’re a haystack at the worst possible time.
- No high-cardinality dimensions. The ability to ask “show me errors for customer X on endpoint Y in region Z” is what turns a 40-minute investigation into a 4-minute one.
Fixing these is higher-leverage than adding any new tool.
Where AI helps
Incident observability is mostly reasoning over signals and config — good AI territory, no production access needed.
Auditing readiness. Paste your dashboard config and alert rules and ask:
“Here are my dashboards and alert rules. For each of these incident questions — is it real / what’s the symptom / what changed / where in the stack / is my fix working — tell me whether I can answer it quickly with what I have, and what’s missing. Flag any panel using averages where percentiles would be better, and any critical path with no symptom-based signal.”
Correlating during the incident. This is where it shines — paste the metric values, recent deploys, and a trace or log slice and ask it to correlate:
“Latency spike started 02:09 UTC. Here are the p99 values, the deploy log for the last 6 hours, and a representative slow trace. What changed closest to the spike, which span in the trace is eating the time, and what’s the mechanism connecting them?”
The model is good at lining up a timeline across signals and noticing the change three layers down you’d find twenty minutes later. Give it your real metric and service names so it doesn’t invent them. We keep observability and correlation prompts for exactly this.
One guardrail, same as always: AI reads and reasons over the signals; it never runs the query or the fix. And scrub customer data and secrets out of any logs or traces before they go into a prompt.
Instrument before you need it
The brutal truth about incident observability: you can’t add it during the incident. The signal you wish you had at 3am has to exist before the pager goes off — the deploy annotations, the percentile metrics, the high-cardinality dimensions, the request IDs in your logs. The best time to find a gap is during a postmortem (“we couldn’t tell who was affected”) and the action item is “add the slice we needed.”
Build for the five questions, lead with symptoms, zoom metrics → traces → logs, and put your changes on the graphs. Do that and the 3am dashboard answers you instead of mocking you.
If you want help auditing your observability for incident-readiness and correlating signals during an active incident, that’s part of what the AI Incident Response Assistant is built to do.
Generated audits and correlations are assistive, not authoritative. Always verify findings against your real systems before acting on them.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.