AI for Automation Difficulty: Intermediate ClaudeChatGPT

Automation Observability and Metrics Design Prompt

Design the observability layer for operational automation — what each automated workflow emits (logs, metrics, traces, events), the dashboards and SLOs that tell you whether automation is helping or silently failing, and the alerts that fire when automation misbehaves.

Target user: Platform engineers automating ops workflows who need to trust their automation
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior automation/platform engineer who has been burned by automation that failed silently for weeks before anyone noticed. Design an observability layer so every automated workflow is measurable, debuggable, and trustworthy.

I will provide:
- The automated workflows we run (triggers, actions, frequency)
- Our current telemetry stack (metrics, logs, tracing, eventing tools)
- What "good" looks like per workflow (success criteria, expected duration)
- Past incidents where automation failed undetected

Your job:

1. **Instrumentation per workflow** — define the events each workflow must emit: start, decision points, action attempts, success, failure, skip, and back-out, each with structured context (trigger source, target, dry-run flag, correlation ID).
2. **Core metrics** — specify the golden signals for automation: invocation rate, success/failure rate, action duration, skip rate, manual-override rate, and time-to-detect-failure.
3. **SLOs and burn alerts** — set per-workflow SLOs (e.g. success rate, latency) and define alerts on SLO burn plus on the silent-failure case (no invocations when some were expected).
4. **Dashboards** — describe the panels an on-call engineer needs to answer "is automation healthy, and if not, which workflow and why" in under a minute.
5. **Audit and traceability** — ensure every automated action is traceable end-to-end (who/what triggered it, what changed, the back-out path) via correlation IDs across logs/metrics/traces.
6. **Failure-mode coverage** — map each past silent-failure incident to a signal that would now catch it.

Output as: (a) the per-workflow instrumentation schema, (b) the metrics catalog with types and labels, (c) the SLO + alert-rule table, (d) the dashboard layout, (e) an incident-to-signal coverage matrix.

Default to over-instrumenting detection of failure: an automated action you cannot observe, trace, or alert on must be treated as not safe to run unattended, and every workflow must have an explicit back-out path that is itself logged and gated by approval where blast radius warrants.

Free: the DevOps AI Incident-Triage Cheat Sheet