CloudWatch Logs Insights and Alarm Design Prompt
Write precise CloudWatch Logs Insights queries to find the signal in noisy logs, then design alarms that page on real problems without flapping.
- Target user
- SRE and observability engineers using CloudWatch on AWS
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT, Cursor
The prompt
You are a senior observability engineer. You write Logs Insights queries that parse and aggregate to isolate signal, and you design alarms around a symptom users feel — not around a raw count that flaps. I will provide: - Sample log lines and their format (JSON, plain, ALB/VPC flow): [LOG_SAMPLES] - What I am trying to find or alert on: [GOAL] - The metric or pattern that indicates a real problem, and current alarm pain if any (flapping, missed incidents): [SIGNAL_AND_PAIN] Do the following, numbered: 1. Write a Logs Insights query using the right commands for the goal: `parse`/`fields` to extract, `filter` to narrow, `stats ... by bin()` to aggregate over time, `sort` and `limit`. Explain each clause. If the logs are JSON, use the `@`/dot field access correctly. 2. Iterate toward the signal: show a first query to characterize the data (e.g. counts by status/error type), then a refined query that isolates the specific condition in [GOAL], such as error rate per minute or p99 latency from a parsed duration field. 3. Design the alarm around a user-facing symptom: choose the metric (a metric filter from logs, or an existing metric), the statistic, the period, and the threshold. Set `EvaluationPeriods`/`DatapointsToAlarm` so transient blips don't page, and decide `TreatMissingData` deliberately. 4. Avoid flapping and false silence: prefer a rate or percentage over a raw count when traffic varies, use multiple datapoints, and consider a composite alarm so a single noisy signal doesn't page alone. Output as: (a) the characterize query, (b) the refined signal query with each clause explained, (c) the alarm definition (metric, statistic, period, threshold, evaluation/datapoints, missing-data handling) with the rationale, (d) one tuning note for after it runs in production. Validate the query against the real log group before wiring an alarm to it. Never set an alarm to a raw absolute count on traffic-dependent data, and never wire a new alarm straight to a paging action without first running it in a non-paging or dashboard-only mode to confirm it fires correctly.
Why this prompt works
CloudWatch logs are firehoses, and the difference between a useful Logs Insights query and a useless one is precision in the parse, filter, and stats ... by bin() pipeline. Engineers often write one giant query and get either an unreadable dump or a number with no time context. This prompt builds the query in two passes — first characterize the data, then isolate the exact signal — which mirrors how an experienced operator actually narrows a search and produces a query you can read clause by clause.
Alarm design is where good intentions become pager fatigue. The most common failure is alarming on a raw absolute count: a threshold of “100 errors” that is fine at peak traffic and a false page at 3am when volume is low, or vice versa. By steering toward rates and percentages, multiple datapoints, and deliberate TreatMissingData handling, the prompt produces alarms that track a symptom users actually feel rather than an arbitrary count that flaps with traffic.
The cost and rollout guardrails reflect two real CloudWatch traps. Logs Insights bills by bytes scanned, so an unbounded query over a large log group is both slow and expensive — hence the insistence on a tight time range and an early filter. And a fresh alarm wired straight to paging can either storm on-call or, with the wrong missing-data setting, stay silent during a real outage. Running it dashboard-only first and confirming it fires on a known-bad window keeps the engineer in control before it ever pages anyone.