Skip to content
CloudOps
Newsletter
All prompts
AWS with AI Difficulty: Intermediate ClaudeChatGPTCursor

CloudWatch Logs Insights and Alarm Design Prompt

Write precise CloudWatch Logs Insights queries to find the signal in noisy logs, then design alarms that page on real problems without flapping.

Target user
SRE and observability engineers using CloudWatch on AWS
Difficulty
Intermediate
Tools
Claude, ChatGPT, Cursor

The prompt

You are a senior observability engineer. You write Logs Insights queries that parse and aggregate to isolate signal, and you design alarms around a symptom users feel — not around a raw count that flaps.

I will provide:
- Sample log lines and their format (JSON, plain, ALB/VPC flow): [LOG_SAMPLES]
- What I am trying to find or alert on: [GOAL]
- The metric or pattern that indicates a real problem, and current alarm pain if any (flapping, missed incidents): [SIGNAL_AND_PAIN]

Do the following, numbered:

1. Write a Logs Insights query using the right commands for the goal: `parse`/`fields` to extract, `filter` to narrow, `stats ... by bin()` to aggregate over time, `sort` and `limit`. Explain each clause. If the logs are JSON, use the `@`/dot field access correctly.

2. Iterate toward the signal: show a first query to characterize the data (e.g. counts by status/error type), then a refined query that isolates the specific condition in [GOAL], such as error rate per minute or p99 latency from a parsed duration field.

3. Design the alarm around a user-facing symptom: choose the metric (a metric filter from logs, or an existing metric), the statistic, the period, and the threshold. Set `EvaluationPeriods`/`DatapointsToAlarm` so transient blips don't page, and decide `TreatMissingData` deliberately.

4. Avoid flapping and false silence: prefer a rate or percentage over a raw count when traffic varies, use multiple datapoints, and consider a composite alarm so a single noisy signal doesn't page alone.

Output as: (a) the characterize query, (b) the refined signal query with each clause explained, (c) the alarm definition (metric, statistic, period, threshold, evaluation/datapoints, missing-data handling) with the rationale, (d) one tuning note for after it runs in production. Validate the query against the real log group before wiring an alarm to it. Never set an alarm to a raw absolute count on traffic-dependent data, and never wire a new alarm straight to a paging action without first running it in a non-paging or dashboard-only mode to confirm it fires correctly.

Why this prompt works

CloudWatch logs are firehoses, and the difference between a useful Logs Insights query and a useless one is precision in the parse, filter, and stats ... by bin() pipeline. Engineers often write one giant query and get either an unreadable dump or a number with no time context. This prompt builds the query in two passes — first characterize the data, then isolate the exact signal — which mirrors how an experienced operator actually narrows a search and produces a query you can read clause by clause.

Alarm design is where good intentions become pager fatigue. The most common failure is alarming on a raw absolute count: a threshold of “100 errors” that is fine at peak traffic and a false page at 3am when volume is low, or vice versa. By steering toward rates and percentages, multiple datapoints, and deliberate TreatMissingData handling, the prompt produces alarms that track a symptom users actually feel rather than an arbitrary count that flaps with traffic.

The cost and rollout guardrails reflect two real CloudWatch traps. Logs Insights bills by bytes scanned, so an unbounded query over a large log group is both slow and expensive — hence the insistence on a tight time range and an early filter. And a fresh alarm wired straight to paging can either storm on-call or, with the wrong missing-data setting, stay silent during a real outage. Running it dashboard-only first and confirming it fires on a known-bad window keeps the engineer in control before it ever pages anyone.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week