Writing CloudWatch Logs Insights Queries With AI

Mid-incident, with traffic dropping and a Slack channel full of people watching, I opened CloudWatch Logs Insights to count 5xx errors by endpoint and completely blanked on the syntax. Is it parse or extract? Does stats go before or after filter? The Logs Insights query language is one of those tools I use just often enough to know it exists and not often enough to keep it in muscle memory — which means I’m fumbling with the docs at the exact moment I need answers fast. This is the single best place I’ve found for AI to earn its keep: turning “I want the count of 5xx by URL path in the last 15 minutes” into a correct query in one shot.

The framing is lighter here than for IAM or networking, because a read-only query can’t break production. But the discipline still matters: AI drafts the query, I read it to make sure it’s asking what I think it’s asking, and I sanity-check the result against reality before I make a decision based on it. A confidently wrong query during an incident is worse than no query.

Describe the question in plain English, with the log shape

The model writes a much better query if you tell it what the log lines actually look like. Grab one sample line first:

aws logs tail /aws/lambda/checkout-api --since 5m --format short | head -1

Then prompt with the question and the sample:

I’m in CloudWatch Logs Insights querying an API access log. A sample line looks like: {"ts":"2026-06-21T09:14:02Z","method":"POST","path":"/v1/checkout","status":502,"latency_ms":31,"requestId":"abc-123"} Write a Logs Insights query that, over the last 15 minutes, returns the count of requests with status >= 500 grouped by path, sorted by count descending. Use the JSON field names directly since these are structured logs.

Because the logs are JSON, Logs Insights auto-discovers the fields and the query is clean:

fields @timestamp, path, status
| filter status >= 500
| stats count(*) as errors by path
| sort errors desc
| limit 20

That stats ... by grouping is the exact piece I always forget the syntax for, and the model gets it right every time. The limit is a habit worth keeping so a huge result set doesn’t time out.

Unstructured logs: let AI write the parse

The harder case is text logs with no JSON structure, where you need parse with a glob or regex — and that’s precisely where my memory fails hardest. Give the model a raw sample:

These are plain-text app logs, not JSON. Sample: 2026-06-21 09:14:02 ERROR [checkout] user=4821 took=1290ms failed to reach payments: timeout Write a Logs Insights query that extracts the latency in ms and the user id, filters to ERROR lines mentioning “timeout”, and shows the average and max latency. Use parse with a glob pattern, and explain the pattern.

It produces the parse line and, crucially, explains the glob so I can verify it matches my real format:

fields @message
| filter @message like /ERROR/ and @message like /timeout/
| parse @message "took=*ms" as latency
| parse @message "user=* " as userId
| stats avg(latency) as avgMs, max(latency) as maxMs, count(*) as hits

I read the parse "took=*ms" pattern against my sample line to confirm the wildcard lands where I expect. If my log format varies (some lines say took= and some say duration=), the glob silently drops the non-matching lines — that’s a real footgun, and it’s why reading the pattern matters before trusting the count.

Always sanity-check the result against reality

Here’s the discipline that keeps a wrong query from misleading an incident. After running it, I cross-check the magnitude against something I already know. If the query says 3 errors in the last 15 minutes but the alarm that paged me is based on a 5xx rate that should be in the hundreds, the query is wrong — probably a filter that’s too narrow or a parse that’s dropping lines. Two cheap cross-checks:

# Rough independent count to compare against the query result
aws logs filter-log-events \
  --log-group-name /aws/lambda/checkout-api \
  --filter-pattern '{ $.status >= 500 }' \
  --start-time $(date -d '15 minutes ago' +%s)000 \
  --query 'length(events)'

If that number is in the same ballpark as the query’s total, I trust the breakdown. If it’s wildly off, I go back to the model with “the count seems too low, here’s the raw filter-log-events count, what in the query is over-filtering?” — and it’s good at spotting its own too-narrow filter or a parse that didn’t match.

Build a small library while you’re at it

The best long-term move is to have AI generate the queries you reach for repeatedly and save them. A handful that earn their place: error rate by endpoint, p99 latency by route, top requestIds by latency for tracing a slow path, and a “find every log line for this requestId across services” query. Ask the model to write each with comments, paste them into Logs Insights’ saved queries, and you’ve turned a fumble-under-pressure task into a one-click operation. AI also handles the annoying ones well — multi-stats, time-bucketed bin() series for graphing, and dedup for noisy logs — that I’d otherwise never remember the syntax for.

The line that holds

A Logs Insights query is read-only, so the stakes are lower than touching IAM or a NACL — but a confidently wrong query during an incident sends you chasing ghosts. So AI drafts from a real log sample, I read the filter/parse/stats to confirm it asks what I mean, and I cross-check the result’s magnitude against an independent count before I act on it. That keeps the speed without surrendering judgment.

The same “draft fast, verify against reality” pattern shows up across the AWS guides, and the saved Logs Insights queries I keep are in the prompts collection.

Describe the question in plain English, with the log shape

Unstructured logs: let AI write the parse

Always sanity-check the result against reality

Build a small library while you’re at it

The line that holds

Download the Free 500-Prompt DevOps AI Toolkit