Writing Azure Monitor KQL Queries With AI Without Shipping Garbage Dashboards
KQL is powerful and the schema is huge. Here's how AI drafts Azure Monitor queries fast while you verify the columns, joins, and time grain so your alerts are actually correct.
- #azure
- #ai
- #kql
- #azure-monitor
- #observability
I once shipped an alert that never fired during a real outage. The KQL looked right in the query editor — it returned rows when I tested it against the last hour. But it used summarize count() by bin(TimeGenerated, 5m) against a table that only logged events on success, so when the service went fully down and stopped logging entirely, the count went to zero and my threshold of “greater than 100 errors” was never crossed. The query was syntactically perfect and semantically useless. KQL will happily let you do that.
KQL is a great query language and Azure Monitor’s schema is enormous — dozens of tables, each with its own columns and quirks. That combination is exactly where AI helps and exactly where it bites you. AI writes valid KQL fast, which collapses the time from “what do I want to know” to “here’s a query.” But valid is not correct, and a query that runs cleanly can still measure the wrong thing. The job is to let AI draft and to verify the schema, the time grain, and the failure mode yourself.
Let AI draft against a named table, not into the void
The single biggest improvement to AI-generated KQL is telling it the exact table and showing it the columns. Don’t ask for “a query to find errors” — pull the schema first, then constrain the request:
# Tables in a workspace, then the columns of the one you want
az monitor log-analytics workspace table list \
--workspace-name "$WS" --resource-group "$RG" --query "[].name" -o table
Or just run getschema in the query editor to dump column names and types. Then prompt with that schema in hand:
Prompt: “Here are the columns of the
AppRequeststable: TimeGenerated, Name, Url, ResultCode, DurationMs, Success, OperationId, AppRoleName. Write a KQL query that shows the p95 and p99 of DurationMs per Url over the last 6 hours, in 15-minute bins, only for requests where Success == false. Usepercentile()and explain the time grain choice.”
A good answer:
AppRequests
| where TimeGenerated > ago(6h)
| where Success == false
| summarize p95 = percentile(DurationMs, 95),
p99 = percentile(DurationMs, 99),
failures = count()
by Url, bin(TimeGenerated, 15m)
| order by TimeGenerated desc
Because you gave it the real columns, it can’t hallucinate ResponseTime when the field is DurationMs. Constraining the schema up front is what turns AI KQL from a guessing game into a draft you can trust enough to verify.
Verify the three things AI gets wrong
AI-written KQL fails in predictable places. Check these every time:
- Column names and table version. The classic trap is the old
requeststable versus the newerAppRequests, orcustomDimensionsvsProperties. Rungetschemaand confirm. A query against a column that doesn’t exist errors loudly, which is fine — the dangerous case is a column that exists but means something different. - The time grain.
bin(TimeGenerated, 1h)smooths a spike into nothing;bin(..., 1m)can be too noisy for an alert. AI picks a default; you decide based on what you’re measuring. - The absence-of-data failure mode. My opening story. Ask explicitly: “If the service stops logging entirely, does this query return zero rows or no rows, and how does that interact with an alert threshold?” A
count()-based alert needs a companion that fires on no data, which Azure Monitor supports as a distinct setting.
Prompt: “Review this KQL for an alert. The service may stop emitting logs entirely during a hard outage. Will my threshold still detect that case, or do I need a separate no-data alert? Rewrite it if there’s a silent-failure gap.”
That review pass is the one that would have saved my dead alert. AI reasons about the failure mode well when you ask it to — it just won’t volunteer it.
Joins are where it gets subtle
Correlating across tables is KQL’s real power and AI’s real risk. A join on the wrong key, or the wrong join kind (inner vs leftouter), silently drops or duplicates rows. The common, correct pattern is correlating requests with the exceptions thrown during them:
AppRequests
| where TimeGenerated > ago(1h) and Success == false
| join kind=leftouter (
AppExceptions
| where TimeGenerated > ago(1h)
| project OperationId, ExceptionType, OuterMessage
) on OperationId
| project TimeGenerated, Url, ResultCode, ExceptionType, OuterMessage
When AI hands you a join, make it defend the choice:
Prompt: “Explain why this query uses
kind=leftouterand joins on OperationId. What rows would I lose withkind=inner? Could the join produce duplicate rows, and if so why?”
leftouter keeps failed requests that had no recorded exception (real cases — timeouts, dependency failures), which inner would drop. That’s the kind of correctness detail you must own, but AI explains it accurately once challenged. The pattern of decode-and-explain is the same one I lean on for reading Azure RBAC and infrastructure — let the model translate dense output, then verify the claim.
Build a reusable library, not one-off queries
The highest-leverage move is to stop writing throwaway KQL. Have AI generate a small set of parameterized queries you keep in a repo — error rate by service, p99 latency, dependency failures, no-data detection — each with a comment explaining the time grain and the failure mode. Then alerts and dashboards reference known-good queries instead of whatever someone typed at 2 a.m.
Prompt: “Generate five reusable Azure Monitor KQL queries for a web service using the
AppRequestsandAppExceptionstables: (1) error rate %, (2) p99 latency, (3) top failing URLs, (4) exception trend, (5) a no-data sentinel. Add a comment on each explaining its time grain and what it does NOT catch.”
The “what it does not catch” comment is the part that keeps the library honest. Every query has a blind spot; documenting it is how the next person avoids my mistake.
The discipline
AI drafts the KQL; you own the schema, the grain, and the failure mode. The query language is too good at running invalid-but-syntactically-clean logic to skip the verify step — a query that returns rows in the editor is not a query that’s correct in production. Pull the real schema, constrain the prompt to it, challenge every join and threshold, and write down each query’s blind spot. Do that and AI turns KQL from an afternoon of documentation-spelunking into a fast, correct draft.
The KQL prompts I actually use are in the prompts library. Azure Monitor rewards good queries and punishes plausible-looking bad ones, and that’s the exact gap where a careful human plus an AI draft beats either alone.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.