Writing Cloud Monitoring MQL and Log Explorer Queries With AI
MQL and the Log Explorer query language are powerful and genuinely hard to write from memory. Here's how I use AI to draft GCP monitoring and logging queries that actually run.
- #gcp
- #ai
- #monitoring
- #mql
- #logging
At 2am during an incident, I needed the 99th-percentile latency of a Cloud Run service grouped by revision, as a rate over five-minute windows. I know that’s expressible in MQL. I did not, at 2am, remember the exact align, group_by, and percentile syntax to write it. I burned ten minutes on documentation while the incident burned. That’s the recurring problem with GCP’s query languages: Monitoring Query Language and the Log Explorer filter syntax are both expressive enough to answer almost anything and arcane enough that nobody writes them fluently from memory. So I stopped trying to. I describe the question in English, let AI draft the query, and I verify it runs and returns what I expect.
MQL: describe the metric, get the query
MQL’s pipeline model — fetch, then align, then group_by, then a reducer — is logical once you know it and opaque until you do. The model knows the grammar cold. I give it the metric type and the shape of answer I want:
Prompt: “Write a GCP Monitoring Query Language (MQL) query for the metric
run.googleapis.com/request_latencieson resource typecloud_run_revision. I want the 99th percentile latency, grouped by therevision_namelabel, aligned to 5-minute windows over the last hour. Explain each pipeline stage in one line.”
fetch cloud_run_revision
| metric 'run.googleapis.com/request_latencies'
| align delta(5m)
| group_by [resource.revision_name], [value_p99: percentile(value.request_latencies, 99)]
| every 5m
I always verify it against the real metric explorer before trusting a number from it, because the model can pick a plausible-but-wrong metric type or a reducer that double-counts. The grammar it gets right; the semantics — is this metric a distribution, a gauge, a delta? — I confirm. A latency metric is a distribution, so percentile is valid; if the model had reached for mean on a counter, I’d catch it by sanity-checking the magnitude.
Build an alert policy from the query
Once the MQL is right, the model drafts the full alert policy, which is otherwise a wall of nested JSON nobody enjoys writing:
Prompt: “Turn that MQL query into a GCP alerting policy condition. Fire when p99 latency exceeds 2000ms for 5 minutes. Output the JSON for
gcloud alpha monitoring policies create --policy-from-file. Include a clear display name and documentation field, no notification channels (I’ll attach those).”
gcloud alpha monitoring policies create --policy-from-file=p99-latency-alert.json
I review the threshold and the duration by hand — those are judgment calls about my SLO, not syntax the model can know. Leaving notification channels out of the AI’s scope is deliberate: I attach those myself so a generated policy can never page the wrong on-call.
Log Explorer: the filter language is the other half
The Logging query language has its own syntax, and combining resource type, severity, and a jsonPayload field filter trips people up. I describe the hunt:
Prompt: “Write a GCP Log Explorer query that finds all logs from Cloud Run service
checkout-apiwith severity ERROR or higher, wherejsonPayload.messagecontains ‘timeout’, in the last 6 hours. Then give me the equivalentgcloud logging readcommand with the right--freshnessflag.”
gcloud logging read \
'resource.type="cloud_run_revision"
AND resource.labels.service_name="checkout-api"
AND severity>=ERROR
AND jsonPayload.message:"timeout"' \
--freshness=6h --limit=100 --format=json
The : operator for substring vs = for exact match is a constant source of empty result sets, and AI picks the right one when I tell it “contains.” Small thing, but it’s the difference between zero rows and the rows I needed.
Let AI read the results, not just write the query
The query is half the job; reading 100 JSON log lines is the other half. I hand the output straight back:
Prompt: “Here are 100 Cloud Run error logs (JSON). Cluster them by root cause — group similar
jsonPayload.messagevalues, ignoring request IDs and timestamps. Give me a table of cause, count, and one example, sorted by count. Tell me which cluster started most recently.”
That “started most recently” question is the gold during an incident: it points at what changed. The model is genuinely good at fuzzy-clustering messages that differ only in IDs, which is miserable to do by eye.
Log-based metrics and dashboards
To make a recurring error trackable, I have AI draft a log-based metric so I can alert on its rate later:
gcloud logging metrics create checkout_timeouts \
--description="Count of checkout-api timeout errors" \
--log-filter='resource.type="cloud_run_revision"
AND resource.labels.service_name="checkout-api"
AND jsonPayload.message:"timeout"'
Prompt: “Given that log-based metric, write the MQL to chart its rate per minute, and a one-line description of what a healthy baseline looks like so I know what threshold to alert on.”
Where the human stays in the loop
The pattern across MQL, log filters, alert policies, and log-based metrics is the same: the syntax is hard and the model knows it, so I delegate the grammar entirely. But which metric, which threshold, which severity, and what counts as healthy — those are judgment, and I own them. I verify every generated query returns sane numbers against the real backend before I build an alert on it, because a query that runs cleanly can still measure the wrong thing.
The reusable versions of these prompts are in my prompts collection, and the rest of the GCP with AI series covers the services these queries tend to watch. You don’t need to memorize MQL. You need to verify it.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.