Turning Plain-English SLO Requirements Into PromQL With AI

Every SLO starts life as a sentence in a planning doc: “99.9% of checkout requests should succeed within 300ms over a rolling 30 days.” That sentence is unambiguous to a product manager and almost useless to Prometheus, which wants a numerator, a denominator, and a window. The translation from English to correct PromQL is where most SLO efforts stall, and it’s exactly the kind of mechanical-but-fiddly work AI handles well. I’ve built a repeatable flow for it. The flow leans on the model for speed and on me for the judgment calls it keeps getting wrong.

Pin down the four pieces before you prompt

A good SLI query needs four things stated explicitly, and the English sentence usually leaves two of them implied. Before I ask the model for anything, I force myself to name:

What counts as a “good” event — HTTP 2xx/3xx? Under a latency bound? Both?
What counts as the total — all requests? excluding health checks? excluding 4xx client errors?
The window — rolling 30 days, calendar month, or per-evaluation?
The target — 99.9% means a 0.1% error budget.

The “excluding 4xx” decision alone changes the query meaningfully, and it’s a business call, not a technical one. AI will happily pick a default and never mention the ambiguity, which is the first thing I make it stop doing.

The prompt that surfaces ambiguity

I want an SLI for “99.9% of checkout requests succeed within 300ms over rolling 30 days.” Before writing PromQL, list every ambiguous decision in that statement and the options for each. Don’t write the query yet.

A strong model comes back with: do we count latency failures as errors or only status-code failures? Are 4xx errors our fault? Is “checkout” one route or several? Forcing this list before the query is the highest-leverage move in the whole flow, because it turns the model into a requirements interrogator instead of a guesser. I take that list to the product owner, get real answers, then ask for the query.

Pro Tip: Always make the model enumerate ambiguities before it writes a single line of PromQL. The query it writes after you’ve resolved them is dramatically more likely to be the one you actually meant — and the ambiguity list itself becomes great documentation for the SLO.

The SLI query, recorded for reuse

With decisions made — say, a “good” event is HTTP 2xx under 300ms, the total excludes 4xx, scoped to the checkout route — the query falls out cleanly. I have the model draft it and I record both numerator and denominator so the SLO, dashboard, and alerts all share one source of truth:

groups:
  - name: checkout-slo.rules
    rules:
      - record: "slo:checkout_requests_total:rate5m"
        expr: |
          sum(rate(http_requests_total{route="/checkout", code!~"4.."}[5m]))
      - record: "slo:checkout_requests_good:rate5m"
        expr: |
          sum(rate(http_requests_total{route="/checkout", code=~"2.."}[5m]))
            - sum(rate(http_request_duration_seconds_bucket{route="/checkout", le="0.3", code=~"2.."}[5m]))
            * 0

The latency-bound piece is where AI tends to overcomplicate. Counting “2xx AND under 300ms” as good requires care with the histogram buckets, and the model’s first attempt is often wrong. I verify the numerator can never exceed the denominator by running both in the expression browser — a numerator larger than the denominator means the logic is inverted somewhere, and that check catches it instantly.

From SLI to error budget

The error budget is the friendly part. With 99.9% target, the budget is 0.1% of total events over the window. I express the current consumption as a ratio:

# Fraction of good events over the trailing 30 days
sum_over_time(slo:checkout_requests_good:rate5m[30d])
  /
sum_over_time(slo:checkout_requests_total:rate5m[30d])

I let AI draft this but I sanity-check the window mechanics myself, because sum_over_time over a recorded rate has subtle sampling behavior, and the model sometimes reaches for rate() of a rate(), which is nonsense. Anything the model can’t explain plainly, I don’t ship.

Then the burn-rate alerts

The point of an SLO is alerting on budget burn, not on raw errors. I ask the model to generate multi-window burn-rate alerts from the recorded SLI, then I review the windows and thresholds against the Google SRE workbook values rather than trusting whatever the model invents:

- alert: "CheckoutErrorBudgetFastBurn"
  expr: |
    (1 - (slo:checkout_requests_good:rate5m / slo:checkout_requests_total:rate5m)) > (14.4 * 0.001)
    and
    (1 - (slo:checkout_requests_good:rate1h / slo:checkout_requests_total:rate1h)) > (14.4 * 0.001)
  for: 2m
  labels:
    severity: page
  annotations:
    runbook_url: "https://runbooks.internal/checkout-slo"

The 14.4 burn-rate multiplier and the dual-window structure are established patterns; I make the model justify each number against the budget math, and the free Alert Rule Generator gives a clean scaffold with the annotations already in place. Deeper treatment lives in multi-window burn-rate alerts for SLOs that work.

Watch the window-mechanics trap

The single most common place AI gets SLO PromQL wrong is the long-window aggregation, and it’s worth a dedicated check. The naive instinct — the model’s and often the human’s — is to wrap a rate() in another rate() or to avg_over_time a ratio, both of which produce numbers that look reasonable and are statistically meaningless. A ratio of two rates is not the same as the rate of a ratio, and averaging a percentage over thirty days weights every sample equally regardless of traffic, so a quiet weekend with two errors out of ten requests tanks your reported availability.

The correct approach sums the underlying counts over the window, then takes the ratio of the sums:

# Correct: ratio of summed counts over the window
sum(increase(slo_requests_good_total[30d]))
  /
sum(increase(slo_requests_total[30d]))

# Wrong: averaging a ratio weights low-traffic periods equally
avg_over_time((slo_requests_good_total / slo_requests_total)[30d:5m])

I always ask the model to explain why the count-based version is correct and the ratio-average is not. If it can articulate that traffic-weighting argument, I trust the query; if it just asserts the form is right, I push back. This is the difference between a fast junior engineer who understands the math and one who’s pattern-matching — and the explanation is how I tell them apart before the SLO number drives a roadmap decision.

Keep a human between the prose and production

The throughline is that AI accelerates the mechanical translation but cannot make the judgment calls — what counts as good, whether 4xx is your fault, how aggressive the burn-rate page should be. Those are human decisions, and the model’s value is forcing them into the open early instead of burying them in a default. I keep my SLO-translation prompts saved in a prompt workspace and reuse them across Claude and ChatGPT so each new SLO starts from the same disciplined flow.

It’s worth naming the failure mode this guards against. An SLO that’s mistranslated doesn’t announce itself — it just quietly reports a number that’s too rosy or too harsh, and teams make decisions on it for months. A 99.9% availability SLO that secretly counts health-check traffic in its denominator will look healthier than reality, because health checks never fail; teams relax, and real user errors hide in the dilution. The ambiguity-enumeration step exists precisely to catch this before the SLI ships, because the model surfaces the “do we include health checks?” question that a human writing PromQL straight from the doc would never think to ask. The judgment stays human, but the prompting drags the judgment into daylight while it can still be made correctly — which is the entire value of treating the model as a fast junior engineer who asks good questions rather than an oracle that hands you an answer.

Conclusion

The gap between an SLO sentence and correct PromQL is full of quiet decisions, and AI’s real gift is dragging them into daylight before they calcify into a wrong query. Make the model enumerate ambiguities first, resolve them with the people who own the business outcome, verify the numerator-denominator relationship, and review every burn-rate threshold by hand. Do that and your SLOs will mean what the doc says they mean. More in the SLOs and error budgets guide and the monitoring category.