AI-Assisted Log-Based Alert Rule Generation

At 03:14 one morning a single line scrolled past in our logs and nobody saw it:

level=error msg="payment capture failed: upstream timeout" tenant=acme order=88213 attempt=3

It repeated forty-one times over the next nineteen minutes. No alert fired. We found it the next morning in a customer email, not in PagerDuty. The signal was right there in Loki the whole time, structured and searchable, and we had simply never written a rule to count it. That outage was not a tooling gap. It was a backlog gap: writing good alert rules is tedious, and tedious work slides.

This is exactly the kind of work AI is good at accelerating. Not deciding what matters, but drafting the boilerplate around a pattern you already recognize. I treat the model as a fast junior engineer: it can write a credible LogQL query and a rule skeleton in seconds, but it does not get to ship anything to production paging on its own. Every generated rule goes through review, testing, and a documented back-out path, the same as any other code.

Start From the Log Line, Not the Dashboard

Most alerting advice starts with metrics. But the richest early-warning signals often live in logs first, before anyone has thought to emit a counter. The payment-capture failure above had no metric. It had a log line with a stable shape.

So the workflow begins with a real example. I find the recurring pattern in Loki and write a LogQL query that counts it:

sum by (tenant) (
  count_over_time(
    {app="checkout", level="error"}
      |= "payment capture failed"
      | logfmt
      [5m]
  )
)

This gives me a per-tenant rate of capture failures over a five-minute window. The | logfmt parser pulls tenant, order, and attempt out of the structured line so I can group and threshold on them. Getting this query right is the human part of the job. I know which tenants are noisy, which errors are transient retries, and which mean money is actively being lost.

Pro Tip: Always pin the query to a structured field, not a free-text substring, once the log format stabilizes. A |= "payment capture failed" match is fine for exploration, but | logfmt | event="capture_failed" survives log-message rewording that would otherwise silently break your alert.

Hand the Model the Pattern, Ask for a Draft

Once I have a query that returns the right numbers, I bring in the model to draft the actual rule. I paste the sample log line, the working LogQL, and the constraints. The prompt matters: vague prompts produce vague rules. I keep a reusable version of this in my prompt workspace so the team drafts rules the same way.

A prompt that works looks roughly like this:

Here is a recurring error log line and a LogQL query that counts it per tenant. Draft a Loki alerting rule. Requirements: fire only on sustained failure, not a single blip. Include a for: duration, severity and team labels, a summary and description annotation that interpolates the tenant, and a runbook_url annotation pointing to a placeholder I will fill in. Suggest a starting threshold and explain the reasoning so I can adjust it.

The model returns something I can actually read and critique. I am not asking it to be right. I am asking it to be a fast first draft so I spend my time judging thresholds instead of typing YAML.

What a Good Generated Rule Looks Like

Here is the kind of rule that comes back, after I have edited it:

groups:
  - name: checkout-payment-alerts
    rules:
      - alert: PaymentCaptureFailing
        expr: |
          sum by (tenant) (
            count_over_time(
              {app="checkout", level="error"}
                |= "payment capture failed"
                | logfmt
                [5m]
            )
          ) > 5
        for: 10m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Payment captures failing for tenant {{ $labels.tenant }}"
          description: >
            More than 5 capture failures in 5m for tenant
            {{ $labels.tenant }}, sustained for 10 minutes. Revenue impact
            likely. Check upstream processor status before failing over.
          runbook_url: "https://runbooks.internal/payments/capture-failing"

Two things I always verify by hand. First, the for: 10m clause: AI tends to draft alerts that fire instantly, which is how you get paged at 03:14 for a transient retry that self-heals at 03:15. A sustained-failure window is the difference between a useful page and a trained-to-ignore page. Second, the runbook_url. A page with no runbook is a page that wakes someone up with no plan. I make the runbook link a required annotation and reject rules that ship without one.

The threshold of 5 is a starting guess the model suggested. It is almost certainly wrong for my traffic. That is fine, because the next step is to test it against reality, not to trust it.

Test Rules Before They Ship

Generated alert rules are code, and code gets tested. For Prometheus-style rules, promtool runs unit tests that assert a rule fires (or stays quiet) given a synthetic time series. I write the test alongside the rule:

# capture_failing_test.yaml
rule_files:
  - payment_alerts.yaml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'errors{tenant="acme"}'
        values: '0 2 4 7 8 9 9 9 9 9 9 9'
    alert_rule_test:
      - eval_time: 12m
        alertname: PaymentCaptureFailing
        exp_alerts:
          - exp_labels:
              severity: critical
              team: payments
              tenant: acme
            exp_annotations:
              summary: "Payment captures failing for tenant acme"

Then:

promtool test rules capture_failing_test.yaml

This catches the embarrassing failures before they reach anyone: a rule that never fires because the label dropped out of the sum by, a for: so long the alert resolves before it triggers, an annotation template that references a label that does not exist. I have caught the model hallucinating a {{ $labels.service }} that my query never produces. The test turns red, and nobody gets paged for a broken rule. For Loki rules I run the equivalent check with cortextool rules check plus a replay against recorded log windows in staging.

Pro Tip: Keep a small corpus of recorded incident windows and quiet windows as fixtures. Every new generated rule runs against both: it must fire on the real incident and stay silent on the boring Tuesday. This is your regression suite for alerting, and it is the single best defense against AI-drafted noise.

The Review Gate Is Not Optional

The model never has my production credentials and never touches the alerting stack directly. Its output lands in a branch as a proposed change, nothing more. From there the rule goes through the same gate as any other code: a pull request, a human reviewer, and CI that runs promtool test rules on every push. If the tests fail, the PR cannot merge. If a reviewer cannot name what the rule protects and who it pages, the PR cannot merge.

This is the same approval-gate philosophy I apply to any AI-suggested action. I wrote about the general pattern in ChatOps approval gates for AI-suggested actions, and the principle holds here: AI proposes, a human disposes. The reviewer is not rubber-stamping YAML. They are the person who owns the decision that this pattern deserves to wake someone up. Our code review workflow treats alert-rule PRs exactly like application-code PRs, because they are.

Ship Quiet First, Keep a Back-Out Path

Even a reviewed, tested rule does not go straight to production paging. New rules ship in a quiet mode first: routed to a low-priority channel or a staging Alertmanager that notifies no one, for a soak period of a few days. We watch how often it would have fired. A rule that would have paged thirty times in a weekend is a noisy rule, and we tune the threshold or the for: window before it ever reaches the on-call rotation. This is the same dry-run instinct I describe in dry-run and simulation before automated actions: see the blast radius before you arm the trigger.

And because the rule is code in git, the back-out path is trivial and well-understood: revert the commit. If a freshly promoted rule turns out to be a 3 a.m. nuisance generator, anyone on call can git revert it and redeploy, no archaeology required. That revert-ability is what makes me comfortable moving fast on the drafting step. The cost of a bad rule is bounded because undoing it is one PR.

Where AI Actually Earns Its Keep

The honest accounting: AI did not decide that payment-capture failures matter. A human did, after an incident. AI did not pick the threshold or the soak period. Humans did, from data. What AI did was collapse the hour of fiddly YAML-and-LogQL authoring into a few minutes of editing a solid draft, which meant the rule got written that week instead of sitting in the backlog until the next outage.

That is the right division of labor. The model is a fast junior engineer that drafts, suggests, and explains. The senior judgment, the production credentials, and the final decision stay with people. If you want to push this further, a curated prompt pack of alert-authoring prompts keeps the whole team drafting rules to the same standard, and the broader set of automation patterns on this site all share the same backbone: AI accelerates, gates contain, and a human owns the call.

The log line that should have paged someone now does. It took an AI to draft the rule and a human to trust it.