AWS WAF Rules and Rate Limiting With AI: From Managed Groups

The worst WAF incident I’ve worked started as a “the API is down” page that wasn’t a down at all — it was AWS WAF returning 403s to a partner’s legitimate batch integration because the managed core rule set flagged a base64 payload as a possible injection. Nothing was broken; the WAF was doing exactly what we told it to. That’s the recurring shape of WAF work: the rules that stop attackers also catch the edges of normal traffic, and the difference between a useful WAF and a support-ticket factory is how carefully you tune the corners. AI helps here because WAF reasoning is mostly pattern analysis over sampled requests — reading which rule fired, against which traffic, and whether the match is a real threat or a quirk of your own clients. A model does that triage faster than I can read JSON-formatted sampled requests by hand. What it can’t do is know that the “suspicious” user agent is our own monitoring probe, so I verify every rule change against the traffic it would actually block.

The line I hold: AI analyzes the sampled requests and proposes rule and rate-limit changes with the false-positive risk spelled out. I decide what goes to BLOCK, because the model doesn’t know our partners, our probes, or which “anomaly” is a feature we shipped last week.

Start managed rules in count mode

The single best habit in WAF: never deploy a managed rule group straight to BLOCK. Add it in count mode first, let it observe real traffic, and read what it would have blocked before it does. The AWS Managed Rules core rule set and the known-bad-inputs set are the usual starting pair, but both have rules that fire on legitimate enterprise traffic — large request bodies, base64 content, unusual but valid headers.

aws wafv2 get-web-acl \
  --name app-prod-acl --scope REGIONAL \
  --id 1234abcd-... \
  --query 'WebACL.Rules[].{Name:Name,Priority:Priority,Action:OverrideAction}'

When you add a managed group with OverrideAction: Count, every rule inside it logs a match without blocking. You then watch the metrics per rule label and decide which individual rules to keep counting and which to let block. That granularity matters — you almost never want to disable a whole managed group; you want to disable the one rule inside it that hates your traffic.

Read the sampled requests, not just the counts

A count going up tells you something fired; it doesn’t tell you whether it should have. WAF logs sampled requests with the matching rule, and that’s where the real signal lives. Pull a sample for the rule that’s firing:

aws wafv2 get-sampled-requests \
  --web-acl-arn "arn:aws:wafv2:us-east-1:123456789012:regional/webacl/app-prod-acl/1234abcd" \
  --rule-metric-name "CrossSiteScripting_BODY" \
  --scope REGIONAL \
  --time-window StartTime=$(date -u -d '1 hour ago' +%s),EndTime=$(date -u +%s) \
  --max-items 100 \
  --query 'SampledRequests[].{uri:Request.URI,method:Request.Method,country:Request.Country,action:Action}'

Now you have the actual URIs, methods, and source countries the rule matched. If they’re all hitting /api/v2/import from your partner’s IP range, that’s not XSS — that’s a base64 payload tripping the body inspection. That’s the moment to write a scoped exception rather than disable the protection globally.

Let AI triage the false positives

This is the work AI is best at: take a batch of sampled blocked requests and separate the genuine attacks from the false positives, with reasoning you can check.

Prompt: The managed CrossSiteScripting_BODY rule is in count mode and matched 340 requests in the last hour. Sampled requests show: most are POST to /api/v2/import from one /24 (our partner’s egress), bodies are base64-encoded payloads; a handful are POST to /login with <script> strings in the username field from scattered residential IPs. Should I let this rule block?

Reply: You’ve got two distinct populations and they need different handling — don’t make one decision for both. The /api/v2/import traffic is almost certainly a false positive: base64 bodies routinely contain byte sequences that the XSS body inspector reads as script-like, and a single partner /24 hammering one import endpoint is integration traffic, not an attack. Blocking that rule globally to fix it would also drop the second population, which looks real — <script> in a login username field from scattered residential IPs is a textbook credential-form probe, and you want that blocked. The clean fix is a scoped exception, not a disable: add a rule above the managed group that matches the import endpoint from the partner CIDR and allows it, or use a label-match rule so the managed group still inspects everything else. Keep the managed rule in BLOCK for the rest of the traffic so the login probes get stopped. Before you ship the allow, verify that partner CIDR is actually theirs and not spoofable at your edge — if you’re behind CloudFront, match on a header secret too, since source IP alone can be forged upstream of regional WAF.

The model split the traffic correctly and caught the IP-spoofing caveat, which is the part I’d have been tempted to skip. I confirmed the CIDR with the partner and added a header check before allowing anything. The AWS guides index has more edge-security walkthroughs, and the security review prompts collection includes reusable WAF triage starters.

Rate-based rules: the math people get wrong

Rate-based rules count requests from an aggregation key over a sliding window and block when the count crosses a limit. The two settings that decide everything are the Limit and the EvaluationWindowSec. The classic mistake is reasoning about the limit as “requests per second” — it isn’t. The limit is the total count over the window, and the window can be 60, 120, 300, or 600 seconds. A limit of 2000 over a 300-second window is a very different gate than 2000 over 60 seconds.

{
  "Name": "rate-limit-by-ip",
  "Priority": 10,
  "Statement": {
    "RateBasedStatement": {
      "Limit": 2000,
      "EvaluationWindowSec": 300,
      "AggregateKeyType": "IP",
      "ScopeDownStatement": {
        "ByteMatchStatement": {
          "SearchString": "/api/",
          "FieldToMatch": { "UriPath": {} },
          "TextTransformations": [{ "Priority": 0, "Type": "NONE" }],
          "PositionalConstraint": "STARTS_WITH"
        }
      }
    }
  },
  "Action": { "Block": {} },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "rateLimitByIp"
  }
}

Two details worth internalizing. The ScopeDownStatement narrows the rate rule to just /api/ paths — without it, you’re rate-limiting static asset fetches and CDN warmups together with API calls, and a single page load that pulls 40 assets eats the budget. And AggregateKeyType: IP aggregates by the immediate source IP — behind a NAT gateway or a corporate proxy, that’s everyone behind it sharing one counter, which is how you accidentally throttle an entire office. If you’re fronted by CloudFront, aggregate on a forwarded header instead so each real client gets its own bucket.

Tune the rate limit from real traffic, not a guess

Before you set a limit, look at what your legitimate heavy users actually do. Query your access logs for the per-IP request rate over a five-minute window and find the p99 of your real traffic — your limit should sit comfortably above that so normal heavy users never trip it, but below the volume an abusive scraper generates. Set it too low and you block real customers during a sale; set it too high and it never engages. AI is good at reading the request-rate distribution and proposing a limit with the trade-off stated, but the number you ship should clear your busiest legitimate client with margin. I usually start the rate rule in count mode too, watch it for a few days against real peaks, and only then flip it to block.

The WAF that earns its keep is tuned, not just turned on: managed groups in count mode first, individual rules disabled only when sampled requests prove they’re wrong, scoped exceptions instead of global disables, and rate limits derived from your own traffic distribution with the right aggregation key. AI makes the sampled-request triage and the rate math fast, and it’s genuinely good at separating real probes from the base64 false positives that fill the logs. But every move toward BLOCK decides whether a real user gets a 403, so verify each one against the traffic it would catch before you ship it.

AWS WAF Rules and Rate Limiting With AI: From Managed Groups to Clean Custom Rules