Tuning Pod Resource Requests From Real Metrics With AI

Nobody actually knows what to set for resources.requests. We copy a number from a neighboring service, round it up because round numbers feel safe, and ship it. The result is a cluster where requests bear no relation to usage: some pods reserve 2Gi and use 200Mi, others request 100m CPU and get throttled into the ground under load. The bill goes up and the latency gets worse at the same time.

The fix is boring and data-driven: look at actual usage over a representative window and set requests near the real demand. The reason nobody does it is that correlating usage data across dozens of workloads is tedious. That’s the AI-shaped part. I let a copilot crunch the metrics into recommendations, then I verify every number before it goes near the cluster, because a confident wrong limit causes OOMKills.

Get real numbers, not vibes

The first input is actual usage. The cheap version is kubectl top, which gives you a snapshot:

kubectl top pods -n prod --containers

A snapshot lies, though — it misses the peaks. If you have Prometheus, pull a percentile over a real window. The query that matters for memory is the max working set, and for CPU it’s a high percentile of the rate:

quantile_over_time(0.95,
  rate(container_cpu_usage_seconds_total{namespace="prod"}[5m])[7d:5m])

I export a table of containers with their current requests, p50, p95, and max over a week, and hand that to the model:

Here’s a week of usage per container with current requests and limits. Recommend new requests at roughly p95 and limits with headroom over the max. Flag anything where the current request is more than 3x the p95 — that’s wasted reservation.

Let the model do the arithmetic, you keep the judgment

The model is great at the mechanical part: spotting that cache-worker requests 2Gi but never exceeds 240Mi, or that api requests 100m CPU but sits pinned at its limit during business hours. A typical recommendation comes back as a diff:

# api container
resources:
  requests:
    cpu: "250m"      # was 100m, throttled at peak
    memory: "512Mi"  # was 512Mi, fine
  limits:
    cpu: "1"         # headroom for bursts
    memory: "768Mi"  # p95 was 480Mi, +60% headroom

What the model does not know is the shape of your traffic. A batch job that runs once a day will look idle in a 7-day average and the model will happily recommend slashing it — right up until the daily run OOMKills. So I always prompt for the assumptions:

For each recommendation, state what traffic pattern you assumed. Call out any workload where a weekly average might hide a spike.

That forces the model to reveal where its reasoning is thin, which is exactly where I apply human judgment.

Pro Tip: Never let the model set a memory limit equal to the memory request unless you intend a Guaranteed QoS pod. And never let it set a memory limit below the observed max — that’s not “saving money,” that’s scheduling an OOMKill. Ask it explicitly to keep memory limit above peak observed usage.

CPU limits are a trap worth discussing

CPU limits cause throttling, and throttling causes latency that looks like a bug. Many shops deliberately set CPU requests but no CPU limit so pods can burst into spare capacity. The model has read both schools of thought, so I make it argue the trade-off for my specific case:

This is a latency-sensitive HTTP service. Should it have a CPU limit at all? Give me the case for and against, given these usage numbers.

I don’t want a confident one-liner here; I want the trade-off laid out so I can decide. The AI is a research assistant, not the decision-maker.

Verify against the scheduler’s reality

A recommendation that doesn’t fit the cluster is useless. After the model proposes new requests, I check that the totals still schedule:

kubectl describe nodes | grep -A5 "Allocated resources"

If summing the new requests blows past node allocatable, pods will go Pending. I hand the node capacity back to the model and ask it to sanity-check that its own recommendations fit — it’s good at catching its own arithmetic when you give it the constraint.

Stage it, watch it, then trust it

Here’s the human-in-the-loop part. The model produces a patch; I apply it to one workload in staging or to a canary, then watch for the two failure signals over a real cycle:

kubectl get events -n prod | grep -iE 'oomkill|evicted'
kubectl top pods -n prod --containers

Only after a workload survives a full traffic cycle without OOMKills or throttling-driven latency do I roll the change wider. The AI never applies these patches itself and never gets cluster credentials — it reads metrics tables and emits YAML diffs, and a human pushes each change behind a canary. A wrong number from a chat window is free; a wrong number applied fleet-wide is an incident.

If you’d rather automate the right-sizing entirely, the Vertical Pod Autoscaler does this in-cluster, and right-sizing pods: requests, limits, and autoscaling covers the fundamentals the AI is reasoning over.

QoS class is the consequence nobody mentions

Resource requests and limits don’t just affect scheduling — they decide the pod’s Quality of Service class, which decides what gets killed first when a node runs out of memory. This is the part teams discover during an incident instead of during tuning. There are three classes: Guaranteed (requests equal limits for every resource), Burstable (requests set, limits higher or absent), and BestEffort (nothing set). Under memory pressure, the kubelet evicts BestEffort first, then Burstable that’s exceeded its request, and Guaranteed last.

When the model proposes new numbers, I make it tell me the resulting QoS class and whether that’s what I want:

For each recommendation, state the resulting QoS class. Flag any latency-critical service that would end up BestEffort or low-priority Burstable — those should be Guaranteed or at least have requests that keep them safe under pressure.

kubectl get pod payments-7d9f -n prod -o jsonpath='{.status.qosClass}'

A “cost optimization” that strips requests off your most important service quietly demotes it to first-to-die. The model knows the QoS rules cold, so asking it to surface the class turns an invisible side effect into a deliberate choice.

Don’t tune in a vacuum — account for autoscaling

If the workload sits behind a Horizontal Pod Autoscaler, the request you set isn’t just a reservation — it’s the denominator for the HPA’s CPU-target math. The HPA scales on utilization relative to the request, so halving a CPU request to “save money” makes every pod look twice as busy and the HPA scales out aggressively, often costing more than you saved. I always tell the model about the autoscaler:

This deployment has an HPA targeting 70% CPU. If I change the CPU request, how does that change the autoscaler’s behavior? Recommend a request that keeps the HPA’s scaling sane at current traffic.

This is a genuine systems interaction the model reasons through well, and it’s the kind of thing that turns a “simple” request tweak into a surprise scale-up. Tuning requests, limits, QoS, and the HPA target together — rather than one at a time — is what separates real right-sizing from a number that looks good in a spreadsheet.

Conclusion

Resource requests are guessed because doing it properly means correlating usage data nobody has time to correlate. AI removes that friction — it turns a week of metrics into concrete, headroom-aware recommendations and explains its assumptions when you ask. But it can’t see your traffic shape and can’t be trusted to set a memory limit below peak, so you verify every number, check it against node capacity, and roll it out behind a canary. That’s how you cut the bill and the latency at the same time without scheduling an outage.

The monitoring alerts dashboard ties usage trends to the alerts that fire when you get it wrong, and the wider Kubernetes and Helm guides cover the autoscaling pieces that complement manual tuning.