Right-Sizing Pods: Resource Requests, Limits, and Autoscaling That Works
Bad requests and limits cause both OOMKills and wasted spend. Here's how to set them correctly and wire up HPA and VPA, with AI to reason about real usage data.
- #kubernetes
- #autoscaling
- #hpa
- #vpa
- #resources
- #ai
Most cluster waste and a good chunk of cluster instability come from the same place: resource requests and limits that someone guessed once and never revisited. Set them too low and pods get OOMKilled and evicted. Set them too high and you’re paying for nodes full of reserved-but-idle capacity. Get them right and autoscaling actually works.
Here’s how I think about requests, limits, and the HPA/VPA pairing — and how AI helps turn usage data into numbers.
Requests and limits are two different jobs
This is the foundational confusion, so let’s nail it:
- Requests are what the scheduler uses to place the pod. A pod requesting
500mCPU will only land on a node with 500m free. Requests reserve capacity. - Limits are what the kubelet enforces at runtime. Exceed a CPU limit and you get throttled. Exceed a memory limit and you get OOMKilled — hard, no grace.
They are not the same number and shouldn’t be set as if they were.
The rules I actually use
After years of tuning these, my defaults:
Memory: set request = limit. Memory isn’t compressible — there’s no “throttle,” only kill. Setting request equal to limit means the scheduler reserves exactly what the pod is allowed to use, so you never get the nasty surprise of a pod that scheduled fine but gets OOMKilled under node pressure.
CPU: set a request, be careful with limits. CPU is compressible — over-using just throttles. Always set a CPU request (the scheduler needs it). Be cautious with CPU limits: a tight limit throttles latency-sensitive services even when the node has spare CPU. Many shops set CPU requests and skip CPU limits deliberately.
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
memory: 256Mi # = request: never OOM-surprised
# no cpu limit: let it burst when the node is idle
Base the numbers on data, not vibes
The right request is roughly your steady-state usage plus headroom. Get the real numbers:
kubectl top pods -n payments
# or for history, query Prometheus:
# container_memory_working_set_bytes, rate(container_cpu_usage_seconds_total[5m])
Look at the p95 over a representative window, not the instantaneous value. A pod that idles at 80Mi but spikes to 220Mi during cache warm-up needs a request near 220Mi, or it’ll get evicted during warm-up.
Where AI helps
Turning a week of usage metrics into sane requests is arithmetic plus judgment about headroom — a great AI task. Paste your kubectl top output or a Prometheus export and ask:
“Here’s a week of CPU and memory usage for these pods. Recommend requests and limits using p95 plus 20% headroom for memory and p90 for CPU requests. Flag any pod where current limits are below observed peak.”
That last clause catches the OOMKill waiting to happen. Keep a few of these Kubernetes resource-tuning prompts around. The model does the math; you sanity-check the headroom.
HPA: scale out on load
The Horizontal Pod Autoscaler adds and removes replicas based on a metric. The most common setup:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
The catch nobody mentions: HPA’s CPU target is a percentage of the request. If your request is wrong, your HPA scales at the wrong time. A request set too low means the HPA thinks the pod is at 200% and scales aggressively for no reason. Requests and HPA are coupled — tune the request first.
For traffic-driven services, scale on a custom metric like requests-per-second rather than CPU; it’s a much truer signal of load.
VPA: right-size automatically
The Vertical Pod Autoscaler adjusts requests and limits based on observed usage. Run it in recommendation mode first — it’ll tell you what it would set without touching anything:
kubectl describe vpa api-vpa | grep -A10 "Recommendation"
Critically: don’t run VPA in Auto mode on the same workload as an HPA scaling on CPU/memory. They fight — VPA changes the request, which moves HPA’s target, which changes replica count, which changes per-pod load. Use VPA for requests, HPA for replicas, and keep them on different signals.
Don’t forget the other levers
- PodDisruptionBudgets keep autoscaling and node maintenance from taking down too many replicas at once. Set a
minAvailablefor anything that serves traffic. - Cluster Autoscaler / Karpenter adds nodes when pods can’t schedule. HPA adds pods; if there’s nowhere to put them, you also need node autoscaling.
- LimitRange and ResourceQuota per namespace stop one team from requesting the whole cluster.
A right-sizing pass that pays for itself
The workflow I run quarterly:
- Pull a week of usage per workload.
- Set memory request = limit at p95 + headroom.
- Set CPU requests at p90; drop reflexive CPU limits on latency-sensitive services.
- Recheck HPA targets now that requests are correct.
- Run VPA in recommendation mode as a second opinion.
- Add PDBs for serving workloads.
Before the manifest changes ship, run them through the Code Review tool — it catches the dangerous pattern of a memory limit raised without raising the request, and HPAs pointed at a Deployment whose requests are zero.
Right-sizing isn’t a one-time project; usage drifts as features ship. But a quarterly pass driven by real data — with AI doing the arithmetic and you owning the headroom calls — turns resource management from guesswork into a routine that cuts spend and stops the 3am OOMKill pages.
AI sizing recommendations are assistive. Always validate against your own usage data and load-test before applying limits to production.