Right-Sizing Pods Automatically With the Vertical Pod

Almost every cluster I’ve audited has the same problem: resource requests that someone picked on day one and nobody has touched since. Half the pods are over-provisioned and burning money; the other half are under-requested and getting OOM-killed or throttled. The Vertical Pod Autoscaler (VPA) fixes this by watching actual usage and recommending — or applying — the right requests. Used well, it’s one of the highest-ROI tools in the cluster.

Here’s how I run it, including the part most people get wrong.

Vertical vs. horizontal

The Horizontal Pod Autoscaler changes how many pods you run. The VPA changes how big each pod is — its CPU and memory requests and limits. They solve different problems:

HPA: handle more load by adding replicas.
VPA: stop guessing at requests by sizing pods to real consumption.

They can conflict if you point both at the same resource dimension, which I’ll come back to.

Installing the VPA

The VPA isn’t bundled with Kubernetes; you install it as three components — recommender, updater, and admission controller — typically from the autoscaler repo or a Helm chart:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
kubectl get pods -n kube-system | grep vpa

The recommender needs the metrics-server running to see usage. If you don’t have metrics-server, install that first or the recommender produces nothing.

Start in recommendation-only mode

This is the single most important piece of advice in this article: run VPA in Off mode first. It computes recommendations without touching your pods. You get the data with zero risk.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
  namespace: web
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"

After it’s gathered a day or two of usage, read the recommendation:

kubectl describe vpa api-vpa -n web

You’ll see a Recommendation block with target, lowerBound, and upperBound for CPU and memory. The target is what VPA would set requests to. For a lot of teams, just reading these numbers and updating their manifests by hand is the entire win — no auto-apply needed.

The update modes

Off — recommend only. Safe. Start here.
Initial — set requests when a pod is first created, never after. Good middle ground.
Auto / Recreate — actively evict and recreate pods to apply new requests. Powerful, but it restarts your pods, which is the gotcha everyone hits.

That eviction behavior is why you don’t flip straight to Auto on a stateful or latency-sensitive workload. The VPA will happily kill a pod to right-size it. Pair Auto mode with a PodDisruptionBudget so it can’t take down all your replicas at once.

Setting bounds you can trust

Let VPA know the floor and ceiling so it can’t recommend something absurd:

  resourcePolicy:
    containerPolicies:
    - containerName: '*'
      minAllowed:
        cpu: 50m
        memory: 64Mi
      maxAllowed:
        cpu: "2"
        memory: 2Gi
      controlledResources: ["cpu", "memory"]

maxAllowed is your guardrail against a memory-leaking container convincing VPA to request enormous amounts of RAM. Always set it.

The HPA conflict you must avoid

Do not run VPA and HPA on the same metric. If HPA scales on CPU and VPA is also adjusting CPU requests, they create a feedback loop — VPA raises requests, which changes the CPU utilization percentage HPA reads, which changes replica count, which changes per-pod load. The supported pattern is:

HPA scales on CPU (or a custom metric).
VPA manages memory only (controlledResources: ["memory"]).

That keeps them in separate lanes. Mixing them on the same dimension is the most common way VPA setups go sideways.

Reading the recommendation honestly

A few interpretation notes from experience:

VPA’s recommendation is based on observed history. If you set it up during a quiet week, it’ll under-recommend for your real peak. Let it observe a full traffic cycle — including your busiest day — before trusting the numbers.
The upperBound accounts for usage spikes; the target is the steady-state pick. For requests, I use target; for limits, I look at upperBound.
VPA right-sizes requests, not concurrency. If your app is single-threaded, more CPU request won’t help it go faster — measure, don’t assume.

What this buys you

On real clusters I’ve seen VPA recommendations cut a namespace’s memory reservations by 30–40% simply because the original requests were copy-pasted guesses. That’s bin-packing headroom you get back — more pods per node, fewer nodes, lower bill. And the under-provisioned services stop getting OOM-killed because someone finally measured them.

Before you switch any VPA to an active update mode, have someone review the bounds and confirm a PDB is in place. Our AI code review catches missing maxAllowed ceilings and VPA/HPA overlap on the same resource.

VPA turns resource sizing from folklore into measurement. Pair it with the right-sizing fundamentals and you stop paying for headroom you never use. For more, see the Kubernetes & Helm category.

Resource recommendations reflect observed history, not future peaks. Validate VPA output against your full traffic cycle before applying it in production.

Right-Sizing Pods Automatically With the Vertical Pod Autoscaler