Kubernetes
13 min readMay 8, 2026

Kubernetes VPA: Right-Sizing Containers Without Manual Tuning

Kubernetes VPA (Vertical Pod Autoscaler) automatically adjusts container resource requests based on actual usage. But using it wrong breaks HPA, causes unnecessary pod restarts, and gives you false confidence in your resource sizing. Here is how to use it safely in production.

AJ
Ajeet Yadav
Platform & Cloud Engineer
Kubernetes VPA: Right-Sizing Containers Without Manual Tuning

Most teams set resource requests once — usually by guessing, copying from another team's manifest, or inheriting defaults from an older Helm chart — and never revisit them. The result is predictable: either the pod is over-provisioned (500m CPU request for a service that peaks at 80m), which wastes node capacity and inflates your cloud bill, or it is under-provisioned (64Mi memory for a service that needs 300Mi at load), which produces OOMKilled pods, CPU throttling, and on-call alerts at 3am.

Vertical Pod Autoscaler (VPA) is Kubernetes' answer to this problem. It watches actual container usage over time, computes statistically-derived recommendations, and optionally applies those recommendations by updating resource requests automatically. It eliminates the guesswork for stateless workloads where the right numbers are genuinely hard to know in advance.

But VPA has sharp edges. Used without understanding its update modes, it will evict pods at inconvenient times. Combined naively with HPA, it creates an autoscaling conflict that destabilizes both. And in Off mode — the only truly safe starting point — it gives you numbers you still have to apply manually. Getting value from VPA requires understanding those trade-offs before enabling anything beyond a read-only advisory role.


VPA Components

VPA is not a single controller. It is three separate components that must all be running for full functionality.

The Recommender watches pod resource usage via the metrics API (backed by metrics-server or a compatible provider like Prometheus Adapter) and computes recommendations using a percentile histogram model. It stores recommendations in the VerticalPodAutoscaler status. This component runs continuously and is stateless — it rebuilds its model from metrics history on restart. VPA requires metrics-server to be installed in the cluster; the Recommender will silently produce no recommendations if metrics are unavailable.

The Admission Controller is a mutating webhook that intercepts pod creation requests. When a pod is created that matches a VPA object, the Admission Controller mutates the pod spec to apply the current recommendation before the pod is scheduled. This is how VPA sets the initial resource requests — not by patching the Deployment, but by modifying the Pod at admission time.

The Updater monitors running pods and compares their current resource requests against the Recommender's latest recommendations. When the gap is large enough (controlled by the --eviction-tolerance flag, default 0.5), the Updater evicts the pod so that the Admission Controller can apply the updated recommendation on the replacement pod.

If the Admission Controller is absent, VPA can still generate recommendations (visible in Off and Initial modes), but it cannot apply them automatically. The Recommender and Updater operate independently. For a read-only advisory workflow, you only strictly need the Recommender running.


Installing VPA

The canonical installation method is cloning the kubernetes/autoscaler repository and running its install script. In practice, the Helm chart from cowboysysop is more maintainable for production clusters because it supports values overrides, image pinning, and standard Helm lifecycle management.

bash
helm repo add cowboysysop https://cowboysysop.github.io/charts/
helm install vpa cowboysysop/vertical-pod-autoscaler \
  --namespace kube-system \
  --set admissionController.replicaCount=2

Run two replicas of the Admission Controller. It is a mutating webhook — if the single replica goes down during a pod creation event, the pod creation fails (the webhook is configured with failurePolicy: Fail by default). Two replicas gives you a redundant path during rollouts and node disruptions.

Verify the three components are running:

bash
kubectl get pods -n kube-system | grep vpa

Expected output shows three running pods: vpa-admission-controller-*, vpa-recommender-*, and vpa-updater-*. If the Admission Controller is missing, VPA will still produce recommendations but will not mutate pods on creation.

VPA requires metrics-server to already be installed. On EKS, metrics-server is not installed by default; add it via the metrics-server Helm chart or the EKS add-on before deploying VPA.


VPA Modes

The updateMode field is the most consequential configuration decision in a VPA object. It controls whether VPA passively generates recommendations or actively mutates pods.

Off

In Off mode, the Recommender generates recommendations and stores them in the VPA status, but nothing is applied. The Admission Controller does not mutate pods. The Updater does not evict anything. VPA is purely advisory.

This is the correct starting mode for any workload. Run it for at least a few days — preferably through a full traffic cycle that covers peak and off-peak load — before considering any active mode. Read the recommendations with:

bash
kubectl describe vpa payments-api-vpa -n production

Initial

In Initial mode, the Admission Controller applies the current recommendation when a pod is created (or recreated), but the Updater does not evict running pods. A pod that was created before VPA was set to Initial mode continues running with its original requests until it is recreated naturally — by a Deployment rollout, a node eviction, or manual deletion.

This is the safest active mode. Resource recommendations are applied opportunistically during the normal pod lifecycle without VPA forcing any disruption. Use this mode for most stateless workloads after reviewing Off-mode recommendations.

Recreate

In Recreate mode, the Updater actively evicts running pods when the current resource recommendations differ significantly from the pod's current requests. The evicted pod is replaced by the Deployment controller, and the Admission Controller applies the new recommendation to the replacement pod.

This causes pod restarts outside of normal Deployment rollouts. For a Deployment with minAvailable: 1 and 2 replicas, VPA eviction can briefly leave a single replica serving traffic. Use Recreate mode only for long-running single-pod services where resource drift is significant and the service is tolerant of restarts.

Auto

Auto mode is currently identical to Recreate for most clusters. The intent is to use in-place pod resource resize (KEP-1287) when it becomes stable, which would allow VPA to resize containers without eviction. As of Kubernetes 1.30, KEP-1287 is still in alpha and is not suitable for production use. Do not choose Auto expecting different behavior from Recreate today.


A Complete VPA Resource

Here is a VPA object for a payments API service. Start in Off mode, review the recommendations after a week of traffic, and then decide whether to move to Initial.

yaml
1apiVersion: autoscaling.k8s.io/v1
2kind: VerticalPodAutoscaler
3metadata:
4  name: payments-api-vpa
5  namespace: production
6spec:
7  targetRef:
8    apiVersion: apps/v1
9    kind: Deployment
10    name: payments-api
11  updatePolicy:
12    updateMode: "Off"
13  resourcePolicy:
14    containerPolicies:
15      - containerName: "payments-api"
16        minAllowed:
17          cpu: 50m
18          memory: 64Mi
19        maxAllowed:
20          cpu: 2
21          memory: 4Gi
22        controlledResources: ["cpu", "memory"]

The minAllowed and maxAllowed bounds are guardrails. VPA will never recommend below minAllowed (preventing under-provisioning) or above maxAllowed (preventing recommendations that exceed node capacity). Set maxAllowed below the allocatable resources of your largest node — if VPA recommends 12Gi memory but your nodes only offer 8Gi allocatable, the pod goes Pending and never starts.

The controlledResources field scopes what VPA manages. Omitting it defaults to both CPU and memory. Setting it to ["memory"] only would let VPA tune memory while leaving CPU requests unchanged — useful when you are confident in your CPU sizing but uncertain about memory.


Reading VPA Recommendations

After running in Off mode for a few days, read the recommendation:

bash
kubectl describe vpa payments-api-vpa -n production

The relevant output looks like this:

Recommendation:
  Container Recommendations:
    Container Name: payments-api
    Lower Bound:    cpu: 100m, memory: 128Mi
    Target:         cpu: 250m, memory: 512Mi
    Uncapped Target: cpu: 250m, memory: 512Mi
    Upper Bound:    cpu: 1, memory: 2Gi

Target is VPA's recommended value for resources.requests. Apply this number. Lower Bound and Upper Bound represent the confidence interval — VPA is saying the true steady-state request is likely somewhere between these values, with Target as the point estimate. When Target and Uncapped Target differ, it means VPA wanted to recommend something outside your minAllowed/maxAllowed bounds; the capping is visible here.

For a workload in Off mode, apply the Target values manually by updating the Deployment's resources.requests. For a workload in Initial mode, VPA applies Target automatically on next pod creation. Do not blindly apply Upper Bound — it leads to the same over-provisioning problem you were trying to fix.


VPA + HPA Compatibility: The Critical Constraint

VPA and HPA cannot both manage the same resource dimension simultaneously. Running both on CPU without understanding this causes an autoscaling feedback loop.

Here is the failure mode in detail. Suppose HPA is scaling on CPU utilization (target: 50%) and VPA is in Recreate mode managing CPU requests. VPA increases the CPU request from 200m to 400m. The CPU utilization percentage drops — not because actual CPU usage changed, but because the denominator (the request) doubled. HPA sees low utilization and scales down replicas. With fewer replicas, each pod handles more traffic, actual CPU usage rises, HPA scales back up. VPA observes higher CPU usage per pod and raises the request again. The cycle continues, with HPA and VPA perpetually chasing each other.

There are three safe combination patterns:

HPA on CPU/memory + VPA in Off mode. VPA acts as an advisor only. Review its recommendations periodically and apply them manually to the Deployment. HPA controls replica count without interference. This is the lowest-risk approach and works for teams that want observability before automation.

HPA on custom metrics + VPA in Auto/Recreate mode. If HPA scales on a metric that VPA cannot influence — queue depth, RPS via KEDA, or a Prometheus metric representing business-level load — there is no feedback loop. VPA owns resource sizing; HPA owns replica count based on a metric with no denominator dependency on resource requests. This pattern works well in practice and is the recommended path for teams that want full automation.

VPA only, no HPA. For stateful workloads, single-replica services, or batch jobs where horizontal scaling is not appropriate, VPA alone is the right choice. The Updater evicts pods when recommendations drift significantly, and the Admission Controller applies updated requests on recreation.

Never run HPA on CPU or memory and VPA on those same resources in Initial, Recreate, or Auto mode simultaneously. The coupling will produce unpredictable scaling behavior that is difficult to diagnose.


Goldilocks: Cluster-Wide Right-Sizing Without Risk

Goldilocks, maintained by Fairwinds, is the safest way to get VPA recommendations across an entire cluster. It creates VPA objects in Off mode for every Deployment in labeled namespaces and exposes a dashboard showing current requests, VPA recommendations, and actual usage side by side. You get the observability of VPA without enabling any mutations.

Install it:

bash
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install goldilocks fairwinds-stable/goldilocks \
  --namespace goldilocks \
  --create-namespace

Label the namespaces where you want recommendations generated:

bash
kubectl label namespace production goldilocks.fairwinds.com/enabled=true
kubectl label namespace staging goldilocks.fairwinds.com/enabled=true

Goldilocks creates a VPA object in Off mode for each Deployment in those namespaces and polls their recommendations. Access the dashboard via kubectl port-forward or expose it through an Ingress.

The dashboard shows three values per container: Current (what the Deployment manifest says), Recommended (VPA Target), and Actual (observed p50/p95 usage). For workloads where Current and Recommended differ by more than 50%, update the Deployment manually first. Do not jump directly to enabling VPA mutations on workloads with large discrepancies — a pod requesting 2Gi that VPA wants to bring to 200Mi will evict and recreate dramatically, which is jarring in production.

Use Goldilocks for a full week before setting any VPA object to Initial or higher. The recommendations improve as the histogram accumulates more data — the first few hours of data produce rough estimates; after eight days, the model is considered fully trained.


Production Adoption Pattern

Adopting VPA across a cluster works best as a phased rollout rather than a single cluster-wide change.

Week 1: Deploy Goldilocks and label all namespaces. Let VPA Recommenders run in Off mode and accumulate data. Do not touch any Deployment manifests yet — just observe.

Week 2: Review the Goldilocks dashboard. For every workload where the current CPU or memory request differs from VPA's Target by more than 50%, update the Deployment manifest manually. This is the safest form of right-sizing: human-reviewed, applied during a controlled rollout, with no VPA mutations involved.

Week 3 onward: For high-priority stateless workloads where you are confident in the data, switch the VPA object to Initial mode. Pods will receive correctly-sized requests on next recreation. Monitor pod startup behavior — if OOMKills appear after switching to Initial, the minAllowed bounds need adjustment.

Avoid Recreate or Auto for anything with fewer than two replicas. A single-replica service evicted by the VPA Updater goes from one running pod to zero while Kubernetes schedules and starts the replacement. That gap — typically 10–60 seconds — is an outage. If you need VPA's active update behavior for a critical single-replica service, schedule maintenance windows around it.


Controlling Which Containers VPA Manages

VPA manages containers individually. For pods with multiple containers — the main application plus a sidecar — it is almost always wrong to let VPA tune every container indiscriminately.

yaml
1resourcePolicy:
2  containerPolicies:
3    - containerName: "payments-api"
4      mode: Auto
5      minAllowed:
6        cpu: 50m
7        memory: 64Mi
8      maxAllowed:
9        cpu: 2
10        memory: 4Gi
11    - containerName: "fluentbit-sidecar"
12      mode: "Off"

Support for per-container mode in containerPolicies was introduced in VPA v0.14. Earlier versions required all containers in the pod to share the same update behavior. If your cluster is running an older VPA version, upgrade before attempting per-container mode overrides.

Set sidecars to Off (get recommendations as a reference) or give them tight minAllowed/maxAllowed bounds that prevent VPA from adjusting them meaningfully. A Fluent Bit sidecar has well-known resource characteristics; letting VPA experiment with its requests introduces noise without benefit.


In-Place Resource Resize: KEP-1287

Kubernetes 1.27 introduced alpha support for in-place pod resource resize (KEP-1287). The intent is to allow VPA in Auto mode to resize containers without eviction — the kubelet sends SIGSTOP to the container, updates the cgroup limits, then sends SIGCONT. For the application, the resize is nearly invisible; there is no pod recreation, no scheduler involvement, no loss of pod IP.

As of Kubernetes 1.30, KEP-1287 remains in alpha and is gated behind the InPlacePodVerticalScaling feature gate. The implementation has known stability issues with certain container runtimes and is not enabled by default on any managed Kubernetes offering (EKS, GKE, AKS). Do not enable this feature gate in production clusters expecting VPA to use in-place resize. Design your VPA adoption around the eviction-based model — when KEP-1287 stabilizes to stable, the behavior of Auto mode will improve automatically.


Frequently Asked Questions

Does VPA work with Deployments, StatefulSets, and DaemonSets?

VPA works with Deployments and StatefulSets via the targetRef field. DaemonSets are technically supported as a target, but the Updater eviction behavior is problematic — DaemonSet pods are not replicated, so evicting one removes it from a node entirely until the DaemonSet controller reschedules it. For DaemonSets, use VPA in Off mode to get recommendations, then apply them manually to the DaemonSet manifest. Never use Recreate or Auto for a DaemonSet running a critical node-level agent like Datadog or Fluent Bit.

What happens if VPA recommends a value that exceeds the node's capacity?

The pod goes Pending. The scheduler cannot find a node with sufficient allocatable CPU or memory to satisfy the request, so the pod waits indefinitely (or until a new node is provisioned). This is why maxAllowed is not optional — always set it below the allocatable capacity of your largest node type. On EKS with m5.2xlarge nodes (8 vCPU, 32Gi RAM, ~28Gi allocatable after system overhead), a reasonable maxAllowed is cpu: 6 and memory: 24Gi, leaving headroom for DaemonSet overhead.

How long does VPA take to generate useful recommendations?

Initial recommendations appear within minutes to hours depending on traffic volume. The histogram model is considered fully trained at eight days of data — it has seen enough variance in request patterns to produce stable percentile estimates. Recommendations generated in the first 24 hours should be treated as rough estimates and reviewed before applying. After a week of production traffic, the Target value is reliable.

Can VPA be combined with Karpenter for node right-sizing?

Yes, and the combination works well. When VPA increases a pod's CPU or memory request and the current nodes have insufficient allocatable capacity, the pod becomes unschedulable. Karpenter detects the pending pod, evaluates its resource requirements, and provisions a node of the appropriate instance type. The net effect is that VPA right-sizes the pod, and Karpenter right-sizes the node — both adjust together rather than you having to manage either manually. Ensure the Karpenter NodePool includes instance types that can satisfy the maxAllowed values in your VPA policies.

What about VPA and Guaranteed QoS class?

Guaranteed QoS requires requests == limits for all resources. VPA sets requests only by default (via controlledValues: RequestsOnly). If you also have limits set in your Deployment and they diverge from VPA's updated requests, the pod loses Guaranteed QoS and becomes Burstable. For pods where Guaranteed QoS is a hard requirement (latency-sensitive, colocated with noisy neighbors), set VPA's controlledValues: RequestsAndLimits — VPA will update both requests and limits together, preserving the equality.


For how resource requests affect scheduling decisions, QoS class assignment, and the exact mechanics of OOMKill vs. CPU throttling, read Kubernetes Resource Management: Requests, Limits, QoS, and LimitRanges. That post covers the foundation that makes VPA recommendations meaningful.

For HPA v2 behavior — stabilization windows, scale-down delays, and how to configure custom metrics scaling that is safe to combine with VPA — see Kubernetes HPA and VPA: Horizontal and Vertical Pod Autoscaling, which covers the full HPA v2 API and the safe combination patterns in more depth.

For event-driven scaling with scale-to-zero using KEDA — which provides the custom metrics that allow HPA and VPA to coexist without conflict — see KEDA: Event-Driven Autoscaling for Kubernetes.

For node-level autoscaling that responds to the changed pod resource requirements VPA generates — including how Karpenter selects instance types based on pending pod requests — see the Kubernetes Cluster Autoscaler and Karpenter post.

For the FinOps view of right-sizing — how VPA adoption affects your cluster cost profile, what the typical savings look like at scale, and how to measure the impact — see Kubernetes Cost Optimization: A FinOps Guide.


Running VPA in production and unsure if your recommendations are safe to apply? Talk to us at Coding Protocols — we help platform teams adopt VPA without HPA conflicts or unnecessary pod restarts.

Related Topics

Kubernetes
VPA
Autoscaling
Resource Management
Platform Engineering
FinOps

Read Next