Kubernetes HPA and VPA: Horizontal and Vertical Pod Autoscaling
HPA scales the number of pod replicas based on CPU, memory, or custom metrics from Prometheus. VPA adjusts pod resource requests and limits based on observed usage. HPA handles demand spikes — more replicas for more traffic. VPA handles resource right-sizing — correct requests so pods land on appropriately-sized nodes and QoS classes are accurate. This covers HPA v2 with CPU and custom metrics, scale-to-zero with KEDA, VPA installation and update modes, and the critical constraint: HPA and VPA cannot both manage the same resource dimension simultaneously.

Kubernetes has two pod-level autoscalers:
HPA (Horizontal Pod Autoscaler) adds or removes pod replicas. When CPU usage rises, HPA scales from 3 replicas to 6. When it drops, it scales back down. This is the primary autoscaler for stateless services with variable traffic.
VPA (Vertical Pod Autoscaler) adjusts the resource requests and limits on existing pods. When a pod consistently uses only 80m CPU against a 500m request, VPA recommends (or automatically applies) a 100m request. This keeps resource efficiency high and QoS class assignments accurate.
The two autoscalers solve different problems. HPA handles demand elasticity; VPA handles resource right-sizing. They can coexist — but with a constraint: if HPA is managing CPU or memory, VPA must not also manage those same dimensions.
HPA v2: CPU and Memory Scaling
The HPA v2 API (autoscaling/v2) replaced v1 and supports multiple metrics simultaneously. The metrics server must be installed for CPU and memory metrics:
# Install metrics-server if not already present
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify
kubectl top pods -n paymentsBasic HPA scaling on CPU:
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4 name: payments-api-hpa
5 namespace: payments
6spec:
7 scaleTargetRef:
8 apiVersion: apps/v1
9 kind: Deployment
10 name: payments-api
11
12 minReplicas: 3
13 maxReplicas: 20
14
15 metrics:
16 - type: Resource
17 resource:
18 name: cpu
19 target:
20 type: Utilization
21 averageUtilization: 60 # Target: 60% of the CPU request across all pods
22
23 behavior:
24 scaleUp:
25 stabilizationWindowSeconds: 0 # Scale up immediately (no stabilization)
26 policies:
27 - type: Percent
28 value: 100 # Can double replica count per period
29 periodSeconds: 60
30 scaleDown:
31 stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down (prevent flapping)
32 policies:
33 - type: Pods
34 value: 1 # Remove at most 1 pod per minute (conservative scale-down)
35 periodSeconds: 60The HPA controller polls metrics every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). It uses the formula:
desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))
For CPU utilization: if 3 pods are using 120% of their CPU request and the target is 60%, HPA computes ceil(3 × (120/60)) = ceil(6) = 6 replicas.
CPU utilization is measured as a percentage of the pod's CPU request, not of the node's CPU. A pod with no CPU request set has no meaningful utilization metric — HPA won't work correctly without resource requests set on the target pods.
Multi-Metric HPA: CPU + Memory + Custom
HPA v2 supports multiple metrics. It calculates the desired replica count for each metric independently and uses the maximum across all metrics:
1metrics:
2 - type: Resource
3 resource:
4 name: cpu
5 target:
6 type: Utilization
7 averageUtilization: 60
8
9 - type: Resource
10 resource:
11 name: memory
12 target:
13 type: AverageValue
14 averageValue: 512Mi # Target: average 512Mi memory per pod (not utilization %)
15
16 - type: Pods
17 pods:
18 metric:
19 name: http_requests_per_second # Custom metric from Prometheus Adapter
20 target:
21 type: AverageValue
22 averageValue: "1000" # Target: 1000 req/s per podThe maximum desired replica count across all three metrics determines the actual target. If CPU says 4 replicas, memory says 6 replicas, and requests says 8 replicas, HPA scales to 8.
Prometheus Custom Metrics with the Prometheus Adapter
For HTTP request rate and other application-level metrics, install the Prometheus Adapter, which exposes Prometheus metrics to the Kubernetes custom metrics API:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--set prometheus.url=http://prometheus-operated.monitoring.svc:9090Configure a metric rule that maps a PromQL query to a Kubernetes custom metric:
1# prometheus-adapter values
2rules:
3 custom:
4 - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
5 resources:
6 overrides:
7 namespace: {resource: "namespace"}
8 pod: {resource: "pod"}
9 name:
10 matches: "^(.*)_total$"
11 as: "${1}_per_second"
12 metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'This exposes http_requests_per_second as a pods-scoped custom metric. Verify:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta2/namespaces/payments/pods/*/http_requests_per_second"HPA Behavior: Preventing Flapping
The default HPA behavior scales up aggressively and scales down conservatively. The behavior field gives explicit control:
1behavior:
2 scaleUp:
3 stabilizationWindowSeconds: 60 # Don't scale up again within 60s of last scale-up
4 policies:
5 - type: Percent
6 value: 50 # Scale up by at most 50% per minute
7 periodSeconds: 60
8 - type: Pods
9 value: 4 # Or add at most 4 pods per minute
10 periodSeconds: 60
11 selectPolicy: Max # Use whichever policy allows scaling more aggressively
12
13 scaleDown:
14 stabilizationWindowSeconds: 600 # 10-minute stabilization window before scale-down
15 policies:
16 - type: Percent
17 value: 10 # Remove at most 10% of pods per minute
18 periodSeconds: 60
19 selectPolicy: Min # Use the most conservative policyselectPolicy: Max picks the policy that results in the most scaling (faster scale-up). selectPolicy: Min picks the policy that results in the least scaling (slower scale-down). The asymmetry — aggressive scale-up, conservative scale-down — is the standard pattern for production services to avoid performance degradation during traffic spikes.
VPA: Vertical Pod Autoscaler
VPA observes historical resource usage and recommends or automatically updates resource requests. Install VPA (requires the VPA CRDs and three components: recommender, updater, admission controller):
1git clone https://github.com/kubernetes/autoscaler.git
2cd autoscaler/vertical-pod-autoscaler
3./hack/vpa-up.sh
4
5# Verify
6kubectl get pods -n kube-system | grep vpa
7# vpa-admission-controller-<hash> Running
8# vpa-recommender-<hash> Running
9# vpa-updater-<hash> RunningOr with Helm:
helm repo add cowboysysop https://cowboysysop.github.io/charts/
helm install vpa cowboysysop/vertical-pod-autoscaler --namespace kube-systemVPA UpdateMode: Off (Recommend Only)
Start with Off mode — VPA generates recommendations but doesn't touch pods:
1apiVersion: autoscaling.k8s.io/v1
2kind: VerticalPodAutoscaler
3metadata:
4 name: payments-api-vpa
5 namespace: payments
6spec:
7 targetRef:
8 apiVersion: apps/v1
9 kind: Deployment
10 name: payments-api
11 updatePolicy:
12 updateMode: "Off" # Recommend only — observe for 1-2 weeks before enabling Auto
13 resourcePolicy:
14 containerPolicies:
15 - containerName: payments-api
16 minAllowed:
17 cpu: 50m
18 memory: 64Mi
19 maxAllowed:
20 cpu: "2"
21 memory: 2Gi
22 controlledResources: ["cpu", "memory"]Read recommendations:
1kubectl get vpa payments-api-vpa -n payments -o yaml | grep -A 30 recommendation
2# containerRecommendations:
3# - containerName: payments-api
4# lowerBound:
5# cpu: 50m
6# memory: 128Mi
7# target: ← Use these values as your resource requests
8# cpu: 250m
9# memory: 384Mi
10# upperBound:
11# cpu: 1
12# memory: 768Mi
13# uncappedTarget:
14# cpu: 250m
15# memory: 384MiThe target value is the recommended request. lowerBound and upperBound are confidence intervals. uncappedTarget is the recommendation VPA would make if no minAllowed/maxAllowed bounds were specified — useful for checking whether your bounds are constraining the recommendation.
VPA UpdateMode: Auto
Auto mode applies recommendations in two ways: at pod creation (via the admission controller webhook, like Initial mode) and by evicting running pods so they restart with updated resource requests. This causes pod restarts for running workloads — acceptable for Deployments, problematic for single-replica pods or StatefulSets with no PodDisruptionBudget:
updatePolicy:
updateMode: "Auto" # Evict and restart pods to apply new resource requestsUse PodDisruptionBudgets to prevent VPA from evicting too many pods simultaneously:
1apiVersion: policy/v1
2kind: PodDisruptionBudget
3metadata:
4 name: payments-api-pdb
5 namespace: payments
6spec:
7 minAvailable: 2 # VPA won't evict if fewer than 2 pods are available
8 selector:
9 matchLabels:
10 app: payments-apiVPA UpdateMode: Initial
Initial sets resource requests when pods are first created (via the admission controller webhook) but doesn't evict running pods:
updatePolicy:
updateMode: "Initial" # Apply recommendations to new pods only; don't evict running podsThis is a useful middle ground: new pods start with good resource settings, but running pods aren't disrupted.
HPA + VPA: Using Both Simultaneously
HPA and VPA can coexist, but not on the same resource dimension:
| Configuration | Safe? | Reason |
|---|---|---|
HPA on CPU, VPA on CPU in Auto/Initial mode | No | VPA changes the CPU request that HPA's utilization percentage is based on — both fight over the same dimension |
HPA on CPU, VPA on CPU in Off mode | Yes | VPA in Off mode only generates recommendations, never modifies pods |
| HPA on CPU, VPA on memory only | Yes | Different dimensions, no conflict |
| HPA on custom metrics (req/s), VPA on CPU + memory | Yes | HPA uses application metrics; VPA manages resource requests |
| HPA on CPU + memory, VPA disabled | Yes | Standard pattern |
VPA Off mode (recommendations only, no automatic changes) is the safe default. Switch to Auto only after validating recommendations for 1+ weeks on non-critical workloads. Auto mode causes pod evictions and should be enabled gradually. For the recommended combination, use HPA on HTTP request rate (custom metric from Prometheus) and VPA in Off mode for right-sizing guidance. HPA handles demand; VPA surfacing recommendations handles right-sizing.
When using both, exclude the dimensions HPA manages from VPA's controlledResources:
resourcePolicy:
containerPolicies:
- containerName: payments-api
controlledResources: ["memory"] # VPA manages memory only; HPA manages CPU-based scalingKEDA: Scale-to-Zero and Event-Driven Scaling
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA to support scale-to-zero and event sources beyond Prometheus: SQS queue depth, Kafka consumer lag, Redis list length, Cron schedules.
helm install keda kedacore/keda --namespace keda --create-namespaceScale an SQS-based worker to zero when the queue is empty:
1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4 name: sqs-worker-scaledobject
5 namespace: payments
6spec:
7 scaleTargetRef:
8 name: sqs-worker
9 minReplicaCount: 0 # Scale to zero when queue is empty
10 maxReplicaCount: 50
11 cooldownPeriod: 300 # Seconds to wait before scaling to zero after queue drains
12
13 triggers:
14 - type: aws-sqs-queue
15 # Use identityOwner: operator to use the KEDA operator's IAM role (via IRSA or Pod Identity)
16 # When identityOwner: operator is set, authenticationRef is not used — remove it
17 metadata:
18 queueURL: https://sqs.us-east-1.amazonaws.com/012345678901/payments-jobs
19 queueLength: "5" # Target: 5 messages per replica
20 awsRegion: us-east-1
21 identityOwner: operator # KEDA operator's Pod Identity / IRSA role handles authKEDA creates and manages an HPA under the hood. ScaledObject is the user-facing API; KEDA translates it to an HPA with the appropriate custom metric source.
Operational Considerations
PodDisruptionBudget: Protecting Against Scale-Down
When HPA scales down or a node is drained, pods are evicted. Without a PodDisruptionBudget, all replicas of a Deployment can be evicted simultaneously:
1apiVersion: policy/v1
2kind: PodDisruptionBudget
3metadata:
4 name: payments-api-pdb
5 namespace: payments
6spec:
7 minAvailable: 2 # At least 2 pods must remain available during disruptions
8 # OR:
9 # maxUnavailable: 1 # At most 1 pod unavailable at a time
10 selector:
11 matchLabels:
12 app: payments-apiHPA respects PodDisruptionBudgets — it won't scale below minAvailable. Karpenter also respects PDBs when consolidating nodes.
Infrastructure Scaling: HPA and Karpenter
In 2026, horizontal scaling is often bottlenecked by node availability. While HPA creates new pods, Karpenter (the modern alternative to Cluster Autoscaler) ensures that those pods have nodes to run on:
- HPA Scale-Up: Traffic spikes, HPA increases
replicasfrom 5 to 15. - Pending Pods: 5 pods find space on existing nodes; 5 pods become
Pendingbecause the cluster is full. - Karpenter Detection: Within milliseconds, Karpenter identifies the pending pods and their resource requirements.
- Just-in-Time Provisioning: Karpenter calls the AWS EC2 API to provision a new node (or an optimal mix of Spot/On-Demand instances) tailored exactly to those pods.
Karpenter's "just-in-time" model reduces node provisioning time from several minutes to under 60 seconds, making your HPA scaling significantly more responsive.
Checking HPA Status and Scaling Events
1# Current HPA state
2kubectl get hpa payments-api-hpa -n payments
3# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
4# payments-api-hpa Deployment/payments-api 45%/60% 3 20 5 2d
5
6# Detailed HPA status including conditions
7kubectl describe hpa payments-api-hpa -n payments
8# Conditions:
9# AbleToScale True ReadyForNewScale
10# ScalingActive True ValidMetricFound
11# ScalingLimited False DesiredWithinRange
12
13# Scaling events
14kubectl get events -n payments --field-selector reason=SuccessfulRescale --sort-by=.lastTimestampScalingLimited = True means HPA wants to scale beyond maxReplicas (or below minReplicas) but can't. If you see this condition frequently, raise maxReplicas.
Frequently Asked Questions
Why isn't my HPA scaling even though CPU is high?
The most common causes:
- No resource requests set: HPA measures utilization as a percentage of the request. Pods with no CPU request have undefined utilization —
kubectl describe hpashows<unknown>for the current metric value. - Metrics server not running:
kubectl top podsfails. Install the metrics server. - Target is already at maxReplicas: The
ScalingLimitedcondition will beTrue. - Stabilization window: HPA waited for scale-down and is now in the stabilization window. Check
kubectl describe hpafor the last scale event time.
Should I set CPU limits if I'm using HPA?
HPA scales on CPU utilization (percentage of request). If you omit CPU limits, a pod can use spare node capacity freely, which looks like lower utilization — HPA won't scale up as aggressively. If you set CPU limits equal to requests (Guaranteed QoS), throttling at the limit will inflate latency. The standard pattern: set requests conservatively (steady-state usage), set limits at 2-4x requests, and let HPA add replicas before pods hit their limits. See Kubernetes Resource Management: Requests, Limits, QoS, and LimitRanges.
What's the difference between HPA and Karpenter node autoscaling?
HPA scales pods (replicas). Karpenter scales nodes. They work together: HPA adds pods, and if the pods can't be scheduled because the cluster doesn't have enough node capacity, Karpenter provisions new nodes. HPA operates in seconds; Karpenter provisions nodes in 30-60 seconds. For bursty workloads, use HPA to handle within-node capacity and Karpenter to provision new nodes for larger spikes. See Kubernetes Node Autoscaling: Cluster Autoscaler vs Karpenter.
For resource requests that HPA uses as its utilization baseline, see Kubernetes Resource Management: Requests, Limits, QoS, and LimitRanges. For KEDA's event-driven scaling with SQS, Kafka, and other sources beyond what standard HPA supports, see KEDA: Event-Driven Autoscaling for Kubernetes. For the dedicated VPA deep-dive — modes, HPA conflict, Goldilocks right-sizing, and production adoption — see Kubernetes VPA: Right-Sizing Containers Without Manual Tuning.
Setting up HPA with Prometheus custom metrics on EKS, or debugging HPA that won't scale? Talk to us at Coding Protocols — we help platform teams implement autoscaling that responds to the right signals without the flapping and over-provisioning that come from misconfigured CPU-only scaling.


