Platform Engineering
13 min readMay 8, 2026

Kubernetes HPA and VPA: Horizontal and Vertical Pod Autoscaling

HPA scales the number of pod replicas based on CPU, memory, or custom metrics from Prometheus. VPA adjusts pod resource requests and limits based on observed usage. HPA handles demand spikes — more replicas for more traffic. VPA handles resource right-sizing — correct requests so pods land on appropriately-sized nodes and QoS classes are accurate. This covers HPA v2 with CPU and custom metrics, scale-to-zero with KEDA, VPA installation and update modes, and the critical constraint: HPA and VPA cannot both manage the same resource dimension simultaneously.

CO
Coding Protocols Team
Platform Engineering
Kubernetes HPA and VPA: Horizontal and Vertical Pod Autoscaling

Kubernetes has two pod-level autoscalers:

HPA (Horizontal Pod Autoscaler) adds or removes pod replicas. When CPU usage rises, HPA scales from 3 replicas to 6. When it drops, it scales back down. This is the primary autoscaler for stateless services with variable traffic.

VPA (Vertical Pod Autoscaler) adjusts the resource requests and limits on existing pods. When a pod consistently uses only 80m CPU against a 500m request, VPA recommends (or automatically applies) a 100m request. This keeps resource efficiency high and QoS class assignments accurate.

The two autoscalers solve different problems. HPA handles demand elasticity; VPA handles resource right-sizing. They can coexist — but with a constraint: if HPA is managing CPU or memory, VPA must not also manage those same dimensions.


HPA v2: CPU and Memory Scaling

The HPA v2 API (autoscaling/v2) replaced v1 and supports multiple metrics simultaneously. The metrics server must be installed for CPU and memory metrics:

bash
# Install metrics-server if not already present
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify
kubectl top pods -n payments

Basic HPA scaling on CPU:

yaml
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: payments-api-hpa
5  namespace: payments
6spec:
7  scaleTargetRef:
8    apiVersion: apps/v1
9    kind: Deployment
10    name: payments-api
11
12  minReplicas: 3
13  maxReplicas: 20
14
15  metrics:
16    - type: Resource
17      resource:
18        name: cpu
19        target:
20          type: Utilization
21          averageUtilization: 60    # Target: 60% of the CPU request across all pods
22
23  behavior:
24    scaleUp:
25      stabilizationWindowSeconds: 0    # Scale up immediately (no stabilization)
26      policies:
27        - type: Percent
28          value: 100    # Can double replica count per period
29          periodSeconds: 60
30    scaleDown:
31      stabilizationWindowSeconds: 300    # Wait 5 minutes before scaling down (prevent flapping)
32      policies:
33        - type: Pods
34          value: 1      # Remove at most 1 pod per minute (conservative scale-down)
35          periodSeconds: 60

The HPA controller polls metrics every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period). It uses the formula:

desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))

For CPU utilization: if 3 pods are using 120% of their CPU request and the target is 60%, HPA computes ceil(3 × (120/60)) = ceil(6) = 6 replicas.

CPU utilization is measured as a percentage of the pod's CPU request, not of the node's CPU. A pod with no CPU request set has no meaningful utilization metric — HPA won't work correctly without resource requests set on the target pods.


Multi-Metric HPA: CPU + Memory + Custom

HPA v2 supports multiple metrics. It calculates the desired replica count for each metric independently and uses the maximum across all metrics:

yaml
1metrics:
2  - type: Resource
3    resource:
4      name: cpu
5      target:
6        type: Utilization
7        averageUtilization: 60
8
9  - type: Resource
10    resource:
11      name: memory
12      target:
13        type: AverageValue
14        averageValue: 512Mi    # Target: average 512Mi memory per pod (not utilization %)
15
16  - type: Pods
17    pods:
18      metric:
19        name: http_requests_per_second    # Custom metric from Prometheus Adapter
20      target:
21        type: AverageValue
22        averageValue: "1000"    # Target: 1000 req/s per pod

The maximum desired replica count across all three metrics determines the actual target. If CPU says 4 replicas, memory says 6 replicas, and requests says 8 replicas, HPA scales to 8.


Prometheus Custom Metrics with the Prometheus Adapter

For HTTP request rate and other application-level metrics, install the Prometheus Adapter, which exposes Prometheus metrics to the Kubernetes custom metrics API:

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://prometheus-operated.monitoring.svc:9090

Configure a metric rule that maps a PromQL query to a Kubernetes custom metric:

yaml
1# prometheus-adapter values
2rules:
3  custom:
4    - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
5      resources:
6        overrides:
7          namespace: {resource: "namespace"}
8          pod: {resource: "pod"}
9      name:
10        matches: "^(.*)_total$"
11        as: "${1}_per_second"
12      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

This exposes http_requests_per_second as a pods-scoped custom metric. Verify:

bash
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta2/namespaces/payments/pods/*/http_requests_per_second"

HPA Behavior: Preventing Flapping

The default HPA behavior scales up aggressively and scales down conservatively. The behavior field gives explicit control:

yaml
1behavior:
2  scaleUp:
3    stabilizationWindowSeconds: 60     # Don't scale up again within 60s of last scale-up
4    policies:
5      - type: Percent
6        value: 50       # Scale up by at most 50% per minute
7        periodSeconds: 60
8      - type: Pods
9        value: 4        # Or add at most 4 pods per minute
10        periodSeconds: 60
11    selectPolicy: Max   # Use whichever policy allows scaling more aggressively
12
13  scaleDown:
14    stabilizationWindowSeconds: 600    # 10-minute stabilization window before scale-down
15    policies:
16      - type: Percent
17        value: 10       # Remove at most 10% of pods per minute
18        periodSeconds: 60
19    selectPolicy: Min   # Use the most conservative policy

selectPolicy: Max picks the policy that results in the most scaling (faster scale-up). selectPolicy: Min picks the policy that results in the least scaling (slower scale-down). The asymmetry — aggressive scale-up, conservative scale-down — is the standard pattern for production services to avoid performance degradation during traffic spikes.


VPA: Vertical Pod Autoscaler

VPA observes historical resource usage and recommends or automatically updates resource requests. Install VPA (requires the VPA CRDs and three components: recommender, updater, admission controller):

bash
1git clone https://github.com/kubernetes/autoscaler.git
2cd autoscaler/vertical-pod-autoscaler
3./hack/vpa-up.sh
4
5# Verify
6kubectl get pods -n kube-system | grep vpa
7# vpa-admission-controller-<hash>    Running
8# vpa-recommender-<hash>             Running
9# vpa-updater-<hash>                 Running

Or with Helm:

bash
helm repo add cowboysysop https://cowboysysop.github.io/charts/
helm install vpa cowboysysop/vertical-pod-autoscaler --namespace kube-system

VPA UpdateMode: Off (Recommend Only)

Start with Off mode — VPA generates recommendations but doesn't touch pods:

yaml
1apiVersion: autoscaling.k8s.io/v1
2kind: VerticalPodAutoscaler
3metadata:
4  name: payments-api-vpa
5  namespace: payments
6spec:
7  targetRef:
8    apiVersion: apps/v1
9    kind: Deployment
10    name: payments-api
11  updatePolicy:
12    updateMode: "Off"    # Recommend only — observe for 1-2 weeks before enabling Auto
13  resourcePolicy:
14    containerPolicies:
15      - containerName: payments-api
16        minAllowed:
17          cpu: 50m
18          memory: 64Mi
19        maxAllowed:
20          cpu: "2"
21          memory: 2Gi
22        controlledResources: ["cpu", "memory"]

Read recommendations:

bash
1kubectl get vpa payments-api-vpa -n payments -o yaml | grep -A 30 recommendation
2# containerRecommendations:
3# - containerName: payments-api
4#   lowerBound:
5#     cpu: 50m
6#     memory: 128Mi
7#   target:             ← Use these values as your resource requests
8#     cpu: 250m
9#     memory: 384Mi
10#   upperBound:
11#     cpu: 1
12#     memory: 768Mi
13#   uncappedTarget:
14#     cpu: 250m
15#     memory: 384Mi

The target value is the recommended request. lowerBound and upperBound are confidence intervals. uncappedTarget is the recommendation VPA would make if no minAllowed/maxAllowed bounds were specified — useful for checking whether your bounds are constraining the recommendation.

VPA UpdateMode: Auto

Auto mode applies recommendations in two ways: at pod creation (via the admission controller webhook, like Initial mode) and by evicting running pods so they restart with updated resource requests. This causes pod restarts for running workloads — acceptable for Deployments, problematic for single-replica pods or StatefulSets with no PodDisruptionBudget:

yaml
updatePolicy:
  updateMode: "Auto"    # Evict and restart pods to apply new resource requests

Use PodDisruptionBudgets to prevent VPA from evicting too many pods simultaneously:

yaml
1apiVersion: policy/v1
2kind: PodDisruptionBudget
3metadata:
4  name: payments-api-pdb
5  namespace: payments
6spec:
7  minAvailable: 2    # VPA won't evict if fewer than 2 pods are available
8  selector:
9    matchLabels:
10      app: payments-api

VPA UpdateMode: Initial

Initial sets resource requests when pods are first created (via the admission controller webhook) but doesn't evict running pods:

yaml
updatePolicy:
  updateMode: "Initial"    # Apply recommendations to new pods only; don't evict running pods

This is a useful middle ground: new pods start with good resource settings, but running pods aren't disrupted.


HPA + VPA: Using Both Simultaneously

HPA and VPA can coexist, but not on the same resource dimension:

ConfigurationSafe?Reason
HPA on CPU, VPA on CPU in Auto/Initial modeNoVPA changes the CPU request that HPA's utilization percentage is based on — both fight over the same dimension
HPA on CPU, VPA on CPU in Off modeYesVPA in Off mode only generates recommendations, never modifies pods
HPA on CPU, VPA on memory onlyYesDifferent dimensions, no conflict
HPA on custom metrics (req/s), VPA on CPU + memoryYesHPA uses application metrics; VPA manages resource requests
HPA on CPU + memory, VPA disabledYesStandard pattern

VPA Off mode (recommendations only, no automatic changes) is the safe default. Switch to Auto only after validating recommendations for 1+ weeks on non-critical workloads. Auto mode causes pod evictions and should be enabled gradually. For the recommended combination, use HPA on HTTP request rate (custom metric from Prometheus) and VPA in Off mode for right-sizing guidance. HPA handles demand; VPA surfacing recommendations handles right-sizing.

When using both, exclude the dimensions HPA manages from VPA's controlledResources:

yaml
resourcePolicy:
  containerPolicies:
    - containerName: payments-api
      controlledResources: ["memory"]    # VPA manages memory only; HPA manages CPU-based scaling

KEDA: Scale-to-Zero and Event-Driven Scaling

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA to support scale-to-zero and event sources beyond Prometheus: SQS queue depth, Kafka consumer lag, Redis list length, Cron schedules.

bash
helm install keda kedacore/keda --namespace keda --create-namespace

Scale an SQS-based worker to zero when the queue is empty:

yaml
1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4  name: sqs-worker-scaledobject
5  namespace: payments
6spec:
7  scaleTargetRef:
8    name: sqs-worker
9  minReplicaCount: 0    # Scale to zero when queue is empty
10  maxReplicaCount: 50
11  cooldownPeriod: 300   # Seconds to wait before scaling to zero after queue drains
12
13  triggers:
14    - type: aws-sqs-queue
15      # Use identityOwner: operator to use the KEDA operator's IAM role (via IRSA or Pod Identity)
16      # When identityOwner: operator is set, authenticationRef is not used — remove it
17      metadata:
18        queueURL: https://sqs.us-east-1.amazonaws.com/012345678901/payments-jobs
19        queueLength: "5"      # Target: 5 messages per replica
20        awsRegion: us-east-1
21        identityOwner: operator    # KEDA operator's Pod Identity / IRSA role handles auth

KEDA creates and manages an HPA under the hood. ScaledObject is the user-facing API; KEDA translates it to an HPA with the appropriate custom metric source.


Operational Considerations

PodDisruptionBudget: Protecting Against Scale-Down

When HPA scales down or a node is drained, pods are evicted. Without a PodDisruptionBudget, all replicas of a Deployment can be evicted simultaneously:

yaml
1apiVersion: policy/v1
2kind: PodDisruptionBudget
3metadata:
4  name: payments-api-pdb
5  namespace: payments
6spec:
7  minAvailable: 2    # At least 2 pods must remain available during disruptions
8  # OR:
9  # maxUnavailable: 1    # At most 1 pod unavailable at a time
10  selector:
11    matchLabels:
12      app: payments-api

HPA respects PodDisruptionBudgets — it won't scale below minAvailable. Karpenter also respects PDBs when consolidating nodes.

Infrastructure Scaling: HPA and Karpenter

In 2026, horizontal scaling is often bottlenecked by node availability. While HPA creates new pods, Karpenter (the modern alternative to Cluster Autoscaler) ensures that those pods have nodes to run on:

  1. HPA Scale-Up: Traffic spikes, HPA increases replicas from 5 to 15.
  2. Pending Pods: 5 pods find space on existing nodes; 5 pods become Pending because the cluster is full.
  3. Karpenter Detection: Within milliseconds, Karpenter identifies the pending pods and their resource requirements.
  4. Just-in-Time Provisioning: Karpenter calls the AWS EC2 API to provision a new node (or an optimal mix of Spot/On-Demand instances) tailored exactly to those pods.

Karpenter's "just-in-time" model reduces node provisioning time from several minutes to under 60 seconds, making your HPA scaling significantly more responsive.


Checking HPA Status and Scaling Events

bash
1# Current HPA state
2kubectl get hpa payments-api-hpa -n payments
3# NAME                REFERENCE              TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
4# payments-api-hpa   Deployment/payments-api   45%/60%   3         20        5          2d
5
6# Detailed HPA status including conditions
7kubectl describe hpa payments-api-hpa -n payments
8# Conditions:
9#   AbleToScale   True    ReadyForNewScale
10#   ScalingActive True    ValidMetricFound
11#   ScalingLimited  False   DesiredWithinRange
12
13# Scaling events
14kubectl get events -n payments --field-selector reason=SuccessfulRescale --sort-by=.lastTimestamp

ScalingLimited = True means HPA wants to scale beyond maxReplicas (or below minReplicas) but can't. If you see this condition frequently, raise maxReplicas.


Frequently Asked Questions

Why isn't my HPA scaling even though CPU is high?

The most common causes:

  1. No resource requests set: HPA measures utilization as a percentage of the request. Pods with no CPU request have undefined utilization — kubectl describe hpa shows <unknown> for the current metric value.
  2. Metrics server not running: kubectl top pods fails. Install the metrics server.
  3. Target is already at maxReplicas: The ScalingLimited condition will be True.
  4. Stabilization window: HPA waited for scale-down and is now in the stabilization window. Check kubectl describe hpa for the last scale event time.

Should I set CPU limits if I'm using HPA?

HPA scales on CPU utilization (percentage of request). If you omit CPU limits, a pod can use spare node capacity freely, which looks like lower utilization — HPA won't scale up as aggressively. If you set CPU limits equal to requests (Guaranteed QoS), throttling at the limit will inflate latency. The standard pattern: set requests conservatively (steady-state usage), set limits at 2-4x requests, and let HPA add replicas before pods hit their limits. See Kubernetes Resource Management: Requests, Limits, QoS, and LimitRanges.

What's the difference between HPA and Karpenter node autoscaling?

HPA scales pods (replicas). Karpenter scales nodes. They work together: HPA adds pods, and if the pods can't be scheduled because the cluster doesn't have enough node capacity, Karpenter provisions new nodes. HPA operates in seconds; Karpenter provisions nodes in 30-60 seconds. For bursty workloads, use HPA to handle within-node capacity and Karpenter to provision new nodes for larger spikes. See Kubernetes Node Autoscaling: Cluster Autoscaler vs Karpenter.


For resource requests that HPA uses as its utilization baseline, see Kubernetes Resource Management: Requests, Limits, QoS, and LimitRanges. For KEDA's event-driven scaling with SQS, Kafka, and other sources beyond what standard HPA supports, see KEDA: Event-Driven Autoscaling for Kubernetes. For the dedicated VPA deep-dive — modes, HPA conflict, Goldilocks right-sizing, and production adoption — see Kubernetes VPA: Right-Sizing Containers Without Manual Tuning.

Setting up HPA with Prometheus custom metrics on EKS, or debugging HPA that won't scale? Talk to us at Coding Protocols — we help platform teams implement autoscaling that responds to the right signals without the flapping and over-provisioning that come from misconfigured CPU-only scaling.

Related Topics

Kubernetes
HPA
VPA
Autoscaling
KEDA
Platform Engineering
EKS
Performance

Read Next