Kubernetes
14 min readMay 8, 2026

Kubernetes HPA Beyond CPU: Scaling on Custom and External Metrics

CPU-based autoscaling works until it doesn't. Queue depth, request latency, active connections, business metrics — these are better signals for most workloads. Here's how to configure HPA with custom and external metrics, and when to reach for KEDA instead.

CO
Coding Protocols Team
Platform Engineering
Kubernetes HPA Beyond CPU: Scaling on Custom and External Metrics

The Kubernetes Horizontal Pod Autoscaler ships with CPU and memory scaling built in. For stateless services with a linear relationship between CPU utilisation and load, that's sufficient. For everything else — queue consumers, API gateways, batch processors, services where latency matters more than CPU — you need custom metrics.

This post covers the full HPA picture: how custom and external metric scaling works, how to wire Prometheus metrics into HPA, the common scaling signal choices and their trade-offs, and when KEDA is the better tool.


How HPA Metrics Work

HPA operates on a control loop. Every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period), it queries the metrics API, computes the desired replica count, and adjusts if the current count differs.

The formula for resource metrics (CPU/memory):

desiredReplicas = ceil(currentReplicas × (currentMetricValue / desiredMetricValue))

If you have 3 replicas at 80% CPU and your target is 50%, HPA computes ceil(3 × 80/50) = ceil(4.8) = 5 and scales to 5.

For custom metrics the same formula applies, but the metric value comes from the custom metrics API (custom.metrics.k8s.io) or external metrics API (external.metrics.k8s.io) rather than the resource metrics API (metrics.k8s.io).

The three metric source types in an HPA spec:

yaml
metrics:
  - type: Resource          # CPU/memory — from metrics-server
  - type: Pods              # Per-pod custom metric — averaged across pods
  - type: Object            # Metric on a specific Kubernetes object
  - type: External          # Metric from outside Kubernetes (SQS, Pub/Sub, etc.)
  - type: ContainerResource # Per-container resource metric (beta)

Prometheus Adapter: Custom Metrics from Prometheus

The most common setup: your application exposes a /metrics endpoint, Prometheus scrapes it, and you want HPA to scale on those metrics.

The Prometheus Adapter (prometheus-community/prometheus-adapter) translates Prometheus queries into the custom.metrics.k8s.io API that HPA consumes.

Install Prometheus Adapter

bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --set prometheus.url=http://kube-prometheus-stack-prometheus.monitoring.svc.cluster.local \
  --set prometheus.port=9090

Configure a Custom Metric

Add a rule to the adapter's ConfigMap that defines how a Prometheus query maps to a metric name:

yaml
1# values.yaml for prometheus-adapter
2rules:
3  custom:
4    - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
5      resources:
6        overrides:
7          namespace:
8            resource: namespace
9          pod:
10            resource: pod
11      name:
12        matches: "^(.*)_total$"
13        as: "${1}_per_second"
14      metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'

This exposes http_requests_per_second as a per-pod metric in the custom metrics API.

Verify the metric is available:

bash
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta2" | jq '.resources[].name'

kubectl get --raw \
  "/apis/custom.metrics.k8s.io/v1beta2/namespaces/production/pods/*/http_requests_per_second" \
  | jq .

HPA Using the Custom Metric

yaml
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: api-hpa
5  namespace: production
6spec:
7  scaleTargetRef:
8    apiVersion: apps/v1
9    kind: Deployment
10    name: api
11  minReplicas: 2
12  maxReplicas: 20
13  metrics:
14    - type: Pods
15      pods:
16        metric:
17          name: http_requests_per_second
18        target:
19          type: AverageValue
20          averageValue: "100"   # Scale when avg req/s per pod exceeds 100
21  behavior:
22    scaleDown:
23      stabilizationWindowSeconds: 300   # Wait 5 minutes before scaling down
24      policies:
25        - type: Percent
26          value: 10
27          periodSeconds: 60             # Scale down at most 10% per minute
28    scaleUp:
29      stabilizationWindowSeconds: 0    # Scale up immediately
30      policies:
31        - type: Percent
32          value: 100
33          periodSeconds: 15            # Can double replicas every 15 seconds

The behavior block is important and often omitted in examples. Without it, HPA uses defaults that can thrash — scaling down too aggressively on a brief traffic dip, then scaling back up immediately. The stabilisation window prevents flapping.


Scaling on Queue Depth

Queue depth is the ideal autoscaling signal for queue consumers: one replica per N messages in the queue. When the queue is empty, scale to zero (if your workload supports it). When messages accumulate, scale up proportionally.

SQS Queue Depth (External Metric)

For AWS SQS queues, the metric comes from CloudWatch — it's external to Kubernetes. Use the External metric type with an adapter that exposes CloudWatch metrics (e.g., k8s-cloudwatch-adapter or KEDA's SQS scaler).

With the CloudWatch adapter:

yaml
1metrics:
2  - type: External
3    external:
4      metric:
5        name: sqs_messages_visible
6        selector:
7          matchLabels:
8            queue_name: my-worker-queue
9      target:
10        type: AverageValue
11        averageValue: "30"   # One replica per 30 messages in queue

Prometheus Queue Depth (Custom Metric)

If your queue metrics are already in Prometheus (RabbitMQ, Redis Streams, Kafka via JMX exporter):

yaml
1metrics:
2  - type: Object
3    object:
4      metric:
5        name: rabbitmq_queue_messages_ready
6      describedObject:
7        apiVersion: v1
8        kind: Service
9        name: rabbitmq
10      target:
11        type: Value
12        value: "100"   # Total queue depth target, not per-pod average

Object metrics are not averaged across pods — they're a single value from a specific Kubernetes object. Use Object for queue depth (the total queue depth is what matters, not per-pod average), and Pods for per-pod metrics like request rate.


Scaling on Request Latency

CPU is a lagging indicator for latency-sensitive services. By the time CPU is high enough to trigger scaling, latency has already degraded. P95/P99 latency as a scaling signal is more responsive.

With Prometheus adapter and Istio or application-level latency metrics:

yaml
1# Prometheus adapter rule
2- seriesQuery: 'http_request_duration_seconds_bucket{namespace!="",pod!=""}'
3  name:
4    as: "http_request_p95_latency"
5  metricsQuery: >
6    histogram_quantile(0.95,
7      sum(rate(http_request_duration_seconds_bucket{<<.LabelMatchers>>}[2m]))
8      by (le, pod, namespace)
9    )
yaml
1# HPA using latency metric
2metrics:
3  - type: Pods
4    pods:
5      metric:
6        name: http_request_p95_latency
7      target:
8        type: AverageValue
9        averageValue: "200m"   # 200 milliseconds P95 target

Latency-based scaling requires careful tuning. If your target is P95 < 200ms and traffic spikes cause a brief latency increase, HPA may scale up before the issue resolves itself. Add a stabilizationWindowSeconds on scale-up to avoid over-reacting to transient spikes.


Choosing the Right Scaling Signal

SignalBest ForCaution
CPUCPU-bound computation, batch processingLags for I/O-bound or latency-sensitive services
MemoryMemory-bound processing (caching, ML inference)Memory leaks cause runaway scaling
Request rate (RPS)HTTP services with uniform request costDoesn't account for request complexity variance
P95/P99 latencyLatency-SLO-driven servicesCan oscillate; needs stabilisation window
Queue depthQueue consumers, async workersNeeds scale-to-zero for empty queues (KEDA)
Active connectionsWebsocket servers, connection-pooled servicesConnection draining during scale-down is complex
Custom business metricE-commerce (cart count), gaming (active sessions)Requires metric export pipeline

The most reliable approach for most HTTP services: request rate per pod with a target based on your service's benchmarked capacity. If your service handles 200 RPS per pod before latency degrades, set target to 150 RPS per pod (25% headroom). This is more predictive than CPU and simpler than latency.


Scaling Behaviour Tuning

HPA v2 (autoscaling/v2, GA in Kubernetes 1.23) gives you fine-grained control over scaling velocity:

yaml
1behavior:
2  scaleUp:
3    stabilizationWindowSeconds: 0      # No delay on scale-up
4    selectPolicy: Max                  # Use the policy that scales up the most
5    policies:
6      - type: Pods
7        value: 4
8        periodSeconds: 60              # Add at most 4 pods per minute
9      - type: Percent
10        value: 100
11        periodSeconds: 60             # Or double the pods per minute — whichever is larger
12  scaleDown:
13    stabilizationWindowSeconds: 300    # Wait 5 minutes before scaling down
14    selectPolicy: Min                  # Use the policy that scales down the least (conservative)
15    policies:
16      - type: Pods
17        value: 2
18        periodSeconds: 60             # Remove at most 2 pods per minute
19      - type: Percent
20        value: 10
21        periodSeconds: 60            # Or 10% per minute — whichever is smaller

Scale-up: fast and aggressive. Traffic spikes are bad for users; adding pods quickly is correct.

Scale-down: slow and conservative. Removing pods too quickly after a traffic spike causes thrashing. The stabilizationWindowSeconds on scale-down looks at the last N seconds of metric values and uses the highest as the current value — preventing scale-down when metrics have been high recently.


When to Use KEDA Instead

KEDA (Kubernetes Event-Driven Autoscaler) extends HPA with a library of 50+ pre-built scalers for external systems. Use KEDA when:

You need scale-to-zero. Standard HPA has a minimum of 1 replica. KEDA can scale to 0 replicas and back — essential for queue consumers that should consume no resources when the queue is empty.

You're scaling on external systems. KEDA has native scalers for SQS, Kafka, RabbitMQ, Azure Service Bus, Google Pub/Sub, Redis, PostgreSQL, MySQL, Prometheus, and more — without requiring a custom adapter.

You want simpler configuration. KEDA's ScaledObject is easier to read than an HPA with a Prometheus adapter rule:

yaml
1# KEDA ScaledObject for SQS scaling
2apiVersion: keda.sh/v1alpha1
3kind: ScaledObject
4metadata:
5  name: worker-scaler
6  namespace: production
7spec:
8  scaleTargetRef:
9    name: worker-deployment
10  minReplicaCount: 0        # Scale to zero when queue empty
11  maxReplicaCount: 50
12  triggers:
13    - type: aws-sqs-queue
14      metadata:
15        queueURL: https://sqs.us-east-1.amazonaws.com/123456789/my-queue
16        queueLength: "30"    # One replica per 30 messages
17        awsRegion: us-east-1
18      authenticationRef:
19        name: keda-trigger-auth-aws

KEDA creates and manages the underlying HPA automatically. The HPA still does the actual scaling; KEDA is a higher-level abstraction over it.

See KEDA: Event-Driven Autoscaling for Kubernetes for the full KEDA setup guide.

When to stay with standard HPA: Prometheus-based metrics with a single trigger, no scale-to-zero requirement, teams already familiar with HPA. Adding KEDA for a single Prometheus metric adds operational overhead (another controller) without proportional benefit.


Debugging HPA

When autoscaling isn't working as expected:

bash
1# Check HPA status and recent events
2kubectl describe hpa api-hpa -n production
3
4# Check if the metric is visible to HPA
5kubectl get --raw \
6  "/apis/custom.metrics.k8s.io/v1beta2/namespaces/production/pods/*/http_requests_per_second"
7
8# Check HPA controller logs
9kubectl logs -n kube-system -l app=kube-controller-manager --tail=50 | grep HPA

Common issues:

unable to fetch metrics from custom metrics API — The Prometheus adapter isn't running or its rules don't match the metric series. Check the adapter logs: kubectl logs -n monitoring deployment/prometheus-adapter.

HPA was unable to compute the replica count — The metric returned no data (possibly because no pods matched the label selector in the Prometheus query). Verify the query returns results in Prometheus directly.

DesiredReplicas not changing despite high metric value — Check minReplicas/maxReplicas bounds. Also check if the scaleDown.stabilizationWindowSeconds is holding a higher value from a recent spike.

Scale-up not happening fast enough — Check behavior.scaleUp policies. The default allows 4 pods or 100% (whichever is larger) per 15 seconds. If you have 2 pods and need 20, the default takes a few cycles to get there. Increase maxReplicas and loosen the scale-up policy if you need faster response.


Frequently Asked Questions

Can I use multiple metrics in a single HPA?

Yes. HPA with multiple metrics takes the maximum desired replica count across all metric evaluations. If CPU says scale to 5 and request rate says scale to 8, HPA scales to 8. This is conservative in the right direction — it prevents under-scaling when different signals diverge.

What happens when Prometheus is down?

If the custom metrics API returns no data, HPA enters a degraded state and stops scaling in either direction. It does not scale down (which would be dangerous) or scale up. Existing replicas continue running. This fail-safe behaviour is correct for most workloads.

Should I use VPA alongside HPA?

Not on the same metric. VPA (Vertical Pod Autoscaler) adjusts resource requests; HPA scales replica count. They can conflict if both are responding to CPU — HPA scales out while VPA increases requests, potentially triggering node pressure. The standard recommendation: use VPA for right-sizing resource requests (in Off or Initial mode to set values without live adjustment), HPA for replica scaling on a non-CPU metric.

How do I set the right target value?

Load test your service to find its saturation point: the RPS (or queue depth, or latency) at which performance degrades. Set the HPA target at 60–70% of that value. This gives headroom for traffic to spike before scaling triggers, and enough time for new pods to start before the existing pods are overwhelmed.


For event-driven autoscaling with scale-to-zero, see KEDA: Event-Driven Autoscaling for Kubernetes. For node-level autoscaling, see How to Install Karpenter on EKS. For HPA v2 behavior tuning and ContainerResource metrics, see Kubernetes HPA v2: Behavior Tuning and ContainerResource Metrics.

Tuning autoscaling for a latency-sensitive production service? Talk to us at Coding Protocols — we help platform teams build scaling configurations that hold up under real traffic patterns.

Related Topics

Kubernetes
HPA
Autoscaling
KEDA
Prometheus
Custom Metrics
Platform Engineering
Performance

Read Next