Why Your HPA Isn't Scaling — Fixing It with Custom Metrics (KEDA + Prometheus)
HPA on CPU alone fails silently for most real workloads. I'll show you exactly why it breaks, how to diagnose it, and how KEDA + Prometheus custom metrics fix what native HPA can't.

The Kubernetes HPA is one of those things that looks deceptively simple and behaves deceptively correctly — right up until the moment your pods aren't scaling and your on-call engineer is staring at a flat replica count while the queue depth climbs.
I've debugged this in enough clusters to say with confidence: the problem is almost never the HPA itself. It's the assumption that CPU is the right signal for your workload.
This post walks through why native HPA fails for common production patterns, how to diagnose exactly what's happening, and how to replace the broken signal with KEDA + Prometheus custom metrics.
Why CPU-based HPA fails silently
HPA targeting CPU utilization works on a simple formula:
desired replicas = ceil(current replicas × (current metric / target metric))
When your app is CPU-bound — compute-heavy batch processing, image resizing, cryptographic work — this is fine. But the majority of production workloads are I/O-bound: they're waiting on database queries, downstream APIs, or message queues. An I/O-bound pod under heavy load has low CPU utilization because it's mostly blocked, not computing.
Result: HPA sees CPU at 15%, decides scaling isn't needed, and your p99 latency climbs to 8 seconds while pods sit idle.
This is the fundamental mismatch. CPU measures what the pod is doing computationally. What you actually care about is how much work is waiting to be done.
Diagnosing why your HPA isn't scaling
Before you replace anything, understand exactly what's happening.
Step 1 — Read the HPA status
kubectl describe hpa my-app -n productionLook at the Conditions section. You'll see one of these:
AbleToScale True ReadyForNewScale
ScalingActive False FailedGetScale
ScalingLimited True TooManyReplicas
ScalingActive: False / FailedGetScale— the HPA can't read the metric. The metrics pipeline is broken.ScalingActive: False / ValidMetricFound: False— no metrics are being returned for the target.ScalingLimited: True / TooManyReplicas— you've hitmaxReplicas. The HPA wants to scale but can't.ScalingActive: Truebut replicas aren't moving — the current metric value is below the threshold.
Step 2 — Check what value the HPA actually sees
kubectl get hpa my-app -n production -o yamlLook at status.currentMetrics. This is the live value the HPA is computing against. If it shows 0 or is absent, the metrics pipeline isn't delivering values.
Step 3 — Check the metrics API
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/production/pods" | jq .If this 404s, metrics-server isn't installed or isn't running:
kubectl get pods -n kube-system | grep metrics-serverFor custom/external metrics (the custom.metrics.k8s.io API):
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name'If that API doesn't exist at all, the custom metrics adapter (Prometheus Adapter or KEDA) isn't installed.
Step 4 — Check for scale-down stabilization
HPA has a built-in cooldown window — by default, 5 minutes before scaling down, 0 seconds before scaling up. If you're seeing replicas stuck high after a traffic spike, it's the stabilization window:
kubectl get hpa my-app -n production -o jsonpath='{.spec.behavior}'The three workload patterns where CPU fails
Pattern 1: Queue consumers
A Celery worker, SQS consumer, or Kafka consumer is doing I/O-bound work. When 10,000 messages pile up, the CPU of each worker is at 20% — most of the time is spent waiting for the broker. HPA does nothing.
The right signal: queue depth. Scale when messages > N per replica.
Pattern 2: High-concurrency APIs
An API server handling 500 concurrent requests may have low CPU because it's async and mostly awaiting responses from downstream services. The right signal is either request rate (RPS) or active connection count.
The right signal: requests per second or active HTTP connections from your load balancer or ingress.
Pattern 3: Latency-sensitive services
Your p99 is at 800ms and you want to scale out before it hits 1s SLA. CPU at the time: 30%. HPA won't act.
The right signal: p99 latency from Prometheus. Scale when it exceeds your SLO threshold.
Enter KEDA
KEDA (Kubernetes Event-Driven Autoscaling) is a graduated CNCF project that replaces HPA for these patterns. It adds a ScaledObject CRD that:
- Reads from Prometheus, Kafka, SQS, RabbitMQ, Redis, and 50+ other sources
- Scales to zero when there's no work (and back up on the first event)
- Integrates with the existing HPA controller — it doesn't replace it, it drives it
Install it:
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespaceKEDA + Prometheus: scaling on queue depth
Assume you have a worker that processes jobs and exposes a Prometheus metric job_queue_depth scraped by your in-cluster Prometheus.
1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4 name: worker-scaler
5 namespace: production
6spec:
7 scaleTargetRef:
8 name: my-worker
9 minReplicaCount: 1
10 maxReplicaCount: 20
11 pollingInterval: 15
12 cooldownPeriod: 60
13 triggers:
14 - type: prometheus
15 metadata:
16 serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
17 metricName: job_queue_depth
18 query: sum(job_queue_depth{namespace="production"})
19 threshold: "10" # scale up when queue depth > 10 per replicaWith this config, KEDA polls Prometheus every 15 seconds. When sum(job_queue_depth) exceeds 10 × current_replicas, it scales up. When depth drops, it scales back down. When depth hits 0, it scales to minReplicaCount (or zero if you set minReplicaCount: 0).
KEDA + Prometheus: scaling on RPS
Your app exposes http_requests_total via Prometheus. You want one replica per 100 RPS:
1triggers:
2 - type: prometheus
3 metadata:
4 serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
5 metricName: http_rps
6 query: |
7 sum(rate(http_requests_total{
8 namespace="production",
9 job="my-api"
10 }[2m]))
11 threshold: "100"KEDA evaluates the query result, divides by the threshold, and sets the desired replica count. At 350 RPS → 4 replicas. At 50 RPS → 1 replica (or 0 if scaled to zero).
One gotcha: the query must return a scalar or a single time series. If your PromQL returns multiple series (e.g., one per pod), sum it first. KEDA expects a single value to divide by the threshold.
KEDA + Prometheus: scaling on p99 latency
1triggers:
2 - type: prometheus
3 metadata:
4 serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
5 metricName: api_p99_latency
6 query: |
7 histogram_quantile(0.99,
8 sum(rate(http_request_duration_seconds_bucket{
9 namespace="production",
10 job="my-api"
11 }[5m])) by (le)
12 ) * 1000
13 threshold: "500" # scale up when p99 > 500msThis is the pattern I reach for with latency-sensitive APIs. When p99 climbs above 500ms, scale out. When it recovers, scale back.
One subtlety: latency metrics lag behind load by the histogram window (5m here). For very spiky traffic, shrink the window to 2m or 1m — at the cost of noisier scaling decisions.
Combining CPU and custom metrics
You don't have to choose. KEDA supports multiple triggers and uses the highest desired replica count across all of them:
1triggers:
2 - type: cpu
3 metricType: Utilization
4 metadata:
5 value: "70"
6 - type: prometheus
7 metadata:
8 serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
9 metricName: job_queue_depth
10 query: sum(job_queue_depth{namespace="production"})
11 threshold: "10"Now your deployment scales if either CPU exceeds 70% OR queue depth exceeds 10 per replica. Whichever demands more replicas wins.
Scale to zero — and back up
This is KEDA's most operationally interesting feature. Set minReplicaCount: 0 and your deployment scales to zero pods when there's no work:
minReplicaCount: 0
maxReplicaCount: 10KEDA maintains a small controller loop that watches the trigger even when the deployment is at zero. When the Prometheus query returns a value above the threshold, it scales from 0 to 1, then the HPA takes over from there.
When to use it: batch workloads, dev/staging namespaces, background workers that only run during business hours. Not suitable for latency-sensitive APIs where cold-start time matters.
Debugging KEDA specifically
If your ScaledObject isn't behaving:
1# Check ScaledObject status
2kubectl describe scaledobject worker-scaler -n production
3
4# Check KEDA operator logs
5kubectl logs -n keda -l app=keda-operator --tail=100
6
7# Check what value KEDA is reading from your trigger
8kubectl get --raw \
9 "/apis/external.metrics.k8s.io/v1beta1/namespaces/production/s0-prometheus-job_queue_depth" \
10 | jq .The external metrics API endpoint follows the pattern s0-prometheus-<metricName>. If it returns 0 when you expect a non-zero value, the Prometheus query itself is returning nothing — test it directly in Prometheus UI.
Common mistakes:
serverAddressis wrong — verify it withkubectl execfrom a pod in the same namespace- PromQL returns multiple series — KEDA expects one; use
sum()or add label filters - Prometheus scrape target is down — check
Status → Targetsin Prometheus UI - KEDA can't reach Prometheus — check NetworkPolicy; KEDA pods need egress to your Prometheus service
The migration path
If you already have HPA deployed and want to migrate to KEDA:
- Delete the existing HPA:
kubectl delete hpa my-app -n production - Create the
ScaledObject— KEDA will create a new HPA under the hood - Verify with
kubectl get hpa -n production— you'll see a KEDA-managed HPA appear - Watch
kubectl describe scaledobjectfor the first few scaling events
Don't keep both. KEDA creates its own HPA for the same target, and two HPAs fighting over the same deployment causes unpredictable behavior.
What to monitor after switching
Once KEDA is live, set up these alerts:
1# ScaledObject hitting maxReplicas — you may need to raise the limit
2- alert: KEDAMaxReplicasReached
3 expr: |
4 keda_scaler_metrics_value / keda_scaler_metrics_value
5 * on(scaledObject) group_left()
6 (keda_scaled_object_max_replicas - keda_scaled_object_current_replicas == 0)
7 for: 5m
8
9# Scaler error rate — KEDA can't read its trigger
10- alert: KEDAScalerError
11 expr: rate(keda_scaler_errors_total[5m]) > 0
12 for: 2mAlso watch your Prometheus query in KEDA's polling interval. If the query is expensive (wide time range, many series), it adds latency to every scale decision. Keep KEDA's Prometheus queries lean.
Wrapping up
The mental model shift is this: HPA answers "how hard is my pod working?" Custom metrics answer "how much work is waiting?" For most real workloads, the second question is the right one.
If you want to generate the YAML for your next ScaledObject or HPA without writing it from scratch, the Autoscaling Config Builder handles both native HPA and KEDA ScaledObject output.
If you're also using KEDA for event-driven autoscaling with SQS or Kafka, the same ScaledObject pattern applies — just swap the trigger type.


