Kubernetes
13 min readApril 25, 2026

Why Your HPA Isn't Scaling — Fixing It with Custom Metrics (KEDA + Prometheus)

HPA on CPU alone fails silently for most real workloads. I'll show you exactly why it breaks, how to diagnose it, and how KEDA + Prometheus custom metrics fix what native HPA can't.

AJ
Ajeet Yadav
Platform & Cloud Engineer
Why Your HPA Isn't Scaling — Fixing It with Custom Metrics (KEDA + Prometheus)

The Kubernetes HPA is one of those things that looks deceptively simple and behaves deceptively correctly — right up until the moment your pods aren't scaling and your on-call engineer is staring at a flat replica count while the queue depth climbs.

I've debugged this in enough clusters to say with confidence: the problem is almost never the HPA itself. It's the assumption that CPU is the right signal for your workload.

This post walks through why native HPA fails for common production patterns, how to diagnose exactly what's happening, and how to replace the broken signal with KEDA + Prometheus custom metrics.


Why CPU-based HPA fails silently

HPA targeting CPU utilization works on a simple formula:

desired replicas = ceil(current replicas × (current metric / target metric))

When your app is CPU-bound — compute-heavy batch processing, image resizing, cryptographic work — this is fine. But the majority of production workloads are I/O-bound: they're waiting on database queries, downstream APIs, or message queues. An I/O-bound pod under heavy load has low CPU utilization because it's mostly blocked, not computing.

Result: HPA sees CPU at 15%, decides scaling isn't needed, and your p99 latency climbs to 8 seconds while pods sit idle.

This is the fundamental mismatch. CPU measures what the pod is doing computationally. What you actually care about is how much work is waiting to be done.


Diagnosing why your HPA isn't scaling

Before you replace anything, understand exactly what's happening.

Step 1 — Read the HPA status

bash
kubectl describe hpa my-app -n production

Look at the Conditions section. You'll see one of these:

AbleToScale    True    ReadyForNewScale
ScalingActive  False   FailedGetScale
ScalingLimited True    TooManyReplicas
  • ScalingActive: False / FailedGetScale — the HPA can't read the metric. The metrics pipeline is broken.
  • ScalingActive: False / ValidMetricFound: False — no metrics are being returned for the target.
  • ScalingLimited: True / TooManyReplicas — you've hit maxReplicas. The HPA wants to scale but can't.
  • ScalingActive: True but replicas aren't moving — the current metric value is below the threshold.

Step 2 — Check what value the HPA actually sees

bash
kubectl get hpa my-app -n production -o yaml

Look at status.currentMetrics. This is the live value the HPA is computing against. If it shows 0 or is absent, the metrics pipeline isn't delivering values.

Step 3 — Check the metrics API

bash
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/production/pods" | jq .

If this 404s, metrics-server isn't installed or isn't running:

bash
kubectl get pods -n kube-system | grep metrics-server

For custom/external metrics (the custom.metrics.k8s.io API):

bash
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq '.resources[].name'

If that API doesn't exist at all, the custom metrics adapter (Prometheus Adapter or KEDA) isn't installed.

Step 4 — Check for scale-down stabilization

HPA has a built-in cooldown window — by default, 5 minutes before scaling down, 0 seconds before scaling up. If you're seeing replicas stuck high after a traffic spike, it's the stabilization window:

bash
kubectl get hpa my-app -n production -o jsonpath='{.spec.behavior}'

The three workload patterns where CPU fails

Pattern 1: Queue consumers

A Celery worker, SQS consumer, or Kafka consumer is doing I/O-bound work. When 10,000 messages pile up, the CPU of each worker is at 20% — most of the time is spent waiting for the broker. HPA does nothing.

The right signal: queue depth. Scale when messages > N per replica.

Pattern 2: High-concurrency APIs

An API server handling 500 concurrent requests may have low CPU because it's async and mostly awaiting responses from downstream services. The right signal is either request rate (RPS) or active connection count.

The right signal: requests per second or active HTTP connections from your load balancer or ingress.

Pattern 3: Latency-sensitive services

Your p99 is at 800ms and you want to scale out before it hits 1s SLA. CPU at the time: 30%. HPA won't act.

The right signal: p99 latency from Prometheus. Scale when it exceeds your SLO threshold.


Enter KEDA

KEDA (Kubernetes Event-Driven Autoscaling) is a graduated CNCF project that replaces HPA for these patterns. It adds a ScaledObject CRD that:

  • Reads from Prometheus, Kafka, SQS, RabbitMQ, Redis, and 50+ other sources
  • Scales to zero when there's no work (and back up on the first event)
  • Integrates with the existing HPA controller — it doesn't replace it, it drives it

Install it:

bash
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

KEDA + Prometheus: scaling on queue depth

Assume you have a worker that processes jobs and exposes a Prometheus metric job_queue_depth scraped by your in-cluster Prometheus.

yaml
1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4  name: worker-scaler
5  namespace: production
6spec:
7  scaleTargetRef:
8    name: my-worker
9  minReplicaCount: 1
10  maxReplicaCount: 20
11  pollingInterval: 15
12  cooldownPeriod: 60
13  triggers:
14    - type: prometheus
15      metadata:
16        serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
17        metricName: job_queue_depth
18        query: sum(job_queue_depth{namespace="production"})
19        threshold: "10"   # scale up when queue depth > 10 per replica

With this config, KEDA polls Prometheus every 15 seconds. When sum(job_queue_depth) exceeds 10 × current_replicas, it scales up. When depth drops, it scales back down. When depth hits 0, it scales to minReplicaCount (or zero if you set minReplicaCount: 0).


KEDA + Prometheus: scaling on RPS

Your app exposes http_requests_total via Prometheus. You want one replica per 100 RPS:

yaml
1triggers:
2  - type: prometheus
3    metadata:
4      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
5      metricName: http_rps
6      query: |
7        sum(rate(http_requests_total{
8          namespace="production",
9          job="my-api"
10        }[2m]))
11      threshold: "100"

KEDA evaluates the query result, divides by the threshold, and sets the desired replica count. At 350 RPS → 4 replicas. At 50 RPS → 1 replica (or 0 if scaled to zero).

One gotcha: the query must return a scalar or a single time series. If your PromQL returns multiple series (e.g., one per pod), sum it first. KEDA expects a single value to divide by the threshold.


KEDA + Prometheus: scaling on p99 latency

yaml
1triggers:
2  - type: prometheus
3    metadata:
4      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
5      metricName: api_p99_latency
6      query: |
7        histogram_quantile(0.99,
8          sum(rate(http_request_duration_seconds_bucket{
9            namespace="production",
10            job="my-api"
11          }[5m])) by (le)
12        ) * 1000
13      threshold: "500"   # scale up when p99 > 500ms

This is the pattern I reach for with latency-sensitive APIs. When p99 climbs above 500ms, scale out. When it recovers, scale back.

One subtlety: latency metrics lag behind load by the histogram window (5m here). For very spiky traffic, shrink the window to 2m or 1m — at the cost of noisier scaling decisions.


Combining CPU and custom metrics

You don't have to choose. KEDA supports multiple triggers and uses the highest desired replica count across all of them:

yaml
1triggers:
2  - type: cpu
3    metricType: Utilization
4    metadata:
5      value: "70"
6  - type: prometheus
7    metadata:
8      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
9      metricName: job_queue_depth
10      query: sum(job_queue_depth{namespace="production"})
11      threshold: "10"

Now your deployment scales if either CPU exceeds 70% OR queue depth exceeds 10 per replica. Whichever demands more replicas wins.


Scale to zero — and back up

This is KEDA's most operationally interesting feature. Set minReplicaCount: 0 and your deployment scales to zero pods when there's no work:

yaml
minReplicaCount: 0
maxReplicaCount: 10

KEDA maintains a small controller loop that watches the trigger even when the deployment is at zero. When the Prometheus query returns a value above the threshold, it scales from 0 to 1, then the HPA takes over from there.

When to use it: batch workloads, dev/staging namespaces, background workers that only run during business hours. Not suitable for latency-sensitive APIs where cold-start time matters.


Debugging KEDA specifically

If your ScaledObject isn't behaving:

bash
1# Check ScaledObject status
2kubectl describe scaledobject worker-scaler -n production
3
4# Check KEDA operator logs
5kubectl logs -n keda -l app=keda-operator --tail=100
6
7# Check what value KEDA is reading from your trigger
8kubectl get --raw \
9  "/apis/external.metrics.k8s.io/v1beta1/namespaces/production/s0-prometheus-job_queue_depth" \
10  | jq .

The external metrics API endpoint follows the pattern s0-prometheus-<metricName>. If it returns 0 when you expect a non-zero value, the Prometheus query itself is returning nothing — test it directly in Prometheus UI.

Common mistakes:

  • serverAddress is wrong — verify it with kubectl exec from a pod in the same namespace
  • PromQL returns multiple series — KEDA expects one; use sum() or add label filters
  • Prometheus scrape target is down — check Status → Targets in Prometheus UI
  • KEDA can't reach Prometheus — check NetworkPolicy; KEDA pods need egress to your Prometheus service

The migration path

If you already have HPA deployed and want to migrate to KEDA:

  1. Delete the existing HPA: kubectl delete hpa my-app -n production
  2. Create the ScaledObject — KEDA will create a new HPA under the hood
  3. Verify with kubectl get hpa -n production — you'll see a KEDA-managed HPA appear
  4. Watch kubectl describe scaledobject for the first few scaling events

Don't keep both. KEDA creates its own HPA for the same target, and two HPAs fighting over the same deployment causes unpredictable behavior.


What to monitor after switching

Once KEDA is live, set up these alerts:

yaml
1# ScaledObject hitting maxReplicas — you may need to raise the limit
2- alert: KEDAMaxReplicasReached
3  expr: |
4    keda_scaler_metrics_value / keda_scaler_metrics_value
5    * on(scaledObject) group_left()
6    (keda_scaled_object_max_replicas - keda_scaled_object_current_replicas == 0)
7  for: 5m
8
9# Scaler error rate — KEDA can't read its trigger
10- alert: KEDAScalerError
11  expr: rate(keda_scaler_errors_total[5m]) > 0
12  for: 2m

Also watch your Prometheus query in KEDA's polling interval. If the query is expensive (wide time range, many series), it adds latency to every scale decision. Keep KEDA's Prometheus queries lean.


Wrapping up

The mental model shift is this: HPA answers "how hard is my pod working?" Custom metrics answer "how much work is waiting?" For most real workloads, the second question is the right one.

If you want to generate the YAML for your next ScaledObject or HPA without writing it from scratch, the Autoscaling Config Builder handles both native HPA and KEDA ScaledObject output.

If you're also using KEDA for event-driven autoscaling with SQS or Kafka, the same ScaledObject pattern applies — just swap the trigger type.

Related Topics

Kubernetes
HPA
KEDA
Prometheus
Autoscaling
Platform Engineering
Observability

Read Next