Kubernetes
14 min readMay 8, 2026

Kubernetes Resource Requests and Limits: The Complete Production Guide

Resource requests and limits control how Kubernetes schedules pods and what happens when nodes run out of capacity. Set them wrong and you get OOMKilled containers, CPU throttling that adds latency, or nodes that over-commit and evict your most important workloads. Here's how to get them right.

CO
Coding Protocols Team
Platform Engineering
Kubernetes Resource Requests and Limits: The Complete Production Guide

Resource requests and limits are two numbers on every container spec that directly control your cluster's stability, cost, and performance. Most teams either skip them entirely (causing over-commit and eviction cascades) or set them conservatively high (wasting 60–70% of provisioned compute).

Getting them right requires understanding what each number actually does — which is different from what most documentation implies.


Requests vs Limits: What They Actually Do

Requests

A resource request is what Kubernetes uses for scheduling. The scheduler finds a node with enough unallocated CPU and memory to satisfy all the requests of a pod. If no node has sufficient unallocated capacity, the pod stays Pending.

Requests do not cap usage. A container with cpu: "500m" request can use 2 CPUs if the node has spare capacity. The request is a scheduling hint and a QoS floor, not a ceiling.

Limits

A resource limit is what the kernel enforces at runtime. CPU limits are enforced via CFS (Completely Fair Scheduler) bandwidth control — the container's CPU usage is throttled if it exceeds the limit. Memory limits are enforced via cgroups — if a container exceeds its memory limit, the kernel OOMKills it.

The key difference:

  • Exceeding CPU limit → throttled (slower, but still running)
  • Exceeding memory limit → OOMKilled (container killed and restarted)

QoS Classes

Kubernetes assigns each pod one of three Quality of Service classes based on its resource configuration. QoS class determines eviction priority when a node is under memory pressure.

Guaranteed

Every container in the pod has equal requests and limits set for both CPU and memory:

yaml
1resources:
2  requests:
3    cpu: "500m"
4    memory: "512Mi"
5  limits:
6    cpu: "500m"
7    memory: "512Mi"

Guaranteed pods are the last to be evicted. Use for production workloads that must not be interrupted.

Burstable

At least one container has requests set, and requests ≠ limits (or limits are not set for some containers):

yaml
1resources:
2  requests:
3    cpu: "250m"
4    memory: "256Mi"
5  limits:
6    cpu: "1"
7    memory: "1Gi"

Burstable pods can use resources beyond their request when capacity is available. They're evicted after BestEffort pods but before Guaranteed pods.

BestEffort

No requests or limits set on any container:

yaml
# No resources block — BestEffort
containers:
  - name: app
    image: myapp:latest

BestEffort pods are the first evicted under pressure. Never use this in production — if the node is under pressure, your pod is gone.


CPU: Throttling Is the Silent Performance Killer

CPU limits are frequently misunderstood. The common mental model: "set CPU limit to 1 CPU, container uses at most 1 CPU." The reality is more subtle.

CPU limits use CFS bandwidth control. A container with cpu: "1" gets 1 CPU of quota per 100ms scheduling period. If the container uses its 100ms quota in 40ms (burst), it's throttled for the remaining 60ms — even if other CPUs on the node are idle.

This throttling is invisible in most metrics. kubectl top pods shows CPU usage, not CPU throttling. A container that shows 500m CPU usage can be heavily throttled if it bursts periodically.

How to check CPU throttling:

bash
# CPU throttling rate via Prometheus (requires node-exporter)
rate(container_cpu_cfs_throttled_seconds_total{container="my-app"}[5m]) /
rate(container_cpu_cfs_periods_total{container="my-app"}[5m])

A throttling rate above 25% indicates your CPU limit is too low for your actual usage pattern, and your application is slower than it needs to be.

The Case Against CPU Limits

A significant portion of the Kubernetes community runs without CPU limits entirely, relying only on CPU requests for scheduling. The argument:

  • CPU is a compressible resource — throttling degrades performance but doesn't crash the pod
  • On a well-utilised cluster, idle CPU is wasted; limits prevent pods from using spare capacity
  • Throttling latency is often worse than the risk of a noisy-neighbour CPU hog

This works if your cluster has good bin-packing (most of the time, nodes are well-utilised) and you monitor for CPU-hungry pods via requests. It doesn't work on clusters where a single runaway process can starve critical workloads.

Practical guidance: set CPU requests always. Consider CPU limits optional for application pods; set them for critical system pods where a runaway process must be contained.

Memory: Always Set Both Requests and Limits

Memory is incompressible — the kernel cannot reclaim memory from a container without killing it. For memory:

  • Always set requests for accurate scheduling
  • Always set limits to prevent one container from consuming all node memory and triggering evictions across the node
  • Set requests == limits for Guaranteed QoS on critical workloads

If a container's memory usage grows over time (a memory leak), the limit ensures the process is eventually OOMKilled rather than silently consuming node memory until the node becomes unstable.


OOMKilled: Diagnosis and Fix

When a container exceeds its memory limit, the kernel kills it. Kubernetes reports this as OOMKilled:

bash
kubectl describe pod <pod> -n <namespace>
# Look for: Last State: Terminated, Reason: OOMKilled

Diagnosis steps:

  1. Check the memory limit:

    bash
    kubectl get pod <pod> -o jsonpath='{.spec.containers[].resources.limits.memory}'
  2. Check actual memory usage over time (if you have Prometheus):

    max_over_time(container_memory_working_set_bytes{container="my-app"}[1h])
    
  3. Determine if the OOMKill was from a genuine memory leak, a legitimate spike (large request, batch job), or an under-sized limit.

Fixes by cause:

CauseFix
Limit too low for normal usageIncrease the memory limit
Legitimate spike (batch processing)Increase limit or split into smaller batches
Memory leakFix the application; limit is correctly catching it
JVM not respecting container limitsAdd -XX:MaxRAMPercentage=75 to JVM flags

JVM note: Java applications frequently OOMKill because the JVM defaults to using 25% of system RAM as heap, not container RAM. On a node with 64GB RAM, a JVM in a 2GB-limit container defaults to a 16GB heap and immediately exhausts the container limit. Always set explicit heap limits for JVM workloads:

yaml
env:
  - name: JAVA_OPTS
    value: "-XX:MaxRAMPercentage=75.0 -XX:InitialRAMPercentage=50.0"

-XX:MaxRAMPercentage=75 sets the heap to 75% of the container's memory limit (not the node's RAM), leaving 25% for non-heap JVM overhead.


Setting the Right Values

Step 1: Measure Actual Usage

Run the application under realistic load and observe actual resource usage:

bash
1# Current usage
2kubectl top pods -n production
3
4# Historical (if you have Prometheus)
5# P99 CPU usage over the last 7 days
6quantile_over_time(0.99, rate(container_cpu_usage_seconds_total{container="my-app"}[5m])[7d:5m])
7
8# P99 memory usage
9quantile_over_time(0.99, container_memory_working_set_bytes{container="my-app"}[7d:5m])

Step 2: Set Requests at the P50–P75 of Normal Usage

Requests should reflect typical usage, not peak. Over-sizing requests wastes allocatable capacity and makes scheduling harder.

Step 3: Set Memory Limits at P99 + 20–30% Buffer

Enough buffer to absorb normal spikes without OOMKill; tight enough to catch genuine leaks.

Step 4: Set CPU Requests Accurately; Decide on CPU Limits Deliberately

If you set CPU limits, set them at P99 of CPU usage or higher. Below P99 causes throttling during normal load spikes.


LimitRange: Namespace-Level Defaults and Constraints

LimitRange objects set default requests/limits for containers in a namespace and enforce minimum/maximum values. This prevents BestEffort pods by ensuring every container gets resources even if the developer doesn't specify them.

yaml
1apiVersion: v1
2kind: LimitRange
3metadata:
4  name: default-limits
5  namespace: production
6spec:
7  limits:
8    - type: Container
9      default:              # Applied when no limit is specified
10        cpu: "500m"
11        memory: "512Mi"
12      defaultRequest:       # Applied when no request is specified
13        cpu: "100m"
14        memory: "128Mi"
15      max:                  # Hard ceiling — pods exceeding this are rejected
16        cpu: "4"
17        memory: "8Gi"
18      min:                  # Hard floor — pods below this are rejected
19        cpu: "50m"
20        memory: "64Mi"

With this LimitRange, a pod with no resource spec gets 100m/128Mi requests and 500m/512Mi limits automatically — Burstable QoS by default.

Combine with a Kyverno policy that requires explicit resource settings for production workloads:

yaml
1# Kyverno: reject pods without explicit resource requests in production
2validate:
3  message: "CPU and memory requests are required."
4  pattern:
5    spec:
6      containers:
7        - resources:
8            requests:
9              memory: "?*"
10              cpu: "?*"

ResourceQuota: Namespace-Level Budget

ResourceQuota caps the total resources a namespace can consume:

yaml
1apiVersion: v1
2kind: ResourceQuota
3metadata:
4  name: production-quota
5  namespace: production
6spec:
7  hard:
8    requests.cpu: "20"
9    requests.memory: 40Gi
10    limits.cpu: "40"
11    limits.memory: 80Gi
12    pods: "100"

This prevents a single namespace from consuming all cluster capacity. When the quota is reached, new pods are rejected with a clear error. Size quotas based on team allocation, not guesswork — measure actual usage first.


Vertical Pod Autoscaler (VPA) for Right-Sizing

VPA observes actual resource usage and recommends (or automatically applies) adjusted requests and limits. It's useful for right-sizing workloads where you don't know the correct values upfront.

Run VPA in Off mode to get recommendations without automatic changes:

yaml
1apiVersion: autoscaling.k8s.io/v1
2kind: VerticalPodAutoscaler
3metadata:
4  name: api-vpa
5  namespace: production
6spec:
7  targetRef:
8    apiVersion: apps/v1
9    kind: Deployment
10    name: api
11  updatePolicy:
12    updateMode: "Off"    # Recommend only, don't change

Check recommendations:

bash
kubectl describe vpa api-vpa -n production
# Look for: Recommendation section with Lower Bound, Target, Upper Bound

VPA's Target recommendation is a reasonable starting point for requests. Apply it, observe for a week, adjust if needed.

Warning: do not run VPA in Auto mode alongside HPA on the same resource metric. VPA adjusting requests causes HPA to recalculate desired replicas, causing thrashing. Use VPA in Off or Initial mode for production, and only apply recommendations manually after review.


Frequently Asked Questions

What happens when a node runs out of memory?

The kubelet begins evicting pods in order: BestEffort first, then Burstable (starting with those furthest over their requests), then Guaranteed last. Evicted pods are rescheduled on other nodes. If all nodes are under pressure, pods stay Pending until capacity frees up or new nodes are provisioned.

Should I set the same values for requests and limits?

For memory on production workloads: yes, set them equal for Guaranteed QoS. For CPU: equal values mean the container is always throttled if it tries to burst — often not what you want. A common pattern: equal memory requests and limits, CPU limit set 2–4x the CPU request.

Why does kubectl top show lower memory than my limit but the container still OOMKills?

kubectl top shows container_memory_working_set_bytes, which excludes file-backed pages that can be reclaimed. The OOMKill triggers on container_memory_usage_bytes, which includes those pages. Your container may be using more memory than kubectl top shows. Use container_memory_rss in Prometheus for a better picture of actual heap usage.

Is there a rule of thumb for setting initial values?

For a typical Go or Node.js service with no prior data: start with 100m CPU request, 128Mi memory request, 512Mi memory limit. For JVM services: 250m CPU request, 512Mi memory request, 1Gi memory limit. Observe actual usage for one week, then right-size based on measured P95 values. These are starting points, not targets.


For autoscaling on top of resource configuration, see Kubernetes HPA Beyond CPU. For enforcing resource requirements cluster-wide, see Kubernetes RBAC in Practice (Kyverno patterns) and Supply Chain Security Tools for Kubernetes. For Goldilocks and VPA tooling for right-sizing workloads across namespaces, see Kubernetes Cost Optimization: FinOps Patterns for EKS at Scale.

Trying to right-size resource configuration across a large cluster? Talk to us at Coding Protocols — we help platform teams implement resource governance that reduces waste without sacrificing reliability.

Related Topics

Kubernetes
Resource Management
CPU
Memory
QoS
Platform Engineering
Performance
Best Practices

Read Next