Pod Topology Spread Constraints for High Availability
Use pod topology spread constraints to distribute workloads evenly across availability zones and nodes. Covers maxSkew, whenUnsatisfiable, topologyKey, and how to combine spread constraints with node affinity for zone-aware HA deployments.
Before you begin
- kubectl installed and configured
- A multi-node Kubernetes cluster with nodes in multiple zones (EKS or GKE recommended)
- Basic familiarity with Kubernetes Deployments and node labels
A 6-replica Deployment with no placement controls can schedule all 6 pods on a single node or in a single availability zone. When that zone loses network connectivity, your entire application has zero replicas. The Kubernetes scheduler isn't trying to hurt you — it just optimizes for speed and bin-packing, not redundancy.
Pod topology spread constraints give the scheduler explicit distribution rules. You define the maximum allowed imbalance (maxSkew) between topology domains (nodes, zones, regions), and the scheduler ensures new pods honor that constraint.
What You'll Build
A 6-replica Deployment spread evenly across 3 availability zones with maxSkew: 1, so no zone ever has more than one extra pod compared to the least-loaded zone. Then a second constraint to also spread across individual nodes within each zone.
Step 1: Verify Zone Labels on Your Nodes
Cloud-managed clusters (EKS, GKE, AKS) add zone labels automatically. Verify yours:
1kubectl get nodes -L topology.kubernetes.io/zone
2# NAME STATUS ROLES AGE VERSION ZONE
3# node-1a-1 Ready <none> 5d v1.29.0 us-east-1a
4# node-1a-2 Ready <none> 5d v1.29.0 us-east-1a
5# node-1b-1 Ready <none> 5d v1.29.0 us-east-1b
6# node-1b-2 Ready <none> 5d v1.29.0 us-east-1b
7# node-1c-1 Ready <none> 5d v1.29.0 us-east-1c
8# node-1c-2 Ready <none> 5d v1.29.0 us-east-1cIf you're using kind for local testing, add the labels manually:
kubectl label nodes kind-worker topology.kubernetes.io/zone=zone-a
kubectl label nodes kind-worker2 topology.kubernetes.io/zone=zone-b
kubectl label nodes kind-worker3 topology.kubernetes.io/zone=zone-cStep 2: Deploy Without Spread Constraints
First, see what the scheduler does without guidance:
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: web-unconstrained
6spec:
7 replicas: 6
8 selector:
9 matchLabels:
10 app: web-unconstrained
11 template:
12 metadata:
13 labels:
14 app: web-unconstrained
15 spec:
16 containers:
17 - name: nginx
18 image: nginx:1.25
19EOFkubectl get pods -l app=web-unconstrained -o wide | awk '{print $7}' | sort | uniq -c
# The distribution may look like:
# 4 node-1a-1
# 1 node-1a-2
# 1 node-1b-1
# All 6 pods in zone us-east-1a, none in 1b or 1cThe scheduler filled the emptiest nodes first. In a freshly-deployed cluster, that often means one zone. A zone-1a failure takes out all 6 replicas.
Clean up:
kubectl delete deployment web-unconstrainedStep 3: Add Zone-Level Topology Spread Constraints
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: web
6spec:
7 replicas: 6
8 selector:
9 matchLabels:
10 app: web
11 template:
12 metadata:
13 labels:
14 app: web
15 spec:
16 topologySpreadConstraints:
17 - maxSkew: 1
18 topologyKey: topology.kubernetes.io/zone
19 whenUnsatisfiable: DoNotSchedule
20 labelSelector:
21 matchLabels:
22 app: web
23 containers:
24 - name: nginx
25 image: nginx:1.25
26EOFStep 4: Understand Each Field
maxSkew: 1
The maximum permitted difference between the number of matching pods in the candidate domain (the zone being evaluated for the incoming pod) and the global minimum (the fewest matching pods in any eligible domain). The scheduler evaluates each candidate domain individually — it is not a simple global max-minus-min formula.
With maxSkew: 1 and 6 pods across 3 zones, valid distributions are:
[2, 2, 2]— global min = 2; all zones: 2 − 2 = 0 ≤ 1 ✓ (optimal)[3, 2, 1]— global min = 1; zone A: 3 − 1 = 2 > 1 ✗ (violatesmaxSkew: 1)[2, 2, 2]→ scheduling a 7th pod results in[3, 2, 2]— global min = 2; zone A: 3 − 2 = 1 ≤ 1 ✓ (still valid)
topologyKey: topology.kubernetes.io/zone
The node label that defines topology domains. The scheduler groups nodes by the value of this label:
us-east-1a= one domainus-east-1b= one domainus-east-1c= one domain
Any node label works as topologyKey. kubernetes.io/hostname gives per-node granularity.
whenUnsatisfiable: DoNotSchedule
Hard constraint. The pod stays Pending if placing it would violate maxSkew. Use this when you need a guarantee, and you're confident the cluster always has enough nodes in each zone.
whenUnsatisfiable: ScheduleAnyway
Soft constraint. The scheduler prefers the constraint but will violate it if necessary. The scheduler assigns a penalty score to placements that increase skew and avoids them when possible — but won't block the pod.
labelSelector
The spread constraint counts only pods that match this selector. This must match the pod's own labels, or the constraint is effectively counting zero pods in every domain, which means spread is never enforced. The selector in the constraint must be identical to the matchLabels in spec.selector.
Step 5: Verify the Distribution
1kubectl get pods -l app=web -o wide
2# NAME READY STATUS NODE ZONE
3# web-abc-1 1/1 Running node-1a-1 us-east-1a
4# web-abc-2 1/1 Running node-1a-2 us-east-1a
5# web-abc-3 1/1 Running node-1b-1 us-east-1b
6# web-abc-4 1/1 Running node-1b-2 us-east-1b
7# web-abc-5 1/1 Running node-1c-1 us-east-1c
8# web-abc-6 1/1 Running node-1c-2 us-east-1cExactly 2 pods per zone — skew of 0. The spread constraint worked.
Count pods per zone to verify:
kubectl get pods -l app=web -o wide | awk 'NR>1{print $7}' | sort | uniq -c
# 2 us-east-1a
# 2 us-east-1b
# 2 us-east-1cStep 6: Add Node-Level Spread Within Each Zone
Two constraints can run simultaneously. The first distributes across zones; the second distributes across individual nodes within each zone:
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: web
6spec:
7 replicas: 6
8 selector:
9 matchLabels:
10 app: web
11 template:
12 metadata:
13 labels:
14 app: web
15 spec:
16 topologySpreadConstraints:
17 - maxSkew: 1
18 topologyKey: topology.kubernetes.io/zone
19 whenUnsatisfiable: DoNotSchedule
20 labelSelector:
21 matchLabels:
22 app: web
23 - maxSkew: 1
24 topologyKey: kubernetes.io/hostname
25 whenUnsatisfiable: ScheduleAnyway
26 labelSelector:
27 matchLabels:
28 app: web
29 containers:
30 - name: nginx
31 image: nginx:1.25
32EOFThe first constraint is hard (DoNotSchedule) — zone balance is non-negotiable. The second is soft (ScheduleAnyway) — node balance within a zone is preferred but won't block scheduling if a zone has fewer nodes.
Both constraints are evaluated simultaneously. The scheduler must satisfy both when possible. When they conflict, the hard constraint takes precedence — zone balance is preserved even if it means violating the node-level skew.
Step 7: minDomains for Cluster-Aware Scheduling
Without minDomains, the constraint counts only topology domains that have at least one matching pod. If your cluster starts with pods only in zone-a (because the other zones have no pods yet), the constraint calculates skew against zones that have pods — not against all three zones. All 6 pods pile into zone-a with a skew of 0 (only one domain has pods).
minDomains fixes this by requiring a minimum number of eligible topology domains before pods are scheduled:
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: web
6spec:
7 replicas: 6
8 selector:
9 matchLabels:
10 app: web
11 template:
12 metadata:
13 labels:
14 app: web
15 spec:
16 topologySpreadConstraints:
17 - maxSkew: 1
18 topologyKey: topology.kubernetes.io/zone
19 whenUnsatisfiable: DoNotSchedule
20 minDomains: 3
21 labelSelector:
22 matchLabels:
23 app: web
24 containers:
25 - name: nginx
26 image: nginx:1.25
27EOFWith minDomains: 3, if fewer than 3 eligible zones exist in the cluster, the global minimum is treated as 0 — which keeps pods Pending until all 3 zones are available. This prevents the situation where pods pile into a single zone during cluster scale-up or zone recovery.
Note: eligible topology domains are determined by nodes carrying the topologyKey label — a zone counts as eligible even if it has no matching pods (its pod count is simply 0). The purpose of minDomains is to guard against scenarios where fewer zones exist than your architecture requires.
minDomains requires the MinDomainsInPodTopologySpread feature gate on Kubernetes v1.24–v1.26. The gate reached GA in v1.27 and was removed from the codebase in v1.28 — on v1.27+, minDomains is unconditionally available.
Step 8: Combining with Node Affinity
Node affinity and topology spread constraints compose: affinity filters which nodes are eligible, and the spread constraint distributes among the eligible nodes.
Deploy pods only on role=web nodes, spread across zones:
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: web-affinity
6spec:
7 replicas: 6
8 selector:
9 matchLabels:
10 app: web-affinity
11 template:
12 metadata:
13 labels:
14 app: web-affinity
15 spec:
16 affinity:
17 nodeAffinity:
18 requiredDuringSchedulingIgnoredDuringExecution:
19 nodeSelectorTerms:
20 - matchExpressions:
21 - key: role
22 operator: In
23 values:
24 - web
25 topologySpreadConstraints:
26 - maxSkew: 1
27 topologyKey: topology.kubernetes.io/zone
28 whenUnsatisfiable: DoNotSchedule
29 labelSelector:
30 matchLabels:
31 app: web-affinity
32 containers:
33 - name: nginx
34 image: nginx:1.25
35EOFWith nodeAffinityPolicy: Honor (the default), only zones containing at least one role=web node form eligible topology domains for the spread calculation. Pods are counted within those eligible domains. If zone-b has no role=web nodes, zone-b is not an eligible domain — pods spread only across zone-a and zone-c, and the constraint doesn't try to balance against non-web zones.
Step 9: Simulate a Zone Failure
Cordon all nodes in one zone to simulate it going offline:
kubectl cordon node-1c-1 node-1c-2
# Restart the deployment — old pods on zone-c are evicted; new pods need to be placed
kubectl rollout restart deployment/webWhat actually happens with DoNotSchedule: cordoned nodes keep the topology.kubernetes.io/zone label, so zone-c remains an eligible domain for skew calculation — it simply has no schedulable nodes. The scheduler can't place pods there to satisfy the spread constraint, so new pods stay Pending once the constraint can't be satisfied (typically after the first few pods fill zones a and b to the maxSkew limit).
To observe pods rescheduling into a [3, 3, 0] distribution when a zone is lost, use whenUnsatisfiable: ScheduleAnyway on the zone constraint. With ScheduleAnyway, the scheduler accepts the increased skew rather than blocking placement:
kubectl get pods -l app=web -o wide
# With ScheduleAnyway: all 6 pods distribute across zone-a and zone-b
# [3, 3, 0] — the soft constraint accepted the imbalance
# With DoNotSchedule (hard): pods 5 and 6 would stay PendingRestore zone-c:
kubectl uncordon node-1c-1 node-1c-2
kubectl rollout restart deployment/web
kubectl get pods -l app=web -o wide
# [2, 2, 2] — back to balancedWhen you uncordon zone-c, the spread constraint isn't automatically enforced for running pods — only for newly scheduled ones. The rollout restart reschedules all pods, allowing the constraint to produce the optimal distribution.
Common Mistakes to Avoid
labelSelector doesn't match the pod's own labels — the most common mistake. The spread constraint counts pods matching the selector to determine domain load. If the selector is wrong or empty, all domains appear to have zero pods, and the constraint has no effect. The selector in topologySpreadConstraints[].labelSelector must exactly match spec.selector.matchLabels.
whenUnsatisfiable: DoNotSchedule without enough zones — if you require spreading across 3 zones but only 2 zones have schedulable nodes (e.g., during maintenance), new pods stay Pending indefinitely. Use ScheduleAnyway for the secondary constraint, or use minDomains carefully.
topologyKey that doesn't match any node label — if your nodes don't have topology.kubernetes.io/zone set (common on self-managed clusters), the constraint treats the entire cluster as one domain and does nothing. Verify the node labels before adding constraints.
Per-node maxSkew: 1 on an uneven cluster — with 7 pods across 3 nodes (capacities: 4, 2, 1), the maximum you can achieve is [3, 2, 2] — skew 1. But when the 8th pod is added, [3, 3, 2] — skew 1 — is still valid. With 9 pods: [3, 3, 3]. Some distributions are mathematically unsatisfiable with maxSkew: 1. The pod stays Pending. Consider maxSkew: 2 for clusters with significantly unequal node counts.
Forgetting that existing pods don't re-balance automatically — topology spread constraints apply at scheduling time. If you add constraints to a running Deployment, existing pods don't move. Run kubectl rollout restart deployment/<name> to reschedule all pods under the new constraints.
Cleanup
kubectl delete deployment web web-affinity
kubectl uncordon node-1c-1 node-1c-2 # if you ran the failure simulationWhat's Next
- Node Affinity, Taints & Tolerations — the scheduling building block that composes with topology spread constraints
- Descheduler — a controller that evicts and reschedules pods to rebalance the cluster when spread constraints drift over time (e.g., after node additions)
Official References
- Pod Topology Spread Constraints — complete reference for all fields, interaction with node affinity, and cluster-level defaults
- Well-Known Labels, Annotations and Taints — canonical label names like
topology.kubernetes.io/zoneandkubernetes.io/hostname - Kubernetes Scheduler Framework — how the scheduler evaluates topology spread as a Filter and Score plugin
We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.
Struggling with this in production?
We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.