Kubernetes

Pod Topology Spread Constraints for High Availability

Advanced40 min to complete14 min read

Use pod topology spread constraints to distribute workloads evenly across availability zones and nodes. Covers maxSkew, whenUnsatisfiable, topologyKey, and how to combine spread constraints with node affinity for zone-aware HA deployments.

Before you begin

  • kubectl installed and configured
  • A multi-node Kubernetes cluster with nodes in multiple zones (EKS or GKE recommended)
  • Basic familiarity with Kubernetes Deployments and node labels
Kubernetes
Scheduling
High Availability
Topology
Pod Distribution
DevOps

A 6-replica Deployment with no placement controls can schedule all 6 pods on a single node or in a single availability zone. When that zone loses network connectivity, your entire application has zero replicas. The Kubernetes scheduler isn't trying to hurt you — it just optimizes for speed and bin-packing, not redundancy.

Pod topology spread constraints give the scheduler explicit distribution rules. You define the maximum allowed imbalance (maxSkew) between topology domains (nodes, zones, regions), and the scheduler ensures new pods honor that constraint.

What You'll Build

A 6-replica Deployment spread evenly across 3 availability zones with maxSkew: 1, so no zone ever has more than one extra pod compared to the least-loaded zone. Then a second constraint to also spread across individual nodes within each zone.

Step 1: Verify Zone Labels on Your Nodes

Cloud-managed clusters (EKS, GKE, AKS) add zone labels automatically. Verify yours:

bash
1kubectl get nodes -L topology.kubernetes.io/zone
2# NAME          STATUS   ROLES    AGE   VERSION   ZONE
3# node-1a-1     Ready    <none>   5d    v1.29.0   us-east-1a
4# node-1a-2     Ready    <none>   5d    v1.29.0   us-east-1a
5# node-1b-1     Ready    <none>   5d    v1.29.0   us-east-1b
6# node-1b-2     Ready    <none>   5d    v1.29.0   us-east-1b
7# node-1c-1     Ready    <none>   5d    v1.29.0   us-east-1c
8# node-1c-2     Ready    <none>   5d    v1.29.0   us-east-1c

If you're using kind for local testing, add the labels manually:

bash
kubectl label nodes kind-worker  topology.kubernetes.io/zone=zone-a
kubectl label nodes kind-worker2 topology.kubernetes.io/zone=zone-b
kubectl label nodes kind-worker3 topology.kubernetes.io/zone=zone-c

Step 2: Deploy Without Spread Constraints

First, see what the scheduler does without guidance:

bash
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: web-unconstrained
6spec:
7  replicas: 6
8  selector:
9    matchLabels:
10      app: web-unconstrained
11  template:
12    metadata:
13      labels:
14        app: web-unconstrained
15    spec:
16      containers:
17        - name: nginx
18          image: nginx:1.25
19EOF
bash
kubectl get pods -l app=web-unconstrained -o wide | awk '{print $7}' | sort | uniq -c
# The distribution may look like:
#   4 node-1a-1
#   1 node-1a-2
#   1 node-1b-1
# All 6 pods in zone us-east-1a, none in 1b or 1c

The scheduler filled the emptiest nodes first. In a freshly-deployed cluster, that often means one zone. A zone-1a failure takes out all 6 replicas.

Clean up:

bash
kubectl delete deployment web-unconstrained

Step 3: Add Zone-Level Topology Spread Constraints

bash
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: web
6spec:
7  replicas: 6
8  selector:
9    matchLabels:
10      app: web
11  template:
12    metadata:
13      labels:
14        app: web
15    spec:
16      topologySpreadConstraints:
17        - maxSkew: 1
18          topologyKey: topology.kubernetes.io/zone
19          whenUnsatisfiable: DoNotSchedule
20          labelSelector:
21            matchLabels:
22              app: web
23      containers:
24        - name: nginx
25          image: nginx:1.25
26EOF

Step 4: Understand Each Field

maxSkew: 1

The maximum permitted difference between the number of matching pods in the candidate domain (the zone being evaluated for the incoming pod) and the global minimum (the fewest matching pods in any eligible domain). The scheduler evaluates each candidate domain individually — it is not a simple global max-minus-min formula.

With maxSkew: 1 and 6 pods across 3 zones, valid distributions are:

  • [2, 2, 2] — global min = 2; all zones: 2 − 2 = 0 ≤ 1 ✓ (optimal)
  • [3, 2, 1] — global min = 1; zone A: 3 − 1 = 2 > 1 ✗ (violates maxSkew: 1)
  • [2, 2, 2] → scheduling a 7th pod results in [3, 2, 2] — global min = 2; zone A: 3 − 2 = 1 ≤ 1 ✓ (still valid)

topologyKey: topology.kubernetes.io/zone

The node label that defines topology domains. The scheduler groups nodes by the value of this label:

  • us-east-1a = one domain
  • us-east-1b = one domain
  • us-east-1c = one domain

Any node label works as topologyKey. kubernetes.io/hostname gives per-node granularity.

whenUnsatisfiable: DoNotSchedule

Hard constraint. The pod stays Pending if placing it would violate maxSkew. Use this when you need a guarantee, and you're confident the cluster always has enough nodes in each zone.

whenUnsatisfiable: ScheduleAnyway

Soft constraint. The scheduler prefers the constraint but will violate it if necessary. The scheduler assigns a penalty score to placements that increase skew and avoids them when possible — but won't block the pod.

labelSelector

The spread constraint counts only pods that match this selector. This must match the pod's own labels, or the constraint is effectively counting zero pods in every domain, which means spread is never enforced. The selector in the constraint must be identical to the matchLabels in spec.selector.

Step 5: Verify the Distribution

bash
1kubectl get pods -l app=web -o wide
2# NAME           READY   STATUS    NODE        ZONE
3# web-abc-1      1/1     Running   node-1a-1   us-east-1a
4# web-abc-2      1/1     Running   node-1a-2   us-east-1a
5# web-abc-3      1/1     Running   node-1b-1   us-east-1b
6# web-abc-4      1/1     Running   node-1b-2   us-east-1b
7# web-abc-5      1/1     Running   node-1c-1   us-east-1c
8# web-abc-6      1/1     Running   node-1c-2   us-east-1c

Exactly 2 pods per zone — skew of 0. The spread constraint worked.

Count pods per zone to verify:

bash
kubectl get pods -l app=web -o wide | awk 'NR>1{print $7}' | sort | uniq -c
#   2 us-east-1a
#   2 us-east-1b
#   2 us-east-1c

Step 6: Add Node-Level Spread Within Each Zone

Two constraints can run simultaneously. The first distributes across zones; the second distributes across individual nodes within each zone:

bash
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: web
6spec:
7  replicas: 6
8  selector:
9    matchLabels:
10      app: web
11  template:
12    metadata:
13      labels:
14        app: web
15    spec:
16      topologySpreadConstraints:
17        - maxSkew: 1
18          topologyKey: topology.kubernetes.io/zone
19          whenUnsatisfiable: DoNotSchedule
20          labelSelector:
21            matchLabels:
22              app: web
23        - maxSkew: 1
24          topologyKey: kubernetes.io/hostname
25          whenUnsatisfiable: ScheduleAnyway
26          labelSelector:
27            matchLabels:
28              app: web
29      containers:
30        - name: nginx
31          image: nginx:1.25
32EOF

The first constraint is hard (DoNotSchedule) — zone balance is non-negotiable. The second is soft (ScheduleAnyway) — node balance within a zone is preferred but won't block scheduling if a zone has fewer nodes.

Both constraints are evaluated simultaneously. The scheduler must satisfy both when possible. When they conflict, the hard constraint takes precedence — zone balance is preserved even if it means violating the node-level skew.

Step 7: minDomains for Cluster-Aware Scheduling

Without minDomains, the constraint counts only topology domains that have at least one matching pod. If your cluster starts with pods only in zone-a (because the other zones have no pods yet), the constraint calculates skew against zones that have pods — not against all three zones. All 6 pods pile into zone-a with a skew of 0 (only one domain has pods).

minDomains fixes this by requiring a minimum number of eligible topology domains before pods are scheduled:

bash
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: web
6spec:
7  replicas: 6
8  selector:
9    matchLabels:
10      app: web
11  template:
12    metadata:
13      labels:
14        app: web
15    spec:
16      topologySpreadConstraints:
17        - maxSkew: 1
18          topologyKey: topology.kubernetes.io/zone
19          whenUnsatisfiable: DoNotSchedule
20          minDomains: 3
21          labelSelector:
22            matchLabels:
23              app: web
24      containers:
25        - name: nginx
26          image: nginx:1.25
27EOF

With minDomains: 3, if fewer than 3 eligible zones exist in the cluster, the global minimum is treated as 0 — which keeps pods Pending until all 3 zones are available. This prevents the situation where pods pile into a single zone during cluster scale-up or zone recovery.

Note: eligible topology domains are determined by nodes carrying the topologyKey label — a zone counts as eligible even if it has no matching pods (its pod count is simply 0). The purpose of minDomains is to guard against scenarios where fewer zones exist than your architecture requires.

minDomains requires the MinDomainsInPodTopologySpread feature gate on Kubernetes v1.24–v1.26. The gate reached GA in v1.27 and was removed from the codebase in v1.28 — on v1.27+, minDomains is unconditionally available.

Step 8: Combining with Node Affinity

Node affinity and topology spread constraints compose: affinity filters which nodes are eligible, and the spread constraint distributes among the eligible nodes.

Deploy pods only on role=web nodes, spread across zones:

bash
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: web-affinity
6spec:
7  replicas: 6
8  selector:
9    matchLabels:
10      app: web-affinity
11  template:
12    metadata:
13      labels:
14        app: web-affinity
15    spec:
16      affinity:
17        nodeAffinity:
18          requiredDuringSchedulingIgnoredDuringExecution:
19            nodeSelectorTerms:
20              - matchExpressions:
21                  - key: role
22                    operator: In
23                    values:
24                      - web
25      topologySpreadConstraints:
26        - maxSkew: 1
27          topologyKey: topology.kubernetes.io/zone
28          whenUnsatisfiable: DoNotSchedule
29          labelSelector:
30            matchLabels:
31              app: web-affinity
32      containers:
33        - name: nginx
34          image: nginx:1.25
35EOF

With nodeAffinityPolicy: Honor (the default), only zones containing at least one role=web node form eligible topology domains for the spread calculation. Pods are counted within those eligible domains. If zone-b has no role=web nodes, zone-b is not an eligible domain — pods spread only across zone-a and zone-c, and the constraint doesn't try to balance against non-web zones.

Step 9: Simulate a Zone Failure

Cordon all nodes in one zone to simulate it going offline:

bash
kubectl cordon node-1c-1 node-1c-2

# Restart the deployment — old pods on zone-c are evicted; new pods need to be placed
kubectl rollout restart deployment/web

What actually happens with DoNotSchedule: cordoned nodes keep the topology.kubernetes.io/zone label, so zone-c remains an eligible domain for skew calculation — it simply has no schedulable nodes. The scheduler can't place pods there to satisfy the spread constraint, so new pods stay Pending once the constraint can't be satisfied (typically after the first few pods fill zones a and b to the maxSkew limit).

To observe pods rescheduling into a [3, 3, 0] distribution when a zone is lost, use whenUnsatisfiable: ScheduleAnyway on the zone constraint. With ScheduleAnyway, the scheduler accepts the increased skew rather than blocking placement:

bash
kubectl get pods -l app=web -o wide
# With ScheduleAnyway: all 6 pods distribute across zone-a and zone-b
# [3, 3, 0] — the soft constraint accepted the imbalance
# With DoNotSchedule (hard): pods 5 and 6 would stay Pending

Restore zone-c:

bash
kubectl uncordon node-1c-1 node-1c-2
kubectl rollout restart deployment/web

kubectl get pods -l app=web -o wide
# [2, 2, 2] — back to balanced

When you uncordon zone-c, the spread constraint isn't automatically enforced for running pods — only for newly scheduled ones. The rollout restart reschedules all pods, allowing the constraint to produce the optimal distribution.

Common Mistakes to Avoid

labelSelector doesn't match the pod's own labels — the most common mistake. The spread constraint counts pods matching the selector to determine domain load. If the selector is wrong or empty, all domains appear to have zero pods, and the constraint has no effect. The selector in topologySpreadConstraints[].labelSelector must exactly match spec.selector.matchLabels.

whenUnsatisfiable: DoNotSchedule without enough zones — if you require spreading across 3 zones but only 2 zones have schedulable nodes (e.g., during maintenance), new pods stay Pending indefinitely. Use ScheduleAnyway for the secondary constraint, or use minDomains carefully.

topologyKey that doesn't match any node label — if your nodes don't have topology.kubernetes.io/zone set (common on self-managed clusters), the constraint treats the entire cluster as one domain and does nothing. Verify the node labels before adding constraints.

Per-node maxSkew: 1 on an uneven cluster — with 7 pods across 3 nodes (capacities: 4, 2, 1), the maximum you can achieve is [3, 2, 2] — skew 1. But when the 8th pod is added, [3, 3, 2] — skew 1 — is still valid. With 9 pods: [3, 3, 3]. Some distributions are mathematically unsatisfiable with maxSkew: 1. The pod stays Pending. Consider maxSkew: 2 for clusters with significantly unequal node counts.

Forgetting that existing pods don't re-balance automatically — topology spread constraints apply at scheduling time. If you add constraints to a running Deployment, existing pods don't move. Run kubectl rollout restart deployment/<name> to reschedule all pods under the new constraints.

Cleanup

bash
kubectl delete deployment web web-affinity
kubectl uncordon node-1c-1 node-1c-2   # if you ran the failure simulation

What's Next

  • Node Affinity, Taints & Tolerations — the scheduling building block that composes with topology spread constraints
  • Descheduler — a controller that evicts and reschedules pods to rebalance the cluster when spread constraints drift over time (e.g., after node additions)

Official References

We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.

Struggling with this in production?

We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.