Kubernetes Scheduling: Taints, Tolerations, Affinity, and Priority Classes
Kubernetes scheduling determines which node runs each pod. The default scheduler considers resource requests, node conditions, and spreading requirements — but the tools that shape where workloads land are taints and tolerations, node affinity, pod affinity/anti-affinity, topology spread constraints, and PriorityClasses. This covers the production scheduling patterns: GPU node isolation with taints, database anti-affinity across AZs, topology spread for even distribution, and PriorityClasses that protect critical workloads from eviction.

The Kubernetes scheduler assigns pods to nodes. When you run kubectl apply on a Deployment, the scheduler considers every node in the cluster, filters out nodes that don't satisfy the pod's constraints (resource requests, node selectors, taints), and scores the remaining nodes (prefers less loaded nodes, prefers nodes in different AZs than existing replicas). The pod lands on the highest-scoring node.
The default scheduler is good enough for most workloads — but as a cluster grows, you need explicit control: GPU workloads must go to GPU nodes, database pods must not co-locate, critical system pods must survive node pressure. This is where the scheduling API comes in.
Taints and Tolerations: Node Isolation
A taint marks a node as unsuitable for general workloads. A toleration on a pod declares that the pod can tolerate a specific taint. Pods without the matching toleration are not scheduled to tainted nodes.
1# Taint a node — prevents pods without the toleration from landing here
2kubectl taint nodes node-gpu-01 workload=gpu:NoSchedule
3
4# Taint effects:
5# NoSchedule: don't schedule new pods (existing pods stay)
6# PreferNoSchedule: prefer not to schedule (soft version)
7# NoExecute: evict existing pods that don't tolerate the taintA pod that needs to run on GPU nodes:
1spec:
2 tolerations:
3 - key: workload
4 operator: Equal
5 value: gpu
6 effect: NoSchedule
7 nodeSelector:
8 workload: gpu # Also requires the node label — toleration alone allows, doesn't requireThe toleration says "I can run on GPU nodes." The nodeSelector says "I must run on GPU nodes." Without the nodeSelector, a pod with the toleration could land on any node — including the GPU nodes and general nodes.
System Taint Tolerations
Kubernetes adds taints automatically in some conditions:
| Taint | Condition | Effect |
|---|---|---|
node.kubernetes.io/not-ready | Node not ready | NoExecute |
node.kubernetes.io/unreachable | Node unreachable | NoExecute |
node.kubernetes.io/memory-pressure | Node memory pressure | NoSchedule |
node.kubernetes.io/disk-pressure | Node disk pressure | NoSchedule |
node.kubernetes.io/unschedulable | Node cordoned | NoSchedule |
DaemonSets automatically get tolerations for not-ready and unreachable — so they keep running during node issues. For critical infrastructure pods (Velero, cert-manager, logging agents), add explicit tolerations:
1tolerations:
2 - key: node.kubernetes.io/memory-pressure
3 operator: Exists
4 effect: NoSchedule
5 - key: node.kubernetes.io/not-ready
6 operator: Exists
7 effect: NoExecute
8 tolerationSeconds: 300 # Wait 5 minutes before evicting on node failuretolerationSeconds on NoExecute gives pods a grace period before they're evicted when the taint appears — useful for stateful workloads that should survive brief node issues.
Node Affinity: Placement Requirements
Node affinity is a more expressive replacement for nodeSelector. It supports In, NotIn, Exists, DoesNotExist, Gt, Lt operators and distinguishes between hard requirements and soft preferences.
1spec:
2 affinity:
3 nodeAffinity:
4 # Hard requirement: pod MUST land on a node matching this rule
5 requiredDuringSchedulingIgnoredDuringExecution:
6 nodeSelectorTerms:
7 - matchExpressions:
8 # Only schedule in us-east-1 AZs with SSD storage
9 - key: topology.kubernetes.io/zone
10 operator: In
11 values: ["us-east-1a", "us-east-1b", "us-east-1c"]
12 - key: storage-type
13 operator: In
14 values: ["ssd"]
15
16 # Soft preference: prefer nodes with at least 32GB RAM
17 preferredDuringSchedulingIgnoredDuringExecution:
18 - weight: 80 # Higher weight = stronger preference (1-100)
19 preference:
20 matchExpressions:
21 - key: node.kubernetes.io/instance-type
22 operator: In
23 values: ["m6i.2xlarge", "m6i.4xlarge", "m6a.2xlarge"]requiredDuringSchedulingIgnoredDuringExecution is a hard constraint — if no node matches, the pod stays Pending. IgnoredDuringExecution means the constraint is only applied at scheduling time; if the node later loses the label, existing pods are not evicted. requiredDuringSchedulingRequiredDuringExecution has been in alpha/beta since Kubernetes 1.25 (behind a feature gate). Check the feature gate status for your cluster version.
Pod Affinity and Anti-Affinity
Pod affinity schedules pods near other pods with matching labels. Pod anti-affinity schedules pods away from other pods — the critical tool for high availability.
Anti-Affinity for HA Database Replicas
1spec:
2 affinity:
3 podAntiAffinity:
4 # Hard: never schedule two postgres replicas on the same node
5 requiredDuringSchedulingIgnoredDuringExecution:
6 - labelSelector:
7 matchLabels:
8 app: postgres
9 role: replica
10 topologyKey: kubernetes.io/hostname # Spread across nodes
11
12 # Soft: prefer different AZs for replicas
13 preferredDuringSchedulingIgnoredDuringExecution:
14 - weight: 100
15 podAffinityTerm:
16 labelSelector:
17 matchLabels:
18 app: postgres
19 topologyKey: topology.kubernetes.io/zone # Spread across AZstopologyKey is the node label that defines the "domain" for anti-affinity. kubernetes.io/hostname means "no two matching pods on the same node." topology.kubernetes.io/zone means "no two matching pods in the same AZ."
Pod Affinity: Co-Location for Latency
1# Schedule the payments-cache pod near the payments-api pod (same node = no network hop)
2spec:
3 affinity:
4 podAffinity:
5 preferredDuringSchedulingIgnoredDuringExecution:
6 - weight: 100
7 podAffinityTerm:
8 labelSelector:
9 matchLabels:
10 app: payments-api
11 topologyKey: kubernetes.io/hostnameUse preferredDuring (not requiredDuring) for co-location — hard co-location makes pods unschedulable if the target is on a full node.
Topology Spread Constraints
Topology spread constraints provide even distribution across failure domains without the complexity of per-pod anti-affinity rules. Available since Kubernetes 1.16 (beta and enabled by default in 1.18, GA in 1.19).
1spec:
2 topologySpreadConstraints:
3 # Spread pods evenly across AZs — max 1 pod skew between AZs
4 - maxSkew: 1
5 topologyKey: topology.kubernetes.io/zone
6 whenUnsatisfiable: DoNotSchedule # Hard: block scheduling if spread would exceed maxSkew
7 labelSelector:
8 matchLabels:
9 app: payments-api
10
11 # Also spread evenly across nodes within each AZ
12 - maxSkew: 2
13 topologyKey: kubernetes.io/hostname
14 whenUnsatisfiable: ScheduleAnyway # Soft: try to spread but don't block scheduling
15 labelSelector:
16 matchLabels:
17 app: payments-apimaxSkew: 1 with whenUnsatisfiable: DoNotSchedule means: if scheduling this pod would create a situation where any zone has more than 1 extra pod compared to the least-loaded zone, don't schedule it. ScheduleAnyway is the soft version — it records a score but doesn't block.
Topology spread constraints are generally simpler and more predictable than pod anti-affinity for even distribution. Use anti-affinity for the specific "never co-locate these two pods" requirement; use topology spread for "distribute evenly across N domains."
PriorityClass: Controlling Eviction and Preemption
PriorityClasses assign numeric priority to pods. During resource pressure (node memory limit approached), Kubernetes evicts lower-priority pods first. During scheduling, higher-priority pods can preempt lower-priority ones.
1# Define PriorityClasses for your cluster
2---
3apiVersion: scheduling.k8s.io/v1
4kind: PriorityClass
5metadata:
6 name: platform-critical
7value: 1000000 # Highest — reserved for infrastructure components
8globalDefault: false
9preemptionPolicy: PreemptLowerPriority # Can evict lower-priority pods to schedule
10description: "Critical platform infrastructure: CoreDNS, kube-proxy, CNI"
11
12---
13apiVersion: scheduling.k8s.io/v1
14kind: PriorityClass
15metadata:
16 name: production-high
17value: 100000
18globalDefault: false
19preemptionPolicy: PreemptLowerPriority
20description: "Production services — SLA-backed workloads"
21
22---
23apiVersion: scheduling.k8s.io/v1
24kind: PriorityClass
25metadata:
26 name: production-default
27value: 10000
28globalDefault: true # Applied to pods with no priorityClassName
29preemptionPolicy: PreemptLowerPriority
30description: "Standard production workloads"
31
32---
33apiVersion: scheduling.k8s.io/v1
34kind: PriorityClass
35metadata:
36 name: batch-low
37value: 1000
38globalDefault: false
39preemptionPolicy: Never # Cannot evict other pods — only scheduled on free capacity
40description: "Batch jobs, preemptable analytics workloads"Assign in pod spec:
spec:
priorityClassName: production-highBuilt-in System Priority Classes
Kubernetes ships with two reserved PriorityClasses:
| Name | Value | Purpose |
|---|---|---|
system-cluster-critical | 2000000000 | Cluster-level critical pods (CoreDNS) |
system-node-critical | 2000001000 | Node-level critical pods (kube-proxy, kubelet static pods) |
Never assign system-cluster-critical or system-node-critical to application workloads — reserve them for cluster infrastructure.
Scheduling Profiles and the Default Scheduler
In clusters with heterogeneous workload types (batch jobs + latency-sensitive APIs + GPU training), consider multiple scheduling profiles — different scheduler configurations applied based on pod annotations:
1# kube-scheduler ConfigMap (simplified)
2apiVersion: kubescheduler.config.k8s.io/v1
3kind: KubeSchedulerConfiguration
4profiles:
5 - schedulerName: default-scheduler
6 plugins:
7 score:
8 enabled:
9 - name: NodeResourcesFit
10 - name: PodTopologySpread
11
12 - schedulerName: batch-scheduler
13 plugins:
14 score:
15 disabled:
16 - name: PodTopologySpread # Don't spread batch jobs — pack them for efficiency
17 enabled:
18 - name: NodeResourcesFitPods request a specific profile with spec.schedulerName: batch-scheduler. The batch profile packs pods tightly (bin-packing) rather than spreading them, reducing fragmentation for jobs that run to completion quickly.
PodSchedulingReadiness (K8s 1.32 GA): The spec.schedulingGates field lets you hold pods in a SchedulingGated state until external conditions are met. Used by job queuing systems like Kueue and Volcano to implement fair queuing without overloading the scheduler.
Frequently Asked Questions
What's the difference between nodeSelector and node affinity?
nodeSelector is the simpler, older API: it matches pods to nodes by exact label key-value pairs. Node affinity supports operators (In, NotIn, Exists, Gt, Lt), multiple match expressions with AND/OR logic, and the soft preferredDuring variant. For new code, use node affinity — it's strictly more expressive. nodeSelector is not deprecated but node affinity subsumes it.
How does Karpenter interact with taints and tolerations?
Karpenter respects pod tolerations when selecting instance types. If a pod has a workload=gpu:NoSchedule toleration, Karpenter can provision a GPU instance (with that taint) to satisfy the pod. Configure the NodePool to allow GPU instances and Karpenter will provision them only when a pod with the matching toleration is pending. See Kubernetes Node Autoscaling: Cluster Autoscaler vs Karpenter for NodePool configuration.
Why is my pod Pending despite having a toleration?
A toleration allows a pod to land on a tainted node, but doesn't require it. A pod with a toleration and no nodeSelector will go to the least-loaded available node — which may not be the tainted one. If the pod is Pending despite tolerations:
1# See why the pod isn't scheduled
2kubectl describe pod <pod-name> -n <namespace>
3# Look at "Events:" section — "0/5 nodes are available: ..."
4
5# Check if the node is schedulable
6kubectl get node -o custom-columns=NAME:.metadata.name,TAINT:.spec.taints
7
8# Check the pod's affinity/tolerations
9kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 20 affinityCommon causes: hard affinity rule that matches zero nodes, resource requests that no node can satisfy, requiredDuringScheduling topology spread that can't be satisfied with current pod distribution.
For Karpenter's NodePool that uses tolerations to match pods to provisioned node types, see Kubernetes Node Autoscaling: Cluster Autoscaler vs Karpenter. For Pod Disruption Budgets that control how the scheduler evicts pods during node consolidation, see Kubernetes StatefulSets: Running Stateful Workloads in Production.
Designing a scheduling strategy for a multi-team EKS cluster with mixed workload types, or debugging pods that are stuck in Pending? Talk to us at Coding Protocols — we help platform teams implement scheduling policies that isolate workloads without creating scheduling deadlocks.


