Kubernetes Scheduling: Taints, Tolerations, Affinity, and PriorityClass (2026)

The Kubernetes scheduler assigns pods to nodes. When you run kubectl apply on a Deployment, the scheduler considers every node in the cluster, filters out nodes that don't satisfy the pod's constraints (resource requests, node selectors, taints), and scores the remaining nodes (prefers less loaded nodes, prefers nodes in different AZs than existing replicas). The pod lands on the highest-scoring node.

The default scheduler is good enough for most workloads — but as a cluster grows, you need explicit control: GPU workloads must go to GPU nodes, database pods must not co-locate, critical system pods must survive node pressure. This is where the scheduling API comes in.

Taints and Tolerations: Node Isolation

A taint marks a node as unsuitable for general workloads. A toleration on a pod declares that the pod can tolerate a specific taint. Pods without the matching toleration are not scheduled to tainted nodes.

bash

1# Taint a node — prevents pods without the toleration from landing here
2kubectl taint nodes node-gpu-01 workload=gpu:NoSchedule
3
4# Taint effects:
5# NoSchedule: don't schedule new pods (existing pods stay)
6# PreferNoSchedule: prefer not to schedule (soft version)
7# NoExecute: evict existing pods that don't tolerate the taint

A pod that needs to run on GPU nodes:

yaml

1spec:
2  tolerations:
3    - key: workload
4      operator: Equal
5      value: gpu
6      effect: NoSchedule
7  nodeSelector:
8    workload: gpu    # Also requires the node label — toleration alone allows, doesn't require

The toleration says "I can run on GPU nodes." The nodeSelector says "I must run on GPU nodes." Without the nodeSelector, a pod with the toleration could land on any node — including the GPU nodes and general nodes.

System Taint Tolerations

Kubernetes adds taints automatically in some conditions:

Taint	Condition	Effect
`node.kubernetes.io/not-ready`	Node not ready	`NoExecute`
`node.kubernetes.io/unreachable`	Node unreachable	`NoExecute`
`node.kubernetes.io/memory-pressure`	Node memory pressure	`NoSchedule`
`node.kubernetes.io/disk-pressure`	Node disk pressure	`NoSchedule`
`node.kubernetes.io/unschedulable`	Node cordoned	`NoSchedule`

DaemonSets automatically get tolerations for not-ready and unreachable — so they keep running during node issues. For critical infrastructure pods (Velero, cert-manager, logging agents), add explicit tolerations:

yaml

1tolerations:
2  - key: node.kubernetes.io/memory-pressure
3    operator: Exists
4    effect: NoSchedule
5  - key: node.kubernetes.io/not-ready
6    operator: Exists
7    effect: NoExecute
8    tolerationSeconds: 300    # Wait 5 minutes before evicting on node failure

tolerationSeconds on NoExecute gives pods a grace period before they're evicted when the taint appears — useful for stateful workloads that should survive brief node issues.

Node Affinity: Placement Requirements

Node affinity is a more expressive replacement for nodeSelector. It supports In, NotIn, Exists, DoesNotExist, Gt, Lt operators and distinguishes between hard requirements and soft preferences.

yaml

1spec:
2  affinity:
3    nodeAffinity:
4      # Hard requirement: pod MUST land on a node matching this rule
5      requiredDuringSchedulingIgnoredDuringExecution:
6        nodeSelectorTerms:
7          - matchExpressions:
8              # Only schedule in us-east-1 AZs with SSD storage
9              - key: topology.kubernetes.io/zone
10                operator: In
11                values: ["us-east-1a", "us-east-1b", "us-east-1c"]
12              - key: storage-type
13                operator: In
14                values: ["ssd"]
15
16      # Soft preference: prefer nodes with at least 32GB RAM
17      preferredDuringSchedulingIgnoredDuringExecution:
18        - weight: 80    # Higher weight = stronger preference (1-100)
19          preference:
20            matchExpressions:
21              - key: node.kubernetes.io/instance-type
22                operator: In
23                values: ["m6i.2xlarge", "m6i.4xlarge", "m6a.2xlarge"]

requiredDuringSchedulingIgnoredDuringExecution is a hard constraint — if no node matches, the pod stays Pending. IgnoredDuringExecution means the constraint is only applied at scheduling time; if the node later loses the label, existing pods are not evicted. requiredDuringSchedulingRequiredDuringExecution has been in alpha/beta since Kubernetes 1.25 (behind a feature gate). Check the feature gate status for your cluster version.

Pod Affinity and Anti-Affinity

Pod affinity schedules pods near other pods with matching labels. Pod anti-affinity schedules pods away from other pods — the critical tool for high availability.

Anti-Affinity for HA Database Replicas

yaml

1spec:
2  affinity:
3    podAntiAffinity:
4      # Hard: never schedule two postgres replicas on the same node
5      requiredDuringSchedulingIgnoredDuringExecution:
6        - labelSelector:
7            matchLabels:
8              app: postgres
9              role: replica
10          topologyKey: kubernetes.io/hostname    # Spread across nodes
11
12      # Soft: prefer different AZs for replicas
13      preferredDuringSchedulingIgnoredDuringExecution:
14        - weight: 100
15          podAffinityTerm:
16            labelSelector:
17              matchLabels:
18                app: postgres
19            topologyKey: topology.kubernetes.io/zone    # Spread across AZs

topologyKey is the node label that defines the "domain" for anti-affinity. kubernetes.io/hostname means "no two matching pods on the same node." topology.kubernetes.io/zone means "no two matching pods in the same AZ."

Pod Affinity: Co-Location for Latency

yaml

1# Schedule the payments-cache pod near the payments-api pod (same node = no network hop)
2spec:
3  affinity:
4    podAffinity:
5      preferredDuringSchedulingIgnoredDuringExecution:
6        - weight: 100
7          podAffinityTerm:
8            labelSelector:
9              matchLabels:
10                app: payments-api
11            topologyKey: kubernetes.io/hostname

Use preferredDuring (not requiredDuring) for co-location — hard co-location makes pods unschedulable if the target is on a full node.

Topology Spread Constraints

Topology spread constraints provide even distribution across failure domains without the complexity of per-pod anti-affinity rules. Available since Kubernetes 1.16 (beta and enabled by default in 1.18, GA in 1.19).

yaml

1spec:
2  topologySpreadConstraints:
3    # Spread pods evenly across AZs — max 1 pod skew between AZs
4    - maxSkew: 1
5      topologyKey: topology.kubernetes.io/zone
6      whenUnsatisfiable: DoNotSchedule    # Hard: block scheduling if spread would exceed maxSkew
7      labelSelector:
8        matchLabels:
9          app: payments-api
10
11    # Also spread evenly across nodes within each AZ
12    - maxSkew: 2
13      topologyKey: kubernetes.io/hostname
14      whenUnsatisfiable: ScheduleAnyway    # Soft: try to spread but don't block scheduling
15      labelSelector:
16        matchLabels:
17          app: payments-api

maxSkew: 1 with whenUnsatisfiable: DoNotSchedule means: if scheduling this pod would create a situation where any zone has more than 1 extra pod compared to the least-loaded zone, don't schedule it. ScheduleAnyway is the soft version — it records a score but doesn't block.

Topology spread constraints are generally simpler and more predictable than pod anti-affinity for even distribution. Use anti-affinity for the specific "never co-locate these two pods" requirement; use topology spread for "distribute evenly across N domains."

PriorityClass: Controlling Eviction and Preemption

PriorityClasses assign numeric priority to pods. During resource pressure (node memory limit approached), Kubernetes evicts lower-priority pods first. During scheduling, higher-priority pods can preempt lower-priority ones.

yaml

1# Define PriorityClasses for your cluster
2---
3apiVersion: scheduling.k8s.io/v1
4kind: PriorityClass
5metadata:
6  name: platform-critical
7value: 1000000    # Highest — reserved for infrastructure components
8globalDefault: false
9preemptionPolicy: PreemptLowerPriority    # Can evict lower-priority pods to schedule
10description: "Critical platform infrastructure: CoreDNS, kube-proxy, CNI"
11
12---
13apiVersion: scheduling.k8s.io/v1
14kind: PriorityClass
15metadata:
16  name: production-high
17value: 100000
18globalDefault: false
19preemptionPolicy: PreemptLowerPriority
20description: "Production services — SLA-backed workloads"
21
22---
23apiVersion: scheduling.k8s.io/v1
24kind: PriorityClass
25metadata:
26  name: production-default
27value: 10000
28globalDefault: true    # Applied to pods with no priorityClassName
29preemptionPolicy: PreemptLowerPriority
30description: "Standard production workloads"
31
32---
33apiVersion: scheduling.k8s.io/v1
34kind: PriorityClass
35metadata:
36  name: batch-low
37value: 1000
38globalDefault: false
39preemptionPolicy: Never    # Cannot evict other pods — only scheduled on free capacity
40description: "Batch jobs, preemptable analytics workloads"

Assign in pod spec:

yaml

spec:
  priorityClassName: production-high

Built-in System Priority Classes

Kubernetes ships with two reserved PriorityClasses:

Name	Value	Purpose
`system-cluster-critical`	2000000000	Cluster-level critical pods (CoreDNS)
`system-node-critical`	2000001000	Node-level critical pods (kube-proxy, kubelet static pods)

Never assign system-cluster-critical or system-node-critical to application workloads — reserve them for cluster infrastructure.

Scheduling Profiles and the Default Scheduler

In clusters with heterogeneous workload types (batch jobs + latency-sensitive APIs + GPU training), consider multiple scheduling profiles — different scheduler configurations applied based on pod annotations:

yaml

1# kube-scheduler ConfigMap (simplified)
2apiVersion: kubescheduler.config.k8s.io/v1
3kind: KubeSchedulerConfiguration
4profiles:
5  - schedulerName: default-scheduler
6    plugins:
7      score:
8        enabled:
9          - name: NodeResourcesFit
10          - name: PodTopologySpread
11
12  - schedulerName: batch-scheduler
13    plugins:
14      score:
15        disabled:
16          - name: PodTopologySpread    # Don't spread batch jobs — pack them for efficiency
17        enabled:
18          - name: NodeResourcesFit

Pods request a specific profile with spec.schedulerName: batch-scheduler. The batch profile packs pods tightly (bin-packing) rather than spreading them, reducing fragmentation for jobs that run to completion quickly.

PodSchedulingReadiness (K8s 1.32 GA): The spec.schedulingGates field lets you hold pods in a SchedulingGated state until external conditions are met. Used by job queuing systems like Kueue and Volcano to implement fair queuing without overloading the scheduler.

Frequently Asked Questions

What's the difference between `nodeSelector` and node affinity?

nodeSelector is the simpler, older API: it matches pods to nodes by exact label key-value pairs. Node affinity supports operators (In, NotIn, Exists, Gt, Lt), multiple match expressions with AND/OR logic, and the soft preferredDuring variant. For new code, use node affinity — it's strictly more expressive. nodeSelector is not deprecated but node affinity subsumes it.

How does Karpenter interact with taints and tolerations?

Karpenter respects pod tolerations when selecting instance types. If a pod has a workload=gpu:NoSchedule toleration, Karpenter can provision a GPU instance (with that taint) to satisfy the pod. Configure the NodePool to allow GPU instances and Karpenter will provision them only when a pod with the matching toleration is pending. See Kubernetes Node Autoscaling: Cluster Autoscaler vs Karpenter for NodePool configuration.

Why is my pod Pending despite having a toleration?

A toleration allows a pod to land on a tainted node, but doesn't require it. A pod with a toleration and no nodeSelector will go to the least-loaded available node — which may not be the tainted one. If the pod is Pending despite tolerations:

bash

1# See why the pod isn't scheduled
2kubectl describe pod <pod-name> -n <namespace>
3# Look at "Events:" section — "0/5 nodes are available: ..."
4
5# Check if the node is schedulable
6kubectl get node -o custom-columns=NAME:.metadata.name,TAINT:.spec.taints
7
8# Check the pod's affinity/tolerations
9kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 20 affinity

Common causes: hard affinity rule that matches zero nodes, resource requests that no node can satisfy, requiredDuringScheduling topology spread that can't be satisfied with current pod distribution.

For Karpenter's NodePool that uses tolerations to match pods to provisioned node types, see Kubernetes Node Autoscaling: Cluster Autoscaler vs Karpenter. For Pod Disruption Budgets that control how the scheduler evicts pods during node consolidation, see Kubernetes StatefulSets: Running Stateful Workloads in Production.

Designing a scheduling strategy for a multi-team EKS cluster with mixed workload types, or debugging pods that are stuck in Pending? Talk to us at Coding Protocols — we help platform teams implement scheduling policies that isolate workloads without creating scheduling deadlocks.

Kubernetes Scheduling: Taints, Tolerations, Affinity, and Priority Classes

Taints and Tolerations: Node Isolation

System Taint Tolerations

Node Affinity: Placement Requirements

Pod Affinity and Anti-Affinity

Anti-Affinity for HA Database Replicas

Pod Affinity: Co-Location for Latency

Topology Spread Constraints

PriorityClass: Controlling Eviction and Preemption

Built-in System Priority Classes

Scheduling Profiles and the Default Scheduler

Frequently Asked Questions

What's the difference between `nodeSelector` and node affinity?

How does Karpenter interact with taints and tolerations?

Why is my pod Pending despite having a toleration?

Related Topics

Read Next

Kubernetes Resource Management: Quotas, LimitRanges, and QoS Classes

Helm Advanced Patterns: Chart Development and Production Operations

Kubernetes Resource Management: Requests, Limits, QoS, and LimitRanges

Taints and Tolerations: Node Isolation

System Taint Tolerations

Node Affinity: Placement Requirements

Pod Affinity and Anti-Affinity

Anti-Affinity for HA Database Replicas

Pod Affinity: Co-Location for Latency

Topology Spread Constraints

PriorityClass: Controlling Eviction and Preemption

Built-in System Priority Classes

Scheduling Profiles and the Default Scheduler

Frequently Asked Questions

What's the difference between nodeSelector and node affinity?

How does Karpenter interact with taints and tolerations?

Why is my pod Pending despite having a toleration?

Related Topics

Read Next

Kubernetes Resource Management: Quotas, LimitRanges, and QoS Classes

Helm Advanced Patterns: Chart Development and Production Operations

Kubernetes Resource Management: Requests, Limits, QoS, and LimitRanges

What's the difference between `nodeSelector` and node affinity?