Node Affinity, Taints & Tolerations in Production
Pin workloads to the right nodes and keep undesirable pods away. This tutorial covers node labels, requiredDuring vs preferredDuring affinity, taint effects (NoSchedule, NoExecute, PreferNoSchedule), and how to combine both for GPU node pools.
Before you begin
- kubectl installed and configured
- Access to a running Kubernetes cluster (EKS
- GKE
- or kind)
- Basic familiarity with Kubernetes Deployments
Without placement controls, Kubernetes schedules pods wherever capacity exists. In practice that means CPU-hungry batch jobs land on the same nodes as latency-sensitive APIs, spot instances host stateful databases, and your monitoring agents miss the control plane entirely.
Node affinity and taints are the two tools that fix this. Affinity attracts pods to specific nodes. Taints repel pods from nodes unless they explicitly tolerate the taint. Together they let you segment workloads precisely without hard-coding node IPs anywhere.
What You'll Build
Two scenarios against the same 3-node cluster:
- A web Deployment pinned to nodes labeled
role=webusing hard (required) node affinity. - A monitoring DaemonSet that tolerates the
node-role.kubernetes.io/control-planetaint so it can run on every node including control plane nodes.
As a bonus, Step 8 shows the combined pattern used for reserved GPU node pools.
Step 1: Label Your Nodes
Node labels are the foundation of affinity rules. Add a role label to simulate a segmented cluster:
1kubectl label nodes node1 role=web
2kubectl label nodes node2 role=web
3kubectl label nodes node3 role=batch
4
5# Verify
6kubectl get nodes --show-labels | grep role
7# node1 Ready <none> ... kubernetes.io/os=linux,role=web,...
8# node2 Ready <none> ... kubernetes.io/os=linux,role=web,...
9# node3 Ready <none> ... kubernetes.io/os=linux,role=batch,...Replace node1, node2, node3 with your actual node names from kubectl get nodes.
Step 2: Required Affinity (Hard Rule)
requiredDuringSchedulingIgnoredDuringExecution is the hard constraint. The scheduler will not place the pod on a node that doesn't satisfy it. If no matching node exists, the pod stays Pending.
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: web
6spec:
7 replicas: 4
8 selector:
9 matchLabels:
10 app: web
11 template:
12 metadata:
13 labels:
14 app: web
15 spec:
16 affinity:
17 nodeAffinity:
18 requiredDuringSchedulingIgnoredDuringExecution:
19 nodeSelectorTerms:
20 - matchExpressions:
21 - key: role
22 operator: In
23 values:
24 - web
25 containers:
26 - name: nginx
27 image: nginx:1.25
28 ports:
29 - containerPort: 80
30EOFThe IgnoredDuringExecution part is worth noting. It means if the node's role=web label is removed after the pod is already running, the pod is not evicted. Kubernetes only enforces the constraint at scheduling time, not continuously.
Step 3: Verify Placement
kubectl get pods -o wide
# NAME READY STATUS NODE ...
# web-7d4f9b-abc 1/1 Running node1 ...
# web-7d4f9b-def 1/1 Running node2 ...
# web-7d4f9b-ghi 1/1 Running node1 ...
# web-7d4f9b-jkl 1/1 Running node2 ...All four pods land on node1 or node2 — never node3. Try adding a fifth replica and confirm it still lands on a role=web node.
Step 4: Preferred Affinity (Soft Rule)
preferredDuringSchedulingIgnoredDuringExecution is the soft constraint. The scheduler tries to honor it but will schedule the pod elsewhere if no preferred node is available. This is useful when hard constraints would leave pods Pending during node outages or scale-up delays.
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: web-soft
6spec:
7 replicas: 4
8 selector:
9 matchLabels:
10 app: web-soft
11 template:
12 metadata:
13 labels:
14 app: web-soft
15 spec:
16 affinity:
17 nodeAffinity:
18 preferredDuringSchedulingIgnoredDuringExecution:
19 - weight: 80
20 preference:
21 matchExpressions:
22 - key: role
23 operator: In
24 values:
25 - web
26 containers:
27 - name: nginx
28 image: nginx:1.25
29EOFweight ranges from 1 to 100. Higher weight means a stronger preference. The scheduler scores candidate nodes and picks the highest scorer. A node labeled role=web gets +80 to its score. If node3 has significantly more free capacity, the scheduler may still pick it — the preference is advisory, not binding.
Step 5: Taints — Marking Nodes as Off-Limits
A taint marks a node so that pods are repelled from it unless they declare a matching toleration. Add a taint to node3:
kubectl taint nodes node3 dedicated=batch:NoScheduleThe format is key=value:effect. The three effects:
NoSchedule— new pods are not scheduled here. Existing pods are unaffected.PreferNoSchedule— the scheduler prefers to avoid this node, but the decision is scoring-based. It may still schedule pods here even when other nodes are available, depending on each node's total score.NoExecute— new pods are not scheduled AND existing pods without a matching toleration are evicted.
Verify the taint was applied:
kubectl describe node node3 | grep -A5 Taints
# Taints: dedicated=batch:NoScheduleNow deploy something and confirm it never lands on node3:
kubectl run test-pod --image=nginx --restart=Never
kubectl get pod test-pod -o wide
# NODE is node1 or node2, never node3Step 6: Tolerations — Opting Into a Tainted Node
A toleration doesn't force a pod onto a tainted node — it allows it. The pod can land there if the scheduler chooses it, but it won't be barred. To actually force the pod onto node3, combine a toleration with required affinity pointing to node3.
Here's a batch job that tolerates the dedicated=batch:NoSchedule taint:
1kubectl apply -f - <<EOF
2apiVersion: batch/v1
3kind: Job
4metadata:
5 name: batch-job
6spec:
7 template:
8 spec:
9 tolerations:
10 - key: "dedicated"
11 operator: "Equal"
12 value: "batch"
13 effect: "NoSchedule"
14 affinity:
15 nodeAffinity:
16 requiredDuringSchedulingIgnoredDuringExecution:
17 nodeSelectorTerms:
18 - matchExpressions:
19 - key: role
20 operator: In
21 values:
22 - batch
23 containers:
24 - name: worker
25 image: busybox
26 command: ["sh", "-c", "echo 'batch work done' && sleep 10"]
27 restartPolicy: Never
28EOFkubectl get pod -l job-name=batch-job -o wide
# NODE is node3 — the only node with role=batch AND the matching taintStep 7: Tolerating the Control Plane
On clusters provisioned with kubeadm (self-managed clusters), control plane nodes are automatically tainted with node-role.kubernetes.io/control-plane:NoSchedule. Worker nodes don't get this taint. As a result, monitoring agents like Prometheus node-exporter miss the control plane nodes. Managed services (EKS, GKE, AKS) run the control plane on infrastructure you don't have access to, so this scenario applies to self-managed clusters only.
Add the toleration to your DaemonSet to fix this:
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: DaemonSet
4metadata:
5 name: node-monitor
6spec:
7 selector:
8 matchLabels:
9 app: node-monitor
10 template:
11 metadata:
12 labels:
13 app: node-monitor
14 spec:
15 tolerations:
16 - key: "node-role.kubernetes.io/control-plane"
17 operator: "Exists"
18 effect: "NoSchedule"
19 containers:
20 - name: monitor
21 image: busybox
22 command: ["sh", "-c", "while true; do echo 'monitoring'; sleep 60; done"]
23EOFoperator: Exists matches any taint with that key, regardless of value — but the effect field still applies (or omit it to match all effects). This is the correct form for built-in Kubernetes taints because the value may vary by provider. As a special case, leaving key empty with operator: Exists creates a wildcard toleration that matches every taint on the node.
kubectl get pods -l app=node-monitor -o wide
# Now includes control plane node(s)Step 8: Production Pattern — Reserved GPU Node Pool
The most common production use of combined affinity + taints is reserving GPU nodes exclusively for GPU workloads. Here's the pattern:
Label and taint the GPU nodes once (usually done at node bootstrap or via a node group label):
kubectl label nodes gpu-node1 accelerator=nvidia
kubectl taint nodes gpu-node1 dedicated=gpu:NoScheduleGPU workload — sets both the affinity (to land on GPU nodes) and the toleration (to bypass the taint):
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: ml-inference
6spec:
7 replicas: 2
8 selector:
9 matchLabels:
10 app: ml-inference
11 template:
12 metadata:
13 labels:
14 app: ml-inference
15 spec:
16 tolerations:
17 - key: "dedicated"
18 operator: "Equal"
19 value: "gpu"
20 effect: "NoSchedule"
21 affinity:
22 nodeAffinity:
23 requiredDuringSchedulingIgnoredDuringExecution:
24 nodeSelectorTerms:
25 - matchExpressions:
26 - key: accelerator
27 operator: In
28 values:
29 - nvidia
30 containers:
31 - name: inference
32 image: your-ml-image:latest
33 resources:
34 limits:
35 nvidia.com/gpu: 1
36EOFCPU workloads have no toleration and no GPU affinity — they land on regular nodes automatically. GPU nodes stay exclusive to GPU workloads without any extra configuration on the CPU side.
Common Mistakes to Avoid
Using nodeSelector instead of nodeAffinity — nodeSelector is the older, simpler form. It only supports key=value equality. nodeAffinity supports In, NotIn, Exists, DoesNotExist, Gt, and Lt operators and can express OR conditions across multiple matchExpressions blocks. Use nodeAffinity for anything non-trivial.
Required affinity with no matching nodes — if you set requiredDuringSchedulingIgnoredDuringExecution and delete the label from all matching nodes, all new pods for that Deployment stay Pending indefinitely. Set up alerting for Pending pods older than 5 minutes.
Confusing NoExecute with NoSchedule — NoSchedule is additive and non-disruptive: existing pods stay, new pods are barred. NoExecute is disruptive: pods without a matching toleration are evicted immediately. Pods with a matching toleration and no tolerationSeconds are never evicted. Pods with a matching toleration and a tolerationSeconds: N value are evicted N seconds after the taint is applied to the node (not N seconds after the pod started). Use NoExecute intentionally, e.g. when draining a node.
Toleration without matching affinity — adding a toleration lets a pod schedule on a tainted node but doesn't prefer it. If your goal is "run only on GPU nodes", you need both the toleration AND required affinity pointing to the GPU node label. The toleration alone allows it; the affinity enforces it.
Cleanup
kubectl delete deployment web web-soft
kubectl delete job batch-job
kubectl delete daemonset node-monitor
kubectl taint nodes node3 dedicated=batch:NoSchedule- # trailing dash removes the taint
kubectl label nodes node1 node2 role-
kubectl label nodes node3 role-What's Next
- Pod Topology Spread Constraints — distribute replicas evenly across zones, builds on affinity concepts from this tutorial
- Node Pools and Managed Node Groups — how EKS applies labels and taints at the node group level automatically
Official References
- Assigning Pods to Nodes — complete reference for
nodeSelector,nodeAffinity, and inter-pod affinity - Taints and Tolerations — taint effects, toleration operators, and built-in taints added by the node lifecycle controller
- Well-Known Labels, Annotations and Taints — all the standard labels like
topology.kubernetes.io/zonethat cloud providers add to nodes
We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.
Struggling with this in production?
We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.