Kubernetes

Node Affinity, Taints & Tolerations in Production

Intermediate40 min to complete13 min read

Pin workloads to the right nodes and keep undesirable pods away. This tutorial covers node labels, requiredDuring vs preferredDuring affinity, taint effects (NoSchedule, NoExecute, PreferNoSchedule), and how to combine both for GPU node pools.

Before you begin

  • kubectl installed and configured
  • Access to a running Kubernetes cluster (EKS
  • GKE
  • or kind)
  • Basic familiarity with Kubernetes Deployments
Kubernetes
Scheduling
Node Affinity
Taints
Tolerations
DevOps

Without placement controls, Kubernetes schedules pods wherever capacity exists. In practice that means CPU-hungry batch jobs land on the same nodes as latency-sensitive APIs, spot instances host stateful databases, and your monitoring agents miss the control plane entirely.

Node affinity and taints are the two tools that fix this. Affinity attracts pods to specific nodes. Taints repel pods from nodes unless they explicitly tolerate the taint. Together they let you segment workloads precisely without hard-coding node IPs anywhere.

What You'll Build

Two scenarios against the same 3-node cluster:

  1. A web Deployment pinned to nodes labeled role=web using hard (required) node affinity.
  2. A monitoring DaemonSet that tolerates the node-role.kubernetes.io/control-plane taint so it can run on every node including control plane nodes.

As a bonus, Step 8 shows the combined pattern used for reserved GPU node pools.

Step 1: Label Your Nodes

Node labels are the foundation of affinity rules. Add a role label to simulate a segmented cluster:

bash
1kubectl label nodes node1 role=web
2kubectl label nodes node2 role=web
3kubectl label nodes node3 role=batch
4
5# Verify
6kubectl get nodes --show-labels | grep role
7# node1   Ready   <none>   ...   kubernetes.io/os=linux,role=web,...
8# node2   Ready   <none>   ...   kubernetes.io/os=linux,role=web,...
9# node3   Ready   <none>   ...   kubernetes.io/os=linux,role=batch,...

Replace node1, node2, node3 with your actual node names from kubectl get nodes.

Step 2: Required Affinity (Hard Rule)

requiredDuringSchedulingIgnoredDuringExecution is the hard constraint. The scheduler will not place the pod on a node that doesn't satisfy it. If no matching node exists, the pod stays Pending.

bash
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: web
6spec:
7  replicas: 4
8  selector:
9    matchLabels:
10      app: web
11  template:
12    metadata:
13      labels:
14        app: web
15    spec:
16      affinity:
17        nodeAffinity:
18          requiredDuringSchedulingIgnoredDuringExecution:
19            nodeSelectorTerms:
20              - matchExpressions:
21                  - key: role
22                    operator: In
23                    values:
24                      - web
25      containers:
26        - name: nginx
27          image: nginx:1.25
28          ports:
29            - containerPort: 80
30EOF

The IgnoredDuringExecution part is worth noting. It means if the node's role=web label is removed after the pod is already running, the pod is not evicted. Kubernetes only enforces the constraint at scheduling time, not continuously.

Step 3: Verify Placement

bash
kubectl get pods -o wide
# NAME                   READY   STATUS    NODE    ...
# web-7d4f9b-abc         1/1     Running   node1   ...
# web-7d4f9b-def         1/1     Running   node2   ...
# web-7d4f9b-ghi         1/1     Running   node1   ...
# web-7d4f9b-jkl         1/1     Running   node2   ...

All four pods land on node1 or node2 — never node3. Try adding a fifth replica and confirm it still lands on a role=web node.

Step 4: Preferred Affinity (Soft Rule)

preferredDuringSchedulingIgnoredDuringExecution is the soft constraint. The scheduler tries to honor it but will schedule the pod elsewhere if no preferred node is available. This is useful when hard constraints would leave pods Pending during node outages or scale-up delays.

bash
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: web-soft
6spec:
7  replicas: 4
8  selector:
9    matchLabels:
10      app: web-soft
11  template:
12    metadata:
13      labels:
14        app: web-soft
15    spec:
16      affinity:
17        nodeAffinity:
18          preferredDuringSchedulingIgnoredDuringExecution:
19            - weight: 80
20              preference:
21                matchExpressions:
22                  - key: role
23                    operator: In
24                    values:
25                      - web
26      containers:
27        - name: nginx
28          image: nginx:1.25
29EOF

weight ranges from 1 to 100. Higher weight means a stronger preference. The scheduler scores candidate nodes and picks the highest scorer. A node labeled role=web gets +80 to its score. If node3 has significantly more free capacity, the scheduler may still pick it — the preference is advisory, not binding.

Step 5: Taints — Marking Nodes as Off-Limits

A taint marks a node so that pods are repelled from it unless they declare a matching toleration. Add a taint to node3:

bash
kubectl taint nodes node3 dedicated=batch:NoSchedule

The format is key=value:effect. The three effects:

  • NoSchedule — new pods are not scheduled here. Existing pods are unaffected.
  • PreferNoSchedule — the scheduler prefers to avoid this node, but the decision is scoring-based. It may still schedule pods here even when other nodes are available, depending on each node's total score.
  • NoExecute — new pods are not scheduled AND existing pods without a matching toleration are evicted.

Verify the taint was applied:

bash
kubectl describe node node3 | grep -A5 Taints
# Taints:  dedicated=batch:NoSchedule

Now deploy something and confirm it never lands on node3:

bash
kubectl run test-pod --image=nginx --restart=Never
kubectl get pod test-pod -o wide
# NODE is node1 or node2, never node3

Step 6: Tolerations — Opting Into a Tainted Node

A toleration doesn't force a pod onto a tainted node — it allows it. The pod can land there if the scheduler chooses it, but it won't be barred. To actually force the pod onto node3, combine a toleration with required affinity pointing to node3.

Here's a batch job that tolerates the dedicated=batch:NoSchedule taint:

bash
1kubectl apply -f - <<EOF
2apiVersion: batch/v1
3kind: Job
4metadata:
5  name: batch-job
6spec:
7  template:
8    spec:
9      tolerations:
10        - key: "dedicated"
11          operator: "Equal"
12          value: "batch"
13          effect: "NoSchedule"
14      affinity:
15        nodeAffinity:
16          requiredDuringSchedulingIgnoredDuringExecution:
17            nodeSelectorTerms:
18              - matchExpressions:
19                  - key: role
20                    operator: In
21                    values:
22                      - batch
23      containers:
24        - name: worker
25          image: busybox
26          command: ["sh", "-c", "echo 'batch work done' && sleep 10"]
27      restartPolicy: Never
28EOF
bash
kubectl get pod -l job-name=batch-job -o wide
# NODE is node3 — the only node with role=batch AND the matching taint

Step 7: Tolerating the Control Plane

On clusters provisioned with kubeadm (self-managed clusters), control plane nodes are automatically tainted with node-role.kubernetes.io/control-plane:NoSchedule. Worker nodes don't get this taint. As a result, monitoring agents like Prometheus node-exporter miss the control plane nodes. Managed services (EKS, GKE, AKS) run the control plane on infrastructure you don't have access to, so this scenario applies to self-managed clusters only.

Add the toleration to your DaemonSet to fix this:

bash
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: DaemonSet
4metadata:
5  name: node-monitor
6spec:
7  selector:
8    matchLabels:
9      app: node-monitor
10  template:
11    metadata:
12      labels:
13        app: node-monitor
14    spec:
15      tolerations:
16        - key: "node-role.kubernetes.io/control-plane"
17          operator: "Exists"
18          effect: "NoSchedule"
19      containers:
20        - name: monitor
21          image: busybox
22          command: ["sh", "-c", "while true; do echo 'monitoring'; sleep 60; done"]
23EOF

operator: Exists matches any taint with that key, regardless of value — but the effect field still applies (or omit it to match all effects). This is the correct form for built-in Kubernetes taints because the value may vary by provider. As a special case, leaving key empty with operator: Exists creates a wildcard toleration that matches every taint on the node.

bash
kubectl get pods -l app=node-monitor -o wide
# Now includes control plane node(s)

Step 8: Production Pattern — Reserved GPU Node Pool

The most common production use of combined affinity + taints is reserving GPU nodes exclusively for GPU workloads. Here's the pattern:

Label and taint the GPU nodes once (usually done at node bootstrap or via a node group label):

bash
kubectl label nodes gpu-node1 accelerator=nvidia
kubectl taint nodes gpu-node1 dedicated=gpu:NoSchedule

GPU workload — sets both the affinity (to land on GPU nodes) and the toleration (to bypass the taint):

bash
1kubectl apply -f - <<EOF
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5  name: ml-inference
6spec:
7  replicas: 2
8  selector:
9    matchLabels:
10      app: ml-inference
11  template:
12    metadata:
13      labels:
14        app: ml-inference
15    spec:
16      tolerations:
17        - key: "dedicated"
18          operator: "Equal"
19          value: "gpu"
20          effect: "NoSchedule"
21      affinity:
22        nodeAffinity:
23          requiredDuringSchedulingIgnoredDuringExecution:
24            nodeSelectorTerms:
25              - matchExpressions:
26                  - key: accelerator
27                    operator: In
28                    values:
29                      - nvidia
30      containers:
31        - name: inference
32          image: your-ml-image:latest
33          resources:
34            limits:
35              nvidia.com/gpu: 1
36EOF

CPU workloads have no toleration and no GPU affinity — they land on regular nodes automatically. GPU nodes stay exclusive to GPU workloads without any extra configuration on the CPU side.

Common Mistakes to Avoid

Using nodeSelector instead of nodeAffinitynodeSelector is the older, simpler form. It only supports key=value equality. nodeAffinity supports In, NotIn, Exists, DoesNotExist, Gt, and Lt operators and can express OR conditions across multiple matchExpressions blocks. Use nodeAffinity for anything non-trivial.

Required affinity with no matching nodes — if you set requiredDuringSchedulingIgnoredDuringExecution and delete the label from all matching nodes, all new pods for that Deployment stay Pending indefinitely. Set up alerting for Pending pods older than 5 minutes.

Confusing NoExecute with NoScheduleNoSchedule is additive and non-disruptive: existing pods stay, new pods are barred. NoExecute is disruptive: pods without a matching toleration are evicted immediately. Pods with a matching toleration and no tolerationSeconds are never evicted. Pods with a matching toleration and a tolerationSeconds: N value are evicted N seconds after the taint is applied to the node (not N seconds after the pod started). Use NoExecute intentionally, e.g. when draining a node.

Toleration without matching affinity — adding a toleration lets a pod schedule on a tainted node but doesn't prefer it. If your goal is "run only on GPU nodes", you need both the toleration AND required affinity pointing to the GPU node label. The toleration alone allows it; the affinity enforces it.

Cleanup

bash
kubectl delete deployment web web-soft
kubectl delete job batch-job
kubectl delete daemonset node-monitor
kubectl taint nodes node3 dedicated=batch:NoSchedule-   # trailing dash removes the taint
kubectl label nodes node1 node2 role-
kubectl label nodes node3 role-

What's Next

Official References

We built Podscape to simplify Kubernetes workflows like this — logs, events, and cluster state in one interface, without switching tools.

Struggling with this in production?

We help teams fix these exact issues. Our engineers have deployed these patterns across production environments at scale.