Kubernetes
13 min readMay 10, 2026

NVIDIA GPU Operator: Running GPU Workloads on Kubernetes

Running GPU workloads on Kubernetes without the GPU Operator means manually installing NVIDIA drivers, the container runtime, device plugin, and monitoring components on every GPU node. The GPU Operator automates all of this. But it also adds complexity — this post covers what the operator manages, how to configure it for different GPU sharing models, and the production failure modes.

AJ
Ajeet Yadav
Platform & Cloud Engineer
NVIDIA GPU Operator: Running GPU Workloads on Kubernetes

You provision a GPU node on your cluster. It joins, it's Ready, and then nothing works. Your ML job sits in Pending with Insufficient nvidia.com/gpu. The GPU is physically there — you can log into the node and run nvidia-smi — but Kubernetes has no idea it exists. Before a container can use that GPU, four separate components need to be in place: the NVIDIA driver loaded as a kernel module, the NVIDIA Container Runtime configured as a containerd or CRI-O hook that intercepts GPU device mounts, the Device Plugin that advertises nvidia.com/gpu as a schedulable resource, and optionally DCGM Exporter for metrics and the MIG Manager for A100/H100 GPU partitioning.

Without automation, that's a node-by-node installation process. Worse, it breaks every time the kernel upgrades — the driver module is tied to the exact kernel version, and a routine OS update can silently disable your GPU nodes until someone manually reinstalls. The GPU Operator exists to solve this. It manages the full stack as Kubernetes DaemonSets and CRDs, declaratively, with lifecycle management built in.

That's the pitch. The reality is that the Operator also adds complexity — another set of DaemonSets to monitor, CRDs to understand, and failure modes that aren't obvious until you hit them in production. This post is a technical walkthrough of what the GPU Operator actually manages, how to configure it for different GPU sharing models, and what breaks in production.

What the GPU Operator manages

The GPU Operator is a meta-operator: it doesn't do GPU management itself, it orchestrates a set of sub-components, each of which is a DaemonSet or CRD controller. Here's what it manages:

NVIDIA Driver DaemonSet — Installs the NVIDIA GPU driver as a container on each GPU node. The driver container uses a pre-compiled driver image for major distros (Ubuntu 22.04, RHEL 9, Amazon Linux 2023) or compiles the driver from source if no pre-built image matches your kernel. The key advantage over host-level installation: the driver container can be updated by rolling the DaemonSet, no host reinstallation required. The driver image is tagged by driver version and kernel version (nvcr.io/nvidia/driver:535.104.12-ubuntu22.04). On kernel upgrades, the Operator automatically rolls the driver DaemonSet to the matching image.

NVIDIA Container Runtime — Configures containerd or CRI-O to use nvidia-container-runtime for GPU-requesting containers. The runtime intercepts container startup and mounts the correct /dev/nvidia* device files and NVIDIA libraries into the container namespace. This is what makes nvidia-smi work inside a container — without it, the container sees no GPU devices. The Container Toolkit DaemonSet handles this configuration automatically by writing to /etc/containerd/config.toml or /etc/crio/crio.conf.d/ depending on the runtime it detects.

Device Plugin — Runs as a DaemonSet and communicates with kubelet via the Device Plugin API (stable since Kubernetes 1.26). It enumerates the physical GPUs on each node and registers them as nvidia.com/gpu extended resources. This is what makes kubectl describe node show nvidia.com/gpu: 8 in the Allocatable section. Pods request GPUs via resources.limits: nvidia.com/gpu: 1 — and the Device Plugin handles the actual device assignment, ensuring no two pods receive the same physical GPU.

DCGM Exporter — Data Center GPU Manager exposes GPU telemetry as Prometheus metrics. This includes GPU utilization, memory usage, power draw, temperature, NVLink throughput, and ECC error counts. Without DCGM, you're flying blind — you won't know if a GPU is idle, throttling, or silently accumulating ECC errors that indicate hardware failure.

GPU Feature Discovery (GFD) — Integrates with Node Feature Discovery to label nodes with GPU capabilities detected at runtime: GPU model (nvidia.com/gpu.product: A100-SXM4-80GB), CUDA driver version, CUDA compute capability, MIG support. These labels are what you use in pod affinity rules to schedule workloads on specific GPU hardware.

MIG Manager — Manages Multi-Instance GPU partitioning on A100, H100, and H200 GPUs. It watches a node label (nvidia.com/mig.config) and reconfigures the physical GPU into the specified MIG profiles. Reconfiguration requires draining running GPU processes first — more on the operational implications of this below.

Installing the GPU Operator

The GPU Operator is installed via Helm. The chart is published to NVIDIA's NGC registry.

bash
1helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
2helm repo update
3
4helm install gpu-operator nvidia/gpu-operator \
5  --namespace gpu-operator \
6  --create-namespace \
7  --version v24.3.0 \
8  --set driver.enabled=true \
9  --set devicePlugin.enabled=true \
10  --set dcgmExporter.enabled=true \
11  --set gfd.enabled=true

This installs all components. On fresh nodes with no pre-installed driver, the driver DaemonSet will install the driver on each GPU node. Expect 3–5 minutes for driver installation to complete on each node — it's building kernel modules.

For EKS with GPU-optimized AMIs (amazon-linux-2-gpu or the Bottlerocket GPU variant), the driver is pre-installed and tied to the AMI. Installing the GPU Operator's driver on top of the AMI driver causes conflicts. Disable driver installation:

bash
1helm install gpu-operator nvidia/gpu-operator \
2  --namespace gpu-operator \
3  --create-namespace \
4  --set driver.enabled=false \
5  --set toolkit.enabled=true \
6  --set devicePlugin.enabled=true \
7  --set dcgmExporter.enabled=true

The toolkit.enabled=true flag keeps the Container Runtime configuration active (containerd still needs to be told to use nvidia-container-runtime), but skips driver installation. This is the correct configuration for EKS GPU-optimized AMIs.

Verify the installation:

bash
kubectl get pods -n gpu-operator

You should see DaemonSet pods in Running state on your GPU nodes: gpu-operator, nvidia-driver-daemonset, nvidia-container-toolkit-daemonset, nvidia-device-plugin-daemonset, nvidia-dcgm-exporter, gpu-feature-discovery. If any pod is in CrashLoopBackOff, check the logs — the most common cause is a kernel version mismatch with the pre-compiled driver image.

Verify the GPU resource is available on the node:

bash
kubectl describe node <gpu-node-name> | grep nvidia

You should see something like nvidia.com/gpu: 8 in both Capacity and Allocatable.

Requesting GPUs in pods

Once the Device Plugin is running, pods request GPUs through the standard extended resource mechanism:

yaml
1apiVersion: v1
2kind: Pod
3metadata:
4  name: gpu-burn-test
5spec:
6  restartPolicy: Never
7  containers:
8    - name: gpu-burn
9      image: nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04
10      command: ["nvidia-smi"]
11      resources:
12        limits:
13          nvidia.com/gpu: 1

One important behavior: GPU resources are exclusive by default. A pod requesting nvidia.com/gpu: 1 gets exclusive access to one entire physical GPU — no other pod will be scheduled to that GPU while it's held. This is the right model for training workloads where a job needs the full GPU memory and compute. For inference workloads, this is wasteful — a small inference model may use 8GB of a 40GB GPU, leaving 32GB idle.

Also critical: for Device Plugin resources, requests without matching limits are silently ignored. You must set limits.nvidia.com/gpu, not just requests.nvidia.com/gpu. The scheduler only considers limits when scheduling extended resources.

GPU sharing: Time-Slicing

Time-slicing partitions a physical GPU into multiple logical GPUs using CUDA's time-multiplexing. Multiple pods receive a "GPU" and the CUDA runtime time-multiplexes their compute work on the physical GPU. Unlike MIG, this is entirely software-level — there is no physical partitioning of VRAM. Every time-slice shares the full physical GPU memory space.

Configure time-slicing with a ConfigMap that the Device Plugin reads:

yaml
1apiVersion: v1
2kind: ConfigMap
3metadata:
4  name: time-slicing-config
5  namespace: gpu-operator
6data:
7  any: |-
8    version: v1
9    flags:
10      migStrategy: none
11    sharing:
12      timeSlicing:
13        resources:
14          - name: nvidia.com/gpu
15            replicas: 4

The replicas: 4 value means each physical GPU is advertised as 4 logical GPUs. On a 2-GPU node, you'd see nvidia.com/gpu: 8 in node Allocatable.

Apply the configuration:

bash
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --reuse-values \
  --set devicePlugin.config.name=time-slicing-config \
  --set devicePlugin.config.default=any

After applying, the Device Plugin DaemonSet restarts and the node's GPU capacity updates. Pods requesting nvidia.com/gpu: 1 receive a time-slice of the physical GPU.

The fundamental limitation of time-slicing: no VRAM isolation. All pods sharing a physical GPU share the same memory address space from CUDA's perspective. If one pod allocates 30GB of VRAM on a 40GB GPU, the remaining pods get effectively 10GB. If one pod OOM-kills its GPU processes due to memory exhaustion, it can destabilize other pods' GPU processes running on the same physical GPU. Time-slicing is appropriate for inference workloads where models are small (under 5GB) and the bottleneck is compute throughput, not memory. For multi-tenant environments where you need strict isolation, use MIG.

GPU sharing: Multi-Instance GPU (MIG)

MIG is available on A100, H100, and H200 GPUs. It physically partitions the GPU silicon into isolated instances, each with dedicated compute engines, its own L2 cache slice, and a dedicated VRAM partition. The isolation is hardware-enforced — one MIG instance cannot see or interfere with another's memory, even via GPU-to-GPU communication bugs.

MIG instance profiles define the partition size. On an A100 80GB:

  • 1g.10gb — 1/7 of compute, 10GB VRAM (you can fit 7 of these per GPU)
  • 2g.20gb — 2/7 of compute, 20GB VRAM (3 per GPU, with 2/7 compute remaining unused)
  • 3g.40gb — 3/7 of compute, 40GB VRAM (2 per GPU)
  • 7g.80gb — Full GPU, all compute, 80GB VRAM (1 per GPU)

The MIG Manager watches the nvidia.com/mig.config node label and applies the matching partition layout. Set the label to trigger reconfiguration:

bash
kubectl label node gpu-node-1 nvidia.com/mig.config=all-1g.10gb

This instructs the MIG Manager to partition every GPU on gpu-node-1 into seven 1g.10gb instances. The MIG Manager will wait for all active GPU processes on the node to complete before applying the partition change — if you have a long-running training job, this can take hours.

Configure the MIG strategy at the cluster level via the GPU Operator:

bash
helm upgrade gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --reuse-values \
  --set migManager.enabled=true \
  --set mig.strategy=mixed

With strategy: mixed, each MIG profile type appears as a distinct resource name in the scheduler. Pods request specific profiles:

yaml
resources:
  limits:
    nvidia.com/mig-1g.10gb: 1

With strategy: single, all MIG instances on a node use the same profile and pods continue using the generic nvidia.com/gpu: 1 resource name. Single strategy is simpler operationally but less flexible — you can't mix 1g.10gb and 3g.40gb instances on the same node.

MIG is the right choice when you're serving multiple inference models with different memory requirements, need strong tenant isolation, and are on A100/H100 hardware. The operational overhead is real — MIG reconfiguration requires draining nodes, and the partition layout is inflexible once set.

DCGM Exporter and GPU Monitoring

DCGM exposes around 100 GPU metrics. The ones that matter in production:

promql
1# Per-pod GPU utilization — low values on expensive GPU nodes mean money burning
2nvidia_gpu_utilization{gpu="0", pod="training-job-xxx"}
3
4# Memory pressure — spikes here precede OOM kills
5nvidia_gpu_memory_used_bytes / nvidia_gpu_memory_total_bytes
6
7# Temperature — A100 thermal throttle threshold is around 83°C
8nvidia_gpu_temperature_celsius > 85
9
10# ECC correctable errors — non-zero over time indicates degrading DRAM
11increase(nvidia_gpu_ecc_errors_corrected_total[1h]) > 0
12
13# NVLink bandwidth — saturation here means your multi-GPU training job is network-bound
14nvidia_nvlink_bandwidth_total_bytes

Correctable ECC errors warrant a warning alert and node review. Uncorrectable ECC errors require immediate node cordon and investigation — they indicate permanent GPU memory cell failures.

Sample alert for thermal throttling:

yaml
1- alert: GPUThermalThrottling
2  expr: nvidia_gpu_temperature_celsius > 83
3  for: 5m
4  labels:
5    severity: warning
6  annotations:
7    summary: "GPU {{ $labels.gpu }} on node {{ $labels.node }} is throttling"
8    description: "GPU temperature {{ $value }}°C exceeds safe threshold. Check cooling and workload density."

The DCGM Exporter runs as a DaemonSet on GPU nodes. If it crashes, it doesn't affect GPU scheduling — your workloads keep running — but you lose observability. Monitor the DCGM Exporter pod restart count separately from your workload pods.

Node labeling and GPU-aware scheduling

GPU Feature Discovery runs on each GPU node after the driver is installed and writes labels to the node object. These labels are how you express scheduling affinity for specific GPU hardware without hardcoding node names:

yaml
1affinity:
2  nodeAffinity:
3    requiredDuringSchedulingIgnoredDuringExecution:
4      nodeSelectorTerms:
5        - matchExpressions:
6            - key: nvidia.com/gpu.product
7              operator: In
8              values:
9                - "A100-SXM4-80GB"
10                - "A100-PCIE-80GB"

This ensures a training job requiring NVLink (only available on SXM4 form factor) doesn't accidentally land on a PCIe A100 that lacks it.

Key GFD labels and what they represent:

  • nvidia.com/gpu.present: "true" — node has a GPU (the baseline label)
  • nvidia.com/gpu.product: "A100-SXM4-80GB" — exact GPU model string from NVML
  • nvidia.com/cuda.driver.major: "535" — installed CUDA driver major version
  • nvidia.com/cuda.driver.minor: "104" — CUDA driver minor version
  • nvidia.com/mig.capable: "true" — GPU supports MIG (A100/H100/H200)
  • nvidia.com/gpu.memory: "81920" — GPU VRAM in MiB

The CUDA driver version labels are particularly useful for workloads that require a minimum CUDA version. A container using CUDA 12.4 features needs a driver version that supports at least CUDA 12.4 runtime. Use nvidia.com/cuda.driver.major to ensure workloads only land on nodes with a compatible driver.

EKS GPU nodes and the GPU Operator

EKS GPU node configuration has more nuance than the general case because AWS manages the driver in the AMI. Your options:

GPU Operator with driver.enabled=false — Use the Operator to manage the device plugin, DCGM Exporter, container toolkit, and GFD, but let the AMI handle driver installation. This is my recommended approach for EKS. You get the full Operator feature set (DCGM, GFD, MIG management) without fighting with the AMI-managed driver.

AWS Device Plugin only — Skip the GPU Operator entirely. The k8s-device-plugin DaemonSet from AWS registers nvidia.com/gpu resources and that's it. Simpler, no CRDs, no operator complexity. But you lose DCGM metrics and MIG management. Acceptable for simple single-tenant clusters; not acceptable for production ML platforms.

For Karpenter-managed GPU nodes, define a NodePool that targets GPU instance types and taints them so only GPU workloads land there:

yaml
1apiVersion: karpenter.sh/v1
2kind: NodePool
3metadata:
4  name: gpu-nodes
5spec:
6  template:
7    spec:
8      nodeClassRef:
9        apiVersion: karpenter.k8s.aws/v1
10        kind: EC2NodeClass
11        name: gpu-class
12      requirements:
13        - key: node.kubernetes.io/instance-type
14          operator: In
15          values: ["p4d.24xlarge", "p3.8xlarge", "g5.xlarge"]
16        - key: kubernetes.io/os
17          operator: In
18          values: ["linux"]
19      taints:
20        - key: nvidia.com/gpu
21          value: "true"
22          effect: NoSchedule
23  limits:
24    nvidia.com/gpu: 64
25  disruption:
26    consolidationPolicy: WhenEmpty
27    consolidateAfter: 30s

The NoSchedule taint on nvidia.com/gpu is essential. GPU nodes are expensive — a p4d.24xlarge with 8 A100s costs over $30/hour on-demand. If you don't taint them, CPU workloads will opportunistically schedule there when the cluster is under pressure, burning GPU capacity for workloads that don't need it.

Tolerations for GPU pods

All GPU pods must tolerate the node taint to be scheduled on GPU nodes:

yaml
tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

This is easy to forget, especially in ML frameworks that generate pod specs programmatically. If your training job stays Pending on a cluster with available GPU nodes, check for missing tolerations before debugging anything else. It's the most common cause.

Production failure modes

Driver installation failure — The GPU Operator driver pod will fail to start if the pre-compiled driver image doesn't match your kernel version. This happens when you update the OS without updating the GPU Operator, or when you use a custom kernel build. Symptoms: nvidia-driver-daemonset pod is in CrashLoopBackOff, logs show No precompiled package found for kernel. Solutions: pin your kernel version in the node AMI/image, update the GPU Operator to a version that ships a driver image for your kernel, or enable source compilation (the Operator supports this via driver.usePrecompiled=false, but compilation takes 10–15 minutes per node on cold start).

DCGM Exporter crash — DCGM Exporter can crash if DCGM loses connection to the GPU driver, typically after a driver restart or node-level GPU reset. This doesn't affect scheduling or workload execution, but it breaks your observability. Set up an alert on DCGM Exporter pod restarts separately from your workload health checks. The Operator will restart it automatically, but there's a gap in metric collection during the restart window.

MIG Manager reconfiguration delay — When you change the nvidia.com/mig.config label on a node, the MIG Manager waits for all active GPU processes to complete before reconfiguring. It will not forcibly evict running pods. On a node running a 12-hour training job, the reconfiguration doesn't happen for 12 hours. If you need to change MIG profiles urgently, you have to manually cordon the node and evict GPU pods. Plan MIG profile changes during maintenance windows or build them into your initial node provisioning.

Time-slicing memory OOM — With time-slicing enabled, there is no per-pod VRAM limit enforcement. If a pod allocates more VRAM than expected — a model that's larger than anticipated, or a batch size increase — it can exhaust the GPU memory shared by all time-slice holders. The GPU driver will kill GPU processes to reclaim memory, but "which process gets killed" is not deterministic. You may see other pods' GPU jobs crash with CUDA error: out of memory without any increase in their own memory usage. Set conservative resources.limits.memory (CPU memory) and use resource quotas at the namespace level to limit how many time-slices a team can claim. But accept that time-slicing has fundamentally weaker isolation guarantees than MIG.

GPU Operator and cert-manager conflict — GPU Operator v24.x requires cert-manager for webhook TLS. If your cluster runs cert-manager and the GPU Operator installs its own cert-manager instance, you'll have two cert-manager deployments conflicting over webhook registrations. Install the GPU Operator with --set cert-manager.enabled=false and ensure your existing cert-manager version is 1.13+ (required for GPU Operator v24.x webhook compatibility).

Frequently Asked Questions

Does the GPU Operator support AMD GPUs?

No. The GPU Operator is NVIDIA-specific. For AMD ROCm workloads, use the AMD GPU device plugin (k8s-device-plugin from the ROCm repository), which registers amd.com/gpu resources. As of mid-2026, there is no AMD equivalent of the full GPU Operator — you get a device plugin and that's it. No equivalent of DCGM for AMD GPUs at the Kubernetes operator level; you'd integrate ROCm SMI metrics separately.

How does the GPU Operator interact with containerd vs. CRI-O?

The Container Toolkit DaemonSet detects the container runtime at startup by checking which socket exists (/run/containerd/containerd.sock vs. /var/run/crio/crio.sock) and updates the appropriate configuration. For containerd, it adds a runtime class entry to /etc/containerd/config.toml pointing to nvidia-container-runtime. For CRI-O, it writes a drop-in config to /etc/crio/crio.conf.d/99-nvidia.conf. After configuration, it signals the runtime to reload (containerd: systemctl reload containerd; CRI-O: systemctl reload crio). If you're using containerd with a custom config.toml that already has runtime class entries, verify the Operator doesn't overwrite them — use --set toolkit.env[0].name=CONTAINERD_CONFIG,toolkit.env[0].value=/etc/containerd/config.toml to point the Toolkit to your actual config path if it differs from the default.

Can I run both time-slicing and MIG on the same cluster?

Yes — configure per-node. Label some nodes with nvidia.com/mig.config=all-1g.10gb for MIG partitioning and configure time-slicing only on non-MIG nodes via the Device Plugin ConfigMap. The Device Plugin config supports per-node-label selectors, so you can define different sharing policies for different node groups. You cannot enable both time-slicing and MIG on the same physical GPU simultaneously — MIG disables time-slicing at the hardware level. On MIG-enabled GPUs, each MIG instance can be time-sliced, but that's configured at the CUDA application level, not through the GPU Operator.

What's the minimum GPU Operator version for Kubernetes 1.29+?

GPU Operator v24.x is the minimum supported version for Kubernetes 1.29 and later. The v23.x series doesn't support the CRD validation changes introduced in Kubernetes 1.29. Always cross-reference the NVIDIA GPU Operator version compatibility matrix before upgrading either the Operator or Kubernetes — they're not independently versioned and a mismatch between GPU Operator version and Kubernetes version is a common source of installation failures.


For Karpenter NodePool configuration that provisions GPU nodes on demand and scales to zero when no GPU workloads are running, see Kubernetes Cluster Autoscaler vs. Karpenter — in particular the consolidationPolicy: WhenEmpty and consolidateAfter settings that control how quickly idle GPU nodes are terminated.

For resource requests and limits — and specifically why GPU pods must always set limits.nvidia.com/gpu rather than just requests — see Kubernetes Resource Management: Requests, Limits, and QoS. Device Plugin resources behave differently from CPU and memory: a request without a matching limit is silently ignored by the scheduler, so pods that specify only requests.nvidia.com/gpu will never be scheduled.

For EKS upgrade procedures that affect GPU AMIs and require coordinating driver versions with Kubernetes versions, see EKS Upgrades: Zero-Downtime Strategy. GPU node upgrades are more involved than standard node upgrades because the driver version is baked into the AMI — a Kubernetes minor version upgrade often requires a new GPU AMI with a different driver, which means draining and replacing all GPU nodes.

For FinOps strategies including GPU spot instances and GPU utilization monitoring to reduce idle GPU spend, see Kubernetes Cost Optimization and FinOps. GPU utilization below 60% on a p4d.24xlarge is money on fire — the DCGM metrics covered above are the foundation of any GPU cost optimization effort.


Setting up GPU infrastructure for ML workloads on Kubernetes and unsure whether to use the GPU Operator or the bare device plugin? Talk to us at Coding Protocols — we help platform teams configure GPU node pools, sharing strategies, and monitoring for production ML workloads.

Related Topics

Kubernetes
NVIDIA
GPU
GPU Operator
Machine Learning
EKS
Platform Engineering

Read Next