Kubernetes
15 min readMay 6, 2026

Kubernetes Cluster Upgrades: Zero-Downtime Strategy for Production

Kubernetes has a 14-month support window. Falling behind on upgrades means running unsupported versions with unpatched CVEs and no access to new API features. Here's how to upgrade production clusters without downtime — and the checklist that prevents the common failure modes.

AJ
Ajeet Yadav
Platform & Cloud Engineer
Kubernetes Cluster Upgrades: Zero-Downtime Strategy for Production

Kubernetes releases a new minor version every four months. Standard support covers three minor versions — the current release plus two prior. Fall behind more than three versions and you're running unsupported Kubernetes with unpatched security vulnerabilities, no upstream bug fixes, and an increasingly wide delta to close when you finally do upgrade.

The good news: Kubernetes upgrades on managed platforms (EKS, GKE, AKS) are well-understood operations with clear failure modes. The bad news: the failure modes are predictable enough that organisations keep hitting them anyway — deprecated APIs, add-on version drift, nodes that don't drain cleanly.

This post covers the upgrade process end to end: pre-upgrade preparation, control plane upgrade, node upgrade strategies, add-on management, and the rollback plan you need before starting.


The Upgrade Model

Before diving into process, understand what gets upgraded and in what order:

Control Plane (API server, etcd, scheduler, controller manager)
    ↓
Node Groups (kubelet, kube-proxy)
    ↓
Add-ons (CoreDNS, kube-proxy, VPC CNI, CSI drivers)
    ↓
In-cluster tools (Argo CD, Prometheus, cert-manager, Kyverno)

Rule: Never upgrade nodes to a version higher than the control plane. The Kubernetes version skew policy allows kubelet (nodes) to be up to three minor versions behind the kube-apiserver (e.g., a 1.32 control plane supports nodes running 1.32, 1.31, 1.30, or 1.29). This three-version skew applies to Kubernetes 1.28+; on 1.27 and earlier the allowed skew was two minor versions. Upgrade the control plane first, then nodes.

Rule: Add-ons must be compatible with the new Kubernetes version. Many add-ons are sensitive to API group changes — kube-proxy and CoreDNS are bundled, but AWS VPC CNI, EBS CSI driver, cert-manager, and Kyverno are not.


Pre-Upgrade Checklist

Run this checklist at least one sprint before the planned upgrade date.

1. Check for Deprecated API Usage

Kubernetes deprecates and removes API versions on a defined schedule. Using a removed API version after upgrade causes workloads to fail to deploy. Check for deprecated APIs in use:

bash
1# Install pluto (deprecated API checker)
2brew install FairwindsOps/tap/pluto
3
4# Scan running resources in cluster
5pluto detect-all-in-cluster --target-versions k8s=v1.32
6
7# Scan Helm releases
8pluto detect-helm --target-versions k8s=v1.32
9
10# Scan local manifests
11pluto detect-files -d ./manifests --target-versions k8s=v1.32

Common API removals to watch:

  • networking.k8s.io/v1beta1 Ingress (removed in 1.22)
  • policy/v1beta1 PodDisruptionBudget (removed in 1.25)
  • policy/v1beta1 PodSecurityPolicy (removed in 1.25)
  • autoscaling/v2beta2 HorizontalPodAutoscaler (removed in 1.26)
  • batch/v1beta1 CronJob (removed in 1.25)

Fix any deprecated API usage before upgrading. For Helm charts, this means upgrading the chart to a version that uses the current API version.

2. Verify PodDisruptionBudgets

Node drain during upgrade will evict pods. If a PDB blocks eviction (minAvailable can't be satisfied), the drain hangs:

bash
1# Check all PDBs
2kubectl get pdb -A
3
4# Identify PDBs that would block drain
5kubectl get pdb -A -o json | jq '.items[] |
6  select(.status.disruptionsAllowed == 0) |
7  {name: .metadata.name, namespace: .metadata.namespace, reason: "0 disruptions allowed"}'

A PDB with minAvailable: 100% or maxUnavailable: 0 on a single-replica Deployment blocks all evictions. Before upgrading, either scale the Deployment to 2+ replicas or temporarily relax the PDB. Set a reminder to restore it after upgrade.

3. Verify Add-on Version Compatibility

For each add-on, check compatibility with the target Kubernetes version:

bash
1# EKS: check current add-on versions
2aws eks list-addons --cluster-name my-cluster --region us-east-1
3
4# Get the latest compatible version for each add-on for the target K8s version
5aws eks describe-addon-versions \
6  --kubernetes-version 1.32 \
7  --addon-name aws-ebs-csi-driver \
8  --query 'addons[0].addonVersions[0].addonVersion' \
9  --output text

Key add-ons to verify:

  • aws-vpc-cni — must be >= minimum version for target K8s
  • aws-ebs-csi-driver — CSI API changes between K8s versions
  • coredns — bundled, but may have configuration changes
  • kube-proxy — bundled, pinned to K8s version
  • cert-manager — check cert-manager compatibility matrix
  • kyverno — check Kyverno release notes for K8s version support
  • metrics-server — check API server compatibility

4. Take a Backup

Before any upgrade, take a Velero backup:

bash
velero backup create pre-upgrade-$(date +%Y%m%d) \
  --include-namespaces "*" \
  --include-cluster-scoped-resources=true \
  --wait

On EKS, also enable automated etcd backups via AWS Backup if not already configured.

5. Check Node Health

bash
# Nodes not in Ready state will cause issues during upgrade
kubectl get nodes | grep -v Ready

# Check for nodes with resource pressure
kubectl describe nodes | grep -A 5 "Conditions:"

Resolve any unhealthy nodes before upgrading.

6. Upgrade in Non-Production First

Run the upgrade in staging/dev before production. The upgrade process itself should be tested even if you've run it before — add-on compatibility, PDB behaviour, and deprecated API removal all manifest during the actual upgrade, not during planning.


Control Plane Upgrade

EKS

bash
1# Upgrade control plane (only one minor version at a time)
2aws eks update-cluster-version \
3  --name my-cluster \
4  --kubernetes-version 1.32 \
5  --region us-east-1
6
7# Monitor upgrade progress
8aws eks describe-update \
9  --name my-cluster \
10  --update-id <update-id> \
11  --region us-east-1
12
13# Or watch via kubectl (API server briefly unavailable during upgrade)
14watch kubectl get nodes

EKS control plane upgrade takes 15–30 minutes. The API server is briefly unavailable (typically <30 seconds) as the API server rolls over. Existing workloads continue running — only new API calls fail during this window. Plan upgrades outside peak traffic windows.

After control plane upgrade, update EKS managed add-ons:

bash
1# Update each managed add-on to the latest compatible version
2aws eks update-addon \
3  --cluster-name my-cluster \
4  --addon-name aws-vpc-cni \
5  --addon-version $(aws eks describe-addon-versions --addon-name aws-vpc-cni --kubernetes-version 1.32 --query 'addons[0].addonVersions[0].addonVersion' --output text) \
6  # Check the current version with: aws eks describe-addon-versions --addon-name aws-vpc-cni before upgrading
7  --resolve-conflicts OVERWRITE \
8  --region us-east-1
9
10# Repeat for coredns, kube-proxy, aws-ebs-csi-driver

GKE

bash
1# Upgrade control plane
2gcloud container clusters upgrade my-cluster \
3  --master \
4  --cluster-version 1.32 \
5  --region us-central1
6
7# If on a release channel, trigger upgrade manually
8gcloud container clusters upgrade my-cluster \
9  --master \
10  --region us-central1

On GKE, if you use release channels (Regular, Stable, Rapid), Google schedules control plane upgrades automatically within your maintenance window. For production clusters, configure a maintenance window and exclusion windows:

bash
gcloud container clusters update my-cluster \
  --maintenance-window-start "2026-05-09T02:00:00Z" \
  --maintenance-window-end "2026-05-09T06:00:00Z" \
  --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA,SU" \
  --region us-central1

AKS

bash
1# Upgrade control plane only first
2az aks upgrade \
3  --resource-group my-rg \
4  --name my-cluster \
5  --kubernetes-version 1.32.0 \
6  --control-plane-only
7
8# After verifying control plane, upgrade node pools

AKS also supports auto-upgrade channels:

bash
az aks update \
  --resource-group my-rg \
  --name my-cluster \
  --auto-upgrade-channel stable

Node Upgrade Strategies

This is where the most risk lives. Nodes hold your workloads. The goal is to replace nodes running the old kubelet version with new nodes without dropping requests.

Surge Upgrade (Managed Node Groups)

The default on managed node groups (EKS, GKE, AKS): the platform provisions new nodes, drains old nodes, and removes them.

EKS managed node group upgrade:

bash
aws eks update-nodegroup-version \
  --cluster-name my-cluster \
  --nodegroup-name workers \
  --kubernetes-version 1.32 \
  --region us-east-1

EKS uses a surge upgrade strategy by default: it provisions max_surge new nodes (default: 1), drains and terminates an equal number of old nodes, then repeats. Configure the surge size:

hcl
# Terraform EKS module
update_config = {
  max_unavailable_percentage = 33   # Up to 33% of nodes unavailable during upgrade
}

Higher surge = faster upgrade, more temporary overcapacity cost. Lower surge = slower upgrade, less disruption to workloads.

Before node group upgrade, ensure:

  • HPA minimum replicas ≥ 2 for all critical Deployments
  • PDBs are set on all Deployments that can't tolerate a single pod being evicted

Blue-Green Node Upgrade (Self-Managed Nodes)

For self-managed node groups or when you want maximum control:

  1. Create a new node group with the target Kubernetes version
  2. Cordon all nodes in the old node group (prevent new scheduling)
  3. Drain nodes one at a time
  4. Verify pods are running on new nodes
  5. Terminate old nodes
bash
1# Cordon all old nodes
2kubectl get nodes -l node-group=workers-v1.31 -o name | \
3  xargs -I {} kubectl cordon {}
4
5# Drain one node at a time
6kubectl drain <node-name> \
7  --ignore-daemonsets \
8  --delete-emptydir-data \
9  --grace-period=300
10
11# Verify
12kubectl get pods -A -o wide | grep <node-name>
13# Should show no pods (DaemonSet pods are excluded by --ignore-daemonsets)

--delete-emptydir-data is required for pods using emptyDir volumes. These are ephemeral by definition — the pod knows this data is not durable. Don't omit this flag to "protect" data in emptyDir; the pod contract is that emptyDir is lost on pod termination.

Karpenter Node Replacement

With Karpenter, node upgrades work differently. You update the EC2NodeClass AMI configuration and Karpenter gradually replaces nodes via its disruption mechanism:

yaml
1apiVersion: karpenter.k8s.aws/v1
2kind: EC2NodeClass
3metadata:
4  name: default
5spec:
6  amiFamily: AL2023   # Karpenter resolves the latest AMI for this family automatically
7  # Or pin a specific AMI version:
8  amiSelectorTerms:
9    - alias: al2023@v20261201   # Specific AMI version

To force node replacement:

bash
# Remove do-not-disrupt protection so Karpenter's drift mechanism can replace the node
kubectl annotate node <node-name> karpenter.sh/do-not-disrupt-

# Or let Karpenter's drift detection handle it:
# When EC2NodeClass AMI changes, Karpenter marks affected nodes as "drifted"
# and gradually replaces them according to disruption budgets

Karpenter's drift detection makes node upgrades largely automatic — when you update the EC2NodeClass AMI family or alias, Karpenter identifies nodes running the old AMI as "drifted" and gradually replaces them, respecting disruption budgets and PDBs. This is a separate mechanism from the consolidation policy.


Add-on Upgrades After Node Upgrade

After nodes are on the new Kubernetes version, upgrade in-cluster tools:

cert-manager

bash
1# Check current version
2kubectl get deployment -n cert-manager cert-manager -o jsonpath='{.spec.template.spec.containers[0].image}'
3
4# Upgrade via Helm
5helm repo update
6helm upgrade cert-manager jetstack/cert-manager \
7  --namespace cert-manager \
8  --version v1.17.x \
9  --reuse-values

Kyverno

bash
helm upgrade kyverno kyverno/kyverno \
  --namespace kyverno \
  --version 3.x.x \
  --reuse-values

Check Kyverno's migration guides between major versions — ClusterPolicy API versions and webhook configurations sometimes change in major releases.

Prometheus / kube-prometheus-stack

bash
helm upgrade kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --version <target-version> \
  --reuse-values

The kube-prometheus-stack upgrade path has historically been sensitive to CRD changes. Check the chart's migration guide for any manual CRD upgrade steps.


Rollback Plan

Before starting any upgrade, document the rollback plan. You will not have time to reason through it if something goes wrong at 2am.

Control Plane Rollback

You cannot downgrade a Kubernetes control plane version. If the control plane upgrade fails:

  • EKS: AWS Support can restore from their internal snapshots (contact AWS Support immediately)
  • GKE: Google manages control plane — contact Cloud Support
  • AKS: Azure manages control plane — contact Azure Support

This is why the backup before upgrade is non-negotiable. If you have a Velero backup and the cluster is irrecoverably broken, you can provision a new cluster on the old version and restore.

Node Rollback

Node upgrades are reversible. If the new node group has issues:

  1. Stop the upgrade
  2. Keep the old node group running
  3. Scale the old node group back up
  4. Remove the cordon from old nodes
  5. Drain and terminate new nodes

For EKS managed node groups, the old AMI is still available — you can create a new node group with the previous Kubernetes version and shift pods back while you investigate.

Application Rollback

For application-level breakages (deprecated API usage, changed behaviour), roll back via GitOps:

bash
# Argo CD: sync to previous revision
argocd app history my-app
argocd app rollback my-app <revision-number>

# Or kubectl: if using Deployments directly
kubectl rollout undo deployment/my-app -n production

Post-Upgrade Verification

After the upgrade completes, verify before declaring success:

bash
1# All nodes on new version
2kubectl get nodes -o wide
3
4# All system pods running
5kubectl get pods -n kube-system
6
7# All application pods running
8kubectl get pods -A | grep -v Running | grep -v Completed
9
10# HPA functioning
11kubectl get hpa -A
12
13# Ingress controller responding
14curl -I https://my-service.example.com/healthz
15
16# Check for any admission webhook failures in recent events
17kubectl get events -A --sort-by='.lastTimestamp' | tail -20

Let the upgraded cluster run through at least one business day before closing the upgrade incident. Admission webhook failures, API deprecation errors in logs, and Ingress controller regressions sometimes surface only under real traffic.


Upgrade Frequency Best Practices

Running at N-1 (one version behind latest) is a reasonable steady-state for production clusters on EKS and AKS, where extended support is available. Running at N-2 (two versions behind) should be the maximum acceptable — it means one upgrade separates you from end-of-standard-support.

Build a quarterly upgrade cadence:

  • Q1: Upgrade non-production clusters to latest minor version
  • Q2: Upgrade production clusters
  • Q3: Upgrade non-production to next minor version
  • Q4: Upgrade production again

This keeps production at most two minor versions behind latest, gives you 3+ months of non-production validation before production upgrades, and prevents the "haven't upgraded in 18 months and now we're 4 versions behind" scenario that requires multi-step upgrades to close.


Frequently Asked Questions

Can I skip minor versions when upgrading?

No. Kubernetes (and all managed services) require upgrading one minor version at a time. To go from 1.28 to 1.32, you must upgrade 1.28 → 1.29 → 1.30 → 1.31 → 1.32. Each hop is a full upgrade cycle. This is why staying current (upgrading every 2-3 versions rather than waiting) significantly reduces the total upgrade burden.

How long does the API server outage last during control plane upgrade?

On EKS, typically 30–60 seconds. On GKE regional clusters, the control plane upgrade is HA and doesn't cause API server downtime — the upgrade is rolling across the control plane replicas. AKS is similar to EKS (brief unavailability). Existing workloads continue running during the API server outage; only new API calls (deployments, scaling) fail.

What if node drain hangs?

A hung drain is almost always a PDB that prevents eviction. To identify the blocking PDB:

bash
kubectl get events -n <namespace> | grep "disruption budget"

Options:

  1. Scale up the Deployment so the PDB can be satisfied
  2. Temporarily delete the PDB (kubectl delete pdb <name> -n <namespace>) — restore it after drain completes
  3. Force-evict (only as a last resort — it bypasses the PDB): kubectl delete pod <pod> -n <namespace> --grace-period=0 --force

How do I handle StatefulSets during node drain?

StatefulSets with ordered pod management (podManagementPolicy: OrderedReady) drain correctly — the StatefulSet controller ensures orderly shutdown. Ensure your StatefulSet's PDB allows at least one pod to be disrupted; minAvailable: n-1 is the standard pattern.

Should I upgrade Kubernetes before or after upgrading Helm charts?

Kubernetes first. Once the control plane is upgraded, Helm chart upgrades can happen at any pace. The risk is upgrading Helm charts (e.g., cert-manager, Kyverno) while the old Kubernetes version is running — chart updates sometimes add features that require the new K8s version and fail on the old one.


For EKS-specific upgrade guidance — addon version management, Karpenter node replacement, Terraform module upgrades, and the EKS extended support cost model — see EKS Cluster Upgrades: Zero-Downtime Strategy for Production. For GitOps-managed upgrades and sync policies, see GitOps with Argo CD: Production Setup Guide. For Karpenter node management, see How to Install Karpenter on EKS.

Planning a major Kubernetes version upgrade? Talk to us at Coding Protocols — we've run these upgrades on production clusters handling millions of requests and can help you avoid the common failure modes.

Related Topics

Kubernetes
Upgrades
Zero Downtime
EKS
GKE
AKS
Platform Engineering
Reliability
SRE

Read Next