Karpenter IAM Deadlock: How We Broke Our EKS Cluster with a Terraform Apply
A deep dive into how a routine terraform apply with an explicit depends_on caused a cluster-wide ImagePullBackOff for all Karpenter nodes by silently destroying and recreating IAM policy attachments.

Karpenter IAM Deadlock: How We Broke Our EKS Cluster with a Terraform Apply
Infrastructure as Code (IaC) tools like Terraform are designed to give us predictability, safety, and repeatable deployments on AWS. But sometimes, the very features built to protect us—like explicit module dependencies—can become the trigger for a catastrophic Kubernetes cluster outage. This is the incident report of how a seemingly innocent Terraform configuration turned a routine update into a cluster-wide deadlock, leaving our Karpenter-provisioned EKS nodes bricked mid-flight with a severe ImagePullBackOff error.
What Happened
On 22 March 2026, we ran a routine terraform apply on our EKS cluster. Within minutes, every pod running on Karpenter-provisioned nodes entered ImagePullBackOff. The nodes were alive and healthy—but they could no longer pull container images from ECR.
The root cause: Terraform silently destroyed and recreated our Karpenter node IAM policy attachments, leaving the node role without any policies for a window of time. Karpenter nodes lost ECR access, EKS worker permissions, and CNI policy — effectively bricking them mid-flight.
Our Setup
We use Karpenter to dynamically provision EC2 nodes for workloads. The infrastructure is split into two Terraform modules:
module.eks_cluster— EKS control plane, managed node groups, addonsmodule.karpenter— Karpenter Helm release, IAM roles, SQS queue, NodePools
The Karpenter module had an explicit depends_on:
module "karpenter" {
source = "./modules/karpenter"
...
depends_on = [module.eks_cluster]
}This seemed reasonable — Karpenter needs the cluster to exist before it can be installed. But it created a hidden trap.
The Trap: How depends_on Caused the Deadlock
When you use depends_on on a module, Terraform treats the entire dependency module as a prerequisite. This has a subtle side effect with -target:
If you runterraform apply -target=module.karpenter, Terraform also pulls in all pending changes frommodule.eks_clusterbecause of thedepends_onrelationship. This forces Terraform to recalculate the graph and potentially replace resources that shouldn't be touched.
On 22 March, both modules had pending changes:
module.eks_clusterhad minor addon version updatesmodule.karpenterhad NodePool changes
We ran the apply targeting karpenter. Terraform saw the depends_on, pulled in the eks_cluster changes, and determined that some aws_iam_role_policy_attachment.node[*] resources needed to be replaced (destroy → create) due to dependency graph changes.
During the replace cycle:
- Terraform destroyed the IAM policy attachments
- Karpenter nodes immediately lost ECR, EKS, and CNI permissions
- Every pod on those nodes hit
ImagePullBackOff - Terraform recreated the attachments — but the damage was done
The Blast Radius
- All Karpenter-provisioned nodes:
ImagePullBackOffon every pod aws-ebs-csi-driverpods stuck (EBS volumes couldn't be mounted)kube-proxypods stuck (networking degraded on affected nodes)- New pods could not be scheduled on recovered nodes until system pods were recycled
vpc-cniaddon enteredUPDATINGstate and got stuck (the CNI relies on the very IAM role that was detached to communicate its status back to the EKS control plane).
How We Recovered
Step 1: Manually reattach IAM policies
The node role still existed — it just had no policies. We reattached them directly via AWS CLI:
1AWS_PROFILE=homeneedspro aws iam attach-role-policy \
2 --role-name homeneedspro-secure-karpenter-node \
3 --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryPullOnly
4
5AWS_PROFILE=homeneedspro aws iam attach-role-policy \
6 --role-name homeneedspro-secure-karpenter-node \
7 --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
8
9AWS_PROFILE=homeneedspro aws iam attach-role-policy \
10 --role-name homeneedspro-secure-karpenter-node \
11 --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_PolicyStep 2: Recycle stuck system pods
Nodes recovered IAM access, but system pods were already stuck. Force-recycling them picked up the restored permissions:
kubectl delete pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
kubectl delete pods -n kube-system -l k8s-app=kube-proxyStep 3: Wait for vpc-cni to unstick
The vpc-cni addon was stuck in UPDATING. We waited for the EKS control plane to resolve it automatically (~10 minutes). If it hadn't, the recovery would have been to trigger a manual addon update via AWS Console.
Root Cause Summary
| Factor | Detail |
|---|---|
| Trigger | terraform apply -target=module.karpenter with pending eks_cluster changes |
| Mechanism | depends_on = [module.eks_cluster] causes -target to pull in all eks_cluster changes |
| Failure | IAM policy attachments marked for replacement → destroy window left node role policy-less |
| Impact | All Karpenter nodes lost ECR/EKS/CNI access → ImagePullBackOff cluster-wide |
What We Fixed
Removed the explicit depends_on from the Karpenter module:
module "karpenter" {
source = "./modules/karpenter"
cluster_name = module.eks_cluster.cluster_name # implicit dependency is enough
...
# depends_on removed — implicit dependency through cluster_name is sufficient
}The implicit data dependency (cluster_name = module.eks_cluster.cluster_name) ensures Karpenter still waits for the cluster — but without the broad module-level dependency that caused -target to misbehave.
Prevention Checklist
Before any terraform apply on the EKS or Karpenter modules:
1. Always scan the plan for replacements
terraform plan | grep "must be replaced"If any aws_iam_role_policy_attachment resources appear — stop. Do not apply.
2. Apply modules in order, separately
# Step 1 — cluster changes first
terraform apply -target=module.eks_cluster
# Step 2 — karpenter separately, after cluster is stable
terraform apply -target=module.karpenter3. Never apply both modules in a single run when both have changes
4. After any Karpenter node IAM change, verify immediately
kubectl get pods -A | grep -v Running | grep -v CompletedLessons Learned
-
depends_onon modules is dangerous with-target— it silently pulls in unrelated changes and can cause replacement cascades. Use implicit data dependencies instead. -
IAM detachment is instantaneous; the blast is immediate — there is no grace period when Terraform destroys an IAM attachment. Nodes lose permissions the moment the destroy completes.
-
Always
grep "must be replaced"before applying — a replacement of any IAM resource on a node role should be treated as a production incident risk, not a routine change. -
Recovery is fast if you know the cause — manually reattaching policies via CLI took under 2 minutes. The hard part was diagnosing the root cause.
Quick Reference: Emergency Recovery
If you ever see a sudden, cluster-wideImagePullBackOffacross all Karpenter nodes immediately after aterraform apply,check the IAM node role policies first.
1# 1. Check if node role has policies
2AWS_PROFILE=homeneedspro aws iam list-attached-role-policies \
3 --role-name homeneedspro-secure-karpenter-node
4
5# 2. If missing — reattach
6AWS_PROFILE=homeneedspro aws iam attach-role-policy \
7 --role-name homeneedspro-secure-karpenter-node \
8 --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryPullOnly
9
10AWS_PROFILE=homeneedspro aws iam attach-role-policy \
11 --role-name homeneedspro-secure-karpenter-node \
12 --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
13
14AWS_PROFILE=homeneedspro aws iam attach-role-policy \
15 --role-name homeneedspro-secure-karpenter-node \
16 --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
17
18# 3. Recycle stuck pods
19kubectl delete pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
20kubectl delete pods -n kube-system -l k8s-app=kube-proxyConclusion
Outages like this are the tax we pay for operating complex, declarative infrastructure at scale. Terraform is incredibly powerful, but its graph resolution logic—especially around depends_on and -target—is unforgiving. By moving to implicit data dependencies and establishing a strict review protocol for must be replaced warnings, we've closed this gap.
The best systems aren't the ones that never fail; they're the ones that fail, teach us a lesson, and are engineered to ensure it never happens the same way twice.


