Fixing Karpenter ImagePullBackOff: Terraform IAM Outage on EKS

Karpenter IAM Deadlock: How We Broke Our EKS Cluster with a Terraform Apply

Infrastructure as Code (IaC) tools like Terraform are designed to give us predictability, safety, and repeatable deployments on AWS. But sometimes, the very features built to protect us—like explicit module dependencies—can become the trigger for a catastrophic Kubernetes cluster outage. This is the incident report of how a seemingly innocent Terraform configuration turned a routine update into a cluster-wide deadlock, leaving our Karpenter-provisioned EKS nodes bricked mid-flight with a severe ImagePullBackOff error.

What Happened

On 22 March 2026, we ran a routine terraform apply on our EKS cluster. Within minutes, every pod running on Karpenter-provisioned nodes entered ImagePullBackOff. The nodes were alive and healthy—but they could no longer pull container images from ECR.

The root cause: Terraform silently destroyed and recreated our Karpenter node IAM policy attachments, leaving the node role without any policies for a window of time. Karpenter nodes lost ECR access, EKS worker permissions, and CNI policy — effectively bricking them mid-flight.

Our Setup

We use Karpenter to dynamically provision EC2 nodes for workloads. The infrastructure is split into two Terraform modules:

module.eks_cluster — EKS control plane, managed node groups, addons
module.karpenter — Karpenter Helm release, IAM roles, SQS queue, NodePools

The Karpenter module had an explicit depends_on:

hcl

module "karpenter" {
  source = "./modules/karpenter"
  ...
  depends_on = [module.eks_cluster]
}

This seemed reasonable — Karpenter needs the cluster to exist before it can be installed. But it created a hidden trap.

The Trap: How `depends_on` Caused the Deadlock

When you use depends_on on a module, Terraform treats the entire dependency module as a prerequisite. This has a subtle side effect with -target:

Warning

If you runterraform apply -target=module.karpenter, Terraform also pulls in all pending changes frommodule.eks_clusterbecause of thedepends_onrelationship. This forces Terraform to recalculate the graph and potentially replace resources that shouldn't be touched.

Rendering diagram…

On 22 March, both modules had pending changes:

module.eks_cluster had minor addon version updates
module.karpenter had NodePool changes

We ran the apply targeting karpenter. Terraform saw the depends_on, pulled in the eks_cluster changes, and determined that some aws_iam_role_policy_attachment.node[*] resources needed to be replaced (destroy → create) due to dependency graph changes.

During the replace cycle:

Terraform destroyed the IAM policy attachments
Karpenter nodes immediately lost ECR, EKS, and CNI permissions
Every pod on those nodes hit ImagePullBackOff
Terraform recreated the attachments — but the damage was done

The Blast Radius

All Karpenter-provisioned nodes: ImagePullBackOff on every pod
aws-ebs-csi-driver pods stuck (EBS volumes couldn't be mounted)
kube-proxy pods stuck (networking degraded on affected nodes)
New pods could not be scheduled on recovered nodes until system pods were recycled
vpc-cni addon entered UPDATING state and got stuck (the CNI relies on the very IAM role that was detached to communicate its status back to the EKS control plane).

How We Recovered

Step 1: Manually reattach IAM policies

The node role still existed — it just had no policies. We reattached them directly via AWS CLI:

bash

1AWS_PROFILE=homeneedspro aws iam attach-role-policy \
2  --role-name homeneedspro-secure-karpenter-node \
3  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryPullOnly
4
5AWS_PROFILE=homeneedspro aws iam attach-role-policy \
6  --role-name homeneedspro-secure-karpenter-node \
7  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
8
9AWS_PROFILE=homeneedspro aws iam attach-role-policy \
10  --role-name homeneedspro-secure-karpenter-node \
11  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy

Step 2: Recycle stuck system pods

Nodes recovered IAM access, but system pods were already stuck. Force-recycling them picked up the restored permissions:

bash

kubectl delete pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
kubectl delete pods -n kube-system -l k8s-app=kube-proxy

Step 3: Wait for vpc-cni to unstick

The vpc-cni addon was stuck in UPDATING. We waited for the EKS control plane to resolve it automatically (~10 minutes). If it hadn't, the recovery would have been to trigger a manual addon update via AWS Console.

Root Cause Summary

Factor	Detail
Trigger	`terraform apply -target=module.karpenter` with pending eks_cluster changes
Mechanism	`depends_on = [module.eks_cluster]` causes `-target` to pull in all eks_cluster changes
Failure	IAM policy attachments marked for replacement → destroy window left node role policy-less
Impact	All Karpenter nodes lost ECR/EKS/CNI access → `ImagePullBackOff` cluster-wide

What We Fixed

Removed the explicit depends_on from the Karpenter module:

hcl

module "karpenter" {
  source       = "./modules/karpenter"
  cluster_name = module.eks_cluster.cluster_name  # implicit dependency is enough
  ...
  # depends_on removed — implicit dependency through cluster_name is sufficient
}

The implicit data dependency (cluster_name = module.eks_cluster.cluster_name) ensures Karpenter still waits for the cluster — but without the broad module-level dependency that caused -target to misbehave.

Prevention Checklist

Before any terraform apply on the EKS or Karpenter modules:

1. Always scan the plan for replacements

bash

terraform plan | grep "must be replaced"

If any aws_iam_role_policy_attachment resources appear — stop. Do not apply.

2. Apply modules in order, separately

bash

# Step 1 — cluster changes first
terraform apply -target=module.eks_cluster

# Step 2 — karpenter separately, after cluster is stable
terraform apply -target=module.karpenter

3. Never apply both modules in a single run when both have changes

4. After any Karpenter node IAM change, verify immediately

bash

kubectl get pods -A | grep -v Running | grep -v Completed

Lessons Learned

depends_on on modules is dangerous with -target — it silently pulls in unrelated changes and can cause replacement cascades. Use implicit data dependencies instead.
IAM detachment is instantaneous; the blast is immediate — there is no grace period when Terraform destroys an IAM attachment. Nodes lose permissions the moment the destroy completes.
Always grep "must be replaced" before applying — a replacement of any IAM resource on a node role should be treated as a production incident risk, not a routine change.
Recovery is fast if you know the cause — manually reattaching policies via CLI took under 2 minutes. The hard part was diagnosing the root cause.

Quick Reference: Emergency Recovery

Note

If you ever see a sudden, cluster-wideImagePullBackOffacross all Karpenter nodes immediately after aterraform apply,check the IAM node role policies first.

bash

1# 1. Check if node role has policies
2AWS_PROFILE=homeneedspro aws iam list-attached-role-policies \
3  --role-name homeneedspro-secure-karpenter-node
4
5# 2. If missing — reattach
6AWS_PROFILE=homeneedspro aws iam attach-role-policy \
7  --role-name homeneedspro-secure-karpenter-node \
8  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryPullOnly
9
10AWS_PROFILE=homeneedspro aws iam attach-role-policy \
11  --role-name homeneedspro-secure-karpenter-node \
12  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
13
14AWS_PROFILE=homeneedspro aws iam attach-role-policy \
15  --role-name homeneedspro-secure-karpenter-node \
16  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
17
18# 3. Recycle stuck pods
19kubectl delete pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver
20kubectl delete pods -n kube-system -l k8s-app=kube-proxy

Conclusion

Outages like this are the tax we pay for operating complex, declarative infrastructure at scale. Terraform is incredibly powerful, but its graph resolution logic—especially around depends_on and -target—is unforgiving. By moving to implicit data dependencies and establishing a strict review protocol for must be replaced warnings, we've closed this gap.

The best systems aren't the ones that never fail; they're the ones that fail, teach us a lesson, and are engineered to ensure it never happens the same way twice.

Karpenter IAM Deadlock: How We Broke Our EKS Cluster with a Terraform Apply

Karpenter IAM Deadlock: How We Broke Our EKS Cluster with a Terraform Apply

What Happened

Our Setup

The Trap: How `depends_on` Caused the Deadlock

The Blast Radius

How We Recovered

Root Cause Summary

What We Fixed

Prevention Checklist

Lessons Learned

Quick Reference: Emergency Recovery

Conclusion

Related Topics

Read Next

The Ultimate Guide to Kubernetes Cost Optimization on AWS

EKS Auto Mode vs. GKE Autopilot: Choosing the Right Managed Experience

The K8s Cloud Wars: EKS vs GKE vs AKS (2026 Edition)

Karpenter IAM Deadlock: How We Broke Our EKS Cluster with a Terraform Apply

What Happened

Our Setup

The Trap: How depends_on Caused the Deadlock

The Blast Radius

How We Recovered

Root Cause Summary

What We Fixed

Prevention Checklist

Lessons Learned

Quick Reference: Emergency Recovery

Conclusion

Related Topics

Read Next

The Ultimate Guide to Kubernetes Cost Optimization on AWS

EKS Auto Mode vs. GKE Autopilot: Choosing the Right Managed Experience

The K8s Cloud Wars: EKS vs GKE vs AKS (2026 Edition)

The Trap: How `depends_on` Caused the Deadlock