Kubernetes Debugging Guide: Pod and Node Failures (2026)

Kubernetes debugging is pattern recognition. Once you've seen enough CrashLoopBackOff pods, you know to start with logs from the previous container run (-p flag) before the current one. Once you've seen enough Pending pods, you know to check events first, then node resources, then affinity rules.

This post is a systematic guide to the most common Kubernetes failure modes — not a list of commands, but a decision tree for each failure pattern so you can move from symptom to root cause efficiently.

The Debugging Stack

Before specific failure modes, understand the diagnostic hierarchy:

kubectl get <resource>          → What's the current state?
kubectl describe <resource>     → What events explain the state?
kubectl logs <pod>              → What did the application say?
kubectl exec <pod> -- command   → What can the application see from inside?

Always work top-down through this stack. describe gives you the events that led to the current state — this is almost always more useful than the state alone. Logs tell you what the application itself saw. exec lets you verify networking, DNS, and filesystem from the pod's perspective.

CrashLoopBackOff

Symptom: Pod shows CrashLoopBackOff in the STATUS column. The container starts, crashes, Kubernetes restarts it with exponential backoff (10s, 20s, 40s, 80s...).

Diagnostic sequence:

bash

1# Step 1: Get the exit code of the last crash
2kubectl describe pod <pod-name> -n <namespace>
3# Look at: "Last State: Terminated" section — shows ExitCode and Reason
4
5# Step 2: Get logs from the PREVIOUS container run (the one that crashed)
6kubectl logs <pod-name> -n <namespace> -p
7# -p = previous container. If you omit -p, you get logs from the restarted container
8# which may show nothing if it crashes immediately
9
10# Step 3: Get logs from the current run to see the startup sequence
11kubectl logs <pod-name> -n <namespace> --tail=100

Exit code interpretation:

Exit Code	Meaning	Common Cause
0	Clean exit	Application finished — not a crash, but for a daemon is wrong
1	Generic error	Application bug or misconfiguration
137	SIGKILL (128+9)	OOMKilled or manually killed
139	SIGSEGV (128+11)	Segfault in application or native library
143	SIGTERM (128+15)	Normal graceful shutdown — process received SIGTERM and terminated (expected during rolling updates, drain, pod deletion). Only a problem if unexpected.
255	Generic failure	Startup script failure, missing file

Exit code 137 (OOMKilled): Check memory limits and actual memory usage:

bash

kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "OOMKilled\|Limits\|Requests"
kubectl top pod <pod-name> -n <namespace>

If the pod is OOMKilling, either increase the memory limit or investigate a memory leak. For JVM applications, set -XX:MaxRAMPercentage=75 to prevent the JVM from consuming more than 75% of the container's memory limit.

Exit code 1 with empty logs: The container starts but produces no log output before crashing. Common causes: missing environment variable, bad command-line argument, missing config file:

bash

1# Get the actual command the container is running
2kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Args\|Command"
3
4# Check if required environment variables are set
5kubectl exec <pod-name> -n <namespace> -- env | sort
6
7# Check if required config files exist
8kubectl exec <pod-name> -n <namespace> -- ls /etc/config/
9
10# If the pod crashes too fast to exec into:
11# Override the command to keep the container alive for inspection
12kubectl debug <pod-name> -n <namespace> --copy-to=debug-pod \
13  --set-image=app=<image> -- sleep infinity

Pending Pod

Symptom: Pod stuck in Pending state. Not scheduled to any node.

Diagnostic sequence:

bash

# Step 1: Check events — this almost always tells you why
kubectl describe pod <pod-name> -n <namespace>
# Look at the Events section at the bottom

# Step 2: If events are sparse, check scheduler events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep -i "failed\|error\|warning"

Common Pending causes:

Insufficient Resources

Events:
  Warning  FailedScheduling  0/5 nodes are available: 
    2 Insufficient cpu, 3 node(s) had taint that the pod didn't tolerate.

bash

1# Check node capacity vs allocatable
2kubectl describe nodes | grep -A 6 "Capacity:\|Allocatable:"
3
4# Check current allocation across all nodes
5kubectl get nodes -o custom-columns=\
6"NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory"
7
8# See what's consuming resources on nodes
9kubectl describe nodes | grep -A 10 "Allocated resources:"

If nodes are fully allocated but your pod can't schedule, either:

Add more nodes (or increase Karpenter/CA min capacity)
Reduce the pod's resource requests

No Matching Nodes (Affinity/Selector)

Events:
  Warning  FailedScheduling  0/5 nodes are available:
    5 node(s) didn't match Pod's node affinity/selector

bash

1# What does the pod need?
2kubectl describe pod <pod-name> -n <namespace> | grep -A 20 "Node-Selectors\|Affinity"
3
4# What labels do nodes have?
5kubectl get nodes --show-labels
6
7# Does any node match?
8kubectl get nodes -l <key>=<value>

Taints Not Tolerated

bash

# What taints are on nodes?
kubectl describe nodes | grep -A 3 "Taints:"

# Does the pod tolerate the taint?
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Tolerations:"

PVC Not Bound

A pod requiring a PVC that doesn't exist or isn't bound stays Pending:

bash

kubectl get pvc -n <namespace>
# If STATUS is Pending, not Bound:
kubectl describe pvc <pvc-name> -n <namespace>
# Events will show: ProvisioningFailed, WaitForFirstConsumer, etc.

WaitForFirstConsumer is normal for StorageClasses with volumeBindingMode: WaitForFirstConsumer — the PVC binds when the pod is assigned to a node. If the pod is also Pending, there's a chicken-and-egg: the pod can't schedule because the PVC isn't bound, and the PVC won't bind until the pod is scheduled. This resolves if the pod can be scheduled — if not, look at the scheduling failure reason first.

ImagePullBackOff / ErrImagePull

Symptom: Pod shows ImagePullBackOff or ErrImagePull. Container image can't be pulled.

bash

kubectl describe pod <pod-name> -n <namespace>
# Events will show: Failed to pull image "..." — gives the specific error

Common causes:

Image Doesn't Exist

Failed to pull image "myrepo/myapp:latest": 
  Error response from daemon: manifest for myrepo/myapp:latest not found

bash

# Verify the image name and tag
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

# Common mistakes: wrong tag, private repo without auth, tag not pushed yet

Private Registry — Missing Credentials

Failed to pull image "gcr.io/my-project/myapp:v1.0": 
  Error response from daemon: pull access denied

bash

1# Check if imagePullSecrets is set on the pod
2kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Image Pull Secrets"
3
4# Does the secret exist in the namespace?
5kubectl get secret -n <namespace> | grep regcred
6
7# Check the service account's imagePullSecrets
8kubectl describe serviceaccount default -n <namespace>

Create a registry credential secret if missing:

bash

1kubectl create secret docker-registry regcred \
2  --docker-server=<registry-host> \
3  --docker-username=<username> \
4  --docker-password=<token> \
5  --namespace=<namespace>
6
7# Patch the default service account to use it automatically
8kubectl patch serviceaccount default -n <namespace> \
9  -p '{"imagePullSecrets": [{"name": "regcred"}]}'

Rate Limiting (Docker Hub)

Docker Hub imposes pull limits on unauthenticated requests. If you're pulling public images from Docker Hub without authentication, you may hit rate limits:

Error response from daemon: toomanyrequests: 
  You have reached your pull rate limit.

Solution: authenticate to Docker Hub even for public images, or mirror images to your own registry (ECR, GCR, ACR).

OOMKilled

Symptom: Pod repeatedly restarts with exit code 137, kubectl describe shows OOMKilled: true.

This is distinct from the memory limit being set too low — sometimes it's a genuine memory leak or memory growth under load.

bash

1# Confirm OOMKill
2kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State:"
3# Shows: Reason: OOMKilled, ExitCode: 137
4
5# Current memory usage
6kubectl top pod <pod-name> -n <namespace>
7
8# Memory limit
9kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.limits.memory}'

Diagnosis:

Memory usage near the limit at steady state → Limit is too low. Increase it.
Memory grows over time → Memory leak. Profile the application.
Memory spikes under load → Set the limit higher than the P99 spike level, or use HPA to spread load across more replicas before the per-pod memory reaches the limit.

For JVM applications running on Java 11+:

bash

# Confirm the JVM sees the container memory limit (not host memory)
kubectl exec <pod-name> -n <namespace> -- java -XX:+PrintFlagsFinal -version | grep MaxHeap
# Should be ~75% of the container memory limit, not the host memory

If the JVM is using more memory than the limit (heap + off-heap + metaspace), the container gets OOMKilled even when heap usage is within the -Xmx setting. Set -XX:MaxRAMPercentage=75 and ensure the container memory limit is significantly larger than the desired heap size (leave room for off-heap and GC overhead).

Node NotReady

Symptom: One or more nodes show NotReady in kubectl get nodes.

bash

1# Get the reason
2kubectl describe node <node-name>
3# Look at: "Conditions:" section — MemoryPressure, DiskPressure, PIDPressure, Ready
4
5# Common condition messages:
6# Ready: False — kubelet stopped reporting
7# MemoryPressure: True — node is low on memory
8# DiskPressure: True — node is low on disk space

Kubelet Not Running

bash

# If you have access to the node (EC2 SSM, node shell)
systemctl status kubelet
journalctl -u kubelet -n 50

Disk Pressure

bash

1# On the node (via SSM or SSH):
2df -h
3# High usage in /var/lib/docker or /var/lib/containerd is common
4
5# Clear unused images
6crictl rmi --prune
7
8# Or from kubectl if the node is still partially functional:
9kubectl describe node <node-name> | grep -A 5 "DiskPressure"

Set eviction thresholds in kubelet config to evict pods before the node hits disk pressure — once a node has DiskPressure, it stops accepting new pods.

Memory Pressure

Node memory pressure triggers pod eviction (BestEffort pods first, then Burstable, then Guaranteed). If nodes are consistently under memory pressure, either right-size workloads or add memory to nodes.

bash

kubectl describe node <node-name> | grep -A 3 "MemoryPressure\|memory"

Service Not Reachable

Symptom: Pods can't reach a Service by its DNS name, or external traffic can't reach the Service.

bash

1# Step 1: Does the Service exist?
2kubectl get svc -n <namespace>
3
4# Step 2: Does it have endpoints?
5kubectl get endpoints <service-name> -n <namespace>
6# If ENDPOINTS shows "<none>", the selector doesn't match any pods
7
8# Step 3: Do the pods match the Service selector?
9kubectl describe svc <service-name> -n <namespace> | grep -A 5 "Selector:"
10kubectl get pods -n <namespace> --show-labels | grep <key>=<value>

No Endpoints

The Service selector doesn't match pod labels:

bash

1# What the Service selects:
2kubectl get svc my-service -n production -o jsonpath='{.spec.selector}'
3# Output: {"app":"api","version":"v2"}
4
5# What pods have:
6kubectl get pods -n production -l app=api --show-labels
7# If version label is missing or different, no match

Fix: either update the pod labels to match the Service selector, or update the Service selector to match the pods.

DNS Not Resolving

bash

1# Test DNS from a pod in the same namespace
2kubectl run -it --rm debug --image=nicolaka/netshoot -n <namespace> -- \
3  nslookup <service-name>
4
5# Test cross-namespace DNS (requires FQDN)
6kubectl run -it --rm debug --image=nicolaka/netshoot -n <namespace> -- \
7  nslookup <service-name>.<target-namespace>.svc.cluster.local
8
9# If DNS fails entirely, check CoreDNS:
10kubectl get pods -n kube-system -l k8s-app=kube-dns
11kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

NetworkPolicy Blocking Traffic

If NetworkPolicy default-deny is in place, the Service may exist and have endpoints, but traffic is blocked:

bash

1# Check if there's a default-deny policy in the target namespace
2kubectl get networkpolicy -n <namespace>
3
4# Check events for NetworkPolicy denials (if Cilium with Hubble):
5hubble observe --namespace <namespace> --verdict DROPPED
6
7# Or use netshoot to test connectivity:
8kubectl run -it --rm debug --image=nicolaka/netshoot -n <source-namespace> -- \
9  curl http://<service-name>.<target-namespace>.svc.cluster.local:<port>

Deployment Stuck (Rollout Not Progressing)

Symptom: kubectl rollout status deployment/<name> hangs or reports unavailable replicas.

bash

1# Check rollout status
2kubectl rollout status deployment/<name> -n <namespace>
3
4# Check replica state
5kubectl describe deployment <name> -n <namespace>
6
7# Check ReplicaSet events
8kubectl describe rs -n <namespace> | grep -A 10 "Events:"
9
10# Get the specific pod that's failing
11kubectl get pods -n <namespace> -l app=<label> | grep -v Running
12kubectl describe pod <failing-pod> -n <namespace>

Common causes of stuck rollouts:

New pods failing to start → Follow CrashLoopBackOff or Pending diagnosis above.

Readiness probe failing → New pods start but don't become Ready:

bash

kubectl describe pod <new-pod> -n <namespace> | grep -A 10 "Readiness:"
# Shows: probe failures, timeout counts

# What does the readiness endpoint return?
kubectl exec <new-pod> -n <namespace> -- curl -sv http://localhost:8080/health

PDB blocking rollout → A PDB prevents the old ReplicaSet from scaling down:

bash

kubectl describe pdb -n <namespace>
# DisruptionsAllowed: 0 means nothing can be evicted

Insufficient resources for the new pod → Same as Pending diagnosis — the new pod can't schedule because there's no capacity with the new replica running alongside the old.

Ephemeral Debug Containers

kubectl debug (stable since Kubernetes 1.25) provides a clean way to debug running pods without modifying them:

bash

1# Attach a debug container to a running pod
2kubectl debug <pod-name> -n <namespace> -it \
3  --image=nicolaka/netshoot \
4  --target=<container-name>
5
6# This shares the target container's namespaces (process, network, filesystem)
7# without requiring the production image to have debug tools

bash

# Copy a pod with a different entrypoint (for crash-at-startup debugging)
kubectl debug <pod-name> -n <namespace> -it \
  --copy-to=debug-pod \
  --set-image=app=<image> \
  -- /bin/sh

For pods using a distroless or scratch image that has no shell, the --target form with a sidecar debug image is the only way to inspect the container at runtime without modifying the image.

Quick Reference: Failure → First Command

Symptom	First Command
CrashLoopBackOff	`kubectl logs <pod> -p` (previous run logs)
Pending	`kubectl describe pod <pod>` → Events section
ImagePullBackOff	`kubectl describe pod <pod>` → Events section
OOMKilled	`kubectl describe pod <pod>` → Last State section
Node NotReady	`kubectl describe node <node>` → Conditions section
Service not reachable	`kubectl get endpoints <svc>` → is it empty?
Deployment stuck	`kubectl rollout status deployment/<name>`
DNS not resolving	`kubectl run debug --image=nicolaka/netshoot -- nslookup <name>`
NetworkPolicy blocked	`hubble observe --verdict DROPPED` or test with netshoot

Frequently Asked Questions

How do I debug a pod that crashes immediately?

The container exits before you can exec into it. Use kubectl debug:

bash

kubectl debug <pod-name> -n <namespace> --copy-to=debug-pod \
  --set-image=app=<same-image> \
  -- sleep infinity

This creates a copy of the pod with sleep infinity as the entrypoint, keeping the container alive for inspection. Then exec into it and manually run the original entrypoint to reproduce the crash.

How do I check what's happening with a DaemonSet pod on a specific node?

bash

kubectl get pod -n kube-system -o wide | grep <daemonset-name> | grep <node-name>
kubectl logs -n kube-system <daemonset-pod-name>

What does `ContainerStatusUnknown` mean?

The kubelet lost contact with the container runtime for that pod. It usually resolves itself as the runtime recovers. If persistent, check the containerd/docker daemon on the node: systemctl status containerd.

How do I find which pod is consuming the most CPU or memory?

bash

# Across the entire cluster
kubectl top pods -A --sort-by=cpu | head -20
kubectl top pods -A --sort-by=memory | head -20

# Within a namespace
kubectl top pods -n production --sort-by=cpu

Is there a way to get logs from a pod that no longer exists?

Standard kubectl logs only works for running pods or recently terminated pods (while the pod object still exists). For historical logs, you need a log aggregation system (Loki, CloudWatch, Elasticsearch). This is why centralised logging is essential for production — kubectl logs is not a log management solution.

For systematic observability to complement debugging, see Kubernetes Observability: Prometheus, Grafana, and OpenTelemetry. For probe configuration that prevents crashes from being silent, see Kubernetes Probes: Liveness, Readiness, and Startup. For advanced debugging techniques covering network forensics, node inspection, audit logs, and ephemeral containers, see Kubernetes Debugging: Systematic Troubleshooting for Production Incidents.

Dealing with a production incident in Kubernetes? Talk to us at Coding Protocols — we help platform teams build debugging runbooks and observability stacks that make incidents shorter.

Kubernetes Debugging: A Systematic Guide to Diagnosing Pod and Node Failures

The Debugging Stack

CrashLoopBackOff

Pending Pod

Insufficient Resources

No Matching Nodes (Affinity/Selector)

Taints Not Tolerated

PVC Not Bound

ImagePullBackOff / ErrImagePull

Image Doesn't Exist

Private Registry — Missing Credentials

Rate Limiting (Docker Hub)

OOMKilled

Node NotReady

Kubelet Not Running

Disk Pressure

Memory Pressure

Service Not Reachable

No Endpoints

DNS Not Resolving

NetworkPolicy Blocking Traffic

Deployment Stuck (Rollout Not Progressing)

Ephemeral Debug Containers

Quick Reference: Failure → First Command

Frequently Asked Questions

How do I debug a pod that crashes immediately?

How do I check what's happening with a DaemonSet pod on a specific node?

What does `ContainerStatusUnknown` mean?

How do I find which pod is consuming the most CPU or memory?

Is there a way to get logs from a pod that no longer exists?

Related Topics

Read Next

Kubernetes Debugging: Systematic Troubleshooting for Production Incidents

eBPF Observability: Tetragon, Hubble, and Pixie in Production

Kubernetes PodDisruptionBudget and Graceful Shutdown Patterns

The Debugging Stack

CrashLoopBackOff

Pending Pod

Insufficient Resources

No Matching Nodes (Affinity/Selector)

Taints Not Tolerated

PVC Not Bound

ImagePullBackOff / ErrImagePull

Image Doesn't Exist

Private Registry — Missing Credentials

Rate Limiting (Docker Hub)

OOMKilled

Node NotReady

Kubelet Not Running

Disk Pressure

Memory Pressure

Service Not Reachable

No Endpoints

DNS Not Resolving

NetworkPolicy Blocking Traffic

Deployment Stuck (Rollout Not Progressing)

Ephemeral Debug Containers

Quick Reference: Failure → First Command

Frequently Asked Questions

How do I debug a pod that crashes immediately?

How do I check what's happening with a DaemonSet pod on a specific node?

What does ContainerStatusUnknown mean?

How do I find which pod is consuming the most CPU or memory?

Is there a way to get logs from a pod that no longer exists?

Related Topics

Read Next

Kubernetes Debugging: Systematic Troubleshooting for Production Incidents

eBPF Observability: Tetragon, Hubble, and Pixie in Production

Kubernetes PodDisruptionBudget and Graceful Shutdown Patterns

What does `ContainerStatusUnknown` mean?