Kubernetes
16 min readMay 9, 2026

Kubernetes Debugging: A Systematic Guide to Diagnosing Pod and Node Failures

CrashLoopBackOff, Pending pods, OOMKilled, ImagePullBackOff — Kubernetes failure modes are consistent and diagnosable with the right sequence of commands. Here's the decision tree and command reference for every common failure pattern.

CO
Coding Protocols Team
Platform Engineering
Kubernetes Debugging: A Systematic Guide to Diagnosing Pod and Node Failures

Kubernetes debugging is pattern recognition. Once you've seen enough CrashLoopBackOff pods, you know to start with logs from the previous container run (-p flag) before the current one. Once you've seen enough Pending pods, you know to check events first, then node resources, then affinity rules.

This post is a systematic guide to the most common Kubernetes failure modes — not a list of commands, but a decision tree for each failure pattern so you can move from symptom to root cause efficiently.


The Debugging Stack

Before specific failure modes, understand the diagnostic hierarchy:

kubectl get <resource>          → What's the current state?
kubectl describe <resource>     → What events explain the state?
kubectl logs <pod>              → What did the application say?
kubectl exec <pod> -- command   → What can the application see from inside?

Always work top-down through this stack. describe gives you the events that led to the current state — this is almost always more useful than the state alone. Logs tell you what the application itself saw. exec lets you verify networking, DNS, and filesystem from the pod's perspective.


CrashLoopBackOff

Symptom: Pod shows CrashLoopBackOff in the STATUS column. The container starts, crashes, Kubernetes restarts it with exponential backoff (10s, 20s, 40s, 80s...).

Diagnostic sequence:

bash
1# Step 1: Get the exit code of the last crash
2kubectl describe pod <pod-name> -n <namespace>
3# Look at: "Last State: Terminated" section — shows ExitCode and Reason
4
5# Step 2: Get logs from the PREVIOUS container run (the one that crashed)
6kubectl logs <pod-name> -n <namespace> -p
7# -p = previous container. If you omit -p, you get logs from the restarted container
8# which may show nothing if it crashes immediately
9
10# Step 3: Get logs from the current run to see the startup sequence
11kubectl logs <pod-name> -n <namespace> --tail=100

Exit code interpretation:

Exit CodeMeaningCommon Cause
0Clean exitApplication finished — not a crash, but for a daemon is wrong
1Generic errorApplication bug or misconfiguration
137SIGKILL (128+9)OOMKilled or manually killed
139SIGSEGV (128+11)Segfault in application or native library
143SIGTERM (128+15)Normal graceful shutdown — process received SIGTERM and terminated (expected during rolling updates, drain, pod deletion). Only a problem if unexpected.
255Generic failureStartup script failure, missing file

Exit code 137 (OOMKilled): Check memory limits and actual memory usage:

bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "OOMKilled\|Limits\|Requests"
kubectl top pod <pod-name> -n <namespace>

If the pod is OOMKilling, either increase the memory limit or investigate a memory leak. For JVM applications, set -XX:MaxRAMPercentage=75 to prevent the JVM from consuming more than 75% of the container's memory limit.

Exit code 1 with empty logs: The container starts but produces no log output before crashing. Common causes: missing environment variable, bad command-line argument, missing config file:

bash
1# Get the actual command the container is running
2kubectl describe pod <pod-name> -n <namespace> | grep -A 10 "Args\|Command"
3
4# Check if required environment variables are set
5kubectl exec <pod-name> -n <namespace> -- env | sort
6
7# Check if required config files exist
8kubectl exec <pod-name> -n <namespace> -- ls /etc/config/
9
10# If the pod crashes too fast to exec into:
11# Override the command to keep the container alive for inspection
12kubectl debug <pod-name> -n <namespace> --copy-to=debug-pod \
13  --set-image=app=<image> -- sleep infinity

Pending Pod

Symptom: Pod stuck in Pending state. Not scheduled to any node.

Diagnostic sequence:

bash
# Step 1: Check events — this almost always tells you why
kubectl describe pod <pod-name> -n <namespace>
# Look at the Events section at the bottom

# Step 2: If events are sparse, check scheduler events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | grep -i "failed\|error\|warning"

Common Pending causes:

Insufficient Resources

Events:
  Warning  FailedScheduling  0/5 nodes are available: 
    2 Insufficient cpu, 3 node(s) had taint that the pod didn't tolerate.
bash
1# Check node capacity vs allocatable
2kubectl describe nodes | grep -A 6 "Capacity:\|Allocatable:"
3
4# Check current allocation across all nodes
5kubectl get nodes -o custom-columns=\
6"NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory"
7
8# See what's consuming resources on nodes
9kubectl describe nodes | grep -A 10 "Allocated resources:"

If nodes are fully allocated but your pod can't schedule, either:

  • Add more nodes (or increase Karpenter/CA min capacity)
  • Reduce the pod's resource requests

No Matching Nodes (Affinity/Selector)

Events:
  Warning  FailedScheduling  0/5 nodes are available:
    5 node(s) didn't match Pod's node affinity/selector
bash
1# What does the pod need?
2kubectl describe pod <pod-name> -n <namespace> | grep -A 20 "Node-Selectors\|Affinity"
3
4# What labels do nodes have?
5kubectl get nodes --show-labels
6
7# Does any node match?
8kubectl get nodes -l <key>=<value>

Taints Not Tolerated

bash
# What taints are on nodes?
kubectl describe nodes | grep -A 3 "Taints:"

# Does the pod tolerate the taint?
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Tolerations:"

PVC Not Bound

A pod requiring a PVC that doesn't exist or isn't bound stays Pending:

bash
kubectl get pvc -n <namespace>
# If STATUS is Pending, not Bound:
kubectl describe pvc <pvc-name> -n <namespace>
# Events will show: ProvisioningFailed, WaitForFirstConsumer, etc.

WaitForFirstConsumer is normal for StorageClasses with volumeBindingMode: WaitForFirstConsumer — the PVC binds when the pod is assigned to a node. If the pod is also Pending, there's a chicken-and-egg: the pod can't schedule because the PVC isn't bound, and the PVC won't bind until the pod is scheduled. This resolves if the pod can be scheduled — if not, look at the scheduling failure reason first.


ImagePullBackOff / ErrImagePull

Symptom: Pod shows ImagePullBackOff or ErrImagePull. Container image can't be pulled.

bash
kubectl describe pod <pod-name> -n <namespace>
# Events will show: Failed to pull image "..." — gives the specific error

Common causes:

Image Doesn't Exist

Failed to pull image "myrepo/myapp:latest": 
  Error response from daemon: manifest for myrepo/myapp:latest not found
bash
# Verify the image name and tag
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].image}'

# Common mistakes: wrong tag, private repo without auth, tag not pushed yet

Private Registry — Missing Credentials

Failed to pull image "gcr.io/my-project/myapp:v1.0": 
  Error response from daemon: pull access denied
bash
1# Check if imagePullSecrets is set on the pod
2kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Image Pull Secrets"
3
4# Does the secret exist in the namespace?
5kubectl get secret -n <namespace> | grep regcred
6
7# Check the service account's imagePullSecrets
8kubectl describe serviceaccount default -n <namespace>

Create a registry credential secret if missing:

bash
1kubectl create secret docker-registry regcred \
2  --docker-server=<registry-host> \
3  --docker-username=<username> \
4  --docker-password=<token> \
5  --namespace=<namespace>
6
7# Patch the default service account to use it automatically
8kubectl patch serviceaccount default -n <namespace> \
9  -p '{"imagePullSecrets": [{"name": "regcred"}]}'

Rate Limiting (Docker Hub)

Docker Hub imposes pull limits on unauthenticated requests. If you're pulling public images from Docker Hub without authentication, you may hit rate limits:

Error response from daemon: toomanyrequests: 
  You have reached your pull rate limit.

Solution: authenticate to Docker Hub even for public images, or mirror images to your own registry (ECR, GCR, ACR).


OOMKilled

Symptom: Pod repeatedly restarts with exit code 137, kubectl describe shows OOMKilled: true.

This is distinct from the memory limit being set too low — sometimes it's a genuine memory leak or memory growth under load.

bash
1# Confirm OOMKill
2kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State:"
3# Shows: Reason: OOMKilled, ExitCode: 137
4
5# Current memory usage
6kubectl top pod <pod-name> -n <namespace>
7
8# Memory limit
9kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.limits.memory}'

Diagnosis:

  1. Memory usage near the limit at steady state → Limit is too low. Increase it.
  2. Memory grows over time → Memory leak. Profile the application.
  3. Memory spikes under load → Set the limit higher than the P99 spike level, or use HPA to spread load across more replicas before the per-pod memory reaches the limit.

For JVM applications running on Java 11+:

bash
# Confirm the JVM sees the container memory limit (not host memory)
kubectl exec <pod-name> -n <namespace> -- java -XX:+PrintFlagsFinal -version | grep MaxHeap
# Should be ~75% of the container memory limit, not the host memory

If the JVM is using more memory than the limit (heap + off-heap + metaspace), the container gets OOMKilled even when heap usage is within the -Xmx setting. Set -XX:MaxRAMPercentage=75 and ensure the container memory limit is significantly larger than the desired heap size (leave room for off-heap and GC overhead).


Node NotReady

Symptom: One or more nodes show NotReady in kubectl get nodes.

bash
1# Get the reason
2kubectl describe node <node-name>
3# Look at: "Conditions:" section — MemoryPressure, DiskPressure, PIDPressure, Ready
4
5# Common condition messages:
6# Ready: False — kubelet stopped reporting
7# MemoryPressure: True — node is low on memory
8# DiskPressure: True — node is low on disk space

Kubelet Not Running

bash
# If you have access to the node (EC2 SSM, node shell)
systemctl status kubelet
journalctl -u kubelet -n 50

Disk Pressure

bash
1# On the node (via SSM or SSH):
2df -h
3# High usage in /var/lib/docker or /var/lib/containerd is common
4
5# Clear unused images
6crictl rmi --prune
7
8# Or from kubectl if the node is still partially functional:
9kubectl describe node <node-name> | grep -A 5 "DiskPressure"

Set eviction thresholds in kubelet config to evict pods before the node hits disk pressure — once a node has DiskPressure, it stops accepting new pods.

Memory Pressure

Node memory pressure triggers pod eviction (BestEffort pods first, then Burstable, then Guaranteed). If nodes are consistently under memory pressure, either right-size workloads or add memory to nodes.

bash
kubectl describe node <node-name> | grep -A 3 "MemoryPressure\|memory"

Service Not Reachable

Symptom: Pods can't reach a Service by its DNS name, or external traffic can't reach the Service.

bash
1# Step 1: Does the Service exist?
2kubectl get svc -n <namespace>
3
4# Step 2: Does it have endpoints?
5kubectl get endpoints <service-name> -n <namespace>
6# If ENDPOINTS shows "<none>", the selector doesn't match any pods
7
8# Step 3: Do the pods match the Service selector?
9kubectl describe svc <service-name> -n <namespace> | grep -A 5 "Selector:"
10kubectl get pods -n <namespace> --show-labels | grep <key>=<value>

No Endpoints

The Service selector doesn't match pod labels:

bash
1# What the Service selects:
2kubectl get svc my-service -n production -o jsonpath='{.spec.selector}'
3# Output: {"app":"api","version":"v2"}
4
5# What pods have:
6kubectl get pods -n production -l app=api --show-labels
7# If version label is missing or different, no match

Fix: either update the pod labels to match the Service selector, or update the Service selector to match the pods.

DNS Not Resolving

bash
1# Test DNS from a pod in the same namespace
2kubectl run -it --rm debug --image=nicolaka/netshoot -n <namespace> -- \
3  nslookup <service-name>
4
5# Test cross-namespace DNS (requires FQDN)
6kubectl run -it --rm debug --image=nicolaka/netshoot -n <namespace> -- \
7  nslookup <service-name>.<target-namespace>.svc.cluster.local
8
9# If DNS fails entirely, check CoreDNS:
10kubectl get pods -n kube-system -l k8s-app=kube-dns
11kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

NetworkPolicy Blocking Traffic

If NetworkPolicy default-deny is in place, the Service may exist and have endpoints, but traffic is blocked:

bash
1# Check if there's a default-deny policy in the target namespace
2kubectl get networkpolicy -n <namespace>
3
4# Check events for NetworkPolicy denials (if Cilium with Hubble):
5hubble observe --namespace <namespace> --verdict DROPPED
6
7# Or use netshoot to test connectivity:
8kubectl run -it --rm debug --image=nicolaka/netshoot -n <source-namespace> -- \
9  curl http://<service-name>.<target-namespace>.svc.cluster.local:<port>

Deployment Stuck (Rollout Not Progressing)

Symptom: kubectl rollout status deployment/<name> hangs or reports unavailable replicas.

bash
1# Check rollout status
2kubectl rollout status deployment/<name> -n <namespace>
3
4# Check replica state
5kubectl describe deployment <name> -n <namespace>
6
7# Check ReplicaSet events
8kubectl describe rs -n <namespace> | grep -A 10 "Events:"
9
10# Get the specific pod that's failing
11kubectl get pods -n <namespace> -l app=<label> | grep -v Running
12kubectl describe pod <failing-pod> -n <namespace>

Common causes of stuck rollouts:

New pods failing to start → Follow CrashLoopBackOff or Pending diagnosis above.

Readiness probe failing → New pods start but don't become Ready:

bash
kubectl describe pod <new-pod> -n <namespace> | grep -A 10 "Readiness:"
# Shows: probe failures, timeout counts

# What does the readiness endpoint return?
kubectl exec <new-pod> -n <namespace> -- curl -sv http://localhost:8080/health

PDB blocking rollout → A PDB prevents the old ReplicaSet from scaling down:

bash
kubectl describe pdb -n <namespace>
# DisruptionsAllowed: 0 means nothing can be evicted

Insufficient resources for the new pod → Same as Pending diagnosis — the new pod can't schedule because there's no capacity with the new replica running alongside the old.


Ephemeral Debug Containers

kubectl debug (stable since Kubernetes 1.25) provides a clean way to debug running pods without modifying them:

bash
1# Attach a debug container to a running pod
2kubectl debug <pod-name> -n <namespace> -it \
3  --image=nicolaka/netshoot \
4  --target=<container-name>
5
6# This shares the target container's namespaces (process, network, filesystem)
7# without requiring the production image to have debug tools
bash
# Copy a pod with a different entrypoint (for crash-at-startup debugging)
kubectl debug <pod-name> -n <namespace> -it \
  --copy-to=debug-pod \
  --set-image=app=<image> \
  -- /bin/sh

For pods using a distroless or scratch image that has no shell, the --target form with a sidecar debug image is the only way to inspect the container at runtime without modifying the image.


Quick Reference: Failure → First Command

SymptomFirst Command
CrashLoopBackOffkubectl logs <pod> -p (previous run logs)
Pendingkubectl describe pod <pod> → Events section
ImagePullBackOffkubectl describe pod <pod> → Events section
OOMKilledkubectl describe pod <pod> → Last State section
Node NotReadykubectl describe node <node> → Conditions section
Service not reachablekubectl get endpoints <svc> → is it empty?
Deployment stuckkubectl rollout status deployment/<name>
DNS not resolvingkubectl run debug --image=nicolaka/netshoot -- nslookup <name>
NetworkPolicy blockedhubble observe --verdict DROPPED or test with netshoot

Frequently Asked Questions

How do I debug a pod that crashes immediately?

The container exits before you can exec into it. Use kubectl debug:

bash
kubectl debug <pod-name> -n <namespace> --copy-to=debug-pod \
  --set-image=app=<same-image> \
  -- sleep infinity

This creates a copy of the pod with sleep infinity as the entrypoint, keeping the container alive for inspection. Then exec into it and manually run the original entrypoint to reproduce the crash.

How do I check what's happening with a DaemonSet pod on a specific node?

bash
kubectl get pod -n kube-system -o wide | grep <daemonset-name> | grep <node-name>
kubectl logs -n kube-system <daemonset-pod-name>

What does ContainerStatusUnknown mean?

The kubelet lost contact with the container runtime for that pod. It usually resolves itself as the runtime recovers. If persistent, check the containerd/docker daemon on the node: systemctl status containerd.

How do I find which pod is consuming the most CPU or memory?

bash
# Across the entire cluster
kubectl top pods -A --sort-by=cpu | head -20
kubectl top pods -A --sort-by=memory | head -20

# Within a namespace
kubectl top pods -n production --sort-by=cpu

Is there a way to get logs from a pod that no longer exists?

Standard kubectl logs only works for running pods or recently terminated pods (while the pod object still exists). For historical logs, you need a log aggregation system (Loki, CloudWatch, Elasticsearch). This is why centralised logging is essential for production — kubectl logs is not a log management solution.


For systematic observability to complement debugging, see Kubernetes Observability: Prometheus, Grafana, and OpenTelemetry. For probe configuration that prevents crashes from being silent, see Kubernetes Probes: Liveness, Readiness, and Startup. For advanced debugging techniques covering network forensics, node inspection, audit logs, and ephemeral containers, see Kubernetes Debugging: Systematic Troubleshooting for Production Incidents.

Dealing with a production incident in Kubernetes? Talk to us at Coding Protocols — we help platform teams build debugging runbooks and observability stacks that make incidents shorter.

Related Topics

Kubernetes
Debugging
Troubleshooting
SRE
Platform Engineering
kubectl
Observability
CrashLoopBackOff

Read Next