Kubernetes Debugging: Systematic Troubleshooting for Production Incidents
Production Kubernetes incidents follow repeatable patterns: CrashLoopBackOff from configuration errors, OOMKilled from under-resourced containers, ImagePullBackOff from registry auth, Pending pods from scheduling constraints, and service connectivity failures from DNS or NetworkPolicy. This is the diagnostic workflow that gets from symptom to root cause without guessing.

Kubernetes hides failure detail behind status fields. CrashLoopBackOff could mean anything: the application panics on startup, the config file has a syntax error, a required secret doesn't exist, or the container can't bind to the port. The status tells you a container exited; it doesn't tell you why.
Systematic debugging moves from the symptom (pod status) down through layers (pod events → container logs → resource constraints → configuration → network) until you find the root cause. This is the workflow.
Layer 1: Start with Pod Status
# Get all non-running pods in the namespace
kubectl get pods -n payments | grep -v Running | grep -v Completed
# See events and conditions for a specific pod
kubectl describe pod payments-api-7d8f9b-xyz -n paymentsThe Modern Way: kubectl events (2026 Standard)
In 2026, we've replaced the legacy kubectl get events with the more powerful kubectl events command. It provides a structured, chronological view of cluster events, making it easier to identify the "first failure" in a cascading incident:
1# Get events for a specific pod in a human-readable timeline
2kubectl events --for pod/payments-api-7d8f9b-xyz -n payments
3
4# Watch for critical events across the entire namespace
5kubectl events -w -n payments --types=Warning
6
7# Filter by specific reason (e.g., scheduling failures)
8kubectl events --filter 'reason=FailedScheduling'This command is more efficient than describe for tracking the sequence of events over time and is the industry standard for high-cardinality event analysis during production outages.
The describe output has everything you need to start: events (what Kubernetes tried to do), status conditions (what's wrong with the pod), and resource state.
Key events to look for:
| Event | Meaning |
|---|---|
Failed to pull image | Registry auth, network, or image doesn't exist |
OOMKilled | Container exceeded memory limit |
Back-off restarting failed container | Container exiting repeatedly (CrashLoopBackOff) |
0/3 nodes are available: insufficient cpu | Scheduling failure — no node has capacity |
unbound immediate PersistentVolumeClaims | PVC pending — StorageClass issue or no available PV |
Readiness probe failed | Application not ready on the expected port/path |
CrashLoopBackOff
The container started and exited. The exit code tells you why:
1# Get the exit code from the last termination
2kubectl get pod payments-api-7d8f9b-xyz -n payments -o json | \
3 jq '.status.containerStatuses[0].lastState.terminated | {exitCode: .exitCode, reason: .reason, message: .message}'
4
5# Common exit codes:
6# 0 — successful exit (shouldn't crash loop — check your restart policy or entrypoint)
7# 1 — application error (check logs)
8# 137 = 128 + 9 — killed by SIGKILL (OOMKilled or manual kill)
9# 143 = 128 + 15 — killed by SIGTERM (preStop or shutdown)
10# 1 with "exec format error" — wrong architecture (amd64 image on arm64 node)# Get the logs from the previous (crashed) container
kubectl logs payments-api-7d8f9b-xyz -n payments --previous
# If the container starts and immediately crashes, the log might be empty
# Try an init container to check the filesystem/configCommon CrashLoopBackOff causes:
1# 1. Config/env var missing — look for "key not found" or panic at startup
2kubectl logs -n payments payments-api-7d8f9b-xyz --previous | tail -20
3
4# 2. Secret doesn't exist
5kubectl get secret payments-db-credentials -n payments
6# Error: not found → ExternalSecret not synced, or wrong name
7
8# 3. Port conflict — container can't bind 8080
9kubectl logs -n payments payments-api-7d8f9b-xyz --previous | grep "bind\|listen\|address in use"
10
11# 4. Init container failed (blocks main container)
12kubectl describe pod payments-api-7d8f9b-xyz -n payments | grep -A 10 "Init Containers:"
13kubectl logs payments-api-7d8f9b-xyz -n payments -c init-migrateOOMKilled
1# Check OOMKill
2kubectl get pod payments-api-7d8f9b-xyz -n payments -o json | \
3 jq '.status.containerStatuses[0].lastState.terminated.reason'
4# "OOMKilled"
5
6# Check the current limits
7kubectl get pod payments-api-7d8f9b-xyz -n payments -o json | \
8 jq '.spec.containers[0].resources'
9
10# Check actual memory usage before crash (requires metrics-server)
11kubectl top pod payments-api-7d8f9b-xyz -n payments
12kubectl top pods -n payments --sort-by=memory
13
14# Get VPA recommendation (if VPA is installed in Off mode)
15kubectl get verticalpodautoscaler payments-api -n payments -o json | \
16 jq '.status.recommendation.containerRecommendations[0].target'Fix: Increase the memory limit, or find the memory leak. Check if the application has unbounded in-memory caching or memory leaks under load.
Pending Pods
Pending means Kubernetes can't schedule the pod. The describe output shows the reason:
kubectl describe pod payments-api-7d8f9b-xyz -n payments | grep -A 20 Events:
# "0/3 nodes are available: 3 Insufficient memory" → increase node size or scale up nodes
# "0/3 nodes are available: 3 node(s) had untolerated taint" → add toleration to pod
# "0/3 nodes are available: 3 node(s) didn't match Pod's node affinity" → relax affinity rules
# "0/3 nodes are available: 1 Insufficient cpu, 2 node(s) didn't match node selector" → mixed reasons1# Check node resources
2kubectl describe nodes | grep -A 5 "Allocated resources"
3
4# For Karpenter: check why it's not provisioning
5kubectl get events -A --field-selector reason=ProvisioningFailed
6
7# Check if ResourceQuota is blocking
8kubectl describe resourcequota -n payments
9# If "Used" approaches "Hard", the namespace is at quotaImagePullBackOff
kubectl describe pod payments-api-7d8f9b-xyz -n payments | grep -A 5 "Failed to pull"
# Error types:
# "not found" → image tag doesn't exist in registry
# "unauthorized" → imagePullSecret missing or invalid
# "net/http: TLS handshake timeout" → network issue, VPC DNS/NAT1# Check imagePullSecrets
2kubectl get pod payments-api-7d8f9b-xyz -n payments -o json | \
3 jq '.spec.imagePullSecrets'
4
5# Verify the secret exists and is valid
6kubectl get secret ecr-registry -n payments
7kubectl get secret ecr-registry -n payments -o json | \
8 jq '.data[".dockerconfigjson"]' | base64 -d | jq .
9
10# For ECR: check if token is expired (ECR tokens expire after 12 hours)
11# Use the ECR credential helper or an automated renewal solutionService Connectivity Failures
1# Can't reach a service? Start by verifying DNS resolution
2kubectl run debug --rm -it --image=nicolaka/netshoot -- bash
3
4# Inside the debug pod:
5nslookup payments-api.payments.svc.cluster.local
6# NXDOMAIN → service doesn't exist, namespace wrong, or DNS broken
7# Correct address → DNS works, check connectivity next
8
9curl -v http://payments-api.payments.svc.cluster.local:8080/health
10
11# Check the service endpoints are populated
12kubectl get endpoints payments-api -n payments
13# Empty endpoints → no pods matching the service selector, or pods not Ready
14
15# Verify pod labels match service selector
16kubectl get pod -n payments --show-labels
17kubectl get service payments-api -n payments -o json | jq '.spec.selector'1# NetworkPolicy blocking traffic? Check with Cilium hubble (if Cilium)
2hubble observe --namespace payments --verdict DROPPED
3
4# Or test without policy by running a debug pod in the same namespace
5kubectl run --rm -it debug-pod \
6 --namespace payments \
7 --image=nicolaka/netshoot \
8 -- curl http://payments-api:8080/healthEphemeral Containers for Live Debugging
Ephemeral containers let you attach a debug container to a running pod without modifying it — useful when the application container is a distroless/scratch image with no shell:
1# Add an ephemeral container to a running pod
2kubectl debug -it payments-api-7d8f9b-xyz \
3 --image=nicolaka/netshoot \
4 --target=payments-api \
5 --namespace=payments \
6 -- bash
7
8# Inside the ephemeral container:
9# - You share the pod's network namespace (can call localhost:8080)
10# - You share the process namespace (with shareProcessNamespace: true)
11# - The target container's filesystem is at /proc/<PID>/root/
12
13# Check what the app has open
14ls -la /proc/1/fd/
15
16# Check process environment
17cat /proc/1/environ | tr '\0' '\n'
18
19# Trace syscalls (network calls, file operations)
20strace -p 1 -e trace=network,fileDebugging Node Issues
1# Launch a privileged pod on a specific node (--profile=sysadmin requires Kubernetes 1.30+)
2kubectl debug node/ip-10-0-1-100.ec2.internal \
3 --image=ubuntu \
4 --profile=sysadmin \
5 -- bash
6
7# Inside, the host filesystem is mounted at /host
8nsenter --target 1 --mount --uts --ipc --net --pid -- bash
9# Now you're in the node's namespaces with full access
10
11# Check kubelet logs
12journalctl -u kubelet -n 100
13
14# Check container runtime
15crictl ps
16crictl logs <container-id>Network Debugging Toolkit
1# nicolaka/netshoot — the Swiss Army knife for network debugging
2# Includes: curl, dig, nslookup, tcpdump, iperf3, traceroute, nmap, ss, netstat
3
4# TCP connectivity test with timeout
5timeout 5 bash -c "cat < /dev/null > /dev/tcp/payments-db.payments.svc/5432"
6echo $? # 0 = connected, 1 = refused, 124 = timeout
7
8# Check DNS resolution timing
9dig payments-api.payments.svc.cluster.local @10.96.0.10 +stats | grep "Query time"
10
11# Packet capture (run tcpdump in the debug pod, shares pod network namespace)
12kubectl debug -it payments-api-7d8f9b-xyz \
13 --image=nicolaka/netshoot \
14 --target=payments-api \
15 --namespace=payments \
16 -- tcpdump -i any -n port 8080 -w /tmp/capture.pcap
17
18# Copy capture off the pod
19kubectl cp payments/payments-api-7d8f9b-xyz:/tmp/capture.pcap ./capture.pcapAPI Server and etcd Audit Events
1# Check Kubernetes audit logs for permission errors (EKS stores in CloudWatch)
2# Filter for 403 responses to find permission-denied errors
3aws logs filter-log-events \
4 --log-group-name /aws/eks/my-cluster/cluster \
5 --filter-pattern '{ $.responseStatus.code = 403 }' \
6 --start-time $(date -d "1 hour ago" +%s000) \
7 --output json | jq '.events[].message | fromjson | {verb: .verb, resource: .objectRef.resource, user: .user.username, reason: .responseStatus.reason}'Frequently Asked Questions
How do I debug a pod that crashes before I can exec into it?
Two approaches: (1) Override the entrypoint to sleep instead of running the app, so you can exec in and investigate manually: kubectl run debug-payments --image=your-image --command -- sleep 3600. (2) Use an init container that copies debug tools or validates configuration before the main container starts. For distroless images, ephemeral containers are the right tool — kubectl debug attaches to the running pod even if it crashes immediately.
Why is kubectl exec failing with "Unable to use a TTY"?
The container doesn't have a shell (/bin/sh or /bin/bash). For distroless or scratch images, there's no shell to exec into. Use ephemeral containers (kubectl debug) with an image that has a shell. This is the correct approach for production-hardened container images — the lack of a shell is a security feature.
For a complete debugging decision tree and symptom-to-command reference covering every common failure mode (CrashLoopBackOff, Pending, OOMKilled, ImagePullBackOff, node NotReady), see Kubernetes Debugging: A Systematic Guide to Diagnosing Pod and Node Failures. For Hubble's network flow visibility that makes NetworkPolicy debugging deterministic, see Cilium Advanced Networking and Observability. For setting up Prometheus and Grafana dashboards that catch problems before pods start crashing, see Prometheus Operator Deep Dive.
Debugging a production incident that you can't reproduce locally? Talk to us at Coding Protocols — we help platform teams build the observability and debugging toolchain that makes incidents shorter and post-mortems more actionable.


