Seven Kubernetes Mistakes I Keep Seeing in Production

I've audited a lot of Kubernetes clusters — internal ones, client ones, clusters that were running fine until they suddenly weren't. The mistakes are almost always the same seven. Not because engineers are careless. Because Kubernetes lets you skip these things and everything still appears to work — right up until it doesn't.

This is not a beginner's list. Every item here has caused a real production incident.

No Resource Requests or Limits

Kubernetes won't stop you from deploying a pod with no resource spec. The scheduler will place it somewhere, the container will start, and everything looks fine. Then traffic ramps up, or another team deploys something noisy, and suddenly your pods are getting OOMKilled or throttled to nothing.

The scheduler uses requests to decide where to place a pod. Without them, the scheduler is flying blind — it places pods without reserving anything, and nodes end up overcommitted. Limits are what prevent a single bad pod from taking down the whole node.

The fix is straightforward but requires discipline:

yaml

1resources:
2  requests:
3    cpu: 100m
4    memory: 128Mi
5  limits:
6    cpu: 500m
7    memory: 256Mi

Start conservative, then tune. Use kubectl top pods or your metrics stack to see actual usage over a week. The worst thing you can do is set limits too tight and spend your time debugging CPU throttling that looks exactly like a slow application.

One nuance the docs don't emphasise enough: CPU throttling happens at the limit, not the request. A pod hitting its CPU limit doesn't get killed — it just gets starved. Memory is different: hit the limit and the OOM killer comes for you. Know which resource you're dealing with before assuming the problem is the app.

Missing or Broken Health Probes

A container that's running is not the same as a container that's healthy. Kubernetes doesn't know the difference unless you tell it.

Liveness probes answer: is this process still functioning, or is it stuck in a deadlock? If the check fails, Kubernetes restarts the container.

Readiness probes answer: is this container ready to receive traffic right now? If the check fails, the endpoint is removed from the Service — no traffic, no errors for users.

Startup probes exist for slow-starting applications that would otherwise fail their liveness check before they're even ready.

The mistake I see most often isn't skipping probes entirely — it's configuring them wrong. Liveness probes that hit a heavyweight endpoint, causing cascades of restarts under load. Readiness probes with failureThreshold: 1, meaning a single blip pulls the pod from rotation. Or probes with no initialDelaySeconds, so the container gets killed before it finishes booting.

A sensible baseline for an HTTP service:

yaml

1livenessProbe:
2  httpGet:
3    path: /healthz
4    port: 8080
5  initialDelaySeconds: 15
6  periodSeconds: 20
7  failureThreshold: 3
8readinessProbe:
9  httpGet:
10    path: /ready
11    port: 8080
12  initialDelaySeconds: 5
13  periodSeconds: 10
14  failureThreshold: 3

/healthz should do the minimum — return 200 if the process is alive. /ready can be more thorough: check database connectivity, warm caches, whatever your app needs before it can serve real traffic. Keep liveness cheap; keep readiness accurate.

Treating `kubectl logs` as a Logging Strategy

kubectl logs is a debugging tool, not a logging system. It reads from the container's stdout/stderr on that specific node, right now. When the pod restarts, you get the new container's logs. When the node goes away, you get nothing.

In production you need logs in a place that outlives any individual pod or node. The standard approach is a DaemonSet running a log forwarder — Fluent Bit is my preference over Fluentd; it's lighter, faster, and the config is simpler — that ships everything to a centralised store.

The bigger unlock is moving beyond plain text logs. If your applications emit structured JSON, you can filter, aggregate, and alert on log fields rather than grepping strings. Combined with OpenTelemetry, you get a single pipeline for logs, metrics, and traces — all correlated by trace ID. When something breaks, you can jump from a slow trace directly to the relevant log lines rather than guessing which pod was involved.

The minimum setup that's actually useful in production:

Fluent Bit DaemonSet forwarding to CloudWatch Logs, Loki, or Elasticsearch
Structured JSON logging in your applications
Prometheus scraping cluster and application metrics
Alerts on error rate and latency, not just on pod restarts

kubectl logs stays useful for quick checks. It just shouldn't be your only option when things go wrong at 2am.

Using the Same Manifests in Dev and Prod

I understand the appeal. One set of manifests, fewer things to maintain. But dev and prod have fundamentally different requirements and pretending otherwise creates real problems.

In dev you want:

Low resource requests so the cluster is cheap
A single replica (fast iteration, no quorum concerns)
Relaxed network policies
Debug tooling available

In prod you want:

Accurately sized resource requests (based on real profiling)
Multiple replicas with a PodDisruptionBudget
Tight network policies, RBAC scoped to least privilege
No debug tooling, no shell access

Kustomize solves this cleanly. A base/ directory holds everything shared. An overlays/production/ directory patches replicas, resources, and anything environment-specific. No templating engine, no extra dependencies — it ships with kubectl.

base/
  deployment.yaml
  service.yaml
  kustomization.yaml
overlays/
  dev/
    kustomization.yaml   # patches: 1 replica, small resources
  production/
    kustomization.yaml   # patches: 3 replicas, HPA, PDB

For sensitive config that differs by environment — database URLs, API keys — use External Secrets Operator to pull from AWS Secrets Manager or Vault. Don't put secrets in ConfigMaps. Don't commit them to git.

Abandoning Resources Without Cleaning Up

Kubernetes doesn't have a garbage collector for things you deployed and forgot. That LoadBalancer Service from the demo three weeks ago is still running. The PersistentVolumeClaim from the deleted StatefulSet is still sitting there, paying for storage. The Namespace you created to test something now has twelve resources in it that nobody can explain.

This compounds over time. The cluster becomes harder to reason about, costs climb quietly, and stale Services can still receive traffic.

Three habits that prevent this:

Label everything. team:, app:, environment:, owner: — at minimum. Then kubectl get all -l team=platform -A actually tells you something.

Use Namespaces as isolation boundaries, not just naming conventions. Give teams their own Namespace. Use ResourceQuotas to cap what they can consume. When a project ends, deleting the Namespace cleans up everything.

Automate lifecycle policies with Kyverno. You can write a policy that auto-deletes resources in non-production Namespaces after 72 hours if they carry a specific label. Opt-in to keep things alive rather than opt-in to clean them up.

yaml

# Kyverno ClusterPolicy: delete any Deployment in 'dev' namespace
# with label ttl=72h if creationTimestamp > 72h

The cloud bill is usually the thing that motivates people to actually do this.

Jumping Straight to Istio

Service meshes are genuinely useful. mTLS between services, traffic shifting for canary deployments, detailed per-route telemetry — these are real features that solve real problems. But they are not the right starting point.

I've watched teams spend two weeks deploying Istio onto a cluster that was serving five internal services. They debugged sidecar injection issues, sorted out certificate rotation, figured out why PeerAuthentication broke their health checks, and ultimately had a system that was harder to operate than what they started with.

Learn Kubernetes networking in order:

ClusterIP Services — how pods find each other, how DNS works (<service>.<namespace>.svc.cluster.local)
Ingress — how external traffic enters the cluster, how TLS termination works
NetworkPolicies — how to restrict pod-to-pod communication
Gateway API — the modern replacement for Ingress, worth learning before reaching for a mesh

Only reach for a service mesh when you have a concrete requirement it solves — mutual TLS enforcement, traffic splitting, or per-route observability at a scale that makes the operational cost worth it. For most clusters, cert-manager for TLS and a good Ingress controller gets you 90% of the way there with a fraction of the complexity.

Broad RBAC and Running as Root

Kubernetes is not secure by default. It's secure by configuration. Leave the defaults in place and you'll have pods running as root, containers that can write to their own filesystems, and ServiceAccounts with more permissions than they need.

The specific things that should be non-negotiable in production:

Never use cluster-admin for application workloads. Create a Role scoped to the Namespace and the exact resources the application needs. If a pod only reads ConfigMaps, its ServiceAccount should only be able to read ConfigMaps — in that Namespace.

Don't run containers as root. Add a securityContext:

yaml

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false

Pin image tags. image: nginx:latest means you don't know what's deployed. A routine node restart can pull a different image version than what you tested. Use digests (nginx@sha256:abc123...) or at minimum exact version tags.

Enforce these with Pod Security Admission (built into Kubernetes since 1.25) at the restricted profile for production Namespaces. PSA runs at admission time — non-compliant pods are rejected before they're scheduled, not after they're already running.

Kyverno is useful here too: you can write policies that enforce specific labels, reject latest tags cluster-wide, or require specific securityContext fields — and you can run them in audit mode first to understand the blast radius before enforcing.

The Pattern Behind All Seven

Every one of these mistakes has the same root cause: Kubernetes allows you to skip the guardrails, and things still appear to work in the short term.

Resource limits aren't enforced until the node is under pressure. Missing probes don't matter until a container hangs. Stale resources don't show up on a dashboard unless you look for them. Broad RBAC doesn't cause an incident until it does.

The discipline is building these things in from the start — not retrofitting them after the first outage. A cluster that's set up right is genuinely easier to operate than one that isn't. The configuration is more verbose upfront, but you're paying that cost once rather than paying a larger cost in incidents repeatedly.

If you're auditing an existing cluster and want to know where to start, resource requests and RBAC are almost always the quickest wins. Everything else can be phased in.

Seven Kubernetes Mistakes I Keep Seeing in Production

No Resource Requests or Limits

Missing or Broken Health Probes

Treating `kubectl logs` as a Logging Strategy

Using the Same Manifests in Dev and Prod

Abandoning Resources Without Cleaning Up

Jumping Straight to Istio

Broad RBAC and Running as Root

The Pattern Behind All Seven

Related Topics

Read Next

eBPF for Platform Engineers: Cilium, Hubble, and Tetragon Without the Hype

Kubernetes PodSecurityContext vs SecurityContext: Which One Applies

Running Llama 3 70B on Kubernetes: AWQ Quantization and Tensor Parallelism

No Resource Requests or Limits

Missing or Broken Health Probes

Treating kubectl logs as a Logging Strategy

Using the Same Manifests in Dev and Prod

Abandoning Resources Without Cleaning Up

Jumping Straight to Istio

Broad RBAC and Running as Root

The Pattern Behind All Seven

Related Topics

Read Next

eBPF for Platform Engineers: Cilium, Hubble, and Tetragon Without the Hype

Kubernetes PodSecurityContext vs SecurityContext: Which One Applies

Running Llama 3 70B on Kubernetes: AWQ Quantization and Tensor Parallelism

Treating `kubectl logs` as a Logging Strategy