Kubernetes
13 min readMay 9, 2026

Kubernetes Liveness, Readiness, and Startup Probes: Getting Them Right

Wrong probe configuration is one of the most common causes of Kubernetes production incidents — cascading restarts, traffic to initialising pods, deployments that never complete. Here's how each probe works, what to check for, and the specific misconfiguration patterns that cause the most damage.

CO
Coding Protocols Team
Platform Engineering
Kubernetes Liveness, Readiness, and Startup Probes: Getting Them Right

Kubernetes uses three probe types to determine whether a container is alive, ready to serve traffic, and past its startup phase. They look similar — each is a health check that runs against your container — but they have completely different consequences when they fail, and conflating them is a reliable way to create production incidents.

This guide covers what each probe does, the correct configuration patterns, and the specific mistakes that cause the most damage.


The Three Probes

Liveness Probe

Question: Is this container still alive, or should Kubernetes restart it?

A liveness probe failure triggers a container restart. It's designed for detecting deadlocks, infinite loops, and other conditions where the process is running but irreparably stuck and won't recover without a restart.

Key principle: a liveness probe should only fail if the container is in a state it cannot self-recover from. If the failure is transient (a brief high-load spike, a slow dependency), the liveness probe should not fail — the restart will make things worse, not better.

Readiness Probe

Question: Is this container ready to receive traffic?

A readiness probe failure removes the pod from the Service endpoints without restarting it. Traffic stops going to the pod; the pod stays running. When the readiness probe passes again, traffic resumes.

Readiness probes handle transient unavailability: a pod warming up its cache, temporarily overloaded, waiting for a dependency to recover. The pod is alive and will recover — it just shouldn't take new traffic right now.

Startup Probe

Question: Has this container finished its startup sequence?

A startup probe delays liveness and readiness probe evaluation until it passes. It's designed for slow-starting applications — JVM services, applications that run database migrations on startup, containers that preload large datasets.

Without a startup probe, a slow-starting application often fails liveness checks during startup, gets restarted by Kubernetes, fails liveness again on the restart, and enters a CrashLoopBackOff that looks like a crash when it's actually a health check misconfiguration.


Probe Mechanics

Every probe type (HTTP GET, TCP socket, exec command, gRPC) shares the same timing parameters:

yaml
1livenessProbe:
2  httpGet:
3    path: /healthz
4    port: 8080
5  initialDelaySeconds: 10    # Wait before first probe
6  periodSeconds: 10          # How often to probe
7  timeoutSeconds: 5          # Timeout per probe
8  failureThreshold: 3        # Consecutive failures before action
9  successThreshold: 1        # Consecutive successes to restore (liveness must be 1)

initialDelaySeconds: time after container start before the first probe. Superseded by startup probes for slow-starting apps — prefer a startup probe over a large initialDelaySeconds on liveness/readiness.

failureThreshold × periodSeconds: the time before action is taken. With periodSeconds: 10 and failureThreshold: 3, Kubernetes waits 30 seconds of consecutive failures before restarting (liveness) or removing from endpoints (readiness). Size this to your acceptable recovery window.

timeoutSeconds: if your health endpoint takes longer than this, the probe counts as failed. Set this generously enough that a slow probe under load doesn't trigger false restarts, but tight enough to catch a genuinely hung endpoint.


Probe Types

HTTP GET

The most common for HTTP services:

yaml
1livenessProbe:
2  httpGet:
3    path: /healthz
4    port: 8080
5    httpHeaders:
6      - name: Accept
7        value: application/json

The probe passes if the response status is 200–399. Use a dedicated /healthz endpoint that checks only what the liveness probe should check — not a full application health check that queries dependencies.

TCP Socket

For non-HTTP services (databases, gRPC before HTTP health checks were standard):

yaml
livenessProbe:
  tcpSocket:
    port: 5432

Passes if the TCP connection succeeds. Does not verify application state — only that the port is listening.

Exec Command

Runs a command inside the container; passes if exit code is 0:

yaml
livenessProbe:
  exec:
    command:
      - /bin/sh
      - -c
      - redis-cli ping | grep -q PONG

Use sparingly — exec probes spawn a subprocess for every probe invocation, which adds overhead at scale. Prefer HTTP or TCP probes where possible.

gRPC

Native gRPC health check support (GA in Kubernetes 1.27):

yaml
livenessProbe:
  grpc:
    port: 50051
    service: ""    # Empty string checks the overall server health

Requires the application to implement the gRPC health checking protocol.


Correct Probe Design

Liveness: Check Only Internal State

A liveness probe should check only what's inside the container. It should not check external dependencies (databases, downstream services, message queues).

Why: if your database is down, all pods fail their liveness probes simultaneously. Kubernetes restarts all pods. The restarts don't fix the database. You now have a database outage plus a pod restart storm. The readiness probe handles dependency unavailability; the liveness probe handles internal stuck states.

go
1// Good liveness endpoint — checks only internal state
2http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
3    // Check internal state: is the goroutine pool alive? is the event loop responsive?
4    if !app.IsInternallyHealthy() {
5        http.Error(w, "unhealthy", http.StatusServiceUnavailable)
6        return
7    }
8    w.WriteHeader(http.StatusOK)
9})
10
11// Bad liveness endpoint — checks external dependency
12http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
13    if err := db.Ping(); err != nil {  // Database check in liveness = restart storm risk
14        http.Error(w, "db unreachable", http.StatusServiceUnavailable)
15        return
16    }
17    w.WriteHeader(http.StatusOK)
18})

Readiness: Check Dependencies

The readiness probe is the right place to check dependency health:

go
1http.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
2    if err := db.Ping(); err != nil {
3        http.Error(w, "db unreachable", http.StatusServiceUnavailable)
4        return
5    }
6    if !cache.IsWarmed() {
7        http.Error(w, "cache warming", http.StatusServiceUnavailable)
8        return
9    }
10    w.WriteHeader(http.StatusOK)
11})

When the database is unavailable, pods fail readiness, leave the endpoint pool, and stop taking new requests. No restarts. When the database recovers, pods pass readiness and rejoin. This is the correct failure behaviour.

Startup: Give Slow Apps Room to Start

yaml
1startupProbe:
2  httpGet:
3    path: /healthz
4    port: 8080
5  failureThreshold: 30     # 30 × 10s = 5 minutes maximum startup time
6  periodSeconds: 10
7
8livenessProbe:
9  httpGet:
10    path: /healthz
11    port: 8080
12  periodSeconds: 10
13  failureThreshold: 3      # Tight once running — 3 failures = restart
14
15readinessProbe:
16  httpGet:
17    path: /readyz
18    port: 8080
19  periodSeconds: 5
20  failureThreshold: 3

The startup probe runs first. If it hasn't passed within failureThreshold × periodSeconds (5 minutes here), the container is restarted. Once it passes, liveness and readiness probes take over. The liveness probe can now be aggressive (failureThreshold: 3) without causing false restarts during startup.

This is better than initialDelaySeconds: 300 on the liveness probe because the startup probe adapts to actual startup time rather than waiting the full 5 minutes every time.


The Misconfiguration Patterns That Break Production

Pattern 1: Liveness Checks External Dependencies

The restart storm scenario described above. This is the single most common liveness probe misconfiguration.

Symptom: database outage → all pods restart → pods can't reconnect during restart → pods restart again → CrashLoopBackOff across the fleet.

Fix: remove external dependency checks from liveness probes. Move them to readiness.

Pattern 2: No Startup Probe for Slow-Starting Apps

Symptom: JVM service, app with migrations, or model-loading ML service starts CrashLoopBackOff every deployment. Logs show the liveness check failing at the same time every restart — before the app finishes initialising.

Fix: add a startup probe with failureThreshold large enough to cover the worst-case startup time.

Pattern 3: Readiness Probe Never Passes During Rolling Deployment

Symptom: rolling deployment hangs. kubectl rollout status shows Waiting for deployment "api" rollout to finish: 1 out of 5 new replicas have been updated. New pods stay 0/1 Running indefinitely.

Cause: the readiness probe checks something that isn't satisfied in the new pod — a missing environment variable, a new dependency that isn't reachable, a migration that hasn't completed, or the new code version has a bug that causes the readiness endpoint to 500.

Diagnosis:

bash
1kubectl describe pod <new-pod-name> -n production
2# Look for: Readiness probe failed: ...
3
4kubectl logs <new-pod-name> -n production
5# Look for startup errors or missing config
6
7kubectl exec -it <new-pod-name> -n production -- \
8  curl -v http://localhost:8080/readyz
9# Hit the readiness endpoint directly

Fix: depends on the root cause. The deployment rollout has a progressDeadlineSeconds (default 600s) after which it marks the deployment as failed. This is a safety net — a hung deployment doesn't permanently block the cluster.

Pattern 4: Readiness Probe Too Sensitive

Symptom: pods intermittently drop out of endpoints under load, causing traffic spikes that overload the remaining pods, which then also fail readiness — a cascading failure.

Cause: readiness probe failureThreshold is too low (1 or 2) and/or the probe checks a dependency with variable latency. A single slow probe response removes the pod from endpoints.

Fix: increase failureThreshold to 3–5 and timeoutSeconds to something that accommodates your dependency's tail latency. The readiness probe should only fail if the pod is genuinely unable to serve traffic, not because of a single slow health check.

Pattern 5: Identical Liveness and Readiness Endpoints

Using the same /health endpoint for both liveness and readiness conflates their concerns. If the endpoint checks a database (as it should for readiness), a database outage triggers liveness failures and restarts.

Use separate endpoints:

  • /healthz — liveness, internal state only, always fast
  • /readyz — readiness, includes dependency checks

Pattern 6: Missing Probe Entirely

Pods without probes are considered always ready and always alive from Kubernetes' perspective. During deployment, new pods immediately receive traffic before they're ready. During failures, stuck pods continue to receive traffic.

For any service that receives traffic, both liveness and readiness probes are required. For batch jobs, only liveness is typically relevant.


Probe Configuration Reference

Good starting values for an HTTP service with moderate startup time:

yaml
1startupProbe:
2  httpGet:
3    path: /healthz
4    port: 8080
5  failureThreshold: 12      # 12 × 5s = 60s startup window
6  periodSeconds: 5
7
8livenessProbe:
9  httpGet:
10    path: /healthz
11    port: 8080
12  periodSeconds: 10
13  timeoutSeconds: 5
14  failureThreshold: 3       # 30 seconds of consecutive failures before restart
15  successThreshold: 1
16
17readinessProbe:
18  httpGet:
19    path: /readyz
20    port: 8080
21  periodSeconds: 5
22  timeoutSeconds: 3
23  failureThreshold: 3       # 15 seconds before removing from endpoints
24  successThreshold: 2       # 2 consecutive passes before re-adding to endpoints

successThreshold: 2 on readiness prevents a pod from flapping in and out of endpoints — it must pass twice before receiving traffic again after a failure.


Debugging Probe Failures

bash
1# Recent probe failures in events
2kubectl describe pod <pod> -n <namespace> | grep -A5 "Liveness\|Readiness\|Startup"
3
4# Exec into the pod and hit the endpoint directly
5kubectl exec -it <pod> -n <namespace> -- \
6  wget -qO- http://localhost:8080/healthz
7
8# Check probe configuration on the running pod
9kubectl get pod <pod> -n <namespace> -o json | \
10  jq '.spec.containers[].livenessProbe, .spec.containers[].readinessProbe, .spec.containers[].startupProbe'
11
12# Watch pod restarts in real time
13kubectl get pods -n <namespace> -w

Probe failures appear in kubectl describe pod under Events with the specific error: HTTP status code returned, connection refused, or exec exit code.


Frequently Asked Questions

Should I use the same port for probes as the application?

If your application has a separate admin/management port, use it for probes — it's not in the critical path for production traffic and is less likely to be overloaded when the main service is under stress. If your application has only one port, that's fine — just ensure the probe endpoints are fast and cheap.

Can probes cause a pod to be killed during a deployment rollout?

Yes — if the new pod's liveness probe fails during a rolling deployment, the pod is restarted. If it fails repeatedly, the pod enters CrashLoopBackOff and the rollout stalls. This is the correct safety behaviour — a broken deployment shouldn't complete. Use kubectl rollout undo to roll back.

How do I handle a probe that needs authentication?

For HTTP probes, you can add headers:

yaml
httpGet:
  path: /healthz
  port: 8080
  httpHeaders:
    - name: Authorization
      value: Bearer internal-probe-token

Better design: make health endpoints unauthenticated. Health endpoints expose no sensitive data and are accessed by infrastructure, not users. Requiring authentication on health endpoints adds complexity for no security benefit.

What's the right periodSeconds value?

For readiness: 5–10 seconds. Faster recovery when a pod becomes ready after being removed from endpoints. For liveness: 10–15 seconds. Slow enough to not add unnecessary API server load, fast enough to catch a stuck container within a minute.


For related reliability patterns, see Kubernetes StatefulSets: Running Stateful Workloads in Production and Kubernetes HPA Beyond CPU: Scaling on Custom and External Metrics.

Debugging probe-related production incidents on your cluster? Talk to us at Coding Protocols — we help platform teams diagnose and resolve reliability issues before they become outages.

Related Topics

Kubernetes
Probes
Liveness
Readiness
Startup
Platform Engineering
Reliability
Best Practices

Read Next