Kubernetes StatefulSets: Running Stateful Workloads in Production
StatefulSets give pods stable network identities (pod-0, pod-1) and stable storage (PersistentVolumeClaims that survive pod rescheduling). This makes them the right primitive for databases, Kafka brokers, ZooKeeper ensembles, and any workload where identity and data persistence matter. This covers StatefulSet mechanics, volumeClaimTemplates for per-pod storage, headless Services for DNS-based peer discovery, ordered vs parallel pod management, PodDisruptionBudgets for controlled maintenance, and the operational patterns for running PostgreSQL and Kafka in Kubernetes.

A Deployment treats pods as interchangeable. When a node fails, Kubernetes reschedules the pods elsewhere and assigns them new names, new IPs, and new ephemeral storage. For stateless services this is fine — any replica can handle any request. For databases and distributed systems, interchangeability breaks things: a Kafka broker needs to know it's broker-2, not just some broker; a PostgreSQL replica needs to reconnect to its primary using a stable address; a ZooKeeper node needs to bring up its journal from the same disk it used before.
StatefulSet provides three guarantees that Deployment doesn't:
- Stable pod identity: pods are named
<statefulset-name>-0,<statefulset-name>-1, etc. and keep these names across rescheduling. - Stable network identity: each pod gets a stable DNS name
<pod-name>.<headless-service>.<namespace>.svc.cluster.local. - Stable storage: each pod gets its own
PersistentVolumeClaimthat follows it across rescheduling — it's not shared with other pods.
Basic StatefulSet
1apiVersion: apps/v1
2kind: StatefulSet
3metadata:
4 name: postgres
5 namespace: databases
6spec:
7 serviceName: postgres-headless # Must reference a headless Service (clusterIP: None)
8 replicas: 3
9 selector:
10 matchLabels:
11 app: postgres
12 podManagementPolicy: OrderedReady # Default: bring up pods one at a time in order
13
14 template:
15 metadata:
16 labels:
17 app: postgres
18 spec:
19 containers:
20 - name: postgres
21 image: postgres:16.4
22 env:
23 - name: POSTGRES_PASSWORD
24 valueFrom:
25 secretKeyRef:
26 name: postgres-credentials
27 key: password
28 ports:
29 - containerPort: 5432
30 volumeMounts:
31 - name: data
32 mountPath: /var/lib/postgresql/data
33
34 # volumeClaimTemplates creates a separate PVC for each pod
35 volumeClaimTemplates:
36 - metadata:
37 name: data
38 spec:
39 accessModes: ["ReadWriteOnce"]
40 storageClassName: gp3 # EKS: use gp3 for general purpose; io2 for high-IOPS databases
41 resources:
42 requests:
43 storage: 100GiWhen this StatefulSet is created, Kubernetes provisions three PVCs: data-postgres-0, data-postgres-1, data-postgres-2. Each pod mounts only its own PVC. If postgres-1 is deleted and rescheduled to a different node, it mounts data-postgres-1 again — not a new blank volume.
Headless Service: Stable DNS Names
StatefulSets require a headless Service (clusterIP: None) that provides stable DNS records for each pod. Unlike a regular Service (which load-balances across pods via a single ClusterIP), a headless Service creates one DNS A record per pod:
1apiVersion: v1
2kind: Service
3metadata:
4 name: postgres-headless
5 namespace: databases
6spec:
7 clusterIP: None # Headless: no VIP, direct pod DNS records
8 selector:
9 app: postgres
10 ports:
11 - port: 5432
12 targetPort: 5432DNS records created:
postgres-0.postgres-headless.databases.svc.cluster.local→postgres-0pod IPpostgres-1.postgres-headless.databases.svc.cluster.local→postgres-1pod IPpostgres-2.postgres-headless.databases.svc.cluster.local→postgres-2pod IP
And the Service itself resolves to all pod A records:
postgres-headless.databases.svc.cluster.local→ all pod IPs (DNS client receives all records and chooses one — no server-side load balancing)
Applications that need to talk to a specific replica (like a PostgreSQL client connecting to the primary at postgres-0) use the per-pod DNS name. Applications that don't care which pod they hit use the Service name.
For external access, create a regular (non-headless) Service for the primary:
1apiVersion: v1
2kind: Service
3metadata:
4 name: postgres-primary
5 namespace: databases
6spec:
7 selector:
8 app: postgres
9 role: primary # Label the primary pod with this; your replication controller manages it
10 ports:
11 - port: 5432Pod Management Policy
podManagementPolicy controls how pods start and stop:
OrderedReady (default): Pods start in order (0 before 1, 1 before 2) and each must be Running and Ready before the next starts. Scale-down reverses order (highest-numbered pod first). This is correct for distributed systems that require a quorum (ZooKeeper, etcd) where starting in a specific sequence prevents split-brain.
Parallel: All pods start simultaneously. Faster startup, but applications must handle all replicas initializing at once. Correct for applications that don't have bootstrap sequencing requirements.
spec:
podManagementPolicy: Parallel # For Kafka brokers, Redis Cluster nodes that bootstrap independentlyUpdate Strategy: Rolling Updates
StatefulSet rolling updates proceed from highest-ordinal to lowest (reverse of startup order), ensuring the primary/leader (usually ordinal 0) is updated last:
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 2 # Only update pods with ordinal >= 2 (canary: test update on replicas before primary)partition enables canary updates: set partition: 2 to update only postgres-2, verify it works, then set partition: 1 (updates postgres-2 and postgres-1), then partition: 0 (all pods). This is the standard pattern for rolling out PostgreSQL minor version upgrades safely.
PodDisruptionBudget: Controlling Maintenance Impact
A PodDisruptionBudget limits how many pods can be voluntarily disrupted (by node drains, rolling updates, VPA evictions) simultaneously. Without a PDB, a node drain can take down multiple database replicas at once:
1apiVersion: policy/v1
2kind: PodDisruptionBudget
3metadata:
4 name: postgres-pdb
5 namespace: databases
6spec:
7 minAvailable: 2 # At least 2 postgres pods must be available during disruptions
8 selector:
9 matchLabels:
10 app: postgresFor a 3-replica cluster, minAvailable: 2 means at most 1 pod can be disrupted at a time. A node drain will evict postgres-2, wait for it to restart on another node, then proceed to the next pod.
maxUnavailable is the alternative — specify as a number or percentage:
spec:
maxUnavailable: 1 # At most 1 pod unavailable at any timePDBs protect against voluntary disruptions (node drains, rolling updates) but not involuntary ones (node failures). For a production database that requires quorum, minAvailable should be set to the quorum size: ceil(replicas / 2) + 1.
Running PostgreSQL: Primary-Replica Pattern
The standard approach for PostgreSQL on Kubernetes in 2026 is to use a Kubernetes operator that handles replication management, failover, and backup:
- CloudNativePG: the CNCF-graduated PostgreSQL operator. Manages streaming replication, automated failover, PITR backup to S3.
- Zalando Postgres Operator: creates a primary + replicas with Patroni for HA.
A bare StatefulSet gets you stable storage and identity, but not replication setup, primary election, or failover. Operators handle all of this:
1# CloudNativePG Cluster resource
2apiVersion: postgresql.cnpg.io/v1
3kind: Cluster
4metadata:
5 name: payments-db
6 namespace: databases
7spec:
8 instances: 3 # 1 primary + 2 replicas
9
10 postgresql:
11 parameters:
12 max_connections: "200"
13 shared_buffers: "256MB"
14
15 storage:
16 size: 100Gi
17 storageClass: gp3
18
19 backup:
20 barmanObjectStore:
21 destinationPath: s3://your-bucket/payments-db
22 s3Credentials:
23 accessKeyId:
24 name: s3-creds
25 key: ACCESS_KEY_ID
26 secretAccessKey:
27 name: s3-creds
28 key: SECRET_ACCESS_KEY
29 retentionPolicy: "30d"CloudNativePG creates a StatefulSet internally and exposes three Services: <cluster>-rw (read-write, points to primary), <cluster>-ro (read-only, round-robins replicas), <cluster>-r (all instances). Applications connect to payments-db-rw.databases.svc:5432 — the service automatically follows the primary after failover.
Running Kafka: Broker Identity and Partition Leadership
Kafka brokers are the canonical StatefulSet use case: each broker has a fixed ID, owns specific topic partitions, and replicas rely on stable broker addresses for leader election and log replication.
1apiVersion: apps/v1
2kind: StatefulSet
3metadata:
4 name: kafka
5 namespace: kafka
6spec:
7 serviceName: kafka-headless
8 replicas: 3
9 podManagementPolicy: Parallel # Brokers can start simultaneously
10 selector:
11 matchLabels:
12 app: kafka
13 template:
14 spec:
15 containers:
16 - name: kafka
17 image: apache/kafka:3.8.0
18 env:
19 - name: KAFKA_BROKER_ID
20 valueFrom:
21 fieldRef:
22 fieldPath: metadata.name # WARNING: resolves to full name "kafka-0", not integer "0" — Kafka will fail; see KRaft note below
23 - name: KAFKA_ZOOKEEPER_CONNECT
24 value: "zookeeper-headless.kafka:2181"
25 - name: KAFKA_ADVERTISED_LISTENERS
26 value: "PLAINTEXT://$(POD_NAME).kafka-headless.kafka.svc.cluster.local:9092"
27 volumeMounts:
28 - name: data
29 mountPath: /var/kafka/data
30 volumeClaimTemplates:
31 - metadata:
32 name: data
33 spec:
34 accessModes: ["ReadWriteOnce"]
35 storageClassName: gp3
36 resources:
37 requests:
38 storage: 500GiNote on KAFKA_BROKER_ID: Using metadata.name gives kafka-0, kafka-1, kafka-2 — strings, not integers. Kafka requires an integer broker ID. Use an init container or the KAFKA_CFG_NODE_ID pattern with KRaft mode (ZooKeeper-less Kafka) instead:
1# KRaft mode — ZooKeeper-free Kafka (preferred for Kafka 3.3+)
2# Extract the integer node ID from the hostname in an init container:
3# HOSTNAME=kafka-0 → ID=0 via: echo ${HOSTNAME##*-}
4initContainers:
5 - name: set-node-id
6 image: busybox
7 command: ["/bin/sh", "-c", "echo ${HOSTNAME##*-} > /mnt/config/node-id"]
8 volumeMounts:
9 - name: config
10 mountPath: /mnt/config
11containers:
12 - name: kafka
13 # env.value does not support shell command substitution — use command/args instead
14 # $(cat file) in env.value performs Kubernetes variable substitution, NOT shell execution.
15 # Kafka would receive the literal string "$(cat /mnt/config/node-id)" as its node ID.
16 env:
17 - name: KAFKA_CFG_NODE_ID
18 value: "" # Set dynamically at startup via command/args below
19 - name: KAFKA_CFG_PROCESS_ROLES
20 value: "broker,controller"
21 - name: KAFKA_CFG_CONTROLLER_QUORUM_VOTERS
22 value: "0@kafka-0.kafka-headless.kafka.svc.cluster.local:9093,1@kafka-1.kafka-headless.kafka.svc.cluster.local:9093,2@kafka-2.kafka-headless.kafka.svc.cluster.local:9093"
23 command: ["/bin/sh", "-c"]
24 args:
25 - |
26 # Read the integer node ID written by the init container and export it
27 export KAFKA_CFG_NODE_ID=$(cat /mnt/config/node-id)
28 exec /opt/bitnami/scripts/kafka/run.shIn production, use the Strimzi Kafka Operator which handles broker ID assignment, KRaft configuration, topic management, and rolling upgrades — the same role CloudNativePG plays for PostgreSQL.
Expanding PVC Storage
A limitation of volumeClaimTemplates: you cannot change the template's storage size to resize existing PVCs — Kubernetes treats the template as immutable. To expand PVCs for a running StatefulSet:
1# 1. Patch each PVC (storage class must have allowVolumeExpansion: true)
2kubectl patch pvc data-postgres-0 -n databases -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
3kubectl patch pvc data-postgres-1 -n databases -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
4kubectl patch pvc data-postgres-2 -n databases -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
5
6# 2. Wait for PVC status.capacity to reflect the new size
7# For in-use volumes, the filesystem resize happens on pod restart — the PVC patch alone is not sufficient
8kubectl get pvc -n databases -w # Wait for Resizing → FileSystemResizePending → Bound
9
10# 3. Restart pods to trigger filesystem resize (if the volume is in use)
11kubectl rollout restart statefulset/postgres -n databases
12
13# 4. Delete and recreate the StatefulSet without deleting PVCs (--cascade=orphan)
14# Only needed if you want the volumeClaimTemplate to reflect the new size for future pods
15kubectl delete statefulset postgres --cascade=orphan -n databases
16# Re-apply with updated volumeClaimTemplate storage size
17kubectl apply -f postgres-statefulset.yamlDeleting with --cascade=orphan removes the StatefulSet controller but leaves the pods and PVCs running. The new StatefulSet adopts the existing pods.
Frequently Asked Questions
When should I use a StatefulSet vs a Deployment?
Use a StatefulSet when any of the following are true: (1) each pod needs its own persistent storage, (2) pods need to know their own identity (broker ID, replica role), (3) applications use DNS-based peer discovery with stable hostnames, (4) startup or shutdown order matters. Use a Deployment for everything else — it's simpler to manage, supports more scheduling patterns, and works with standard HPA.
Can I delete a single StatefulSet pod to force it to a different node?
Yes. Deleting a StatefulSet pod triggers recreation with the same name and the same PVC. If you want to force migration to a different node, delete the pod and use a nodeSelector or Pod Anti-Affinity annotation beforehand. The new pod will start on a different node (if anti-affinity or taint constraints require it) and mount the same PVC. Note: if the PVC uses a ReadWriteOnce access mode, the volume must detach from the old node before the new pod can mount it. On healthy nodes, this happens after pod termination. On unhealthy nodes (node failure, network partition), detach may not occur until the node object is deleted or a force-detach timeout elapses — terminating the pod is not sufficient. This is the "stuck mounting" scenario; force-deleting the pod or the node object may be required, but do so only after confirming the old node is truly offline (to prevent a split-brain write scenario).
What's the difference between a headless Service and a regular Service for StatefulSets?
A regular Service (clusterIP: non-None) creates a VIP that load-balances requests across pods — all pods look like one endpoint. A headless Service (clusterIP: None) creates individual DNS records for each pod and the Service itself resolves to the full list of pod IPs. StatefulSets require the headless Service for stable per-pod DNS names. For client access, you typically create both: the headless Service for stable identity (used internally by pods for peer discovery) and a regular Service for application clients that need to reach a specific role (primary, read-only replica).
For PV/PVC fundamentals — access modes, reclaim policies, StorageClass binding modes, and common data-loss mistakes — see Kubernetes Persistent Volumes: A Production Guide. For PodDisruptionBudgets that also protect StatefulSets during Karpenter node consolidation, see Kubernetes Node Autoscaling: Cluster Autoscaler vs Karpenter. For Velero backup that integrates with CSI volume snapshots to back up StatefulSet PVCs, see Velero: Kubernetes Backup and Disaster Recovery.
Running PostgreSQL, Kafka, or Redis as StatefulSets on EKS, or migrating from EC2-hosted databases to Kubernetes? Talk to us at Coding Protocols — we help platform teams implement stateful workload patterns that survive node failures and cluster maintenance without data loss.


