Platform Engineering
14 min readMay 3, 2026

Kubernetes StatefulSets: Running Stateful Workloads in Production

StatefulSets give pods stable network identities (pod-0, pod-1) and stable storage (PersistentVolumeClaims that survive pod rescheduling). This makes them the right primitive for databases, Kafka brokers, ZooKeeper ensembles, and any workload where identity and data persistence matter. This covers StatefulSet mechanics, volumeClaimTemplates for per-pod storage, headless Services for DNS-based peer discovery, ordered vs parallel pod management, PodDisruptionBudgets for controlled maintenance, and the operational patterns for running PostgreSQL and Kafka in Kubernetes.

CO
Coding Protocols Team
Platform Engineering
Kubernetes StatefulSets: Running Stateful Workloads in Production

A Deployment treats pods as interchangeable. When a node fails, Kubernetes reschedules the pods elsewhere and assigns them new names, new IPs, and new ephemeral storage. For stateless services this is fine — any replica can handle any request. For databases and distributed systems, interchangeability breaks things: a Kafka broker needs to know it's broker-2, not just some broker; a PostgreSQL replica needs to reconnect to its primary using a stable address; a ZooKeeper node needs to bring up its journal from the same disk it used before.

StatefulSet provides three guarantees that Deployment doesn't:

  1. Stable pod identity: pods are named <statefulset-name>-0, <statefulset-name>-1, etc. and keep these names across rescheduling.
  2. Stable network identity: each pod gets a stable DNS name <pod-name>.<headless-service>.<namespace>.svc.cluster.local.
  3. Stable storage: each pod gets its own PersistentVolumeClaim that follows it across rescheduling — it's not shared with other pods.

Basic StatefulSet

yaml
1apiVersion: apps/v1
2kind: StatefulSet
3metadata:
4  name: postgres
5  namespace: databases
6spec:
7  serviceName: postgres-headless    # Must reference a headless Service (clusterIP: None)
8  replicas: 3
9  selector:
10    matchLabels:
11      app: postgres
12  podManagementPolicy: OrderedReady    # Default: bring up pods one at a time in order
13
14  template:
15    metadata:
16      labels:
17        app: postgres
18    spec:
19      containers:
20        - name: postgres
21          image: postgres:16.4
22          env:
23            - name: POSTGRES_PASSWORD
24              valueFrom:
25                secretKeyRef:
26                  name: postgres-credentials
27                  key: password
28          ports:
29            - containerPort: 5432
30          volumeMounts:
31            - name: data
32              mountPath: /var/lib/postgresql/data
33
34  # volumeClaimTemplates creates a separate PVC for each pod
35  volumeClaimTemplates:
36    - metadata:
37        name: data
38      spec:
39        accessModes: ["ReadWriteOnce"]
40        storageClassName: gp3    # EKS: use gp3 for general purpose; io2 for high-IOPS databases
41        resources:
42          requests:
43            storage: 100Gi

When this StatefulSet is created, Kubernetes provisions three PVCs: data-postgres-0, data-postgres-1, data-postgres-2. Each pod mounts only its own PVC. If postgres-1 is deleted and rescheduled to a different node, it mounts data-postgres-1 again — not a new blank volume.


Headless Service: Stable DNS Names

StatefulSets require a headless Service (clusterIP: None) that provides stable DNS records for each pod. Unlike a regular Service (which load-balances across pods via a single ClusterIP), a headless Service creates one DNS A record per pod:

yaml
1apiVersion: v1
2kind: Service
3metadata:
4  name: postgres-headless
5  namespace: databases
6spec:
7  clusterIP: None    # Headless: no VIP, direct pod DNS records
8  selector:
9    app: postgres
10  ports:
11    - port: 5432
12      targetPort: 5432

DNS records created:

  • postgres-0.postgres-headless.databases.svc.cluster.localpostgres-0 pod IP
  • postgres-1.postgres-headless.databases.svc.cluster.localpostgres-1 pod IP
  • postgres-2.postgres-headless.databases.svc.cluster.localpostgres-2 pod IP

And the Service itself resolves to all pod A records:

  • postgres-headless.databases.svc.cluster.local → all pod IPs (DNS client receives all records and chooses one — no server-side load balancing)

Applications that need to talk to a specific replica (like a PostgreSQL client connecting to the primary at postgres-0) use the per-pod DNS name. Applications that don't care which pod they hit use the Service name.

For external access, create a regular (non-headless) Service for the primary:

yaml
1apiVersion: v1
2kind: Service
3metadata:
4  name: postgres-primary
5  namespace: databases
6spec:
7  selector:
8    app: postgres
9    role: primary    # Label the primary pod with this; your replication controller manages it
10  ports:
11    - port: 5432

Pod Management Policy

podManagementPolicy controls how pods start and stop:

OrderedReady (default): Pods start in order (0 before 1, 1 before 2) and each must be Running and Ready before the next starts. Scale-down reverses order (highest-numbered pod first). This is correct for distributed systems that require a quorum (ZooKeeper, etcd) where starting in a specific sequence prevents split-brain.

Parallel: All pods start simultaneously. Faster startup, but applications must handle all replicas initializing at once. Correct for applications that don't have bootstrap sequencing requirements.

yaml
spec:
  podManagementPolicy: Parallel    # For Kafka brokers, Redis Cluster nodes that bootstrap independently

Update Strategy: Rolling Updates

StatefulSet rolling updates proceed from highest-ordinal to lowest (reverse of startup order), ensuring the primary/leader (usually ordinal 0) is updated last:

yaml
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 2    # Only update pods with ordinal >= 2 (canary: test update on replicas before primary)

partition enables canary updates: set partition: 2 to update only postgres-2, verify it works, then set partition: 1 (updates postgres-2 and postgres-1), then partition: 0 (all pods). This is the standard pattern for rolling out PostgreSQL minor version upgrades safely.


PodDisruptionBudget: Controlling Maintenance Impact

A PodDisruptionBudget limits how many pods can be voluntarily disrupted (by node drains, rolling updates, VPA evictions) simultaneously. Without a PDB, a node drain can take down multiple database replicas at once:

yaml
1apiVersion: policy/v1
2kind: PodDisruptionBudget
3metadata:
4  name: postgres-pdb
5  namespace: databases
6spec:
7  minAvailable: 2    # At least 2 postgres pods must be available during disruptions
8  selector:
9    matchLabels:
10      app: postgres

For a 3-replica cluster, minAvailable: 2 means at most 1 pod can be disrupted at a time. A node drain will evict postgres-2, wait for it to restart on another node, then proceed to the next pod.

maxUnavailable is the alternative — specify as a number or percentage:

yaml
spec:
  maxUnavailable: 1    # At most 1 pod unavailable at any time

PDBs protect against voluntary disruptions (node drains, rolling updates) but not involuntary ones (node failures). For a production database that requires quorum, minAvailable should be set to the quorum size: ceil(replicas / 2) + 1.


Running PostgreSQL: Primary-Replica Pattern

The standard approach for PostgreSQL on Kubernetes in 2026 is to use a Kubernetes operator that handles replication management, failover, and backup:

  • CloudNativePG: the CNCF-graduated PostgreSQL operator. Manages streaming replication, automated failover, PITR backup to S3.
  • Zalando Postgres Operator: creates a primary + replicas with Patroni for HA.

A bare StatefulSet gets you stable storage and identity, but not replication setup, primary election, or failover. Operators handle all of this:

yaml
1# CloudNativePG Cluster resource
2apiVersion: postgresql.cnpg.io/v1
3kind: Cluster
4metadata:
5  name: payments-db
6  namespace: databases
7spec:
8  instances: 3    # 1 primary + 2 replicas
9
10  postgresql:
11    parameters:
12      max_connections: "200"
13      shared_buffers: "256MB"
14
15  storage:
16    size: 100Gi
17    storageClass: gp3
18
19  backup:
20    barmanObjectStore:
21      destinationPath: s3://your-bucket/payments-db
22      s3Credentials:
23        accessKeyId:
24          name: s3-creds
25          key: ACCESS_KEY_ID
26        secretAccessKey:
27          name: s3-creds
28          key: SECRET_ACCESS_KEY
29    retentionPolicy: "30d"

CloudNativePG creates a StatefulSet internally and exposes three Services: <cluster>-rw (read-write, points to primary), <cluster>-ro (read-only, round-robins replicas), <cluster>-r (all instances). Applications connect to payments-db-rw.databases.svc:5432 — the service automatically follows the primary after failover.


Running Kafka: Broker Identity and Partition Leadership

Kafka brokers are the canonical StatefulSet use case: each broker has a fixed ID, owns specific topic partitions, and replicas rely on stable broker addresses for leader election and log replication.

yaml
1apiVersion: apps/v1
2kind: StatefulSet
3metadata:
4  name: kafka
5  namespace: kafka
6spec:
7  serviceName: kafka-headless
8  replicas: 3
9  podManagementPolicy: Parallel    # Brokers can start simultaneously
10  selector:
11    matchLabels:
12      app: kafka
13  template:
14    spec:
15      containers:
16        - name: kafka
17          image: apache/kafka:3.8.0
18          env:
19            - name: KAFKA_BROKER_ID
20              valueFrom:
21                fieldRef:
22                  fieldPath: metadata.name    # WARNING: resolves to full name "kafka-0", not integer "0" — Kafka will fail; see KRaft note below
23            - name: KAFKA_ZOOKEEPER_CONNECT
24              value: "zookeeper-headless.kafka:2181"
25            - name: KAFKA_ADVERTISED_LISTENERS
26              value: "PLAINTEXT://$(POD_NAME).kafka-headless.kafka.svc.cluster.local:9092"
27          volumeMounts:
28            - name: data
29              mountPath: /var/kafka/data
30  volumeClaimTemplates:
31    - metadata:
32        name: data
33      spec:
34        accessModes: ["ReadWriteOnce"]
35        storageClassName: gp3
36        resources:
37          requests:
38            storage: 500Gi

Note on KAFKA_BROKER_ID: Using metadata.name gives kafka-0, kafka-1, kafka-2 — strings, not integers. Kafka requires an integer broker ID. Use an init container or the KAFKA_CFG_NODE_ID pattern with KRaft mode (ZooKeeper-less Kafka) instead:

yaml
1# KRaft mode — ZooKeeper-free Kafka (preferred for Kafka 3.3+)
2# Extract the integer node ID from the hostname in an init container:
3# HOSTNAME=kafka-0 → ID=0 via: echo ${HOSTNAME##*-}
4initContainers:
5  - name: set-node-id
6    image: busybox
7    command: ["/bin/sh", "-c", "echo ${HOSTNAME##*-} > /mnt/config/node-id"]
8    volumeMounts:
9      - name: config
10        mountPath: /mnt/config
11containers:
12  - name: kafka
13    # env.value does not support shell command substitution — use command/args instead
14    # $(cat file) in env.value performs Kubernetes variable substitution, NOT shell execution.
15    # Kafka would receive the literal string "$(cat /mnt/config/node-id)" as its node ID.
16    env:
17      - name: KAFKA_CFG_NODE_ID
18        value: ""    # Set dynamically at startup via command/args below
19      - name: KAFKA_CFG_PROCESS_ROLES
20        value: "broker,controller"
21      - name: KAFKA_CFG_CONTROLLER_QUORUM_VOTERS
22        value: "0@kafka-0.kafka-headless.kafka.svc.cluster.local:9093,1@kafka-1.kafka-headless.kafka.svc.cluster.local:9093,2@kafka-2.kafka-headless.kafka.svc.cluster.local:9093"
23    command: ["/bin/sh", "-c"]
24    args:
25      - |
26        # Read the integer node ID written by the init container and export it
27        export KAFKA_CFG_NODE_ID=$(cat /mnt/config/node-id)
28        exec /opt/bitnami/scripts/kafka/run.sh

In production, use the Strimzi Kafka Operator which handles broker ID assignment, KRaft configuration, topic management, and rolling upgrades — the same role CloudNativePG plays for PostgreSQL.


Expanding PVC Storage

A limitation of volumeClaimTemplates: you cannot change the template's storage size to resize existing PVCs — Kubernetes treats the template as immutable. To expand PVCs for a running StatefulSet:

bash
1# 1. Patch each PVC (storage class must have allowVolumeExpansion: true)
2kubectl patch pvc data-postgres-0 -n databases -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
3kubectl patch pvc data-postgres-1 -n databases -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
4kubectl patch pvc data-postgres-2 -n databases -p '{"spec":{"resources":{"requests":{"storage":"200Gi"}}}}'
5
6# 2. Wait for PVC status.capacity to reflect the new size
7# For in-use volumes, the filesystem resize happens on pod restart — the PVC patch alone is not sufficient
8kubectl get pvc -n databases -w    # Wait for Resizing → FileSystemResizePending → Bound
9
10# 3. Restart pods to trigger filesystem resize (if the volume is in use)
11kubectl rollout restart statefulset/postgres -n databases
12
13# 4. Delete and recreate the StatefulSet without deleting PVCs (--cascade=orphan)
14# Only needed if you want the volumeClaimTemplate to reflect the new size for future pods
15kubectl delete statefulset postgres --cascade=orphan -n databases
16# Re-apply with updated volumeClaimTemplate storage size
17kubectl apply -f postgres-statefulset.yaml

Deleting with --cascade=orphan removes the StatefulSet controller but leaves the pods and PVCs running. The new StatefulSet adopts the existing pods.


Frequently Asked Questions

When should I use a StatefulSet vs a Deployment?

Use a StatefulSet when any of the following are true: (1) each pod needs its own persistent storage, (2) pods need to know their own identity (broker ID, replica role), (3) applications use DNS-based peer discovery with stable hostnames, (4) startup or shutdown order matters. Use a Deployment for everything else — it's simpler to manage, supports more scheduling patterns, and works with standard HPA.

Can I delete a single StatefulSet pod to force it to a different node?

Yes. Deleting a StatefulSet pod triggers recreation with the same name and the same PVC. If you want to force migration to a different node, delete the pod and use a nodeSelector or Pod Anti-Affinity annotation beforehand. The new pod will start on a different node (if anti-affinity or taint constraints require it) and mount the same PVC. Note: if the PVC uses a ReadWriteOnce access mode, the volume must detach from the old node before the new pod can mount it. On healthy nodes, this happens after pod termination. On unhealthy nodes (node failure, network partition), detach may not occur until the node object is deleted or a force-detach timeout elapses — terminating the pod is not sufficient. This is the "stuck mounting" scenario; force-deleting the pod or the node object may be required, but do so only after confirming the old node is truly offline (to prevent a split-brain write scenario).

What's the difference between a headless Service and a regular Service for StatefulSets?

A regular Service (clusterIP: non-None) creates a VIP that load-balances requests across pods — all pods look like one endpoint. A headless Service (clusterIP: None) creates individual DNS records for each pod and the Service itself resolves to the full list of pod IPs. StatefulSets require the headless Service for stable per-pod DNS names. For client access, you typically create both: the headless Service for stable identity (used internally by pods for peer discovery) and a regular Service for application clients that need to reach a specific role (primary, read-only replica).


For PV/PVC fundamentals — access modes, reclaim policies, StorageClass binding modes, and common data-loss mistakes — see Kubernetes Persistent Volumes: A Production Guide. For PodDisruptionBudgets that also protect StatefulSets during Karpenter node consolidation, see Kubernetes Node Autoscaling: Cluster Autoscaler vs Karpenter. For Velero backup that integrates with CSI volume snapshots to back up StatefulSet PVCs, see Velero: Kubernetes Backup and Disaster Recovery.

Running PostgreSQL, Kafka, or Redis as StatefulSets on EKS, or migrating from EC2-hosted databases to Kubernetes? Talk to us at Coding Protocols — we help platform teams implement stateful workload patterns that survive node failures and cluster maintenance without data loss.

Related Topics

Kubernetes
StatefulSets
Databases
Kafka
PostgreSQL
Storage
Platform Engineering
EKS

Read Next