Databases in Kubernetes: Honest Take

Would you put your production database inside Kubernetes today?

Be honest.

I've seen teams move everything into the cluster — frontend, backend, and database. It feels modern. Unified. Cloud-native. But there's a gap between what "cloud-native" promises and what it costs to operate in practice. And with databases, that gap is wider than most teams expect.

This post is my honest take, as a platform engineer who has been on both sides of this decision.

Why Teams Want This in the First Place

The argument for running databases in Kubernetes is not irrational. It comes from a few genuinely compelling places:

Unified operational model. One control plane. One observability stack. One deployment pipeline. If your team already lives in Kubernetes, the appeal of managing everything there is real. You get to avoid learning a second operational model.

Infrastructure as Code consistency. With operators and Helm charts, your database configuration lives in Git alongside everything else. Reviewing a database config change becomes a pull request, not a Confluence document.

Cost optics. Running a db.r6g.large RDS instance in a non-production environment at $0.24/hr adds up. If you already have cluster capacity sitting idle, the temptation to colocate is understandable.

Portability. On-premise, cloud, hybrid — Kubernetes abstracts the underlying infrastructure. For organizations with multi-cloud ambitions or air-gapped environments, running databases in K8s fits the story.

None of these reasons are wrong. But they are all about you, not about what the database needs.

The Core Problem: Databases Are Not Stateless

Kubernetes was architected around a specific assumption: workloads are disposable and replaceable. Pods die. Nodes are drained. Deployments roll forward. The whole system is designed to treat any single process as expendable.

Databases are built on the opposite assumption. They are stateful, durable, and sequential by design. Every write has to be acknowledged. Every byte has to land on disk before a commit returns. The data must survive anything — a crash, a restart, a hardware failure.

This is not just a philosophical difference. It translates directly into engineering constraints that Kubernetes does not solve for you.

I/O Sensitivity

A database engine's performance lives and dies on I/O latency and throughput. PostgreSQL, MySQL, and MongoDB all have tight expectations around storage behavior.

In Kubernetes, your storage is a PersistentVolumeClaim backed by a StorageClass. The actual behavior of that storage depends on:

The cloud provider's block storage driver (EBS, GCP Persistent Disk, Azure Disk)
The volumeBindingMode and reclaimPolicy on the StorageClass
Whether you are using ReadWriteOnce or ReadWriteMany
The iopsPerGB provisioning, which on EBS depends on your volume size

By default, most provisioners will give you gp3 or equivalent general-purpose storage. For lightweight workloads this is fine. For a PostgreSQL database under write load, the behavior becomes unpredictable — especially if your PVC ends up on the same physical disk as other noisy-neighbor workloads on the same node.

In contrast, RDS io1 or io2 gives you predictable, dedicated IOPS with a clear SLA. You configure it once. It does not drift.

Failure Intolerance

When a Pod crashes in a typical Kubernetes deployment, the scheduler simply starts a new one. The new Pod is identical to the old. No state, no problem.

When your database Pod crashes, the scheduler will try to restart it. But the database must:

Replay the write-ahead log (WAL) from the last checkpoint
Verify storage integrity
Re-establish replication if you are running a replica setup

This process is not instantaneous. Depending on your checkpoint interval and write volume, crash recovery can take minutes. During that window, your application is either down or hitting errors.

And that is if everything goes right. The failure modes that actually hurt are more subtle:

A Pod gets killed mid-write during a node eviction
The PVC detaches from a node that was cordoned for maintenance
A storage driver bug corrupts the filesystem

None of these are hypothetical. They happen. And with a database, the blast radius is your data.

The Three Hard Problems

1. Storage Performance Is Opaque

When you provision a managed database like RDS, you choose an instance class and storage type. You get documented, predictable performance characteristics. IOPS, throughput, latency — they are explicit.

In Kubernetes, the StorageClass abstraction hides the underlying storage details. You can specify a storageClassName: io2 and provision IOPS explicitly, but most teams don't. They use the default StorageClass and assume it will behave well under load.

This creates a pattern I call performance opacity: the database works fine in staging (low load, small data), breaks unpredictably in production (sustained writes, concurrent readers). By the time you add fsync latency monitoring and block device metrics to your observability stack, you have built a significant chunk of what managed database services provide out of the box.

2. Backup Strategy Is More Complex Than It Looks

Every database backup strategy has three components: capture, store, and restore. Managed services handle all three. In Kubernetes, you own all three.

Capture: You need a sidecar or CronJob that runs a consistent backup — pg_dump, pg_basebackup, or volume snapshots via VolumeSnapshot. Consistency matters; a backup taken while transactions are in flight can be corrupt.

Store: Where does the backup go? S3 is the obvious answer, but you need IAM roles, Kubernetes ServiceAccount annotations, IRSA (or Workload Identity on GCP), and a rotation policy. None of this is hard, but each step is a decision point where something can go wrong.

Restore: This is the part nobody practices. How long does a restore take? Have you timed it? Does your runbook cover restoring a single table versus a full database? Does the team on call know the exact commands?

I have seen teams with backup CronJobs running on schedule, showing green in monitoring, that have never successfully tested a restore. The backup was running. The data on S3 was corrupt. They found out during an incident.

With RDS, automated backups are on by default. Point-in-time recovery to any second in the retention window is a console click. It is boring. But when your on-call engineer is trying to recover data at 2 AM, boring is exactly what you want.

3. Failover Lives in Documentation, Not Practice

Kubernetes provides StatefulSets for ordered, stable Pod management. Database operators like CloudNativePG, Percona Operator, or Zalando Postgres Operator layer on top of StatefulSets to handle primary/replica management, automatic failover, and connection pooling.

These operators are genuinely good engineering. CloudNativePG in particular has become a serious option for teams running PostgreSQL at scale. But "available as a Kubernetes operator" is not the same as "your team can operate it."

A failover drill means:

Deliberately killing your primary Pod
Verifying the replica is elected within your target RTO
Confirming your application reconnects (or your PgBouncer/connection pooler handles it)
Checking that the old primary does not re-join as a rogue primary (split-brain)
Verifying your monitoring fired the right alerts

Most teams have never run this drill. They know failover exists because they read the operator docs. But the real RTO in an unplanned failure is longer than the documented SLA, because human reaction time is part of the equation.

Node Failures Create Unexpected Pressure

Here is a scenario I have seen multiple times:

A node is cordoned for maintenance. The scheduler starts moving Pods to other nodes. Your database Pod gets evicted. The PVC's ReadWriteOnce access mode means only one node can mount it at a time. The new Pod is scheduled on a different node, but the volume is still attached to the old node. The volume detachment takes 30–90 seconds depending on the cloud provider's disk detach API.

During that window, your database is down. Not degraded — down.

With a StatefulSet and a good operator, this gets handled. But the handling requires that your cluster's CSI driver is functioning correctly, your operator is watching for the failure, and your application's connection pool is set to retry with backoff rather than hard-failing.

Each of these is a dependency. Dependencies fail at inconvenient times.

When Databases in Kubernetes Actually Work

I want to be clear: this is not a blanket "never run databases in Kubernetes" argument. Teams do it successfully. But success is not accidental — it requires specific capabilities.

You need a mature operator. Don't run a database in Kubernetes without a purpose-built operator. StatefulSet alone is insufficient. Use CloudNativePG for PostgreSQL, Percona Operator for MySQL/MongoDB, or Vitess for horizontally scaled MySQL. These operators encode operational knowledge that would take your team months to build from scratch.

You need a proper StorageClass. Use io2 (or your cloud provider's equivalent high-performance block storage) with explicitly provisioned IOPS. Define a StorageClass specifically for your database tier. Do not share it with general-purpose workloads.

yaml

1apiVersion: storage.k8s.io/v1
2kind: StorageClass
3metadata:
4  name: database-io2
5provisioner: ebs.csi.aws.com
6parameters:
7  type: io2
8  iops: "3000"
9  encrypted: "true"
10reclaimPolicy: Retain
11volumeBindingMode: WaitForFirstConsumer
12allowVolumeExpansion: true

Note reclaimPolicy: Retain — if a PVC is accidentally deleted, the data survives. The default Delete will destroy your volume.

You need tested backup and restore. Not a CronJob that shows green. Tested means you have run a full restore into a staging environment in the last 30 days and documented the time it took.

You need a failover drill schedule. Once a quarter minimum. Kill the primary. Measure real RTO. Find the gaps. Fix them before an incident does.

You need clear ownership. Kubernetes database operations sit at the intersection of platform engineering and database administration. If your team says "the platform team handles databases," someone needs to actually be the DBA. Distributed ownership is how things fall through the cracks.

You need dedicated nodes. Run your database Pods on dedicated node groups using nodeSelector or nodeAffinity, ideally with Taints so only database Pods are scheduled there. This prevents noisy-neighbor I/O problems and makes capacity planning predictable.

yaml

1affinity:
2  nodeAffinity:
3    requiredDuringSchedulingIgnoredDuringExecution:
4      nodeSelectorTerms:
5        - matchExpressions:
6            - key: workload-type
7              operator: In
8              values:
9                - database
10tolerations:
11  - key: "workload-type"
12    operator: "Equal"
13    value: "database"
14    effect: "NoSchedule"

The Managed Services Case (Boring Is a Compliment)

RDS, Cloud SQL, Azure Database for PostgreSQL — these are the boring options. They are also the options that let your engineers sleep through the night.

Here is what a managed database service buys you:

Capability	Kubernetes + Operator	Managed Service
Automated backups	You configure it	On by default
Point-in-time restore	Operator-dependent	Seconds via console
Multi-AZ failover	You configure it	Single checkbox
Metrics & monitoring	CloudWatch/Prometheus setup	Built-in, with native alerts
Storage auto-scaling	Manual PVC resize	Automatic
Security patching	Your responsibility	Managed by provider
Upgrade path	Operator-guided	In-place with minimal downtime

The criticisms of managed services are real: they cost more, they lock you into a cloud provider, and they limit certain advanced configurations. But for most teams, the operational burden difference is significant.

If your team has fewer than 5 engineers managing infrastructure, managed services are almost certainly the right call. The engineering hours you would spend building the operational discipline for databases in Kubernetes have a very high opportunity cost.

Kubernetes Doesn't Remove Complexity — It Centralizes It

This is the part of the conversation that often gets missed.

When teams move databases into Kubernetes, they aren't eliminating the operational complexity of running a database. They are trading the complexity of managing two systems (Kubernetes + a managed service) for the complexity of running all of it in Kubernetes.

That trade can work in your favor if:

Your team has deep Kubernetes expertise
The operational model genuinely benefits from unification
You have the tooling and processes to support stateful workloads

It works against you if:

You're doing it to simplify the bill
You're doing it because it feels "more cloud-native"
You don't have the capacity to build and maintain the operational practices

Kubernetes is exceptional at stateless workloads: web servers, API services, workers, proxies. These fit the Pod model perfectly. A database is architecturally different, and the system makes no guarantees about data safety beyond what your StorageClass, operator, and backup strategy collectively provide.

The question is not "Can we run it in K8s?" — you can. The question is: are we solving a real problem, or are we chasing architectural elegance?

A Decision Framework

Use this to make the call for your team:

Run your database in Kubernetes if:

You have a dedicated platform engineering team with stateful workload experience
Your environment requires on-premise or air-gapped deployment
You need advanced configurations that managed services don't support
You have adopted a battle-tested operator (CloudNativePG, Percona, Vitess)
You have a documented, tested backup and failover runbook
You are willing to invest in dedicated node groups and storage tuning

Use a managed service if:

Your team is small and needs to move fast
You are running in a single cloud provider environment
Data durability and uptime are primary concerns
Your database setup is standard (PostgreSQL, MySQL, standard configs)
You value operational simplicity over infrastructure unification
You are still building your Kubernetes expertise

For non-production environments specifically:

Use managed services with start/stop automation — but remember the capacity limits that can trap you
Or use lightweight in-cluster databases with clearly marked "this data is disposable" policies

What Experienced Teams Actually Do

In practice, the teams I have seen operate most effectively tend to:

Stateless in Kubernetes, stateful in managed services — This is the dominant pattern at scale. EKS/GKE/AKS for the application tier, RDS or Aurora for the database tier.
Kubernetes for non-critical databases — Dev and staging environments use in-cluster databases. Production uses managed. The operational burden difference is acceptable because losing dev data is acceptable.
Kubernetes for distributed/specialized databases — Teams using Cassandra, Kafka (which is effectively a stateful log store), or Redis as a cache (where data loss is tolerable) find the Kubernetes operational model more workable because the durability requirements are different.
Full Kubernetes with operator investment — Larger platform engineering teams with dedicated SREs who have explicitly chosen to own this complexity. This works but requires ongoing investment.

Conclusion

Running databases in Kubernetes is an engineering choice, not an architecture upgrade. Done with the right tooling, processes, and team capability, it works. Done as a cost shortcut or a philosophical preference for "cloud-native," it creates operational debt that surfaces at the worst possible time.

The checklist is simple to state and hard to execute:

Mature operator
Properly configured storage
Automated and tested backups
Regular failover drills
Clear ownership
Dedicated node capacity

If you can honestly check all of those boxes, running databases in Kubernetes is a legitimate choice. If you cannot, managed services are not a compromise — they are the right engineering answer.

In infrastructure, boring is rarely a consolation prize. It is often the goal.

Running databases in Kubernetes and want a second opinion on your setup? Talk to us at Coding Protocols. We help platform engineering teams make architectural decisions they won't regret at 2 AM.

Databases in Kubernetes: Smart Move or Unnecessary Risk?