Velero: Kubernetes Backup and Disaster Recovery on EKS (2026)

A Kubernetes cluster is not inherently durable. etcd contains all cluster state, but etcd backups don't help you if the problem is at the application layer: a misconfigured Helm upgrade that deleted all PVCs, a namespace accidentally deleted, a developer running kubectl delete deploy --all -n production. Velero solves a different problem than etcd backup — it gives you application-layer recovery.

Velero backs up Kubernetes object manifests (everything kubectl get returns) plus the data in PersistentVolumes. Restore means recreating those objects and restoring volume data into new PVCs, in the same cluster or a different one. This is what enables cross-region DR: back up in us-east-1, restore into a pre-provisioned cluster in us-west-2.

Architecture

Velero runs as a Deployment in the cluster. When a backup runs:

Velero calls the Kubernetes API to list all resources in the target namespace(s)
It serializes each resource as JSON and uploads to object storage (S3)
For PersistentVolumes, it either:
- Takes a CSI volume snapshot (fast, crash-consistent, storage-native)
- Or uses the Kopia integration to stream file-level data to object storage (slower, but works across storage backends and clusters)

Restore inverts this: Velero downloads the JSON objects, applies them via kubectl apply, then either restores from the volume snapshot or replays the Kopia backup stream into a new PVC.

Installation on EKS

IAM Setup

Velero needs S3 access for backup storage and EC2 permissions for volume snapshots:

bash

1# Create the S3 backup bucket
2aws s3 mb s3://velero-backups-production --region us-east-1
3
4# Block public access
5aws s3api put-public-access-block \
6  --bucket velero-backups-production \
7  --public-access-block-configuration \
8  BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true

IAM policy for the Velero controller:

json

1{
2  "Version": "2012-10-17",
3  "Statement": [
4    {
5      "Effect": "Allow",
6      "Action": [
7        "s3:GetObject", "s3:DeleteObject", "s3:PutObject",
8        "s3:AbortMultipartUpload", "s3:ListMultipartUploadParts"
9      ],
10      "Resource": "arn:aws:s3:::velero-backups-production/*"
11    },
12    {
13      "Effect": "Allow",
14      "Action": ["s3:ListBucket"],
15      "Resource": "arn:aws:s3:::velero-backups-production"
16    },
17    {
18      "Effect": "Allow",
19      "Action": [
20        "ec2:DescribeVolumes", "ec2:DescribeSnapshots",
21        "ec2:CreateSnapshot", "ec2:DeleteSnapshot", "ec2:CopySnapshot",
22        "ec2:CreateTags", "ec2:DescribeTags",
23        "ec2:DescribeAvailabilityZones"
24      ],
25      "Resource": "*"
26    }
27  ]
28}

Attach this policy to an IAM role and create a Pod Identity (or IRSA) association for Velero's ServiceAccount:

bash

aws eks create-pod-identity-association \
  --cluster-name production \
  --namespace velero \
  --service-account velero-server \
  --role-arn arn:aws:iam::${AWS_ACCOUNT_ID}:role/VeleroController

Helm Installation

bash

1helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
2helm repo update
3
4# Check https://github.com/vmware-tanzu/velero/releases for latest version
5# Verify chart version with: helm search repo vmware-tanzu/velero
6helm install velero vmware-tanzu/velero \
7  --namespace velero \
8  --create-namespace \
9  --values velero-values.yaml

yaml

1# velero-values.yaml
2configuration:
3  backupStorageLocation:
4    - name: default
5      provider: aws
6      bucket: velero-backups-production
7      config:
8        region: us-east-1
9
10  # volumeSnapshotLocation is only used for legacy (non-CSI) volume snapshots.
11  # With EnableCSI + VolumeSnapshotClass, Velero uses the CSI API and ignores this — omit for CSI-only setups.
12
13# AWS provider plugin — version must match your Velero server minor version
14# e.g., Velero 1.14.x → velero-plugin-for-aws:v1.10.x
15# Check: https://github.com/vmware-tanzu/velero-plugin-for-aws/releases
16initContainers:
17  - name: velero-plugin-for-aws
18    image: velero/velero-plugin-for-aws:v1.10.0
19    volumeMounts:
20      - mountPath: /target
21        name: plugins
22
23# EnableCSI is required for Velero < 1.14; in 1.14+ CSI support is GA and on by default
24features: EnableCSI
25
26# Use Kopia for file-level volume backup (alternative to CSI snapshots)
27defaultVolumesToFsBackup: false    # Set true to enable Kopia backup for all PVCs by default

Verify:

bash

velero backup-location get
# NAME      PROVIDER   BUCKET/PREFIX                  PHASE
# default   aws        velero-backups-production       Available

Creating Backups

On-Demand Backup

bash

1# Back up a single namespace
2velero backup create payments-backup-manual \
3  --include-namespaces payments \
4  --wait
5
6# Back up the entire cluster (all namespaces)
7velero backup create cluster-backup-$(date +%Y%m%d) \
8  --exclude-namespaces kube-system,kube-public,kube-node-lease \
9  --wait
10
11# Check backup status
12velero backup describe payments-backup-manual --details
13velero backup logs payments-backup-manual

Scheduled Backups

yaml

1apiVersion: velero.io/v1
2kind: Schedule
3metadata:
4  name: daily-payments-backup
5  namespace: velero
6spec:
7  # Standard cron format: daily at 2 AM UTC
8  schedule: "0 2 * * *"
9  template:
10    includedNamespaces:
11      - payments
12      - orders
13    # Snapshot PVCs via CSI (if EBS CSI driver with VolumeSnapshotClass is configured)
14    snapshotVolumes: true
15    # TTL: how long to keep the backup
16    ttl: 720h    # 30 days
17    # Include cluster-scoped resources owned by objects in the namespaces
18    includeClusterResources: true
19    storageLocation: default
20    volumeSnapshotLocations:
21      - default
22    # Hooks: run commands before/after snapshot (see database consistency section)
23    hooks: {}

bash

# Apply the schedule
kubectl apply -f schedule.yaml

# View scheduled backups
velero schedule get
velero backup get    # Lists all backups including scheduled ones

Volume Data: Kopia vs CSI Snapshots

CSI Volume Snapshots (recommended for EBS/EFS)

CSI snapshots are storage-native: EBS creates an incremental snapshot directly on AWS without data leaving the storage layer. This is fast and storage-efficient. Requires:

The AWS EBS CSI driver installed with snapshot support
The external-snapshotter controller
A VolumeSnapshotClass that maps to the EBS CSI driver

yaml

1apiVersion: snapshot.storage.k8s.io/v1
2kind: VolumeSnapshotClass
3metadata:
4  name: csi-aws-vsc
5  labels:
6    velero.io/csi-volumesnapshot-class: "true"    # Velero discovers this automatically
7driver: ebs.csi.aws.com
8deletionPolicy: Delete    # Delete EBS snapshots when Velero's backup TTL expires

Kopia File-Level Backup

Kopia is Velero's built-in file-level backup engine. It mounts the PVC into a sidecar, reads the filesystem, and streams a deduplicated backup to S3. Use Kopia when:

The storage driver doesn't support CSI snapshots
You need to restore to a different cloud/storage type
You need cross-cluster or cross-region restore where the target cluster can't access the source EBS snapshots

Enable Kopia backup per PVC via annotation (requires Velero ≥ 1.10, which replaced Restic with Kopia as the file-system backup engine):

yaml

# On the Pod or Deployment — tells Velero to use Kopia for this pod's volumes
metadata:
  annotations:
    backup.velero.io/backup-volumes: data,config    # Volume names from the pod spec

Or enable cluster-wide in velero-values.yaml (defaultVolumesToFsBackup: true).

Kopia is slower and uses more S3 storage than CSI snapshots for large volumes, but it's portable and doesn't depend on storage provider snapshot APIs.

Database-Consistent Backups with Hooks

Filesystem snapshots of running databases are not guaranteed to be consistent — the database may have dirty pages in memory. Use Velero hooks to flush before the snapshot:

yaml

1apiVersion: velero.io/v1
2kind: Backup
3metadata:
4  name: payments-db-consistent
5  namespace: velero
6spec:
7  includedNamespaces:
8    - payments
9  hooks:
10    resources:
11      - name: postgres-flush
12        includedNamespaces:
13          - payments
14        labelSelector:
15          matchLabels:
16            app: postgres
17        pre:
18          - exec:
19              container: postgres
20              command:
21                - /bin/sh
22                - -c
23                - "psql -U postgres -c 'CHECKPOINT;'"    # Flush WAL to disk
24              onError: Fail    # Abort backup if the hook fails
25              timeout: 30s
26        post:
27          - exec:
28              container: postgres
29              command:
30                - /bin/sh
31                - -c
32                - "echo 'Backup complete'"
33              timeout: 10s

The pre hook runs inside the postgres container before Velero takes the volume snapshot. CHECKPOINT forces PostgreSQL to flush all dirty pages from memory to disk, making the subsequent filesystem snapshot crash-consistent. For MySQL: FLUSH TABLES WITH READ LOCK (hold the lock during snapshot, then unlock in the post hook).

Restoring

bash

1# List available backups
2velero backup get
3
4# Restore a backup to the same namespace
5velero restore create --from-backup payments-backup-manual
6
7# Restore to a different namespace (for testing/DR drill)
8velero restore create payments-restore-test \
9  --from-backup payments-backup-manual \
10  --namespace-mappings payments:payments-restored
11
12# Restore only specific resource types (do not combine with --exclude-resources — use one filter type)
13velero restore create \
14  --from-backup payments-backup-manual \
15  --include-resources deployments,services,configmaps
16
17# Monitor restore progress
18velero restore describe payments-restore-test --details

Cross-Cluster Restore

To restore into a different cluster:

Configure Velero in the target cluster pointing at the same S3 bucket (read-only is sufficient for restore)
Run velero backup get — it will list backups visible in the configured bucket
Run velero restore create --from-backup <name>

The target cluster needs compatible StorageClasses (same name, or use --namespace-mappings with volume remapping). EBS snapshots are regional — for cross-region DR, configure Velero to replicate snapshots or use Kopia (which stores data in S3 directly, accessible from any region).

Cross-Region Disaster Recovery Pattern

bash

# In us-east-1: configure S3 replication to us-west-2
aws s3api put-bucket-replication \
  --bucket velero-backups-production \
  --replication-configuration file://replication.json

# replication.json: replicate all objects to velero-backups-dr (us-west-2)

In the DR cluster (us-west-2):

yaml

1# velero-values.yaml for DR cluster
2configuration:
3  backupStorageLocation:
4    - name: primary-backups
5      provider: aws
6      bucket: velero-backups-dr    # Replicated from us-east-1
7      config:
8        region: us-west-2
9      accessMode: ReadOnly    # DR cluster only reads; primary writes

With S3 replication, backups from us-east-1 appear in the DR cluster within minutes. Restore from the replicated bucket to the DR cluster restores the application without needing access to the source EBS snapshots (Kopia data lives in S3, which replicated cleanly).

Frequently Asked Questions

Does Velero back up etcd?

No. Velero backs up Kubernetes resources by calling the API server — not etcd directly. This means it backs up the canonical desired state of your applications (Deployments, Services, ConfigMaps, Secrets, CRDs) but not Kubernetes internals (lease objects, endpointslices, events). For full cluster recovery including control plane state, combine Velero (application recovery) with etcd snapshots (control plane recovery). On EKS, AWS manages etcd, so Velero alone is sufficient for application-layer DR.

How does Velero handle PVCs that are mounted by running pods during backup?

For CSI snapshots, Velero takes the snapshot while the volume is mounted. Most CSI drivers (EBS, EFS) create crash-consistent snapshots — safe for stateless workloads but potentially inconsistent for databases. Use pre-hooks to quiesce the database before the snapshot (see the database consistency section above).

For Kopia backups, Velero creates a sidecar that mounts the PVC (read-only if possible) and streams data to S3. If the PVC is RWO and already mounted by the pod, the sidecar shares the volume. Data consistency depends on the filesystem — open files and in-flight writes may not be captured correctly without pre-hooks.

Can I restore just one Deployment, not the whole namespace?

Yes. Use --include-resources:

bash

velero restore create \
  --from-backup my-backup \
  --include-namespaces payments \
  --include-resources deployments \
  --selector app=payments-api    # Label selector to filter within the resource type

This restores only Deployment objects in the payments namespace that match app=payments-api. Note that referenced ConfigMaps, Secrets, and Services are not automatically included — you need to add them to --include-resources explicitly or do a full namespace restore.

Additional Backup Patterns

Dual-Track Database Backup: pg_dump Alongside Velero

PVC snapshots are crash-consistent but not always application-consistent for databases. A belt-and-suspenders approach runs pg_dump separately so you have a logical dump that can be restored to any cluster, independent of EBS snapshot availability:

yaml

1# CronJob that runs pg_dump to S3, separate from Velero's PVC snapshot
2apiVersion: batch/v1
3kind: CronJob
4metadata:
5  name: postgres-dump
6  namespace: production
7spec:
8  schedule: "0 1 * * *"    # 1 AM — before Velero backup at 2 AM
9  jobTemplate:
10    spec:
11      template:
12        spec:
13          containers:
14            - name: pg-dump
15              image: postgres:16
16              command: ["/bin/sh", "-c"]
17              args:
18                - |
19                  pg_dump -h postgres -U postgres payments_db | \
20                  aws s3 cp - s3://my-velero-backups/pg-dumps/payments_$(date +%Y%m%d).sql.gz \
21                    --sse aws:kms
22          restartPolicy: OnFailure
23          serviceAccountName: pg-backup-sa    # Has S3 put access

This gives you two independent recovery paths: Velero restores the full cluster state (PVCs + manifests) for disaster recovery, while the pg_dump provides a portable logical backup for point-in-time data recovery or schema migrations.

Excluding Events from Scheduled Backups

Kubernetes events are transient, high-volume, and useless in a restore scenario. Excluding them significantly reduces backup size and S3 storage costs:

yaml

1apiVersion: velero.io/v1
2kind: Schedule
3metadata:
4  name: production-daily
5  namespace: velero
6spec:
7  schedule: "0 2 * * *"    # 2 AM UTC daily
8  template:
9    includedNamespaces:
10      - production
11      - monitoring
12      - cert-manager
13    excludedResources:
14      - events           # Don't back up events (transient, large volume)
15      - events.events.k8s.io
16    snapshotVolumes: true
17    storageLocation: default
18    ttl: 720h0m0s    # Retain backups for 30 days

Restoring to a Different Namespace

The --namespace-mappings flag restores a namespace under a new name — useful for DR drills (restore production → production-restored to validate without affecting live traffic) or cross-environment migrations:

bash

1# Restore production backup into production-restore for validation
2velero restore create --from-backup payments-backup-manual \
3  --include-namespaces production \
4  --namespace-mappings production:production-restore \
5  --restore-volumes true
6
7# Verify key resources were restored
8kubectl get deployments -n production-restore
9kubectl get services -n production-restore
10kubectl get configmaps -n production-restore
11
12# Clean up after validation
13kubectl delete namespace production-restore

For a companion guide covering scheduling patterns, Kopia vs restic configuration, and cross-region DR runbooks, see Kubernetes Disaster Recovery: Backup and Restore with Velero.

For storage provisioning with EBS and EFS that underpins the PVCs Velero backs up, see Kubernetes Storage: EBS and EFS CSI Drivers on EKS. For Argo CD GitOps workflows where Velero schedules are managed as Git-tracked CRDs alongside application manifests, see Argo CD: GitOps Continuous Delivery for Kubernetes.

Setting up Velero backup schedules for a production EKS cluster or designing a cross-region DR runbook? Talk to us at Coding Protocols — we help platform teams implement backup strategies that hold up under the pressure of an actual incident.

Velero: Kubernetes Backup and Disaster Recovery

Architecture

Installation on EKS

IAM Setup

Helm Installation

Creating Backups

On-Demand Backup

Scheduled Backups

Volume Data: Kopia vs CSI Snapshots

CSI Volume Snapshots (recommended for EBS/EFS)

Kopia File-Level Backup

Database-Consistent Backups with Hooks

Restoring

Cross-Cluster Restore

Cross-Region Disaster Recovery Pattern

Frequently Asked Questions

Does Velero back up etcd?

How does Velero handle PVCs that are mounted by running pods during backup?

Can I restore just one Deployment, not the whole namespace?

Additional Backup Patterns

Dual-Track Database Backup: pg_dump Alongside Velero

Excluding Events from Scheduled Backups

Restoring to a Different Namespace

Related Topics

Read Next

Terraform for Kubernetes: Managing EKS with Infrastructure as Code

Kubernetes Node Autoscaling: Cluster Autoscaler vs Karpenter

Crossplane: Cloud Infrastructure as Kubernetes Resources