Kubernetes Disaster Recovery: Backup and Restore with Velero
etcd snapshots protect control plane state. They don't protect your PVCs, your namespace configuration, or your deployed workloads. Velero does. Here's how to set up Velero for real disaster recovery — scheduled backups, cross-cluster restore, and the failure modes to test before you need them.

Kubernetes doesn't back itself up. etcd snapshots preserve the cluster state — the Deployments, Services, ConfigMaps, and Secrets that the API server knows about — but they're tied to the specific etcd version and cluster topology. You can't use an etcd snapshot to migrate to a new cluster or restore into a managed Kubernetes service where you have no etcd access at all (EKS, GKE, AKS).
Velero fills this gap. It backs up Kubernetes API resources and optionally snapshots persistent volumes, stores everything in object storage (S3, GCS, Azure Blob), and can restore into any compatible Kubernetes cluster. This post covers the full Velero setup: installation, scheduled backups, PVC snapshot configuration, restore procedures, and the tests you should run before you need them.
What Velero Backs Up
Velero creates backups by calling the Kubernetes API — not by reading etcd directly. A backup is a collection of the JSON representations of Kubernetes objects, grouped by namespace or cluster-scoped, stored as a tarball in object storage.
For persistent volumes, Velero has two approaches:
Volume Snapshots (CSI): Velero calls the CSI VolumeSnapshot API, which triggers the cloud provider's snapshot mechanism (EBS snapshot, GCE PD snapshot, Azure Disk snapshot). Fast, point-in-time consistent, uses the native snapshot infrastructure of your storage provider. Only works for CSI drivers that support snapshots.
File-level backup (Restic/Kopia): Velero uses Kopia (formerly Restic) to back up the actual file contents of the volume by running a pod on the same node and reading the files directly. Slower and more resource-intensive than snapshots, but works on any volume type including NFS and EFS. Kopia was introduced alongside Restic in Velero 1.10 as an opt-in uploader, and became the default in Velero 1.12.
For most production clusters on major cloud providers (EKS with EBS/EFS, GKE, AKS), CSI snapshots are the right approach. Use file-level backup for non-snapshotable storage or as a fallback.
Installation
Prerequisites
You need:
- An S3-compatible bucket (or GCS/Azure Blob) for backup storage
- IAM permissions for Velero to read/write to that bucket
- CSI snapshot support in your cluster (
snapshot.storage.k8s.ioCRDs and the CSI snapshotter controller)
IAM Setup (AWS)
Create an S3 bucket and IAM policy:
1BUCKET=my-velero-backups
2REGION=us-east-1
3
4aws s3api create-bucket \
5 --bucket $BUCKET \
6 --region $REGION \
7 --create-bucket-configuration LocationConstraint=$REGION
8
9# Enable versioning and encryption
10aws s3api put-bucket-versioning \
11 --bucket $BUCKET \
12 --versioning-configuration Status=Enabled
13
14aws s3api put-bucket-encryption \
15 --bucket $BUCKET \
16 --server-side-encryption-configuration '{
17 "Rules": [{
18 "ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"}
19 }]
20 }'IAM policy for the Velero service account:
1{
2 "Version": "2012-10-17",
3 "Statement": [
4 {
5 "Effect": "Allow",
6 "Action": [
7 "ec2:DescribeVolumes",
8 "ec2:DescribeSnapshots",
9 "ec2:CreateSnapshot",
10 "ec2:DeleteSnapshot",
11 "ec2:DescribeTags",
12 "ec2:CreateTags"
13 ],
14 "Resource": "*"
15 },
16 {
17 "Effect": "Allow",
18 "Action": [
19 "s3:GetObject",
20 "s3:DeleteObject",
21 "s3:PutObject",
22 "s3:AbortMultipartUpload",
23 "s3:ListMultipartUploadParts"
24 ],
25 "Resource": "arn:aws:s3:::my-velero-backups/*"
26 },
27 {
28 "Effect": "Allow",
29 "Action": "s3:ListBucket",
30 "Resource": "arn:aws:s3:::my-velero-backups"
31 }
32 ]
33}Attach this policy to the IAM role used by the Velero Kubernetes service account via IRSA (EKS Pod Identity is also supported).
Install via Helm
1helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
2helm repo update
3
4helm upgrade --install velero vmware-tanzu/velero \
5 --namespace velero \
6 --create-namespace \
7 --set configuration.backupStorageLocation[0].name=default \
8 --set configuration.backupStorageLocation[0].provider=aws \
9 --set configuration.backupStorageLocation[0].bucket=my-velero-backups \
10 --set configuration.backupStorageLocation[0].config.region=us-east-1 \
11 --set configuration.volumeSnapshotLocation[0].name=default \
12 --set configuration.volumeSnapshotLocation[0].provider=aws \
13 --set configuration.volumeSnapshotLocation[0].config.region=us-east-1 \
14 --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::123456789:role/velero-role \
15 --set initContainers[0].name=velero-plugin-for-aws \
16 --set initContainers[0].image=velero/velero-plugin-for-aws:v1.10.0 \
17 --set initContainers[0].volumeMounts[0].mountPath=/target \
18 --set initContainers[0].volumeMounts[0].name=pluginsCheck the backup storage location is available:
kubectl get backupstoragelocation -n velero
# NAME PHASE LAST VALIDATED AGE DEFAULT
# default Available 12s 30s trueIf the phase is not Available, Velero cannot reach the S3 bucket — check IAM permissions and bucket configuration.
CSI Volume Snapshots
For CSI snapshots to work, the cluster needs the snapshot controller and the VolumeSnapshot CRDs installed:
1# Install snapshot CRDs
2kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
3kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
4kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
5
6# Install snapshot controller
7kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
8kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yamlOn EKS, check the AWS EBS CSI driver version — CSI snapshot support requires the EBS CSI driver installed separately (not the in-tree EBS provider):
kubectl get csidriver ebs.csi.aws.comCreate a VolumeSnapshotClass for EBS:
1apiVersion: snapshot.storage.k8s.io/v1
2kind: VolumeSnapshotClass
3metadata:
4 name: ebs-vsc
5 labels:
6 velero.io/csi-volumesnapshot-class: "true" # Velero uses this label to discover the class
7driver: ebs.csi.aws.com
8deletionPolicy: Delete # Or Retain if you want snapshots to outlive VolumeSnapshot objectsThe velero.io/csi-volumesnapshot-class: "true" label tells Velero which VolumeSnapshotClass to use when snapshotting volumes backed by the ebs.csi.aws.com driver.
Enable CSI snapshot support in Velero's Helm values:
# values.yaml
features: EnableCSI # No-op on Velero 1.14+; required only for versions 1.10–1.13
configuration:
defaultVolumeSnapshotLocations: "aws:default"Taking a Backup
Manual Backup
1# Back up a specific namespace
2velero backup create my-app-backup \
3 --include-namespaces production \
4 --wait
5
6# Check backup status
7velero backup describe my-app-backup
8
9# Check backup logs if something went wrong
10velero backup logs my-app-backupThe backup creates:
- A tarball of all Kubernetes API objects in the namespace (stored in S3)
- CSI VolumeSnapshots for any PVCs with CSI-backed StorageClasses
- Metadata linking API objects to their volume snapshots
Scheduled Backups
1apiVersion: velero.io/v1
2kind: Schedule
3metadata:
4 name: daily-production-backup
5 namespace: velero
6spec:
7 schedule: "0 2 * * *" # 2am UTC daily
8 template:
9 includedNamespaces:
10 - production
11 - staging
12 storageLocation: default
13 volumeSnapshotLocations:
14 - default
15 ttl: 720h # 30 days retention
16 snapshotVolumes: true
17 defaultVolumesToFsBackup: false # Use CSI snapshots, not file-level backup
18 labelSelector:
19 matchLabels:
20 backup: enabled # Optional: only back up labelled resources1kubectl apply -f schedule.yaml
2
3# Check schedules
4velero schedule get
5
6# NAME STATUS CREATED SCHEDULE BACKUP TTL LAST BACKUP SELECTOR
7# daily-production-backup Enabled 2026-05-09 10:00:00 +0000 UTC 0 2 * * * 720h0m0s 1m ago <none>Backup Hooks
Pre/post backup hooks let you quiesce a database before snapshot:
# Annotation on the pod (not the Deployment)
kubectl annotate pod -n production -l app=postgres \
pre.hook.backup.velero.io/command='["/bin/bash", "-c", "psql -U postgres -c CHECKPOINT"]' \
pre.hook.backup.velero.io/timeout=60s \
post.hook.backup.velero.io/command='["/bin/bash", "-c", "echo backup complete"]'For stateful applications, hooks ensure you capture a consistent snapshot — a raw volume snapshot of a database mid-write may be corrupt.
Restoring from Backup
Full Namespace Restore
1# List available backups
2velero backup get
3
4# Restore a namespace
5velero restore create my-app-restore \
6 --from-backup my-app-backup \
7 --include-namespaces production \
8 --wait
9
10# Check restore status
11velero restore describe my-app-restore
12velero restore logs my-app-restoreA full restore recreates all API objects exactly as they were at backup time. PVCs are recreated, and Velero restores the volume data from the CSI snapshot.
Partial Restore: Resources Only (No Volumes)
Sometimes you want to restore the Kubernetes configuration (Deployments, Services, ConfigMaps) without restoring PVC data — for example, after recreating a namespace from a GitOps repo but needing to recover specific ConfigMaps:
velero restore create config-only-restore \
--from-backup my-app-backup \
--include-namespaces production \
--include-resources configmaps,secrets,deployments,services \
--restore-volumes=falseCross-Cluster Restore
This is where Velero's value is clearest. To restore into a different cluster:
- Deploy Velero in the target cluster, pointing at the same S3 bucket as the source cluster
- The target Velero picks up the backup storage location and sees existing backups
- Restore as normal
1# In the target cluster — same bucket, read-only access is sufficient for restore
2helm upgrade --install velero vmware-tanzu/velero \
3 --set configuration.backupStorageLocation[0].bucket=my-velero-backups \
4 # ... rest of config same as source cluster
5
6# List backups from source cluster
7velero backup get
8
9# Restore into new cluster
10velero restore create cross-cluster-restore \
11 --from-backup my-app-backup \
12 --include-namespaces productionImportant: CSI snapshots are cloud-provider-specific. An EBS snapshot can only be restored into an EKS cluster in the same AWS account and region. For cross-region or cross-account DR, configure Velero's snapshot replication or replicate snapshots separately using AWS Data Lifecycle Manager.
What Velero Does Not Back Up
Cluster-scoped resources by default: Velero's default behavior backs up namespace-scoped resources. Cluster-scoped resources (StorageClasses, PersistentVolumes, ClusterRoles, ClusterRoleBindings, CRDs) require explicit inclusion:
velero backup create full-cluster-backup \
--include-cluster-scoped-resources=true \
--include-namespaces "*"Be cautious restoring cluster-scoped resources (especially CRDs) into a cluster where they already exist — you may overwrite the version installed by your Helm charts.
Container images: Velero backs up the Kubernetes objects that reference images, not the images themselves. If your private container registry becomes unavailable, Velero can't restore pods that can't pull images. Maintain a separate registry backup or replication strategy.
Secrets backed by external systems: If your secrets are injected from Vault or managed by External Secrets Operator, the Kubernetes Secret objects may be empty shells. Velero backs up the empty shells — the actual secret material is in Vault/SSM and needs its own backup.
Running state: Velero is crash-consistent at best. In-flight database transactions, in-memory state, and network connections are not captured. For stateful workloads, backup hooks (pre-quiesce) are essential for consistency.
Backup Strategy
Retention Tiers
A typical tiered retention schedule:
1# Hourly snapshots for the last 24 hours
2---
3apiVersion: velero.io/v1
4kind: Schedule
5metadata:
6 name: hourly-backup
7 namespace: velero
8spec:
9 schedule: "0 * * * *"
10 template:
11 includedNamespaces: ["production"]
12 ttl: 24h
13 snapshotVolumes: false # API objects only for hourly — no volume snapshots
14---
15# Daily snapshots for 30 days
16apiVersion: velero.io/v1
17kind: Schedule
18metadata:
19 name: daily-backup
20 namespace: velero
21spec:
22 schedule: "0 2 * * *"
23 template:
24 includedNamespaces: ["production"]
25 ttl: 720h
26 snapshotVolumes: true
27---
28# Weekly snapshots for 90 days
29apiVersion: velero.io/v1
30kind: Schedule
31metadata:
32 name: weekly-backup
33 namespace: velero
34spec:
35 schedule: "0 3 * * 0"
36 template:
37 includedNamespaces: ["production"]
38 ttl: 2160h
39 snapshotVolumes: trueVolume snapshots are expensive (EBS snapshot storage isn't free). Avoid hourly volume snapshots for high-churn databases — hourly API object backups with daily or weekly volume snapshots is more cost-effective.
Multi-Region Replication
The Velero backup (tarball in S3) is a single point of failure if the S3 region is unavailable. For production DR:
1# Enable S3 cross-region replication on the bucket
2aws s3api put-bucket-replication \
3 --bucket my-velero-backups \
4 --replication-configuration '{
5 "Role": "arn:aws:iam::123456789:role/s3-replication-role",
6 "Rules": [{
7 "Status": "Enabled",
8 "Destination": {
9 "Bucket": "arn:aws:s3:::my-velero-backups-replica",
10 "StorageClass": "STANDARD_IA"
11 }
12 }]
13 }'EBS snapshots are regional. For cross-region recovery, use AWS Backup or the EBS snapshot copy API to replicate snapshots to your DR region.
Testing Your DR Procedure
Untested backups are not backups — they're a hope. Run DR tests quarterly (or more frequently for critical services):
Test 1: API Object Restore
1# Simulate namespace deletion
2kubectl delete namespace staging
3
4# Restore from backup
5velero restore create staging-dr-test \
6 --from-backup daily-staging-backup-<latest> \
7 --include-namespaces staging \
8 --wait
9
10# Verify pods are running
11kubectl get pods -n staging
12
13# Verify services are accessible
14kubectl get svc -n stagingExpected outcome: All Deployments restored, pods in Running state, Services have their expected configurations.
Test 2: Volume Data Restore
1# Deploy a test workload with PVC, write known data
2kubectl apply -f test-stateful-app.yaml
3kubectl exec -n test stateful-pod-0 -- sh -c "echo 'test data' > /data/test.txt"
4
5# Take a backup
6velero backup create volume-test-backup --include-namespaces test --wait
7
8# Delete the namespace
9kubectl delete namespace test
10
11# Restore
12velero restore create volume-test-restore \
13 --from-backup volume-test-backup \
14 --wait
15
16# Verify the data
17kubectl exec -n test stateful-pod-0 -- cat /data/test.txt
18# Expected: test dataTest 3: Cross-Cluster Restore
This is the hardest test and the one most teams skip. Run it at least annually.
- Provision a new cluster (or a dev cluster with the same CSI drivers)
- Install Velero pointing at the production backup bucket
- Restore a production backup
- Verify the application starts and serves traffic
- Document the time-to-recovery
The actual RTO (recovery time objective) you document is the number your SLA commitments should be based on — not an estimate.
Monitoring Velero
Velero exposes Prometheus metrics at :8085/metrics. Key metrics to alert on:
1# Prometheus alert rules
2groups:
3 - name: velero
4 rules:
5 - alert: VeleroBackupFailure
6 expr: velero_backup_failure_total > 0
7 for: 0m
8 labels:
9 severity: critical
10 annotations:
11 summary: "Velero backup failed"
12 description: "{{ $labels.schedule }} backup has failed"
13
14 - alert: VeleroBackupMissing
15 expr: time() - velero_backup_last_successful_timestamp > 86400
16 for: 1h
17 labels:
18 severity: warning
19 annotations:
20 summary: "No successful Velero backup in 24 hours"
21
22 - alert: VeleroBackupStorageNotAvailable
23 expr: velero_backup_storage_location_phase{phase!="Available"} > 0
24 for: 5m
25 labels:
26 severity: critical
27 annotations:
28 summary: "Velero backup storage location unavailable"The velero_backup_storage_location_phase metric going non-Available is the most critical — it means Velero cannot write new backups or validate existing ones.
Common Issues
backup storage location not ready — Velero can't reach S3. Check IAM permissions (is the IRSA role attached?), bucket policy, and VPC endpoint if your cluster uses private networking.
CSI snapshots not being created — Check the VolumeSnapshotClass has velero.io/csi-volumesnapshot-class: "true" label. Check the CSI snapshotter controller is running in kube-system. Check the PVC uses a CSI StorageClass (not the in-tree provisioner).
restore partially failed — Some resources may already exist in the target namespace (from GitOps). Use --existing-resource-policy=update to overwrite existing resources, or --existing-resource-policy=none (default) to skip them:
velero restore create my-restore \
--from-backup my-backup \
--existing-resource-policy=updateRestore creates PVCs but pods stay Pending — The PVC is bound but the pod can't mount the volume. Usually a node affinity issue — the EBS volume is in a different AZ than the nodes the pod can schedule on. Inspect kubectl describe pod <pod> for mount errors.
Large backups timing out — Increase Velero's --backup-ttl and the --item-operation-sync-frequency. For very large clusters, consider backing up namespaces in separate schedules rather than one all-namespace backup.
Frequently Asked Questions
Is Velero a replacement for etcd snapshots?
No — they serve different purposes. etcd snapshots are fast, low-overhead, and tied to the cluster topology. Velero backups are portable, namespace-granular, and work on managed clusters where you can't access etcd. Use both: etcd snapshots for in-place cluster recovery, Velero for namespace-level restore, cross-cluster migration, and managed Kubernetes DR.
How do I back up cluster-wide configuration (CRDs, ClusterRoles)?
velero backup create cluster-config-backup \
--include-cluster-scoped-resources=true \
--exclude-namespaces kube-system,velero \
--include-namespaces "*"Be careful restoring CRDs into an existing cluster — the versions must be compatible. Restoring CRDs is usually better handled by re-running your GitOps bootstrap (Argo CD app-of-apps or Flux) than by Velero.
What's the recovery time with Velero?
API objects restore quickly — a namespace with 50 Deployments and 20 Services typically restores in under 2 minutes. PVC data restore time depends on snapshot size and the CSI driver's restore speed. EBS snapshots are lazy-loaded — the volume is available immediately but reads from unrestored blocks go to S3, which is slower. For databases, expect IO to be degraded for the first 30-60 minutes after restore until the snapshot is fully materialised.
Should I use Restic or Kopia?
Kopia became the default uploader in Velero 1.14+ and is the recommended file-level backup engine. It's faster and more efficient than Restic, especially for large files. If you're on an older Velero version using Restic, plan to migrate to Kopia. For new installs, Kopia is automatic — you don't need to configure it explicitly.
For persistent volume configuration, see Kubernetes Persistent Volumes: A Production Guide. For GitOps-driven cluster bootstrap (which complements Velero's configuration backup), see GitOps with Argo CD: Production Setup Guide.
Setting up DR for a production Kubernetes cluster? Talk to us at Coding Protocols — we help platform teams build backup and recovery procedures that work under real incident conditions.


