Kubernetes Disaster Recovery with Velero: Backup & Restore Guide (2026)

Kubernetes doesn't back itself up. etcd snapshots preserve the cluster state — the Deployments, Services, ConfigMaps, and Secrets that the API server knows about — but they're tied to the specific etcd version and cluster topology. You can't use an etcd snapshot to migrate to a new cluster or restore into a managed Kubernetes service where you have no etcd access at all (EKS, GKE, AKS).

Velero fills this gap. It backs up Kubernetes API resources and optionally snapshots persistent volumes, stores everything in object storage (S3, GCS, Azure Blob), and can restore into any compatible Kubernetes cluster. This post covers the full Velero setup: installation, scheduled backups, PVC snapshot configuration, restore procedures, and the tests you should run before you need them.

What Velero Backs Up

Velero creates backups by calling the Kubernetes API — not by reading etcd directly. A backup is a collection of the JSON representations of Kubernetes objects, grouped by namespace or cluster-scoped, stored as a tarball in object storage.

For persistent volumes, Velero has two approaches:

Volume Snapshots (CSI): Velero calls the CSI VolumeSnapshot API, which triggers the cloud provider's snapshot mechanism (EBS snapshot, GCE PD snapshot, Azure Disk snapshot). Fast, point-in-time consistent, uses the native snapshot infrastructure of your storage provider. Only works for CSI drivers that support snapshots.

File-level backup (Restic/Kopia): Velero uses Kopia (formerly Restic) to back up the actual file contents of the volume by running a pod on the same node and reading the files directly. Slower and more resource-intensive than snapshots, but works on any volume type including NFS and EFS. Kopia was introduced alongside Restic in Velero 1.10 as an opt-in uploader, and became the default in Velero 1.12.

For most production clusters on major cloud providers (EKS with EBS/EFS, GKE, AKS), CSI snapshots are the right approach. Use file-level backup for non-snapshotable storage or as a fallback.

Installation

Prerequisites

You need:

An S3-compatible bucket (or GCS/Azure Blob) for backup storage
IAM permissions for Velero to read/write to that bucket
CSI snapshot support in your cluster (snapshot.storage.k8s.io CRDs and the CSI snapshotter controller)

IAM Setup (AWS)

Create an S3 bucket and IAM policy:

bash

1BUCKET=my-velero-backups
2REGION=us-east-1
3
4aws s3api create-bucket \
5  --bucket $BUCKET \
6  --region $REGION \
7  --create-bucket-configuration LocationConstraint=$REGION
8
9# Enable versioning and encryption
10aws s3api put-bucket-versioning \
11  --bucket $BUCKET \
12  --versioning-configuration Status=Enabled
13
14aws s3api put-bucket-encryption \
15  --bucket $BUCKET \
16  --server-side-encryption-configuration '{
17    "Rules": [{
18      "ApplyServerSideEncryptionByDefault": {"SSEAlgorithm": "AES256"}
19    }]
20  }'

IAM policy for the Velero service account:

json

1{
2  "Version": "2012-10-17",
3  "Statement": [
4    {
5      "Effect": "Allow",
6      "Action": [
7        "ec2:DescribeVolumes",
8        "ec2:DescribeSnapshots",
9        "ec2:CreateSnapshot",
10        "ec2:DeleteSnapshot",
11        "ec2:DescribeTags",
12        "ec2:CreateTags"
13      ],
14      "Resource": "*"
15    },
16    {
17      "Effect": "Allow",
18      "Action": [
19        "s3:GetObject",
20        "s3:DeleteObject",
21        "s3:PutObject",
22        "s3:AbortMultipartUpload",
23        "s3:ListMultipartUploadParts"
24      ],
25      "Resource": "arn:aws:s3:::my-velero-backups/*"
26    },
27    {
28      "Effect": "Allow",
29      "Action": "s3:ListBucket",
30      "Resource": "arn:aws:s3:::my-velero-backups"
31    }
32  ]
33}

Attach this policy to the IAM role used by the Velero Kubernetes service account via IRSA (EKS Pod Identity is also supported).

Install via Helm

bash

1helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
2helm repo update
3
4helm upgrade --install velero vmware-tanzu/velero \
5  --namespace velero \
6  --create-namespace \
7  --set configuration.backupStorageLocation[0].name=default \
8  --set configuration.backupStorageLocation[0].provider=aws \
9  --set configuration.backupStorageLocation[0].bucket=my-velero-backups \
10  --set configuration.backupStorageLocation[0].config.region=us-east-1 \
11  --set configuration.volumeSnapshotLocation[0].name=default \
12  --set configuration.volumeSnapshotLocation[0].provider=aws \
13  --set configuration.volumeSnapshotLocation[0].config.region=us-east-1 \
14  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::123456789:role/velero-role \
15  --set initContainers[0].name=velero-plugin-for-aws \
16  --set initContainers[0].image=velero/velero-plugin-for-aws:v1.10.0 \
17  --set initContainers[0].volumeMounts[0].mountPath=/target \
18  --set initContainers[0].volumeMounts[0].name=plugins

Check the backup storage location is available:

bash

kubectl get backupstoragelocation -n velero
# NAME      PHASE       LAST VALIDATED   AGE   DEFAULT
# default   Available   12s              30s   true

If the phase is not Available, Velero cannot reach the S3 bucket — check IAM permissions and bucket configuration.

CSI Volume Snapshots

For CSI snapshots to work, the cluster needs the snapshot controller and the VolumeSnapshot CRDs installed:

bash

1# Install snapshot CRDs
2kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
3kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
4kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml
5
6# Install snapshot controller
7kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/deploy/kubernetes/snapshot-controller/rbac-snapshot-controller.yaml
8kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/main/deploy/kubernetes/snapshot-controller/setup-snapshot-controller.yaml

On EKS, check the AWS EBS CSI driver version — CSI snapshot support requires the EBS CSI driver installed separately (not the in-tree EBS provider):

bash

kubectl get csidriver ebs.csi.aws.com

Create a VolumeSnapshotClass for EBS:

yaml

1apiVersion: snapshot.storage.k8s.io/v1
2kind: VolumeSnapshotClass
3metadata:
4  name: ebs-vsc
5  labels:
6    velero.io/csi-volumesnapshot-class: "true"   # Velero uses this label to discover the class
7driver: ebs.csi.aws.com
8deletionPolicy: Delete   # Or Retain if you want snapshots to outlive VolumeSnapshot objects

The velero.io/csi-volumesnapshot-class: "true" label tells Velero which VolumeSnapshotClass to use when snapshotting volumes backed by the ebs.csi.aws.com driver.

Enable CSI snapshot support in Velero's Helm values:

yaml

# values.yaml
features: EnableCSI    # No-op on Velero 1.14+; required only for versions 1.10–1.13
configuration:
  defaultVolumeSnapshotLocations: "aws:default"

Taking a Backup

Manual Backup

bash

1# Back up a specific namespace
2velero backup create my-app-backup \
3  --include-namespaces production \
4  --wait
5
6# Check backup status
7velero backup describe my-app-backup
8
9# Check backup logs if something went wrong
10velero backup logs my-app-backup

The backup creates:

A tarball of all Kubernetes API objects in the namespace (stored in S3)
CSI VolumeSnapshots for any PVCs with CSI-backed StorageClasses
Metadata linking API objects to their volume snapshots

Scheduled Backups

yaml

1apiVersion: velero.io/v1
2kind: Schedule
3metadata:
4  name: daily-production-backup
5  namespace: velero
6spec:
7  schedule: "0 2 * * *"   # 2am UTC daily
8  template:
9    includedNamespaces:
10      - production
11      - staging
12    storageLocation: default
13    volumeSnapshotLocations:
14      - default
15    ttl: 720h   # 30 days retention
16    snapshotVolumes: true
17    defaultVolumesToFsBackup: false   # Use CSI snapshots, not file-level backup
18    labelSelector:
19      matchLabels:
20        backup: enabled   # Optional: only back up labelled resources

bash

1kubectl apply -f schedule.yaml
2
3# Check schedules
4velero schedule get
5
6# NAME                      STATUS    CREATED                         SCHEDULE    BACKUP TTL   LAST BACKUP   SELECTOR
7# daily-production-backup   Enabled   2026-05-09 10:00:00 +0000 UTC   0 2 * * *   720h0m0s     1m ago        <none>

Backup Hooks

Pre/post backup hooks let you quiesce a database before snapshot:

yaml

# Annotation on the pod (not the Deployment)
kubectl annotate pod -n production -l app=postgres \
  pre.hook.backup.velero.io/command='["/bin/bash", "-c", "psql -U postgres -c CHECKPOINT"]' \
  pre.hook.backup.velero.io/timeout=60s \
  post.hook.backup.velero.io/command='["/bin/bash", "-c", "echo backup complete"]'

For stateful applications, hooks ensure you capture a consistent snapshot — a raw volume snapshot of a database mid-write may be corrupt.

Restoring from Backup

Full Namespace Restore

bash

1# List available backups
2velero backup get
3
4# Restore a namespace
5velero restore create my-app-restore \
6  --from-backup my-app-backup \
7  --include-namespaces production \
8  --wait
9
10# Check restore status
11velero restore describe my-app-restore
12velero restore logs my-app-restore

A full restore recreates all API objects exactly as they were at backup time. PVCs are recreated, and Velero restores the volume data from the CSI snapshot.

Partial Restore: Resources Only (No Volumes)

Sometimes you want to restore the Kubernetes configuration (Deployments, Services, ConfigMaps) without restoring PVC data — for example, after recreating a namespace from a GitOps repo but needing to recover specific ConfigMaps:

bash

velero restore create config-only-restore \
  --from-backup my-app-backup \
  --include-namespaces production \
  --include-resources configmaps,secrets,deployments,services \
  --restore-volumes=false

Cross-Cluster Restore

This is where Velero's value is clearest. To restore into a different cluster:

Deploy Velero in the target cluster, pointing at the same S3 bucket as the source cluster
The target Velero picks up the backup storage location and sees existing backups
Restore as normal

bash

1# In the target cluster — same bucket, read-only access is sufficient for restore
2helm upgrade --install velero vmware-tanzu/velero \
3  --set configuration.backupStorageLocation[0].bucket=my-velero-backups \
4  # ... rest of config same as source cluster
5
6# List backups from source cluster
7velero backup get
8
9# Restore into new cluster
10velero restore create cross-cluster-restore \
11  --from-backup my-app-backup \
12  --include-namespaces production

Important: CSI snapshots are cloud-provider-specific. An EBS snapshot can only be restored into an EKS cluster in the same AWS account and region. For cross-region or cross-account DR, configure Velero's snapshot replication or replicate snapshots separately using AWS Data Lifecycle Manager.

What Velero Does Not Back Up

Cluster-scoped resources by default: Velero's default behavior backs up namespace-scoped resources. Cluster-scoped resources (StorageClasses, PersistentVolumes, ClusterRoles, ClusterRoleBindings, CRDs) require explicit inclusion:

bash

velero backup create full-cluster-backup \
  --include-cluster-scoped-resources=true \
  --include-namespaces "*"

Be cautious restoring cluster-scoped resources (especially CRDs) into a cluster where they already exist — you may overwrite the version installed by your Helm charts.

Container images: Velero backs up the Kubernetes objects that reference images, not the images themselves. If your private container registry becomes unavailable, Velero can't restore pods that can't pull images. Maintain a separate registry backup or replication strategy.

Secrets backed by external systems: If your secrets are injected from Vault or managed by External Secrets Operator, the Kubernetes Secret objects may be empty shells. Velero backs up the empty shells — the actual secret material is in Vault/SSM and needs its own backup.

Running state: Velero is crash-consistent at best. In-flight database transactions, in-memory state, and network connections are not captured. For stateful workloads, backup hooks (pre-quiesce) are essential for consistency.

Backup Strategy

Retention Tiers

A typical tiered retention schedule:

yaml

1# Hourly snapshots for the last 24 hours
2---
3apiVersion: velero.io/v1
4kind: Schedule
5metadata:
6  name: hourly-backup
7  namespace: velero
8spec:
9  schedule: "0 * * * *"
10  template:
11    includedNamespaces: ["production"]
12    ttl: 24h
13    snapshotVolumes: false   # API objects only for hourly — no volume snapshots
14---
15# Daily snapshots for 30 days
16apiVersion: velero.io/v1
17kind: Schedule
18metadata:
19  name: daily-backup
20  namespace: velero
21spec:
22  schedule: "0 2 * * *"
23  template:
24    includedNamespaces: ["production"]
25    ttl: 720h
26    snapshotVolumes: true
27---
28# Weekly snapshots for 90 days
29apiVersion: velero.io/v1
30kind: Schedule
31metadata:
32  name: weekly-backup
33  namespace: velero
34spec:
35  schedule: "0 3 * * 0"
36  template:
37    includedNamespaces: ["production"]
38    ttl: 2160h
39    snapshotVolumes: true

Volume snapshots are expensive (EBS snapshot storage isn't free). Avoid hourly volume snapshots for high-churn databases — hourly API object backups with daily or weekly volume snapshots is more cost-effective.

Multi-Region Replication

The Velero backup (tarball in S3) is a single point of failure if the S3 region is unavailable. For production DR:

bash

1# Enable S3 cross-region replication on the bucket
2aws s3api put-bucket-replication \
3  --bucket my-velero-backups \
4  --replication-configuration '{
5    "Role": "arn:aws:iam::123456789:role/s3-replication-role",
6    "Rules": [{
7      "Status": "Enabled",
8      "Destination": {
9        "Bucket": "arn:aws:s3:::my-velero-backups-replica",
10        "StorageClass": "STANDARD_IA"
11      }
12    }]
13  }'

EBS snapshots are regional. For cross-region recovery, use AWS Backup or the EBS snapshot copy API to replicate snapshots to your DR region.

Testing Your DR Procedure

Untested backups are not backups — they're a hope. Run DR tests quarterly (or more frequently for critical services):

Test 1: API Object Restore

bash

1# Simulate namespace deletion
2kubectl delete namespace staging
3
4# Restore from backup
5velero restore create staging-dr-test \
6  --from-backup daily-staging-backup-<latest> \
7  --include-namespaces staging \
8  --wait
9
10# Verify pods are running
11kubectl get pods -n staging
12
13# Verify services are accessible
14kubectl get svc -n staging

Expected outcome: All Deployments restored, pods in Running state, Services have their expected configurations.

Test 2: Volume Data Restore

bash

1# Deploy a test workload with PVC, write known data
2kubectl apply -f test-stateful-app.yaml
3kubectl exec -n test stateful-pod-0 -- sh -c "echo 'test data' > /data/test.txt"
4
5# Take a backup
6velero backup create volume-test-backup --include-namespaces test --wait
7
8# Delete the namespace
9kubectl delete namespace test
10
11# Restore
12velero restore create volume-test-restore \
13  --from-backup volume-test-backup \
14  --wait
15
16# Verify the data
17kubectl exec -n test stateful-pod-0 -- cat /data/test.txt
18# Expected: test data

Test 3: Cross-Cluster Restore

This is the hardest test and the one most teams skip. Run it at least annually.

Provision a new cluster (or a dev cluster with the same CSI drivers)
Install Velero pointing at the production backup bucket
Restore a production backup
Verify the application starts and serves traffic
Document the time-to-recovery

The actual RTO (recovery time objective) you document is the number your SLA commitments should be based on — not an estimate.

Monitoring Velero

Velero exposes Prometheus metrics at :8085/metrics. Key metrics to alert on:

yaml

1# Prometheus alert rules
2groups:
3  - name: velero
4    rules:
5      - alert: VeleroBackupFailure
6        expr: velero_backup_failure_total > 0
7        for: 0m
8        labels:
9          severity: critical
10        annotations:
11          summary: "Velero backup failed"
12          description: "{{ $labels.schedule }} backup has failed"
13
14      - alert: VeleroBackupMissing
15        expr: time() - velero_backup_last_successful_timestamp > 86400
16        for: 1h
17        labels:
18          severity: warning
19        annotations:
20          summary: "No successful Velero backup in 24 hours"
21
22      - alert: VeleroBackupStorageNotAvailable
23        expr: velero_backup_storage_location_phase{phase!="Available"} > 0
24        for: 5m
25        labels:
26          severity: critical
27        annotations:
28          summary: "Velero backup storage location unavailable"

The velero_backup_storage_location_phase metric going non-Available is the most critical — it means Velero cannot write new backups or validate existing ones.

Common Issues

backup storage location not ready — Velero can't reach S3. Check IAM permissions (is the IRSA role attached?), bucket policy, and VPC endpoint if your cluster uses private networking.

CSI snapshots not being created — Check the VolumeSnapshotClass has velero.io/csi-volumesnapshot-class: "true" label. Check the CSI snapshotter controller is running in kube-system. Check the PVC uses a CSI StorageClass (not the in-tree provisioner).

restore partially failed — Some resources may already exist in the target namespace (from GitOps). Use --existing-resource-policy=update to overwrite existing resources, or --existing-resource-policy=none (default) to skip them:

bash

velero restore create my-restore \
  --from-backup my-backup \
  --existing-resource-policy=update

Restore creates PVCs but pods stay Pending — The PVC is bound but the pod can't mount the volume. Usually a node affinity issue — the EBS volume is in a different AZ than the nodes the pod can schedule on. Inspect kubectl describe pod <pod> for mount errors.

Large backups timing out — Increase Velero's --backup-ttl and the --item-operation-sync-frequency. For very large clusters, consider backing up namespaces in separate schedules rather than one all-namespace backup.

Frequently Asked Questions

Is Velero a replacement for etcd snapshots?

No — they serve different purposes. etcd snapshots are fast, low-overhead, and tied to the cluster topology. Velero backups are portable, namespace-granular, and work on managed clusters where you can't access etcd. Use both: etcd snapshots for in-place cluster recovery, Velero for namespace-level restore, cross-cluster migration, and managed Kubernetes DR.

How do I back up cluster-wide configuration (CRDs, ClusterRoles)?

bash

velero backup create cluster-config-backup \
  --include-cluster-scoped-resources=true \
  --exclude-namespaces kube-system,velero \
  --include-namespaces "*"

Be careful restoring CRDs into an existing cluster — the versions must be compatible. Restoring CRDs is usually better handled by re-running your GitOps bootstrap (Argo CD app-of-apps or Flux) than by Velero.

What's the recovery time with Velero?

API objects restore quickly — a namespace with 50 Deployments and 20 Services typically restores in under 2 minutes. PVC data restore time depends on snapshot size and the CSI driver's restore speed. EBS snapshots are lazy-loaded — the volume is available immediately but reads from unrestored blocks go to S3, which is slower. For databases, expect IO to be degraded for the first 30-60 minutes after restore until the snapshot is fully materialised.

Should I use Restic or Kopia?

Kopia became the default uploader in Velero 1.14+ and is the recommended file-level backup engine. It's faster and more efficient than Restic, especially for large files. If you're on an older Velero version using Restic, plan to migrate to Kopia. For new installs, Kopia is automatic — you don't need to configure it explicitly.

For persistent volume configuration, see Kubernetes Persistent Volumes: A Production Guide. For GitOps-driven cluster bootstrap (which complements Velero's configuration backup), see GitOps with Argo CD: Production Setup Guide.

Setting up DR for a production Kubernetes cluster? Talk to us at Coding Protocols — we help platform teams build backup and recovery procedures that work under real incident conditions.

Kubernetes Disaster Recovery: Backup and Restore with Velero

What Velero Backs Up

Installation

Prerequisites

IAM Setup (AWS)

Install via Helm

CSI Volume Snapshots

Taking a Backup

Manual Backup

Scheduled Backups

Backup Hooks

Restoring from Backup

Full Namespace Restore

Partial Restore: Resources Only (No Volumes)

Cross-Cluster Restore

What Velero Does Not Back Up

Backup Strategy

Retention Tiers

Multi-Region Replication

Testing Your DR Procedure

Test 1: API Object Restore

Test 2: Volume Data Restore

Test 3: Cross-Cluster Restore

Monitoring Velero

Common Issues

Frequently Asked Questions

Is Velero a replacement for etcd snapshots?

How do I back up cluster-wide configuration (CRDs, ClusterRoles)?

What's the recovery time with Velero?

Should I use Restic or Kopia?

Related Topics

Read Next

Karpenter v1: Node Provisioning, Consolidation, and Drift

Kubernetes Cost Optimisation: Spot Instances, Right-Sizing, and Namespace Budgets

Crossplane: Cloud Infrastructure as Kubernetes Resources