Backup & Restore (cluster)

Backing up a Kubernetes cluster means preserving its state—all the configurations, deployments, services, and metadata that define what’s running. Unlike application data (which lives in persistent volumes), cluster state lives in etcd, the cluster’s distributed key-value store. Regular backups protect against data loss, enable disaster recovery, and allow you to restore clusters to a known good state.

Think of cluster backups like saving a blueprint of your entire cluster. If something catastrophic happens, you can use the backup to rebuild the cluster exactly as it was, with all your applications and configurations intact.

What Gets Backed Up?

Kubernetes cluster backups focus on etcd, which stores:

  • API Objects - All resources (Pods, Deployments, Services, ConfigMaps, Secrets, etc.)
  • Cluster Configuration - RBAC policies, network policies, resource quotas
  • Metadata - Labels, annotations, resource versions
  • State Information - Current replica counts, pod statuses, event history

What doesn’t get backed up:

  • Application Data - Data in PersistentVolumes (backed up separately at the storage layer)
  • Container Images - Stored in image registries, not in etcd
  • Node Configuration - OS-level configuration on nodes
  • Add-on Data - Some add-ons store data outside etcd
graph TB A[Cluster Backup] --> B[etcd Backup] A --> C[Application Data Backup] B --> D[API Objects] B --> E[Configurations] B --> F[Metadata] C --> G[PersistentVolumes] C --> H[Database Data] C --> I[File Storage] style A fill:#e1f5ff style B fill:#fff4e1 style C fill:#f3e5f5 style D fill:#e8f5e9 style E fill:#e8f5e9 style F fill:#e8f5e9

Why Backup?

Regular backups are essential for:

  • Disaster Recovery - Recover from complete cluster failures
  • Accidental Deletion - Restore resources that were deleted by mistake
  • Migration - Move clusters to new infrastructure or locations
  • Rollback - Return to a previous cluster state after problematic changes
  • Compliance - Meet regulatory requirements for data retention
  • Testing - Create test environments from production snapshots

Backup Strategies

Manual etcd Backup

For single-node etcd (development or small clusters):

# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Automated Backup Scripts

Schedule regular backups using cron jobs or systemd timers that run etcd snapshot commands automatically.

Backup Tools

Tools like Velero provide comprehensive backup solutions that:

  • Schedule automatic backups
  • Backup entire namespaces or selected resources
  • Handle volume snapshots
  • Support backup to cloud storage
  • Enable cross-cluster restores

Backup Process

The backup process depends on your etcd setup:

sequenceDiagram participant Admin participant etcd participant BackupStorage Admin->>etcd: Request snapshot etcd->>etcd: Create snapshot etcd->>Admin: Return snapshot file Admin->>BackupStorage: Upload snapshot BackupStorage->>BackupStorage: Store with timestamp Note over BackupStorage: Retention policy<br/>applies here

Single etcd Node

For standalone etcd (common in development):

  1. Stop API server temporarily (prevents writes during backup)
  2. Create etcd snapshot
  3. Resume API server
  4. Store snapshot securely

etcd Cluster

For multi-node etcd (production HA setups):

  1. Create snapshot from any etcd member (snapshots are consistent)
  2. No need to stop API server (etcd handles consistency)
  3. Store snapshot securely

External etcd

For externally managed etcd:

  1. Use etcd’s native backup tools
  2. Follow etcd provider’s backup procedures
  3. Coordinate with etcd administrators

Restore Process

Restoring from backup requires:

  1. Stop the Cluster - Stop all control plane components
  2. Restore etcd - Restore etcd data from snapshot
  3. Restart Control Plane - Start API server and other components
  4. Verify Cluster - Check that all resources are restored correctly
  5. Restore Worker Nodes - Rejoin nodes if needed (usually not required)
graph TD A[Backup Available] --> B[Stop Control Plane] B --> C[Stop etcd] C --> D[Restore etcd Data] D --> E[Start etcd] E --> F[Start Control Plane] F --> G[Verify Cluster Health] G --> H[Rejoin Nodes if needed] H --> I[Restore Complete] style A fill:#e1f5ff style B fill:#ffe1e1 style D fill:#fff4e1 style F fill:#fff4e1 style G fill:#e8f5e9 style I fill:#e8f5e9

Backup Frequency

How often to backup depends on:

  • Recovery Point Objective (RPO) - How much data loss is acceptable?
  • Change Rate - How frequently does cluster state change?
  • Criticality - How important is the cluster?
  • Storage Costs - More frequent backups use more storage

Common strategies:

  • Production clusters - Daily backups, retain for 30-90 days
  • Development clusters - Weekly backups, retain for 7-14 days
  • Critical production - Multiple backups per day, longer retention
  • Before major changes - Manual backups before upgrades or migrations

Backup Storage

Store backups securely and redundantly:

  • Off-cluster Storage - Never store backups only on cluster nodes
  • Multiple Locations - Use multiple storage backends (local + cloud)
  • Encryption - Encrypt backups containing sensitive data (Secrets)
  • Access Control - Limit who can access backups
  • Versioning - Keep multiple backup versions for point-in-time recovery

Testing Restores

Regularly test restore procedures:

  • Practice Restores - Restore to a test cluster regularly
  • Document Procedures - Write down restore steps
  • Time Restores - Measure recovery time to meet RTO requirements
  • Verify Integrity - Ensure restored resources are correct
  • Update Procedures - Refine restore process based on testing

Disaster Recovery Scenarios

Complete Cluster Failure

  1. Provision new infrastructure
  2. Install Kubernetes (kubeadm or managed service)
  3. Restore etcd from backup
  4. Rejoin or recreate worker nodes
  5. Verify all applications are running

Partial Failure (Control Plane Only)

  1. Rebuild control plane nodes
  2. Restore etcd to new control plane
  3. Worker nodes reconnect automatically (usually)
  4. Verify cluster health

Data Corruption

  1. Identify last known good backup
  2. Restore etcd from that backup
  3. Accept data loss since corruption point
  4. Reconcile any changes made after backup

Application Data Backups

Remember that etcd backups don’t include application data in PersistentVolumes. You need separate backup strategies for:

  • Database Backups - Use database-native backup tools
  • File Storage - Backup object storage or file systems
  • Volume Snapshots - Use CSI snapshot capabilities
  • Application-Level Backups - Application-specific backup solutions

See the Storage Backup & Restore section for data backup strategies.

Best Practices

  1. Automate Backups - Use cron jobs, Velero, or other automation
  2. Test Regularly - Practice restores in non-production environments
  3. Monitor Backup Success - Alert on backup failures
  4. Version Backups - Keep multiple backup versions
  5. Document Procedures - Write clear backup and restore runbooks
  6. Secure Storage - Encrypt backups and control access
  7. Backup Before Changes - Always backup before upgrades or major changes
  8. Include Metadata - Tag backups with cluster name, date, version
  9. Verify Backups - Periodically verify backup integrity
  10. Plan for Scale - Backup strategies should work as clusters grow

Topics

See Also