Backup & Restore (cluster)
Backing up a Kubernetes cluster means preserving its state—all the configurations, deployments, services, and metadata that define what’s running. Unlike application data (which lives in persistent volumes), cluster state lives in etcd, the cluster’s distributed key-value store. Regular backups protect against data loss, enable disaster recovery, and allow you to restore clusters to a known good state.
Think of cluster backups like saving a blueprint of your entire cluster. If something catastrophic happens, you can use the backup to rebuild the cluster exactly as it was, with all your applications and configurations intact.
What Gets Backed Up?
Kubernetes cluster backups focus on etcd, which stores:
- API Objects - All resources (Pods, Deployments, Services, ConfigMaps, Secrets, etc.)
- Cluster Configuration - RBAC policies, network policies, resource quotas
- Metadata - Labels, annotations, resource versions
- State Information - Current replica counts, pod statuses, event history
What doesn’t get backed up:
- Application Data - Data in PersistentVolumes (backed up separately at the storage layer)
- Container Images - Stored in image registries, not in etcd
- Node Configuration - OS-level configuration on nodes
- Add-on Data - Some add-ons store data outside etcd
Why Backup?
Regular backups are essential for:
- Disaster Recovery - Recover from complete cluster failures
- Accidental Deletion - Restore resources that were deleted by mistake
- Migration - Move clusters to new infrastructure or locations
- Rollback - Return to a previous cluster state after problematic changes
- Compliance - Meet regulatory requirements for data retention
- Testing - Create test environments from production snapshots
Backup Strategies
Manual etcd Backup
For single-node etcd (development or small clusters):
# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Automated Backup Scripts
Schedule regular backups using cron jobs or systemd timers that run etcd snapshot commands automatically.
Backup Tools
Tools like Velero provide comprehensive backup solutions that:
- Schedule automatic backups
- Backup entire namespaces or selected resources
- Handle volume snapshots
- Support backup to cloud storage
- Enable cross-cluster restores
Backup Process
The backup process depends on your etcd setup:
Single etcd Node
For standalone etcd (common in development):
- Stop API server temporarily (prevents writes during backup)
- Create etcd snapshot
- Resume API server
- Store snapshot securely
etcd Cluster
For multi-node etcd (production HA setups):
- Create snapshot from any etcd member (snapshots are consistent)
- No need to stop API server (etcd handles consistency)
- Store snapshot securely
External etcd
For externally managed etcd:
- Use etcd’s native backup tools
- Follow etcd provider’s backup procedures
- Coordinate with etcd administrators
Restore Process
Restoring from backup requires:
- Stop the Cluster - Stop all control plane components
- Restore etcd - Restore etcd data from snapshot
- Restart Control Plane - Start API server and other components
- Verify Cluster - Check that all resources are restored correctly
- Restore Worker Nodes - Rejoin nodes if needed (usually not required)
Backup Frequency
How often to backup depends on:
- Recovery Point Objective (RPO) - How much data loss is acceptable?
- Change Rate - How frequently does cluster state change?
- Criticality - How important is the cluster?
- Storage Costs - More frequent backups use more storage
Common strategies:
- Production clusters - Daily backups, retain for 30-90 days
- Development clusters - Weekly backups, retain for 7-14 days
- Critical production - Multiple backups per day, longer retention
- Before major changes - Manual backups before upgrades or migrations
Backup Storage
Store backups securely and redundantly:
- Off-cluster Storage - Never store backups only on cluster nodes
- Multiple Locations - Use multiple storage backends (local + cloud)
- Encryption - Encrypt backups containing sensitive data (Secrets)
- Access Control - Limit who can access backups
- Versioning - Keep multiple backup versions for point-in-time recovery
Testing Restores
Regularly test restore procedures:
- Practice Restores - Restore to a test cluster regularly
- Document Procedures - Write down restore steps
- Time Restores - Measure recovery time to meet RTO requirements
- Verify Integrity - Ensure restored resources are correct
- Update Procedures - Refine restore process based on testing
Disaster Recovery Scenarios
Complete Cluster Failure
- Provision new infrastructure
- Install Kubernetes (kubeadm or managed service)
- Restore etcd from backup
- Rejoin or recreate worker nodes
- Verify all applications are running
Partial Failure (Control Plane Only)
- Rebuild control plane nodes
- Restore etcd to new control plane
- Worker nodes reconnect automatically (usually)
- Verify cluster health
Data Corruption
- Identify last known good backup
- Restore etcd from that backup
- Accept data loss since corruption point
- Reconcile any changes made after backup
Application Data Backups
Remember that etcd backups don’t include application data in PersistentVolumes. You need separate backup strategies for:
- Database Backups - Use database-native backup tools
- File Storage - Backup object storage or file systems
- Volume Snapshots - Use CSI snapshot capabilities
- Application-Level Backups - Application-specific backup solutions
See the Storage Backup & Restore section for data backup strategies.
Best Practices
- Automate Backups - Use cron jobs, Velero, or other automation
- Test Regularly - Practice restores in non-production environments
- Monitor Backup Success - Alert on backup failures
- Version Backups - Keep multiple backup versions
- Document Procedures - Write clear backup and restore runbooks
- Secure Storage - Encrypt backups and control access
- Backup Before Changes - Always backup before upgrades or major changes
- Include Metadata - Tag backups with cluster name, date, version
- Verify Backups - Periodically verify backup integrity
- Plan for Scale - Backup strategies should work as clusters grow
Topics
- Backup & Restore - This page covers cluster backup and restore strategies
See Also
- Storage Backup & Restore - Backing up application data in PersistentVolumes
- Kubeadm - Backup procedures for kubeadm clusters
- High Availability - Backup considerations for HA etcd clusters
- etcd Basics - Understanding etcd architecture