Backup & Restore (cluster)

Backing up a Kubernetes cluster means preserving its state—all the configurations, deployments, services, and metadata that define what’s running. Unlike application data (which lives in persistent volumes), cluster state lives in etcd, the cluster’s distributed key-value store. Regular backups protect against data loss, enable disaster recovery, and allow you to restore clusters to a known good state.

Think of cluster backups like saving a blueprint of your entire cluster. If something catastrophic happens, you can use the backup to rebuild the cluster exactly as it was, with all your applications and configurations intact.

What Gets Backed Up?

Kubernetes cluster backups focus on etcd, which stores:

API Objects - All resources (Pods, Deployments, Services, ConfigMaps, Secrets, etc.)
Cluster Configuration - RBAC policies, network policies, resource quotas
Metadata - Labels, annotations, resource versions
State Information - Current replica counts, pod statuses, event history

What doesn’t get backed up:

Application Data - Data in PersistentVolumes (backed up separately at the storage layer)
Container Images - Stored in image registries, not in etcd
Node Configuration - OS-level configuration on nodes
Add-on Data - Some add-ons store data outside etcd

graph TB A[Cluster Backup] --> B[etcd Backup] A --> C[Application Data Backup] B --> D[API Objects] B --> E[Configurations] B --> F[Metadata] C --> G[PersistentVolumes] C --> H[Database Data] C --> I[File Storage] style A fill:#e1f5ff style B fill:#fff4e1 style C fill:#f3e5f5 style D fill:#e8f5e9 style E fill:#e8f5e9 style F fill:#e8f5e9

Why Backup?

Regular backups are essential for:

Disaster Recovery - Recover from complete cluster failures
Accidental Deletion - Restore resources that were deleted by mistake
Migration - Move clusters to new infrastructure or locations
Rollback - Return to a previous cluster state after problematic changes
Compliance - Meet regulatory requirements for data retention
Testing - Create test environments from production snapshots

Backup Strategies

Manual etcd Backup

For single-node etcd (development or small clusters):

# Snapshot etcd
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Automated Backup Scripts

Schedule regular backups using cron jobs or systemd timers that run etcd snapshot commands automatically.

Backup Tools

Tools like Velero provide comprehensive backup solutions that:

Schedule automatic backups
Backup entire namespaces or selected resources
Handle volume snapshots
Support backup to cloud storage
Enable cross-cluster restores

Backup Process

The backup process depends on your etcd setup:

sequenceDiagram participant Admin participant etcd participant BackupStorage Admin->>etcd: Request snapshot etcd->>etcd: Create snapshot etcd->>Admin: Return snapshot file Admin->>BackupStorage: Upload snapshot BackupStorage->>BackupStorage: Store with timestamp Note over BackupStorage: Retention policy<br/>applies here

Single etcd Node

For standalone etcd (common in development):

Stop API server temporarily (prevents writes during backup)
Create etcd snapshot
Resume API server
Store snapshot securely

etcd Cluster

For multi-node etcd (production HA setups):

Create snapshot from any etcd member (snapshots are consistent)
No need to stop API server (etcd handles consistency)
Store snapshot securely

External etcd

For externally managed etcd:

Use etcd’s native backup tools
Follow etcd provider’s backup procedures
Coordinate with etcd administrators

Restore Process

Restoring from backup requires:

Stop the Cluster - Stop all control plane components
Restore etcd - Restore etcd data from snapshot
Restart Control Plane - Start API server and other components
Verify Cluster - Check that all resources are restored correctly
Restore Worker Nodes - Rejoin nodes if needed (usually not required)

graph TD A[Backup Available] --> B[Stop Control Plane] B --> C[Stop etcd] C --> D[Restore etcd Data] D --> E[Start etcd] E --> F[Start Control Plane] F --> G[Verify Cluster Health] G --> H[Rejoin Nodes if needed] H --> I[Restore Complete] style A fill:#e1f5ff style B fill:#ffe1e1 style D fill:#fff4e1 style F fill:#fff4e1 style G fill:#e8f5e9 style I fill:#e8f5e9

Backup Frequency

How often to backup depends on:

Recovery Point Objective (RPO) - How much data loss is acceptable?
Change Rate - How frequently does cluster state change?
Criticality - How important is the cluster?
Storage Costs - More frequent backups use more storage

Common strategies:

Production clusters - Daily backups, retain for 30-90 days
Development clusters - Weekly backups, retain for 7-14 days
Critical production - Multiple backups per day, longer retention
Before major changes - Manual backups before upgrades or migrations

Backup Storage

Store backups securely and redundantly:

Off-cluster Storage - Never store backups only on cluster nodes
Multiple Locations - Use multiple storage backends (local + cloud)
Encryption - Encrypt backups containing sensitive data (Secrets)
Access Control - Limit who can access backups
Versioning - Keep multiple backup versions for point-in-time recovery

Testing Restores

Regularly test restore procedures:

Practice Restores - Restore to a test cluster regularly
Document Procedures - Write down restore steps
Time Restores - Measure recovery time to meet RTO requirements
Verify Integrity - Ensure restored resources are correct
Update Procedures - Refine restore process based on testing

Disaster Recovery Scenarios

Complete Cluster Failure

Provision new infrastructure
Install Kubernetes (kubeadm or managed service)
Restore etcd from backup
Rejoin or recreate worker nodes
Verify all applications are running

Partial Failure (Control Plane Only)

Rebuild control plane nodes
Restore etcd to new control plane
Worker nodes reconnect automatically (usually)
Verify cluster health

Data Corruption

Identify last known good backup
Restore etcd from that backup
Accept data loss since corruption point
Reconcile any changes made after backup

Application Data Backups

Remember that etcd backups don’t include application data in PersistentVolumes. You need separate backup strategies for:

Database Backups - Use database-native backup tools
File Storage - Backup object storage or file systems
Volume Snapshots - Use CSI snapshot capabilities
Application-Level Backups - Application-specific backup solutions

See the Storage Backup & Restore section for data backup strategies.

Best Practices

Automate Backups - Use cron jobs, Velero, or other automation
Test Regularly - Practice restores in non-production environments
Monitor Backup Success - Alert on backup failures
Version Backups - Keep multiple backup versions
Document Procedures - Write clear backup and restore runbooks
Secure Storage - Encrypt backups and control access
Backup Before Changes - Always backup before upgrades or major changes
Include Metadata - Tag backups with cluster name, date, version
Verify Backups - Periodically verify backup integrity
Plan for Scale - Backup strategies should work as clusters grow

Topics

Backup & Restore - This page covers cluster backup and restore strategies