High Availability
High availability (HA) ensures your Kubernetes cluster continues operating even when individual components fail. In a single-node control plane, if that node fails, the entire cluster becomes unusable. High availability distributes control plane components across multiple nodes, so the failure of one node doesn’t bring down the cluster.
Think of high availability like having multiple engines on an airplane. If one engine fails, the others keep the plane flying. Similarly, with multiple control plane nodes, if one fails, the others continue serving the cluster.
What Is High Availability?
High availability in Kubernetes means:
- Multiple Control Plane Nodes - Run API server, controller manager, and scheduler on multiple nodes
- Clustered etcd - Run etcd as a distributed cluster (typically 3 or 5 nodes)
- Load Balanced API Traffic - Distribute API server requests across control plane nodes
- Node Redundancy - Run worker nodes across multiple availability zones
- Automatic Failover - Components automatically use healthy nodes when others fail
Control Plane High Availability
The control plane consists of several components, each with different HA requirements:
API Server
The API server is stateless and can run multiple instances. A load balancer distributes requests across all API server instances. If one API server fails, requests automatically go to the others.
etcd
etcd is stateful and requires clustering for HA. etcd uses a consensus algorithm (Raft) that requires a quorum (majority) of nodes to operate:
- 3-node etcd - Can tolerate 1 node failure (needs 2 of 3 for quorum)
- 5-node etcd - Can tolerate 2 node failures (needs 3 of 5 for quorum)
- 7-node etcd - Can tolerate 3 node failures (needs 4 of 7 for quorum)
More nodes provide better fault tolerance but increase complexity and latency. Most clusters use 3-node etcd.
Controller Manager and Scheduler
These components use leader election—only one instance is active at a time, but multiple instances run for redundancy. If the active instance fails, another instance takes over automatically.
etcd Topologies
How etcd is deployed affects availability:
Stacked etcd Topology
etcd runs on the same nodes as control plane components. This is simpler but couples etcd availability with control plane availability.
Advantages:
- Simpler setup (fewer nodes)
- Lower resource requirements
- Easier to manage
Disadvantages:
- etcd and API server share fate (if node fails, both are affected)
- More complex recovery (need to restore both)
External etcd Topology
etcd runs on separate nodes from control plane components. This provides better isolation and is recommended for production.
Advantages:
- Better isolation (etcd and control plane failures are independent)
- Can scale etcd separately
- More resilient to failures
Disadvantages:
- More nodes to manage
- Higher resource requirements
- More complex setup
Load Balancing Control Plane Traffic
All components (kubelet, kube-proxy, controllers, users) need to connect to the API server. In an HA setup, a load balancer distributes this traffic:
The load balancer must:
- Health check API servers
- Distribute traffic evenly
- Handle API server failures gracefully
- Provide a stable endpoint (VIP) that doesn’t change
Failure Scenarios
High availability protects against various failure scenarios:
Single Control Plane Node Failure
- API server: Load balancer routes to other API servers (no impact)
- Controller manager/scheduler: Another instance takes over via leader election (brief pause)
- etcd (stacked): Cluster continues with remaining etcd nodes (if quorum maintained)
etcd Node Failure
- 3-node etcd: Cluster continues with 2 nodes (quorum maintained)
- 5-node etcd: Cluster continues with 4 nodes (quorum maintained)
- If quorum lost: etcd becomes read-only (cluster effectively down)
Load Balancer Failure
- Single point of failure
- Mitigate with redundant load balancers or DNS-based failover
Availability Zone Failure
- Distribute control plane nodes across zones
- Distribute worker nodes across zones
- Use Pod Disruption Budgets to maintain application availability
HA Setup with kubeadm
kubeadm supports HA cluster setup:
- Initialize first control plane node - Creates certificates and initial configuration
- Copy certificates - Share certificates to other control plane nodes
- Join additional control plane nodes - Use
kubeadm joinwith control-plane flag - Configure load balancer - Set up load balancer pointing to all API servers
- Update kubeconfig - Point to load balancer VIP instead of single node
Best Practices
- Use 3 or 5 etcd nodes - Odd numbers prevent split-brain scenarios
- Distribute across zones - Place nodes in different availability zones
- Monitor etcd health - Watch etcd cluster health and quorum status
- Test failure scenarios - Regularly test node failures to verify HA works
- Document procedures - Document how to add/remove control plane nodes
- Use external etcd for production - Better isolation than stacked topology
- Configure proper load balancing - Use health checks and proper algorithms
- Plan for upgrades - Upgrade HA clusters one node at a time
- Monitor leader election - Ensure controller manager and scheduler have leaders
- Backup etcd regularly - Even with HA, backups are essential
Topics
- etcd Topologies - Detailed guide to etcd deployment topologies
- Control Plane LB - Load balancing strategies for control plane
See Also
- High Availability Overview - HA concepts and principles
- etcd Basics - Understanding etcd architecture
- Kubeadm - Setting up HA clusters with kubeadm
- Backup & Restore - Backup considerations for HA clusters
- Pod Disruption Budgets - Maintaining application availability during disruptions