High Availability Overview
High availability (HA) ensures that Kubernetes clusters continue operating even when individual components fail. Understanding HA is crucial for production deployments because failures are inevitable—nodes fail, networks partition, disks corrupt. HA design ensures your cluster survives these failures and continues serving workloads.
Think of high availability like redundancy in critical systems. Just as airplanes have multiple engines (so one failure doesn’t crash the plane), HA Kubernetes clusters have multiple control plane nodes, multiple etcd nodes, and workloads distributed across multiple worker nodes. If one component fails, others take over.
What is High Availability?
High availability means the system continues operating despite component failures. For Kubernetes, this means:
- Control plane redundancy - Multiple API servers, schedulers, controller managers
- etcd clustering - Multiple etcd nodes with replication
- Node distribution - Workloads spread across multiple nodes
- Load balancing - Traffic distributed across multiple instances
- Automatic recovery - Failed components are replaced automatically
Why HA Matters
Production clusters need HA because:
- Component failures - Hardware fails, software crashes
- Network issues - Network partitions, connectivity problems
- Maintenance - Need to update/reboot nodes without downtime
- Disaster recovery - Survive data center failures
- Service level agreements - Meet uptime requirements
Without HA, a single failure can take down your entire cluster.
Control Plane HA
The control plane is the “brain” of Kubernetes. Making it highly available is critical:
API Server HA
Multiple API servers behind a load balancer:
- Load balancer - Distributes requests across API servers
- Stateless - API servers are stateless, any can handle any request
- Health checks - Load balancer routes away from unhealthy servers
- Automatic failover - If one fails, others continue serving
Scheduler HA
Multiple schedulers with leader election:
- Leader election - Only one scheduler is active at a time
- Automatic failover - If leader fails, another becomes leader
- No duplicate scheduling - Prevents conflicts
Controller Manager HA
Multiple controller managers with leader election:
- Leader election - Only one controller manager is active
- Automatic failover - If leader fails, another takes over
- Consistent state - Prevents duplicate actions
etcd Clustering
etcd stores all cluster state. Clustering etcd is essential for HA:
etcd Cluster
Typically 3 or 5 etcd nodes:
- 3 nodes - Can survive 1 node failure
- 5 nodes - Can survive 2 node failures
- Odd numbers - Required for quorum (majority voting)
Raft Consensus
etcd uses Raft consensus:
- Leader - One node handles writes
- Followers - Replicate from leader
- Quorum - Majority must agree for writes
- Automatic leader election - If leader fails
etcd Placement
For best HA:
- Separate nodes - Run etcd on dedicated nodes
- Separate zones - Distribute across availability zones
- Network isolation - Protect etcd network
- Regular backups - Backup etcd regularly
Node Distribution
Distribute workloads across multiple nodes:
Multiple Worker Nodes
- Node redundancy - Multiple nodes run workloads
- Automatic rescheduling - Pods rescheduled if node fails
- Load distribution - Workloads spread across nodes
Availability Zones
Distribute nodes across zones:
- Zone redundancy - Nodes in multiple zones
- Zone-aware scheduling - Spread pods across zones
- Survive zone failures - Cluster survives zone outages
Pod Disruption Budgets
Control pod evictions during maintenance:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
Load Balancing
Distribute traffic across multiple instances:
Service Load Balancing
Kubernetes Services provide load balancing:
- ClusterIP - Internal load balancing
- NodePort - Expose on node ports
- LoadBalancer - Cloud load balancer integration
Ingress Load Balancing
Ingress controllers provide HTTP load balancing:
- Multiple replicas - Ingress controller replicas
- Traffic distribution - Distribute across pods
- Health checks - Route away from unhealthy pods
Failure Scenarios
HA design handles various failures:
Single Node Failure
- Worker node - Pods rescheduled to other nodes
- Control plane node - Other nodes continue operating
- etcd node - Cluster continues if quorum maintained
Network Partition
- Split-brain prevention - etcd quorum prevents split-brain
- Partition handling - Majority partition continues
- Automatic recovery - Rejoins when network restored
Data Center Failure
- Multi-zone deployment - Survive zone failures
- Multi-region - Survive region failures (advanced)
- Disaster recovery - Backup and restore procedures
HA Architecture Patterns
Single Zone HA
- Multiple control plane nodes
- Multiple worker nodes
- etcd cluster
- Survives node failures
Multi-Zone HA
- Control plane across zones
- Worker nodes across zones
- etcd across zones
- Survives zone failures
HA Best Practices
Control Plane
- Multiple API servers - At least 3
- Load balancer - Distribute API server traffic
- Leader election - For scheduler and controller manager
- Health monitoring - Monitor all components
etcd
- Cluster size - 3 or 5 nodes
- Separate nodes - Dedicated etcd nodes
- Regular backups - Backup etcd regularly
- Monitor health - Monitor etcd cluster health
Workloads
- Multiple replicas - Run multiple pod replicas
- Pod Disruption Budgets - Control evictions
- Anti-affinity - Spread pods across nodes
- Health checks - Liveness and readiness probes
Networking
- Load balancing - Use Services and Ingress
- Health checks - Route away from unhealthy pods
- Multiple endpoints - Distribute across endpoints
Monitoring HA
Monitor HA health:
- Control plane status - All components healthy
- etcd health - Cluster quorum maintained
- Node status - Sufficient healthy nodes
- Pod distribution - Workloads properly distributed
- Service availability - Services responding
Key Takeaways
- High availability ensures clusters survive component failures
- Control plane HA requires multiple API servers, schedulers, controller managers
- etcd clustering (3 or 5 nodes) provides state storage HA
- Node distribution spreads workloads across multiple nodes
- Load balancing distributes traffic across instances
- HA handles node failures, network partitions, and zone failures
- Follow HA best practices for production clusters
- Monitor HA health continuously
See Also
- Kubernetes Architecture - How components fit together
- Control Plane Components - Control plane details
- etcd Basics - etcd clustering
- High Availability - Detailed HA configuration
- etcd Topologies - etcd HA patterns
- Control Plane LB - Load balancing control plane