Control Plane
The Kubernetes control plane is the brain of your cluster, managing cluster state, scheduling pods, and handling API requests. When control plane components fail, it can impact the entire cluster. This guide covers diagnosing and troubleshooting control plane issues.
Control Plane Components
The control plane consists of several critical components:
API Server (kube-apiserver)
- Handles all API requests
- Validates and processes requests
- Communicates with etcd
- Exposes Kubernetes API
etcd
- Distributed key-value store
- Stores all cluster state
- Provides consistency guarantees
- Critical for cluster operation
Scheduler (kube-scheduler)
- Assigns pods to nodes
- Considers resource requirements
- Evaluates constraints and preferences
- Watches for unscheduled pods
Controller Manager (kube-controller-manager)
- Runs controllers (ReplicaSet, Deployment, etc.)
- Reconciles desired vs actual state
- Handles scaling, updates, etc.
Common Control Plane Issues
API Server Down
Symptoms:
kubectlcommands fail- “connection refused” errors
- Dashboard inaccessible
- Cluster appears unresponsive
Diagnosis:
# Check API server status
kubectl get --raw /healthz
# Check API server pods
kubectl get pods -n kube-system | grep kube-apiserver
# Check API server logs
kubectl logs -n kube-system -l component=kube-apiserver
# Check API server events
kubectl get events -n kube-system | grep kube-apiserver
Common Causes:
- API server pod crashed
- etcd connectivity issues
- Resource exhaustion
- Configuration errors
- Certificate issues
Solutions:
- Check API server logs
- Verify etcd connectivity
- Check resource limits
- Verify certificates
- Restart API server pod
etcd Problems
Symptoms:
- API server errors
- Cluster state inconsistencies
- Slow API responses
- etcd pod failures
Diagnosis:
# Check etcd pods
kubectl get pods -n kube-system | grep etcd
# Check etcd logs
kubectl logs -n kube-system -l component=etcd
# Check etcd status (if access available)
kubectl exec -n kube-system etcd-<pod-name> -- etcdctl endpoint health
# Check etcd events
kubectl get events -n kube-system | grep etcd
Common Causes:
- etcd pod failures
- Disk space issues
- Network problems
- Corrupted data
- Resource exhaustion
Solutions:
- Check etcd logs
- Verify disk space
- Check network connectivity
- Backup and restore etcd
- Scale etcd cluster
Scheduler Issues
Symptoms:
- Pods stuck in Pending
- Scheduling errors
- No pods being scheduled
Diagnosis:
# Check scheduler pod
kubectl get pods -n kube-system | grep kube-scheduler
# Check scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler
# Check pending pods
kubectl get pods --all-namespaces --field-selector status.phase=Pending
# Check scheduler events
kubectl get events -n kube-system | grep scheduler
Common Causes:
- Scheduler pod crashed
- Configuration errors
- Resource constraints
- Node unavailability
Solutions:
- Check scheduler logs
- Verify scheduler configuration
- Check node availability
- Restart scheduler pod
Controller Manager Issues
Symptoms:
- Resources not being created/updated
- Scaling not working
- Deployment rollouts stuck
- State not being reconciled
Diagnosis:
# Check controller manager pod
kubectl get pods -n kube-system | grep kube-controller-manager
# Check controller manager logs
kubectl logs -n kube-system -l component=kube-controller-manager
# Check controller manager events
kubectl get events -n kube-system | grep controller-manager
Common Causes:
- Controller manager pod crashed
- API server connectivity issues
- Configuration errors
- Resource constraints
Solutions:
- Check controller manager logs
- Verify API server connectivity
- Check configuration
- Restart controller manager pod
Diagnostic Commands
Health Checks
API Server Health
# Basic health check
kubectl get --raw /healthz
# Readiness check
kubectl get --raw /readyz
# Live check
kubectl get --raw /livez
# Verbose health check
kubectl get --raw /healthz?verbose
etcd Health
# Check etcd health (if access available)
kubectl exec -n kube-system etcd-<pod-name> -- \
etcdctl endpoint health
# Check etcd member list
kubectl exec -n kube-system etcd-<pod-name> -- \
etcdctl member list
# Check etcd status
kubectl exec -n kube-system etcd-<pod-name> -- \
etcdctl endpoint status
Component Status
# Check component status
kubectl get componentstatuses
# Check component status details
kubectl get componentstatuses -o yaml
Cluster Info
# Cluster information
kubectl cluster-info
# Cluster info dump
kubectl cluster-info dump
# Check API versions
kubectl api-versions
# Check API resources
kubectl api-resources
Log Locations and Access
API Server Logs
# Get API server logs
kubectl logs -n kube-system -l component=kube-apiserver
# Follow API server logs
kubectl logs -f -n kube-system -l component=kube-apiserver
# Get logs from specific pod
kubectl logs -n kube-system kube-apiserver-<pod-name>
# Get previous instance logs
kubectl logs -n kube-system -l component=kube-apiserver --previous
etcd Logs
# Get etcd logs
kubectl logs -n kube-system -l component=etcd
# Follow etcd logs
kubectl logs -f -n kube-system -l component=etcd
# Get logs from specific pod
kubectl logs -n kube-system etcd-<pod-name>
Scheduler Logs
# Get scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler
# Follow scheduler logs
kubectl logs -f -n kube-system -l component=kube-scheduler
Controller Manager Logs
# Get controller manager logs
kubectl logs -n kube-system -l component=kube-controller-manager
# Follow controller manager logs
kubectl logs -f -n kube-system -l component=kube-controller-manager
Direct Log Access (Node Access Required)
If you have node access:
# API server logs (systemd)
journalctl -u kube-apiserver -n 100
# etcd logs (systemd)
journalctl -u etcd -n 100
# Scheduler logs (systemd)
journalctl -u kube-scheduler -n 100
# Controller manager logs (systemd)
journalctl -u kube-controller-manager -n 100
Recovery Procedures
API Server Recovery
- Check API server status:
kubectl get pods -n kube-system | grep kube-apiserver
- Check logs:
kubectl logs -n kube-system -l component=kube-apiserver --tail=100
- Restart API server:
kubectl delete pod -n kube-system -l component=kube-apiserver
- Verify recovery:
kubectl get --raw /healthz
etcd Recovery
- Check etcd status:
kubectl get pods -n kube-system | grep etcd
- Check etcd health:
kubectl exec -n kube-system etcd-<pod-name> -- etcdctl endpoint health
- Backup etcd:
kubectl exec -n kube-system etcd-<pod-name> -- \
etcdctl snapshot save /backup/etcd-snapshot.db
- Restore etcd (if needed):
kubectl exec -n kube-system etcd-<pod-name> -- \
etcdctl snapshot restore /backup/etcd-snapshot.db
Scheduler Recovery
- Check scheduler status:
kubectl get pods -n kube-system | grep kube-scheduler
- Check logs:
kubectl logs -n kube-system -l component=kube-scheduler --tail=100
- Restart scheduler:
kubectl delete pod -n kube-system -l component=kube-scheduler
High Availability Considerations
API Server HA
- Run multiple API server replicas
- Use load balancer for API server endpoints
- Ensure etcd is accessible to all API servers
etcd HA
- Run odd number of etcd members (3, 5, 7)
- Distribute etcd across availability zones
- Regular backups
- Monitor etcd health
Scheduler HA
- Run multiple scheduler replicas (leader election)
- Ensure API server connectivity
Controller Manager HA
- Run multiple controller manager replicas (leader election)
- Ensure API server connectivity
Best Practices
Monitor control plane components - Set up monitoring and alerting
Regular backups - Backup etcd regularly
Resource limits - Set appropriate resource limits for control plane pods
Health checks - Regularly check component health
Log aggregation - Collect and analyze control plane logs
Documentation - Document recovery procedures
Testing - Test recovery procedures regularly
Troubleshooting Checklist
- Check API server health (
/healthz,/readyz) - Verify control plane pods are running
- Check component logs for errors
- Verify etcd health and connectivity
- Check resource usage on control plane nodes
- Verify network connectivity between components
- Check certificates and authentication
- Review recent configuration changes
- Check for resource exhaustion
- Verify high availability setup (if applicable)
See Also
- Kubelet & CRI - Troubleshooting kubelet
- Clusters & Nodes - General node troubleshooting
- High Availability - HA setup and considerations