Control Plane

The Kubernetes control plane is the brain of your cluster, managing cluster state, scheduling pods, and handling API requests. When control plane components fail, it can impact the entire cluster. This guide covers diagnosing and troubleshooting control plane issues.

Control Plane Components

The control plane consists of several critical components:

graph TB A[Control Plane] --> B[API Server] A --> C[etcd] A --> D[Scheduler] A --> E[Controller Manager] B --> B1[API Requests] B --> B2[Authentication] B --> B3[Authorization] C --> C1[Cluster State] C --> C2[Configuration] D --> D1[Pod Scheduling] E --> E1[Controllers] E --> E2[Reconciliation] style A fill:#e1f5ff style B fill:#e8f5e9 style C fill:#fff4e1 style D fill:#f3e5f5 style E fill:#ffe1e1

API Server (kube-apiserver)

  • Handles all API requests
  • Validates and processes requests
  • Communicates with etcd
  • Exposes Kubernetes API

etcd

  • Distributed key-value store
  • Stores all cluster state
  • Provides consistency guarantees
  • Critical for cluster operation

Scheduler (kube-scheduler)

  • Assigns pods to nodes
  • Considers resource requirements
  • Evaluates constraints and preferences
  • Watches for unscheduled pods

Controller Manager (kube-controller-manager)

  • Runs controllers (ReplicaSet, Deployment, etc.)
  • Reconciles desired vs actual state
  • Handles scaling, updates, etc.

Common Control Plane Issues

API Server Down

Symptoms:

  • kubectl commands fail
  • “connection refused” errors
  • Dashboard inaccessible
  • Cluster appears unresponsive

Diagnosis:

# Check API server status
kubectl get --raw /healthz

# Check API server pods
kubectl get pods -n kube-system | grep kube-apiserver

# Check API server logs
kubectl logs -n kube-system -l component=kube-apiserver

# Check API server events
kubectl get events -n kube-system | grep kube-apiserver

Common Causes:

  • API server pod crashed
  • etcd connectivity issues
  • Resource exhaustion
  • Configuration errors
  • Certificate issues

Solutions:

  • Check API server logs
  • Verify etcd connectivity
  • Check resource limits
  • Verify certificates
  • Restart API server pod

etcd Problems

Symptoms:

  • API server errors
  • Cluster state inconsistencies
  • Slow API responses
  • etcd pod failures

Diagnosis:

# Check etcd pods
kubectl get pods -n kube-system | grep etcd

# Check etcd logs
kubectl logs -n kube-system -l component=etcd

# Check etcd status (if access available)
kubectl exec -n kube-system etcd-<pod-name> -- etcdctl endpoint health

# Check etcd events
kubectl get events -n kube-system | grep etcd

Common Causes:

  • etcd pod failures
  • Disk space issues
  • Network problems
  • Corrupted data
  • Resource exhaustion

Solutions:

  • Check etcd logs
  • Verify disk space
  • Check network connectivity
  • Backup and restore etcd
  • Scale etcd cluster

Scheduler Issues

Symptoms:

  • Pods stuck in Pending
  • Scheduling errors
  • No pods being scheduled

Diagnosis:

# Check scheduler pod
kubectl get pods -n kube-system | grep kube-scheduler

# Check scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler

# Check pending pods
kubectl get pods --all-namespaces --field-selector status.phase=Pending

# Check scheduler events
kubectl get events -n kube-system | grep scheduler

Common Causes:

  • Scheduler pod crashed
  • Configuration errors
  • Resource constraints
  • Node unavailability

Solutions:

  • Check scheduler logs
  • Verify scheduler configuration
  • Check node availability
  • Restart scheduler pod

Controller Manager Issues

Symptoms:

  • Resources not being created/updated
  • Scaling not working
  • Deployment rollouts stuck
  • State not being reconciled

Diagnosis:

# Check controller manager pod
kubectl get pods -n kube-system | grep kube-controller-manager

# Check controller manager logs
kubectl logs -n kube-system -l component=kube-controller-manager

# Check controller manager events
kubectl get events -n kube-system | grep controller-manager

Common Causes:

  • Controller manager pod crashed
  • API server connectivity issues
  • Configuration errors
  • Resource constraints

Solutions:

  • Check controller manager logs
  • Verify API server connectivity
  • Check configuration
  • Restart controller manager pod

Diagnostic Commands

Health Checks

API Server Health

# Basic health check
kubectl get --raw /healthz

# Readiness check
kubectl get --raw /readyz

# Live check
kubectl get --raw /livez

# Verbose health check
kubectl get --raw /healthz?verbose

etcd Health

# Check etcd health (if access available)
kubectl exec -n kube-system etcd-<pod-name> -- \
  etcdctl endpoint health

# Check etcd member list
kubectl exec -n kube-system etcd-<pod-name> -- \
  etcdctl member list

# Check etcd status
kubectl exec -n kube-system etcd-<pod-name> -- \
  etcdctl endpoint status

Component Status

# Check component status
kubectl get componentstatuses

# Check component status details
kubectl get componentstatuses -o yaml

Cluster Info

# Cluster information
kubectl cluster-info

# Cluster info dump
kubectl cluster-info dump

# Check API versions
kubectl api-versions

# Check API resources
kubectl api-resources

Log Locations and Access

API Server Logs

# Get API server logs
kubectl logs -n kube-system -l component=kube-apiserver

# Follow API server logs
kubectl logs -f -n kube-system -l component=kube-apiserver

# Get logs from specific pod
kubectl logs -n kube-system kube-apiserver-<pod-name>

# Get previous instance logs
kubectl logs -n kube-system -l component=kube-apiserver --previous

etcd Logs

# Get etcd logs
kubectl logs -n kube-system -l component=etcd

# Follow etcd logs
kubectl logs -f -n kube-system -l component=etcd

# Get logs from specific pod
kubectl logs -n kube-system etcd-<pod-name>

Scheduler Logs

# Get scheduler logs
kubectl logs -n kube-system -l component=kube-scheduler

# Follow scheduler logs
kubectl logs -f -n kube-system -l component=kube-scheduler

Controller Manager Logs

# Get controller manager logs
kubectl logs -n kube-system -l component=kube-controller-manager

# Follow controller manager logs
kubectl logs -f -n kube-system -l component=kube-controller-manager

Direct Log Access (Node Access Required)

If you have node access:

# API server logs (systemd)
journalctl -u kube-apiserver -n 100

# etcd logs (systemd)
journalctl -u etcd -n 100

# Scheduler logs (systemd)
journalctl -u kube-scheduler -n 100

# Controller manager logs (systemd)
journalctl -u kube-controller-manager -n 100

Recovery Procedures

API Server Recovery

  1. Check API server status:
kubectl get pods -n kube-system | grep kube-apiserver
  1. Check logs:
kubectl logs -n kube-system -l component=kube-apiserver --tail=100
  1. Restart API server:
kubectl delete pod -n kube-system -l component=kube-apiserver
  1. Verify recovery:
kubectl get --raw /healthz

etcd Recovery

  1. Check etcd status:
kubectl get pods -n kube-system | grep etcd
  1. Check etcd health:
kubectl exec -n kube-system etcd-<pod-name> -- etcdctl endpoint health
  1. Backup etcd:
kubectl exec -n kube-system etcd-<pod-name> -- \
  etcdctl snapshot save /backup/etcd-snapshot.db
  1. Restore etcd (if needed):
kubectl exec -n kube-system etcd-<pod-name> -- \
  etcdctl snapshot restore /backup/etcd-snapshot.db

Scheduler Recovery

  1. Check scheduler status:
kubectl get pods -n kube-system | grep kube-scheduler
  1. Check logs:
kubectl logs -n kube-system -l component=kube-scheduler --tail=100
  1. Restart scheduler:
kubectl delete pod -n kube-system -l component=kube-scheduler

High Availability Considerations

API Server HA

  • Run multiple API server replicas
  • Use load balancer for API server endpoints
  • Ensure etcd is accessible to all API servers

etcd HA

  • Run odd number of etcd members (3, 5, 7)
  • Distribute etcd across availability zones
  • Regular backups
  • Monitor etcd health

Scheduler HA

  • Run multiple scheduler replicas (leader election)
  • Ensure API server connectivity

Controller Manager HA

  • Run multiple controller manager replicas (leader election)
  • Ensure API server connectivity

Best Practices

  1. Monitor control plane components - Set up monitoring and alerting

  2. Regular backups - Backup etcd regularly

  3. Resource limits - Set appropriate resource limits for control plane pods

  4. Health checks - Regularly check component health

  5. Log aggregation - Collect and analyze control plane logs

  6. Documentation - Document recovery procedures

  7. Testing - Test recovery procedures regularly

Troubleshooting Checklist

  • Check API server health (/healthz, /readyz)
  • Verify control plane pods are running
  • Check component logs for errors
  • Verify etcd health and connectivity
  • Check resource usage on control plane nodes
  • Verify network connectivity between components
  • Check certificates and authentication
  • Review recent configuration changes
  • Check for resource exhaustion
  • Verify high availability setup (if applicable)

See Also