Clusters & Nodes
Troubleshooting cluster and node issues is critical for maintaining a healthy Kubernetes environment. This guide covers common node problems, diagnostic workflows, and essential commands for investigating cluster and node-level issues.
Overview
Cluster and node issues can affect multiple workloads and require systematic investigation:
Common Node Issues
Node Not Ready
A node in NotReady state cannot schedule new pods and existing pods may not function correctly.
Symptoms:
- Node status shows
NotReady - Pods cannot be scheduled to the node
- Existing pods on node may be unhealthy
Common Causes:
- Kubelet not running or failing
- Network connectivity issues
- Container runtime problems
- Resource exhaustion (CPU, memory, disk)
- System issues (kernel, drivers)
Resource Exhaustion
Nodes can run out of resources, affecting pod scheduling and execution.
Types:
- CPU exhaustion - All CPU cores fully utilized
- Memory pressure - Insufficient memory
- Disk pressure - Disk space full
- PID exhaustion - Too many processes
Network Problems
Network issues can prevent:
- Pod-to-pod communication
- Node-to-node communication
- External connectivity
- Service discovery
Diagnostic Workflow
Step 1: Check Node Status
# List all nodes and their status
kubectl get nodes
# Get detailed node information
kubectl describe node <node-name>
# Check node conditions
kubectl get node <node-name> -o yaml | grep -A 5 conditions
Step 2: Check Node Conditions
Node conditions indicate node health:
# Check node conditions
kubectl get node <node-name> -o jsonpath='{.status.conditions}'
Common conditions:
- Ready - Node is healthy and ready to accept pods
- MemoryPressure - Node has insufficient memory
- DiskPressure - Node has insufficient disk space
- PIDPressure - Node has too many processes
- NetworkUnavailable - Node network is not configured correctly
Step 3: Check Node Events
# Get events for node
kubectl get events --field-selector involvedObject.name=<node-name>
# Get recent events
kubectl describe node <node-name> | grep -A 10 Events
Step 4: Check Resource Usage
# Check node resource usage
kubectl top node <node-name>
# Check node capacity and allocatable
kubectl describe node <node-name> | grep -A 5 "Capacity\|Allocatable"
Step 5: Check Pods on Node
# List pods on node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
# Check for problematic pods
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> \
--field-selector status.phase!=Running
Essential Node Commands
Node Information
# Get node details
kubectl describe node <node-name>
# Get node YAML
kubectl get node <node-name> -o yaml
# Get node labels
kubectl get node <node-name> --show-labels
# Get node annotations
kubectl get node <node-name> -o jsonpath='{.metadata.annotations}'
Resource Information
# Node resource usage
kubectl top node <node-name>
# Node capacity
kubectl get node <node-name> -o jsonpath='{.status.capacity}'
# Allocatable resources
kubectl get node <node-name> -o jsonpath='{.status.allocatable}'
Pod Information
# Pods on node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>
# Non-running pods on node
kubectl get pods --all-namespaces \
--field-selector spec.nodeName=<node-name>,status.phase!=Running
Common Node Problems
Node Not Ready
Diagnosis:
# Check node status
kubectl get node <node-name>
# Check node conditions
kubectl describe node <node-name>
# Check kubelet status (requires node access)
ssh <node> systemctl status kubelet
# Check kubelet logs
ssh <node> journalctl -u kubelet -n 50
Solutions:
- Restart kubelet:
systemctl restart kubelet - Check network connectivity
- Verify container runtime is running
- Check for resource exhaustion
Memory Pressure
Symptoms:
- Node condition shows
MemoryPressure=True - Pods evicted due to memory pressure
- OOM (Out of Memory) events
Diagnosis:
# Check memory usage
kubectl top node <node-name>
# Check node conditions
kubectl describe node <node-name> | grep MemoryPressure
# Check for evicted pods
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> \
| grep Evicted
Solutions:
- Free up memory (remove unnecessary pods)
- Increase node memory
- Set appropriate resource limits on pods
- Use resource quotas
Disk Pressure
Symptoms:
- Node condition shows
DiskPressure=True - Image pull failures
- Pod creation failures
Diagnosis:
# Check disk usage (requires node access)
ssh <node> df -h
# Check container log usage
ssh <node> du -sh /var/log/pods/*
# Check node conditions
kubectl describe node <node-name> | grep DiskPressure
Solutions:
- Free up disk space
- Clean up unused images:
docker image pruneorcrictl rmi --prune - Clean up old logs
- Increase node disk size
CPU Exhaustion
Symptoms:
- High CPU usage on node
- Slow pod scheduling
- Pod performance degradation
Diagnosis:
# Check CPU usage
kubectl top node <node-name>
# Check CPU requests/limits
kubectl describe node <node-name> | grep -A 10 "Non-terminated Pods"
Solutions:
- Identify high CPU pods
- Set appropriate CPU limits
- Scale out to more nodes
- Upgrade node CPU capacity
Network Issues
Symptoms:
- Pods cannot communicate
- Service discovery failures
- Node network unreachable
Diagnosis:
# Check node network status
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="NetworkUnavailable")]}'
# Test pod-to-pod connectivity
kubectl run test-pod --image=busybox --rm -it -- ping <target-pod-ip>
# Check CNI plugin
kubectl get pods -n kube-system | grep cni
Solutions:
- Check CNI plugin status
- Verify network policies
- Check firewall rules
- Verify node network configuration
Node Maintenance
Cordon Node
Prevent new pods from being scheduled:
# Cordon node
kubectl cordon <node-name>
# Check node status (SchedulingDisabled)
kubectl get node <node-name>
Drain Node
Safely evict pods from node:
# Drain node
kubectl drain <node-name>
# Drain with ignore daemonsets
kubectl drain <node-name> --ignore-daemonsets
# Drain with delete local data
kubectl drain <node-name> --delete-emptydir-data
Uncordon Node
Allow scheduling after maintenance:
# Uncordon node
kubectl uncordon <node-name>
Troubleshooting Workflow
Complete Node Troubleshooting
# Step 1: Check node status
kubectl get nodes
kubectl describe node <node-name>
# Step 2: Check node conditions
kubectl get node <node-name> -o jsonpath='{.status.conditions}'
# Step 3: Check resource usage
kubectl top node <node-name>
# Step 4: Check node events
kubectl get events --field-selector involvedObject.name=<node-name>
# Step 5: Check pods on node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>
# Step 6: Check kubelet (if node access available)
ssh <node> systemctl status kubelet
ssh <node> journalctl -u kubelet -n 100
Topics
- Control Plane - Troubleshooting control plane components
- Kubelet & CRI - Troubleshooting kubelet and container runtime
See Also
- Troubleshooting - General troubleshooting guide
- Networking - Network troubleshooting
- Debugging Toolkit - Debugging tools