Clusters & Nodes

Troubleshooting cluster and node issues is critical for maintaining a healthy Kubernetes environment. This guide covers common node problems, diagnostic workflows, and essential commands for investigating cluster and node-level issues.

Overview

Cluster and node issues can affect multiple workloads and require systematic investigation:

graph TB A[Cluster/Node Issue] --> B{Node Status?} B -->|NotReady| C[Node Not Ready] B -->|Ready| D{Resource Issues?} C --> C1[Kubelet Problems] C --> C2[Network Issues] C --> C3[Container Runtime] D --> D1[CPU Exhaustion] D --> D2[Memory Pressure] D --> D3[Disk Pressure] D --> D4[PID Exhaustion] C1 --> E[Diagnostics] C2 --> E C3 --> E D1 --> E D2 --> E D3 --> E D4 --> E style A fill:#e1f5ff style C fill:#ffe1e1 style D fill:#fff4e1 style E fill:#e8f5e9

Common Node Issues

Node Not Ready

A node in NotReady state cannot schedule new pods and existing pods may not function correctly.

Symptoms:

Node status shows NotReady
Pods cannot be scheduled to the node
Existing pods on node may be unhealthy

Common Causes:

Kubelet not running or failing
Network connectivity issues
Container runtime problems
Resource exhaustion (CPU, memory, disk)
System issues (kernel, drivers)

Resource Exhaustion

Nodes can run out of resources, affecting pod scheduling and execution.

Types:

CPU exhaustion - All CPU cores fully utilized
Memory pressure - Insufficient memory
Disk pressure - Disk space full
PID exhaustion - Too many processes

Network Problems

Network issues can prevent:

Pod-to-pod communication
Node-to-node communication
External connectivity
Service discovery

Diagnostic Workflow

Step 1: Check Node Status

# List all nodes and their status
kubectl get nodes

# Get detailed node information
kubectl describe node <node-name>

# Check node conditions
kubectl get node <node-name> -o yaml | grep -A 5 conditions

Step 2: Check Node Conditions

Node conditions indicate node health:

# Check node conditions
kubectl get node <node-name> -o jsonpath='{.status.conditions}'

Common conditions:

Ready - Node is healthy and ready to accept pods
MemoryPressure - Node has insufficient memory
DiskPressure - Node has insufficient disk space
PIDPressure - Node has too many processes
NetworkUnavailable - Node network is not configured correctly

Step 3: Check Node Events

# Get events for node
kubectl get events --field-selector involvedObject.name=<node-name>

# Get recent events
kubectl describe node <node-name> | grep -A 10 Events

Step 4: Check Resource Usage

# Check node resource usage
kubectl top node <node-name>

# Check node capacity and allocatable
kubectl describe node <node-name> | grep -A 5 "Capacity\|Allocatable"

Step 5: Check Pods on Node

# List pods on node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>

# Check for problematic pods
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> \
  --field-selector status.phase!=Running

Essential Node Commands

Node Information

# Get node details
kubectl describe node <node-name>

# Get node YAML
kubectl get node <node-name> -o yaml

# Get node labels
kubectl get node <node-name> --show-labels

# Get node annotations
kubectl get node <node-name> -o jsonpath='{.metadata.annotations}'

Resource Information

# Node resource usage
kubectl top node <node-name>

# Node capacity
kubectl get node <node-name> -o jsonpath='{.status.capacity}'

# Allocatable resources
kubectl get node <node-name> -o jsonpath='{.status.allocatable}'

Pod Information

# Pods on node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>

# Non-running pods on node
kubectl get pods --all-namespaces \
  --field-selector spec.nodeName=<node-name>,status.phase!=Running

Common Node Problems

Node Not Ready

Diagnosis:

# Check node status
kubectl get node <node-name>

# Check node conditions
kubectl describe node <node-name>

# Check kubelet status (requires node access)
ssh <node> systemctl status kubelet

# Check kubelet logs
ssh <node> journalctl -u kubelet -n 50

Solutions:

Restart kubelet: systemctl restart kubelet
Check network connectivity
Verify container runtime is running
Check for resource exhaustion

Memory Pressure

Symptoms:

Node condition shows MemoryPressure=True
Pods evicted due to memory pressure
OOM (Out of Memory) events

Diagnosis:

# Check memory usage
kubectl top node <node-name>

# Check node conditions
kubectl describe node <node-name> | grep MemoryPressure

# Check for evicted pods
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> \
  | grep Evicted

Solutions:

Free up memory (remove unnecessary pods)
Increase node memory
Set appropriate resource limits on pods
Use resource quotas

Disk Pressure

Symptoms:

Node condition shows DiskPressure=True
Image pull failures
Pod creation failures

Diagnosis:

# Check disk usage (requires node access)
ssh <node> df -h

# Check container log usage
ssh <node> du -sh /var/log/pods/*

# Check node conditions
kubectl describe node <node-name> | grep DiskPressure

Solutions:

Free up disk space
Clean up unused images: docker image prune or crictl rmi --prune
Clean up old logs
Increase node disk size

CPU Exhaustion

Symptoms:

High CPU usage on node
Slow pod scheduling
Pod performance degradation

Diagnosis:

# Check CPU usage
kubectl top node <node-name>

# Check CPU requests/limits
kubectl describe node <node-name> | grep -A 10 "Non-terminated Pods"

Solutions:

Identify high CPU pods
Set appropriate CPU limits
Scale out to more nodes
Upgrade node CPU capacity

Network Issues

Symptoms:

Pods cannot communicate
Service discovery failures
Node network unreachable

Diagnosis:

# Check node network status
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="NetworkUnavailable")]}'

# Test pod-to-pod connectivity
kubectl run test-pod --image=busybox --rm -it -- ping <target-pod-ip>

# Check CNI plugin
kubectl get pods -n kube-system | grep cni

Solutions:

Check CNI plugin status
Verify network policies
Check firewall rules
Verify node network configuration

Node Maintenance

Cordon Node

Prevent new pods from being scheduled:

# Cordon node
kubectl cordon <node-name>

# Check node status (SchedulingDisabled)
kubectl get node <node-name>

Drain Node

Safely evict pods from node:

# Drain node
kubectl drain <node-name>

# Drain with ignore daemonsets
kubectl drain <node-name> --ignore-daemonsets

# Drain with delete local data
kubectl drain <node-name> --delete-emptydir-data

Uncordon Node

Allow scheduling after maintenance:

# Uncordon node
kubectl uncordon <node-name>

Troubleshooting Workflow

Complete Node Troubleshooting

# Step 1: Check node status
kubectl get nodes
kubectl describe node <node-name>

# Step 2: Check node conditions
kubectl get node <node-name> -o jsonpath='{.status.conditions}'

# Step 3: Check resource usage
kubectl top node <node-name>

# Step 4: Check node events
kubectl get events --field-selector involvedObject.name=<node-name>

# Step 5: Check pods on node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>

# Step 6: Check kubelet (if node access available)
ssh <node> systemctl status kubelet
ssh <node> journalctl -u kubelet -n 100

Topics

Control Plane - Troubleshooting control plane components
Kubelet & CRI - Troubleshooting kubelet and container runtime

Clusters & Nodes

Overview

Common Node Issues

Node Not Ready

Resource Exhaustion

Network Problems

Diagnostic Workflow

Step 1: Check Node Status

Step 2: Check Node Conditions

Step 3: Check Node Events

Step 4: Check Resource Usage

Step 5: Check Pods on Node

Essential Node Commands

Node Information

Resource Information

Pod Information

Common Node Problems

Node Not Ready

Memory Pressure

Disk Pressure

CPU Exhaustion

Network Issues

Node Maintenance

Cordon Node

Drain Node

Uncordon Node

Troubleshooting Workflow

Complete Node Troubleshooting

Topics

See Also