Clusters & Nodes

Troubleshooting cluster and node issues is critical for maintaining a healthy Kubernetes environment. This guide covers common node problems, diagnostic workflows, and essential commands for investigating cluster and node-level issues.

Overview

Cluster and node issues can affect multiple workloads and require systematic investigation:

graph TB A[Cluster/Node Issue] --> B{Node Status?} B -->|NotReady| C[Node Not Ready] B -->|Ready| D{Resource Issues?} C --> C1[Kubelet Problems] C --> C2[Network Issues] C --> C3[Container Runtime] D --> D1[CPU Exhaustion] D --> D2[Memory Pressure] D --> D3[Disk Pressure] D --> D4[PID Exhaustion] C1 --> E[Diagnostics] C2 --> E C3 --> E D1 --> E D2 --> E D3 --> E D4 --> E style A fill:#e1f5ff style C fill:#ffe1e1 style D fill:#fff4e1 style E fill:#e8f5e9

Common Node Issues

Node Not Ready

A node in NotReady state cannot schedule new pods and existing pods may not function correctly.

Symptoms:

  • Node status shows NotReady
  • Pods cannot be scheduled to the node
  • Existing pods on node may be unhealthy

Common Causes:

  • Kubelet not running or failing
  • Network connectivity issues
  • Container runtime problems
  • Resource exhaustion (CPU, memory, disk)
  • System issues (kernel, drivers)

Resource Exhaustion

Nodes can run out of resources, affecting pod scheduling and execution.

Types:

  • CPU exhaustion - All CPU cores fully utilized
  • Memory pressure - Insufficient memory
  • Disk pressure - Disk space full
  • PID exhaustion - Too many processes

Network Problems

Network issues can prevent:

  • Pod-to-pod communication
  • Node-to-node communication
  • External connectivity
  • Service discovery

Diagnostic Workflow

Step 1: Check Node Status

# List all nodes and their status
kubectl get nodes

# Get detailed node information
kubectl describe node <node-name>

# Check node conditions
kubectl get node <node-name> -o yaml | grep -A 5 conditions

Step 2: Check Node Conditions

Node conditions indicate node health:

# Check node conditions
kubectl get node <node-name> -o jsonpath='{.status.conditions}'

Common conditions:

  • Ready - Node is healthy and ready to accept pods
  • MemoryPressure - Node has insufficient memory
  • DiskPressure - Node has insufficient disk space
  • PIDPressure - Node has too many processes
  • NetworkUnavailable - Node network is not configured correctly

Step 3: Check Node Events

# Get events for node
kubectl get events --field-selector involvedObject.name=<node-name>

# Get recent events
kubectl describe node <node-name> | grep -A 10 Events

Step 4: Check Resource Usage

# Check node resource usage
kubectl top node <node-name>

# Check node capacity and allocatable
kubectl describe node <node-name> | grep -A 5 "Capacity\|Allocatable"

Step 5: Check Pods on Node

# List pods on node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>

# Check for problematic pods
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> \
  --field-selector status.phase!=Running

Essential Node Commands

Node Information

# Get node details
kubectl describe node <node-name>

# Get node YAML
kubectl get node <node-name> -o yaml

# Get node labels
kubectl get node <node-name> --show-labels

# Get node annotations
kubectl get node <node-name> -o jsonpath='{.metadata.annotations}'

Resource Information

# Node resource usage
kubectl top node <node-name>

# Node capacity
kubectl get node <node-name> -o jsonpath='{.status.capacity}'

# Allocatable resources
kubectl get node <node-name> -o jsonpath='{.status.allocatable}'

Pod Information

# Pods on node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>

# Non-running pods on node
kubectl get pods --all-namespaces \
  --field-selector spec.nodeName=<node-name>,status.phase!=Running

Common Node Problems

Node Not Ready

Diagnosis:

# Check node status
kubectl get node <node-name>

# Check node conditions
kubectl describe node <node-name>

# Check kubelet status (requires node access)
ssh <node> systemctl status kubelet

# Check kubelet logs
ssh <node> journalctl -u kubelet -n 50

Solutions:

  • Restart kubelet: systemctl restart kubelet
  • Check network connectivity
  • Verify container runtime is running
  • Check for resource exhaustion

Memory Pressure

Symptoms:

  • Node condition shows MemoryPressure=True
  • Pods evicted due to memory pressure
  • OOM (Out of Memory) events

Diagnosis:

# Check memory usage
kubectl top node <node-name>

# Check node conditions
kubectl describe node <node-name> | grep MemoryPressure

# Check for evicted pods
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name> \
  | grep Evicted

Solutions:

  • Free up memory (remove unnecessary pods)
  • Increase node memory
  • Set appropriate resource limits on pods
  • Use resource quotas

Disk Pressure

Symptoms:

  • Node condition shows DiskPressure=True
  • Image pull failures
  • Pod creation failures

Diagnosis:

# Check disk usage (requires node access)
ssh <node> df -h

# Check container log usage
ssh <node> du -sh /var/log/pods/*

# Check node conditions
kubectl describe node <node-name> | grep DiskPressure

Solutions:

  • Free up disk space
  • Clean up unused images: docker image prune or crictl rmi --prune
  • Clean up old logs
  • Increase node disk size

CPU Exhaustion

Symptoms:

  • High CPU usage on node
  • Slow pod scheduling
  • Pod performance degradation

Diagnosis:

# Check CPU usage
kubectl top node <node-name>

# Check CPU requests/limits
kubectl describe node <node-name> | grep -A 10 "Non-terminated Pods"

Solutions:

  • Identify high CPU pods
  • Set appropriate CPU limits
  • Scale out to more nodes
  • Upgrade node CPU capacity

Network Issues

Symptoms:

  • Pods cannot communicate
  • Service discovery failures
  • Node network unreachable

Diagnosis:

# Check node network status
kubectl get node <node-name> -o jsonpath='{.status.conditions[?(@.type=="NetworkUnavailable")]}'

# Test pod-to-pod connectivity
kubectl run test-pod --image=busybox --rm -it -- ping <target-pod-ip>

# Check CNI plugin
kubectl get pods -n kube-system | grep cni

Solutions:

  • Check CNI plugin status
  • Verify network policies
  • Check firewall rules
  • Verify node network configuration

Node Maintenance

Cordon Node

Prevent new pods from being scheduled:

# Cordon node
kubectl cordon <node-name>

# Check node status (SchedulingDisabled)
kubectl get node <node-name>

Drain Node

Safely evict pods from node:

# Drain node
kubectl drain <node-name>

# Drain with ignore daemonsets
kubectl drain <node-name> --ignore-daemonsets

# Drain with delete local data
kubectl drain <node-name> --delete-emptydir-data

Uncordon Node

Allow scheduling after maintenance:

# Uncordon node
kubectl uncordon <node-name>

Troubleshooting Workflow

Complete Node Troubleshooting

# Step 1: Check node status
kubectl get nodes
kubectl describe node <node-name>

# Step 2: Check node conditions
kubectl get node <node-name> -o jsonpath='{.status.conditions}'

# Step 3: Check resource usage
kubectl top node <node-name>

# Step 4: Check node events
kubectl get events --field-selector involvedObject.name=<node-name>

# Step 5: Check pods on node
kubectl get pods --all-namespaces --field-selector spec.nodeName=<node-name>

# Step 6: Check kubelet (if node access available)
ssh <node> systemctl status kubelet
ssh <node> journalctl -u kubelet -n 100

Topics

See Also