Troubleshooting

Troubleshooting Kubernetes issues requires a systematic approach. This section provides methodologies, tools, and techniques to diagnose and resolve problems in your clusters, from pod failures to control plane issues.

Troubleshooting Methodology

A structured approach to troubleshooting helps you resolve issues efficiently:

graph TB A[Issue Reported] --> B[Reproduce Issue] B --> C{Can Reproduce?} C -->|No| D[Gather More Info] D --> B C -->|Yes| E[Identify Scope] E --> F{Scope} F -->|Pod| G[Pod Troubleshooting] F -->|Service| H[Service Troubleshooting] F -->|Node| I[Node Troubleshooting] F -->|Cluster| J[Cluster Troubleshooting] G --> K[Check Pod Status] H --> L[Check Service Endpoints] I --> M[Check Node Status] J --> N[Check Control Plane] K --> O[Check Logs] L --> O M --> O N --> O O --> P[Check Events] P --> Q[Check Resources] Q --> R{Issue Resolved?} R -->|No| S[Deep Dive] R -->|Yes| T[Document Solution] S --> O style A fill:#e1f5ff style E fill:#e8f5e9 style O fill:#fff4e1 style T fill:#e8f5e9

Common Problem Categories

Pod Issues

  • Pods not starting
  • Crash loops
  • Image pull errors
  • Resource constraints
  • Configuration errors

Service & Networking Issues

  • Service not accessible
  • DNS resolution failures
  • NetworkPolicy blocking traffic
  • Load balancer problems

Storage Issues

  • PersistentVolume mounting failures
  • StorageClass not provisioning
  • PVC pending indefinitely
  • Data loss or corruption

Control Plane Issues

  • API server unavailable
  • etcd problems
  • Scheduler not working
  • Controller manager failures

Node Issues

  • Node not ready
  • Kubelet problems
  • Container runtime issues
  • Resource exhaustion

Essential kubectl Commands

Check Resource Status

# Check pod status
kubectl get pods -A

# Check pod details
kubectl describe pod <pod-name> -n <namespace>

# Check node status
kubectl get nodes

# Check node details
kubectl describe node <node-name>

# Check service endpoints
kubectl get endpoints <service-name>

# Check events
kubectl get events --sort-by='.lastTimestamp'

View Logs

# Pod logs
kubectl logs <pod-name> -n <namespace>

# Previous container instance
kubectl logs <pod-name> --previous

# All pods with label
kubectl logs -l app=my-app

# Follow logs
kubectl logs -f <pod-name>

# Container logs in multi-container pod
kubectl logs <pod-name> -c <container-name>

Debugging Commands

# Execute into pod
kubectl exec -it <pod-name> -- /bin/sh

# Port forward
kubectl port-forward <pod-name> 8080:80

# Debug ephemeral container
kubectl debug <pod-name> -it --image=busybox

# Check resource usage
kubectl top pods
kubectl top nodes

Inspect Configuration

# Get resource YAML
kubectl get pod <pod-name> -o yaml

# Check API resources
kubectl api-resources

# Check API versions
kubectl api-versions

# Check cluster info
kubectl cluster-info

# Check config
kubectl config view

Debugging Workflow

1. Identify the Issue

# Start with high-level status
kubectl get all -A

# Check for failed resources
kubectl get pods -A --field-selector=status.phase!=Running

# Check events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

2. Narrow Down the Scope

Determine if it’s:

  • A specific pod
  • A service
  • A node
  • The entire cluster

3. Gather Information

# Describe the resource
kubectl describe <resource-type> <resource-name>

# Check logs
kubectl logs <pod-name>

# Check configuration
kubectl get <resource-type> <resource-name> -o yaml

4. Analyze Events

Events provide chronological information about what happened:

# Watch events in real-time
kubectl get events -w

# Filter events by reason
kubectl get events --field-selector reason=Failed

# Events for specific resource
kubectl describe pod <pod-name> | grep -A 10 Events

5. Check Logs

# Application logs
kubectl logs <pod-name>

# Previous instance (for crash loops)
kubectl logs <pod-name> --previous

# System component logs (if on node)
journalctl -u kubelet

6. Verify Configuration

# Check if YAML is valid
kubectl apply --dry-run=client -f manifest.yaml

# Validate resource
kubectl get <resource> -o yaml | kubectl apply --dry-run=server -f -

# Check resource limits
kubectl describe pod <pod-name> | grep -A 5 "Limits\|Requests"

Troubleshooting Decision Tree

graph TD A[Issue Detected] --> B{Resource Type?} B -->|Pod| C[Pod Issues] B -->|Service| D[Service Issues] B -->|Node| E[Node Issues] B -->|Cluster| F[Cluster Issues] C --> C1{Pod Status?} C1 -->|Pending| C2[Check Scheduling] C1 -->|CrashLoopBackOff| C3[Check Logs] C1 -->|ImagePullBackOff| C4[Check Image] C1 -->|Error| C3 D --> D1{Service Accessible?} D1 -->|No| D2[Check Endpoints] D1 -->|Yes| D3[Check Selector] E --> E1{Node Status?} E1 -->|NotReady| E2[Check Kubelet] E1 -->|Ready| E3[Check Resources] F --> F1{API Server?} F1 -->|Down| F2[Check Control Plane] F1 -->|Up| F3[Check etcd] style A fill:#e1f5ff style C3 fill:#fff4e1 style D2 fill:#fff4e1 style E2 fill:#fff4e1 style F2 fill:#fff4e1

Common Issues and Solutions

Pod Stuck in Pending

# Check why pod can't be scheduled
kubectl describe pod <pod-name> | grep -A 5 Events

# Common causes:
# - Insufficient resources
# - No nodes match affinity rules
# - PVC pending
# - Taints without tolerations

CrashLoopBackOff

# Check logs from previous instance
kubectl logs <pod-name> --previous

# Common causes:
# - Application errors
# - Missing configuration
# - Resource limits
# - Health check failures

ImagePullBackOff

# Check image pull secrets
kubectl get pod <pod-name> -o jsonpath='{.spec.imagePullSecrets}'

# Common causes:
# - Wrong image name/tag
# - Private registry without credentials
# - Network issues
# - Registry unavailable

Service Not Accessible

# Check service endpoints
kubectl get endpoints <service-name>

# Check service selector
kubectl get service <service-name> -o yaml

# Verify pods match selector
kubectl get pods -l <selector>

Node Not Ready

# Check kubelet status
systemctl status kubelet

# Check kubelet logs
journalctl -u kubelet -n 50

# Common causes:
# - Kubelet not running
# - Network issues
# - Disk pressure
# - Memory pressure

Debugging Tools

kubectl debug

Create ephemeral containers for debugging:

# Debug running pod
kubectl debug <pod-name> -it --image=busybox

# Copy pod with debug container
kubectl debug <pod-name> -it --image=busybox --copy-to=debug-pod

# Debug node
kubectl debug node/<node-name> -it --image=busybox

kubectl exec

Execute commands in running containers:

# Basic exec
kubectl exec -it <pod-name> -- /bin/sh

# Run specific command
kubectl exec <pod-name> -- env

# Exec in specific container
kubectl exec -it <pod-name> -c <container-name> -- /bin/sh

Port Forwarding

Access services locally:

# Forward pod port
kubectl port-forward <pod-name> 8080:80

# Forward service port
kubectl port-forward svc/<service-name> 8080:80

# Forward in background
kubectl port-forward <pod-name> 8080:80 &

Topics

See Also