Troubleshooting
Troubleshooting Kubernetes issues requires a systematic approach. This section provides methodologies, tools, and techniques to diagnose and resolve problems in your clusters, from pod failures to control plane issues.
Troubleshooting Methodology
A structured approach to troubleshooting helps you resolve issues efficiently:
graph TB
A[Issue Reported] --> B[Reproduce Issue]
B --> C{Can Reproduce?}
C -->|No| D[Gather More Info]
D --> B
C -->|Yes| E[Identify Scope]
E --> F{Scope}
F -->|Pod| G[Pod Troubleshooting]
F -->|Service| H[Service Troubleshooting]
F -->|Node| I[Node Troubleshooting]
F -->|Cluster| J[Cluster Troubleshooting]
G --> K[Check Pod Status]
H --> L[Check Service Endpoints]
I --> M[Check Node Status]
J --> N[Check Control Plane]
K --> O[Check Logs]
L --> O
M --> O
N --> O
O --> P[Check Events]
P --> Q[Check Resources]
Q --> R{Issue Resolved?}
R -->|No| S[Deep Dive]
R -->|Yes| T[Document Solution]
S --> O
style A fill:#e1f5ff
style E fill:#e8f5e9
style O fill:#fff4e1
style T fill:#e8f5e9
Common Problem Categories
Pod Issues
- Pods not starting
- Crash loops
- Image pull errors
- Resource constraints
- Configuration errors
Service & Networking Issues
- Service not accessible
- DNS resolution failures
- NetworkPolicy blocking traffic
- Load balancer problems
Storage Issues
- PersistentVolume mounting failures
- StorageClass not provisioning
- PVC pending indefinitely
- Data loss or corruption
Control Plane Issues
- API server unavailable
- etcd problems
- Scheduler not working
- Controller manager failures
Node Issues
- Node not ready
- Kubelet problems
- Container runtime issues
- Resource exhaustion
Essential kubectl Commands
Check Resource Status
# Check pod status
kubectl get pods -A
# Check pod details
kubectl describe pod <pod-name> -n <namespace>
# Check node status
kubectl get nodes
# Check node details
kubectl describe node <node-name>
# Check service endpoints
kubectl get endpoints <service-name>
# Check events
kubectl get events --sort-by='.lastTimestamp'
View Logs
# Pod logs
kubectl logs <pod-name> -n <namespace>
# Previous container instance
kubectl logs <pod-name> --previous
# All pods with label
kubectl logs -l app=my-app
# Follow logs
kubectl logs -f <pod-name>
# Container logs in multi-container pod
kubectl logs <pod-name> -c <container-name>
Debugging Commands
# Execute into pod
kubectl exec -it <pod-name> -- /bin/sh
# Port forward
kubectl port-forward <pod-name> 8080:80
# Debug ephemeral container
kubectl debug <pod-name> -it --image=busybox
# Check resource usage
kubectl top pods
kubectl top nodes
Inspect Configuration
# Get resource YAML
kubectl get pod <pod-name> -o yaml
# Check API resources
kubectl api-resources
# Check API versions
kubectl api-versions
# Check cluster info
kubectl cluster-info
# Check config
kubectl config view
Debugging Workflow
1. Identify the Issue
# Start with high-level status
kubectl get all -A
# Check for failed resources
kubectl get pods -A --field-selector=status.phase!=Running
# Check events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
2. Narrow Down the Scope
Determine if it’s:
- A specific pod
- A service
- A node
- The entire cluster
3. Gather Information
# Describe the resource
kubectl describe <resource-type> <resource-name>
# Check logs
kubectl logs <pod-name>
# Check configuration
kubectl get <resource-type> <resource-name> -o yaml
4. Analyze Events
Events provide chronological information about what happened:
# Watch events in real-time
kubectl get events -w
# Filter events by reason
kubectl get events --field-selector reason=Failed
# Events for specific resource
kubectl describe pod <pod-name> | grep -A 10 Events
5. Check Logs
# Application logs
kubectl logs <pod-name>
# Previous instance (for crash loops)
kubectl logs <pod-name> --previous
# System component logs (if on node)
journalctl -u kubelet
6. Verify Configuration
# Check if YAML is valid
kubectl apply --dry-run=client -f manifest.yaml
# Validate resource
kubectl get <resource> -o yaml | kubectl apply --dry-run=server -f -
# Check resource limits
kubectl describe pod <pod-name> | grep -A 5 "Limits\|Requests"
Troubleshooting Decision Tree
graph TD
A[Issue Detected] --> B{Resource Type?}
B -->|Pod| C[Pod Issues]
B -->|Service| D[Service Issues]
B -->|Node| E[Node Issues]
B -->|Cluster| F[Cluster Issues]
C --> C1{Pod Status?}
C1 -->|Pending| C2[Check Scheduling]
C1 -->|CrashLoopBackOff| C3[Check Logs]
C1 -->|ImagePullBackOff| C4[Check Image]
C1 -->|Error| C3
D --> D1{Service Accessible?}
D1 -->|No| D2[Check Endpoints]
D1 -->|Yes| D3[Check Selector]
E --> E1{Node Status?}
E1 -->|NotReady| E2[Check Kubelet]
E1 -->|Ready| E3[Check Resources]
F --> F1{API Server?}
F1 -->|Down| F2[Check Control Plane]
F1 -->|Up| F3[Check etcd]
style A fill:#e1f5ff
style C3 fill:#fff4e1
style D2 fill:#fff4e1
style E2 fill:#fff4e1
style F2 fill:#fff4e1
Common Issues and Solutions
Pod Stuck in Pending
# Check why pod can't be scheduled
kubectl describe pod <pod-name> | grep -A 5 Events
# Common causes:
# - Insufficient resources
# - No nodes match affinity rules
# - PVC pending
# - Taints without tolerations
CrashLoopBackOff
# Check logs from previous instance
kubectl logs <pod-name> --previous
# Common causes:
# - Application errors
# - Missing configuration
# - Resource limits
# - Health check failures
ImagePullBackOff
# Check image pull secrets
kubectl get pod <pod-name> -o jsonpath='{.spec.imagePullSecrets}'
# Common causes:
# - Wrong image name/tag
# - Private registry without credentials
# - Network issues
# - Registry unavailable
Service Not Accessible
# Check service endpoints
kubectl get endpoints <service-name>
# Check service selector
kubectl get service <service-name> -o yaml
# Verify pods match selector
kubectl get pods -l <selector>
Node Not Ready
# Check kubelet status
systemctl status kubelet
# Check kubelet logs
journalctl -u kubelet -n 50
# Common causes:
# - Kubelet not running
# - Network issues
# - Disk pressure
# - Memory pressure
Debugging Tools
kubectl debug
Create ephemeral containers for debugging:
# Debug running pod
kubectl debug <pod-name> -it --image=busybox
# Copy pod with debug container
kubectl debug <pod-name> -it --image=busybox --copy-to=debug-pod
# Debug node
kubectl debug node/<node-name> -it --image=busybox
kubectl exec
Execute commands in running containers:
# Basic exec
kubectl exec -it <pod-name> -- /bin/sh
# Run specific command
kubectl exec <pod-name> -- env
# Exec in specific container
kubectl exec -it <pod-name> -c <container-name> -- /bin/sh
Port Forwarding
Access services locally:
# Forward pod port
kubectl port-forward <pod-name> 8080:80
# Forward service port
kubectl port-forward svc/<service-name> 8080:80
# Forward in background
kubectl port-forward <pod-name> 8080:80 &
Topics
- Clusters & Nodes - Troubleshooting cluster and node issues
- Networking - Troubleshooting network connectivity
See Also
- Debugging Toolkit - Tools for debugging Kubernetes
- Events - Understanding Kubernetes events
- kubectl debug - Debugging with ephemeral containers