Troubleshooting

Troubleshooting Kubernetes issues requires a systematic approach. This section provides methodologies, tools, and techniques to diagnose and resolve problems in your clusters, from pod failures to control plane issues.

Troubleshooting Methodology

A structured approach to troubleshooting helps you resolve issues efficiently:

graph TB A[Issue Reported] --> B[Reproduce Issue] B --> C{Can Reproduce?} C -->|No| D[Gather More Info] D --> B C -->|Yes| E[Identify Scope] E --> F{Scope} F -->|Pod| G[Pod Troubleshooting] F -->|Service| H[Service Troubleshooting] F -->|Node| I[Node Troubleshooting] F -->|Cluster| J[Cluster Troubleshooting] G --> K[Check Pod Status] H --> L[Check Service Endpoints] I --> M[Check Node Status] J --> N[Check Control Plane] K --> O[Check Logs] L --> O M --> O N --> O O --> P[Check Events] P --> Q[Check Resources] Q --> R{Issue Resolved?} R -->|No| S[Deep Dive] R -->|Yes| T[Document Solution] S --> O style A fill:#e1f5ff style E fill:#e8f5e9 style O fill:#fff4e1 style T fill:#e8f5e9

Common Problem Categories

Pod Issues

Pods not starting
Crash loops
Image pull errors
Resource constraints
Configuration errors

Service & Networking Issues

Service not accessible
DNS resolution failures
NetworkPolicy blocking traffic
Load balancer problems

Storage Issues

PersistentVolume mounting failures
StorageClass not provisioning
PVC pending indefinitely
Data loss or corruption

Control Plane Issues

API server unavailable
etcd problems
Scheduler not working
Controller manager failures

Node Issues

Node not ready
Kubelet problems
Container runtime issues
Resource exhaustion

Essential kubectl Commands

Check Resource Status

# Check pod status
kubectl get pods -A

# Check pod details
kubectl describe pod <pod-name> -n <namespace>

# Check node status
kubectl get nodes

# Check node details
kubectl describe node <node-name>

# Check service endpoints
kubectl get endpoints <service-name>

# Check events
kubectl get events --sort-by='.lastTimestamp'

View Logs

# Pod logs
kubectl logs <pod-name> -n <namespace>

# Previous container instance
kubectl logs <pod-name> --previous

# All pods with label
kubectl logs -l app=my-app

# Follow logs
kubectl logs -f <pod-name>

# Container logs in multi-container pod
kubectl logs <pod-name> -c <container-name>

Debugging Commands

# Execute into pod
kubectl exec -it <pod-name> -- /bin/sh

# Port forward
kubectl port-forward <pod-name> 8080:80

# Debug ephemeral container
kubectl debug <pod-name> -it --image=busybox

# Check resource usage
kubectl top pods
kubectl top nodes

Inspect Configuration

# Get resource YAML
kubectl get pod <pod-name> -o yaml

# Check API resources
kubectl api-resources

# Check API versions
kubectl api-versions

# Check cluster info
kubectl cluster-info

# Check config
kubectl config view

Debugging Workflow

1. Identify the Issue

# Start with high-level status
kubectl get all -A

# Check for failed resources
kubectl get pods -A --field-selector=status.phase!=Running

# Check events
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

2. Narrow Down the Scope

Determine if it’s:

A specific pod
A service
A node
The entire cluster

3. Gather Information

# Describe the resource
kubectl describe <resource-type> <resource-name>

# Check logs
kubectl logs <pod-name>

# Check configuration
kubectl get <resource-type> <resource-name> -o yaml

4. Analyze Events

Events provide chronological information about what happened:

# Watch events in real-time
kubectl get events -w

# Filter events by reason
kubectl get events --field-selector reason=Failed

# Events for specific resource
kubectl describe pod <pod-name> | grep -A 10 Events

5. Check Logs

# Application logs
kubectl logs <pod-name>

# Previous instance (for crash loops)
kubectl logs <pod-name> --previous

# System component logs (if on node)
journalctl -u kubelet

6. Verify Configuration

# Check if YAML is valid
kubectl apply --dry-run=client -f manifest.yaml

# Validate resource
kubectl get <resource> -o yaml | kubectl apply --dry-run=server -f -

# Check resource limits
kubectl describe pod <pod-name> | grep -A 5 "Limits\|Requests"

Troubleshooting Decision Tree

graph TD A[Issue Detected] --> B{Resource Type?} B -->|Pod| C[Pod Issues] B -->|Service| D[Service Issues] B -->|Node| E[Node Issues] B -->|Cluster| F[Cluster Issues] C --> C1{Pod Status?} C1 -->|Pending| C2[Check Scheduling] C1 -->|CrashLoopBackOff| C3[Check Logs] C1 -->|ImagePullBackOff| C4[Check Image] C1 -->|Error| C3 D --> D1{Service Accessible?} D1 -->|No| D2[Check Endpoints] D1 -->|Yes| D3[Check Selector] E --> E1{Node Status?} E1 -->|NotReady| E2[Check Kubelet] E1 -->|Ready| E3[Check Resources] F --> F1{API Server?} F1 -->|Down| F2[Check Control Plane] F1 -->|Up| F3[Check etcd] style A fill:#e1f5ff style C3 fill:#fff4e1 style D2 fill:#fff4e1 style E2 fill:#fff4e1 style F2 fill:#fff4e1

Common Issues and Solutions

Pod Stuck in Pending

# Check why pod can't be scheduled
kubectl describe pod <pod-name> | grep -A 5 Events

# Common causes:
# - Insufficient resources
# - No nodes match affinity rules
# - PVC pending
# - Taints without tolerations

CrashLoopBackOff

# Check logs from previous instance
kubectl logs <pod-name> --previous

# Common causes:
# - Application errors
# - Missing configuration
# - Resource limits
# - Health check failures

ImagePullBackOff

# Check image pull secrets
kubectl get pod <pod-name> -o jsonpath='{.spec.imagePullSecrets}'

# Common causes:
# - Wrong image name/tag
# - Private registry without credentials
# - Network issues
# - Registry unavailable

Service Not Accessible

# Check service endpoints
kubectl get endpoints <service-name>

# Check service selector
kubectl get service <service-name> -o yaml

# Verify pods match selector
kubectl get pods -l <selector>

Node Not Ready

# Check kubelet status
systemctl status kubelet

# Check kubelet logs
journalctl -u kubelet -n 50

# Common causes:
# - Kubelet not running
# - Network issues
# - Disk pressure
# - Memory pressure

Debugging Tools

kubectl debug

Create ephemeral containers for debugging:

# Debug running pod
kubectl debug <pod-name> -it --image=busybox

# Copy pod with debug container
kubectl debug <pod-name> -it --image=busybox --copy-to=debug-pod

# Debug node
kubectl debug node/<node-name> -it --image=busybox

kubectl exec

Execute commands in running containers:

# Basic exec
kubectl exec -it <pod-name> -- /bin/sh

# Run specific command
kubectl exec <pod-name> -- env

# Exec in specific container
kubectl exec -it <pod-name> -c <container-name> -- /bin/sh

Port Forwarding

Access services locally:

# Forward pod port
kubectl port-forward <pod-name> 8080:80

# Forward service port
kubectl port-forward svc/<service-name> 8080:80

# Forward in background
kubectl port-forward <pod-name> 8080:80 &

Topics

Clusters & Nodes - Troubleshooting cluster and node issues
Networking - Troubleshooting network connectivity