EKS Troubleshooting

Troubleshooting EKS issues requires understanding cluster components, networking, authentication, and AWS service integration. This guide covers common issues, debugging techniques, and resolution strategies for EKS clusters.

Troubleshooting Approach

Systematic troubleshooting process:

graph TB A[Identify Issue] --> B{Gather Information} B --> C[Check Cluster Status] B --> D[Check Node Status] B --> E[Check Pod Status] B --> F[Check Logs] C --> G{Issue Type?} D --> G E --> G F --> G G -->|Cluster| H[Cluster Issues] G -->|Node| I[Node Issues] G -->|Pod| J[Pod Issues] G -->|Network| K[Network Issues] G -->|Auth| L[Auth Issues] H --> M[Resolve] I --> M J --> M K --> M L --> M style A fill:#e1f5ff style M fill:#e8f5e9

Common Cluster Issues

Cluster Not Accessible

Symptoms:

kubectl commands fail
“Unable to connect to the server” errors
API server timeout

Diagnosis:

# Check cluster status
aws eks describe-cluster --name my-cluster --region us-west-2

# Check cluster endpoint
aws eks describe-cluster --name my-cluster --query "cluster.endpoint"

# Test connectivity
curl -k https://<cluster-endpoint>/healthz

# Check kubeconfig
kubectl config view
kubectl config get-contexts

Common Causes:

Cluster endpoint access restricted (private endpoint)
Network connectivity issues
kubeconfig not configured
IAM authentication issues

Solutions:

# Update kubeconfig
aws eks update-kubeconfig --name my-cluster --region us-west-2

# Check endpoint access
aws eks describe-cluster --name my-cluster --query "cluster.resourcesVpcConfig.endpointPublicAccess"
aws eks describe-cluster --name my-cluster --query "cluster.resourcesVpcConfig.endpointPrivateAccess"

# Enable public endpoint if needed
aws eks update-cluster-config \
  --name my-cluster \
  --resources-vpc-config endpointPublicAccess=true,endpointPrivateAccess=true

Cluster Control Plane Issues

Symptoms:

Cluster status shows “UPDATING” or “FAILED”
Control plane components unhealthy
API server errors

Diagnosis:

# Check cluster status
aws eks describe-cluster --name my-cluster --query "cluster.status"

# Check cluster health
aws eks describe-cluster --name my-cluster --query "cluster.health"

# Review cluster logs
aws eks describe-cluster --name my-cluster --query "cluster.logging"

Common Causes:

Control plane version incompatibility
IAM role issues
VPC configuration problems
Service quota limits

Solutions:

# Check IAM role
aws eks describe-cluster --name my-cluster --query "cluster.roleArn"

# Verify IAM role permissions
aws iam get-role --role-name eks-service-role

# Check service quotas
aws service-quotas get-service-quota \
  --service-code eks \
  --quota-code L-1194A341

Node Connectivity Problems

Nodes Not Joining Cluster

Symptoms:

Nodes created but not appearing in kubectl get nodes
Nodes show “NotReady” status
Pods can’t be scheduled

Diagnosis:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

# Check aws-auth ConfigMap
kubectl get configmap aws-auth -n kube-system -o yaml

# Check node logs (SSH to node)
journalctl -u kubelet
journalctl -u containerd

Common Causes:

IAM role not configured correctly
Security group rules blocking traffic
Bootstrap script issues
aws-auth ConfigMap missing node role

Solutions:

# Verify node IAM role
aws eks describe-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name general-workers \
  --query "nodegroup.nodeRole"

# Check security group rules
aws ec2 describe-security-groups \
  --group-ids <node-security-group-id>

# Update aws-auth ConfigMap
kubectl edit configmap aws-auth -n kube-system

# Add node role mapping
eksctl create iamidentitymapping \
  --cluster my-cluster \
  --arn arn:aws:iam::123456789012:role/eks-node-role \
  --username system:node:{{EC2PrivateDNSName}} \
  --group system:bootstrappers \
  --group system:nodes

Node NotReady Status

Symptoms:

Nodes show “NotReady” status
Pods can’t be scheduled on nodes
Node conditions show problems

Diagnosis:

# Describe node for details
kubectl describe node <node-name>

# Check node conditions
kubectl get node <node-name> -o json | jq '.status.conditions'

# Check kubelet status
kubectl get node <node-name> -o json | jq '.status.nodeInfo'

# Check node events
kubectl get events --field-selector involvedObject.name=<node-name>

Common Causes:

kubelet not running
Network connectivity issues
Resource pressure (memory, disk)
Container runtime issues

Solutions:

# Restart kubelet (SSH to node)
sudo systemctl restart kubelet

# Check kubelet logs
sudo journalctl -u kubelet -f

# Check disk space
df -h

# Check memory
free -h

# Check container runtime
sudo systemctl status containerd
sudo systemctl restart containerd

Node Resource Pressure

Symptoms:

Nodes show “MemoryPressure” or “DiskPressure”
Pods being evicted
Node tainted

Diagnosis:

# Check node conditions
kubectl describe node <node-name> | grep -A 5 Conditions

# Check resource usage
kubectl top node <node-name>

# Check pod resource usage
kubectl top pods --all-namespaces

# Check disk usage
kubectl get node <node-name> -o json | jq '.status.conditions[] | select(.type=="DiskPressure")'

Solutions:

# Clean up unused images
kubectl get node <node-name> -o json | jq '.status.images'

# Clean up terminated pods
kubectl get pods --all-namespaces --field-selector status.phase=Succeeded -o json | jq '.items[].metadata.name' | xargs -I {} kubectl delete pod {} --grace-period=0

# Increase node disk size
# Update node group with larger volume size

# Add more nodes
eksctl scale nodegroup \
  --cluster my-cluster \
  --name general-workers \
  --nodes 5

Networking Troubleshooting

Pod Networking Issues

Symptoms:

Pods can’t communicate with each other
Pods can’t reach external services
DNS resolution failing

Diagnosis:

# Check pod network
kubectl get pods -o wide

# Test pod connectivity
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Check VPC CNI pods
kubectl get pods -n kube-system -l app=vpc-cni

# Check VPC CNI logs
kubectl logs -n kube-system -l app=vpc-cni --tail=100

# Check ENI allocation
kubectl get pods -n kube-system -l app=vpc-cni -o json | jq '.items[].status'

Common Causes:

VPC CNI not running
Insufficient IP addresses
Security group rules
Route table issues

Solutions:

# Restart VPC CNI
kubectl delete pods -n kube-system -l app=vpc-cni

# Check IP address availability
aws ec2 describe-network-interfaces \
  --filters "Name=subnet-id,Values=<subnet-id>" \
  --query "NetworkInterfaces[*].PrivateIpAddress"

# Verify security group rules
aws ec2 describe-security-groups \
  --group-ids <security-group-id>

# Check route tables
aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=<subnet-id>"

Load Balancer Issues

Symptoms:

LoadBalancer service stuck in “Pending”
Load balancer not accessible
Health checks failing

Diagnosis:

# Check service status
kubectl get svc <service-name>
kubectl describe svc <service-name>

# Check AWS Load Balancer Controller
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller

# Check controller logs
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=100

# Check load balancer in AWS
aws elbv2 describe-load-balancers \
  --query "LoadBalancers[?contains(LoadBalancerName, '<service-name>')]"

Common Causes:

AWS Load Balancer Controller not installed
IAM permissions missing
Subnet tags incorrect
Security group rules

Solutions:

# Verify controller is running
kubectl get deployment aws-load-balancer-controller -n kube-system

# Check IAM permissions
aws iam get-role-policy \
  --role-name aws-load-balancer-controller-role \
  --policy-name <policy-name>

# Verify subnet tags
aws ec2 describe-subnets \
  --subnet-ids <subnet-id> \
  --query "Subnets[*].Tags"

# Tag subnets if needed
aws ec2 create-tags \
  --resources <subnet-id> \
  --tags Key=kubernetes.io/role/elb,Value=1

DNS Resolution Issues

Symptoms:

Pods can’t resolve service names
External DNS not working
CoreDNS errors

Diagnosis:

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# Test DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default.svc.cluster.local

# Check CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

Solutions:

# Restart CoreDNS
kubectl delete pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS service
kubectl get svc kube-dns -n kube-system

# Verify CoreDNS endpoints
kubectl get endpoints kube-dns -n kube-system

# Update CoreDNS configuration if needed
kubectl edit configmap coredns -n kube-system

Authentication and Authorization Issues

kubectl Access Denied

Symptoms:

kubectl commands return “Forbidden” or “Unauthorized”
Can’t access cluster resources

Diagnosis:

# Check current user
aws sts get-caller-identity

# Test cluster access
kubectl auth can-i get pods

# Check aws-auth ConfigMap
kubectl get configmap aws-auth -n kube-system -o yaml

# Check RBAC
kubectl get rolebindings,clusterrolebindings --all-namespaces

Common Causes:

IAM user/role not in aws-auth ConfigMap
RBAC permissions missing
Cluster endpoint access restricted

Solutions:

# Add IAM user to cluster
eksctl create iamidentitymapping \
  --cluster my-cluster \
  --arn arn:aws:iam::123456789012:user/john \
  --username john \
  --group system:masters

# Add IAM role to cluster
eksctl create iamidentitymapping \
  --cluster my-cluster \
  --arn arn:aws:iam::123456789012:role/eks-admin \
  --username eks-admin \
  --group system:masters

# Create RBAC role
kubectl create role developer \
  --resource=pods,services \
  --verb=get,list,create,update,delete

# Bind role to user
kubectl create rolebinding developer-binding \
  --role=developer \
  --user=john

IRSA Not Working

Symptoms:

Pods can’t assume IAM roles
AWS SDK calls fail
“Access Denied” errors in pods

Diagnosis:

# Check service account
kubectl get serviceaccount <sa-name> -o yaml

# Check OIDC provider
aws iam list-open-id-connect-providers

# Check IAM role trust policy
aws iam get-role --role-name <role-name> --query "Role.AssumeRolePolicyDocument"

# Test in pod
kubectl run -it --rm test --image=amazon/aws-cli:latest --serviceaccount=<sa-name> -- aws sts get-caller-identity

Common Causes:

OIDC provider not created
Service account annotation incorrect
IAM role trust policy wrong
Pod not using service account

Solutions:

# Create OIDC provider
eksctl utils associate-iam-oidc-provider \
  --cluster my-cluster \
  --approve

# Update service account annotation
kubectl annotate serviceaccount <sa-name> \
  eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/<role-name>

# Verify trust policy
aws iam get-role --role-name <role-name> --query "Role.AssumeRolePolicyDocument"

# Restart pods to pick up new credentials
kubectl rollout restart deployment <deployment-name>

Storage Problems

Volume Not Attaching

Symptoms:

Pod stuck in “Pending”
PVC not bound
Volume attachment errors

Diagnosis:

# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>

# Check PV status
kubectl get pv
kubectl describe pv <pv-name>

# Check EBS CSI driver
kubectl get pods -n kube-system -l app=ebs-csi-controller

# Check EBS CSI logs
kubectl logs -n kube-system -l app=ebs-csi-controller --tail=100

Common Causes:

EBS CSI driver not installed
IAM permissions missing
Volume in different AZ than node
Storage class misconfiguration

Solutions:

# Verify EBS CSI driver
aws eks describe-addon \
  --cluster-name my-cluster \
  --addon-name aws-ebs-csi-driver

# Check IAM role
aws eks describe-addon \
  --cluster-name my-cluster \
  --addon-name aws-ebs-csi-driver \
  --query "addon.serviceAccountRoleArn"

# Verify storage class
kubectl get storageclass
kubectl describe storageclass <sc-name>

# Check volume in AWS
aws ec2 describe-volumes \
  --filters "Name=tag:kubernetes.io/created-for/pvc/name,Values=<pvc-name>"

EFS Mount Issues

Symptoms:

EFS mounts timing out
Pods can’t access EFS
Permission denied errors

Diagnosis:

# Check EFS CSI driver
kubectl get pods -n kube-system -l app=efs-csi-controller

# Check EFS CSI logs
kubectl logs -n kube-system -l app=efs-csi-controller --tail=100

# Check mount targets
aws efs describe-mount-targets \
  --file-system-id <efs-id>

# Test mount in pod
kubectl run -it --rm test --image=amazonlinux:latest -- mount -t efs <efs-id>:/ /mnt/efs

Common Causes:

Mount targets not in all subnets
Security group rules blocking NFS
EFS CSI driver not installed
Network connectivity issues

Solutions:

# Create mount targets in all subnets
aws efs create-mount-target \
  --file-system-id <efs-id> \
  --subnet-id <subnet-id> \
  --security-groups <security-group-id>

# Verify security group rules
aws ec2 describe-security-groups \
  --group-ids <security-group-id> \
  --query "SecurityGroups[*].IpPermissions[?FromPort==\`2049\`]"

# Add NFS rule if missing
aws ec2 authorize-security-group-ingress \
  --group-id <security-group-id> \
  --protocol tcp \
  --port 2049 \
  --cidr 10.0.0.0/16

Performance Issues

Slow Pod Startup

Symptoms:

Pods take long time to start
Image pull delays
Container startup slow

Diagnosis:

# Check pod events
kubectl describe pod <pod-name>

# Check image pull times
kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'

# Check node resources
kubectl top node <node-name>

# Check container runtime
kubectl get node <node-name> -o json | jq '.status.nodeInfo.containerRuntimeVersion'

Solutions:

# Use image pull secrets for private registries
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<pass>

# Pre-pull images on nodes
# Use node affinity to schedule on nodes with images

# Optimize container images
# Use smaller base images
# Multi-stage builds

High Resource Usage

Symptoms:

Nodes running out of resources
Pods being evicted
Performance degradation

Diagnosis:

# Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces

# Check resource requests and limits
kubectl get pods -o json | jq '.items[] | {name: .metadata.name, requests: .spec.containers[].resources.requests, limits: .spec.containers[].resources.limits}'

# Check node capacity
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

Solutions:

# Right-size resource requests
kubectl set resources deployment <deployment-name> \
  --requests=cpu=100m,memory=128Mi \
  --limits=cpu=500m,memory=512Mi

# Scale horizontally
kubectl scale deployment <deployment-name> --replicas=5

# Add more nodes
eksctl scale nodegroup \
  --cluster my-cluster \
  --name general-workers \
  --nodes 10

Cost Optimization Issues

Unexpected Costs

Symptoms:

Higher than expected AWS bills
Unused resources
Inefficient resource usage

Diagnosis:

# Check cluster resources
kubectl get nodes
kubectl get pods --all-namespaces

# Check EBS volumes
aws ec2 describe-volumes \
  --filters "Name=tag:kubernetes.io/cluster/<cluster-name>,Values=owned"

# Check load balancers
aws elbv2 describe-load-balancers \
  --query "LoadBalancers[?contains(LoadBalancerName, '<cluster-name>')]"

# Use Cost Explorer
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

Solutions:

# Delete unused resources
kubectl delete pvc <unused-pvc>
kubectl delete svc <unused-service>

# Use spot instances
eksctl create nodegroup \
  --cluster my-cluster \
  --name spot-workers \
  --instance-types t3.medium,t3.large \
  --capacity-type SPOT \
  --nodes 0 \
  --nodes-min 0 \
  --nodes-max 10

# Right-size instances
# Review and adjust instance types based on usage

# Enable cluster autoscaling
# Scale down during low usage periods

Debugging Tools and Techniques

Useful Commands

# Get comprehensive cluster info
kubectl cluster-info dump

# Get all resources
kubectl get all --all-namespaces

# Describe resource for details
kubectl describe <resource-type> <resource-name>

# Get events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

# Check API resources
kubectl api-resources

# Check API versions
kubectl api-versions

Log Collection

# Collect all logs
kubectl logs --all-namespaces > all-logs.txt

# Collect node logs (SSH to node)
sudo journalctl -u kubelet > kubelet.log
sudo journalctl -u containerd > containerd.log

# Collect EKS control plane logs
aws eks describe-cluster --name my-cluster --query "cluster.logging"

Network Debugging

# Test connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never

# Inside debug pod:
# Test DNS
nslookup kubernetes.default.svc.cluster.local

# Test connectivity
curl http://<service-name>.<namespace>.svc.cluster.local

# Check routes
ip route

# Check network interfaces
ip addr

AWS Support and Resources

AWS Support

AWS Support Center - Create support cases
AWS Forums - Community support
AWS Documentation - Official documentation
AWS Premium Support - Enhanced support options

Useful Resources

EKS User Guide - https://docs.aws.amazon.com/eks/
EKS Best Practices - https://aws.github.io/aws-eks-best-practices/
Kubernetes Documentation - https://kubernetes.io/docs/
EKS GitHub - https://github.com/aws/containers-roadmap

Getting Help

Check Logs - Always start with logs
Review Documentation - Check official docs
Search Forums - Look for similar issues
Create Support Case - For AWS-specific issues
Community Forums - Ask in Kubernetes/EKS communities

Best Practices for Troubleshooting

Start with Logs - Check pod, node, and cluster logs first
Use Describe Commands - kubectl describe provides detailed information
Check Events - Kubernetes events show what’s happening
Verify Prerequisites - Ensure IAM, networking, and resources are correct
Test Incrementally - Test changes one at a time
Document Issues - Keep notes on issues and resolutions
Use Debug Pods - Create debug pods for testing
Check AWS Console - Verify AWS resources are created correctly
Review Best Practices - Follow EKS best practices guides
Stay Updated - Keep cluster and add-ons updated

EKS Troubleshooting

Troubleshooting Approach

Common Cluster Issues

Cluster Not Accessible

Cluster Control Plane Issues

Node Connectivity Problems

Nodes Not Joining Cluster

Node NotReady Status

Node Resource Pressure

Networking Troubleshooting

Pod Networking Issues

Load Balancer Issues

DNS Resolution Issues

Authentication and Authorization Issues

kubectl Access Denied

IRSA Not Working

Storage Problems

Volume Not Attaching

EFS Mount Issues

Performance Issues

Slow Pod Startup

High Resource Usage

Cost Optimization Issues

Unexpected Costs

Debugging Tools and Techniques

Useful Commands

Log Collection

Network Debugging

AWS Support and Resources

AWS Support

Useful Resources

Getting Help

Best Practices for Troubleshooting

See Also