EKS Troubleshooting
Troubleshooting EKS issues requires understanding cluster components, networking, authentication, and AWS service integration. This guide covers common issues, debugging techniques, and resolution strategies for EKS clusters.
Troubleshooting Approach
Systematic troubleshooting process:
Common Cluster Issues
Cluster Not Accessible
Symptoms:
kubectlcommands fail- “Unable to connect to the server” errors
- API server timeout
Diagnosis:
# Check cluster status
aws eks describe-cluster --name my-cluster --region us-west-2
# Check cluster endpoint
aws eks describe-cluster --name my-cluster --query "cluster.endpoint"
# Test connectivity
curl -k https://<cluster-endpoint>/healthz
# Check kubeconfig
kubectl config view
kubectl config get-contexts
Common Causes:
- Cluster endpoint access restricted (private endpoint)
- Network connectivity issues
- kubeconfig not configured
- IAM authentication issues
Solutions:
# Update kubeconfig
aws eks update-kubeconfig --name my-cluster --region us-west-2
# Check endpoint access
aws eks describe-cluster --name my-cluster --query "cluster.resourcesVpcConfig.endpointPublicAccess"
aws eks describe-cluster --name my-cluster --query "cluster.resourcesVpcConfig.endpointPrivateAccess"
# Enable public endpoint if needed
aws eks update-cluster-config \
--name my-cluster \
--resources-vpc-config endpointPublicAccess=true,endpointPrivateAccess=true
Cluster Control Plane Issues
Symptoms:
- Cluster status shows “UPDATING” or “FAILED”
- Control plane components unhealthy
- API server errors
Diagnosis:
# Check cluster status
aws eks describe-cluster --name my-cluster --query "cluster.status"
# Check cluster health
aws eks describe-cluster --name my-cluster --query "cluster.health"
# Review cluster logs
aws eks describe-cluster --name my-cluster --query "cluster.logging"
Common Causes:
- Control plane version incompatibility
- IAM role issues
- VPC configuration problems
- Service quota limits
Solutions:
# Check IAM role
aws eks describe-cluster --name my-cluster --query "cluster.roleArn"
# Verify IAM role permissions
aws iam get-role --role-name eks-service-role
# Check service quotas
aws service-quotas get-service-quota \
--service-code eks \
--quota-code L-1194A341
Node Connectivity Problems
Nodes Not Joining Cluster
Symptoms:
- Nodes created but not appearing in
kubectl get nodes - Nodes show “NotReady” status
- Pods can’t be scheduled
Diagnosis:
# Check node status
kubectl get nodes
kubectl describe node <node-name>
# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'
# Check aws-auth ConfigMap
kubectl get configmap aws-auth -n kube-system -o yaml
# Check node logs (SSH to node)
journalctl -u kubelet
journalctl -u containerd
Common Causes:
- IAM role not configured correctly
- Security group rules blocking traffic
- Bootstrap script issues
- aws-auth ConfigMap missing node role
Solutions:
# Verify node IAM role
aws eks describe-nodegroup \
--cluster-name my-cluster \
--nodegroup-name general-workers \
--query "nodegroup.nodeRole"
# Check security group rules
aws ec2 describe-security-groups \
--group-ids <node-security-group-id>
# Update aws-auth ConfigMap
kubectl edit configmap aws-auth -n kube-system
# Add node role mapping
eksctl create iamidentitymapping \
--cluster my-cluster \
--arn arn:aws:iam::123456789012:role/eks-node-role \
--username system:node:{{EC2PrivateDNSName}} \
--group system:bootstrappers \
--group system:nodes
Node NotReady Status
Symptoms:
- Nodes show “NotReady” status
- Pods can’t be scheduled on nodes
- Node conditions show problems
Diagnosis:
# Describe node for details
kubectl describe node <node-name>
# Check node conditions
kubectl get node <node-name> -o json | jq '.status.conditions'
# Check kubelet status
kubectl get node <node-name> -o json | jq '.status.nodeInfo'
# Check node events
kubectl get events --field-selector involvedObject.name=<node-name>
Common Causes:
- kubelet not running
- Network connectivity issues
- Resource pressure (memory, disk)
- Container runtime issues
Solutions:
# Restart kubelet (SSH to node)
sudo systemctl restart kubelet
# Check kubelet logs
sudo journalctl -u kubelet -f
# Check disk space
df -h
# Check memory
free -h
# Check container runtime
sudo systemctl status containerd
sudo systemctl restart containerd
Node Resource Pressure
Symptoms:
- Nodes show “MemoryPressure” or “DiskPressure”
- Pods being evicted
- Node tainted
Diagnosis:
# Check node conditions
kubectl describe node <node-name> | grep -A 5 Conditions
# Check resource usage
kubectl top node <node-name>
# Check pod resource usage
kubectl top pods --all-namespaces
# Check disk usage
kubectl get node <node-name> -o json | jq '.status.conditions[] | select(.type=="DiskPressure")'
Solutions:
# Clean up unused images
kubectl get node <node-name> -o json | jq '.status.images'
# Clean up terminated pods
kubectl get pods --all-namespaces --field-selector status.phase=Succeeded -o json | jq '.items[].metadata.name' | xargs -I {} kubectl delete pod {} --grace-period=0
# Increase node disk size
# Update node group with larger volume size
# Add more nodes
eksctl scale nodegroup \
--cluster my-cluster \
--name general-workers \
--nodes 5
Networking Troubleshooting
Pod Networking Issues
Symptoms:
- Pods can’t communicate with each other
- Pods can’t reach external services
- DNS resolution failing
Diagnosis:
# Check pod network
kubectl get pods -o wide
# Test pod connectivity
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default
# Check VPC CNI pods
kubectl get pods -n kube-system -l app=vpc-cni
# Check VPC CNI logs
kubectl logs -n kube-system -l app=vpc-cni --tail=100
# Check ENI allocation
kubectl get pods -n kube-system -l app=vpc-cni -o json | jq '.items[].status'
Common Causes:
- VPC CNI not running
- Insufficient IP addresses
- Security group rules
- Route table issues
Solutions:
# Restart VPC CNI
kubectl delete pods -n kube-system -l app=vpc-cni
# Check IP address availability
aws ec2 describe-network-interfaces \
--filters "Name=subnet-id,Values=<subnet-id>" \
--query "NetworkInterfaces[*].PrivateIpAddress"
# Verify security group rules
aws ec2 describe-security-groups \
--group-ids <security-group-id>
# Check route tables
aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=<subnet-id>"
Load Balancer Issues
Symptoms:
- LoadBalancer service stuck in “Pending”
- Load balancer not accessible
- Health checks failing
Diagnosis:
# Check service status
kubectl get svc <service-name>
kubectl describe svc <service-name>
# Check AWS Load Balancer Controller
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller
# Check controller logs
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=100
# Check load balancer in AWS
aws elbv2 describe-load-balancers \
--query "LoadBalancers[?contains(LoadBalancerName, '<service-name>')]"
Common Causes:
- AWS Load Balancer Controller not installed
- IAM permissions missing
- Subnet tags incorrect
- Security group rules
Solutions:
# Verify controller is running
kubectl get deployment aws-load-balancer-controller -n kube-system
# Check IAM permissions
aws iam get-role-policy \
--role-name aws-load-balancer-controller-role \
--policy-name <policy-name>
# Verify subnet tags
aws ec2 describe-subnets \
--subnet-ids <subnet-id> \
--query "Subnets[*].Tags"
# Tag subnets if needed
aws ec2 create-tags \
--resources <subnet-id> \
--tags Key=kubernetes.io/role/elb,Value=1
DNS Resolution Issues
Symptoms:
- Pods can’t resolve service names
- External DNS not working
- CoreDNS errors
Diagnosis:
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100
# Test DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default.svc.cluster.local
# Check CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml
Solutions:
# Restart CoreDNS
kubectl delete pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS service
kubectl get svc kube-dns -n kube-system
# Verify CoreDNS endpoints
kubectl get endpoints kube-dns -n kube-system
# Update CoreDNS configuration if needed
kubectl edit configmap coredns -n kube-system
Authentication and Authorization Issues
kubectl Access Denied
Symptoms:
kubectlcommands return “Forbidden” or “Unauthorized”- Can’t access cluster resources
Diagnosis:
# Check current user
aws sts get-caller-identity
# Test cluster access
kubectl auth can-i get pods
# Check aws-auth ConfigMap
kubectl get configmap aws-auth -n kube-system -o yaml
# Check RBAC
kubectl get rolebindings,clusterrolebindings --all-namespaces
Common Causes:
- IAM user/role not in aws-auth ConfigMap
- RBAC permissions missing
- Cluster endpoint access restricted
Solutions:
# Add IAM user to cluster
eksctl create iamidentitymapping \
--cluster my-cluster \
--arn arn:aws:iam::123456789012:user/john \
--username john \
--group system:masters
# Add IAM role to cluster
eksctl create iamidentitymapping \
--cluster my-cluster \
--arn arn:aws:iam::123456789012:role/eks-admin \
--username eks-admin \
--group system:masters
# Create RBAC role
kubectl create role developer \
--resource=pods,services \
--verb=get,list,create,update,delete
# Bind role to user
kubectl create rolebinding developer-binding \
--role=developer \
--user=john
IRSA Not Working
Symptoms:
- Pods can’t assume IAM roles
- AWS SDK calls fail
- “Access Denied” errors in pods
Diagnosis:
# Check service account
kubectl get serviceaccount <sa-name> -o yaml
# Check OIDC provider
aws iam list-open-id-connect-providers
# Check IAM role trust policy
aws iam get-role --role-name <role-name> --query "Role.AssumeRolePolicyDocument"
# Test in pod
kubectl run -it --rm test --image=amazon/aws-cli:latest --serviceaccount=<sa-name> -- aws sts get-caller-identity
Common Causes:
- OIDC provider not created
- Service account annotation incorrect
- IAM role trust policy wrong
- Pod not using service account
Solutions:
# Create OIDC provider
eksctl utils associate-iam-oidc-provider \
--cluster my-cluster \
--approve
# Update service account annotation
kubectl annotate serviceaccount <sa-name> \
eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/<role-name>
# Verify trust policy
aws iam get-role --role-name <role-name> --query "Role.AssumeRolePolicyDocument"
# Restart pods to pick up new credentials
kubectl rollout restart deployment <deployment-name>
Storage Problems
Volume Not Attaching
Symptoms:
- Pod stuck in “Pending”
- PVC not bound
- Volume attachment errors
Diagnosis:
# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>
# Check PV status
kubectl get pv
kubectl describe pv <pv-name>
# Check EBS CSI driver
kubectl get pods -n kube-system -l app=ebs-csi-controller
# Check EBS CSI logs
kubectl logs -n kube-system -l app=ebs-csi-controller --tail=100
Common Causes:
- EBS CSI driver not installed
- IAM permissions missing
- Volume in different AZ than node
- Storage class misconfiguration
Solutions:
# Verify EBS CSI driver
aws eks describe-addon \
--cluster-name my-cluster \
--addon-name aws-ebs-csi-driver
# Check IAM role
aws eks describe-addon \
--cluster-name my-cluster \
--addon-name aws-ebs-csi-driver \
--query "addon.serviceAccountRoleArn"
# Verify storage class
kubectl get storageclass
kubectl describe storageclass <sc-name>
# Check volume in AWS
aws ec2 describe-volumes \
--filters "Name=tag:kubernetes.io/created-for/pvc/name,Values=<pvc-name>"
EFS Mount Issues
Symptoms:
- EFS mounts timing out
- Pods can’t access EFS
- Permission denied errors
Diagnosis:
# Check EFS CSI driver
kubectl get pods -n kube-system -l app=efs-csi-controller
# Check EFS CSI logs
kubectl logs -n kube-system -l app=efs-csi-controller --tail=100
# Check mount targets
aws efs describe-mount-targets \
--file-system-id <efs-id>
# Test mount in pod
kubectl run -it --rm test --image=amazonlinux:latest -- mount -t efs <efs-id>:/ /mnt/efs
Common Causes:
- Mount targets not in all subnets
- Security group rules blocking NFS
- EFS CSI driver not installed
- Network connectivity issues
Solutions:
# Create mount targets in all subnets
aws efs create-mount-target \
--file-system-id <efs-id> \
--subnet-id <subnet-id> \
--security-groups <security-group-id>
# Verify security group rules
aws ec2 describe-security-groups \
--group-ids <security-group-id> \
--query "SecurityGroups[*].IpPermissions[?FromPort==\`2049\`]"
# Add NFS rule if missing
aws ec2 authorize-security-group-ingress \
--group-id <security-group-id> \
--protocol tcp \
--port 2049 \
--cidr 10.0.0.0/16
Performance Issues
Slow Pod Startup
Symptoms:
- Pods take long time to start
- Image pull delays
- Container startup slow
Diagnosis:
# Check pod events
kubectl describe pod <pod-name>
# Check image pull times
kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'
# Check node resources
kubectl top node <node-name>
# Check container runtime
kubectl get node <node-name> -o json | jq '.status.nodeInfo.containerRuntimeVersion'
Solutions:
# Use image pull secrets for private registries
kubectl create secret docker-registry regcred \
--docker-server=<registry> \
--docker-username=<user> \
--docker-password=<pass>
# Pre-pull images on nodes
# Use node affinity to schedule on nodes with images
# Optimize container images
# Use smaller base images
# Multi-stage builds
High Resource Usage
Symptoms:
- Nodes running out of resources
- Pods being evicted
- Performance degradation
Diagnosis:
# Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces
# Check resource requests and limits
kubectl get pods -o json | jq '.items[] | {name: .metadata.name, requests: .spec.containers[].resources.requests, limits: .spec.containers[].resources.limits}'
# Check node capacity
kubectl describe node <node-name> | grep -A 10 "Allocated resources"
Solutions:
# Right-size resource requests
kubectl set resources deployment <deployment-name> \
--requests=cpu=100m,memory=128Mi \
--limits=cpu=500m,memory=512Mi
# Scale horizontally
kubectl scale deployment <deployment-name> --replicas=5
# Add more nodes
eksctl scale nodegroup \
--cluster my-cluster \
--name general-workers \
--nodes 10
Cost Optimization Issues
Unexpected Costs
Symptoms:
- Higher than expected AWS bills
- Unused resources
- Inefficient resource usage
Diagnosis:
# Check cluster resources
kubectl get nodes
kubectl get pods --all-namespaces
# Check EBS volumes
aws ec2 describe-volumes \
--filters "Name=tag:kubernetes.io/cluster/<cluster-name>,Values=owned"
# Check load balancers
aws elbv2 describe-load-balancers \
--query "LoadBalancers[?contains(LoadBalancerName, '<cluster-name>')]"
# Use Cost Explorer
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
Solutions:
# Delete unused resources
kubectl delete pvc <unused-pvc>
kubectl delete svc <unused-service>
# Use spot instances
eksctl create nodegroup \
--cluster my-cluster \
--name spot-workers \
--instance-types t3.medium,t3.large \
--capacity-type SPOT \
--nodes 0 \
--nodes-min 0 \
--nodes-max 10
# Right-size instances
# Review and adjust instance types based on usage
# Enable cluster autoscaling
# Scale down during low usage periods
Debugging Tools and Techniques
Useful Commands
# Get comprehensive cluster info
kubectl cluster-info dump
# Get all resources
kubectl get all --all-namespaces
# Describe resource for details
kubectl describe <resource-type> <resource-name>
# Get events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
# Check API resources
kubectl api-resources
# Check API versions
kubectl api-versions
Log Collection
# Collect all logs
kubectl logs --all-namespaces > all-logs.txt
# Collect node logs (SSH to node)
sudo journalctl -u kubelet > kubelet.log
sudo journalctl -u containerd > containerd.log
# Collect EKS control plane logs
aws eks describe-cluster --name my-cluster --query "cluster.logging"
Network Debugging
# Test connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never
# Inside debug pod:
# Test DNS
nslookup kubernetes.default.svc.cluster.local
# Test connectivity
curl http://<service-name>.<namespace>.svc.cluster.local
# Check routes
ip route
# Check network interfaces
ip addr
AWS Support and Resources
AWS Support
- AWS Support Center - Create support cases
- AWS Forums - Community support
- AWS Documentation - Official documentation
- AWS Premium Support - Enhanced support options
Useful Resources
- EKS User Guide - https://docs.aws.amazon.com/eks/
- EKS Best Practices - https://aws.github.io/aws-eks-best-practices/
- Kubernetes Documentation - https://kubernetes.io/docs/
- EKS GitHub - https://github.com/aws/containers-roadmap
Getting Help
- Check Logs - Always start with logs
- Review Documentation - Check official docs
- Search Forums - Look for similar issues
- Create Support Case - For AWS-specific issues
- Community Forums - Ask in Kubernetes/EKS communities
Best Practices for Troubleshooting
Start with Logs - Check pod, node, and cluster logs first
Use Describe Commands -
kubectl describeprovides detailed informationCheck Events - Kubernetes events show what’s happening
Verify Prerequisites - Ensure IAM, networking, and resources are correct
Test Incrementally - Test changes one at a time
Document Issues - Keep notes on issues and resolutions
Use Debug Pods - Create debug pods for testing
Check AWS Console - Verify AWS resources are created correctly
Review Best Practices - Follow EKS best practices guides
Stay Updated - Keep cluster and add-ons updated
See Also
- Cluster Setup - Initial configuration
- Networking - Network troubleshooting
- Security - Security issues
- Node Management - Node issues