EKS Troubleshooting

Troubleshooting EKS issues requires understanding cluster components, networking, authentication, and AWS service integration. This guide covers common issues, debugging techniques, and resolution strategies for EKS clusters.

Troubleshooting Approach

Systematic troubleshooting process:

graph TB A[Identify Issue] --> B{Gather Information} B --> C[Check Cluster Status] B --> D[Check Node Status] B --> E[Check Pod Status] B --> F[Check Logs] C --> G{Issue Type?} D --> G E --> G F --> G G -->|Cluster| H[Cluster Issues] G -->|Node| I[Node Issues] G -->|Pod| J[Pod Issues] G -->|Network| K[Network Issues] G -->|Auth| L[Auth Issues] H --> M[Resolve] I --> M J --> M K --> M L --> M style A fill:#e1f5ff style M fill:#e8f5e9

Common Cluster Issues

Cluster Not Accessible

Symptoms:

  • kubectl commands fail
  • “Unable to connect to the server” errors
  • API server timeout

Diagnosis:

# Check cluster status
aws eks describe-cluster --name my-cluster --region us-west-2

# Check cluster endpoint
aws eks describe-cluster --name my-cluster --query "cluster.endpoint"

# Test connectivity
curl -k https://<cluster-endpoint>/healthz

# Check kubeconfig
kubectl config view
kubectl config get-contexts

Common Causes:

  • Cluster endpoint access restricted (private endpoint)
  • Network connectivity issues
  • kubeconfig not configured
  • IAM authentication issues

Solutions:

# Update kubeconfig
aws eks update-kubeconfig --name my-cluster --region us-west-2

# Check endpoint access
aws eks describe-cluster --name my-cluster --query "cluster.resourcesVpcConfig.endpointPublicAccess"
aws eks describe-cluster --name my-cluster --query "cluster.resourcesVpcConfig.endpointPrivateAccess"

# Enable public endpoint if needed
aws eks update-cluster-config \
  --name my-cluster \
  --resources-vpc-config endpointPublicAccess=true,endpointPrivateAccess=true

Cluster Control Plane Issues

Symptoms:

  • Cluster status shows “UPDATING” or “FAILED”
  • Control plane components unhealthy
  • API server errors

Diagnosis:

# Check cluster status
aws eks describe-cluster --name my-cluster --query "cluster.status"

# Check cluster health
aws eks describe-cluster --name my-cluster --query "cluster.health"

# Review cluster logs
aws eks describe-cluster --name my-cluster --query "cluster.logging"

Common Causes:

  • Control plane version incompatibility
  • IAM role issues
  • VPC configuration problems
  • Service quota limits

Solutions:

# Check IAM role
aws eks describe-cluster --name my-cluster --query "cluster.roleArn"

# Verify IAM role permissions
aws iam get-role --role-name eks-service-role

# Check service quotas
aws service-quotas get-service-quota \
  --service-code eks \
  --quota-code L-1194A341

Node Connectivity Problems

Nodes Not Joining Cluster

Symptoms:

  • Nodes created but not appearing in kubectl get nodes
  • Nodes show “NotReady” status
  • Pods can’t be scheduled

Diagnosis:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

# Check aws-auth ConfigMap
kubectl get configmap aws-auth -n kube-system -o yaml

# Check node logs (SSH to node)
journalctl -u kubelet
journalctl -u containerd

Common Causes:

  • IAM role not configured correctly
  • Security group rules blocking traffic
  • Bootstrap script issues
  • aws-auth ConfigMap missing node role

Solutions:

# Verify node IAM role
aws eks describe-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name general-workers \
  --query "nodegroup.nodeRole"

# Check security group rules
aws ec2 describe-security-groups \
  --group-ids <node-security-group-id>

# Update aws-auth ConfigMap
kubectl edit configmap aws-auth -n kube-system

# Add node role mapping
eksctl create iamidentitymapping \
  --cluster my-cluster \
  --arn arn:aws:iam::123456789012:role/eks-node-role \
  --username system:node:{{EC2PrivateDNSName}} \
  --group system:bootstrappers \
  --group system:nodes

Node NotReady Status

Symptoms:

  • Nodes show “NotReady” status
  • Pods can’t be scheduled on nodes
  • Node conditions show problems

Diagnosis:

# Describe node for details
kubectl describe node <node-name>

# Check node conditions
kubectl get node <node-name> -o json | jq '.status.conditions'

# Check kubelet status
kubectl get node <node-name> -o json | jq '.status.nodeInfo'

# Check node events
kubectl get events --field-selector involvedObject.name=<node-name>

Common Causes:

  • kubelet not running
  • Network connectivity issues
  • Resource pressure (memory, disk)
  • Container runtime issues

Solutions:

# Restart kubelet (SSH to node)
sudo systemctl restart kubelet

# Check kubelet logs
sudo journalctl -u kubelet -f

# Check disk space
df -h

# Check memory
free -h

# Check container runtime
sudo systemctl status containerd
sudo systemctl restart containerd

Node Resource Pressure

Symptoms:

  • Nodes show “MemoryPressure” or “DiskPressure”
  • Pods being evicted
  • Node tainted

Diagnosis:

# Check node conditions
kubectl describe node <node-name> | grep -A 5 Conditions

# Check resource usage
kubectl top node <node-name>

# Check pod resource usage
kubectl top pods --all-namespaces

# Check disk usage
kubectl get node <node-name> -o json | jq '.status.conditions[] | select(.type=="DiskPressure")'

Solutions:

# Clean up unused images
kubectl get node <node-name> -o json | jq '.status.images'

# Clean up terminated pods
kubectl get pods --all-namespaces --field-selector status.phase=Succeeded -o json | jq '.items[].metadata.name' | xargs -I {} kubectl delete pod {} --grace-period=0

# Increase node disk size
# Update node group with larger volume size

# Add more nodes
eksctl scale nodegroup \
  --cluster my-cluster \
  --name general-workers \
  --nodes 5

Networking Troubleshooting

Pod Networking Issues

Symptoms:

  • Pods can’t communicate with each other
  • Pods can’t reach external services
  • DNS resolution failing

Diagnosis:

# Check pod network
kubectl get pods -o wide

# Test pod connectivity
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Check VPC CNI pods
kubectl get pods -n kube-system -l app=vpc-cni

# Check VPC CNI logs
kubectl logs -n kube-system -l app=vpc-cni --tail=100

# Check ENI allocation
kubectl get pods -n kube-system -l app=vpc-cni -o json | jq '.items[].status'

Common Causes:

  • VPC CNI not running
  • Insufficient IP addresses
  • Security group rules
  • Route table issues

Solutions:

# Restart VPC CNI
kubectl delete pods -n kube-system -l app=vpc-cni

# Check IP address availability
aws ec2 describe-network-interfaces \
  --filters "Name=subnet-id,Values=<subnet-id>" \
  --query "NetworkInterfaces[*].PrivateIpAddress"

# Verify security group rules
aws ec2 describe-security-groups \
  --group-ids <security-group-id>

# Check route tables
aws ec2 describe-route-tables \
  --filters "Name=association.subnet-id,Values=<subnet-id>"

Load Balancer Issues

Symptoms:

  • LoadBalancer service stuck in “Pending”
  • Load balancer not accessible
  • Health checks failing

Diagnosis:

# Check service status
kubectl get svc <service-name>
kubectl describe svc <service-name>

# Check AWS Load Balancer Controller
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller

# Check controller logs
kubectl logs -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller --tail=100

# Check load balancer in AWS
aws elbv2 describe-load-balancers \
  --query "LoadBalancers[?contains(LoadBalancerName, '<service-name>')]"

Common Causes:

  • AWS Load Balancer Controller not installed
  • IAM permissions missing
  • Subnet tags incorrect
  • Security group rules

Solutions:

# Verify controller is running
kubectl get deployment aws-load-balancer-controller -n kube-system

# Check IAM permissions
aws iam get-role-policy \
  --role-name aws-load-balancer-controller-role \
  --policy-name <policy-name>

# Verify subnet tags
aws ec2 describe-subnets \
  --subnet-ids <subnet-id> \
  --query "Subnets[*].Tags"

# Tag subnets if needed
aws ec2 create-tags \
  --resources <subnet-id> \
  --tags Key=kubernetes.io/role/elb,Value=1

DNS Resolution Issues

Symptoms:

  • Pods can’t resolve service names
  • External DNS not working
  • CoreDNS errors

Diagnosis:

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

# Test DNS resolution
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default.svc.cluster.local

# Check CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

Solutions:

# Restart CoreDNS
kubectl delete pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS service
kubectl get svc kube-dns -n kube-system

# Verify CoreDNS endpoints
kubectl get endpoints kube-dns -n kube-system

# Update CoreDNS configuration if needed
kubectl edit configmap coredns -n kube-system

Authentication and Authorization Issues

kubectl Access Denied

Symptoms:

  • kubectl commands return “Forbidden” or “Unauthorized”
  • Can’t access cluster resources

Diagnosis:

# Check current user
aws sts get-caller-identity

# Test cluster access
kubectl auth can-i get pods

# Check aws-auth ConfigMap
kubectl get configmap aws-auth -n kube-system -o yaml

# Check RBAC
kubectl get rolebindings,clusterrolebindings --all-namespaces

Common Causes:

  • IAM user/role not in aws-auth ConfigMap
  • RBAC permissions missing
  • Cluster endpoint access restricted

Solutions:

# Add IAM user to cluster
eksctl create iamidentitymapping \
  --cluster my-cluster \
  --arn arn:aws:iam::123456789012:user/john \
  --username john \
  --group system:masters

# Add IAM role to cluster
eksctl create iamidentitymapping \
  --cluster my-cluster \
  --arn arn:aws:iam::123456789012:role/eks-admin \
  --username eks-admin \
  --group system:masters

# Create RBAC role
kubectl create role developer \
  --resource=pods,services \
  --verb=get,list,create,update,delete

# Bind role to user
kubectl create rolebinding developer-binding \
  --role=developer \
  --user=john

IRSA Not Working

Symptoms:

  • Pods can’t assume IAM roles
  • AWS SDK calls fail
  • “Access Denied” errors in pods

Diagnosis:

# Check service account
kubectl get serviceaccount <sa-name> -o yaml

# Check OIDC provider
aws iam list-open-id-connect-providers

# Check IAM role trust policy
aws iam get-role --role-name <role-name> --query "Role.AssumeRolePolicyDocument"

# Test in pod
kubectl run -it --rm test --image=amazon/aws-cli:latest --serviceaccount=<sa-name> -- aws sts get-caller-identity

Common Causes:

  • OIDC provider not created
  • Service account annotation incorrect
  • IAM role trust policy wrong
  • Pod not using service account

Solutions:

# Create OIDC provider
eksctl utils associate-iam-oidc-provider \
  --cluster my-cluster \
  --approve

# Update service account annotation
kubectl annotate serviceaccount <sa-name> \
  eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/<role-name>

# Verify trust policy
aws iam get-role --role-name <role-name> --query "Role.AssumeRolePolicyDocument"

# Restart pods to pick up new credentials
kubectl rollout restart deployment <deployment-name>

Storage Problems

Volume Not Attaching

Symptoms:

  • Pod stuck in “Pending”
  • PVC not bound
  • Volume attachment errors

Diagnosis:

# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>

# Check PV status
kubectl get pv
kubectl describe pv <pv-name>

# Check EBS CSI driver
kubectl get pods -n kube-system -l app=ebs-csi-controller

# Check EBS CSI logs
kubectl logs -n kube-system -l app=ebs-csi-controller --tail=100

Common Causes:

  • EBS CSI driver not installed
  • IAM permissions missing
  • Volume in different AZ than node
  • Storage class misconfiguration

Solutions:

# Verify EBS CSI driver
aws eks describe-addon \
  --cluster-name my-cluster \
  --addon-name aws-ebs-csi-driver

# Check IAM role
aws eks describe-addon \
  --cluster-name my-cluster \
  --addon-name aws-ebs-csi-driver \
  --query "addon.serviceAccountRoleArn"

# Verify storage class
kubectl get storageclass
kubectl describe storageclass <sc-name>

# Check volume in AWS
aws ec2 describe-volumes \
  --filters "Name=tag:kubernetes.io/created-for/pvc/name,Values=<pvc-name>"

EFS Mount Issues

Symptoms:

  • EFS mounts timing out
  • Pods can’t access EFS
  • Permission denied errors

Diagnosis:

# Check EFS CSI driver
kubectl get pods -n kube-system -l app=efs-csi-controller

# Check EFS CSI logs
kubectl logs -n kube-system -l app=efs-csi-controller --tail=100

# Check mount targets
aws efs describe-mount-targets \
  --file-system-id <efs-id>

# Test mount in pod
kubectl run -it --rm test --image=amazonlinux:latest -- mount -t efs <efs-id>:/ /mnt/efs

Common Causes:

  • Mount targets not in all subnets
  • Security group rules blocking NFS
  • EFS CSI driver not installed
  • Network connectivity issues

Solutions:

# Create mount targets in all subnets
aws efs create-mount-target \
  --file-system-id <efs-id> \
  --subnet-id <subnet-id> \
  --security-groups <security-group-id>

# Verify security group rules
aws ec2 describe-security-groups \
  --group-ids <security-group-id> \
  --query "SecurityGroups[*].IpPermissions[?FromPort==\`2049\`]"

# Add NFS rule if missing
aws ec2 authorize-security-group-ingress \
  --group-id <security-group-id> \
  --protocol tcp \
  --port 2049 \
  --cidr 10.0.0.0/16

Performance Issues

Slow Pod Startup

Symptoms:

  • Pods take long time to start
  • Image pull delays
  • Container startup slow

Diagnosis:

# Check pod events
kubectl describe pod <pod-name>

# Check image pull times
kubectl get events --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'

# Check node resources
kubectl top node <node-name>

# Check container runtime
kubectl get node <node-name> -o json | jq '.status.nodeInfo.containerRuntimeVersion'

Solutions:

# Use image pull secrets for private registries
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<pass>

# Pre-pull images on nodes
# Use node affinity to schedule on nodes with images

# Optimize container images
# Use smaller base images
# Multi-stage builds

High Resource Usage

Symptoms:

  • Nodes running out of resources
  • Pods being evicted
  • Performance degradation

Diagnosis:

# Check resource usage
kubectl top nodes
kubectl top pods --all-namespaces

# Check resource requests and limits
kubectl get pods -o json | jq '.items[] | {name: .metadata.name, requests: .spec.containers[].resources.requests, limits: .spec.containers[].resources.limits}'

# Check node capacity
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

Solutions:

# Right-size resource requests
kubectl set resources deployment <deployment-name> \
  --requests=cpu=100m,memory=128Mi \
  --limits=cpu=500m,memory=512Mi

# Scale horizontally
kubectl scale deployment <deployment-name> --replicas=5

# Add more nodes
eksctl scale nodegroup \
  --cluster my-cluster \
  --name general-workers \
  --nodes 10

Cost Optimization Issues

Unexpected Costs

Symptoms:

  • Higher than expected AWS bills
  • Unused resources
  • Inefficient resource usage

Diagnosis:

# Check cluster resources
kubectl get nodes
kubectl get pods --all-namespaces

# Check EBS volumes
aws ec2 describe-volumes \
  --filters "Name=tag:kubernetes.io/cluster/<cluster-name>,Values=owned"

# Check load balancers
aws elbv2 describe-load-balancers \
  --query "LoadBalancers[?contains(LoadBalancerName, '<cluster-name>')]"

# Use Cost Explorer
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

Solutions:

# Delete unused resources
kubectl delete pvc <unused-pvc>
kubectl delete svc <unused-service>

# Use spot instances
eksctl create nodegroup \
  --cluster my-cluster \
  --name spot-workers \
  --instance-types t3.medium,t3.large \
  --capacity-type SPOT \
  --nodes 0 \
  --nodes-min 0 \
  --nodes-max 10

# Right-size instances
# Review and adjust instance types based on usage

# Enable cluster autoscaling
# Scale down during low usage periods

Debugging Tools and Techniques

Useful Commands

# Get comprehensive cluster info
kubectl cluster-info dump

# Get all resources
kubectl get all --all-namespaces

# Describe resource for details
kubectl describe <resource-type> <resource-name>

# Get events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

# Check API resources
kubectl api-resources

# Check API versions
kubectl api-versions

Log Collection

# Collect all logs
kubectl logs --all-namespaces > all-logs.txt

# Collect node logs (SSH to node)
sudo journalctl -u kubelet > kubelet.log
sudo journalctl -u containerd > containerd.log

# Collect EKS control plane logs
aws eks describe-cluster --name my-cluster --query "cluster.logging"

Network Debugging

# Test connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never

# Inside debug pod:
# Test DNS
nslookup kubernetes.default.svc.cluster.local

# Test connectivity
curl http://<service-name>.<namespace>.svc.cluster.local

# Check routes
ip route

# Check network interfaces
ip addr

AWS Support and Resources

AWS Support

  • AWS Support Center - Create support cases
  • AWS Forums - Community support
  • AWS Documentation - Official documentation
  • AWS Premium Support - Enhanced support options

Useful Resources

Getting Help

  1. Check Logs - Always start with logs
  2. Review Documentation - Check official docs
  3. Search Forums - Look for similar issues
  4. Create Support Case - For AWS-specific issues
  5. Community Forums - Ask in Kubernetes/EKS communities

Best Practices for Troubleshooting

  1. Start with Logs - Check pod, node, and cluster logs first

  2. Use Describe Commands - kubectl describe provides detailed information

  3. Check Events - Kubernetes events show what’s happening

  4. Verify Prerequisites - Ensure IAM, networking, and resources are correct

  5. Test Incrementally - Test changes one at a time

  6. Document Issues - Keep notes on issues and resolutions

  7. Use Debug Pods - Create debug pods for testing

  8. Check AWS Console - Verify AWS resources are created correctly

  9. Review Best Practices - Follow EKS best practices guides

  10. Stay Updated - Keep cluster and add-ons updated

See Also