AKS Troubleshooting

Troubleshooting AKS issues requires understanding cluster components, networking, authentication, and Azure service integration. This guide covers common issues, debugging techniques, and resolution strategies for AKS clusters.

Troubleshooting Approach

Systematic troubleshooting process:

graph TB A[Identify Issue] --> B{Gather Information} B --> C[Check Cluster Status] B --> D[Check Node Status] B --> E[Check Pod Status] B --> F[Check Logs] C --> G{Issue Type?} D --> G E --> G F --> G G -->|Cluster| H[Cluster Issues] G -->|Node| I[Node Issues] G -->|Pod| J[Pod Issues] G -->|Network| K[Network Issues] G -->|Auth| L[Auth Issues] H --> M[Resolve] I --> M J --> M K --> M L --> M style A fill:#e1f5ff style M fill:#e8f5e9

Common Cluster Issues

Cluster Not Accessible

Symptoms:

kubectl commands fail
“Unable to connect to the server” errors
API server timeout

Diagnosis:

# Check cluster status
az aks show --resource-group myResourceGroup --name myAKSCluster

# Check cluster endpoint
az aks show --resource-group myResourceGroup --name myAKSCluster --query "fqdn"

# Test connectivity
curl -k https://<cluster-endpoint>/healthz

# Check kubeconfig
kubectl config view
kubectl config get-contexts

Common Causes:

Cluster endpoint access restricted (private cluster)
Network connectivity issues
kubeconfig not configured
Azure AD authentication issues

Solutions:

# Update kubeconfig
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster

# Check endpoint access
az aks show --resource-group myResourceGroup --name myAKSCluster --query "apiServerAccessProfile"

# For private clusters, use VPN or authorized IP ranges

Cluster Control Plane Issues

Symptoms:

Cluster status shows “Updating” or “Failed”
Control plane components unhealthy
API server errors

Diagnosis:

# Check cluster status
az aks show --resource-group myResourceGroup --name myAKSCluster --query "provisioningState"

# Check cluster health
az aks show --resource-group myResourceGroup --name myAKSCluster --query "powerState"

# Review cluster logs
az monitor activity-log list \
  --resource-id /subscriptions/.../resourceGroups/.../providers/Microsoft.ContainerService/managedClusters/myAKSCluster \
  --max-events 50

Common Causes:

Cluster version incompatibility
Service principal issues
Virtual Network configuration problems
Subscription quota limits

Solutions:

# Check service principal
az aks show --resource-group myResourceGroup --name myAKSCluster --query "servicePrincipalProfile"

# Verify service principal permissions
az role assignment list --assignee <service-principal-id>

# Check subscription quotas
az vm list-usage --location eastus --output table

Node Connectivity Problems

Nodes Not Joining Cluster

Symptoms:

Nodes created but not appearing in kubectl get nodes
Nodes show “NotReady” status
Pods can’t be scheduled

Diagnosis:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

# Check node pool status
az aks nodepool show \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name <node-pool-name>

Common Causes:

Service principal not configured correctly
Network Security Group rules blocking traffic
Virtual Network configuration issues
Node pool configuration problems

Solutions:

# Verify node pool configuration
az aks nodepool show \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name default

# Check Network Security Group rules
az network nsg rule list \
  --resource-group myResourceGroup \
  --nsg-name <nsg-name>

# Check service principal permissions
az role assignment list --assignee <service-principal-id>

Node NotReady Status

Symptoms:

Nodes show “NotReady” status
Pods can’t be scheduled on nodes
Node conditions show problems

Diagnosis:

# Describe node for details
kubectl describe node <node-name>

# Check node conditions
kubectl get node <node-name> -o json | jq '.status.conditions'

# Check node events
kubectl get events --field-selector involvedObject.name=<node-name>

Common Causes:

kubelet not running
Network connectivity issues
Resource pressure (memory, disk)
Container runtime issues

Solutions:

# Enable auto-repair
az aks nodepool update \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name <node-pool-name> \
  --enable-node-auto-repair

# Check node logs via Azure Portal
# Go to Virtual Machine Scale Set → Instances → Serial console

Networking Troubleshooting

Pod Networking Issues

Symptoms:

Pods can’t communicate with each other
Pods can’t reach external services
DNS resolution failing

Diagnosis:

# Check pod network
kubectl get pods -o wide

# Test pod connectivity
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

Common Causes:

CoreDNS not running
Insufficient IP addresses (Azure CNI)
Network Security Group rules
Network policy blocking traffic

Solutions:

# Restart CoreDNS
kubectl delete pods -n kube-system -l k8s-app=kube-dns

# Check subnet IP allocation
az network vnet subnet show \
  --resource-group myResourceGroup \
  --vnet-name myVNet \
  --name mySubnet \
  --query "addressPrefix"

# Verify Network Security Group rules
az network nsg rule list \
  --resource-group myResourceGroup \
  --nsg-name <nsg-name>

Load Balancer Issues

Symptoms:

LoadBalancer service stuck in “Pending”
Load balancer not accessible
Health checks failing

Diagnosis:

# Check service status
kubectl get svc <service-name>
kubectl describe svc <service-name>

# Check load balancer in Azure
az network lb list --resource-group <node-resource-group>

# Check backend pools
az network lb address-pool list \
  --resource-group <node-resource-group> \
  --lb-name <lb-name>

Common Causes:

Network Security Group rules blocking health checks
Subnet configuration incorrect
Subscription quota limits
Health probe configuration

Solutions:

# Check Network Security Group rules
az network nsg rule list \
  --resource-group <node-resource-group> \
  --nsg-name <nsg-name> \
  --query "[?direction=='Inbound' && access=='Allow']"

# Verify subnet configuration
az aks show --resource-group myResourceGroup --name myAKSCluster --query "agentPoolProfiles[].vnetSubnetId"

# Check subscription quotas
az vm list-usage --location eastus --output table

Authentication and Authorization Issues

kubectl Access Denied

Symptoms:

kubectl commands return “Forbidden” or “Unauthorized”
Can’t access cluster resources

Diagnosis:

# Check current user
az account show

# Test cluster access
kubectl auth can-i get pods

# Check Azure AD permissions (if enabled)
az aks show --resource-group myResourceGroup --name myAKSCluster --query "aadProfile"

Common Causes:

Azure AD user/group permissions missing
RBAC permissions missing
Cluster endpoint access restricted
Authentication issues

Solutions:

# Grant Azure AD permissions
az role assignment create \
  --assignee <user-or-group-id> \
  --role "Azure Kubernetes Service Cluster User Role" \
  --scope /subscriptions/.../resourceGroups/.../providers/Microsoft.ContainerService/managedClusters/myAKSCluster

# Create RBAC role
kubectl create role developer \
  --resource=pods,services \
  --verb=get,list,create,update,delete

# Bind role to user
kubectl create rolebinding developer-binding \
  --role=developer \
  --user=<user-email>

Workload Identity Not Working

Symptoms:

Pods can’t assume Azure identities
Azure SDK calls fail
“Access Denied” errors in pods

Diagnosis:

# Check service account
kubectl get serviceaccount <sa-name> -o yaml

# Check Workload Identity configuration
az aks show --resource-group myResourceGroup --name myAKSCluster --query "oidcIssuerProfile"

# Check federated credential
az identity federated-credential show \
  --identity-name my-app-identity \
  --resource-group myResourceGroup \
  --name my-federated-credential

# Test in pod
kubectl run -it --rm test --image=mcr.microsoft.com/azure-cli:latest --serviceaccount=<sa-name> -- az account show

Common Causes:

Workload Identity not enabled on cluster
Service account annotation incorrect
Federated credential wrong
Pod not using service account

Solutions:

# Enable Workload Identity
az aks update \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --enable-oidc-issuer \
  --enable-workload-identity

# Update service account annotation
kubectl annotate serviceaccount <sa-name> \
  azure.workload.identity/client-id=<identity-client-id>

# Verify federated credential
az identity federated-credential show \
  --identity-name my-app-identity \
  --resource-group myResourceGroup \
  --name my-federated-credential

# Restart pods to pick up new credentials
kubectl rollout restart deployment <deployment-name>

Storage Problems

Volume Not Attaching

Symptoms:

Pod stuck in “Pending”
PVC not bound
Volume attachment errors

Diagnosis:

# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>

# Check PV status
kubectl get pv
kubectl describe pv <pv-name>

# Check Azure Disk in Azure
az disk list --resource-group <node-resource-group> --query "[?contains(name, '<pvc-name>')]"

Common Causes:

Azure Disk CSI driver not running
Service principal permissions missing
Volume in different zone than node
Storage class misconfiguration

Solutions:

# Check Azure Disk CSI driver
kubectl get pods -n kube-system -l app=csi-azuredisk-controller

# Verify service principal permissions
az role assignment list --assignee <service-principal-id> --query "[?roleDefinitionName=='Contributor']"

# Check storage class
kubectl get storageclass
kubectl describe storageclass <sc-name>

Debugging Tools and Techniques

Useful Commands

# Get comprehensive cluster info
kubectl cluster-info dump

# Get all resources
kubectl get all --all-namespaces

# Describe resource for details
kubectl describe <resource-type> <resource-name>

# Get events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

# Check API resources
kubectl api-resources

Log Collection

# Collect all logs
kubectl logs --all-namespaces > all-logs.txt

# Collect node logs via Azure Portal
# Go to Virtual Machine Scale Set → Instances → Serial console

# Collect AKS control plane logs
az monitor activity-log list \
  --resource-id /subscriptions/.../resourceGroups/.../providers/Microsoft.ContainerService/managedClusters/myAKSCluster \
  --max-events 100

Network Debugging

# Test connectivity
kubectl run -it --rm debug --image=nicolaka/netshoot --restart=Never

# Inside debug pod:
# Test DNS
nslookup kubernetes.default.svc.cluster.local

# Test connectivity
curl http://<service-name>.<namespace>.svc.cluster.local

# Check routes
ip route

# Check network interfaces
ip addr

Azure Support and Resources

Azure Support

Azure Support Center - Create support cases
Azure Forums - Community support
Azure Documentation - Official documentation
Azure Premium Support - Enhanced support options

Useful Resources

AKS User Guide - https://learn.microsoft.com/azure/aks/
AKS Best Practices - https://learn.microsoft.com/azure/aks/best-practices
Kubernetes Documentation - https://kubernetes.io/docs/

Best Practices for Troubleshooting

Start with Logs - Check pod, node, and cluster logs first
Use Describe Commands - kubectl describe provides detailed information
Check Events - Kubernetes events show what’s happening
Verify Prerequisites - Ensure service principal, networking, and resources are correct
Test Incrementally - Test changes one at a time
Document Issues - Keep notes on issues and resolutions
Use Debug Pods - Create debug pods for testing
Check Azure Portal - Verify Azure resources are created correctly
Review Best Practices - Follow AKS best practices guides
Stay Updated - Keep cluster and add-ons updated

AKS Troubleshooting

Troubleshooting Approach

Common Cluster Issues

Cluster Not Accessible

Cluster Control Plane Issues

Node Connectivity Problems

Nodes Not Joining Cluster

Node NotReady Status

Networking Troubleshooting

Pod Networking Issues

Load Balancer Issues

Authentication and Authorization Issues

kubectl Access Denied

Workload Identity Not Working

Storage Problems

Volume Not Attaching

Debugging Tools and Techniques

Useful Commands

Log Collection

Network Debugging

Azure Support and Resources

Azure Support

Useful Resources

Best Practices for Troubleshooting

See Also