AKS Node Management
Node management on AKS involves provisioning, configuring, and maintaining worker nodes that run your applications. AKS uses node pools (groups of nodes with the same configuration) and Virtual Machine Scale Sets for node management. Understanding node pool management helps you optimize costs and performance.
Node Pool Overview
Node pools are groups of nodes with the same configuration. AKS clusters can have multiple node pools for different workload requirements.
Creating Node Pools
Using Azure CLI
Basic Node Pool:
# Create node pool
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name general-pool \
--node-count 3 \
--node-vm-size Standard_DS2_v2
Advanced Node Pool:
# Create node pool with custom configuration
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name compute-pool \
--node-count 3 \
--node-vm-size Standard_DS4_v2 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10 \
--node-taints compute=true:NoSchedule \
--node-labels workload-type=compute \
--mode User \
--os-type Linux
Node Pool Configuration
Virtual Machine Sizes:
Choose based on workload requirements:
| VM Series | Use Case | Example Sizes |
|---|---|---|
| Standard | General purpose | Standard_DS2_v2, Standard_DS4_v2 |
| Standard_B | Burstable | Standard_B2s, Standard_B4ms |
| Standard_D | Compute optimized | Standard_D2s_v3, Standard_D4s_v3 |
| Standard_E | Memory optimized | Standard_E2s_v3, Standard_E4s_v3 |
| Standard_F | Compute optimized | Standard_F2s_v2, Standard_F4s_v2 |
Scaling Configuration:
# Enable auto-scaling
az aks nodepool update \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name general-pool \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10
Labels and Taints:
# Create node pool with labels and taints
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name compute-pool \
--node-count 3 \
--node-vm-size Standard_DS4_v2 \
--node-taints compute=true:NoSchedule \
--node-labels workload-type=compute,zone=eastus-1
Use Cases:
- Labels - Node selection for pod scheduling
- Taints - Prevent pods from scheduling (unless they have matching tolerations)
Updating Node Pools
Update Node Pool Version:
# Upgrade node pool to latest version
az aks nodepool upgrade \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name general-pool \
--kubernetes-version 1.28.0
Update Configuration:
# Update scaling, labels, or taints
az aks nodepool update \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name general-pool \
--enable-cluster-autoscaler \
--min-count 2 \
--max-count 15
Update Process:
- New nodes created with updated configuration
- Old nodes cordoned (no new pods)
- Old nodes drained (pods moved to new nodes)
- Old nodes terminated
- Zero-downtime update
Virtual Machine Sizes and Sizing
Choosing VM Sizes
General Purpose (Standard_DS, Standard_D):
- Web applications
- Microservices
- Development environments
- Most common workloads
Compute Optimized (Standard_F, Standard_D):
- CPU-intensive workloads
- Batch processing
- High-performance computing
- Scientific computing
Memory Optimized (Standard_E):
- In-memory databases
- Caching systems
- Analytics workloads
- Memory-intensive applications
Burstable (Standard_B):
- Development/testing
- Low baseline, burstable performance
- Cost-effective for variable workloads
Right-Sizing Nodes
Considerations:
- Pod density requirements
- Resource requests and limits
- VM size pod limits
- Cost optimization
- Performance requirements
Example Calculation:
Required CPU: 100 pods × 0.5 CPU = 50 CPU
Required Memory: 100 pods × 2 Gi = 200 Gi
Options:
- 10 × Standard_DS2_v2 (2 vCPU, 7 Gi) = 20 vCPU, 70 Gi (insufficient)
- 10 × Standard_DS4_v2 (4 vCPU, 14 Gi) = 40 vCPU, 140 Gi (insufficient)
- 10 × Standard_DS8_v2 (8 vCPU, 28 Gi) = 80 vCPU, 280 Gi (sufficient)
Windows Node Pools
AKS supports Windows Server containers with Windows node pools:
Creating Windows Node Pool
# Create Windows node pool
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name windows-pool \
--node-count 3 \
--node-vm-size Standard_DS2_v2 \
--os-type Windows \
--os-sku Windows2022
Windows Node Pool Features:
- Full Windows Server container support
- Windows-specific optimizations
- Mixed workloads (Linux + Windows)
- Windows-specific VM sizes
Windows Container Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: windows-app
spec:
replicas: 3
selector:
matchLabels:
app: windows-app
template:
metadata:
labels:
app: windows-app
spec:
nodeSelector:
kubernetes.io/os: windows
containers:
- name: app
image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2022
ports:
- containerPort: 80
Node Lifecycle Management
Auto-Upgrade
Automatic node upgrades for security patches:
# Enable auto-upgrade
az aks nodepool update \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name general-pool \
--auto-upgrade-channel patch
Auto-Upgrade Channels:
none- No automatic upgradespatch- Automatic patch upgradesrapid- Latest available versionnode-image- Node image upgrades only
Node Health Checks
AKS automatically monitors node health:
# Check node status
kubectl get nodes
# Describe node for details
kubectl describe node aks-general-pool-12345678-vmss000000
# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'
Node Conditions:
Ready- Node is healthy and ready for podsMemoryPressure- Node has memory pressureDiskPressure- Node has disk pressurePIDPressure- Node has process ID pressure
Node Replacement
Manual Node Replacement:
# Cordon node (prevent new pods)
kubectl cordon aks-general-pool-12345678-vmss000000
# Drain node (evict pods)
kubectl drain aks-general-pool-12345678-vmss000000 \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=300
# Delete node (VMSS will replace it)
kubectl delete node aks-general-pool-12345678-vmss000000
Automatic Node Replacement:
- Auto-upgrade handles node replacement
- Virtual Machine Scale Sets handle unhealthy instances
- Automatic repair for failed nodes
Spot VMs and Cost Optimization
Spot VMs
Use Spot VMs for cost savings (up to 90% discount):
# Create node pool with Spot VMs
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name spot-pool \
--node-count 3 \
--node-vm-size Standard_DS2_v2 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--enable-cluster-autoscaler \
--min-count 0 \
--max-count 10 \
--node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule \
--node-labels kubernetes.azure.com/scalesetpriority=spot
Spot VM Best Practices:
- Use for fault-tolerant workloads
- Set appropriate taints and tolerations
- Use auto-scaling for availability
- Combine with on-demand for reliability
- Handle interruptions gracefully
Cost Optimization Strategies
Right-Size VM Sizes - Match VM size to workload requirements
Use Spot VMs - For fault-tolerant workloads
Use Azure Reserved Instances - For predictable workloads
Auto-Scaling - Scale down during low usage
Use Burstable VMs - For development/testing
Monitor Costs - Use Azure Cost Management
Node Taints and Tolerations
Taints
Prevent pods from scheduling on nodes:
# Create node pool with taint
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name gpu-pool \
--node-count 2 \
--node-vm-size Standard_NC6s_v3 \
--node-taints nvidia.com/gpu=true:NoSchedule
Tolerations
Allow pods to schedule on tainted nodes:
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: app
image: gpu-app:latest
Node Pool Organization
Multiple Node Pools
Organize workloads with multiple node pools:
# General purpose pool
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name general-pool \
--node-count 3 \
--node-vm-size Standard_DS2_v2 \
--node-labels workload-type=general
# Compute pool
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name compute-pool \
--node-count 2 \
--node-vm-size Standard_DS4_v2 \
--node-labels workload-type=compute \
--node-taints compute=true:NoSchedule
# Windows pool
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name windows-pool \
--node-count 2 \
--node-vm-size Standard_DS2_v2 \
--os-type Windows \
--node-labels workload-type=windows
Best Practices
Use Multiple Node Pools - Separate workloads by requirements
Right-Size VM Sizes - Match VM size to workload requirements
Use Spot VMs - For fault-tolerant, cost-sensitive workloads
Set Appropriate Scaling Limits - Prevent cost overruns
Use Labels and Taints - Organize workloads and node selection
Monitor Node Health - Set up alerts for node issues
Enable Auto-Upgrade - Automatic security patches
Tag Resources - For cost allocation and resource management
Test Updates - Test node pool updates in non-production first
Use Windows Node Pools - For Windows container workloads
Common Issues
Nodes Not Joining Cluster
Problem: Nodes created but not joining cluster
Solutions:
- Check service principal permissions
- Verify Network Security Group rules
- Check Virtual Network configuration
- Review Azure Activity Log
Node Pool Update Fails
Problem: Node pool update stuck or fails
Solutions:
- Check node pool status
- Review update logs
- Verify VM size availability
- Check subscription quotas
- Review Azure Activity Log
High Node Churn
Problem: Nodes frequently being replaced
Solutions:
- Check node health conditions
- Review resource pressure
- Verify VM stability
- Check for Spot VM interruptions
- Review auto-upgrade configuration
See Also
- Cluster Setup - Initial node pool creation
- Autoscaling - Automatic node scaling
- Troubleshooting - Node issues