AKS Node Management

Node management on AKS involves provisioning, configuring, and maintaining worker nodes that run your applications. AKS uses node pools (groups of nodes with the same configuration) and Virtual Machine Scale Sets for node management. Understanding node pool management helps you optimize costs and performance.

Node Pool Overview

Node pools are groups of nodes with the same configuration. AKS clusters can have multiple node pools for different workload requirements.

graph TB AKS[AKS Cluster] --> NP1[Node Pool 1<br/>Linux, General Purpose] AKS --> NP2[Node Pool 2<br/>Linux, Compute Optimized] AKS --> NP3[Node Pool 3<br/>Windows, General Purpose] NP1 --> VMSS1[Virtual Machine Scale Set] NP2 --> VMSS2[Virtual Machine Scale Set] NP3 --> VMSS3[Virtual Machine Scale Set] VMSS1 --> VMs1[Virtual Machines] VMSS2 --> VMs2[Virtual Machines] VMSS3 --> VMs3[Virtual Machines] style AKS fill:#e1f5ff style NP1 fill:#fff4e1 style NP2 fill:#e8f5e9 style NP3 fill:#f3e5f5

Creating Node Pools

Using Azure CLI

Basic Node Pool:

# Create node pool
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name general-pool \
  --node-count 3 \
  --node-vm-size Standard_DS2_v2

Advanced Node Pool:

# Create node pool with custom configuration
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name compute-pool \
  --node-count 3 \
  --node-vm-size Standard_DS4_v2 \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 10 \
  --node-taints compute=true:NoSchedule \
  --node-labels workload-type=compute \
  --mode User \
  --os-type Linux

Node Pool Configuration

Virtual Machine Sizes:

Choose based on workload requirements:

VM Series	Use Case	Example Sizes
Standard	General purpose	Standard_DS2_v2, Standard_DS4_v2
Standard_B	Burstable	Standard_B2s, Standard_B4ms
Standard_D	Compute optimized	Standard_D2s_v3, Standard_D4s_v3
Standard_E	Memory optimized	Standard_E2s_v3, Standard_E4s_v3
Standard_F	Compute optimized	Standard_F2s_v2, Standard_F4s_v2

Scaling Configuration:

# Enable auto-scaling
az aks nodepool update \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name general-pool \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 10

Labels and Taints:

# Create node pool with labels and taints
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name compute-pool \
  --node-count 3 \
  --node-vm-size Standard_DS4_v2 \
  --node-taints compute=true:NoSchedule \
  --node-labels workload-type=compute,zone=eastus-1

Use Cases:

Labels - Node selection for pod scheduling
Taints - Prevent pods from scheduling (unless they have matching tolerations)

Updating Node Pools

Update Node Pool Version:

# Upgrade node pool to latest version
az aks nodepool upgrade \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name general-pool \
  --kubernetes-version 1.28.0

Update Configuration:

# Update scaling, labels, or taints
az aks nodepool update \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name general-pool \
  --enable-cluster-autoscaler \
  --min-count 2 \
  --max-count 15

Update Process:

graph LR A[Start Update] --> B[Create New Nodes] B --> C[Cordon Old Nodes] C --> D[Drain Old Nodes] D --> E[Terminate Old Nodes] E --> F[Update Complete] style A fill:#e1f5ff style F fill:#e8f5e9

New nodes created with updated configuration
Old nodes cordoned (no new pods)
Old nodes drained (pods moved to new nodes)
Old nodes terminated
Zero-downtime update

Virtual Machine Sizes and Sizing

Choosing VM Sizes

General Purpose (Standard_DS, Standard_D):

Web applications
Microservices
Development environments
Most common workloads

Compute Optimized (Standard_F, Standard_D):

CPU-intensive workloads
Batch processing
High-performance computing
Scientific computing

Memory Optimized (Standard_E):

In-memory databases
Caching systems
Analytics workloads
Memory-intensive applications

Burstable (Standard_B):

Development/testing
Low baseline, burstable performance
Cost-effective for variable workloads

Right-Sizing Nodes

Considerations:

Pod density requirements
Resource requests and limits
VM size pod limits
Cost optimization
Performance requirements

Example Calculation:

Required CPU: 100 pods × 0.5 CPU = 50 CPU
Required Memory: 100 pods × 2 Gi = 200 Gi

Options:
- 10 × Standard_DS2_v2 (2 vCPU, 7 Gi) = 20 vCPU, 70 Gi (insufficient)
- 10 × Standard_DS4_v2 (4 vCPU, 14 Gi) = 40 vCPU, 140 Gi (insufficient)
- 10 × Standard_DS8_v2 (8 vCPU, 28 Gi) = 80 vCPU, 280 Gi (sufficient)

Windows Node Pools

AKS supports Windows Server containers with Windows node pools:

Creating Windows Node Pool

# Create Windows node pool
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name windows-pool \
  --node-count 3 \
  --node-vm-size Standard_DS2_v2 \
  --os-type Windows \
  --os-sku Windows2022

Windows Node Pool Features:

Full Windows Server container support
Windows-specific optimizations
Mixed workloads (Linux + Windows)
Windows-specific VM sizes

Windows Container Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: windows-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: windows-app
  template:
    metadata:
      labels:
        app: windows-app
    spec:
      nodeSelector:
        kubernetes.io/os: windows
      containers:
      - name: app
        image: mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2022
        ports:
        - containerPort: 80

Node Lifecycle Management

Auto-Upgrade

Automatic node upgrades for security patches:

# Enable auto-upgrade
az aks nodepool update \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name general-pool \
  --auto-upgrade-channel patch

Auto-Upgrade Channels:

none - No automatic upgrades
patch - Automatic patch upgrades
rapid - Latest available version
node-image - Node image upgrades only

Node Health Checks

AKS automatically monitors node health:

# Check node status
kubectl get nodes

# Describe node for details
kubectl describe node aks-general-pool-12345678-vmss000000

# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

Node Conditions:

Ready - Node is healthy and ready for pods
MemoryPressure - Node has memory pressure
DiskPressure - Node has disk pressure
PIDPressure - Node has process ID pressure

Node Replacement

Manual Node Replacement:

# Cordon node (prevent new pods)
kubectl cordon aks-general-pool-12345678-vmss000000

# Drain node (evict pods)
kubectl drain aks-general-pool-12345678-vmss000000 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300

# Delete node (VMSS will replace it)
kubectl delete node aks-general-pool-12345678-vmss000000

Automatic Node Replacement:

Auto-upgrade handles node replacement
Virtual Machine Scale Sets handle unhealthy instances
Automatic repair for failed nodes

Spot VMs and Cost Optimization

Spot VMs

Use Spot VMs for cost savings (up to 90% discount):

# Create node pool with Spot VMs
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name spot-pool \
  --node-count 3 \
  --node-vm-size Standard_DS2_v2 \
  --priority Spot \
  --eviction-policy Delete \
  --spot-max-price -1 \
  --enable-cluster-autoscaler \
  --min-count 0 \
  --max-count 10 \
  --node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule \
  --node-labels kubernetes.azure.com/scalesetpriority=spot

Spot VM Best Practices:

Use for fault-tolerant workloads
Set appropriate taints and tolerations
Use auto-scaling for availability
Combine with on-demand for reliability
Handle interruptions gracefully

Cost Optimization Strategies

Right-Size VM Sizes - Match VM size to workload requirements
Use Spot VMs - For fault-tolerant workloads
Use Azure Reserved Instances - For predictable workloads
Auto-Scaling - Scale down during low usage
Use Burstable VMs - For development/testing
Monitor Costs - Use Azure Cost Management

Node Taints and Tolerations

Taints

Prevent pods from scheduling on nodes:

# Create node pool with taint
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name gpu-pool \
  --node-count 2 \
  --node-vm-size Standard_NC6s_v3 \
  --node-taints nvidia.com/gpu=true:NoSchedule

Tolerations

Allow pods to schedule on tainted nodes:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  containers:
  - name: app
    image: gpu-app:latest

Node Pool Organization

Multiple Node Pools

Organize workloads with multiple node pools:

# General purpose pool
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name general-pool \
  --node-count 3 \
  --node-vm-size Standard_DS2_v2 \
  --node-labels workload-type=general

# Compute pool
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name compute-pool \
  --node-count 2 \
  --node-vm-size Standard_DS4_v2 \
  --node-labels workload-type=compute \
  --node-taints compute=true:NoSchedule

# Windows pool
az aks nodepool add \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name windows-pool \
  --node-count 2 \
  --node-vm-size Standard_DS2_v2 \
  --os-type Windows \
  --node-labels workload-type=windows

Best Practices

Use Multiple Node Pools - Separate workloads by requirements
Right-Size VM Sizes - Match VM size to workload requirements
Use Spot VMs - For fault-tolerant, cost-sensitive workloads
Set Appropriate Scaling Limits - Prevent cost overruns
Use Labels and Taints - Organize workloads and node selection
Monitor Node Health - Set up alerts for node issues
Enable Auto-Upgrade - Automatic security patches
Tag Resources - For cost allocation and resource management
Test Updates - Test node pool updates in non-production first
Use Windows Node Pools - For Windows container workloads

Common Issues

Nodes Not Joining Cluster

Problem: Nodes created but not joining cluster

Solutions:

Check service principal permissions
Verify Network Security Group rules
Check Virtual Network configuration
Review Azure Activity Log

Node Pool Update Fails

Problem: Node pool update stuck or fails

Solutions:

Check node pool status
Review update logs
Verify VM size availability
Check subscription quotas
Review Azure Activity Log

High Node Churn

Problem: Nodes frequently being replaced

Solutions:

Check node health conditions
Review resource pressure
Verify VM stability
Check for Spot VM interruptions
Review auto-upgrade configuration

AKS Node Management

Node Pool Overview

Creating Node Pools

Using Azure CLI

Node Pool Configuration

Updating Node Pools

Virtual Machine Sizes and Sizing

Choosing VM Sizes

Right-Sizing Nodes

Windows Node Pools

Creating Windows Node Pool

Windows Container Deployment

Node Lifecycle Management

Auto-Upgrade

Node Health Checks

Node Replacement

Spot VMs and Cost Optimization

Spot VMs

Cost Optimization Strategies

Node Taints and Tolerations

Taints

Tolerations

Node Pool Organization

Multiple Node Pools

Best Practices

Common Issues

Nodes Not Joining Cluster

Node Pool Update Fails

High Node Churn

See Also