EKS Node Management

Node management on EKS involves provisioning, configuring, and maintaining worker nodes that run your applications. EKS offers two approaches: managed node groups (AWS handles lifecycle) and self-managed node groups (you handle lifecycle). Understanding both options helps you choose the right approach for your workload requirements.

Node Group Types

EKS supports two types of node groups:

graph TB subgraph managed[Managed Node Groups] MNG[Managed Node Group] --> ASG1[Auto Scaling Group<br/>AWS Managed] MNG --> LT1[Launch Template<br/>AWS Managed] MNG --> UPDATE1[Automatic Updates] MNG --> REPLACE1[Automatic Replacement] end subgraph self_managed[Self-Managed Node Groups] SMNG[Self-Managed Node Group] --> ASG2[Auto Scaling Group<br/>You Manage] SMNG --> LT2[Launch Template<br/>You Manage] SMNG --> UPDATE2[Manual Updates] SMNG --> REPLACE2[Manual Replacement] end style MNG fill:#e1f5ff style SMNG fill:#fff4e1

Managed Node Groups

AWS manages the node lifecycle:

Benefits:

Automatic node updates and patching
Automatic node replacement for unhealthy nodes
Integrated with EKS for optimal configuration
Less operational overhead
Automatic AMI updates

Limitations:

Limited customization options
Must use EKS-optimized AMIs
Less control over update timing
Can’t use custom launch templates directly

Self-Managed Node Groups

You manage the node lifecycle:

Benefits:

Full control over node configuration
Custom AMIs and launch templates
Control over update timing
More flexibility for special requirements

Limitations:

You handle updates and patching
You handle node replacement
More operational overhead
Requires more Kubernetes expertise

Managed Node Groups

Creating Managed Node Groups

Using eksctl:

# Simple node group
eksctl create nodegroup \
  --cluster my-cluster \
  --name general-workers \
  --node-type t3.large \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 10

Using AWS CLI:

# Create node group
aws eks create-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name general-workers \
  --node-role arn:aws:iam::123456789012:role/eks-node-role \
  --subnets subnet-12345678 subnet-87654321 \
  --instance-types t3.large \
  --scaling-config minSize=1,maxSize=10,desiredSize=3 \
  --disk-size 50

Using eksctl Config File:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-cluster
  region: us-west-2

managedNodeGroups:
  - name: general-workers
    instanceType: t3.large
    minSize: 1
    maxSize: 10
    desiredCapacity: 3
    volumeSize: 50
    labels:
      role: general
      environment: production
    taints:
      - key: dedicated
        value: general
        effect: NoSchedule
    tags:
      Environment: Production
      Team: Platform

Node Group Configuration

Instance Types:

Choose based on workload requirements:

Instance Family	Use Case	Example Types
General Purpose	Most workloads	t3, t4g, m5, m6i
Compute Optimized	CPU-intensive	c5, c6i
Memory Optimized	Memory-intensive	r5, r6i
Storage Optimized	High I/O	i3, i3en
Accelerated Computing	GPU workloads	p3, p4, g4dn

Scaling Configuration:

# Update scaling configuration
aws eks update-nodegroup-config \
  --cluster-name my-cluster \
  --nodegroup-name general-workers \
  --scaling-config minSize=2,maxSize=20,desiredSize=5

Labels and Taints:

# Node group with labels and taints
managedNodeGroups:
  - name: compute-workers
    instanceType: c5.xlarge
    labels:
      workload-type: compute
      zone: us-west-2a
    taints:
      - key: compute
        value: "true"
        effect: NoSchedule

Use Cases:

Labels - Node selection for pod scheduling
Taints - Prevent pods from scheduling (unless they have matching tolerations)

Updating Managed Node Groups

Update Node Group Version:

# Update to latest AMI version
aws eks update-nodegroup-version \
  --cluster-name my-cluster \
  --nodegroup-name general-workers

Update Configuration:

# Update scaling, labels, or taints
aws eks update-nodegroup-config \
  --cluster-name my-cluster \
  --nodegroup-name general-workers \
  --scaling-config minSize=2,maxSize=15,desiredSize=5 \
  --labels addOrUpdateLabels={Environment=Prod}

Update Process:

graph LR A[Start Update] --> B[Create New Nodes] B --> C[Cordon Old Nodes] C --> D[Drain Old Nodes] D --> E[Terminate Old Nodes] E --> F[Update Complete] style A fill:#e1f5ff style F fill:#e8f5e9

New nodes created with updated configuration
Old nodes cordoned (no new pods)
Old nodes drained (pods moved to new nodes)
Old nodes terminated
Zero-downtime update

Self-Managed Node Groups

Creating Self-Managed Node Groups

Step 1: Create Launch Template

# Get EKS-optimized AMI
AMI_ID=$(aws ssm get-parameter \
  --name /aws/service/eks/optimized-ami/1.28/amazon-linux-2/recommended/image_id \
  --query 'Parameter.Value' \
  --output text)

# Create launch template
aws ec2 create-launch-template \
  --launch-template-name eks-node-template \
  --launch-template-data "{
    \"ImageId\": \"$AMI_ID\",
    \"InstanceType\": \"t3.large\",
    \"IamInstanceProfile\": {
      \"Arn\": \"arn:aws:iam::123456789012:instance-profile/eks-node-instance-profile\"
    },
    \"UserData\": \"$(base64 -w 0 userdata.sh)\",
    \"TagSpecifications\": [{
      \"ResourceType\": \"instance\",
      \"Tags\": [
        {\"Key\": \"Name\", \"Value\": \"eks-node\"},
        {\"Key\": \"kubernetes.io/cluster/my-cluster\", \"Value\": \"owned\"}
      ]
    }]
  }"

Step 2: Create User Data Script

#!/bin/bash
# userdata.sh

# Install kubelet, kube-proxy, and aws-iam-authenticator
/etc/eks/bootstrap.sh my-cluster

# Configure kubelet
cat > /etc/kubernetes/kubelet/kubelet-config.json <<EOF
{
  "kind": "KubeletConfiguration",
  "apiVersion": "kubelet.config.k8s.io/v1beta1",
  "authentication": {
    "anonymous": {
      "enabled": false
    },
    "webhook": {
      "enabled": true
    },
    "x509": {
      "clientCAFile": "/etc/kubernetes/pki/ca.crt"
    }
  },
  "authorization": {
    "mode": "Webhook"
  },
  "clusterDomain": "cluster.local",
  "clusterDNS": ["10.100.0.10"],
  "containerRuntimeEndpoint": "unix:///run/containerd/containerd.sock",
  "cgroupDriver": "systemd",
  "cgroupRoot": "/",
  "featureGates": {
    "RotateKubeletServerCertificate": true
  },
  "healthzBindAddress": "127.0.0.1",
  "healthzPort": 10248,
  "httpCheckFrequency": "20s",
  "imageMinimumGCAge": "2m",
  "imageGCHighThresholdPercent": 85,
  "imageGCLowThresholdPercent": 80,
  "iptablesDropBit": 15,
  "iptablesMasqueradeBit": 15,
  "kubeAPIBurst": 10,
  "kubeAPIQPS": 5,
  "makeIPTablesUtilChains": true,
  "maxOpenFiles": 1000000,
  "maxPods": 110,
  "podPidsLimit": -1,
  "registryBurst": 10,
  "registryPullQPS": 5,
  "rotateCertificates": true,
  "runtimeRequestTimeout": "2m",
  "serializeImagePulls": true,
  "serverTLSBootstrap": true,
  "streamingConnectionIdleTimeout": "4h",
  "syncFrequency": "1m",
  "volumeStatsAggPeriod": "1m"
}
EOF

Step 3: Create Auto Scaling Group

aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name eks-nodes \
  --launch-template LaunchTemplateName=eks-node-template,Version='$Latest' \
  --min-size 1 \
  --max-size 10 \
  --desired-capacity 3 \
  --vpc-zone-identifier "subnet-12345678,subnet-87654321" \
  --tags \
    Key=Name,Value=eks-node,PropagateAtLaunch=true \
    Key=kubernetes.io/cluster/my-cluster,Value=owned,PropagateAtLaunch=true

Updating Self-Managed Node Groups

Manual Rolling Update:

# 1. Create new launch template with updated AMI
aws ec2 create-launch-template-version \
  --launch-template-name eks-node-template \
  --source-version 1 \
  --launch-template-data "{
    \"ImageId\": \"NEW_AMI_ID\"
  }"

# 2. Update Auto Scaling Group
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name eks-nodes \
  --launch-template LaunchTemplateName=eks-node-template,Version=2 \
  --desired-capacity 3

# 3. Gradually replace nodes
# Increase desired capacity to create new nodes
# Wait for new nodes to be ready
# Decrease desired capacity to remove old nodes

Instance Types and Sizing

Choosing Instance Types

General Purpose (t3, m5, m6i):

Web applications
Microservices
Development environments
Most common workloads

Compute Optimized (c5, c6i):

CPU-intensive workloads
Batch processing
High-performance computing
Scientific computing

Memory Optimized (r5, r6i):

In-memory databases
Caching systems
Analytics workloads
Memory-intensive applications

Storage Optimized (i3, i3en):

NoSQL databases
Data warehousing
High I/O workloads
Big data processing

Right-Sizing Nodes

Considerations:

Pod density requirements
Resource requests and limits
Instance type IP limits (for VPC CNI)
Cost optimization
Performance requirements

Example Calculation:

Required CPU: 100 pods × 0.5 CPU = 50 CPU
Required Memory: 100 pods × 2 Gi = 200 Gi

Options:
- 10 × t3.large (2 vCPU, 8 Gi) = 20 vCPU, 80 Gi (insufficient)
- 10 × t3.xlarge (4 vCPU, 16 Gi) = 40 vCPU, 160 Gi (insufficient)
- 10 × m5.xlarge (4 vCPU, 16 Gi) = 40 vCPU, 160 Gi (insufficient)
- 10 × m5.2xlarge (8 vCPU, 32 Gi) = 80 vCPU, 320 Gi (sufficient)

Auto Scaling Groups Integration

Managed Node Groups

Auto Scaling Groups are automatically created and managed:

graph TB MNG[Managed Node Group] --> ASG[Auto Scaling Group] ASG --> LT[Launch Template] ASG --> EC2[EC2 Instances] SCALER[Cluster Autoscaler] --> ASG ASG --> SCALE_UP[Scale Up] ASG --> SCALE_DOWN[Scale Down] style MNG fill:#e1f5ff style ASG fill:#fff4e1 style SCALER fill:#e8f5e9

Scaling Triggers:

Cluster Autoscaler detects unschedulable pods
Increases desired capacity
ASG launches new instances
Nodes join cluster

Self-Managed Node Groups

You manage the Auto Scaling Group:

# Configure scaling policies
aws autoscaling put-scaling-policy \
  --auto-scaling-group-name eks-nodes \
  --policy-name scale-up \
  --scaling-adjustment 2 \
  --adjustment-type ChangeInCapacity \
  --cooldown 300

Node Lifecycle Management

Node Health Checks

EKS automatically monitors node health:

# Check node status
kubectl get nodes

# Describe node for details
kubectl describe node ip-10-0-1-123.ec2.internal

# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

Node Conditions:

Ready - Node is healthy and ready for pods
MemoryPressure - Node has memory pressure
DiskPressure - Node has disk pressure
PIDPressure - Node has process ID pressure

Node Replacement

Managed Node Groups:

Automatic replacement for unhealthy nodes
Configurable via node group settings

Self-Managed Node Groups:

Manual replacement required
Use Auto Scaling Group health checks
Implement custom replacement logic

Node Draining

Safely remove nodes from cluster:

# Cordon node (prevent new pods)
kubectl cordon ip-10-0-1-123.ec2.internal

# Drain node (evict pods)
kubectl drain ip-10-0-1-123.ec2.internal \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300

# Delete node
kubectl delete node ip-10-0-1-123.ec2.internal

Spot Instances and Cost Optimization

Spot Instances

Use spot instances for cost savings (up to 90% discount):

managedNodeGroups:
  - name: spot-workers
    instanceTypes:
      - t3.medium
      - t3.large
      - t3.xlarge
    capacityType: SPOT
    minSize: 0
    maxSize: 20
    desiredCapacity: 5
    labels:
      capacity-type: spot
    taints:
      - key: spot
        value: "true"
        effect: NoSchedule

Spot Instance Best Practices:

Use for fault-tolerant workloads
Use multiple instance types
Set appropriate interruption handling
Use spot for non-critical workloads
Combine with on-demand for reliability

Cost Optimization Strategies

Right-Size Instances - Match instance size to workload requirements
Use Spot Instances - For fault-tolerant workloads
Reserved Instances - For predictable workloads
Auto Scaling - Scale down during low usage
Instance Scheduling - Stop non-production clusters during off-hours
Graviton Instances - Use ARM-based instances for compatible workloads
Monitor Costs - Use Cost Explorer and tags

Node Taints and Tolerations

Taints

Prevent pods from scheduling on nodes:

# Node group with taint
managedNodeGroups:
  - name: gpu-workers
    instanceType: p3.2xlarge
    taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule

Tolerations

Allow pods to schedule on tainted nodes:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  containers:
  - name: app
    image: gpu-app:latest

Node Group Tagging and Organization

Multiple Node Groups

Organize workloads with multiple node groups:

managedNodeGroups:
  - name: general-workers
    instanceType: t3.large
    labels:
      workload-type: general
  
  - name: compute-workers
    instanceType: c5.xlarge
    labels:
      workload-type: compute
    taints:
      - key: compute
        value: "true"
        effect: NoSchedule
  
  - name: memory-workers
    instanceType: r5.xlarge
    labels:
      workload-type: memory
    taints:
      - key: memory
        value: "true"
        effect: NoSchedule

Best Practices

Use Managed Node Groups - Unless you need custom configurations
Right-Size Instances - Match instance size to workload requirements
Use Spot Instances - For fault-tolerant, cost-sensitive workloads
Set Appropriate Scaling Limits - Prevent cost overruns
Use Labels and Taints - Organize workloads and node selection
Monitor Node Health - Set up alerts for node issues
Plan for Updates - Schedule node group updates during maintenance windows
Tag Everything - For cost allocation and resource management
Test Updates - Test node group updates in non-production first
Use Multiple Node Groups - Separate workloads by requirements

Common Issues

Nodes Not Joining Cluster

Problem: Nodes created but not joining cluster

Solutions:

Check IAM role permissions
Verify security group rules
Check bootstrap script
Review CloudWatch logs
Verify aws-auth ConfigMap

Node Group Update Fails

Problem: Node group update stuck or fails

Solutions:

Check node group status
Review update logs
Verify AMI availability
Check instance capacity
Review Auto Scaling Group

High Node Churn

Problem: Nodes frequently being replaced

Solutions:

Check node health conditions
Review resource pressure
Verify instance stability
Check for spot interruptions
Review Auto Scaling Group configuration

EKS Node Management

Node Group Types

Managed Node Groups

Self-Managed Node Groups

Managed Node Groups

Creating Managed Node Groups

Node Group Configuration

Updating Managed Node Groups

Self-Managed Node Groups

Creating Self-Managed Node Groups

Updating Self-Managed Node Groups

Instance Types and Sizing

Choosing Instance Types

Right-Sizing Nodes

Auto Scaling Groups Integration

Managed Node Groups

Self-Managed Node Groups

Node Lifecycle Management

Node Health Checks

Node Replacement

Node Draining

Spot Instances and Cost Optimization

Spot Instances

Cost Optimization Strategies

Node Taints and Tolerations

Taints

Tolerations

Node Group Tagging and Organization

Tags

Multiple Node Groups

Best Practices

Common Issues

Nodes Not Joining Cluster

Node Group Update Fails

High Node Churn

See Also