EKS Node Management
Node management on EKS involves provisioning, configuring, and maintaining worker nodes that run your applications. EKS offers two approaches: managed node groups (AWS handles lifecycle) and self-managed node groups (you handle lifecycle). Understanding both options helps you choose the right approach for your workload requirements.
Node Group Types
EKS supports two types of node groups:
Managed Node Groups
AWS manages the node lifecycle:
Benefits:
- Automatic node updates and patching
- Automatic node replacement for unhealthy nodes
- Integrated with EKS for optimal configuration
- Less operational overhead
- Automatic AMI updates
Limitations:
- Limited customization options
- Must use EKS-optimized AMIs
- Less control over update timing
- Can’t use custom launch templates directly
Self-Managed Node Groups
You manage the node lifecycle:
Benefits:
- Full control over node configuration
- Custom AMIs and launch templates
- Control over update timing
- More flexibility for special requirements
Limitations:
- You handle updates and patching
- You handle node replacement
- More operational overhead
- Requires more Kubernetes expertise
Managed Node Groups
Creating Managed Node Groups
Using eksctl:
# Simple node group
eksctl create nodegroup \
--cluster my-cluster \
--name general-workers \
--node-type t3.large \
--nodes 3 \
--nodes-min 1 \
--nodes-max 10
Using AWS CLI:
# Create node group
aws eks create-nodegroup \
--cluster-name my-cluster \
--nodegroup-name general-workers \
--node-role arn:aws:iam::123456789012:role/eks-node-role \
--subnets subnet-12345678 subnet-87654321 \
--instance-types t3.large \
--scaling-config minSize=1,maxSize=10,desiredSize=3 \
--disk-size 50
Using eksctl Config File:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: my-cluster
region: us-west-2
managedNodeGroups:
- name: general-workers
instanceType: t3.large
minSize: 1
maxSize: 10
desiredCapacity: 3
volumeSize: 50
labels:
role: general
environment: production
taints:
- key: dedicated
value: general
effect: NoSchedule
tags:
Environment: Production
Team: Platform
Node Group Configuration
Instance Types:
Choose based on workload requirements:
| Instance Family | Use Case | Example Types |
|---|---|---|
| General Purpose | Most workloads | t3, t4g, m5, m6i |
| Compute Optimized | CPU-intensive | c5, c6i |
| Memory Optimized | Memory-intensive | r5, r6i |
| Storage Optimized | High I/O | i3, i3en |
| Accelerated Computing | GPU workloads | p3, p4, g4dn |
Scaling Configuration:
# Update scaling configuration
aws eks update-nodegroup-config \
--cluster-name my-cluster \
--nodegroup-name general-workers \
--scaling-config minSize=2,maxSize=20,desiredSize=5
Labels and Taints:
# Node group with labels and taints
managedNodeGroups:
- name: compute-workers
instanceType: c5.xlarge
labels:
workload-type: compute
zone: us-west-2a
taints:
- key: compute
value: "true"
effect: NoSchedule
Use Cases:
- Labels - Node selection for pod scheduling
- Taints - Prevent pods from scheduling (unless they have matching tolerations)
Updating Managed Node Groups
Update Node Group Version:
# Update to latest AMI version
aws eks update-nodegroup-version \
--cluster-name my-cluster \
--nodegroup-name general-workers
Update Configuration:
# Update scaling, labels, or taints
aws eks update-nodegroup-config \
--cluster-name my-cluster \
--nodegroup-name general-workers \
--scaling-config minSize=2,maxSize=15,desiredSize=5 \
--labels addOrUpdateLabels={Environment=Prod}
Update Process:
- New nodes created with updated configuration
- Old nodes cordoned (no new pods)
- Old nodes drained (pods moved to new nodes)
- Old nodes terminated
- Zero-downtime update
Self-Managed Node Groups
Creating Self-Managed Node Groups
Step 1: Create Launch Template
# Get EKS-optimized AMI
AMI_ID=$(aws ssm get-parameter \
--name /aws/service/eks/optimized-ami/1.28/amazon-linux-2/recommended/image_id \
--query 'Parameter.Value' \
--output text)
# Create launch template
aws ec2 create-launch-template \
--launch-template-name eks-node-template \
--launch-template-data "{
\"ImageId\": \"$AMI_ID\",
\"InstanceType\": \"t3.large\",
\"IamInstanceProfile\": {
\"Arn\": \"arn:aws:iam::123456789012:instance-profile/eks-node-instance-profile\"
},
\"UserData\": \"$(base64 -w 0 userdata.sh)\",
\"TagSpecifications\": [{
\"ResourceType\": \"instance\",
\"Tags\": [
{\"Key\": \"Name\", \"Value\": \"eks-node\"},
{\"Key\": \"kubernetes.io/cluster/my-cluster\", \"Value\": \"owned\"}
]
}]
}"
Step 2: Create User Data Script
#!/bin/bash
# userdata.sh
# Install kubelet, kube-proxy, and aws-iam-authenticator
/etc/eks/bootstrap.sh my-cluster
# Configure kubelet
cat > /etc/kubernetes/kubelet/kubelet-config.json <<EOF
{
"kind": "KubeletConfiguration",
"apiVersion": "kubelet.config.k8s.io/v1beta1",
"authentication": {
"anonymous": {
"enabled": false
},
"webhook": {
"enabled": true
},
"x509": {
"clientCAFile": "/etc/kubernetes/pki/ca.crt"
}
},
"authorization": {
"mode": "Webhook"
},
"clusterDomain": "cluster.local",
"clusterDNS": ["10.100.0.10"],
"containerRuntimeEndpoint": "unix:///run/containerd/containerd.sock",
"cgroupDriver": "systemd",
"cgroupRoot": "/",
"featureGates": {
"RotateKubeletServerCertificate": true
},
"healthzBindAddress": "127.0.0.1",
"healthzPort": 10248,
"httpCheckFrequency": "20s",
"imageMinimumGCAge": "2m",
"imageGCHighThresholdPercent": 85,
"imageGCLowThresholdPercent": 80,
"iptablesDropBit": 15,
"iptablesMasqueradeBit": 15,
"kubeAPIBurst": 10,
"kubeAPIQPS": 5,
"makeIPTablesUtilChains": true,
"maxOpenFiles": 1000000,
"maxPods": 110,
"podPidsLimit": -1,
"registryBurst": 10,
"registryPullQPS": 5,
"rotateCertificates": true,
"runtimeRequestTimeout": "2m",
"serializeImagePulls": true,
"serverTLSBootstrap": true,
"streamingConnectionIdleTimeout": "4h",
"syncFrequency": "1m",
"volumeStatsAggPeriod": "1m"
}
EOF
Step 3: Create Auto Scaling Group
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name eks-nodes \
--launch-template LaunchTemplateName=eks-node-template,Version='$Latest' \
--min-size 1 \
--max-size 10 \
--desired-capacity 3 \
--vpc-zone-identifier "subnet-12345678,subnet-87654321" \
--tags \
Key=Name,Value=eks-node,PropagateAtLaunch=true \
Key=kubernetes.io/cluster/my-cluster,Value=owned,PropagateAtLaunch=true
Updating Self-Managed Node Groups
Manual Rolling Update:
# 1. Create new launch template with updated AMI
aws ec2 create-launch-template-version \
--launch-template-name eks-node-template \
--source-version 1 \
--launch-template-data "{
\"ImageId\": \"NEW_AMI_ID\"
}"
# 2. Update Auto Scaling Group
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name eks-nodes \
--launch-template LaunchTemplateName=eks-node-template,Version=2 \
--desired-capacity 3
# 3. Gradually replace nodes
# Increase desired capacity to create new nodes
# Wait for new nodes to be ready
# Decrease desired capacity to remove old nodes
Instance Types and Sizing
Choosing Instance Types
General Purpose (t3, m5, m6i):
- Web applications
- Microservices
- Development environments
- Most common workloads
Compute Optimized (c5, c6i):
- CPU-intensive workloads
- Batch processing
- High-performance computing
- Scientific computing
Memory Optimized (r5, r6i):
- In-memory databases
- Caching systems
- Analytics workloads
- Memory-intensive applications
Storage Optimized (i3, i3en):
- NoSQL databases
- Data warehousing
- High I/O workloads
- Big data processing
Right-Sizing Nodes
Considerations:
- Pod density requirements
- Resource requests and limits
- Instance type IP limits (for VPC CNI)
- Cost optimization
- Performance requirements
Example Calculation:
Required CPU: 100 pods × 0.5 CPU = 50 CPU
Required Memory: 100 pods × 2 Gi = 200 Gi
Options:
- 10 × t3.large (2 vCPU, 8 Gi) = 20 vCPU, 80 Gi (insufficient)
- 10 × t3.xlarge (4 vCPU, 16 Gi) = 40 vCPU, 160 Gi (insufficient)
- 10 × m5.xlarge (4 vCPU, 16 Gi) = 40 vCPU, 160 Gi (insufficient)
- 10 × m5.2xlarge (8 vCPU, 32 Gi) = 80 vCPU, 320 Gi (sufficient)
Auto Scaling Groups Integration
Managed Node Groups
Auto Scaling Groups are automatically created and managed:
Scaling Triggers:
- Cluster Autoscaler detects unschedulable pods
- Increases desired capacity
- ASG launches new instances
- Nodes join cluster
Self-Managed Node Groups
You manage the Auto Scaling Group:
# Configure scaling policies
aws autoscaling put-scaling-policy \
--auto-scaling-group-name eks-nodes \
--policy-name scale-up \
--scaling-adjustment 2 \
--adjustment-type ChangeInCapacity \
--cooldown 300
Node Lifecycle Management
Node Health Checks
EKS automatically monitors node health:
# Check node status
kubectl get nodes
# Describe node for details
kubectl describe node ip-10-0-1-123.ec2.internal
# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'
Node Conditions:
Ready- Node is healthy and ready for podsMemoryPressure- Node has memory pressureDiskPressure- Node has disk pressurePIDPressure- Node has process ID pressure
Node Replacement
Managed Node Groups:
- Automatic replacement for unhealthy nodes
- Configurable via node group settings
Self-Managed Node Groups:
- Manual replacement required
- Use Auto Scaling Group health checks
- Implement custom replacement logic
Node Draining
Safely remove nodes from cluster:
# Cordon node (prevent new pods)
kubectl cordon ip-10-0-1-123.ec2.internal
# Drain node (evict pods)
kubectl drain ip-10-0-1-123.ec2.internal \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=300
# Delete node
kubectl delete node ip-10-0-1-123.ec2.internal
Spot Instances and Cost Optimization
Spot Instances
Use spot instances for cost savings (up to 90% discount):
managedNodeGroups:
- name: spot-workers
instanceTypes:
- t3.medium
- t3.large
- t3.xlarge
capacityType: SPOT
minSize: 0
maxSize: 20
desiredCapacity: 5
labels:
capacity-type: spot
taints:
- key: spot
value: "true"
effect: NoSchedule
Spot Instance Best Practices:
- Use for fault-tolerant workloads
- Use multiple instance types
- Set appropriate interruption handling
- Use spot for non-critical workloads
- Combine with on-demand for reliability
Cost Optimization Strategies
Right-Size Instances - Match instance size to workload requirements
Use Spot Instances - For fault-tolerant workloads
Reserved Instances - For predictable workloads
Auto Scaling - Scale down during low usage
Instance Scheduling - Stop non-production clusters during off-hours
Graviton Instances - Use ARM-based instances for compatible workloads
Monitor Costs - Use Cost Explorer and tags
Node Taints and Tolerations
Taints
Prevent pods from scheduling on nodes:
# Node group with taint
managedNodeGroups:
- name: gpu-workers
instanceType: p3.2xlarge
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
Tolerations
Allow pods to schedule on tainted nodes:
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: app
image: gpu-app:latest
Node Group Tagging and Organization
Tags
Organize and track node groups:
# Tag node group
aws eks tag-resource \
--resource-arn arn:aws:eks:us-west-2:123456789012:nodegroup/my-cluster/general-workers \
--tags Environment=Production,Team=Platform,CostCenter=Engineering
Recommended Tags:
- Environment (dev, staging, prod)
- Team/Department
- Cost Center
- Application
- Managed By
Multiple Node Groups
Organize workloads with multiple node groups:
managedNodeGroups:
- name: general-workers
instanceType: t3.large
labels:
workload-type: general
- name: compute-workers
instanceType: c5.xlarge
labels:
workload-type: compute
taints:
- key: compute
value: "true"
effect: NoSchedule
- name: memory-workers
instanceType: r5.xlarge
labels:
workload-type: memory
taints:
- key: memory
value: "true"
effect: NoSchedule
Best Practices
Use Managed Node Groups - Unless you need custom configurations
Right-Size Instances - Match instance size to workload requirements
Use Spot Instances - For fault-tolerant, cost-sensitive workloads
Set Appropriate Scaling Limits - Prevent cost overruns
Use Labels and Taints - Organize workloads and node selection
Monitor Node Health - Set up alerts for node issues
Plan for Updates - Schedule node group updates during maintenance windows
Tag Everything - For cost allocation and resource management
Test Updates - Test node group updates in non-production first
Use Multiple Node Groups - Separate workloads by requirements
Common Issues
Nodes Not Joining Cluster
Problem: Nodes created but not joining cluster
Solutions:
- Check IAM role permissions
- Verify security group rules
- Check bootstrap script
- Review CloudWatch logs
- Verify aws-auth ConfigMap
Node Group Update Fails
Problem: Node group update stuck or fails
Solutions:
- Check node group status
- Review update logs
- Verify AMI availability
- Check instance capacity
- Review Auto Scaling Group
High Node Churn
Problem: Nodes frequently being replaced
Solutions:
- Check node health conditions
- Review resource pressure
- Verify instance stability
- Check for spot interruptions
- Review Auto Scaling Group configuration
See Also
- Cluster Setup - Initial node group creation
- Autoscaling - Automatic node scaling
- Troubleshooting - Node issues