GKE Node Management
Node management on GKE involves provisioning, configuring, and maintaining worker nodes that run your applications. In Standard mode, you manage node pools and nodes. In Autopilot mode, Google Cloud manages nodes automatically. Understanding node pool management helps you optimize costs and performance.
Node Pool Overview
Node pools are groups of nodes with the same configuration. In Standard mode, you create and manage node pools. In Autopilot mode, nodes are automatically managed.
Creating Node Pools
Using gcloud CLI
Basic Node Pool:
# Create node pool
gcloud container node-pools create general-pool \
--cluster my-cluster \
--zone us-central1-a \
--num-nodes 3 \
--machine-type n1-standard-2
Advanced Node Pool:
# Create node pool with custom configuration
gcloud container node-pools create compute-pool \
--cluster my-cluster \
--zone us-central1-a \
--num-nodes 3 \
--machine-type n1-standard-4 \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 10 \
--enable-autorepair \
--enable-autoupgrade \
--preemptible \
--disk-type pd-ssd \
--disk-size 50 \
--node-labels workload-type=compute \
--node-taints compute=true:NoSchedule
Node Pool Configuration
Machine Types:
Choose based on workload requirements:
| Machine Family | Use Case | Example Types |
|---|---|---|
| General Purpose | Most workloads | n1-standard, n2-standard, e2-standard |
| Compute Optimized | CPU-intensive | n1-highcpu, n2-highcpu |
| Memory Optimized | Memory-intensive | n1-highmem, n2-highmem |
| Shared Core | Development | e2-micro, e2-small |
Scaling Configuration:
# Enable auto-scaling
gcloud container node-pools update general-pool \
--cluster my-cluster \
--zone us-central1-a \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 10
Labels and Taints:
# Create node pool with labels and taints
gcloud container node-pools create compute-pool \
--cluster my-cluster \
--zone us-central1-a \
--num-nodes 3 \
--machine-type n1-standard-4 \
--node-labels workload-type=compute,zone=us-central1-a \
--node-taints compute=true:NoSchedule
Use Cases:
- Labels - Node selection for pod scheduling
- Taints - Prevent pods from scheduling (unless they have matching tolerations)
Updating Node Pools
Update Node Pool Version:
# Update to latest node version
gcloud container node-pools upgrade compute-pool \
--cluster my-cluster \
--zone us-central1-a
Update Configuration:
# Update scaling, labels, or taints
gcloud container node-pools update compute-pool \
--cluster my-cluster \
--zone us-central1-a \
--enable-autoscaling \
--min-nodes 2 \
--max-nodes 15
Update Process:
- New nodes created with updated configuration
- Old nodes cordoned (no new pods)
- Old nodes drained (pods moved to new nodes)
- Old nodes terminated
- Zero-downtime update
Machine Types and Sizing
Choosing Machine Types
General Purpose (n1-standard, n2-standard, e2-standard):
- Web applications
- Microservices
- Development environments
- Most common workloads
Compute Optimized (n1-highcpu, n2-highcpu):
- CPU-intensive workloads
- Batch processing
- High-performance computing
- Scientific computing
Memory Optimized (n1-highmem, n2-highmem):
- In-memory databases
- Caching systems
- Analytics workloads
- Memory-intensive applications
Right-Sizing Nodes
Considerations:
- Pod density requirements
- Resource requests and limits
- Machine type pod limits
- Cost optimization
- Performance requirements
Example Calculation:
Required CPU: 100 pods × 0.5 CPU = 50 CPU
Required Memory: 100 pods × 2 Gi = 200 Gi
Options:
- 10 × n1-standard-2 (2 vCPU, 7.5 Gi) = 20 vCPU, 75 Gi (insufficient)
- 10 × n1-standard-4 (4 vCPU, 15 Gi) = 40 vCPU, 150 Gi (insufficient)
- 10 × n1-standard-8 (8 vCPU, 30 Gi) = 80 vCPU, 300 Gi (sufficient)
Node Lifecycle Management
Auto-Repair
Automatic node repair for unhealthy nodes:
# Enable auto-repair
gcloud container node-pools update general-pool \
--cluster my-cluster \
--zone us-central1-a \
--enable-autorepair
Auto-Repair Process:
- Monitors node health
- Automatically repairs unhealthy nodes
- Replaces nodes that can’t be repaired
- Reduces manual intervention
Auto-Upgrade
Automatic node upgrades for security patches:
# Enable auto-upgrade
gcloud container node-pools update general-pool \
--cluster my-cluster \
--zone us-central1-a \
--enable-autoupgrade
Auto-Upgrade Process:
- Automatically upgrades nodes to latest patch version
- Stays on same minor version as cluster
- Performs rolling upgrades
- Reduces security vulnerabilities
Node Health Checks
GKE automatically monitors node health:
# Check node status
kubectl get nodes
# Describe node for details
kubectl describe node gke-my-cluster-default-pool-xxx-yyy
# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'
Node Conditions:
Ready- Node is healthy and ready for podsMemoryPressure- Node has memory pressureDiskPressure- Node has disk pressurePIDPressure- Node has process ID pressure
Node Replacement
Manual Node Replacement:
# Cordon node (prevent new pods)
kubectl cordon gke-my-cluster-default-pool-xxx-yyy
# Drain node (evict pods)
kubectl drain gke-my-cluster-default-pool-xxx-yyy \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=300
# Delete node
kubectl delete node gke-my-cluster-default-pool-xxx-yyy
Automatic Node Replacement:
- Auto-repair handles node replacement
- Auto-upgrade performs rolling replacements
- Managed Instance Group handles unhealthy instances
Preemptible VMs and Cost Optimization
Preemptible VMs
Use preemptible VMs for cost savings (up to 80% discount):
# Create node pool with preemptible VMs
gcloud container node-pools create spot-pool \
--cluster my-cluster \
--zone us-central1-a \
--num-nodes 3 \
--machine-type n1-standard-2 \
--preemptible \
--enable-autoscaling \
--min-nodes 0 \
--max-nodes 10 \
--node-labels preemptible=true \
--node-taints preemptible=true:NoSchedule
Preemptible VM Best Practices:
- Use for fault-tolerant workloads
- Set appropriate taints and tolerations
- Use auto-scaling for availability
- Combine with on-demand for reliability
- Handle interruptions gracefully
Cost Optimization Strategies
Right-Size Machine Types - Match machine size to workload requirements
Use Preemptible VMs - For fault-tolerant workloads
Use Committed Use Discounts - For predictable workloads
Auto-Scaling - Scale down during low usage
Sustained Use Discounts - Automatic discounts for long-running VMs
E2 Machine Types - Cost-effective for many workloads
Monitor Costs - Use Cloud Billing and Cost Management
Node Taints and Tolerations
Taints
Prevent pods from scheduling on nodes:
# Create node pool with taint
gcloud container node-pools create gpu-pool \
--cluster my-cluster \
--zone us-central1-a \
--num-nodes 2 \
--machine-type n1-standard-4 \
--accelerator type=nvidia-tesla-k80,count=1 \
--node-taints nvidia.com/gpu=true:NoSchedule
Tolerations
Allow pods to schedule on tainted nodes:
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
spec:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: app
image: gpu-app:latest
Node Pool Organization
Multiple Node Pools
Organize workloads with multiple node pools:
# General purpose pool
gcloud container node-pools create general-pool \
--cluster my-cluster \
--zone us-central1-a \
--num-nodes 3 \
--machine-type n1-standard-2 \
--node-labels workload-type=general
# Compute pool
gcloud container node-pools create compute-pool \
--cluster my-cluster \
--zone us-central1-a \
--num-nodes 2 \
--machine-type n1-standard-4 \
--node-labels workload-type=compute \
--node-taints compute=true:NoSchedule
# Memory pool
gcloud container node-pools create memory-pool \
--cluster my-cluster \
--zone us-central1-a \
--num-nodes 2 \
--machine-type n1-highmem-4 \
--node-labels workload-type=memory \
--node-taints memory=true:NoSchedule
Best Practices
Use Auto-Repair and Auto-Upgrade - Automatic node maintenance
Right-Size Machine Types - Match machine size to workload requirements
Use Preemptible VMs - For fault-tolerant, cost-sensitive workloads
Set Appropriate Scaling Limits - Prevent cost overruns
Use Labels and Taints - Organize workloads and node selection
Monitor Node Health - Set up alerts for node issues
Plan for Updates - Schedule node pool updates during maintenance windows
Tag Resources - For cost allocation and resource management
Test Updates - Test node pool updates in non-production first
Use Multiple Node Pools - Separate workloads by requirements
Common Issues
Nodes Not Joining Cluster
Problem: Nodes created but not joining cluster
Solutions:
- Check service account permissions
- Verify firewall rules
- Check VPC network configuration
- Review Cloud Logging for errors
Node Pool Update Fails
Problem: Node pool update stuck or fails
Solutions:
- Check node pool status
- Review update logs
- Verify machine type availability
- Check instance capacity
- Review Cloud Logging for errors
High Node Churn
Problem: Nodes frequently being replaced
Solutions:
- Check node health conditions
- Review resource pressure
- Verify instance stability
- Check for preemptible interruptions
- Review auto-repair configuration
See Also
- Cluster Setup - Initial node pool creation
- Autoscaling - Automatic node scaling
- Troubleshooting - Node issues