GKE Node Management

Node management on GKE involves provisioning, configuring, and maintaining worker nodes that run your applications. In Standard mode, you manage node pools and nodes. In Autopilot mode, Google Cloud manages nodes automatically. Understanding node pool management helps you optimize costs and performance.

Node Pool Overview

Node pools are groups of nodes with the same configuration. In Standard mode, you create and manage node pools. In Autopilot mode, nodes are automatically managed.

graph TB subgraph standard[Standard Mode] NP[Node Pool] --> MIG[Managed Instance Group] MIG --> VMs[Compute Engine VMs] NP --> CONFIG[You Manage Configuration] end subgraph autopilot[Autopilot Mode] AUTO[Autopilot] --> AUTO_NODES[Auto-Managed Nodes] AUTO --> AUTO_CONFIG[GCP Manages Configuration] end style NP fill:#e1f5ff style AUTO fill:#fff4e1 style VMs fill:#e8f5e9

Creating Node Pools

Using gcloud CLI

Basic Node Pool:

# Create node pool
gcloud container node-pools create general-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-2

Advanced Node Pool:

# Create node pool with custom configuration
gcloud container node-pools create compute-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-4 \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 10 \
  --enable-autorepair \
  --enable-autoupgrade \
  --preemptible \
  --disk-type pd-ssd \
  --disk-size 50 \
  --node-labels workload-type=compute \
  --node-taints compute=true:NoSchedule

Node Pool Configuration

Machine Types:

Choose based on workload requirements:

Machine Family	Use Case	Example Types
General Purpose	Most workloads	n1-standard, n2-standard, e2-standard
Compute Optimized	CPU-intensive	n1-highcpu, n2-highcpu
Memory Optimized	Memory-intensive	n1-highmem, n2-highmem
Shared Core	Development	e2-micro, e2-small

Scaling Configuration:

# Enable auto-scaling
gcloud container node-pools update general-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 10

Labels and Taints:

# Create node pool with labels and taints
gcloud container node-pools create compute-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-4 \
  --node-labels workload-type=compute,zone=us-central1-a \
  --node-taints compute=true:NoSchedule

Use Cases:

Labels - Node selection for pod scheduling
Taints - Prevent pods from scheduling (unless they have matching tolerations)

Updating Node Pools

Update Node Pool Version:

# Update to latest node version
gcloud container node-pools upgrade compute-pool \
  --cluster my-cluster \
  --zone us-central1-a

Update Configuration:

# Update scaling, labels, or taints
gcloud container node-pools update compute-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --enable-autoscaling \
  --min-nodes 2 \
  --max-nodes 15

Update Process:

graph LR A[Start Update] --> B[Create New Nodes] B --> C[Cordon Old Nodes] C --> D[Drain Old Nodes] D --> E[Terminate Old Nodes] E --> F[Update Complete] style A fill:#e1f5ff style F fill:#e8f5e9

New nodes created with updated configuration
Old nodes cordoned (no new pods)
Old nodes drained (pods moved to new nodes)
Old nodes terminated
Zero-downtime update

Machine Types and Sizing

Choosing Machine Types

General Purpose (n1-standard, n2-standard, e2-standard):

Web applications
Microservices
Development environments
Most common workloads

Compute Optimized (n1-highcpu, n2-highcpu):

CPU-intensive workloads
Batch processing
High-performance computing
Scientific computing

Memory Optimized (n1-highmem, n2-highmem):

In-memory databases
Caching systems
Analytics workloads
Memory-intensive applications

Right-Sizing Nodes

Considerations:

Pod density requirements
Resource requests and limits
Machine type pod limits
Cost optimization
Performance requirements

Example Calculation:

Required CPU: 100 pods × 0.5 CPU = 50 CPU
Required Memory: 100 pods × 2 Gi = 200 Gi

Options:
- 10 × n1-standard-2 (2 vCPU, 7.5 Gi) = 20 vCPU, 75 Gi (insufficient)
- 10 × n1-standard-4 (4 vCPU, 15 Gi) = 40 vCPU, 150 Gi (insufficient)
- 10 × n1-standard-8 (8 vCPU, 30 Gi) = 80 vCPU, 300 Gi (sufficient)

Node Lifecycle Management

Auto-Repair

Automatic node repair for unhealthy nodes:

# Enable auto-repair
gcloud container node-pools update general-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --enable-autorepair

Auto-Repair Process:

Monitors node health
Automatically repairs unhealthy nodes
Replaces nodes that can’t be repaired
Reduces manual intervention

Auto-Upgrade

Automatic node upgrades for security patches:

# Enable auto-upgrade
gcloud container node-pools update general-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --enable-autoupgrade

Auto-Upgrade Process:

Automatically upgrades nodes to latest patch version
Stays on same minor version as cluster
Performs rolling upgrades
Reduces security vulnerabilities

Node Health Checks

GKE automatically monitors node health:

# Check node status
kubectl get nodes

# Describe node for details
kubectl describe node gke-my-cluster-default-pool-xxx-yyy

# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

Node Conditions:

Ready - Node is healthy and ready for pods
MemoryPressure - Node has memory pressure
DiskPressure - Node has disk pressure
PIDPressure - Node has process ID pressure

Node Replacement

Manual Node Replacement:

# Cordon node (prevent new pods)
kubectl cordon gke-my-cluster-default-pool-xxx-yyy

# Drain node (evict pods)
kubectl drain gke-my-cluster-default-pool-xxx-yyy \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300

# Delete node
kubectl delete node gke-my-cluster-default-pool-xxx-yyy

Automatic Node Replacement:

Auto-repair handles node replacement
Auto-upgrade performs rolling replacements
Managed Instance Group handles unhealthy instances

Preemptible VMs and Cost Optimization

Preemptible VMs

Use preemptible VMs for cost savings (up to 80% discount):

# Create node pool with preemptible VMs
gcloud container node-pools create spot-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-2 \
  --preemptible \
  --enable-autoscaling \
  --min-nodes 0 \
  --max-nodes 10 \
  --node-labels preemptible=true \
  --node-taints preemptible=true:NoSchedule

Preemptible VM Best Practices:

Use for fault-tolerant workloads
Set appropriate taints and tolerations
Use auto-scaling for availability
Combine with on-demand for reliability
Handle interruptions gracefully

Cost Optimization Strategies

Right-Size Machine Types - Match machine size to workload requirements
Use Preemptible VMs - For fault-tolerant workloads
Use Committed Use Discounts - For predictable workloads
Auto-Scaling - Scale down during low usage
Sustained Use Discounts - Automatic discounts for long-running VMs
E2 Machine Types - Cost-effective for many workloads
Monitor Costs - Use Cloud Billing and Cost Management

Node Taints and Tolerations

Taints

Prevent pods from scheduling on nodes:

# Create node pool with taint
gcloud container node-pools create gpu-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 2 \
  --machine-type n1-standard-4 \
  --accelerator type=nvidia-tesla-k80,count=1 \
  --node-taints nvidia.com/gpu=true:NoSchedule

Tolerations

Allow pods to schedule on tainted nodes:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  containers:
  - name: app
    image: gpu-app:latest

Node Pool Organization

Multiple Node Pools

Organize workloads with multiple node pools:

# General purpose pool
gcloud container node-pools create general-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-2 \
  --node-labels workload-type=general

# Compute pool
gcloud container node-pools create compute-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 2 \
  --machine-type n1-standard-4 \
  --node-labels workload-type=compute \
  --node-taints compute=true:NoSchedule

# Memory pool
gcloud container node-pools create memory-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 2 \
  --machine-type n1-highmem-4 \
  --node-labels workload-type=memory \
  --node-taints memory=true:NoSchedule

Best Practices

Use Auto-Repair and Auto-Upgrade - Automatic node maintenance
Right-Size Machine Types - Match machine size to workload requirements
Use Preemptible VMs - For fault-tolerant, cost-sensitive workloads
Set Appropriate Scaling Limits - Prevent cost overruns
Use Labels and Taints - Organize workloads and node selection
Monitor Node Health - Set up alerts for node issues
Plan for Updates - Schedule node pool updates during maintenance windows
Tag Resources - For cost allocation and resource management
Test Updates - Test node pool updates in non-production first
Use Multiple Node Pools - Separate workloads by requirements

Common Issues

Nodes Not Joining Cluster

Problem: Nodes created but not joining cluster

Solutions:

Check service account permissions
Verify firewall rules
Check VPC network configuration
Review Cloud Logging for errors

Node Pool Update Fails

Problem: Node pool update stuck or fails

Solutions:

Check node pool status
Review update logs
Verify machine type availability
Check instance capacity
Review Cloud Logging for errors

High Node Churn

Problem: Nodes frequently being replaced

Solutions:

Check node health conditions
Review resource pressure
Verify instance stability
Check for preemptible interruptions
Review auto-repair configuration

GKE Node Management

Node Pool Overview

Creating Node Pools

Using gcloud CLI

Node Pool Configuration

Updating Node Pools

Machine Types and Sizing

Choosing Machine Types

Right-Sizing Nodes

Node Lifecycle Management

Auto-Repair

Auto-Upgrade

Node Health Checks

Node Replacement

Preemptible VMs and Cost Optimization

Preemptible VMs

Cost Optimization Strategies

Node Taints and Tolerations

Taints

Tolerations

Node Pool Organization

Multiple Node Pools

Best Practices

Common Issues

Nodes Not Joining Cluster

Node Pool Update Fails

High Node Churn

See Also