GKE Node Management

Node management on GKE involves provisioning, configuring, and maintaining worker nodes that run your applications. In Standard mode, you manage node pools and nodes. In Autopilot mode, Google Cloud manages nodes automatically. Understanding node pool management helps you optimize costs and performance.

Node Pool Overview

Node pools are groups of nodes with the same configuration. In Standard mode, you create and manage node pools. In Autopilot mode, nodes are automatically managed.

graph TB subgraph standard[Standard Mode] NP[Node Pool] --> MIG[Managed Instance Group] MIG --> VMs[Compute Engine VMs] NP --> CONFIG[You Manage Configuration] end subgraph autopilot[Autopilot Mode] AUTO[Autopilot] --> AUTO_NODES[Auto-Managed Nodes] AUTO --> AUTO_CONFIG[GCP Manages Configuration] end style NP fill:#e1f5ff style AUTO fill:#fff4e1 style VMs fill:#e8f5e9

Creating Node Pools

Using gcloud CLI

Basic Node Pool:

# Create node pool
gcloud container node-pools create general-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-2

Advanced Node Pool:

# Create node pool with custom configuration
gcloud container node-pools create compute-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-4 \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 10 \
  --enable-autorepair \
  --enable-autoupgrade \
  --preemptible \
  --disk-type pd-ssd \
  --disk-size 50 \
  --node-labels workload-type=compute \
  --node-taints compute=true:NoSchedule

Node Pool Configuration

Machine Types:

Choose based on workload requirements:

Machine FamilyUse CaseExample Types
General PurposeMost workloadsn1-standard, n2-standard, e2-standard
Compute OptimizedCPU-intensiven1-highcpu, n2-highcpu
Memory OptimizedMemory-intensiven1-highmem, n2-highmem
Shared CoreDevelopmente2-micro, e2-small

Scaling Configuration:

# Enable auto-scaling
gcloud container node-pools update general-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 10

Labels and Taints:

# Create node pool with labels and taints
gcloud container node-pools create compute-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-4 \
  --node-labels workload-type=compute,zone=us-central1-a \
  --node-taints compute=true:NoSchedule

Use Cases:

  • Labels - Node selection for pod scheduling
  • Taints - Prevent pods from scheduling (unless they have matching tolerations)

Updating Node Pools

Update Node Pool Version:

# Update to latest node version
gcloud container node-pools upgrade compute-pool \
  --cluster my-cluster \
  --zone us-central1-a

Update Configuration:

# Update scaling, labels, or taints
gcloud container node-pools update compute-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --enable-autoscaling \
  --min-nodes 2 \
  --max-nodes 15

Update Process:

graph LR A[Start Update] --> B[Create New Nodes] B --> C[Cordon Old Nodes] C --> D[Drain Old Nodes] D --> E[Terminate Old Nodes] E --> F[Update Complete] style A fill:#e1f5ff style F fill:#e8f5e9
  • New nodes created with updated configuration
  • Old nodes cordoned (no new pods)
  • Old nodes drained (pods moved to new nodes)
  • Old nodes terminated
  • Zero-downtime update

Machine Types and Sizing

Choosing Machine Types

General Purpose (n1-standard, n2-standard, e2-standard):

  • Web applications
  • Microservices
  • Development environments
  • Most common workloads

Compute Optimized (n1-highcpu, n2-highcpu):

  • CPU-intensive workloads
  • Batch processing
  • High-performance computing
  • Scientific computing

Memory Optimized (n1-highmem, n2-highmem):

  • In-memory databases
  • Caching systems
  • Analytics workloads
  • Memory-intensive applications

Right-Sizing Nodes

Considerations:

  • Pod density requirements
  • Resource requests and limits
  • Machine type pod limits
  • Cost optimization
  • Performance requirements

Example Calculation:

Required CPU: 100 pods × 0.5 CPU = 50 CPU
Required Memory: 100 pods × 2 Gi = 200 Gi

Options:
- 10 × n1-standard-2 (2 vCPU, 7.5 Gi) = 20 vCPU, 75 Gi (insufficient)
- 10 × n1-standard-4 (4 vCPU, 15 Gi) = 40 vCPU, 150 Gi (insufficient)
- 10 × n1-standard-8 (8 vCPU, 30 Gi) = 80 vCPU, 300 Gi (sufficient)

Node Lifecycle Management

Auto-Repair

Automatic node repair for unhealthy nodes:

# Enable auto-repair
gcloud container node-pools update general-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --enable-autorepair

Auto-Repair Process:

  • Monitors node health
  • Automatically repairs unhealthy nodes
  • Replaces nodes that can’t be repaired
  • Reduces manual intervention

Auto-Upgrade

Automatic node upgrades for security patches:

# Enable auto-upgrade
gcloud container node-pools update general-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --enable-autoupgrade

Auto-Upgrade Process:

  • Automatically upgrades nodes to latest patch version
  • Stays on same minor version as cluster
  • Performs rolling upgrades
  • Reduces security vulnerabilities

Node Health Checks

GKE automatically monitors node health:

# Check node status
kubectl get nodes

# Describe node for details
kubectl describe node gke-my-cluster-default-pool-xxx-yyy

# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

Node Conditions:

  • Ready - Node is healthy and ready for pods
  • MemoryPressure - Node has memory pressure
  • DiskPressure - Node has disk pressure
  • PIDPressure - Node has process ID pressure

Node Replacement

Manual Node Replacement:

# Cordon node (prevent new pods)
kubectl cordon gke-my-cluster-default-pool-xxx-yyy

# Drain node (evict pods)
kubectl drain gke-my-cluster-default-pool-xxx-yyy \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300

# Delete node
kubectl delete node gke-my-cluster-default-pool-xxx-yyy

Automatic Node Replacement:

  • Auto-repair handles node replacement
  • Auto-upgrade performs rolling replacements
  • Managed Instance Group handles unhealthy instances

Preemptible VMs and Cost Optimization

Preemptible VMs

Use preemptible VMs for cost savings (up to 80% discount):

# Create node pool with preemptible VMs
gcloud container node-pools create spot-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-2 \
  --preemptible \
  --enable-autoscaling \
  --min-nodes 0 \
  --max-nodes 10 \
  --node-labels preemptible=true \
  --node-taints preemptible=true:NoSchedule

Preemptible VM Best Practices:

  • Use for fault-tolerant workloads
  • Set appropriate taints and tolerations
  • Use auto-scaling for availability
  • Combine with on-demand for reliability
  • Handle interruptions gracefully

Cost Optimization Strategies

  1. Right-Size Machine Types - Match machine size to workload requirements

  2. Use Preemptible VMs - For fault-tolerant workloads

  3. Use Committed Use Discounts - For predictable workloads

  4. Auto-Scaling - Scale down during low usage

  5. Sustained Use Discounts - Automatic discounts for long-running VMs

  6. E2 Machine Types - Cost-effective for many workloads

  7. Monitor Costs - Use Cloud Billing and Cost Management

Node Taints and Tolerations

Taints

Prevent pods from scheduling on nodes:

# Create node pool with taint
gcloud container node-pools create gpu-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 2 \
  --machine-type n1-standard-4 \
  --accelerator type=nvidia-tesla-k80,count=1 \
  --node-taints nvidia.com/gpu=true:NoSchedule

Tolerations

Allow pods to schedule on tainted nodes:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-workload
spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule
  containers:
  - name: app
    image: gpu-app:latest

Node Pool Organization

Multiple Node Pools

Organize workloads with multiple node pools:

# General purpose pool
gcloud container node-pools create general-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-2 \
  --node-labels workload-type=general

# Compute pool
gcloud container node-pools create compute-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 2 \
  --machine-type n1-standard-4 \
  --node-labels workload-type=compute \
  --node-taints compute=true:NoSchedule

# Memory pool
gcloud container node-pools create memory-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --num-nodes 2 \
  --machine-type n1-highmem-4 \
  --node-labels workload-type=memory \
  --node-taints memory=true:NoSchedule

Best Practices

  1. Use Auto-Repair and Auto-Upgrade - Automatic node maintenance

  2. Right-Size Machine Types - Match machine size to workload requirements

  3. Use Preemptible VMs - For fault-tolerant, cost-sensitive workloads

  4. Set Appropriate Scaling Limits - Prevent cost overruns

  5. Use Labels and Taints - Organize workloads and node selection

  6. Monitor Node Health - Set up alerts for node issues

  7. Plan for Updates - Schedule node pool updates during maintenance windows

  8. Tag Resources - For cost allocation and resource management

  9. Test Updates - Test node pool updates in non-production first

  10. Use Multiple Node Pools - Separate workloads by requirements

Common Issues

Nodes Not Joining Cluster

Problem: Nodes created but not joining cluster

Solutions:

  • Check service account permissions
  • Verify firewall rules
  • Check VPC network configuration
  • Review Cloud Logging for errors

Node Pool Update Fails

Problem: Node pool update stuck or fails

Solutions:

  • Check node pool status
  • Review update logs
  • Verify machine type availability
  • Check instance capacity
  • Review Cloud Logging for errors

High Node Churn

Problem: Nodes frequently being replaced

Solutions:

  • Check node health conditions
  • Review resource pressure
  • Verify instance stability
  • Check for preemptible interruptions
  • Review auto-repair configuration

See Also