GKE Troubleshooting

Troubleshooting GKE issues requires understanding cluster components, networking, authentication, and Google Cloud service integration. This guide covers common issues, debugging techniques, and resolution strategies for GKE clusters.

Troubleshooting Approach

Systematic troubleshooting process:

graph TB A[Identify Issue] --> B{Gather Information} B --> C[Check Cluster Status] B --> D[Check Node Status] B --> E[Check Pod Status] B --> F[Check Logs] C --> G{Issue Type?} D --> G E --> G F --> G G -->|Cluster| H[Cluster Issues] G -->|Node| I[Node Issues] G -->|Pod| J[Pod Issues] G -->|Network| K[Network Issues] G -->|Auth| L[Auth Issues] H --> M[Resolve] I --> M J --> M K --> M L --> M style A fill:#e1f5ff style M fill:#e8f5e9

Common Cluster Issues

Cluster Not Accessible

Symptoms:

kubectl commands fail
“Unable to connect to the server” errors
API server timeout

Diagnosis:

# Check cluster status
gcloud container clusters describe my-cluster --zone us-central1-a

# Check cluster endpoint
gcloud container clusters describe my-cluster --zone us-central1-a --format="value(endpoint)"

# Test connectivity
curl -k https://<cluster-endpoint>/healthz

# Check kubeconfig
kubectl config view
kubectl config get-contexts

Common Causes:

Cluster endpoint access restricted (private endpoint)
Network connectivity issues
kubeconfig not configured
IAM authentication issues

Solutions:

# Update kubeconfig
gcloud container clusters get-credentials my-cluster --zone us-central1-a

# Check endpoint access
gcloud container clusters describe my-cluster --zone us-central1-a --format="value(privateClusterConfig)"

# For private clusters, use authorized networks or VPN

Cluster Control Plane Issues

Symptoms:

Cluster status shows “RECONCILING” or “ERROR”
Control plane components unhealthy
API server errors

Diagnosis:

# Check cluster status
gcloud container clusters describe my-cluster --zone us-central1-a --format="value(status)"

# Check cluster health
gcloud container clusters describe my-cluster --zone us-central1-a --format="value(conditions)"

# Review cluster logs
gcloud logging read "resource.type=gke_cluster AND resource.labels.cluster_name=my-cluster" --limit 50

Common Causes:

Cluster version incompatibility
Service account issues
VPC configuration problems
Service quota limits

Solutions:

# Check service account
gcloud container clusters describe my-cluster --zone us-central1-a --format="value(nodeConfig.serviceAccount)"

# Verify service account permissions
gcloud projects get-iam-policy PROJECT_ID

# Check service quotas
gcloud compute project-info describe --project PROJECT_ID

Node Connectivity Problems

Nodes Not Joining Cluster

Symptoms:

Nodes created but not appearing in kubectl get nodes
Nodes show “NotReady” status
Pods can’t be scheduled

Diagnosis:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Check node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

# Check node logs
gcloud logging read "resource.type=gce_instance AND resource.labels.instance_name=<node-name>" --limit 50

Common Causes:

Service account not configured correctly
Firewall rules blocking traffic
VPC network configuration issues
Node pool configuration problems

Solutions:

# Verify node pool configuration
gcloud container node-pools describe default-pool --cluster my-cluster --zone us-central1-a

# Check firewall rules
gcloud compute firewall-rules list --filter="targetTags:gke-my-cluster-node"

# Check service account permissions
gcloud projects get-iam-policy PROJECT_ID --flatten="bindings[].members" --filter="bindings.members:serviceAccount:*"

Node NotReady Status

Symptoms:

Nodes show “NotReady” status
Pods can’t be scheduled on nodes
Node conditions show problems

Diagnosis:

# Describe node for details
kubectl describe node <node-name>

# Check node conditions
kubectl get node <node-name> -o json | jq '.status.conditions'

# Check node events
kubectl get events --field-selector involvedObject.name=<node-name>

Common Causes:

kubelet not running
Network connectivity issues
Resource pressure (memory, disk)
Container runtime issues

Solutions:

# Check node logs
gcloud compute instances get-serial-port-output <node-name> --zone us-central1-a

# Enable auto-repair
gcloud container node-pools update default-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --enable-autorepair

Networking Troubleshooting

Pod Networking Issues

Symptoms:

Pods can’t communicate with each other
Pods can’t reach external services
DNS resolution failing

Diagnosis:

# Check pod network
kubectl get pods -o wide

# Test pod connectivity
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup kubernetes.default

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=100

Common Causes:

CoreDNS not running
Insufficient IP addresses
Firewall rules
Network policy blocking traffic

Solutions:

# Restart CoreDNS
kubectl delete pods -n kube-system -l k8s-app=kube-dns

# Check secondary IP range
gcloud container clusters describe my-cluster --zone us-central1-a --format="value(ipAllocationPolicy)"

# Verify firewall rules
gcloud compute firewall-rules list --filter="network:my-vpc"

Load Balancer Issues

Symptoms:

LoadBalancer service stuck in “Pending”
Load balancer not accessible
Health checks failing

Diagnosis:

# Check service status
kubectl get svc <service-name>
kubectl describe svc <service-name>

# Check load balancer in GCP
gcloud compute forwarding-rules list --filter="name:<service-name>"

# Check backend services
gcloud compute backend-services list

Common Causes:

Firewall rules blocking health checks
Subnet configuration incorrect
Service quota limits
Health check configuration

Solutions:

# Check firewall rules
gcloud compute firewall-rules list --filter="name:default-allow-health-check"

# Verify subnet configuration
gcloud container clusters describe my-cluster --zone us-central1-a --format="value(subnetwork)"

# Check service quotas
gcloud compute project-info describe --project PROJECT_ID

Authentication and Authorization Issues

kubectl Access Denied

Symptoms:

kubectl commands return “Forbidden” or “Unauthorized”
Can’t access cluster resources

Diagnosis:

# Check current user
gcloud auth list

# Test cluster access
kubectl auth can-i get pods

# Check IAM permissions
gcloud projects get-iam-policy PROJECT_ID --flatten="bindings[].members" --filter="bindings.members:user:$(gcloud config get-value account)"

Common Causes:

IAM user/role permissions missing
RBAC permissions missing
Cluster endpoint access restricted
Authentication issues

Solutions:

# Grant IAM permissions
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="user:$(gcloud config get-value account)" \
  --role="roles/container.developer"

# Create RBAC role
kubectl create role developer \
  --resource=pods,services \
  --verb=get,list,create,update,delete

# Bind role to user
kubectl create rolebinding developer-binding \
  --role=developer \
  --user=$(gcloud config get-value account)

Workload Identity Not Working

Symptoms:

Pods can’t assume GCP service accounts
Google Cloud SDK calls fail
“Access Denied” errors in pods

Diagnosis:

# Check service account
kubectl get serviceaccount <sa-name> -o yaml

# Check Workload Identity pool
gcloud iam workload-identity-pools describe POOL_ID --location=global

# Check IAM policy binding
gcloud iam service-accounts get-iam-policy <gcp-sa>@PROJECT_ID.iam.gserviceaccount.com

# Test in pod
kubectl run -it --rm test --image=gcr.io/google.com/cloudsdktool/google-cloud-cli:latest --serviceaccount=<sa-name> -- gcloud auth list

Common Causes:

Workload Identity not enabled on cluster
Service account annotation incorrect
IAM policy binding wrong
Pod not using service account

Solutions:

# Enable Workload Identity on cluster
gcloud container clusters update my-cluster \
  --zone us-central1-a \
  --workload-pool PROJECT_ID.svc.id.goog

# Enable Workload Identity on node pool
gcloud container node-pools update default-pool \
  --cluster my-cluster \
  --zone us-central1-a \
  --workload-metadata=GKE_METADATA

# Update service account annotation
kubectl annotate serviceaccount <sa-name> \
  iam.gke.io/gcp-service-account=<gcp-sa>@PROJECT_ID.iam.gserviceaccount.com

# Restart pods to pick up new credentials
kubectl rollout restart deployment <deployment-name>

Storage Problems

Volume Not Attaching

Symptoms:

Pod stuck in “Pending”
PVC not bound
Volume attachment errors

Diagnosis:

# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>

# Check PV status
kubectl get pv
kubectl describe pv <pv-name>

# Check Persistent Disk in GCP
gcloud compute disks list --filter="name:<pv-name>"

Common Causes:

PD CSI driver not running
Service account permissions missing
Volume in different zone than node
Storage class misconfiguration

Solutions:

# Check PD CSI driver
kubectl get pods -n kube-system -l app=pd-csi-driver

# Verify service account permissions
gcloud projects get-iam-policy PROJECT_ID --flatten="bindings[].members" --filter="bindings.members:serviceAccount:*pd-csi*"

# Check storage class
kubectl get storageclass
kubectl describe storageclass <sc-name>

Debugging Tools and Techniques

Useful Commands

# Get comprehensive cluster info
kubectl cluster-info dump

# Get all resources
kubectl get all --all-namespaces

# Describe resource for details
kubectl describe <resource-type> <resource-name>

# Get events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

# Check API resources
kubectl api-resources

Log Collection

# Collect all logs
kubectl logs --all-namespaces > all-logs.txt

# Collect node logs
gcloud compute instances get-serial-port-output <node-name> --zone us-central1-a

# Collect GKE control plane logs
gcloud logging read "resource.type=gke_cluster AND resource.labels.cluster_name=my-cluster" --limit 100

Google Cloud Support and Resources

GCP Support

Google Cloud Support Center - Create support cases
Google Cloud Forums - Community support
Google Cloud Documentation - Official documentation
Google Cloud Premium Support - Enhanced support options

Useful Resources

GKE User Guide - https://cloud.google.com/kubernetes-engine/docs
GKE Best Practices - https://cloud.google.com/kubernetes-engine/docs/best-practices
Kubernetes Documentation - https://kubernetes.io/docs/

Best Practices for Troubleshooting

Start with Logs - Check pod, node, and cluster logs first
Use Describe Commands - kubectl describe provides detailed information
Check Events - Kubernetes events show what’s happening
Verify Prerequisites - Ensure IAM, networking, and resources are correct
Test Incrementally - Test changes one at a time
Document Issues - Keep notes on issues and resolutions
Use Debug Pods - Create debug pods for testing
Check Google Cloud Console - Verify GCP resources are created correctly
Review Best Practices - Follow GKE best practices guides
Stay Updated - Keep cluster and add-ons updated

GKE Troubleshooting

Troubleshooting Approach

Common Cluster Issues

Cluster Not Accessible

Cluster Control Plane Issues

Node Connectivity Problems

Nodes Not Joining Cluster

Node NotReady Status

Networking Troubleshooting

Pod Networking Issues

Load Balancer Issues

Authentication and Authorization Issues

kubectl Access Denied

Workload Identity Not Working

Storage Problems

Volume Not Attaching

Debugging Tools and Techniques

Useful Commands

Log Collection

Google Cloud Support and Resources

GCP Support

Useful Resources

Best Practices for Troubleshooting

See Also