GKE Observability
Observability on GKE involves monitoring, logging, and tracing to understand cluster and application behavior. GKE integrates deeply with Google Cloud Operations (formerly Stackdriver) for comprehensive observability, and supports popular open-source tools like Prometheus and Grafana.
Observability Overview
Observability consists of three pillars:
Google Cloud Operations
Cloud Operations provides automatic collection and visualization of metrics and logs from GKE clusters and applications.
Architecture
Enabling Cloud Operations
Using gcloud CLI:
# Enable monitoring
gcloud container clusters update my-cluster \
--zone us-central1-a \
--monitoring=SYSTEM,WORKLOAD
# Enable logging
gcloud container clusters update my-cluster \
--zone us-central1-a \
--logging=SYSTEM,WORKLOAD
Monitoring Types:
SYSTEM- System metrics (nodes, pods)WORKLOAD- Application metricsAPISERVER- API server metricsCONTROLLER_MANAGER- Controller manager metricsSCHEDULER- Scheduler metrics
Cloud Operations Metrics
Cloud Operations automatically collects:
Cluster Metrics:
- CPU utilization
- Memory utilization
- Network I/O
- Storage I/O
Node Metrics:
- Node CPU/memory
- Pod count per node
- Container count per node
Pod Metrics:
- Pod CPU/memory
- Network I/O
- Storage I/O
- Restart count
Namespace Metrics:
- Resource usage per namespace
- Pod count per namespace
Viewing Cloud Operations
Access Cloud Operations dashboards:
- Go to Google Cloud Console → Monitoring → Dashboards
- Select GKE cluster dashboard
- View metrics and logs
Available Views:
- Cluster performance
- Node performance
- Pod performance
- Namespace performance
- Workload performance
Cloud Monitoring
Cloud Monitoring provides metrics and dashboards for GKE clusters.
GKE Metrics
GKE automatically exposes metrics:
- Cluster Metrics - Cluster-level resource usage
- Node Metrics - Node-level metrics
- Pod Metrics - Pod-level metrics
- Container Metrics - Container-level metrics
- Workload Metrics - Deployment/StatefulSet metrics
Custom Metrics
Expose custom application metrics:
Prometheus Metrics Endpoint:
from prometheus_client import Counter, Histogram, start_http_server
# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])
# Instrument code
@request_duration.labels(method='GET', endpoint='/api/users').time()
def handle_request():
request_count.labels(method='GET', endpoint='/api/users').inc()
# Handle request
Kubernetes Metrics:
apiVersion: v1
kind: Service
metadata:
name: my-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
ports:
- name: metrics
port: 8080
targetPort: 8080
Alerting
Create alerting policies:
# Create alerting policy
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="High CPU Usage" \
--condition-display-name="CPU usage > 80%" \
--condition-threshold-value=0.8 \
--condition-threshold-duration=300s
Cloud Logging
Cloud Logging provides log aggregation and analysis for GKE clusters.
Log Types
GKE generates various log types:
- Container Logs - Application container logs
- Node Logs - Node-level logs
- Cluster Logs - Cluster-level logs
- Audit Logs - API server audit logs
Viewing Logs
Access logs via:
- Google Cloud Console → Logging → Logs Explorer
- Filter by cluster, namespace, or pod
- View and search logs
Log Filters:
# Filter logs by cluster
resource.type="gke_cluster"
resource.labels.cluster_name="my-cluster"
# Filter logs by namespace
resource.type="k8s_container"
resource.labels.namespace_name="default"
# Filter logs by pod
resource.type="k8s_container"
resource.labels.pod_name="my-pod"
Structured Logging
Use structured logging for better parsing:
{
"timestamp": "2024-01-15T10:30:00Z",
"severity": "INFO",
"service": "user-service",
"trace": "1-5f2b3c4d-abc123",
"message": "User created",
"user_id": "12345",
"duration_ms": 45
}
Cloud Trace
Cloud Trace provides distributed tracing for applications running on GKE.
Instrumenting Applications
Node.js Example:
const {TraceAgent} = require('@google-cloud/trace-agent');
// Start trace agent
TraceAgent.start();
const express = require('express');
const app = express();
app.get('/api/users', async (req, res) => {
const span = TraceAgent.get().createChildSpan({name: 'database-query'});
try {
// Database query
const users = await db.query('SELECT * FROM users');
span.endSpan();
res.json(users);
} catch (error) {
span.addLabel('error', error.message);
span.endSpan();
res.status(500).json({error: error.message});
}
});
Java Example:
import com.google.cloud.trace.v1.TraceServiceClient;
import com.google.cloud.trace.v1.Trace;
@RestController
public class UserController {
@GetMapping("/api/users")
public List<User> getUsers() {
Trace trace = TraceServiceClient.createTrace();
try {
List<User> users = userRepository.findAll();
trace.endTrace();
return users;
} catch (Exception e) {
trace.addLabel("error", e.getMessage());
trace.endTrace();
throw e;
}
}
}
Prometheus and Grafana
Prometheus is a popular open-source monitoring toolkit. Grafana provides visualization and dashboards for Prometheus metrics.
Installing Prometheus
Using Helm:
# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
Grafana Dashboards
Access Grafana:
# Get Grafana admin password
kubectl get secret prometheus-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d
# Port forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Best Practices
Enable Cloud Operations - Automatic metrics and logs collection
Use Structured Logging - JSON format for better parsing
Implement Distributed Tracing - Cloud Trace or Jaeger for request flows
Set Up Alerts - Proactive monitoring and alerting
Monitor Costs - Track resource usage and costs
Retention Policies - Configure appropriate log and metric retention
Dashboard Organization - Create dashboards for different audiences
Test Alerting - Verify alerts work correctly
Document Runbooks - Procedures for common issues
Use Prometheus for Custom Metrics - Application-specific metrics
Common Issues
Metrics Not Appearing
Problem: Metrics not showing in Cloud Monitoring
Solutions:
- Verify Cloud Operations is enabled
- Check service account permissions
- Verify metadata agent is running
- Review Cloud Logging for errors
High Log Volume
Problem: Too many logs, high costs
Solutions:
- Implement log filtering
- Reduce log verbosity
- Use log sampling
- Configure log retention
- Use log aggregation efficiently
See Also
- Cluster Setup - Initial observability setup
- Add-ons - Installing observability tools
- Troubleshooting - Observability issues