AKS Observability
Observability on AKS involves monitoring, logging, and tracing to understand cluster and application behavior. AKS integrates deeply with Azure Monitor, Log Analytics, and Application Insights for comprehensive observability, and supports popular open-source tools like Prometheus and Grafana.
Observability Overview
Observability consists of three pillars:
Azure Monitor for Containers
Azure Monitor for Containers provides automatic collection and visualization of metrics and logs from AKS clusters and applications.
Architecture
Enabling Azure Monitor
Using Azure CLI:
# Create Log Analytics workspace
az monitor log-analytics workspace create \
--resource-group myResourceGroup \
--workspace-name myWorkspace
# Get workspace resource ID
WORKSPACE_ID=$(az monitor log-analytics workspace show \
--resource-group myResourceGroup \
--workspace-name myWorkspace \
--query id -o tsv)
# Enable Azure Monitor on cluster
az aks enable-addons \
--resource-group myResourceGroup \
--name myAKSCluster \
--addons monitoring \
--workspace-resource-id $WORKSPACE_ID
Using Azure Portal:
- Go to AKS cluster → Insights
- Enable Container Insights
- Select Log Analytics workspace
- Enable monitoring
Container Insights Metrics
Azure Monitor automatically collects:
Cluster Metrics:
- CPU utilization
- Memory utilization
- Network I/O
- Storage I/O
Node Metrics:
- Node CPU/memory
- Pod count per node
- Container count per node
Pod Metrics:
- Pod CPU/memory
- Network I/O
- Storage I/O
- Restart count
Namespace Metrics:
- Resource usage per namespace
- Pod count per namespace
Viewing Container Insights
Access Container Insights dashboard:
- Go to Azure Portal → AKS cluster → Insights
- View metrics and logs
- Create custom queries
Available Views:
- Cluster performance
- Node performance
- Pod performance
- Namespace performance
- Workload performance
Log Analytics
Log Analytics provides log aggregation and analysis for AKS clusters.
Log Types
AKS generates various log types:
- Container Logs - Application container logs
- Node Logs - Node-level logs
- Cluster Logs - Cluster-level logs
- Audit Logs - API server audit logs
Viewing Logs
Access logs via:
- Azure Portal → Log Analytics workspace → Logs
- Filter by cluster, namespace, or pod
- View and search logs
Log Queries:
// Filter logs by cluster
ContainerLog
| where ClusterName == "myAKSCluster"
// Filter logs by namespace
ContainerLog
| where Namespace == "default"
// Filter logs by pod
ContainerLog
| where PodName == "my-pod"
Structured Logging
Use structured logging for better parsing:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "INFO",
"service": "user-service",
"trace": "1-5f2b3c4d-abc123",
"message": "User created",
"user_id": "12345",
"duration_ms": 45
}
Application Insights
Application Insights provides distributed tracing for applications running on AKS.
Instrumenting Applications
Node.js Example:
const appInsights = require('applicationinsights');
// Start Application Insights
appInsights.setup('<instrumentation-key>')
.setAutoDependencyCorrelation(true)
.setAutoCollectRequests(true)
.setAutoCollectPerformance(true)
.setAutoCollectExceptions(true)
.start();
const express = require('express');
const app = express();
app.get('/api/users', async (req, res) => {
const client = appInsights.defaultClient;
const startTime = Date.now();
try {
// Database query
const users = await db.query('SELECT * FROM users');
client.trackDependency({
name: 'database-query',
data: 'SELECT * FROM users',
duration: Date.now() - startTime,
success: true
});
res.json(users);
} catch (error) {
client.trackException({exception: error});
res.status(500).json({error: error.message});
}
});
Java Example:
import com.microsoft.applicationinsights.TelemetryClient;
import com.microsoft.applicationinsights.telemetry.Duration;
@RestController
public class UserController {
private TelemetryClient telemetryClient = new TelemetryClient();
@GetMapping("/api/users")
public List<User> getUsers() {
long startTime = System.currentTimeMillis();
try {
List<User> users = userRepository.findAll();
telemetryClient.trackDependency(
"database-query",
"SELECT * FROM users",
new Duration(System.currentTimeMillis() - startTime),
true
);
return users;
} catch (Exception e) {
telemetryClient.trackException(e);
throw e;
}
}
}
Prometheus and Grafana
Prometheus is a popular open-source monitoring toolkit. Grafana provides visualization and dashboards for Prometheus metrics.
Installing Prometheus
Using Helm:
# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
Grafana Dashboards
Access Grafana:
# Get Grafana admin password
kubectl get secret prometheus-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d
# Port forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Metrics Collection and Dashboards
Custom Metrics
Expose custom application metrics:
Prometheus Metrics Endpoint:
from prometheus_client import Counter, Histogram, start_http_server
# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])
# Instrument code
@request_duration.labels(method='GET', endpoint='/api/users').time()
def handle_request():
request_count.labels(method='GET', endpoint='/api/users').inc()
# Handle request
Kubernetes Metrics:
apiVersion: v1
kind: Service
metadata:
name: my-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
ports:
- name: metrics
port: 8080
targetPort: 8080
Alerting
Create alerting rules:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: default
spec:
groups:
- name: my-app
interval: 30s
rules:
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for 5 minutes"
Azure Monitor Alerts:
# Create alert rule
az monitor metrics alert create \
--name high-cpu-usage \
--resource-group myResourceGroup \
--scopes /subscriptions/.../resourceGroups/.../providers/Microsoft.ContainerService/managedClusters/myAKSCluster \
--condition "avg Percentage CPU > 80" \
--window-size 5m \
--evaluation-frequency 1m
Best Practices
Enable Azure Monitor - Automatic metrics and logs collection
Use Structured Logging - JSON format for better parsing
Implement Distributed Tracing - Application Insights or Jaeger for request flows
Set Up Alerts - Proactive monitoring and alerting
Monitor Costs - Track resource usage and costs
Retention Policies - Configure appropriate log and metric retention
Dashboard Organization - Create dashboards for different audiences
Test Alerting - Verify alerts work correctly
Document Runbooks - Procedures for common issues
Use Prometheus for Custom Metrics - Application-specific metrics
Common Issues
Metrics Not Appearing
Problem: Metrics not showing in Azure Monitor
Solutions:
- Verify Azure Monitor add-on is enabled
- Check Log Analytics workspace configuration
- Verify Container Insights agent is running
- Review Azure Activity Log
High Log Volume
Problem: Too many logs, high costs
Solutions:
- Implement log filtering
- Reduce log verbosity
- Use log sampling
- Configure log retention
- Use log aggregation efficiently
See Also
- Cluster Setup - Initial observability setup
- Add-ons - Installing observability tools
- Troubleshooting - Observability issues