EKS Observability
Observability on EKS involves monitoring, logging, and tracing to understand cluster and application behavior. EKS integrates with AWS services like CloudWatch, and supports popular open-source tools like Prometheus and Grafana for comprehensive observability.
Observability Overview
Observability consists of three pillars:
Metrics: Quantitative measurements over time (CPU, memory, request rate)
Logs: Event records with timestamps (application logs, audit logs)
Traces: Request flows through distributed systems (end-to-end request tracking)
CloudWatch Container Insights
CloudWatch Container Insights provides automatic collection and visualization of metrics and logs from containerized applications and microservices.
Architecture
Enabling Container Insights
Using AWS Console:
- Go to EKS → Clusters → Your Cluster → Logging
- Enable Container Insights
- Select log types to enable
Using AWS CLI:
# Enable Container Insights
aws eks update-cluster-config \
--name my-cluster \
--logging '{
"enable": [
{"types": ["api"]},
{"types": ["audit"]},
{"types": ["authenticator"]},
{"types": ["controllerManager"]},
{"types": ["scheduler"]}
]
}'
Using eksctl:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: my-cluster
region: us-west-2
cloudWatch:
clusterLogging:
enableTypes: ["*"]
Installing Fluent Bit
Fluent Bit is the log forwarder for Container Insights:
# Create namespace
kubectl create namespace amazon-cloudwatch
# Create ConfigMap
kubectl create configmap cluster-info \
--from-literal=cluster.name=my-cluster \
--from-literal=logs.region=us-west-2 \
-n amazon-cloudwatch
# Install Fluent Bit
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml
Container Insights Metrics
Container Insights automatically collects:
Cluster Metrics:
- CPU utilization
- Memory utilization
- Network I/O
- Storage I/O
Node Metrics:
- Node CPU/memory
- Pod count per node
- Container count per node
Pod Metrics:
- Pod CPU/memory
- Network I/O
- Storage I/O
- Restart count
Namespace Metrics:
- Resource usage per namespace
- Pod count per namespace
Viewing Container Insights
Access Container Insights dashboard:
- Go to CloudWatch → Container Insights
- Select your cluster
- View metrics and logs
Available Views:
- Cluster performance
- Node performance
- Pod performance
- Namespace performance
- Logs explorer
Prometheus and Grafana
Prometheus is a popular open-source monitoring and alerting toolkit. Grafana provides visualization and dashboards for Prometheus metrics.
Architecture
Installing Prometheus
Using Helm:
# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
Components Installed:
- Prometheus Server
- Grafana
- Alertmanager
- Node Exporter
- kube-state-metrics
- Prometheus Operator
Prometheus Configuration
ServiceMonitor for Custom Metrics:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-metrics
namespace: default
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
PrometheusRule for Alerts:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: default
spec:
groups:
- name: my-app
interval: 30s
rules:
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for 5 minutes"
Grafana Dashboards
Access Grafana:
# Get Grafana admin password
kubectl get secret prometheus-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d
# Port forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Import Dashboards:
Popular dashboards:
- Kubernetes Cluster Monitoring (3119)
- Node Exporter Full (1860)
- Kubernetes Deployment Statefulset Daemonset metrics (8588)
Custom Dashboard:
{
"dashboard": {
"title": "My App Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}}"
}
]
}
]
}
}
AWS X-Ray for Distributed Tracing
X-Ray helps analyze and debug distributed applications by providing request tracing across services.
Architecture
Installing X-Ray Daemon
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: xray-daemon
namespace: kube-system
spec:
selector:
matchLabels:
app: xray-daemon
template:
metadata:
labels:
app: xray-daemon
spec:
containers:
- name: xray-daemon
image: amazon/aws-xray-daemon:latest
ports:
- containerPort: 2000
protocol: UDP
env:
- name: _X_AMZN_TRACE_ID
value: ""
resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
Instrumenting Applications
Node.js Example:
const AWSXRay = require('aws-xray-sdk-core');
const express = require('express');
const app = express();
// Enable X-Ray
app.use(AWSXRay.express.openSegment('my-app'));
app.get('/api/users', async (req, res) => {
const segment = AWSXRay.getSegment();
const subsegment = segment.addNewSubsegment('database-query');
try {
// Database query
const users = await db.query('SELECT * FROM users');
subsegment.close();
res.json(users);
} catch (error) {
subsegment.addError(error);
subsegment.close();
res.status(500).json({ error: error.message });
}
});
app.use(AWSXRay.express.closeSegment());
Java Example:
import com.amazonaws.xray.AWSXRay;
import com.amazonaws.xray.entities.Subsegment;
@RestController
public class UserController {
@GetMapping("/api/users")
public List<User> getUsers() {
Subsegment subsegment = AWSXRay.beginSubsegment("database-query");
try {
List<User> users = userRepository.findAll();
subsegment.close();
return users;
} catch (Exception e) {
subsegment.addException(e);
subsegment.close();
throw e;
}
}
}
Logging with Fluent Bit and Fluentd
Fluent Bit Configuration
Fluent Bit is lightweight and efficient for log forwarding:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: kube-system
data:
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level info
Daemon off
Parsers_File parsers.conf
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
Mem_Buf_Limit 50MB
Skip_Long_Lines On
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kube.var.log.containers.
Merge_Log On
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[OUTPUT]
Name cloudwatch_logs
Match kube.*
region us-west-2
log_group_name /aws/eks/my-cluster
log_stream_prefix fluent-bit-
auto_create_group On
Application Logging Best Practices
Structured Logging:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "info",
"service": "user-service",
"trace_id": "1-5f2b3c4d-abc123",
"message": "User created",
"user_id": "12345",
"duration_ms": 45
}
Log Levels:
- ERROR - Errors that need attention
- WARN - Warnings that might need attention
- INFO - Informational messages
- DEBUG - Debug information (disable in production)
Log Rotation:
Configure log rotation to prevent disk fill:
apiVersion: v1
kind: ConfigMap
metadata:
name: logrotate-config
data:
logrotate.conf: |
/var/log/containers/*.log {
daily
rotate 7
compress
missingok
notifempty
create 0644 root root
}
Metrics Collection and Dashboards
Custom Metrics
Expose custom application metrics:
Prometheus Metrics Endpoint:
from prometheus_client import Counter, Histogram, start_http_server
# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])
# Instrument code
@request_duration.labels(method='GET', endpoint='/api/users').time()
def handle_request():
request_count.labels(method='GET', endpoint='/api/users').inc()
# Handle request
Kubernetes Metrics:
apiVersion: v1
kind: Service
metadata:
name: my-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
ports:
- name: metrics
port: 8080
targetPort: 8080
Alerting Strategies
Prometheus Alertmanager:
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
data:
alertmanager.yml: |
global:
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'slack-critical'
receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#alerts'
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'slack-critical'
slack_configs:
- channel: '#alerts-critical'
title: 'CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
CloudWatch Alarms:
# Create CloudWatch alarm
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu-usage \
--alarm-description "Alert when CPU usage is high" \
--metric-name CPUUtilization \
--namespace AWS/EKS \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-west-2:123456789012:alerts
Cost Monitoring and Optimization
CloudWatch Cost Insights
Monitor EKS costs:
# Get cost and usage report
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
Resource Tagging
Tag resources for cost allocation:
# Tag cluster
aws eks tag-resource \
--resource-arn arn:aws:eks:us-west-2:123456789012:cluster/my-cluster \
--tags Environment=Production,Team=Platform,CostCenter=Engineering
# Tag node group
aws eks update-nodegroup-config \
--cluster-name my-cluster \
--nodegroup-name general-workers \
--labels addOrUpdateLabels={Environment=Production,Team=Platform}
Cost Optimization Metrics
Monitor these metrics:
- Cluster Utilization - CPU and memory usage
- Node Utilization - Per-node resource usage
- Pod Density - Pods per node
- Idle Resources - Unused capacity
- Spot Instance Usage - Spot vs on-demand ratio
Performance Analysis Tools
kubectl top
Quick resource usage overview:
# Node resource usage
kubectl top nodes
# Pod resource usage
kubectl top pods
# Pod resource usage by namespace
kubectl top pods --namespace=default
cAdvisor
Container resource usage:
# Access cAdvisor on node
kubectl proxy --port=8001
# View node metrics
curl http://localhost:8001/api/v1/nodes/<node-name>/proxy/metrics/cadvisor
kube-state-metrics
Kubernetes object metrics:
# Install kube-state-metrics
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/service-account.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/deployment.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/service.yaml
Best Practices
Enable Container Insights - Automatic metrics and logs collection
Use Prometheus for Custom Metrics - Application-specific metrics
Implement Distributed Tracing - X-Ray or Jaeger for request flows
Structured Logging - JSON format for better parsing
Set Up Alerts - Proactive monitoring and alerting
Monitor Costs - Track resource usage and costs
Retention Policies - Configure appropriate log and metric retention
Dashboard Organization - Create dashboards for different audiences
Test Alerting - Verify alerts work correctly
Document Runbooks - Procedures for common issues
Common Issues
Metrics Not Appearing
Problem: Metrics not showing in CloudWatch or Prometheus
Solutions:
- Verify Fluent Bit is running
- Check IAM permissions
- Verify service account configuration
- Check network connectivity
- Review pod logs
High Log Volume
Problem: Too many logs, high costs
Solutions:
- Implement log filtering
- Reduce log verbosity
- Use log sampling
- Configure log retention
- Use log aggregation efficiently
Prometheus Storage Full
Problem: Prometheus running out of storage
Solutions:
- Increase PVC size
- Reduce retention period
- Configure data compression
- Use remote storage (Thanos, Cortex)
- Implement metric downsampling
See Also
- Cluster Setup - Initial observability setup
- Add-ons - Installing observability tools
- Troubleshooting - Observability issues