EKS Observability

Observability on EKS involves monitoring, logging, and tracing to understand cluster and application behavior. EKS integrates with AWS services like CloudWatch, and supports popular open-source tools like Prometheus and Grafana for comprehensive observability.

Observability Overview

Observability consists of three pillars:

graph TB subgraph observability[Observability] METRICS[Metrics] --> DASHBOARDS[Dashboards] LOGS[Logs] --> AGGREGATION[Log Aggregation] TRACES[Traces] --> DISTRIBUTED[Distributed Tracing] end subgraph tools[Tools] DASHBOARDS --> CW[CloudWatch] DASHBOARDS --> GRAFANA[Grafana] AGGREGATION --> CLOUDWATCH_LOGS[CloudWatch Logs] AGGREGATION --> ELASTIC[Elasticsearch] DISTRIBUTED --> XRAY[X-Ray] DISTRIBUTED --> JAEGER[Jaeger] end style METRICS fill:#e1f5ff style LOGS fill:#fff4e1 style TRACES fill:#e8f5e9

Metrics: Quantitative measurements over time (CPU, memory, request rate)

Logs: Event records with timestamps (application logs, audit logs)

Traces: Request flows through distributed systems (end-to-end request tracking)

CloudWatch Container Insights

CloudWatch Container Insights provides automatic collection and visualization of metrics and logs from containerized applications and microservices.

Architecture

graph TB PODS[Pods] --> FLUENTBIT[Fluent Bit] NODES[Nodes] --> FLUENTBIT FLUENTBIT --> CLOUDWATCH[CloudWatch Logs] FLUENTBIT --> METRICS[CloudWatch Metrics] CLOUDWATCH --> DASHBOARD[Container Insights<br/>Dashboard] METRICS --> DASHBOARD style PODS fill:#e1f5ff style FLUENTBIT fill:#fff4e1 style DASHBOARD fill:#e8f5e9

Enabling Container Insights

Using AWS Console:

Go to EKS → Clusters → Your Cluster → Logging
Enable Container Insights
Select log types to enable

Using AWS CLI:

# Enable Container Insights
aws eks update-cluster-config \
  --name my-cluster \
  --logging '{
    "enable": [
      {"types": ["api"]},
      {"types": ["audit"]},
      {"types": ["authenticator"]},
      {"types": ["controllerManager"]},
      {"types": ["scheduler"]}
    ]
  }'

Using eksctl:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-cluster
  region: us-west-2

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

Installing Fluent Bit

Fluent Bit is the log forwarder for Container Insights:

# Create namespace
kubectl create namespace amazon-cloudwatch

# Create ConfigMap
kubectl create configmap cluster-info \
  --from-literal=cluster.name=my-cluster \
  --from-literal=logs.region=us-west-2 \
  -n amazon-cloudwatch

# Install Fluent Bit
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml

Container Insights Metrics

Container Insights automatically collects:

Cluster Metrics:

CPU utilization
Memory utilization
Network I/O
Storage I/O

Node Metrics:

Node CPU/memory
Pod count per node
Container count per node

Pod Metrics:

Pod CPU/memory
Network I/O
Storage I/O
Restart count

Namespace Metrics:

Resource usage per namespace
Pod count per namespace

Viewing Container Insights

Access Container Insights dashboard:

Go to CloudWatch → Container Insights
Select your cluster
View metrics and logs

Available Views:

Cluster performance
Node performance
Pod performance
Namespace performance
Logs explorer

Prometheus and Grafana

Prometheus is a popular open-source monitoring and alerting toolkit. Grafana provides visualization and dashboards for Prometheus metrics.

Architecture

graph TB PODS[Pods] --> PROMETHEUS[Prometheus<br/>Scraper] NODES[Nodes] --> PROMETHEUS KUBELET[kubelet] --> PROMETHEUS PROMETHEUS --> METRICS_DB[Prometheus<br/>Time Series DB] METRICS_DB --> GRAFANA[Grafana<br/>Dashboards] PROMETHEUS --> ALERTMANAGER[Alertmanager] ALERTMANAGER --> NOTIFICATIONS[Notifications] style PODS fill:#e1f5ff style PROMETHEUS fill:#fff4e1 style GRAFANA fill:#e8f5e9

Installing Prometheus

Using Helm:

# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

Components Installed:

Prometheus Server
Grafana
Alertmanager
Node Exporter
kube-state-metrics
Prometheus Operator

Prometheus Configuration

ServiceMonitor for Custom Metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-metrics
  namespace: default
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

PrometheusRule for Alerts:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: default
spec:
  groups:
  - name: my-app
    interval: 30s
    rules:
    - alert: HighCPUUsage
      expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage detected"
        description: "CPU usage is above 80% for 5 minutes"

Grafana Dashboards

Access Grafana:

# Get Grafana admin password
kubectl get secret prometheus-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d

# Port forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Import Dashboards:

Popular dashboards:

Kubernetes Cluster Monitoring (3119)
Node Exporter Full (1860)
Kubernetes Deployment Statefulset Daemonset metrics (8588)

Custom Dashboard:

{
  "dashboard": {
    "title": "My App Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}}"
          }
        ]
      }
    ]
  }
}

AWS X-Ray for Distributed Tracing

X-Ray helps analyze and debug distributed applications by providing request tracing across services.

Architecture

graph LR APP1[Service 1] --> XRAY[X-Ray SDK] APP2[Service 2] --> XRAY APP3[Service 3] --> XRAY XRAY --> XRAY_DAEMON[X-Ray Daemon] XRAY_DAEMON --> XRAY_SERVICE[X-Ray Service] XRAY_SERVICE --> CONSOLE[X-Ray Console] style APP1 fill:#e1f5ff style XRAY fill:#fff4e1 style CONSOLE fill:#e8f5e9

Installing X-Ray Daemon

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: xray-daemon
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: xray-daemon
  template:
    metadata:
      labels:
        app: xray-daemon
    spec:
      containers:
      - name: xray-daemon
        image: amazon/aws-xray-daemon:latest
        ports:
        - containerPort: 2000
          protocol: UDP
        env:
        - name: _X_AMZN_TRACE_ID
          value: ""
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 256Mi

Instrumenting Applications

Node.js Example:

const AWSXRay = require('aws-xray-sdk-core');
const express = require('express');

const app = express();

// Enable X-Ray
app.use(AWSXRay.express.openSegment('my-app'));

app.get('/api/users', async (req, res) => {
  const segment = AWSXRay.getSegment();
  const subsegment = segment.addNewSubsegment('database-query');
  
  try {
    // Database query
    const users = await db.query('SELECT * FROM users');
    subsegment.close();
    res.json(users);
  } catch (error) {
    subsegment.addError(error);
    subsegment.close();
    res.status(500).json({ error: error.message });
  }
});

app.use(AWSXRay.express.closeSegment());

Java Example:

import com.amazonaws.xray.AWSXRay;
import com.amazonaws.xray.entities.Subsegment;

@RestController
public class UserController {
    
    @GetMapping("/api/users")
    public List<User> getUsers() {
        Subsegment subsegment = AWSXRay.beginSubsegment("database-query");
        try {
            List<User> users = userRepository.findAll();
            subsegment.close();
            return users;
        } catch (Exception e) {
            subsegment.addException(e);
            subsegment.close();
            throw e;
        }
    }
}

Logging with Fluent Bit and Fluentd

Fluent Bit Configuration

Fluent Bit is lightweight and efficient for log forwarding:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: kube-system
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
    
    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  5
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On
    
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
    
    [OUTPUT]
        Name  cloudwatch_logs
        Match kube.*
        region us-west-2
        log_group_name /aws/eks/my-cluster
        log_stream_prefix fluent-bit-
        auto_create_group On

Application Logging Best Practices

Structured Logging:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "service": "user-service",
  "trace_id": "1-5f2b3c4d-abc123",
  "message": "User created",
  "user_id": "12345",
  "duration_ms": 45
}

Log Levels:

ERROR - Errors that need attention
WARN - Warnings that might need attention
INFO - Informational messages
DEBUG - Debug information (disable in production)

Log Rotation:

Configure log rotation to prevent disk fill:

apiVersion: v1
kind: ConfigMap
metadata:
  name: logrotate-config
data:
  logrotate.conf: |
    /var/log/containers/*.log {
        daily
        rotate 7
        compress
        missingok
        notifempty
        create 0644 root root
    }

Metrics Collection and Dashboards

Custom Metrics

Expose custom application metrics:

Prometheus Metrics Endpoint:

from prometheus_client import Counter, Histogram, start_http_server

# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])

# Instrument code
@request_duration.labels(method='GET', endpoint='/api/users').time()
def handle_request():
    request_count.labels(method='GET', endpoint='/api/users').inc()
    # Handle request

Kubernetes Metrics:

apiVersion: v1
kind: Service
metadata:
  name: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  ports:
  - name: metrics
    port: 8080
    targetPort: 8080

Alerting Strategies

Prometheus Alertmanager:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
data:
  alertmanager.yml: |
    global:
      slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'slack-notifications'
      routes:
      - match:
          severity: critical
        receiver: 'slack-critical'
    
    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'slack-critical'
      slack_configs:
      - channel: '#alerts-critical'
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

CloudWatch Alarms:

# Create CloudWatch alarm
aws cloudwatch put-metric-alarm \
  --alarm-name high-cpu-usage \
  --alarm-description "Alert when CPU usage is high" \
  --metric-name CPUUtilization \
  --namespace AWS/EKS \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-west-2:123456789012:alerts

Cost Monitoring and Optimization

CloudWatch Cost Insights

Monitor EKS costs:

# Get cost and usage report
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

Resource Tagging

Tag resources for cost allocation:

# Tag cluster
aws eks tag-resource \
  --resource-arn arn:aws:eks:us-west-2:123456789012:cluster/my-cluster \
  --tags Environment=Production,Team=Platform,CostCenter=Engineering

# Tag node group
aws eks update-nodegroup-config \
  --cluster-name my-cluster \
  --nodegroup-name general-workers \
  --labels addOrUpdateLabels={Environment=Production,Team=Platform}

Cost Optimization Metrics

Monitor these metrics:

Cluster Utilization - CPU and memory usage
Node Utilization - Per-node resource usage
Pod Density - Pods per node
Idle Resources - Unused capacity
Spot Instance Usage - Spot vs on-demand ratio

Performance Analysis Tools

kubectl top

Quick resource usage overview:

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods

# Pod resource usage by namespace
kubectl top pods --namespace=default

cAdvisor

Container resource usage:

# Access cAdvisor on node
kubectl proxy --port=8001

# View node metrics
curl http://localhost:8001/api/v1/nodes/<node-name>/proxy/metrics/cadvisor

kube-state-metrics

Kubernetes object metrics:

# Install kube-state-metrics
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/service-account.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/deployment.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/service.yaml

Best Practices

Enable Container Insights - Automatic metrics and logs collection
Use Prometheus for Custom Metrics - Application-specific metrics
Implement Distributed Tracing - X-Ray or Jaeger for request flows
Structured Logging - JSON format for better parsing
Set Up Alerts - Proactive monitoring and alerting
Monitor Costs - Track resource usage and costs
Retention Policies - Configure appropriate log and metric retention
Dashboard Organization - Create dashboards for different audiences
Test Alerting - Verify alerts work correctly
Document Runbooks - Procedures for common issues

Common Issues

Metrics Not Appearing

Problem: Metrics not showing in CloudWatch or Prometheus

Solutions:

Verify Fluent Bit is running
Check IAM permissions
Verify service account configuration
Check network connectivity
Review pod logs

High Log Volume

Problem: Too many logs, high costs

Solutions:

Implement log filtering
Reduce log verbosity
Use log sampling
Configure log retention
Use log aggregation efficiently

Prometheus Storage Full

Problem: Prometheus running out of storage

Solutions:

Increase PVC size
Reduce retention period
Configure data compression
Use remote storage (Thanos, Cortex)
Implement metric downsampling

EKS Observability

Observability Overview

CloudWatch Container Insights

Architecture

Enabling Container Insights

Installing Fluent Bit

Container Insights Metrics

Viewing Container Insights

Prometheus and Grafana

Architecture

Installing Prometheus

Prometheus Configuration

Grafana Dashboards

AWS X-Ray for Distributed Tracing

Architecture

Installing X-Ray Daemon

Instrumenting Applications

Logging with Fluent Bit and Fluentd

Fluent Bit Configuration

Application Logging Best Practices

Metrics Collection and Dashboards

Custom Metrics

Alerting Strategies

Cost Monitoring and Optimization

CloudWatch Cost Insights

Resource Tagging

Cost Optimization Metrics

Performance Analysis Tools

kubectl top

cAdvisor

kube-state-metrics

Best Practices

Common Issues

Metrics Not Appearing

High Log Volume

Prometheus Storage Full

See Also