EKS Observability

Observability on EKS involves monitoring, logging, and tracing to understand cluster and application behavior. EKS integrates with AWS services like CloudWatch, and supports popular open-source tools like Prometheus and Grafana for comprehensive observability.

Observability Overview

Observability consists of three pillars:

graph TB subgraph observability[Observability] METRICS[Metrics] --> DASHBOARDS[Dashboards] LOGS[Logs] --> AGGREGATION[Log Aggregation] TRACES[Traces] --> DISTRIBUTED[Distributed Tracing] end subgraph tools[Tools] DASHBOARDS --> CW[CloudWatch] DASHBOARDS --> GRAFANA[Grafana] AGGREGATION --> CLOUDWATCH_LOGS[CloudWatch Logs] AGGREGATION --> ELASTIC[Elasticsearch] DISTRIBUTED --> XRAY[X-Ray] DISTRIBUTED --> JAEGER[Jaeger] end style METRICS fill:#e1f5ff style LOGS fill:#fff4e1 style TRACES fill:#e8f5e9

Metrics: Quantitative measurements over time (CPU, memory, request rate)

Logs: Event records with timestamps (application logs, audit logs)

Traces: Request flows through distributed systems (end-to-end request tracking)

CloudWatch Container Insights

CloudWatch Container Insights provides automatic collection and visualization of metrics and logs from containerized applications and microservices.

Architecture

graph TB PODS[Pods] --> FLUENTBIT[Fluent Bit] NODES[Nodes] --> FLUENTBIT FLUENTBIT --> CLOUDWATCH[CloudWatch Logs] FLUENTBIT --> METRICS[CloudWatch Metrics] CLOUDWATCH --> DASHBOARD[Container Insights<br/>Dashboard] METRICS --> DASHBOARD style PODS fill:#e1f5ff style FLUENTBIT fill:#fff4e1 style DASHBOARD fill:#e8f5e9

Enabling Container Insights

Using AWS Console:

  1. Go to EKS → Clusters → Your Cluster → Logging
  2. Enable Container Insights
  3. Select log types to enable

Using AWS CLI:

# Enable Container Insights
aws eks update-cluster-config \
  --name my-cluster \
  --logging '{
    "enable": [
      {"types": ["api"]},
      {"types": ["audit"]},
      {"types": ["authenticator"]},
      {"types": ["controllerManager"]},
      {"types": ["scheduler"]}
    ]
  }'

Using eksctl:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-cluster
  region: us-west-2

cloudWatch:
  clusterLogging:
    enableTypes: ["*"]

Installing Fluent Bit

Fluent Bit is the log forwarder for Container Insights:

# Create namespace
kubectl create namespace amazon-cloudwatch

# Create ConfigMap
kubectl create configmap cluster-info \
  --from-literal=cluster.name=my-cluster \
  --from-literal=logs.region=us-west-2 \
  -n amazon-cloudwatch

# Install Fluent Bit
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml

Container Insights Metrics

Container Insights automatically collects:

Cluster Metrics:

  • CPU utilization
  • Memory utilization
  • Network I/O
  • Storage I/O

Node Metrics:

  • Node CPU/memory
  • Pod count per node
  • Container count per node

Pod Metrics:

  • Pod CPU/memory
  • Network I/O
  • Storage I/O
  • Restart count

Namespace Metrics:

  • Resource usage per namespace
  • Pod count per namespace

Viewing Container Insights

Access Container Insights dashboard:

  1. Go to CloudWatch → Container Insights
  2. Select your cluster
  3. View metrics and logs

Available Views:

  • Cluster performance
  • Node performance
  • Pod performance
  • Namespace performance
  • Logs explorer

Prometheus and Grafana

Prometheus is a popular open-source monitoring and alerting toolkit. Grafana provides visualization and dashboards for Prometheus metrics.

Architecture

graph TB PODS[Pods] --> PROMETHEUS[Prometheus<br/>Scraper] NODES[Nodes] --> PROMETHEUS KUBELET[kubelet] --> PROMETHEUS PROMETHEUS --> METRICS_DB[Prometheus<br/>Time Series DB] METRICS_DB --> GRAFANA[Grafana<br/>Dashboards] PROMETHEUS --> ALERTMANAGER[Alertmanager] ALERTMANAGER --> NOTIFICATIONS[Notifications] style PODS fill:#e1f5ff style PROMETHEUS fill:#fff4e1 style GRAFANA fill:#e8f5e9

Installing Prometheus

Using Helm:

# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

Components Installed:

  • Prometheus Server
  • Grafana
  • Alertmanager
  • Node Exporter
  • kube-state-metrics
  • Prometheus Operator

Prometheus Configuration

ServiceMonitor for Custom Metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-metrics
  namespace: default
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

PrometheusRule for Alerts:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: default
spec:
  groups:
  - name: my-app
    interval: 30s
    rules:
    - alert: HighCPUUsage
      expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU usage detected"
        description: "CPU usage is above 80% for 5 minutes"

Grafana Dashboards

Access Grafana:

# Get Grafana admin password
kubectl get secret prometheus-grafana -n monitoring -o jsonpath="{.data.admin-password}" | base64 -d

# Port forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80

Import Dashboards:

Popular dashboards:

  • Kubernetes Cluster Monitoring (3119)
  • Node Exporter Full (1860)
  • Kubernetes Deployment Statefulset Daemonset metrics (8588)

Custom Dashboard:

{
  "dashboard": {
    "title": "My App Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}}"
          }
        ]
      }
    ]
  }
}

AWS X-Ray for Distributed Tracing

X-Ray helps analyze and debug distributed applications by providing request tracing across services.

Architecture

graph LR APP1[Service 1] --> XRAY[X-Ray SDK] APP2[Service 2] --> XRAY APP3[Service 3] --> XRAY XRAY --> XRAY_DAEMON[X-Ray Daemon] XRAY_DAEMON --> XRAY_SERVICE[X-Ray Service] XRAY_SERVICE --> CONSOLE[X-Ray Console] style APP1 fill:#e1f5ff style XRAY fill:#fff4e1 style CONSOLE fill:#e8f5e9

Installing X-Ray Daemon

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: xray-daemon
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: xray-daemon
  template:
    metadata:
      labels:
        app: xray-daemon
    spec:
      containers:
      - name: xray-daemon
        image: amazon/aws-xray-daemon:latest
        ports:
        - containerPort: 2000
          protocol: UDP
        env:
        - name: _X_AMZN_TRACE_ID
          value: ""
        resources:
          limits:
            cpu: 100m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 256Mi

Instrumenting Applications

Node.js Example:

const AWSXRay = require('aws-xray-sdk-core');
const express = require('express');

const app = express();

// Enable X-Ray
app.use(AWSXRay.express.openSegment('my-app'));

app.get('/api/users', async (req, res) => {
  const segment = AWSXRay.getSegment();
  const subsegment = segment.addNewSubsegment('database-query');
  
  try {
    // Database query
    const users = await db.query('SELECT * FROM users');
    subsegment.close();
    res.json(users);
  } catch (error) {
    subsegment.addError(error);
    subsegment.close();
    res.status(500).json({ error: error.message });
  }
});

app.use(AWSXRay.express.closeSegment());

Java Example:

import com.amazonaws.xray.AWSXRay;
import com.amazonaws.xray.entities.Subsegment;

@RestController
public class UserController {
    
    @GetMapping("/api/users")
    public List<User> getUsers() {
        Subsegment subsegment = AWSXRay.beginSubsegment("database-query");
        try {
            List<User> users = userRepository.findAll();
            subsegment.close();
            return users;
        } catch (Exception e) {
            subsegment.addException(e);
            subsegment.close();
            throw e;
        }
    }
}

Logging with Fluent Bit and Fluentd

Fluent Bit Configuration

Fluent Bit is lightweight and efficient for log forwarding:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: kube-system
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
    
    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            docker
        Tag               kube.*
        Refresh_Interval  5
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On
    
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
    
    [OUTPUT]
        Name  cloudwatch_logs
        Match kube.*
        region us-west-2
        log_group_name /aws/eks/my-cluster
        log_stream_prefix fluent-bit-
        auto_create_group On

Application Logging Best Practices

Structured Logging:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "service": "user-service",
  "trace_id": "1-5f2b3c4d-abc123",
  "message": "User created",
  "user_id": "12345",
  "duration_ms": 45
}

Log Levels:

  • ERROR - Errors that need attention
  • WARN - Warnings that might need attention
  • INFO - Informational messages
  • DEBUG - Debug information (disable in production)

Log Rotation:

Configure log rotation to prevent disk fill:

apiVersion: v1
kind: ConfigMap
metadata:
  name: logrotate-config
data:
  logrotate.conf: |
    /var/log/containers/*.log {
        daily
        rotate 7
        compress
        missingok
        notifempty
        create 0644 root root
    }

Metrics Collection and Dashboards

Custom Metrics

Expose custom application metrics:

Prometheus Metrics Endpoint:

from prometheus_client import Counter, Histogram, start_http_server

# Define metrics
request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint'])
request_duration = Histogram('http_request_duration_seconds', 'HTTP request duration', ['method', 'endpoint'])

# Instrument code
@request_duration.labels(method='GET', endpoint='/api/users').time()
def handle_request():
    request_count.labels(method='GET', endpoint='/api/users').inc()
    # Handle request

Kubernetes Metrics:

apiVersion: v1
kind: Service
metadata:
  name: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  ports:
  - name: metrics
    port: 8080
    targetPort: 8080

Alerting Strategies

Prometheus Alertmanager:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
data:
  alertmanager.yml: |
    global:
      slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'slack-notifications'
      routes:
      - match:
          severity: critical
        receiver: 'slack-critical'
    
    receivers:
    - name: 'slack-notifications'
      slack_configs:
      - channel: '#alerts'
        title: '{{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    
    - name: 'slack-critical'
      slack_configs:
      - channel: '#alerts-critical'
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

CloudWatch Alarms:

# Create CloudWatch alarm
aws cloudwatch put-metric-alarm \
  --alarm-name high-cpu-usage \
  --alarm-description "Alert when CPU usage is high" \
  --metric-name CPUUtilization \
  --namespace AWS/EKS \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-west-2:123456789012:alerts

Cost Monitoring and Optimization

CloudWatch Cost Insights

Monitor EKS costs:

# Get cost and usage report
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

Resource Tagging

Tag resources for cost allocation:

# Tag cluster
aws eks tag-resource \
  --resource-arn arn:aws:eks:us-west-2:123456789012:cluster/my-cluster \
  --tags Environment=Production,Team=Platform,CostCenter=Engineering

# Tag node group
aws eks update-nodegroup-config \
  --cluster-name my-cluster \
  --nodegroup-name general-workers \
  --labels addOrUpdateLabels={Environment=Production,Team=Platform}

Cost Optimization Metrics

Monitor these metrics:

  • Cluster Utilization - CPU and memory usage
  • Node Utilization - Per-node resource usage
  • Pod Density - Pods per node
  • Idle Resources - Unused capacity
  • Spot Instance Usage - Spot vs on-demand ratio

Performance Analysis Tools

kubectl top

Quick resource usage overview:

# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods

# Pod resource usage by namespace
kubectl top pods --namespace=default

cAdvisor

Container resource usage:

# Access cAdvisor on node
kubectl proxy --port=8001

# View node metrics
curl http://localhost:8001/api/v1/nodes/<node-name>/proxy/metrics/cadvisor

kube-state-metrics

Kubernetes object metrics:

# Install kube-state-metrics
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/service-account.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/deployment.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/service.yaml

Best Practices

  1. Enable Container Insights - Automatic metrics and logs collection

  2. Use Prometheus for Custom Metrics - Application-specific metrics

  3. Implement Distributed Tracing - X-Ray or Jaeger for request flows

  4. Structured Logging - JSON format for better parsing

  5. Set Up Alerts - Proactive monitoring and alerting

  6. Monitor Costs - Track resource usage and costs

  7. Retention Policies - Configure appropriate log and metric retention

  8. Dashboard Organization - Create dashboards for different audiences

  9. Test Alerting - Verify alerts work correctly

  10. Document Runbooks - Procedures for common issues

Common Issues

Metrics Not Appearing

Problem: Metrics not showing in CloudWatch or Prometheus

Solutions:

  • Verify Fluent Bit is running
  • Check IAM permissions
  • Verify service account configuration
  • Check network connectivity
  • Review pod logs

High Log Volume

Problem: Too many logs, high costs

Solutions:

  • Implement log filtering
  • Reduce log verbosity
  • Use log sampling
  • Configure log retention
  • Use log aggregation efficiently

Prometheus Storage Full

Problem: Prometheus running out of storage

Solutions:

  • Increase PVC size
  • Reduce retention period
  • Configure data compression
  • Use remote storage (Thanos, Cortex)
  • Implement metric downsampling

See Also