Tracing (OpenTelemetry)

OpenTelemetry is a unified standard for observability that provides APIs, SDKs, and tools to instrument, generate, collect, and export telemetry data (metrics, logs, and traces). It’s become the de facto standard for distributed tracing in modern applications and Kubernetes environments.

What is OpenTelemetry?

OpenTelemetry (OTel) is an open-source observability framework that:

  • Unifies observability - Single standard for traces, metrics, and logs
  • Vendor-neutral - Works with any observability backend
  • Language support - SDKs for 10+ programming languages
  • Automatic instrumentation - Reduces manual coding effort
  • Cloud-native - Built for distributed systems like Kubernetes
graph TB A[Application] --> B[OpenTelemetry SDK] B --> C[Automatic Instrumentation] B --> D[Manual Instrumentation] C --> E[Traces] C --> F[Metrics] C --> G[Logs] D --> E D --> F D --> G E --> H[OTel Collector] F --> H G --> H H --> I[Exporters] I --> J[Prometheus] I --> K[Jaeger] I --> L[Zipkin] I --> M[Backends] style A fill:#e1f5ff style B fill:#e8f5e9 style H fill:#fff4e1 style I fill:#f3e5f5

Three Observability Signals

OpenTelemetry handles three types of telemetry data:

Traces

Traces show the path of a request through distributed services:

  • Spans - Individual operations within a trace
  • Trace context - Propagated across service boundaries
  • Timing - Duration of each operation
  • Relationships - Parent-child span relationships

Metrics

Metrics are numerical measurements over time:

  • Counters - Incrementing values (e.g., request count)
  • Gauges - Point-in-time values (e.g., CPU usage)
  • Histograms - Distribution of measurements

Logs

Logs are structured event records:

  • Structured format - JSON or key-value pairs
  • Correlation - Linked to traces via trace IDs
  • Context - Rich contextual information

OpenTelemetry Architecture

graph TB A[Applications] --> B[OTel SDK] B --> C[Instrumentation] C --> D[OTel Collector] D --> E[Receivers] E --> F[Processors] F --> G[Exporters] G --> H[Prometheus] G --> I[Jaeger] G --> J[Loki] G --> K[Other Backends] style B fill:#e1f5ff style D fill:#e8f5e9 style E fill:#fff4e1 style F fill:#f3e5f5 style G fill:#ffe1e1

Components

  1. SDK - Language-specific library for instrumentation
  2. Collector - Receives, processes, and exports telemetry data
  3. Receivers - Accept data from SDKs or other sources
  4. Processors - Transform, filter, or batch data
  5. Exporters - Send data to observability backends

Automatic vs Manual Instrumentation

Automatic Instrumentation

Zero-code instrumentation for popular frameworks:

# Example: Automatic instrumentation sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: my-app:latest
      - name: otel-collector
        image: otel/opentelemetry-collector:latest
        env:
        - name: OTEL_SERVICE_NAME
          value: my-app

Automatic instrumentation supports:

  • HTTP frameworks (Express, Django, Flask, etc.)
  • Database drivers (PostgreSQL, MySQL, MongoDB, etc.)
  • Message queues (Kafka, RabbitMQ, etc.)
  • gRPC and REST APIs

Manual Instrumentation

Explicit instrumentation for custom code:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

# Setup
tracer_provider = TracerProvider()
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer(__name__)

# Instrumentation
def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        # Business logic
        span.set_attribute("order.status", "completed")

OpenTelemetry Collector

The Collector is a vendor-neutral agent that processes telemetry data:

Deployment in Kubernetes

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector:latest
        volumeMounts:
        - name: config
          mountPath: /etc/otelcol
      volumes:
      - name: config
        configMap:
          name: otel-collector-config

Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Instrumentation Examples

Go Application

package main

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() {
    exporter, _ := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
    ))
    
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
    )
    otel.SetTracerProvider(tp)
}

Python Application

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Setup
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Configure exporter
otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Use
with tracer.start_as_current_span("operation") as span:
    span.set_attribute("key", "value")
    # Your code

Java Application

import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.trace.SdkTracerProvider;

OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
    .setTracerProvider(
        SdkTracerProvider.builder()
            .addSpanProcessor(BatchSpanProcessor.builder(
                OtlpGrpcSpanExporter.builder()
                    .setEndpoint("http://otel-collector:4317")
                    .build())
                .build())
            .build())
    .build();

Kubernetes Deployment

Sidecar Pattern

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-with-otel
spec:
  template:
    spec:
      containers:
      - name: app
        image: my-app:latest
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://localhost:4317"
      - name: otel-collector
        image: otel/opentelemetry-collector:latest
        ports:
        - containerPort: 4317
          name: otlp-grpc
        - containerPort: 4318
          name: otlp-http

DaemonSet Pattern

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
      - name: otel-collector
        image: otel/opentelemetry-collector:latest
        ports:
        - containerPort: 4317
          name: otlp-grpc
        - containerPort: 4318
          name: otlp-http
        volumeMounts:
        - name: config
          mountPath: /etc/otelcol
      volumes:
      - name: config
        configMap:
          name: otel-collector-config

Exporters and Backends

OpenTelemetry can export to many backends:

Prometheus (Metrics)

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

Jaeger (Traces)

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

Loki (Logs)

exporters:
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

OTLP (Generic)

exporters:
  otlp:
    endpoint: backend.example.com:4317
    tls:
      cert_file: /etc/certs/client.crt
      key_file: /etc/certs/client.key

Trace Context Propagation

OpenTelemetry automatically propagates trace context across service boundaries:

sequenceDiagram participant Client participant ServiceA participant ServiceB participant ServiceC Client->>ServiceA: HTTP Request (with trace context) ServiceA->>ServiceA: Start Span A ServiceA->>ServiceB: gRPC Call (propagate trace context) ServiceB->>ServiceB: Start Span B (child of A) ServiceB->>ServiceC: HTTP Request (propagate trace context) ServiceC->>ServiceC: Start Span C (child of B) ServiceC-->>ServiceB: Response ServiceB-->>ServiceA: Response ServiceA-->>Client: Response (complete trace)

Best Practices

  1. Start with automatic instrumentation - Use automatic instrumentation when possible

  2. Use the Collector - Deploy Collector as a sidecar or DaemonSet

  3. Sample appropriately - Configure sampling to control data volume and costs

  4. Set resource attributes - Add service name, version, and environment info

  5. Correlate signals - Link logs and metrics to traces via trace IDs

  6. Instrument at the edge - Instrument API gateways and load balancers

  7. Monitor the Collector - Ensure Collector is healthy and not dropping data

  8. Use semantic conventions - Follow OpenTelemetry semantic conventions for consistency

Sampling

Control data volume with sampling:

processors:
  probabilistic_sampler:
    sampling_percentage: 10  # Sample 10% of traces

Or in code:

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Sample 50% of traces
sampler = TraceIdRatioBased(0.5)

Troubleshooting

No Traces Appearing

# Check Collector logs
kubectl logs -n monitoring -l app=otel-collector

# Verify Collector is receiving data
kubectl port-forward -n monitoring svc/otel-collector 8888:8888
# Check /metrics endpoint

# Check application instrumentation
kubectl logs <app-pod> | grep -i otel

High Cardinality

Too many unique spans can cause performance issues:

  • Use span attributes wisely
  • Avoid putting unique IDs in attribute names
  • Consider sampling for high-volume traces

Missing Context

Ensure trace context is propagated:

  • Check HTTP headers (traceparent, tracestate)
  • Verify gRPC metadata propagation
  • Test across service boundaries

See Also