Datadog

Datadog is a cloud-native monitoring and security platform that provides comprehensive observability for applications and infrastructure. It offers log collection, metrics monitoring, distributed tracing, and security monitoring, all integrated into a single platform.

What is Datadog?

Datadog provides:

Log Management - Centralized log collection and analysis
APM - Application Performance Monitoring with distributed tracing
Infrastructure Monitoring - Metrics for hosts, containers, and services
Real User Monitoring - Frontend performance monitoring
Security Monitoring - Security analytics and threat detection
Dashboards - Customizable visualizations

graph TB A[Kubernetes Cluster] --> B[Datadog Agent] B --> C[Datadog Platform] D[Application Logs] --> B E[System Logs] --> B F[Container Metrics] --> B G[APM Traces] --> B C --> H[Log Management] C --> I[Metrics] C --> J[APM] C --> K[Dashboards] style A fill:#e1f5ff style B fill:#e8f5e9 style C fill:#fff4e1 style H fill:#f3e5f5

Datadog Agent

The Datadog Agent is a lightweight daemon that runs on each node to collect logs, metrics, and traces:

Log collection - Collects logs from containers and applications
Metrics collection - System, container, and application metrics
APM - Distributed tracing
Autodiscovery - Automatically discovers services and configurations
Kubernetes integration - Native Kubernetes support

Installation

Using Helm (Recommended)

# Add Datadog Helm repository
helm repo add datadog https://helm.datadoghq.com
helm repo update

# Create secret with API key
kubectl create secret generic datadog-secret \
  --from-literal api-key=<YOUR_API_KEY> \
  --namespace datadog

# Install Datadog Agent
helm install datadog-agent datadog/datadog \
  --namespace datadog \
  --create-namespace \
  --set datadog.apiKeyExistingSecret=datadog-secret \
  --set datadog.logs.enabled=true \
  --set datadog.logs.containerCollectAll=true \
  --set datadog.apm.enabled=true \
  --set clusterAgent.enabled=true

Manual Deployment

Datadog Agent DaemonSet

apiVersion: v1
kind: Secret
metadata:
  name: datadog-secret
  namespace: datadog
type: Opaque
stringData:
  api-key: <YOUR_DATADOG_API_KEY>
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: datadog-agent
  namespace: datadog
spec:
  selector:
    matchLabels:
      app: datadog-agent
  template:
    metadata:
      labels:
        app: datadog-agent
    spec:
      serviceAccountName: datadog-agent
      containers:
      - image: gcr.io/datadoghq/agent:7
        name: datadog-agent
        env:
        - name: DD_API_KEY
          valueFrom:
            secretKeyRef:
              name: datadog-secret
              key: api-key
        - name: DD_SITE
          value: datadoghq.com
        - name: DD_LOGS_ENABLED
          value: "true"
        - name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
          value: "true"
        - name: DD_CONTAINER_EXCLUDE
          value: "name:datadog-agent"
        - name: DD_APM_ENABLED
          value: "true"
        - name: DD_COLLECT_KUBERNETES_EVENTS
          value: "true"
        - name: DD_LEADER_ELECTION
          value: "true"
        - name: KUBERNETES
          value: "true"
        - name: DD_KUBERNETES_KUBELET_HOST
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        volumeMounts:
        - name: dockersocket
          mountPath: /var/run/docker.sock
        - name: procdir
          mountPath: /host/proc
          readOnly: true
        - name: cgroups
          mountPath: /host/sys/fs/cgroup
          readOnly: true
        - name: pointerdir
          mountPath: /opt/datadog-agent/run
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "512Mi"
            cpu: "500m"
      volumes:
      - name: dockersocket
        hostPath:
          path: /var/run/docker.sock
      - name: procdir
        hostPath:
          path: /proc
      - name: cgroups
        hostPath:
          path: /sys/fs/cgroup
      - name: pointerdir
        emptyDir: {}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: datadog-agent
  namespace: datadog
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: datadog-agent
rules:
- apiGroups: [""]
  resources:
  - services
  - events
  - endpoints
  - pods
  - nodes
  - componentstatuses
  verbs: ["get", "list", "watch"]
- apiGroups: ["quota.openshift.io"]
  resources:
  - clusterresourcequotas
  verbs: ["get", "list"]
- apiGroups: ["apps"]
  resources:
  - deployments
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: datadog-agent
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: datadog-agent
subjects:
- kind: ServiceAccount
  name: datadog-agent
  namespace: datadog

Log Collection Configuration

Container Log Collection

The Datadog Agent automatically collects logs from all containers when enabled:

env:
- name: DD_LOGS_ENABLED
  value: "true"
- name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
  value: "true"

Selective Log Collection

Collect logs only from specific containers using annotations:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    ad.datadoghq.com/my-app.logs: '[{"source": "myapp", "service": "my-service"}]'
spec:
  containers:
  - name: my-app
    image: my-app:latest

Log Processing Rules

Configure log processing in the agent:

env:
- name: DD_LOGS_CONFIG_PROCESSING_RULES
  value: |
    [{
      "type": "multi_line",
      "name": "log_start_with_date",
      "pattern": "\\d{4}-\\d{2}-\\d{2}"
    }]

Service Tags

Add service tags for better organization:

env:
- name: DD_TAGS
  value: "env:production service:my-app team:backend"

Kubernetes Metadata

Automatic Kubernetes metadata enrichment:

env:
- name: DD_KUBERNETES_COLLECT_METADATA_TAGS
  value: "true"
- name: DD_KUBERNETES_METADATA_TAG_UPDATE_FREQ
  value: "60"

Autodiscovery

Autodiscovery automatically configures log collection based on pod annotations:

Annotation-Based Configuration

apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    ad.datadoghq.com/my-app.logs: |
      [
        {
          "source": "python",
          "service": "my-service",
          "log_processing_rules": [
            {
              "type": "multi_line",
              "name": "log_start_with_date",
              "pattern": "\\d{4}-\\d{2}-\\d{2}"
            }
          ]
        }
      ]
spec:
  containers:
  - name: my-app
    image: my-app:latest

ConfigMap-Based Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: datadog-logs-config
  namespace: datadog
data:
  my-service.yaml: |
    ad_identifiers:
      - my-app
    logs:
      - type: file
        path: /var/log/app.log
        source: python
        service: my-service
---
# In agent configuration
env:
- name: DD_LOGS_CONFIG_AUTODISCOVERY_PATHS
  value: "/etc/datadog-agent/conf.d/logs.d/auto-discovery"

Log Queries

Basic Log Search

In Datadog Log Explorer:

Search: service:my-service
Filter by time range
Add facets for filtering

Advanced Queries

service:my-service status:error
source:nginx status:>=400
env:production @http.status_code:[400 TO 499]
kubernetes.namespace:production @message:error

Create facets for commonly filtered fields:

service
source
status
env
kubernetes.namespace
kubernetes.pod_name

APM Integration

Enable APM for distributed tracing:

env:
- name: DD_APM_ENABLED
  value: "true"
- name: DD_APM_NON_LOCAL_TRAFFIC
  value: "true"

Expose APM port:

ports:
- containerPort: 8126
  name: apm
  protocol: TCP

Application Instrumentation

For Python applications:

from ddtrace import patch_all
patch_all()

# Your application code

For Node.js applications:

const tracer = require('dd-trace').init({
  service: 'my-service',
  env: 'production'
});

Dashboards

Creating Dashboards

Go to Dashboards > New Dashboard
Add widgets:
- Timeseries - Metrics over time
- Log Stream - Log events
- Heatmap - Distribution visualization
- Query Value - Single value
- Top List - Ranked list

Log-Based Widgets

Log Volume:

Widget: Timeseries
Query: *
Group by: service

Error Rate:

Widget: Query Value
Query: status:error
Aggregation: Count

Alerts and Monitors

Log-Based Monitors

Go to Monitors > New Monitor
Select Logs
Configure:
- Query: status:error
- Alert conditions: Count > threshold
- Notification channels

Alert Conditions

Threshold - Alert when count exceeds value
Anomaly - Alert on anomalies
Forecast - Alert based on predictions

Best Practices

1. Resource Management

Set appropriate resource limits:

resources:
  requests:
    memory: "256Mi"
    cpu: "200m"
  limits:
    memory: "512Mi"
    cpu: "500m"

2. Log Sampling

Use sampling for high-volume logs:

env:
- name: DD_LOGS_CONFIG_PROCESSING_RULES
  value: |
    [{
      "type": "sample_rate",
      "sample_rate": 0.1,
      "name": "sample_logs"
    }]

3. Service Tags

Use consistent tagging:

env: environment (production, staging, dev)
service: service name
version: application version
team: team name

4. Log Parsing

Configure proper log parsing:

Use source auto-detection when possible
Add custom parsing rules for structured logs
Parse JSON logs automatically

5. Cost Optimization

Use log sampling for verbose logs
Filter unnecessary logs at collection
Set appropriate log retention
Use log exclusion filters

6. Security

Store API key in secrets
Use RBAC for agent permissions
Encrypt agent communication (TLS)
Follow least privilege principle

7. Monitoring the Agent

Monitor Datadog Agent health:

Agent status in Datadog UI
Agent metrics in infrastructure monitoring
Alert on agent failures

Troubleshooting

Check Agent Status

# Check agent pods
kubectl get pods -n datadog

# Check agent logs
kubectl logs -n datadog -l app=datadog-agent

# Test agent connectivity
kubectl exec -n datadog <agent-pod> -- agent status

Verify Log Collection

Check agent configuration
Verify logs are being sent to Datadog
Check log source and service tags
Verify autodiscovery is working

Common Issues

No logs appearing:

Verify DD_LOGS_ENABLED=true
Check DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
Verify API key is correct
Check network connectivity

High agent resource usage:

Reduce log sampling rate
Exclude unnecessary containers
Adjust resource limits
Filter logs at collection