Prometheus
Prometheus is an open-source systems monitoring and alerting toolkit, widely used for monitoring Kubernetes clusters. It collects metrics from configured targets at regular intervals, stores them efficiently, and allows you to query them using PromQL (Prometheus Query Language).
What is Prometheus?
Prometheus is a time-series database designed for monitoring:
- Pull model - Prometheus scrapes metrics from targets (pods, services, nodes)
- Time-series database - Stores metrics as time-series data
- PromQL - Powerful query language for analyzing metrics
- Multi-dimensional - Metrics identified by metric name and key-value pairs (labels)
- Service discovery - Automatically discovers monitoring targets in Kubernetes
Core Concepts
Metrics
Metrics are measurements over time. Each metric has:
- Name - Identifies what is being measured (e.g.,
http_requests_total) - Labels - Key-value pairs that add dimensions (e.g.,
method="GET",status="200") - Value - The actual measurement (e.g.,
1234)
Example metric:
http_requests_total{method="GET", status="200", path="/api"} 1234
Time-Series
A time-series is a sequence of data points over time:
- X-axis: Time (timestamp)
- Y-axis: Metric value
Scraping
Prometheus collects metrics by “scraping” HTTP endpoints that expose metrics in Prometheus format:
GET http://target:port/metrics
Targets
Targets are endpoints that Prometheus scrapes. In Kubernetes, targets are discovered via:
- ServiceMonitor - Scrapes services based on labels
- PodMonitor - Scrapes pods based on labels
- Service discovery - Automatically discovers services and pods
Prometheus Architecture in Kubernetes
Components
- Prometheus Server - Core component that scrapes and stores metrics
- Prometheus Operator - Manages Prometheus deployments via Kubernetes CRDs
- ServiceMonitor - CRD that defines which services to scrape
- PodMonitor - CRD that defines which pods to scrape
- AlertManager - Handles alerts sent by Prometheus
Installation
Using Helm (Recommended)
# Add Prometheus Community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (includes Prometheus, Grafana, AlertManager)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
Using Prometheus Operator
# Install Prometheus Operator
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml
# Install Prometheus
kubectl apply -f prometheus.yaml
Basic Prometheus Deployment
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
volumes:
- name: config
configMap:
name: prometheus-config
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- port: 9090
targetPort: 9090
type: ClusterIP
ServiceMonitor
ServiceMonitor is a Custom Resource that tells Prometheus which services to scrape:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
path: /metrics
interval: 30s
This ServiceMonitor:
- Finds services with label
app: my-app - Scrapes the
metricsport - Uses
/metricspath - Scrapes every 30 seconds
PodMonitor
PodMonitor tells Prometheus which pods to scrape directly:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: my-pods-metrics
namespace: monitoring
spec:
selector:
matchLabels:
app: my-app
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 30s
Exposing Metrics from Applications
Applications need to expose metrics in Prometheus format on an HTTP endpoint:
Go Example
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "status"},
)
)
func init() {
prometheus.MustRegister(requestsTotal)
}
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app:latest
ports:
- name: http
containerPort: 8080
- name: metrics
containerPort: 9090
---
apiVersion: v1
kind: Service
metadata:
name: my-app
labels:
app: my-app
spec:
ports:
- name: http
port: 80
targetPort: 8080
- name: metrics
port: 9090
targetPort: 9090
selector:
app: my-app
PromQL Basics
PromQL (Prometheus Query Language) is used to query and aggregate time-series data.
Basic Queries
# Get current value
http_requests_total
# Filter by label
http_requests_total{method="GET"}
# Multiple label filters
http_requests_total{method="GET", status="200"}
# Rate over time (requests per second)
rate(http_requests_total[5m])
# Increase over time
increase(http_requests_total[1h])
Common Functions
# Calculate rate (per second)
rate(http_requests_total[5m])
# Calculate increase
increase(http_requests_total[1h])
# Average
avg(http_requests_total)
# Sum
sum(http_requests_total)
# Maximum
max(http_requests_total)
# Percentage
(up == 1) * 100
Aggregation
# Sum by label
sum by (status) (http_requests_total)
# Average by label
avg by (instance) (cpu_usage)
# Group without label
sum without (instance) (http_requests_total)
Common Use Cases
# CPU usage percentage
100 - (avg(irate(container_cpu_usage_seconds_total[5m])) * 100)
# Memory usage
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# Availability
avg(up) * 100
# Pod restart count
increase(kube_pod_container_status_restarts_total[1h])
AlertManager Integration
AlertManager handles alerts from Prometheus:
# AlertRule example
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: example-alerts
namespace: monitoring
spec:
groups:
- name: example
rules:
- alert: HighCPUUsage
expr: |
100 - (avg(irate(container_cpu_usage_seconds_total[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for 5 minutes"
Accessing Prometheus
Port Forwarding
kubectl port-forward -n monitoring svc/prometheus 9090:9090
Then access: http://localhost:9090
Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: prometheus
namespace: monitoring
spec:
rules:
- host: prometheus.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: prometheus
port:
number: 9090
Common Metrics Sources
kube-state-metrics
Exposes Kubernetes object metrics:
# Install kube-state-metrics
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/deployment.yaml
node-exporter
Exposes hardware and OS metrics from nodes:
# Install node-exporter
kubectl apply -f node-exporter-daemonset.yaml
cAdvisor
Built into kubelet, exposes container metrics:
- Accessible via kubelet’s
/metrics/cadvisorendpoint - Provides container CPU, memory, network, and filesystem metrics
Best Practices
Use labels wisely - Labels add dimensions but also increase cardinality
Set appropriate scrape intervals - Balance between freshness and resource usage (default: 1m)
Use ServiceMonitor/PodMonitor - Leverage Prometheus Operator for declarative configuration
Monitor Prometheus itself - Ensure Prometheus is healthy and not overloaded
Retention policy - Configure data retention based on your needs (default: 15 days)
Resource limits - Set appropriate resource requests and limits for Prometheus
High availability - Run multiple Prometheus instances for production clusters
Secure metrics endpoints - Use authentication/authorization for production
Troubleshooting
Prometheus Not Scraping Targets
# Check Prometheus targets page
# Go to Status > Targets in Prometheus UI
# Check ServiceMonitor/PodMonitor
kubectl get servicemonitor -A
kubectl get podmonitor -A
# Check Prometheus logs
kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus
High Cardinality
Too many unique label combinations can cause performance issues:
# Check cardinality
count({__name__=~".+"})
# Identify high-cardinality metrics
topk(10, count by (__name__)({__name__=~".+"}))
Storage Issues
# Check Prometheus storage
kubectl exec -n monitoring prometheus-0 -- du -sh /prometheus
# Configure retention
# Add to Prometheus args: --storage.tsdb.retention.time=30d
See Also
- Grafana - Visualizing Prometheus metrics
- Metrics Server - Kubernetes resource metrics
- OpenTelemetry - Unified observability