HPA v2beta2: Custom Metrics & External Metrics API

Introduction

When Kubernetes 1.12 shipped in September 2018, it brought Horizontal Pod Autoscaler v2beta2 with a critical advancement: stable support for custom and external metrics. While HPA v2alpha1 (2016) had introduced the concept, v2beta2 made it production-viable by stabilizing the Custom Metrics API and adding the External Metrics API.

What made this release significant wasn’t just the API stability—it was the ecosystem that had grown around it. Prometheus adapters were maturing, teams were instrumenting applications with custom metrics, and the patterns for scaling on business signals (QPS, queue depth, revenue per second) were becoming standard.

This was the release where autoscaling moved from “scale on CPU” to “scale on anything that matters to your business.”

Why this mattered in 2018

CPU wasn’t enough: modern applications (APIs, workers, microservices) needed to scale on signals that predicted load, not just reflected it. Request rate, queue depth, and business KPIs were more accurate scaling signals.
Prometheus adoption: Prometheus had become the de facto metrics standard. Teams wanted to scale on Prometheus metrics without building custom tooling.
Cloud service integration: scaling based on cloud service metrics (SQS queue depth, CloudWatch alarms) enabled event-driven architectures.
Multi-metric policies: combining CPU, memory, custom metrics, and external metrics in a single HPA policy enabled sophisticated scaling strategies.

Custom Metrics API Maturity

HPA v2beta2 stabilized the Custom Metrics API, which exposes application and infrastructure metrics via Kubernetes’ standard Metrics API:

API stability: custom.metrics.k8s.io/v1beta1 API is now beta-stable, meaning no breaking changes expected before GA.
Prometheus adapter patterns: standard patterns emerged for exposing Prometheus metrics via Custom Metrics API (e.g., prometheus-adapter).
Metric naming: conventions for metric names (e.g., http_requests_per_second, queue_messages) became standardized.
Aggregation: adapters aggregate metrics across pods (average, sum) to provide pod-level or object-level metrics.

Example Custom Metric:

# Prometheus metric: http_requests_per_second
# Exposed via Custom Metrics API as:
# custom.metrics.k8s.io/v1beta1
#   namespaces/default/pods/*/http_requests_per_second

External Metrics API

HPA v2beta2 introduced the External Metrics API, enabling scaling on metrics from outside the cluster:

Cloud service metrics: scale on AWS SQS queue depth, GCP Pub/Sub message count, Azure Queue length.
External systems: scale on metrics from external monitoring systems (Datadog, New Relic) via adapters.
Business metrics: scale on business KPIs (revenue per second, active users) exposed via external systems.

Example External Metric:

# AWS SQS queue depth
# Exposed via External Metrics API as:
# external.metrics.k8s.io/v1beta1
#   namespaces/default/sqs_queue_depth

Multi-Metric HPA Policies

HPA v2beta2 allows combining multiple metrics in a single scaling policy:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  # Resource metric (CPU)
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  # Custom metric (requests per second)
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  # External metric (queue depth)
  - type: External
    external:
      metric:
        name: sqs_queue_depth
      target:
        type: Value
        value: "1000"

Scaling behavior: HPA scales to satisfy the highest metric requirement. If CPU is at 80% (above 70% target) and QPS is 50 (below 100 target), HPA scales up to reduce CPU.

Behavior Fields: Scale-Up and Scale-Down Policies

HPA v2beta2 introduced behavior fields (beta) to control scaling speed and prevent thrashing:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0  # Scale up immediately
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
    - type: Pods
      value: 4
      periodSeconds: 15
    selectPolicy: Max  # Use the most aggressive policy
  scaleDown:
    stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
    policies:
    - type: Percent
      value: 50
      periodSeconds: 60
    selectPolicy: Min  # Use the most conservative policy

Key features:

Stabilization windows: wait before scaling to avoid reacting to brief spikes/dips.
Multiple policies: define different scaling rates (percent-based, pod count-based).
Policy selection: choose most aggressive (scale-up) or conservative (scale-down) policy.

Real-World Patterns

Scaling Web APIs on Request Rate

Pattern: Scale on HTTP requests per second instead of CPU to scale proactively before latency spikes.

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"  # Scale when average QPS > 100 per pod

Why it works: Request rate increases before CPU, giving you time to scale before users experience latency.

Scaling Workers on Queue Depth

Pattern: Scale worker pods based on message queue depth (SQS, RabbitMQ, Kafka).

metrics:
- type: External
  external:
    metric:
      name: sqs_queue_depth
    target:
      type: AverageValue
      averageValue: "10"  # Scale when queue has > 10 messages per pod

Why it works: Queue depth directly indicates work backlog. Scaling on queue depth ensures workers are ready when messages arrive.

Combining CPU and Custom Metrics

Pattern: Use CPU as a safety net, custom metrics as primary signal.

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: "100"
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 90  # Safety limit: don't let CPU exceed 90%

Why it works: Custom metric (QPS) provides proactive scaling; CPU provides safety limit to prevent overloading pods.

Prometheus Adapter Patterns

The Prometheus adapter became the standard way to expose Prometheus metrics via Custom Metrics API:

Deploy Prometheus adapter: install adapter that queries Prometheus and exposes metrics via Custom Metrics API.
Configure metric discovery: adapter discovers Prometheus metrics matching patterns (e.g., http_requests_total).
Expose as Custom Metrics: adapter exposes metrics as custom.metrics.k8s.io/v1beta1 resources.
HPA consumes metrics: HPA queries Custom Metrics API and scales based on metric values.

Example adapter configuration:

rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^(.*)_total$"
    as: "${1}_per_second"
  metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

A practical rollout pattern

Start with CPU-based HPA: establish baseline autoscaling behavior with CPU metrics before adding custom metrics.
Instrument applications: add Prometheus metrics (request rate, queue depth) to applications.
Deploy metrics adapter: install Prometheus adapter and verify custom metrics appear in kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1.
Add one custom metric: start with one custom metric per HPA (e.g., QPS) and validate behavior.
Gradually add more metrics: add additional custom/external metrics once you’re confident in the first.
Tune behavior policies: adjust scale-up/down policies based on observed scaling patterns.

Recommended Architecture (2018)

Metrics Collection: Prometheus scrapes metrics from applications and infrastructure.
Metrics Adapter: Prometheus adapter exposes Prometheus metrics via Custom Metrics API.
External Metrics Adapter: Cloud provider adapters (AWS, GCP, Azure) expose external metrics via External Metrics API.
HPA Definitions: Use autoscaling/v2beta2 API with multiple metrics and behavior policies.
Observability: Monitor HPA scaling events and metric values to validate autoscaling behavior.

Caveats & Tuning

Metric lag: Custom metrics from Prometheus may have 30-60 second lag. For fast-scaling workloads, consider reducing scrape intervals.
Metric accuracy: Ensure Prometheus metrics are accurate (no double-counting, correct labels) before using for autoscaling.
Multi-metric conflicts: When using multiple metrics, HPA scales to satisfy the highest requirement. Ensure metrics are aligned (don’t mix conflicting signals).
Adapter performance: Metrics adapters query Prometheus on every HPA evaluation. High-frequency evaluations can overload Prometheus. Tune adapter caching and evaluation intervals.
Beta API stability: v2beta2 is beta—APIs may change before GA. Test thoroughly and have rollback plans.

Common failure modes (learned the hard way)

“HPA doesn’t scale on custom metric”: Metrics adapter isn’t exposing metric correctly. Verify with kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1.
“Metric values seem wrong”: Prometheus query in adapter configuration may be incorrect. Test Prometheus queries directly before using in adapter.
“HPA scales too aggressively”: Multiple metrics may conflict (e.g., CPU and QPS scaling in opposite directions). Review metric targets and ensure alignment.
“Scaling is too slow”: Metric lag or long stabilization windows delay scaling. Reduce scrape intervals or decrease stabilization windows for scale-up.
“Adapter overloads Prometheus”: High-frequency HPA evaluations cause too many Prometheus queries. Increase adapter caching or reduce evaluation frequency.

Conclusion

HPA v2beta2 in Kubernetes 1.12 marked the maturity of custom metrics autoscaling. With stable Custom Metrics and External Metrics APIs, teams could scale on application metrics, queue depth, and cloud service metrics—not just CPU. Combined with behavior policies and multi-metric support, HPA v2beta2 enabled sophisticated autoscaling strategies that aligned with business needs. While still beta, v2beta2 was stable enough for production use and laid the groundwork for HPA v2 GA in Kubernetes 1.19.

Table of Contents