HPA v2 GA: Production-Ready Autoscaling Patterns

Introduction

By August 2020, Horizontal Pod Autoscaler v2 had been in beta for over two years, powering production workloads that scaled on metrics beyond CPU. With Kubernetes 1.19, HPA v2 finally reached General Availability (GA), signaling that custom metrics autoscaling was ready for enterprise production.

What made HPA v2 GA significant wasn’t just API stability—it was the ecosystem maturity around it. Prometheus adapters were battle-tested, metrics pipelines were standardized, and teams had learned the operational patterns that made HPA reliable at scale. The GA milestone validated that autoscaling on custom metrics had moved from “experimental” to “production-ready.”

Why this mattered in 2020

Custom metrics became standard: scaling on QPS, queue depth, or business KPIs was no longer a niche requirement—it was expected for modern applications.
Cost optimization pressure: with cloud costs under scrutiny, right-sizing HPA min/max replicas and choosing accurate scaling signals became critical.
Observability maturity: Prometheus, Grafana, and metrics adapters were standard infrastructure, making custom metrics accessible to every team.
Multi-dimensional scaling: combining HPA with Cluster Autoscaler and VPA required understanding how they interact in production.

What Changed from Beta to GA

HPA v2beta2 → v2 GA brought several important changes:

API stability: autoscaling/v2 API is now stable, meaning no breaking changes in future Kubernetes versions.
Behavior fields stable: scaleUp and scaleDown policies with stabilization windows are now GA features.
External metrics maturity: scaling on cloud service metrics (SQS queue depth, CloudWatch metrics) is production-ready.
Multi-metric support: HPAs can now reliably combine CPU, memory, custom, and external metrics in a single policy.

Production Metrics Pipeline

A reliable HPA setup requires a complete metrics pipeline:

Metrics Collection: Prometheus scrapes metrics from applications and infrastructure.
Metrics Adapter: Prometheus adapter (or custom adapter) exposes metrics via Kubernetes Metrics API.
Metrics Server: Provides resource metrics (CPU, memory) from node and pod usage.
HPA Controller: Queries Metrics API and makes scaling decisions based on target values.

# Example: HPA scaling on custom metric (requests per second)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      selectPolicy: Min

Combining HPA with Cluster Autoscaler

HPA and Cluster Autoscaler work together to create a complete autoscaling solution:

HPA scales pods: When metrics exceed thresholds, HPA increases replica count.
Cluster Autoscaler scales nodes: When pods are pending due to insufficient resources, CA adds nodes.
Coordination: HPA’s maxReplicas should account for cluster capacity, or CA will scale nodes to accommodate HPA decisions.

Best practice: Set HPA maxReplicas based on expected peak load, and configure Cluster Autoscaler with appropriate node group limits to handle that capacity.

Metrics Selection: Choosing the Right Signal

Not all metrics are created equal for autoscaling:

CPU utilization: Good baseline, but can lag behind actual load. Works well for CPU-bound workloads.
Request rate (QPS): Excellent for web APIs—scales proactively before CPU spikes. Requires application instrumentation.
Queue depth: Perfect for workers processing queues. Scales based on backlog, not current processing rate.
Latency (p95/p99): Can indicate when to scale, but be careful—high latency might mean other issues, not just need for more replicas.
Business metrics: Revenue per second, active users—powerful but requires careful correlation with resource needs.

Rule of thumb: Choose metrics that predict load, not just reflect it. QPS often scales before CPU, giving you headroom.

Behavior Policies: Avoiding Thrashing

HPA v2’s behavior fields prevent the classic autoscaling problem: scaling up and down rapidly (thrashing).

Stabilization windows: Wait before scaling down to avoid reacting to brief spikes.
Scale-up policies: Allow aggressive scale-up (100% increase) to handle traffic bursts quickly.
Scale-down policies: Be conservative (50% decrease) to avoid over-scaling down during temporary dips.

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0  # Scale up immediately
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
    policies:
    - type: Percent
      value: 50
      periodSeconds: 60

A practical rollout pattern

Start with CPU-based HPA: Even if you plan to use custom metrics, begin with CPU to establish baseline scaling behavior.
Instrument applications: Add Prometheus metrics (request rate, queue depth) to your applications before enabling custom metric scaling.
Deploy metrics adapter: Install Prometheus adapter and verify custom metrics appear in kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1.
Add custom metrics gradually: Start with one custom metric per HPA, validate behavior, then add more.
Tune behavior policies: Adjust stabilization windows and scale policies based on observed scaling patterns.
Monitor scaling events: Watch HPA events and metrics to catch thrashing, lag, or incorrect scaling decisions.

Recommended Architecture (2020)

Metrics Stack: Prometheus for collection, Prometheus adapter for Metrics API, metrics-server for resource metrics.
HPA Definitions: Use autoscaling/v2 API with behavior policies, multiple metrics, and appropriate min/max replicas.
Cluster Autoscaler: Configure CA to work with HPA’s scaling decisions, ensuring nodes are available when HPA scales up.
Observability: Monitor HPA scaling events, pending pods, and node utilization to validate autoscaling behavior.

Caveats & Tuning

Metric lag: Custom metrics from Prometheus may have 30-60 second lag. For fast-scaling workloads, consider reducing scrape intervals or using resource metrics.
Min replicas too low: Setting minReplicas: 1 for high-traffic services can cause cold starts and latency spikes. Set based on baseline load.
Max replicas too high: Unlimited maxReplicas can cause cost overruns. Set based on expected peak capacity and budget constraints.
Multiple metrics conflicts: When using multiple metrics, HPA scales to satisfy the highest metric. Ensure metrics are aligned (e.g., don’t mix CPU and QPS if they conflict).
Resource requests accuracy: HPA makes decisions based on resource requests. If requests are wrong, HPA scaling will be wrong too.

Common failure modes (learned the hard way)

“HPA scales up but pods stay pending”: Cluster Autoscaler isn’t configured or can’t add nodes fast enough. Check CA logs and node group capacity.
“Thrashing: constant scale up/down”: Behavior policies are too aggressive or stabilization windows too short. Increase scaleDown.stabilizationWindowSeconds.
“HPA doesn’t scale on custom metric”: Metrics adapter isn’t exposing the metric correctly. Verify with kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1.
“Scaling is too slow”: Metric lag or long stabilization windows delay scaling. Reduce scrape intervals or decrease stabilization windows for scale-up.
“HPA scales but service is still slow”: Scaling signal (e.g., CPU) may lag behind actual load. Switch to request rate or queue depth metrics.

Conclusion

HPA v2 GA in Kubernetes 1.19 marked the maturity of custom metrics autoscaling. With stable APIs, behavior policies, and a mature metrics ecosystem, teams could confidently deploy production autoscaling that scaled on business metrics, not just CPU. Combined with Cluster Autoscaler and VPA, HPA v2 GA enabled true multi-dimensional autoscaling that optimized both performance and cost in production Kubernetes clusters.

Table of Contents