Autoscaling Best Practices: From Theory to Production

Introduction

By 2022, autoscaling had moved from “experimental feature” to “standard practice” in Kubernetes. HPA v2 was GA, VPA was stable, Cluster Autoscaler was mature, and KEDA was graduating from CNCF. But with mainstream adoption came a new problem: teams were learning the same hard lessons over and over.

This post distills the patterns that actually work in production—the ones that prevent “autoscaling incidents” and turn autoscaling from a cost center into a reliability and efficiency tool. These aren’t theoretical best practices; they’re the patterns that separate teams with smooth autoscaling from teams fighting constant fires.

Why this mattered in 2022

Autoscaling was everywhere: most production clusters used some form of autoscaling, but results varied wildly. Some teams saved 40% on costs; others had constant incidents.
Tool maturity: with stable tools, the difference between success and failure was operational knowledge, not tool capabilities.
Cost pressure: cloud bills were under scrutiny, making efficient autoscaling a business requirement, not just a technical nice-to-have.
Incident patterns: common failure modes (thrashing, pending pods, OOM kills) were well-documented but still happening because teams didn’t know the patterns.

Right-Sizing Resource Requests

The foundation of good autoscaling: autoscalers make decisions based on resource requests. If requests are wrong, autoscaling will be wrong too.

Set requests based on P95 usage, not averages: averages hide spikes. Use P95/P99 usage over a week to set requests, ensuring headroom for normal variation.
Don’t over-provision “just to be safe”: over-provisioned requests lead to wasted capacity and poor bin-packing. Use VPA recommendations or historical data to set accurate requests.
Separate requests from limits: requests are for scheduling and autoscaling; limits are for safety. Set requests conservatively, limits more aggressively.
Review requests quarterly: workload patterns change. Regular reviews prevent drift from optimal resource allocation.

Example: If your app uses 200m CPU on average but spikes to 500m during traffic bursts, set requests to 300-400m (P95), not 200m (average) or 1000m (“just to be safe”).

PodDisruptionBudgets: Preventing Scale-Down Disasters

PDBs are your safety net when autoscalers remove capacity:

Set PDBs for all production workloads: prevent Cluster Autoscaler and VPA from evicting too many pods simultaneously.
Use minAvailable for critical services: minAvailable: 2 ensures at least 2 pods stay running during scale-down, preventing service interruption.
Use maxUnavailable for less critical workloads: maxUnavailable: 1 allows one pod to be evicted, enabling more aggressive consolidation.
Don’t set PDBs too strict: minAvailable: 100% prevents all evictions, blocking Cluster Autoscaler from removing nodes. This defeats the purpose of autoscaling.

Common mistake: Setting minAvailable: 100% “to be safe” prevents Cluster Autoscaler from working, leading to idle nodes and higher costs.

Metrics Selection: Choosing Signals That Predict Load

Not all metrics are equal for autoscaling. Choose metrics that predict load, not just reflect it:

Request rate (QPS): excellent for web APIs. Scales proactively before CPU spikes, giving headroom for traffic increases.
Queue depth: perfect for workers. Scales based on backlog, ensuring workers are ready when messages arrive.
CPU utilization: good baseline but can lag. Works well for CPU-bound workloads but may scale too late for request-driven services.
Latency (p95/p99): can indicate when to scale, but be careful—high latency might mean other issues (database, network), not just need for more replicas.
Business metrics: revenue per second, active users—powerful but requires correlation with resource needs.

Rule of thumb: If your metric lags behind actual load, you’ll scale too late. Choose metrics that lead load, not follow it.

Example: For a web API, scaling on QPS (requests per second) is better than CPU because QPS increases before CPU, giving you time to scale before latency spikes.

Stabilization Windows: Tuning Scale-Up and Scale-Down

HPA’s behavior fields prevent thrashing by controlling how quickly autoscaling reacts:

Scale-up: be aggressive: allow fast scale-up (100% increase, 0-60s stabilization) to handle traffic bursts quickly. Users notice slow scale-up more than fast scale-up.
Scale-down: be conservative: use longer stabilization windows (300-600s) and smaller decreases (50%) to avoid scaling down during temporary dips.
Tune based on workload patterns: batch jobs can scale down faster; user-facing APIs should scale down slower to handle traffic spikes.

behavior:
  scaleUp:
    stabilizationWindowSeconds: 0  # Scale up immediately
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
    policies:
    - type: Percent
      value: 50
      periodSeconds: 60

Common mistake: Setting identical scale-up and scale-down policies causes thrashing—scaling up on a spike, then immediately scaling down when the spike ends.

Observability: What to Watch

Autoscaling is only as good as your observability. Monitor these key signals:

Pending pods: if pods stay pending, Cluster Autoscaler isn’t keeping up or can’t add nodes. This is a critical signal.
HPA scaling events: watch kubectl get events for HPA scaling decisions. Look for rapid scale-up/down cycles (thrashing).
Node utilization: track average node CPU/memory utilization. Low utilization (<50%) suggests over-provisioning; high utilization (>80%) suggests risk of pending pods.
Scaling lag: measure time from metric threshold breach to new pods ready. Long lag (>5 minutes) indicates slow scaling.
Cost metrics: track cluster cost per request/transaction. Autoscaling should reduce this over time.

Dashboard example: Create a Grafana dashboard with:

Pending pod count over time
HPA replica count vs. target metric
Node utilization heatmap
Scaling events timeline
Cost per request trend

Min/Max Replica Bounds

Setting appropriate min/max replicas is critical:

Min replicas: set based on baseline load, not zero. minReplicas: 1 for high-traffic services causes cold starts and latency spikes. Use baseline load + 20% headroom.
Max replicas: set based on expected peak capacity and budget. Unlimited maxReplicas can cause cost overruns during traffic spikes or metric misconfigurations.
Review bounds quarterly: as traffic patterns change, adjust min/max replicas to match new baselines and peaks.

Example: If your API handles 1000 req/s baseline and 5000 req/s peak, and each pod handles 200 req/s, set minReplicas: 6 (1000/200 + 20%) and maxReplicas: 30 (5000/200 + 20% headroom).

Common Failure Modes and How to Debug

“Pods stay pending forever”

Symptoms: Pods in Pending state, HPA scaled up but new pods can’t schedule.

Causes:

Cluster Autoscaler not configured or can’t add nodes (node group limits, IAM permissions)
Resource requests too large for available instance types
Node selectors/taints preventing scheduling

Debug:

kubectl describe pod <pending-pod>  # Check events for scheduling failures
kubectl logs -n kube-system -l app=cluster-autoscaler  # Check CA logs
kubectl get nodes  # Check available capacity

Fix: Configure Cluster Autoscaler, reduce resource requests, or adjust node selectors.

“Thrashing: constant scale up/down”

Symptoms: HPA rapidly scales up, then down, then up again in cycles.

Causes:

Metric lag causing delayed reactions
Stabilization windows too short
Metric noise (brief spikes causing scale-up, then immediate scale-down)

Debug:

kubectl get hpa <name> -w  # Watch HPA decisions
kubectl describe hpa <name>  # Check current metrics and targets

Fix: Increase scaleDown.stabilizationWindowSeconds, reduce metric scrape intervals, or filter metric noise.

“HPA doesn’t scale on custom metric”

Symptoms: HPA shows “unknown” or doesn’t scale despite metric values exceeding thresholds.

Causes:

Metrics adapter not exposing metric correctly
Metric name mismatch between HPA and adapter
Metrics API not configured

Debug:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1  # List available metrics
kubectl describe hpa <name>  # Check HPA status and events

Fix: Verify metrics adapter configuration, check metric names match, ensure Metrics API is working.

“Scaling is too slow”

Symptoms: Traffic spikes cause latency before HPA scales up.

Causes:

Long stabilization windows
Metric lag (30-60s is common)
Slow pod startup time

Fix: Reduce scaleUp.stabilizationWindowSeconds, decrease metric scrape intervals, optimize pod startup time (smaller images, readiness probes).

A practical rollout pattern

Start conservative: begin with wide min/max bounds, long stabilization windows, and CPU-based metrics. Validate behavior before optimizing.
Instrument early: add Prometheus metrics (QPS, queue depth) to applications before enabling custom metric scaling.
Monitor aggressively: watch scaling events, pending pods, and node utilization for the first week after enabling autoscaling.
Tune gradually: reduce stabilization windows, add custom metrics, and tighten bounds based on observed behavior.
Document decisions: record why you chose specific metrics, bounds, and policies. This helps when debugging issues later.

Recommended Architecture

Metrics pipeline: Prometheus for collection, Prometheus adapter for Metrics API, metrics-server for resource metrics.
Autoscalers: HPA for replicas, VPA (in Off mode) for resource recommendations, Cluster Autoscaler for nodes.
Observability: Grafana dashboards for scaling events, pending pods, and node utilization. Alerts for pending pods and thrashing.
Safety rails: PDBs for all production workloads, appropriate min/max replica bounds, and resource request policies.

Conclusion

By 2022, autoscaling best practices had crystallized from years of production experience. The difference between successful and problematic autoscaling wasn’t the tools—it was understanding right-sizing, metrics selection, stabilization windows, and observability. Teams that followed these patterns achieved reliable, cost-effective autoscaling. Teams that didn’t learned the hard way through incidents and surprise bills. The patterns in this post are the ones that work in production, not just in theory.

Table of Contents

Introduction

Why this mattered in 2022

Right-Sizing Resource Requests

PodDisruptionBudgets: Preventing Scale-Down Disasters

Metrics Selection: Choosing Signals That Predict Load

Stabilization Windows: Tuning Scale-Up and Scale-Down

Observability: What to Watch

Min/Max Replica Bounds

Common Failure Modes and How to Debug

“Pods stay pending forever”

“Thrashing: constant scale up/down”

“HPA doesn’t scale on custom metric”

“Scaling is too slow”

A practical rollout pattern

Recommended Architecture

Conclusion