Autoscaling Best Practices: From Theory to Production

Table of Contents
Introduction
By 2022, autoscaling had moved from “experimental feature” to “standard practice” in Kubernetes. HPA v2 was GA, VPA was stable, Cluster Autoscaler was mature, and KEDA was graduating from CNCF. But with mainstream adoption came a new problem: teams were learning the same hard lessons over and over.
This post distills the patterns that actually work in production—the ones that prevent “autoscaling incidents” and turn autoscaling from a cost center into a reliability and efficiency tool. These aren’t theoretical best practices; they’re the patterns that separate teams with smooth autoscaling from teams fighting constant fires.
Why this mattered in 2022
- Autoscaling was everywhere: most production clusters used some form of autoscaling, but results varied wildly. Some teams saved 40% on costs; others had constant incidents.
- Tool maturity: with stable tools, the difference between success and failure was operational knowledge, not tool capabilities.
- Cost pressure: cloud bills were under scrutiny, making efficient autoscaling a business requirement, not just a technical nice-to-have.
- Incident patterns: common failure modes (thrashing, pending pods, OOM kills) were well-documented but still happening because teams didn’t know the patterns.
Right-Sizing Resource Requests
The foundation of good autoscaling: autoscalers make decisions based on resource requests. If requests are wrong, autoscaling will be wrong too.
- Set requests based on P95 usage, not averages: averages hide spikes. Use P95/P99 usage over a week to set requests, ensuring headroom for normal variation.
- Don’t over-provision “just to be safe”: over-provisioned requests lead to wasted capacity and poor bin-packing. Use VPA recommendations or historical data to set accurate requests.
- Separate requests from limits: requests are for scheduling and autoscaling; limits are for safety. Set requests conservatively, limits more aggressively.
- Review requests quarterly: workload patterns change. Regular reviews prevent drift from optimal resource allocation.
Example: If your app uses 200m CPU on average but spikes to 500m during traffic bursts, set requests to 300-400m (P95), not 200m (average) or 1000m (“just to be safe”).
PodDisruptionBudgets: Preventing Scale-Down Disasters
PDBs are your safety net when autoscalers remove capacity:
- Set PDBs for all production workloads: prevent Cluster Autoscaler and VPA from evicting too many pods simultaneously.
- Use
minAvailablefor critical services:minAvailable: 2ensures at least 2 pods stay running during scale-down, preventing service interruption. - Use
maxUnavailablefor less critical workloads:maxUnavailable: 1allows one pod to be evicted, enabling more aggressive consolidation. - Don’t set PDBs too strict:
minAvailable: 100%prevents all evictions, blocking Cluster Autoscaler from removing nodes. This defeats the purpose of autoscaling.
Common mistake: Setting minAvailable: 100% “to be safe” prevents Cluster Autoscaler from working, leading to idle nodes and higher costs.
Metrics Selection: Choosing Signals That Predict Load
Not all metrics are equal for autoscaling. Choose metrics that predict load, not just reflect it:
- Request rate (QPS): excellent for web APIs. Scales proactively before CPU spikes, giving headroom for traffic increases.
- Queue depth: perfect for workers. Scales based on backlog, ensuring workers are ready when messages arrive.
- CPU utilization: good baseline but can lag. Works well for CPU-bound workloads but may scale too late for request-driven services.
- Latency (p95/p99): can indicate when to scale, but be careful—high latency might mean other issues (database, network), not just need for more replicas.
- Business metrics: revenue per second, active users—powerful but requires correlation with resource needs.
Rule of thumb: If your metric lags behind actual load, you’ll scale too late. Choose metrics that lead load, not follow it.
Example: For a web API, scaling on QPS (requests per second) is better than CPU because QPS increases before CPU, giving you time to scale before latency spikes.
Stabilization Windows: Tuning Scale-Up and Scale-Down
HPA’s behavior fields prevent thrashing by controlling how quickly autoscaling reacts:
- Scale-up: be aggressive: allow fast scale-up (100% increase, 0-60s stabilization) to handle traffic bursts quickly. Users notice slow scale-up more than fast scale-up.
- Scale-down: be conservative: use longer stabilization windows (300-600s) and smaller decreases (50%) to avoid scaling down during temporary dips.
- Tune based on workload patterns: batch jobs can scale down faster; user-facing APIs should scale down slower to handle traffic spikes.
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 50
periodSeconds: 60
Common mistake: Setting identical scale-up and scale-down policies causes thrashing—scaling up on a spike, then immediately scaling down when the spike ends.
Observability: What to Watch
Autoscaling is only as good as your observability. Monitor these key signals:
- Pending pods: if pods stay pending, Cluster Autoscaler isn’t keeping up or can’t add nodes. This is a critical signal.
- HPA scaling events: watch
kubectl get eventsfor HPA scaling decisions. Look for rapid scale-up/down cycles (thrashing). - Node utilization: track average node CPU/memory utilization. Low utilization (<50%) suggests over-provisioning; high utilization (>80%) suggests risk of pending pods.
- Scaling lag: measure time from metric threshold breach to new pods ready. Long lag (>5 minutes) indicates slow scaling.
- Cost metrics: track cluster cost per request/transaction. Autoscaling should reduce this over time.
Dashboard example: Create a Grafana dashboard with:
- Pending pod count over time
- HPA replica count vs. target metric
- Node utilization heatmap
- Scaling events timeline
- Cost per request trend
Min/Max Replica Bounds
Setting appropriate min/max replicas is critical:
- Min replicas: set based on baseline load, not zero.
minReplicas: 1for high-traffic services causes cold starts and latency spikes. Use baseline load + 20% headroom. - Max replicas: set based on expected peak capacity and budget. Unlimited
maxReplicascan cause cost overruns during traffic spikes or metric misconfigurations. - Review bounds quarterly: as traffic patterns change, adjust min/max replicas to match new baselines and peaks.
Example: If your API handles 1000 req/s baseline and 5000 req/s peak, and each pod handles 200 req/s, set minReplicas: 6 (1000/200 + 20%) and maxReplicas: 30 (5000/200 + 20% headroom).
Common Failure Modes and How to Debug
“Pods stay pending forever”
Symptoms: Pods in Pending state, HPA scaled up but new pods can’t schedule.
Causes:
- Cluster Autoscaler not configured or can’t add nodes (node group limits, IAM permissions)
- Resource requests too large for available instance types
- Node selectors/taints preventing scheduling
Debug:
kubectl describe pod <pending-pod> # Check events for scheduling failures
kubectl logs -n kube-system -l app=cluster-autoscaler # Check CA logs
kubectl get nodes # Check available capacity
Fix: Configure Cluster Autoscaler, reduce resource requests, or adjust node selectors.
“Thrashing: constant scale up/down”
Symptoms: HPA rapidly scales up, then down, then up again in cycles.
Causes:
- Metric lag causing delayed reactions
- Stabilization windows too short
- Metric noise (brief spikes causing scale-up, then immediate scale-down)
Debug:
kubectl get hpa <name> -w # Watch HPA decisions
kubectl describe hpa <name> # Check current metrics and targets
Fix: Increase scaleDown.stabilizationWindowSeconds, reduce metric scrape intervals, or filter metric noise.
“HPA doesn’t scale on custom metric”
Symptoms: HPA shows “unknown” or doesn’t scale despite metric values exceeding thresholds.
Causes:
- Metrics adapter not exposing metric correctly
- Metric name mismatch between HPA and adapter
- Metrics API not configured
Debug:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 # List available metrics
kubectl describe hpa <name> # Check HPA status and events
Fix: Verify metrics adapter configuration, check metric names match, ensure Metrics API is working.
“Scaling is too slow”
Symptoms: Traffic spikes cause latency before HPA scales up.
Causes:
- Long stabilization windows
- Metric lag (30-60s is common)
- Slow pod startup time
Fix: Reduce scaleUp.stabilizationWindowSeconds, decrease metric scrape intervals, optimize pod startup time (smaller images, readiness probes).
A practical rollout pattern
- Start conservative: begin with wide min/max bounds, long stabilization windows, and CPU-based metrics. Validate behavior before optimizing.
- Instrument early: add Prometheus metrics (QPS, queue depth) to applications before enabling custom metric scaling.
- Monitor aggressively: watch scaling events, pending pods, and node utilization for the first week after enabling autoscaling.
- Tune gradually: reduce stabilization windows, add custom metrics, and tighten bounds based on observed behavior.
- Document decisions: record why you chose specific metrics, bounds, and policies. This helps when debugging issues later.
Recommended Architecture
- Metrics pipeline: Prometheus for collection, Prometheus adapter for Metrics API, metrics-server for resource metrics.
- Autoscalers: HPA for replicas, VPA (in
Offmode) for resource recommendations, Cluster Autoscaler for nodes. - Observability: Grafana dashboards for scaling events, pending pods, and node utilization. Alerts for pending pods and thrashing.
- Safety rails: PDBs for all production workloads, appropriate min/max replica bounds, and resource request policies.
Conclusion
By 2022, autoscaling best practices had crystallized from years of production experience. The difference between successful and problematic autoscaling wasn’t the tools—it was understanding right-sizing, metrics selection, stabilization windows, and observability. Teams that followed these patterns achieved reliable, cost-effective autoscaling. Teams that didn’t learned the hard way through incidents and surprise bills. The patterns in this post are the ones that work in production, not just in theory.