Kubernetes Autoscaling in 2016: Cluster Autoscaler & HPA v2

K8s Guru
4 min read
Kubernetes Autoscaling in 2016: Cluster Autoscaler & HPA v2

Introduction

In 2016, keeping a Kubernetes cluster “just right” usually meant watching dashboards, reacting to alerts, and manually adding nodes right when you could least afford the latency. Two additions started to change that story: Cluster Autoscaler, first released for GKE and AWS, and Horizontal Pod Autoscaler v2, which opened the door to metrics beyond CPU.

What made this combination feel different (even in its early days) was that it connected two separate feedback loops: Pods scaling for workload demand and nodes scaling for schedulable capacity. When it worked, it turned “we’re out of room” from a page to a controlled, observable system.

Why this mattered in 2016

  • Cloud costs became a line item: leaving capacity “just in case” was expensive, but running hot meant incident risk.
  • Workloads diversified: batch jobs, APIs, and background consumers don’t scale the same way, and CPU-only autoscaling wasn’t enough.
  • SRE reality: autoscaling was less about chasing peak throughput and more about smoothing the 2am edges (deploy spikes, traffic bursts, noisy neighbors).

Cluster Autoscaler Basics

  • Node Group Awareness: Watches cloud provider autoscaling groups (GCE MIGs, AWS ASGs) to add/remove nodes based on pending pods.
  • Scale-Up Triggers: When a pod cannot be scheduled due to insufficient resources, Cluster Autoscaler simulates scheduling with larger node pools and requests extra nodes.
  • Scale-Down Logic: Periodically checks under-utilized nodes (below configurable thresholds) and drains them if pods can move elsewhere.
  • Graceful Drains: Evicts pods with PDB-awareness (PodDisruptionBudget), ensuring critical workloads stay up.
  • Implementation: Runs as a Deployment inside the cluster with cloud credentials to call provider APIs.

Horizontal Pod Autoscaler v2

  • Custom Metrics API: Moves beyond CPU utilization to allow metrics like QPS, queue length, or business KPIs.
  • Per-Resource Targets: Supports multiple metrics per HPA with scale policies for each.
  • Aggregation: Uses metrics adapters (e.g., Prometheus adapter) to surface custom metrics via the Metrics API.
  • Behavior Tweaks: Stabilization windows and scale-down delays were introduced to avoid thrashing.

A practical rollout pattern

  1. Start with boring signals: CPU-based HPA is still a great baseline while your custom metrics pipeline matures.
  2. Lock in resource requests: autoscalers make decisions off requests/limits; “best effort everywhere” makes results unpredictable.
  3. Add safety rails: PodDisruptionBudgets and sane maxUnavailable settings prevent scale-down from becoming self-inflicted downtime.
  4. Observe before trusting: watch scheduling failures, pending time, and node utilization for a week before tightening thresholds.
  1. Metrics Pipeline: Heapster remained the default metrics collector; Prometheus adapters were emerging for custom metrics.
  2. Autoscaler Deployment: Deploy Cluster Autoscaler with provider-specific flags (e.g., --nodes=1:5:k8s-main for AWS).
  3. HPA Definitions: Define HPAs using autoscaling/v2alpha1, specifying target averages (CPU or custom metric).
  4. PDBs: Configure PodDisruptionBudgets to guide safe node drains.

Caveats & Tuning

  • Node Provisioning Time: Scale-ups depend on cloud VM launch speed—consider overprovisioning buffers for latency-sensitive workloads.
  • DaemonSets: Cluster Autoscaler accounts for DaemonSet pods when simulating capacity; ensure DaemonSets request resources accurately.
  • Scale-Down Delay: Defaults often aggressive; adjust --scale-down-delay and --scale-down-utilization-threshold to avoid churn.
  • Custom Metrics Maturity: Adapters were still experimental; fallback to CPU-based scaling where reliability matters.
  • Stateful Workloads: Pair with StatefulSets and PDBs to avoid data loss when scaling down nodes.

Common failure modes (learned the hard way)

  • “Pending forever” during scale-up: unschedulable pods that can’t fit anywhere (too large requests, node selector/taints mismatch) won’t be fixed by adding more of the wrong nodes.
  • Thrash after deploys: a big rollout can spike requests briefly; without stabilization windows you can scale up and immediately scale down.
  • PDB deadlocks: overly strict PDBs can block drain/eviction, so scale-down never happens and the cluster quietly drifts bigger.

Conclusion

By late 2016, Kubernetes autoscaling evolved from a CPU-only experiment to a holistic system that resized both Pods and nodes. Cluster Autoscaler plus HPA v2 laid the foundation for true pay-as-you-go clusters, eventually joined by the Vertical Pod Autoscaler and predictive scaling techniques in later releases.