The Autoscaling Trinity: HPA, VPA, and Cluster Autoscaler Together

The Autoscaling Trinity: HPA, VPA, and Cluster Autoscaler Together

Introduction

By 2021, Kubernetes had three mature autoscalers: Horizontal Pod Autoscaler (HPA) for scaling replicas, Vertical Pod Autoscaler (VPA) for right-sizing resources, and Cluster Autoscaler (CA) for adding nodes. Each solved a different problem, but using them together—the “autoscaling trinity”—promised true pay-as-you-go clusters that optimized both performance and cost.

The challenge wasn’t that these tools didn’t work individually; it was that they could conflict when used together. HPA scales based on resource requests that VPA might change. VPA evicts pods that Cluster Autoscaler just provisioned nodes for. Cluster Autoscaler adds nodes based on pending pods that HPA is scaling.

What made 2021 the right time to tackle this was maturity: HPA v2 was GA, VPA had stable APIs, and teams had enough production experience to document the coordination patterns that actually worked.

Why this mattered in 2021

  • Cost optimization became critical: cloud bills were growing, and manual resource tuning didn’t scale. Combining all three autoscalers could reduce waste by 30-50%.
  • Workload diversity: microservices, batch jobs, and stateful workloads each needed different autoscaling strategies. No single autoscaler fit all.
  • Platform engineering: platform teams needed to provide autoscaling as a service, not just tools. That meant making HPA, VPA, and CA work together reliably.
  • Multi-dimensional optimization: optimizing replicas, resources, and nodes separately left money on the table. True optimization required coordinating all three.

When to Use Each Autoscaler

Understanding which autoscaler solves which problem is the first step to coordination:

  • HPA (Horizontal Pod Autoscaler): Scales the number of pod replicas based on metrics (CPU, memory, custom metrics). Use when workload demand varies and you need more/fewer pods to handle load.
  • VPA (Vertical Pod Autoscaler): Adjusts pod resource requests and limits based on historical usage. Use when pods are over/under-provisioned and you want to right-size without changing replica count.
  • Cluster Autoscaler (CA): Adds/removes cluster nodes based on pending pods and node utilization. Use when you need more cluster capacity or want to reduce idle nodes.

Key insight: They solve orthogonal problems. HPA answers “how many pods?”, VPA answers “how much per pod?”, and CA answers “how many nodes?”

Coordination Strategies

The main challenge is that HPA and VPA can conflict when used on the same pods:

Strategy 1: Separation of Concerns

Use each autoscaler for different workloads or different resource dimensions:

  • HPA for replicas, VPA for resources: Run VPA in Off mode to get recommendations, manually apply them, then let HPA handle scaling. This avoids conflicts but requires manual VPA application.
  • HPA for custom metrics, VPA for CPU/memory: Use HPA to scale on business metrics (QPS, queue depth) and VPA to optimize CPU/memory requests. They operate on different signals, reducing conflicts.
  • Different workloads: Use HPA on stateless services, VPA on batch jobs, and CA for both. This is the safest approach but limits optimization.

Strategy 2: VPA in Recommendation Mode

Use VPA to inform decisions without automatic updates:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: Off  # Only recommendations, no automatic updates
  resourcePolicy:
    containerPolicies:
    - containerName: app
      minAllowed:
        cpu: 100m
        memory: 128Mi
      maxAllowed:
        cpu: 2
        memory: 4Gi

Then periodically review VPA recommendations and update resource requests manually, allowing HPA to scale based on stable requests.

Strategy 3: Staged Rollout

Gradually enable autoscalers with careful monitoring:

  1. Phase 1: Enable Cluster Autoscaler to handle node capacity.
  2. Phase 2: Enable HPA with CPU-based scaling to establish baseline behavior.
  3. Phase 3: Add VPA in Off mode to audit resource requests.
  4. Phase 4: Manually apply VPA recommendations and validate HPA behavior.
  5. Phase 5: Consider VPA Initial mode for new deployments only.

Avoiding Conflicts

Several patterns help prevent autoscaler conflicts:

  • PodDisruptionBudgets (PDBs): Configure PDBs to prevent VPA and CA from evicting too many pods simultaneously. This is critical for high-availability workloads.
  • Resource request stability: Once VPA recommendations are applied, avoid manual changes. Let HPA scale based on stable resource requests.
  • Min/max bounds: Set VPA minAllowed/maxAllowed and HPA minReplicas/maxReplicas to prevent extreme autoscaling decisions.
  • Update policies: Use VPA Initial mode instead of Auto to avoid constant pod evictions that conflict with HPA scaling.

Cost Optimization Architecture

Combining all three autoscalers enables comprehensive cost optimization:

  1. VPA right-sizes resources: Reduces over-provisioning by setting accurate CPU/memory requests.
  2. HPA scales replicas efficiently: With accurate resource requests from VPA, HPA makes better scaling decisions, reducing unnecessary replicas.
  3. Cluster Autoscaler optimizes nodes: With right-sized pods and efficient replica counts, CA can better bin-pack pods and reduce idle nodes.

Example stack:

  • VPA in Off mode provides resource recommendations
  • HPA scales on custom metrics (QPS, queue depth) with CPU fallback
  • Cluster Autoscaler adds/removes nodes based on pending pods
  • Metrics pipeline: Prometheus → metrics adapter → HPA/VPA

Real-World Architecture Example

# VPA: Recommendation mode for resource optimization
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: Off
  resourcePolicy:
    containerPolicies:
    - containerName: api
      minAllowed:
        cpu: 200m
        memory: 256Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi

---
# HPA: Scales replicas based on metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300

---
# PDB: Prevents excessive evictions
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-server

A practical rollout pattern

  1. Start with Cluster Autoscaler: Enable CA first to handle node capacity. This is the safest autoscaler and has the least conflict potential.
  2. Add HPA with CPU metrics: Enable HPA on a few non-critical workloads to establish scaling behavior and validate metrics pipeline.
  3. Audit with VPA: Deploy VPA in Off mode to understand current resource request accuracy. Review recommendations weekly.
  4. Apply VPA recommendations manually: Update resource requests based on VPA recommendations, then validate HPA behavior doesn’t change unexpectedly.
  5. Gradually enable custom metrics: Add custom metric scaling to HPA once resource requests are stable.
  6. Monitor and tune: Watch for conflicts, thrashing, or unexpected behavior. Adjust min/max bounds and policies as needed.

Caveats & Tuning

  • VPA and HPA conflicts: Running both on the same pods requires careful coordination. Prefer separation of concerns or VPA in Off mode.
  • Pod eviction overhead: VPA Auto mode evicts pods, causing brief interruptions. Not suitable for latency-sensitive workloads without PDBs.
  • Metrics lag: Custom metrics may have 30-60 second lag, affecting HPA decisions. Consider reducing scrape intervals for fast-scaling workloads.
  • Stateful workloads: VPA eviction can disrupt StatefulSets. Use VPA Off or Initial mode, or ensure proper PDBs and backup strategies.
  • Cost vs. performance: Aggressive autoscaling reduces costs but may increase latency during scale-up. Balance based on workload requirements.

Common failure modes (learned the hard way)

  • “HPA and VPA fighting”: VPA changes resource requests while HPA is scaling, causing unpredictable behavior. Use VPA in Off mode or separate workloads.
  • “Constant pod churn”: VPA Auto mode evicts pods too frequently. Switch to Initial mode or increase update intervals.
  • “Nodes added but pods still pending”: Cluster Autoscaler can’t keep up with HPA scaling. Check CA configuration, node group limits, and provisioning speed.
  • “VPA recommendations ignored”: If VPA is in Off mode, recommendations must be applied manually. Automate this with a controller or CI/CD pipeline.
  • “PDB blocking all evictions”: Overly strict PDBs prevent VPA and CA from working. Set minAvailable based on actual availability requirements, not “just to be safe.”

Conclusion

By 2021, orchestrating HPA, VPA, and Cluster Autoscaler together was both possible and valuable. While coordination required careful planning and monitoring, the combination enabled true multi-dimensional autoscaling that optimized replicas, resources, and nodes simultaneously. The key was understanding each autoscaler’s role, choosing coordination strategies that avoided conflicts, and gradually rolling out with proper observability. When done right, the autoscaling trinity delivered both performance and cost optimization that no single autoscaler could achieve alone.