Cluster API Managed Topologies: Declarative Cluster Lifecycle at Scale

Cluster API Managed Topologies: Declarative Cluster Lifecycle at Scale

Introduction

By mid-2022, Cluster API managed topologies had become the recommended pattern for declarative cluster lifecycle management. Introduced in v1beta1, managed topologies enabled teams to manage cluster creation, upgrades, scaling, and configuration through simple spec updates, treating clusters as truly declarative resources.

This mattered because managed topologies solved the operational complexity of managing dozens or hundreds of clusters. Instead of imperative commands (kubectl scale, kubectl upgrade), teams could declare desired state and let Cluster API handle the orchestration. Combined with ClusterClass templates, managed topologies enabled consistent, repeatable cluster operations at scale.

Historical note: Managed topologies were introduced in Cluster API v1beta1 (2021) but became the production-recommended pattern in 2022 as teams adopted them for large-scale deployments. This post focuses on production patterns and best practices that emerged from real-world usage.

Managed Topologies Deep Dive

What Are Managed Topologies?

Managed topologies use the topology field in Cluster resources to enable declarative cluster lifecycle management. The topology field defines:

  • Kubernetes Version: Desired Kubernetes version for the cluster.
  • Control Plane Configuration: Control plane replicas, metadata, and patches.
  • Worker Configuration: Machine deployment classes, replicas, and metadata.
  • Upgrade Strategy: How upgrades should be performed.
  • Scaling Strategy: How scaling should be handled.

Basic Managed Topology Example

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-cluster
spec:
  clusterClassRef:
    name: standard-cluster-class
  topology:
    version: v1.24.0
    controlPlane:
      replicas: 3
      metadata:
        labels:
          environment: production
          tier: control-plane
    workers:
      machineDeployments:
      - class: default-worker
        replicas: 10
        name: worker-pool-1
        metadata:
          labels:
            pool: default
            tier: worker

ClusterClass Composition Patterns

Pattern 1: Environment-Specific ClusterClasses

Create separate ClusterClasses for different environments:

# Development ClusterClass
apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: dev-cluster-class
spec:
  infrastructure:
    ref:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: AWSClusterTemplate
      name: dev-aws-template
  controlPlane:
    ref:
      apiVersion: controlplane.cluster.x-k8s.io/v1beta1
      kind: KubeadmControlPlaneTemplate
      name: dev-controlplane-template
  workers:
    machineDeployments:
    - class: default-worker
      template:
        spec:
          infrastructure:
            ref:
              apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
              kind: AWSMachineTemplate
              name: dev-worker-template

---
# Production ClusterClass
apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: prod-cluster-class
spec:
  infrastructure:
    ref:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: AWSClusterTemplate
      name: prod-aws-template
  controlPlane:
    ref:
      apiVersion: controlplane.cluster.x-k8s.io/v1beta1
      kind: KubeadmControlPlaneTemplate
      name: prod-controlplane-template
  workers:
    machineDeployments:
    - class: default-worker
      template:
        spec:
          infrastructure:
            ref:
              apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
              kind: AWSMachineTemplate
              name: prod-worker-template
    - class: gpu-worker
      template:
        spec:
          infrastructure:
            ref:
              apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
              kind: AWSMachineTemplate
              name: prod-gpu-worker-template

Pattern 2: Composable ClusterClasses

Build ClusterClasses from reusable components:

# Base infrastructure template
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSClusterTemplate
metadata:
  name: base-aws-template
spec:
  template:
    spec:
      region: us-west-2
      networkSpec:
        vpc:
          cidrBlock: "10.0.0.0/16"

---
# Base control plane template
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlaneTemplate
metadata:
  name: base-controlplane-template
spec:
  template:
    spec:
      machineTemplate:
        infrastructureRef:
          apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
          kind: AWSMachineTemplate
          name: controlplane-machine-template
      kubeadmConfigSpec:
        clusterConfiguration:
          apiServer:
            extraArgs:
              audit-log-maxage: "30"

---
# Composed ClusterClass
apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: composed-cluster-class
spec:
  infrastructure:
    ref:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: AWSClusterTemplate
      name: base-aws-template
  controlPlane:
    ref:
      apiVersion: controlplane.cluster.x-k8s.io/v1beta1
      kind: KubeadmControlPlaneTemplate
      name: base-controlplane-template

Pattern 3: Multi-Region ClusterClasses

Create region-specific ClusterClasses:

# US West ClusterClass
apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: us-west-cluster-class
spec:
  infrastructure:
    ref:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: AWSClusterTemplate
      name: us-west-aws-template

---
# EU West ClusterClass
apiVersion: cluster.x-k8s.io/v1beta1
kind: ClusterClass
metadata:
  name: eu-west-cluster-class
spec:
  infrastructure:
    ref:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: AWSClusterTemplate
      name: eu-west-aws-template

Topology-Based Upgrades

Declarative Upgrades

Upgrading clusters with managed topologies is a simple spec update:

# Upgrade cluster to v1.25.0
kubectl patch cluster production-cluster --type merge -p '{
  "spec": {
    "topology": {
      "version": "v1.25.0"
    }
  }
}'

Upgrade Strategies

Rolling Upgrade (Default)

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-cluster
spec:
  topology:
    version: v1.25.0
    controlPlane:
      replicas: 3
      # Rolling upgrade by default

Upgrade with Pause

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-cluster
spec:
  topology:
    version: v1.25.0
    controlPlane:
      replicas: 3
      # Pause upgrade after control plane
      upgradeAfter: "2022-06-20T10:00:00Z"

Upgrade Workflow

  1. Control Plane Upgrade: Cluster API upgrades control plane nodes first.
  2. Health Validation: Validates control plane health after upgrade.
  3. Worker Upgrade: Upgrades worker nodes after control plane.
  4. Final Validation: Validates cluster health after complete upgrade.

Monitoring Upgrades

# Watch upgrade progress
kubectl get cluster production-cluster -w

# Check control plane status
kubectl get kubeadmcontrolplane production-cluster-control-plane

# Check machine status
kubectl get machines -l cluster.x-k8s.io/cluster-name=production-cluster

Topology-Based Scaling

Scaling Control Plane

# Scale control plane from 3 to 5 replicas
kubectl patch cluster production-cluster --type merge -p '{
  "spec": {
    "topology": {
      "controlPlane": {
        "replicas": 5
      }
    }
  }
}'

Scaling Worker Pools

# Scale worker pool
kubectl patch cluster production-cluster --type merge -p '{
  "spec": {
    "topology": {
      "workers": {
        "machineDeployments": [{
          "class": "default-worker",
          "replicas": 20,
          "name": "worker-pool-1"
        }]
      }
    }
  }
}'

Adding New Worker Pools

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: production-cluster
spec:
  topology:
    workers:
      machineDeployments:
      - class: default-worker
        replicas: 10
        name: worker-pool-1
      - class: gpu-worker
        replicas: 5
        name: gpu-pool-1  # New pool

Environment-Specific Cluster Templates

Development Environment

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: dev-cluster
spec:
  clusterClassRef:
    name: dev-cluster-class
  topology:
    version: v1.24.0
    controlPlane:
      replicas: 1  # Single control plane for dev
      metadata:
        labels:
          environment: development
    workers:
      machineDeployments:
      - class: default-worker
        replicas: 2  # Minimal workers for dev
        name: dev-workers
        metadata:
          labels:
            environment: development

Staging Environment

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: staging-cluster
spec:
  clusterClassRef:
    name: staging-cluster-class
  topology:
    version: v1.24.0
    controlPlane:
      replicas: 3  # HA control plane for staging
      metadata:
        labels:
          environment: staging
    workers:
      machineDeployments:
      - class: default-worker
        replicas: 5  # Moderate workers for staging
        name: staging-workers
        metadata:
          labels:
            environment: staging

Production Environment

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: prod-cluster
spec:
  clusterClassRef:
    name: prod-cluster-class
  topology:
    version: v1.24.0
    controlPlane:
      replicas: 3  # HA control plane for prod
      metadata:
        labels:
          environment: production
    workers:
      machineDeployments:
      - class: default-worker
        replicas: 20  # Scale workers for prod
        name: prod-workers
        metadata:
          labels:
            environment: production
            pool: default
      - class: gpu-worker
        replicas: 5  # GPU workers for ML workloads
        name: gpu-workers
        metadata:
          labels:
            environment: production
            pool: gpu

GitOps Integration

ArgoCD Integration

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-clusters
spec:
  source:
    repoURL: https://github.com/org/cluster-definitions
    path: clusters/production
    targetRevision: main
  destination:
    server: https://management-cluster:6443
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true

FluxCD Integration

apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: production-clusters
spec:
  sourceRef:
    kind: GitRepository
    name: cluster-definitions
  path: ./clusters/production
  interval: 5m
  prune: true
  wait: true

Git Repository Structure

cluster-definitions/
├── clusterclasses/
│   ├── dev-clusterclass.yaml
│   ├── staging-clusterclass.yaml
│   └── prod-clusterclass.yaml
├── clusters/
│   ├── development/
│   │   └── dev-cluster.yaml
│   ├── staging/
│   │   └── staging-cluster.yaml
│   └── production/
│       ├── prod-us-west.yaml
│       ├── prod-us-east.yaml
│       └── prod-eu-west.yaml
└── templates/
    ├── aws-templates/
    └── controlplane-templates/

Large-Scale Fleet Management

Managing 100+ Clusters

With managed topologies, managing large fleets becomes manageable:

# Regional production clusters
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: prod-us-west-2
spec:
  clusterClassRef:
    name: prod-cluster-class
  topology:
    version: v1.24.0
    controlPlane:
      replicas: 3
    workers:
      machineDeployments:
      - class: default-worker
        replicas: 20

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: prod-us-east-1
spec:
  clusterClassRef:
    name: prod-cluster-class
  topology:
    version: v1.24.0
    controlPlane:
      replicas: 3
    workers:
      machineDeployments:
      - class: default-worker
        replicas: 20

---
# ... 98 more clusters with same pattern

Bulk Operations

# Upgrade all production clusters
for cluster in $(kubectl get clusters -l environment=production -o name); do
  kubectl patch $cluster --type merge -p '{
    "spec": {
      "topology": {
        "version": "v1.25.0"
      }
    }
  }'
done

# Scale all worker pools
for cluster in $(kubectl get clusters -l environment=production -o name); do
  kubectl patch $cluster --type merge -p '{
    "spec": {
      "topology": {
        "workers": {
          "machineDeployments": [{
            "class": "default-worker",
            "replicas": 25
          }]
        }
      }
    }
  }'
done

Best Practices

ClusterClass Design

  1. Reusability: Design ClusterClasses for multiple use cases.
  2. Composition: Build ClusterClasses from reusable templates.
  3. Versioning: Version ClusterClasses for controlled updates.
  4. Documentation: Document ClusterClass purpose and usage.

Managed Topology Usage

  1. Version Pinning: Pin Kubernetes versions initially.
  2. Gradual Upgrades: Upgrade clusters incrementally.
  3. Testing: Test topology changes in non-production first.
  4. Monitoring: Monitor cluster health during changes.

Fleet Management

  1. Standardization: Use ClusterClasses to enforce standards.
  2. Automation: Automate cluster operations through GitOps.
  3. Monitoring: Monitor fleet health and configuration drift.
  4. Documentation: Document cluster definitions and patterns.

Practical Considerations

Management Cluster Requirements

  • High Availability: HA management cluster for production fleets.
  • Resources: Adequate resources for Cluster API controllers.
  • Network Access: Access to cloud provider APIs and workload clusters.
  • Backup: Regular backups of management cluster state.

Cluster Lifecycle Management

  • Creation: Use managed topologies for consistent cluster creation.
  • Upgrades: Leverage topology-based upgrades for controlled rollouts.
  • Scaling: Use topology-based scaling for capacity management.
  • Deletion: Plan for cluster decommissioning and resource cleanup.

Observability

  • Cluster Health: Monitor cluster status and conditions.
  • Upgrade Progress: Track upgrade progress and failures.
  • Configuration Drift: Detect and alert on configuration drift.
  • Resource Usage: Monitor resource usage across clusters.

Caveats & Lessons Learned

Common Pitfalls

  • ClusterClass Not Found: Ensure ClusterClass exists before creating clusters.
  • Topology Validation: Validate topology settings before applying.
  • Upgrade Failures: Monitor upgrade progress and have rollback plans.
  • Scaling Limits: Be aware of cloud provider scaling limits.

Best Practices Learned

  1. Start Simple: Begin with basic ClusterClasses and expand.
  2. Test Thoroughly: Test topology changes in non-production.
  3. Monitor Closely: Monitor cluster health during operations.
  4. Document Patterns: Document successful patterns for reuse.

Conclusion

Cluster API managed topologies in 2022 became the production-recommended pattern for declarative cluster lifecycle management. The combination of ClusterClass templates and managed topologies enabled teams to manage large fleets of clusters with consistency, repeatability, and automation.

The declarative model of managed topologies—where cluster lifecycle is managed through simple spec updates—represented a fundamental shift from imperative cluster management. Teams could now treat clusters like any other Kubernetes resource: declaratively, with version control, and through GitOps workflows.

For organizations managing dozens or hundreds of clusters, managed topologies provided the foundation for scalable, consistent, and maintainable cluster operations. The patterns and practices that emerged in 2022—ClusterClass composition, topology-based upgrades, and GitOps integration—would become standard approaches as Cluster API continued to mature.

Managed topologies weren’t just a feature; they were the operational model that made Cluster API viable for enterprise-scale deployments. By mid-2022, managed topologies had proven that declarative cluster management at scale was not just possible, but practical and powerful.