Cluster API Runtime Extension: Building Custom Infrastructure Providers

Cluster API Runtime Extension: Building Custom Infrastructure Providers

Introduction

In mid-2023, Cluster API Runtime Extension framework and RuntimeSDK democratized infrastructure provider development, enabling teams to build custom providers for any infrastructure platform. Previously, creating a Cluster API provider required deep knowledge of Cluster API internals and significant development effort. Runtime Extension simplified this process, making it practical for organizations to extend Cluster API to their specific infrastructure needs.

This mattered because not every organization uses AWS, Azure, or GCP. Many teams operate on-premises infrastructure, specialized cloud platforms, edge environments, or hybrid setups that don’t have official Cluster API providers. Runtime Extension enabled these teams to build their own providers using a standardized framework, bringing the benefits of Cluster API to any infrastructure.

Historical note: Runtime Extension was introduced in Cluster API to simplify provider development. The RuntimeSDK provides a framework for building providers without needing to understand all Cluster API internals, making provider development accessible to more teams.

The Problem Runtime Extension Solved

Before Runtime Extension

Building a Cluster API provider required:

  • Deep Cluster API Knowledge: Understanding Cluster API controllers, reconciliation loops, and resource lifecycle.
  • Complex Implementation: Implementing provider-specific logic alongside Cluster API integration.
  • Maintenance Burden: Keeping up with Cluster API changes and provider API updates.
  • Limited Reusability: Provider code tightly coupled to Cluster API internals.

After Runtime Extension

Runtime Extension provides:

  • Standardized Framework: Common patterns and interfaces for provider development.
  • Simplified Implementation: Focus on infrastructure-specific logic, not Cluster API internals.
  • Reduced Maintenance: Runtime Extension handles Cluster API integration.
  • Reusable Components: Shared components and utilities for provider development.

Runtime Extension Architecture

Core Components

Runtime Extension consists of:

  1. Runtime Extension Framework: Core framework for building providers.
  2. RuntimeSDK: Software development kit with utilities and helpers.
  3. Provider Templates: Starter templates for common provider patterns.
  4. Testing Framework: Tools for testing provider implementations.

Provider Architecture

Custom Provider
├── Infrastructure Controller
│   ├── Cluster Reconciliation
│   ├── Machine Reconciliation
│   └── Infrastructure Resource Management
├── Bootstrap Provider (optional)
│   └── Machine Bootstrap Logic
└── Control Plane Provider (optional)
    └── Control Plane Management

Building a Custom Provider

Step 1: Initialize Provider Project

# Use Runtime Extension templates
clusterctl generate provider \
  --provider-type infrastructure \
  --provider-name my-infrastructure \
  --output-dir ./my-infrastructure-provider

Step 2: Define Infrastructure Resources

// Infrastructure cluster resource
type MyInfrastructureCluster struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`
    Spec              MyInfrastructureClusterSpec   `json:"spec,omitempty"`
    Status            MyInfrastructureClusterStatus `json:"status,omitempty"`
}

type MyInfrastructureClusterSpec struct {
    Region       string `json:"region"`
    NetworkCIDR  string `json:"networkCIDR"`
    ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint,omitempty"`
}

type MyInfrastructureClusterStatus struct {
    Ready bool `json:"ready"`
    Conditions clusterv1.Conditions `json:"conditions,omitempty"`
}

Step 3: Implement Reconciliation Logic

func (r *MyInfrastructureClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // Get cluster resource
    cluster := &clusterv1.Cluster{}
    if err := r.Get(ctx, req.NamespacedName, cluster); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // Get infrastructure cluster
    infraCluster := &infrav1.MyInfrastructureCluster{}
    infraClusterName := client.ObjectKey{
        Namespace: req.Namespace,
        Name:      req.Name,
    }
    if err := r.Get(ctx, infraClusterName, infraCluster); err != nil {
        return ctrl.Result{}, err
    }

    // Reconcile infrastructure
    if err := r.reconcileInfrastructure(ctx, cluster, infraCluster); err != nil {
        return ctrl.Result{}, err
    }

    // Update status
    infraCluster.Status.Ready = true
    if err := r.Status().Update(ctx, infraCluster); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{}, nil
}

Step 4: Implement Infrastructure Logic

func (r *MyInfrastructureClusterReconciler) reconcileInfrastructure(
    ctx context.Context,
    cluster *clusterv1.Cluster,
    infraCluster *infrav1.MyInfrastructureCluster,
) error {
    // Create or update infrastructure resources
    // This is provider-specific logic
    
    // Example: Create network
    network, err := r.createNetwork(ctx, infraCluster)
    if err != nil {
        return err
    }

    // Example: Create load balancer
    lb, err := r.createLoadBalancer(ctx, infraCluster, network)
    if err != nil {
        return err
    }

    // Update control plane endpoint
    infraCluster.Spec.ControlPlaneEndpoint = clusterv1.APIEndpoint{
        Host: lb.Host,
        Port: lb.Port,
    }

    return nil
}

RuntimeSDK Features

Resource Management

RuntimeSDK provides utilities for managing infrastructure resources:

import "sigs.k8s.io/cluster-api/util"

// Get cluster from context
cluster, err := util.GetClusterFromMetadata(ctx, r.Client, machine.ObjectMeta)
if err != nil {
    return err
}

// Get infrastructure cluster
infraCluster := &infrav1.MyInfrastructureCluster{}
infraClusterKey := client.ObjectKey{
    Namespace: cluster.Spec.InfrastructureRef.Namespace,
    Name:      cluster.Spec.InfrastructureRef.Name,
}
if err := r.Get(ctx, infraClusterKey, infraCluster); err != nil {
    return err
}

Condition Management

RuntimeSDK simplifies condition management:

import "sigs.k8s.io/cluster-api/util/conditions"

// Set condition
conditions.MarkTrue(infraCluster, infrav1.InfrastructureReadyCondition)

// Check condition
if conditions.IsTrue(infraCluster, infrav1.InfrastructureReadyCondition) {
    // Infrastructure is ready
}

Patch Utilities

RuntimeSDK provides patch utilities for updating resources:

import "sigs.k8s.io/cluster-api/util/patch"

// Create patch helper
patchHelper, err := patch.NewHelper(infraCluster, r.Client)
if err != nil {
    return err
}

// Update resource
infraCluster.Spec.Region = "us-west-2"

// Apply patch
if err := patchHelper.Patch(ctx, infraCluster); err != nil {
    return err
}

Provider Development Best Practices

1. Naming Conventions

Follow Cluster API naming conventions:

// Infrastructure objects should link to Kubernetes resources
func generateInfrastructureName(clusterName string) string {
    return fmt.Sprintf("%s-infrastructure", clusterName)
}

// Use consistent naming patterns
func generateMachineName(clusterName, machineName string) string {
    return fmt.Sprintf("%s-%s", clusterName, machineName)
}

2. Tagging and Labeling

Implement robust identification mechanisms:

// Tag infrastructure resources
tags := map[string]string{
    "kubernetes.io/cluster/" + clusterName: "owned",
    "sigs.k8s.io/cluster-api-provider":    "my-infrastructure",
    "sigs.k8s.io/cluster-api-managed":      "true",
}

// Label Kubernetes resources
labels := map[string]string{
    clusterv1.ClusterNameLabel: clusterName,
    clusterv1.ClusterLabelName: clusterName,
}

3. Error Handling

Implement comprehensive error handling:

func (r *MyInfrastructureClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // ... reconciliation logic
    
    if err != nil {
        // Record error in conditions
        conditions.MarkFalse(
            infraCluster,
            infrav1.InfrastructureReadyCondition,
            infrav1.InfrastructureCreationFailedReason,
            clusterv1.ConditionSeverityError,
            err.Error(),
        )
        
        // Return with requeue
        return ctrl.Result{RequeueAfter: time.Minute * 5}, nil
    }
    
    return ctrl.Result{}, nil
}

4. Testing

Use Cluster API testing frameworks:

import (
    "sigs.k8s.io/cluster-api/test/framework"
    "sigs.k8s.io/controller-runtime/pkg/envtest"
)

func TestMyInfrastructureClusterReconciler(t *testing.T) {
    // Setup test environment
    testEnv := &envtest.Environment{
        CRDDirectoryPaths: []string{
            filepath.Join("..", "config", "crd", "bases"),
        },
    }
    
    cfg, err := testEnv.Start()
    if err != nil {
        t.Fatal(err)
    }
    defer testEnv.Stop()
    
    // Create reconciler
    r := &MyInfrastructureClusterReconciler{
        Client: testClient,
        Scheme: testScheme,
    }
    
    // Run tests
    // ...
}

Development Tools

Tilt Integration

Runtime Extension supports Tilt for rapid development:

# Tiltfile
load('ext://restart_process', 'docker_build', 'k8s_yaml', 'k8s_resource')

# Build provider image
docker_build('my-infrastructure-provider', '.')

# Deploy provider
k8s_yaml('config/manager/manager.yaml')
k8s_resource('my-infrastructure-provider', port_forwards='9443:9443')

Development Workflow

  1. Local Development: Use Tilt for rapid iteration.
  2. Unit Testing: Write unit tests for provider logic.
  3. Integration Testing: Test with real infrastructure.
  4. E2E Testing: Use Cluster API E2E framework.

Security Guidelines

Credential Management

// Use Kubernetes secrets for credentials
secret := &corev1.Secret{}
secretKey := client.ObjectKey{
    Namespace: infraCluster.Namespace,
    Name:      infraCluster.Spec.CredentialsRef.Name,
}
if err := r.Get(ctx, secretKey, secret); err != nil {
    return err
}

// Use credentials securely
credentials := secret.Data["credentials"]

Least Privilege

  • IAM Roles: Use least-privilege IAM roles.
  • Service Accounts: Use dedicated service accounts.
  • Secret Rotation: Implement credential rotation.
  • Audit Logging: Log all infrastructure operations.

Rate Limiting

// Implement rate limiting for cloud API calls
rateLimiter := rate.NewLimiter(rate.Every(time.Second), 10)

func (r *MyInfrastructureClusterReconciler) createResource(ctx context.Context) error {
    if err := rateLimiter.Wait(ctx); err != nil {
        return err
    }
    
    // Make API call
    return r.infrastructureClient.CreateResource(ctx, resource)
}

Case Studies

Case Study 1: On-Premises Provider

A team built a provider for their on-premises infrastructure:

type OnPremisesClusterSpec struct {
    Datacenter    string `json:"datacenter"`
    NetworkVLAN   string `json:"networkVLAN"`
    StoragePool   string `json:"storagePool"`
    ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint,omitempty"`
}

Benefits:

  • Standardized Operations: Consistent cluster management across on-premises.
  • GitOps Integration: Infrastructure as code for on-premises.
  • Multi-Cluster Management: Manage multiple on-premises clusters.

Case Study 2: Edge Provider

A team built a provider for edge deployments:

type EdgeClusterSpec struct {
    EdgeLocation  string `json:"edgeLocation"`
    Connectivity  string `json:"connectivity"` // "online" | "offline"
    LocalStorage  bool   `json:"localStorage"`
    ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint,omitempty"`
}

Benefits:

  • Edge Consistency: Standardized edge cluster management.
  • Offline Support: Support for offline edge deployments.
  • Centralized Management: Manage edge clusters from central management cluster.

Practical Considerations

Provider Maintenance

  • Version Compatibility: Keep provider compatible with Cluster API versions.
  • Testing: Comprehensive testing before releases.
  • Documentation: Maintain up-to-date documentation.
  • Community: Engage with Cluster API community.

Integration Patterns

  • Infrastructure APIs: Integrate with infrastructure provider APIs.
  • Authentication: Implement secure authentication.
  • Error Handling: Handle infrastructure errors gracefully.
  • Retry Logic: Implement retry logic for transient failures.

Caveats & Limitations

  • Learning Curve: Runtime Extension still requires Cluster API knowledge.
  • Provider Complexity: Complex infrastructure requires complex providers.
  • Maintenance: Providers require ongoing maintenance.
  • Testing: Testing providers requires infrastructure access.

Common Challenges

  • API Changes: Cluster API changes may require provider updates.
  • Infrastructure Limits: Infrastructure provider limits and quotas.
  • Network Connectivity: Network connectivity between management and infrastructure.
  • Credential Management: Secure credential management.

Conclusion

Cluster API Runtime Extension in 2023 democratized provider development, making it practical for teams to build custom infrastructure providers. The Runtime Extension framework and RuntimeSDK simplified provider development, enabling organizations to extend Cluster API to their specific infrastructure needs.

The ability to build custom providers opened new possibilities: on-premises deployments, edge environments, specialized cloud platforms, and hybrid setups could all benefit from Cluster API’s declarative cluster management model. Runtime Extension made Cluster API truly infrastructure-agnostic.

For teams operating on infrastructure without official Cluster API providers, Runtime Extension provided a path to standardized, declarative cluster management. The framework, SDK, and best practices that emerged in 2023 would enable a new generation of infrastructure providers, expanding Cluster API’s reach beyond the major cloud providers.

Runtime Extension wasn’t just a development tool; it was an enabler of infrastructure innovation, allowing teams to bring Cluster API’s benefits to any infrastructure platform. By mid-2023, Runtime Extension had proven that custom provider development was not just possible, but practical and powerful.