Cluster API Runtime Extension: Building Custom Infrastructure Providers

Table of Contents
Introduction
In mid-2023, Cluster API Runtime Extension framework and RuntimeSDK democratized infrastructure provider development, enabling teams to build custom providers for any infrastructure platform. Previously, creating a Cluster API provider required deep knowledge of Cluster API internals and significant development effort. Runtime Extension simplified this process, making it practical for organizations to extend Cluster API to their specific infrastructure needs.
This mattered because not every organization uses AWS, Azure, or GCP. Many teams operate on-premises infrastructure, specialized cloud platforms, edge environments, or hybrid setups that don’t have official Cluster API providers. Runtime Extension enabled these teams to build their own providers using a standardized framework, bringing the benefits of Cluster API to any infrastructure.
Historical note: Runtime Extension was introduced in Cluster API to simplify provider development. The RuntimeSDK provides a framework for building providers without needing to understand all Cluster API internals, making provider development accessible to more teams.
The Problem Runtime Extension Solved
Before Runtime Extension
Building a Cluster API provider required:
- Deep Cluster API Knowledge: Understanding Cluster API controllers, reconciliation loops, and resource lifecycle.
- Complex Implementation: Implementing provider-specific logic alongside Cluster API integration.
- Maintenance Burden: Keeping up with Cluster API changes and provider API updates.
- Limited Reusability: Provider code tightly coupled to Cluster API internals.
After Runtime Extension
Runtime Extension provides:
- Standardized Framework: Common patterns and interfaces for provider development.
- Simplified Implementation: Focus on infrastructure-specific logic, not Cluster API internals.
- Reduced Maintenance: Runtime Extension handles Cluster API integration.
- Reusable Components: Shared components and utilities for provider development.
Runtime Extension Architecture
Core Components
Runtime Extension consists of:
- Runtime Extension Framework: Core framework for building providers.
- RuntimeSDK: Software development kit with utilities and helpers.
- Provider Templates: Starter templates for common provider patterns.
- Testing Framework: Tools for testing provider implementations.
Provider Architecture
Custom Provider
├── Infrastructure Controller
│ ├── Cluster Reconciliation
│ ├── Machine Reconciliation
│ └── Infrastructure Resource Management
├── Bootstrap Provider (optional)
│ └── Machine Bootstrap Logic
└── Control Plane Provider (optional)
└── Control Plane Management
Building a Custom Provider
Step 1: Initialize Provider Project
# Use Runtime Extension templates
clusterctl generate provider \
--provider-type infrastructure \
--provider-name my-infrastructure \
--output-dir ./my-infrastructure-provider
Step 2: Define Infrastructure Resources
// Infrastructure cluster resource
type MyInfrastructureCluster struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec MyInfrastructureClusterSpec `json:"spec,omitempty"`
Status MyInfrastructureClusterStatus `json:"status,omitempty"`
}
type MyInfrastructureClusterSpec struct {
Region string `json:"region"`
NetworkCIDR string `json:"networkCIDR"`
ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint,omitempty"`
}
type MyInfrastructureClusterStatus struct {
Ready bool `json:"ready"`
Conditions clusterv1.Conditions `json:"conditions,omitempty"`
}
Step 3: Implement Reconciliation Logic
func (r *MyInfrastructureClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// Get cluster resource
cluster := &clusterv1.Cluster{}
if err := r.Get(ctx, req.NamespacedName, cluster); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Get infrastructure cluster
infraCluster := &infrav1.MyInfrastructureCluster{}
infraClusterName := client.ObjectKey{
Namespace: req.Namespace,
Name: req.Name,
}
if err := r.Get(ctx, infraClusterName, infraCluster); err != nil {
return ctrl.Result{}, err
}
// Reconcile infrastructure
if err := r.reconcileInfrastructure(ctx, cluster, infraCluster); err != nil {
return ctrl.Result{}, err
}
// Update status
infraCluster.Status.Ready = true
if err := r.Status().Update(ctx, infraCluster); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
Step 4: Implement Infrastructure Logic
func (r *MyInfrastructureClusterReconciler) reconcileInfrastructure(
ctx context.Context,
cluster *clusterv1.Cluster,
infraCluster *infrav1.MyInfrastructureCluster,
) error {
// Create or update infrastructure resources
// This is provider-specific logic
// Example: Create network
network, err := r.createNetwork(ctx, infraCluster)
if err != nil {
return err
}
// Example: Create load balancer
lb, err := r.createLoadBalancer(ctx, infraCluster, network)
if err != nil {
return err
}
// Update control plane endpoint
infraCluster.Spec.ControlPlaneEndpoint = clusterv1.APIEndpoint{
Host: lb.Host,
Port: lb.Port,
}
return nil
}
RuntimeSDK Features
Resource Management
RuntimeSDK provides utilities for managing infrastructure resources:
import "sigs.k8s.io/cluster-api/util"
// Get cluster from context
cluster, err := util.GetClusterFromMetadata(ctx, r.Client, machine.ObjectMeta)
if err != nil {
return err
}
// Get infrastructure cluster
infraCluster := &infrav1.MyInfrastructureCluster{}
infraClusterKey := client.ObjectKey{
Namespace: cluster.Spec.InfrastructureRef.Namespace,
Name: cluster.Spec.InfrastructureRef.Name,
}
if err := r.Get(ctx, infraClusterKey, infraCluster); err != nil {
return err
}
Condition Management
RuntimeSDK simplifies condition management:
import "sigs.k8s.io/cluster-api/util/conditions"
// Set condition
conditions.MarkTrue(infraCluster, infrav1.InfrastructureReadyCondition)
// Check condition
if conditions.IsTrue(infraCluster, infrav1.InfrastructureReadyCondition) {
// Infrastructure is ready
}
Patch Utilities
RuntimeSDK provides patch utilities for updating resources:
import "sigs.k8s.io/cluster-api/util/patch"
// Create patch helper
patchHelper, err := patch.NewHelper(infraCluster, r.Client)
if err != nil {
return err
}
// Update resource
infraCluster.Spec.Region = "us-west-2"
// Apply patch
if err := patchHelper.Patch(ctx, infraCluster); err != nil {
return err
}
Provider Development Best Practices
1. Naming Conventions
Follow Cluster API naming conventions:
// Infrastructure objects should link to Kubernetes resources
func generateInfrastructureName(clusterName string) string {
return fmt.Sprintf("%s-infrastructure", clusterName)
}
// Use consistent naming patterns
func generateMachineName(clusterName, machineName string) string {
return fmt.Sprintf("%s-%s", clusterName, machineName)
}
2. Tagging and Labeling
Implement robust identification mechanisms:
// Tag infrastructure resources
tags := map[string]string{
"kubernetes.io/cluster/" + clusterName: "owned",
"sigs.k8s.io/cluster-api-provider": "my-infrastructure",
"sigs.k8s.io/cluster-api-managed": "true",
}
// Label Kubernetes resources
labels := map[string]string{
clusterv1.ClusterNameLabel: clusterName,
clusterv1.ClusterLabelName: clusterName,
}
3. Error Handling
Implement comprehensive error handling:
func (r *MyInfrastructureClusterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// ... reconciliation logic
if err != nil {
// Record error in conditions
conditions.MarkFalse(
infraCluster,
infrav1.InfrastructureReadyCondition,
infrav1.InfrastructureCreationFailedReason,
clusterv1.ConditionSeverityError,
err.Error(),
)
// Return with requeue
return ctrl.Result{RequeueAfter: time.Minute * 5}, nil
}
return ctrl.Result{}, nil
}
4. Testing
Use Cluster API testing frameworks:
import (
"sigs.k8s.io/cluster-api/test/framework"
"sigs.k8s.io/controller-runtime/pkg/envtest"
)
func TestMyInfrastructureClusterReconciler(t *testing.T) {
// Setup test environment
testEnv := &envtest.Environment{
CRDDirectoryPaths: []string{
filepath.Join("..", "config", "crd", "bases"),
},
}
cfg, err := testEnv.Start()
if err != nil {
t.Fatal(err)
}
defer testEnv.Stop()
// Create reconciler
r := &MyInfrastructureClusterReconciler{
Client: testClient,
Scheme: testScheme,
}
// Run tests
// ...
}
Development Tools
Tilt Integration
Runtime Extension supports Tilt for rapid development:
# Tiltfile
load('ext://restart_process', 'docker_build', 'k8s_yaml', 'k8s_resource')
# Build provider image
docker_build('my-infrastructure-provider', '.')
# Deploy provider
k8s_yaml('config/manager/manager.yaml')
k8s_resource('my-infrastructure-provider', port_forwards='9443:9443')
Development Workflow
- Local Development: Use Tilt for rapid iteration.
- Unit Testing: Write unit tests for provider logic.
- Integration Testing: Test with real infrastructure.
- E2E Testing: Use Cluster API E2E framework.
Security Guidelines
Credential Management
// Use Kubernetes secrets for credentials
secret := &corev1.Secret{}
secretKey := client.ObjectKey{
Namespace: infraCluster.Namespace,
Name: infraCluster.Spec.CredentialsRef.Name,
}
if err := r.Get(ctx, secretKey, secret); err != nil {
return err
}
// Use credentials securely
credentials := secret.Data["credentials"]
Least Privilege
- IAM Roles: Use least-privilege IAM roles.
- Service Accounts: Use dedicated service accounts.
- Secret Rotation: Implement credential rotation.
- Audit Logging: Log all infrastructure operations.
Rate Limiting
// Implement rate limiting for cloud API calls
rateLimiter := rate.NewLimiter(rate.Every(time.Second), 10)
func (r *MyInfrastructureClusterReconciler) createResource(ctx context.Context) error {
if err := rateLimiter.Wait(ctx); err != nil {
return err
}
// Make API call
return r.infrastructureClient.CreateResource(ctx, resource)
}
Case Studies
Case Study 1: On-Premises Provider
A team built a provider for their on-premises infrastructure:
type OnPremisesClusterSpec struct {
Datacenter string `json:"datacenter"`
NetworkVLAN string `json:"networkVLAN"`
StoragePool string `json:"storagePool"`
ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint,omitempty"`
}
Benefits:
- Standardized Operations: Consistent cluster management across on-premises.
- GitOps Integration: Infrastructure as code for on-premises.
- Multi-Cluster Management: Manage multiple on-premises clusters.
Case Study 2: Edge Provider
A team built a provider for edge deployments:
type EdgeClusterSpec struct {
EdgeLocation string `json:"edgeLocation"`
Connectivity string `json:"connectivity"` // "online" | "offline"
LocalStorage bool `json:"localStorage"`
ControlPlaneEndpoint clusterv1.APIEndpoint `json:"controlPlaneEndpoint,omitempty"`
}
Benefits:
- Edge Consistency: Standardized edge cluster management.
- Offline Support: Support for offline edge deployments.
- Centralized Management: Manage edge clusters from central management cluster.
Practical Considerations
Provider Maintenance
- Version Compatibility: Keep provider compatible with Cluster API versions.
- Testing: Comprehensive testing before releases.
- Documentation: Maintain up-to-date documentation.
- Community: Engage with Cluster API community.
Integration Patterns
- Infrastructure APIs: Integrate with infrastructure provider APIs.
- Authentication: Implement secure authentication.
- Error Handling: Handle infrastructure errors gracefully.
- Retry Logic: Implement retry logic for transient failures.
Caveats & Limitations
- Learning Curve: Runtime Extension still requires Cluster API knowledge.
- Provider Complexity: Complex infrastructure requires complex providers.
- Maintenance: Providers require ongoing maintenance.
- Testing: Testing providers requires infrastructure access.
Common Challenges
- API Changes: Cluster API changes may require provider updates.
- Infrastructure Limits: Infrastructure provider limits and quotas.
- Network Connectivity: Network connectivity between management and infrastructure.
- Credential Management: Secure credential management.
Conclusion
Cluster API Runtime Extension in 2023 democratized provider development, making it practical for teams to build custom infrastructure providers. The Runtime Extension framework and RuntimeSDK simplified provider development, enabling organizations to extend Cluster API to their specific infrastructure needs.
The ability to build custom providers opened new possibilities: on-premises deployments, edge environments, specialized cloud platforms, and hybrid setups could all benefit from Cluster API’s declarative cluster management model. Runtime Extension made Cluster API truly infrastructure-agnostic.
For teams operating on infrastructure without official Cluster API providers, Runtime Extension provided a path to standardized, declarative cluster management. The framework, SDK, and best practices that emerged in 2023 would enable a new generation of infrastructure providers, expanding Cluster API’s reach beyond the major cloud providers.
Runtime Extension wasn’t just a development tool; it was an enabler of infrastructure innovation, allowing teams to bring Cluster API’s benefits to any infrastructure platform. By mid-2023, Runtime Extension had proven that custom provider development was not just possible, but practical and powerful.