Resilience Patterns

Resilience patterns help your applications handle failures gracefully, maintain availability during disruptions, and recover automatically. Kubernetes provides several mechanisms to make your workloads more resilient: health probes, pod disruption budgets, automatic pod replacement, and graceful shutdown handling.

What Is Resilience?

Resilience in Kubernetes means your applications can:

Detect and respond to failures automatically
Maintain service availability during disruptions
Recover from failures without manual intervention
Handle planned maintenance with minimal impact

graph TB A[Resilience Mechanisms] --> B[Health Probes] A --> C[Pod Disruption Budgets] A --> D[Self-Healing] A --> E[Graceful Shutdown] B --> F[Detect Pod Health] C --> G[Control Disruptions] D --> H[Replace Failed Pods] E --> I[Clean Termination] style A fill:#e1f5ff style B fill:#fff4e1 style C fill:#fff4e1 style D fill:#fff4e1 style E fill:#fff4e1

Health Probes

Health probes let Kubernetes know if your application is healthy and ready to serve traffic. Kubernetes uses probes to make decisions about pod lifecycle and traffic routing.

Types of Probes

Liveness Probe - Determines if the container is running. If it fails, Kubernetes kills and restarts the container.

Readiness Probe - Determines if the container is ready to receive traffic. If it fails, Kubernetes removes the pod from Service endpoints.

Startup Probe - Determines if the container has started. Used for slow-starting containers to give them more time before liveness/readiness probes begin.

graph TD A[Pod Created] --> B[Startup Probe] B --> C{Started?} C -->|No| D[Wait] D --> B C -->|Yes| E[Liveness Probe Begins] E --> F{Healthy?} F -->|No| G[Restart Container] F -->|Yes| H[Readiness Probe Begins] H --> I{Ready?} I -->|No| J[Remove from Service] I -->|Yes| K[Pod Ready] G --> E style A fill:#e1f5ff style K fill:#e8f5e9 style G fill:#ffe1e1 style J fill:#fff4e1

Pod Disruption Budgets

Pod Disruption Budgets (PDBs) ensure that a minimum number of pods remain available during voluntary disruptions like cluster maintenance, node drains, or pod evictions.

graph TB A[Voluntary Disruption] --> B{PDB Configured?} B -->|No| C[All Pods Can Be Disrupted] B -->|Yes| D[Check PDB Rules] D --> E{Min Available Met?} E -->|Yes| F[Allow Disruption] E -->|No| G[Block Disruption] F --> H[Pods Terminated] G --> I[Wait for Capacity] style A fill:#e1f5ff style F fill:#e8f5e9 style G fill:#ffe1e1 style H fill:#fff4e1

PDBs protect against:

Node maintenance
Node upgrades
Cluster autoscaling (scale-down)
Manual pod evictions
Deployment updates (with proper configuration)

Self-Healing

Kubernetes automatically replaces failed or unhealthy pods:

graph TD A[Pod Running] --> B[Failure Detected] B --> C{Liveness Probe Fails} C -->|Yes| D[Container Killed] D --> E[Pod Marked for Termination] E --> F[New Pod Created] F --> G[Pod Starts] G --> H[Health Checks Pass] H --> I[Pod Ready] J[Node Failure] --> K[Pods Marked as Failed] K --> L[Workload Controller Detects] L --> M[New Pods Created on Other Nodes] style A fill:#e1f5ff style D fill:#ffe1e1 style F fill:#fff4e1 style I fill:#e8f5e9

Graceful Shutdown

Applications should handle termination gracefully:

sequenceDiagram participant Controller participant Kubelet participant Pod participant App Controller->>Kubelet: Terminate Pod Kubelet->>Pod: SIGTERM Signal Pod->>App: PreStop Hook Executed App->>App: Drain Connections App->>App: Finish Processing App->>Kubelet: Exit Kubelet->>Controller: Pod Terminated

During graceful shutdown:

Pod receives SIGTERM signal
PreStop hook executes (if configured)
Application stops accepting new requests
Application completes in-flight requests
Application exits cleanly

If the application doesn’t terminate within the termination grace period (default: 30 seconds), Kubernetes sends SIGKILL.

Resilience Mechanisms Working Together

graph TB A[Application Running] --> B[Health Probes Monitor] B --> C{Unhealthy?} C -->|Yes| D[Self-Healing: Replace Pod] C -->|No| E[Continue Running] F[Maintenance Needed] --> G[Pod Disruption Budget] G --> H{Min Available?} H -->|Yes| I[Allow Disruption] H -->|No| J[Block Disruption] I --> K[Graceful Shutdown] K --> L[Pod Terminated] L --> M[New Pod Started] style A fill:#e1f5ff style D fill:#fff4e1 style G fill:#e8f5e9 style K fill:#f3e5f5

Common Resilience Patterns

Pattern 1: Health Checks with Automatic Recovery

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Pattern 2: Pod Disruption Budget for High Availability

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

Pattern 3: Graceful Shutdown with PreStop Hook

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]

When to Use Resilience Patterns

Use resilience patterns when:

✅ Production workloads - Critical applications that must stay available
✅ High availability requirements - Applications that can’t tolerate downtime
✅ Variable traffic - Applications with unpredictable load patterns
✅ Maintenance windows - Need to handle planned maintenance gracefully
✅ Multi-zone deployments - Applications distributed across availability zones

Consider simpler approaches when:

❌ Development/testing - Non-critical environments
❌ Batch jobs - One-time tasks that don’t need continuous availability
❌ Simple applications - Basic apps that can tolerate brief outages

Best Practices

Configure health probes - Always define liveness and readiness probes
Set appropriate probe timing - Balance responsiveness with stability
Use startup probes for slow starters - Give applications time to initialize
Define Pod Disruption Budgets - Protect critical workloads during maintenance
Implement graceful shutdown - Handle SIGTERM and complete in-flight requests
Test failure scenarios - Verify resilience patterns work as expected
Monitor probe failures - Track health probe failures to identify issues
Set resource requests - Ensure pods have adequate resources to be healthy
Use multiple replicas - Enable self-healing with multiple pod instances
Document expectations - Clearly document what constitutes healthy/ready state

Topics

Probes - Health checks: liveness, readiness, and startup probes
Pod Disruption Budgets - Controlling pod disruptions during maintenance