Resilience Patterns

Resilience patterns help your applications handle failures gracefully, maintain availability during disruptions, and recover automatically. Kubernetes provides several mechanisms to make your workloads more resilient: health probes, pod disruption budgets, automatic pod replacement, and graceful shutdown handling.

What Is Resilience?

Resilience in Kubernetes means your applications can:

  • Detect and respond to failures automatically
  • Maintain service availability during disruptions
  • Recover from failures without manual intervention
  • Handle planned maintenance with minimal impact
graph TB A[Resilience Mechanisms] --> B[Health Probes] A --> C[Pod Disruption Budgets] A --> D[Self-Healing] A --> E[Graceful Shutdown] B --> F[Detect Pod Health] C --> G[Control Disruptions] D --> H[Replace Failed Pods] E --> I[Clean Termination] style A fill:#e1f5ff style B fill:#fff4e1 style C fill:#fff4e1 style D fill:#fff4e1 style E fill:#fff4e1

Health Probes

Health probes let Kubernetes know if your application is healthy and ready to serve traffic. Kubernetes uses probes to make decisions about pod lifecycle and traffic routing.

Types of Probes

Liveness Probe - Determines if the container is running. If it fails, Kubernetes kills and restarts the container.

Readiness Probe - Determines if the container is ready to receive traffic. If it fails, Kubernetes removes the pod from Service endpoints.

Startup Probe - Determines if the container has started. Used for slow-starting containers to give them more time before liveness/readiness probes begin.

graph TD A[Pod Created] --> B[Startup Probe] B --> C{Started?} C -->|No| D[Wait] D --> B C -->|Yes| E[Liveness Probe Begins] E --> F{Healthy?} F -->|No| G[Restart Container] F -->|Yes| H[Readiness Probe Begins] H --> I{Ready?} I -->|No| J[Remove from Service] I -->|Yes| K[Pod Ready] G --> E style A fill:#e1f5ff style K fill:#e8f5e9 style G fill:#ffe1e1 style J fill:#fff4e1

Pod Disruption Budgets

Pod Disruption Budgets (PDBs) ensure that a minimum number of pods remain available during voluntary disruptions like cluster maintenance, node drains, or pod evictions.

graph TB A[Voluntary Disruption] --> B{PDB Configured?} B -->|No| C[All Pods Can Be Disrupted] B -->|Yes| D[Check PDB Rules] D --> E{Min Available Met?} E -->|Yes| F[Allow Disruption] E -->|No| G[Block Disruption] F --> H[Pods Terminated] G --> I[Wait for Capacity] style A fill:#e1f5ff style F fill:#e8f5e9 style G fill:#ffe1e1 style H fill:#fff4e1

PDBs protect against:

  • Node maintenance
  • Node upgrades
  • Cluster autoscaling (scale-down)
  • Manual pod evictions
  • Deployment updates (with proper configuration)

Self-Healing

Kubernetes automatically replaces failed or unhealthy pods:

graph TD A[Pod Running] --> B[Failure Detected] B --> C{Liveness Probe Fails} C -->|Yes| D[Container Killed] D --> E[Pod Marked for Termination] E --> F[New Pod Created] F --> G[Pod Starts] G --> H[Health Checks Pass] H --> I[Pod Ready] J[Node Failure] --> K[Pods Marked as Failed] K --> L[Workload Controller Detects] L --> M[New Pods Created on Other Nodes] style A fill:#e1f5ff style D fill:#ffe1e1 style F fill:#fff4e1 style I fill:#e8f5e9

Graceful Shutdown

Applications should handle termination gracefully:

sequenceDiagram participant Controller participant Kubelet participant Pod participant App Controller->>Kubelet: Terminate Pod Kubelet->>Pod: SIGTERM Signal Pod->>App: PreStop Hook Executed App->>App: Drain Connections App->>App: Finish Processing App->>Kubelet: Exit Kubelet->>Controller: Pod Terminated

During graceful shutdown:

  1. Pod receives SIGTERM signal
  2. PreStop hook executes (if configured)
  3. Application stops accepting new requests
  4. Application completes in-flight requests
  5. Application exits cleanly

If the application doesn’t terminate within the termination grace period (default: 30 seconds), Kubernetes sends SIGKILL.

Resilience Mechanisms Working Together

graph TB A[Application Running] --> B[Health Probes Monitor] B --> C{Unhealthy?} C -->|Yes| D[Self-Healing: Replace Pod] C -->|No| E[Continue Running] F[Maintenance Needed] --> G[Pod Disruption Budget] G --> H{Min Available?} H -->|Yes| I[Allow Disruption] H -->|No| J[Block Disruption] I --> K[Graceful Shutdown] K --> L[Pod Terminated] L --> M[New Pod Started] style A fill:#e1f5ff style D fill:#fff4e1 style G fill:#e8f5e9 style K fill:#f3e5f5

Common Resilience Patterns

Pattern 1: Health Checks with Automatic Recovery

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Pattern 2: Pod Disruption Budget for High Availability

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

Pattern 3: Graceful Shutdown with PreStop Hook

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 15"]

When to Use Resilience Patterns

Use resilience patterns when:

Production workloads - Critical applications that must stay available
High availability requirements - Applications that can’t tolerate downtime
Variable traffic - Applications with unpredictable load patterns
Maintenance windows - Need to handle planned maintenance gracefully
Multi-zone deployments - Applications distributed across availability zones

Consider simpler approaches when:

Development/testing - Non-critical environments
Batch jobs - One-time tasks that don’t need continuous availability
Simple applications - Basic apps that can tolerate brief outages

Best Practices

  1. Configure health probes - Always define liveness and readiness probes

  2. Set appropriate probe timing - Balance responsiveness with stability

  3. Use startup probes for slow starters - Give applications time to initialize

  4. Define Pod Disruption Budgets - Protect critical workloads during maintenance

  5. Implement graceful shutdown - Handle SIGTERM and complete in-flight requests

  6. Test failure scenarios - Verify resilience patterns work as expected

  7. Monitor probe failures - Track health probe failures to identify issues

  8. Set resource requests - Ensure pods have adequate resources to be healthy

  9. Use multiple replicas - Enable self-healing with multiple pod instances

  10. Document expectations - Clearly document what constitutes healthy/ready state

Topics

See Also

  • Deployments - Workloads that benefit from resilience patterns
  • Services - Traffic routing based on readiness probes
  • Autoscaling - Scaling based on health and demand