Resilience Patterns
Resilience patterns help your applications handle failures gracefully, maintain availability during disruptions, and recover automatically. Kubernetes provides several mechanisms to make your workloads more resilient: health probes, pod disruption budgets, automatic pod replacement, and graceful shutdown handling.
What Is Resilience?
Resilience in Kubernetes means your applications can:
- Detect and respond to failures automatically
- Maintain service availability during disruptions
- Recover from failures without manual intervention
- Handle planned maintenance with minimal impact
Health Probes
Health probes let Kubernetes know if your application is healthy and ready to serve traffic. Kubernetes uses probes to make decisions about pod lifecycle and traffic routing.
Types of Probes
Liveness Probe - Determines if the container is running. If it fails, Kubernetes kills and restarts the container.
Readiness Probe - Determines if the container is ready to receive traffic. If it fails, Kubernetes removes the pod from Service endpoints.
Startup Probe - Determines if the container has started. Used for slow-starting containers to give them more time before liveness/readiness probes begin.
Pod Disruption Budgets
Pod Disruption Budgets (PDBs) ensure that a minimum number of pods remain available during voluntary disruptions like cluster maintenance, node drains, or pod evictions.
PDBs protect against:
- Node maintenance
- Node upgrades
- Cluster autoscaling (scale-down)
- Manual pod evictions
- Deployment updates (with proper configuration)
Self-Healing
Kubernetes automatically replaces failed or unhealthy pods:
Graceful Shutdown
Applications should handle termination gracefully:
During graceful shutdown:
- Pod receives SIGTERM signal
- PreStop hook executes (if configured)
- Application stops accepting new requests
- Application completes in-flight requests
- Application exits cleanly
If the application doesn’t terminate within the termination grace period (default: 30 seconds), Kubernetes sends SIGKILL.
Resilience Mechanisms Working Together
Common Resilience Patterns
Pattern 1: Health Checks with Automatic Recovery
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Pattern 2: Pod Disruption Budget for High Availability
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
Pattern 3: Graceful Shutdown with PreStop Hook
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
When to Use Resilience Patterns
Use resilience patterns when:
✅ Production workloads - Critical applications that must stay available
✅ High availability requirements - Applications that can’t tolerate downtime
✅ Variable traffic - Applications with unpredictable load patterns
✅ Maintenance windows - Need to handle planned maintenance gracefully
✅ Multi-zone deployments - Applications distributed across availability zones
Consider simpler approaches when:
❌ Development/testing - Non-critical environments
❌ Batch jobs - One-time tasks that don’t need continuous availability
❌ Simple applications - Basic apps that can tolerate brief outages
Best Practices
Configure health probes - Always define liveness and readiness probes
Set appropriate probe timing - Balance responsiveness with stability
Use startup probes for slow starters - Give applications time to initialize
Define Pod Disruption Budgets - Protect critical workloads during maintenance
Implement graceful shutdown - Handle SIGTERM and complete in-flight requests
Test failure scenarios - Verify resilience patterns work as expected
Monitor probe failures - Track health probe failures to identify issues
Set resource requests - Ensure pods have adequate resources to be healthy
Use multiple replicas - Enable self-healing with multiple pod instances
Document expectations - Clearly document what constitutes healthy/ready state
Topics
- Probes - Health checks: liveness, readiness, and startup probes
- Pod Disruption Budgets - Controlling pod disruptions during maintenance
See Also
- Deployments - Workloads that benefit from resilience patterns
- Services - Traffic routing based on readiness probes
- Autoscaling - Scaling based on health and demand