Advanced Resilience Patterns - Chaos Engineering

Overview

While chaos engineering tests resilience, knowing how to build resilient systems is equally important. Advanced resilience patterns help systems withstand failure gracefully.

Pattern 1: Circuit Breaker

Purpose: Prevent cascading failures by stopping requests to failing services

The Problem It Solves

Without Circuit Breaker:
  Service A calls Service B repeatedly
  Service B is down
  Service A keeps trying, wasting resources
  Requests pile up, eventually Service A crashes too (cascading failure)

With Circuit Breaker:
  Service B is down
  First few calls fail
  Circuit breaker opens (STOP making calls)
  Requests immediately fail fast (users can try again)
  When Service B recovers, circuit breaker closes

States

CLOSED (Normal Operation)
  │
  ├─ Requests succeed? → Stay CLOSED
  │
  └─ Error threshold exceeded? → Go to OPEN

OPEN (Failing, Stop Trying)
  │
  ├─ Timeout elapsed? → Go to HALF-OPEN
  │
  └─ Continue failing immediately (fail-fast)

HALF-OPEN (Testing if service recovered)
  │
  ├─ Test requests succeed? → Go to CLOSED
  │
  └─ Test requests fail? → Go to OPEN

Implementation Example

from datetime import datetime, timedelta
 
class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.state = 'CLOSED'
        self.failure_count = 0
        self.last_failure_time = None
    
    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if self._should_attempt_reset():
                self.state = 'HALF_OPEN'
            else:
                raise Exception('Circuit breaker is OPEN')
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise
    
    def _on_success(self):
        self.failure_count = 0
        self.state = 'CLOSED'
    
    def _on_failure(self):
        self.failure_count += 1
        self.last_failure_time = datetime.now()
        
        if self.failure_count >= self.failure_threshold:
            self.state = 'OPEN'
    
    def _should_attempt_reset(self):
        return (datetime.now() - self.last_failure_time) > timedelta(seconds=self.timeout)
 
# Usage
breaker = CircuitBreaker(failure_threshold=5, timeout=60)
 
def call_external_service():
    try:
        return breaker.call(requests.get, 'https://api.example.com/data')
    except Exception as e:
        return {'error': 'Service unavailable', 'status': 503}

Chaos Test for Circuit Breaker

# Test that circuit breaker activates when service fails
Hypothesis: \"When downstream service errors exceed threshold, 
            circuit breaker activates and prevents cascading failure\"
 
Test Steps:
  1. Make 10 requests to downstream service (should succeed)
  2. Shut down downstream service
  3. Make 10 requests to downstream service (should fail, circuit breaker opens)
  4. Verify no further requests are made for 60 seconds (fail-fast)
  5. Restart downstream service
  6. Circuit breaker eventually closes after timeout
  7. Requests resume succeeding
 
Expected: System handles failure gracefully, not cascading

Pattern 2: Bulkhead (Thread Pool Isolation)

Purpose: Isolate failures to prevent resource exhaustion from spreading

The Problem

Without Bulkheads (Shared Thread Pool):
  10 threads total, all shared
  Payment service uses 8 threads, gets slow
  Analytics service needs threads but all 8 are stuck waiting
  Analytics fails, entire system grinds to halt

With Bulkheads (Separate Thread Pools):
  Payment service: 5 threads
  Analytics service: 3 threads
  Notifications service: 2 threads
  If payment service gets slow, analytics&notifications still work

Implementation

import java.util.concurrent.*;
 
class BulkheadExecutor {
    private final ExecutorService paymentThreadPool;
    private final ExecutorService analyticsThreadPool;
    private final ExecutorService notificationsThreadPool;
    
    public BulkheadExecutor() {
        // Separate thread pools with defined sizes
        paymentThreadPool = Executors.newFixedThreadPool(5);
        analyticsThreadPool = Executors.newFixedThreadPool(3);
        notificationsThreadPool = Executors.newFixedThreadPool(2);
    }
    
    public Future<PaymentResult> processPayment(Payment payment) {
        // If payment pool is full, requests queue or fail fast
        return paymentThreadPool.submit(() -> {
            // Process payment with isolated resources
            return paymentService.process(payment);
        });
    }
    
    public Future<AnalyticsEvent> trackEvent(Event event) {
        // Analytics failure won't affect payment processing
        return analyticsThreadPool.submit(() -> {
            return analyticsService.track(event);
        });
    }
    
    // If thread pool is exhausted, fail fast
    public Future<NotificationResult> sendNotification(Notification notif) {
        try {
            return notificationsThreadPool.submit(() -> {
                return notificationService.send(notif);
            });
        } catch (RejectedExecutionException e) {
            // Pool full - return immediate failure instead of queueing
            CompletableFuture<NotificationResult> future = new CompletableFuture<>();
            future.completeExceptionally(
                new Exception(\"Notification service overloaded\")
            );
            return future;
        }
    }
}

Kubernetes Pod Disruption Budgets (PDB)

PDBs are Kubernetes bulkheads:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-service-pdb
spec:
  minAvailable: 2  # Always keep at least 2 pods running
  selector:
    matchLabels:
      app: payment-service
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: analytics-service-pdb
spec:
  maxUnavailable: 1  # Allow only 1 pod to be disrupted at a time
  selector:
    matchLabels:
      app: analytics-service

Pattern 3: Retry with Exponential Backoff and Jitter

Purpose: Handle transient failures automatically without overwhelming the system

The Problem

Without Exponential Backoff:
  Request fails
  Immediately retry
  Immediately retry
  Immediately retry (thundering herd - all clients retry simultaneously)
  System still slow, more failures

With Exponential Backoff + Jitter:
  Request fails (first retry: wait 1s with random jitter)
  Still fails (second retry: wait 2s with random jitter)
  Still fails (third retry: wait 4s with random jitter)
  Clients retry at different times (jitter spreads load)
  System recovers gradually

Implementation

import random
import time
 
def retry_with_backoff(func, max_retries=3, base_delay=1):
    \"\"\"
    Retry a function with exponential backoff and jitter
    \"\"\"
    retries = 0
    
    while retries < max_retries:
        try:
            return func()
        except Exception as e:
            if retries >= max_retries - 1:
                raise  # Give up after max retries
            
            # Calculate delay: (base_delay ^ retries) + random jitter
            delay = (base_delay ** retries) + random.uniform(0, 0.1)
            
            print(f\"Attempt {retries + 1} failed. Retrying in {delay:.2f}s...\")
            time.sleep(delay)
            retries += 1
 
# Usage
def call_database():
    # This might fail temporarily
    return db.execute_query()
 
result = retry_with_backoff(call_database, max_retries=3, base_delay=1)

Decorator Pattern

import functools
import random
import time
 
def retry_decorator(max_retries=3, base_delay=1, exceptions=(Exception,)):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            retries = 0
            while retries < max_retries:
                try:
                    return func(*args, **kwargs)
                except exceptions as e:
                    if retries >= max_retries - 1:
                        raise
                    
                    delay = (base_delay ** retries) + random.uniform(0, 0.1)
                    time.sleep(delay)
                    retries += 1
        return wrapper
    return decorator
 
# Usage
@retry_decorator(max_retries=3, base_delay=1)
def send_email(email):
    # This might fail transiently
    return email_service.send(email)

Pattern 4: Timeout

Purpose: Prevent hung requests from consuming resources indefinitely

The Problem

Without Timeout:
  Request sent to slow service
  Request takes 30 seconds (connection hung)
  Client waits forever
  Thread occupied
  More requests... more hung threads
  Thread pool exhausted
  Entire system becomes unresponsive

With Timeout:
  Request sent to slow service
  After 5 seconds with no response, timeout
  Request fails fast, resource released
  Client can retry or show error

Implementation

import threading
import requests
 
def call_with_timeout(url, timeout=5):
    try:
        # Requests library supports timeout
        response = requests.get(url, timeout=timeout)
        return response.json()
    except requests.Timeout:
        print(f\"Request timed out after {timeout}s\")
        raise
    except requests.RequestException as e:
        print(f\"Request failed: {e}\")
        raise
 
# Different timeout for different operations
# Connect timeout: 2s (if connection takes >2s, fail)
# Read timeout: 5s (if no data for >5s, fail)
response = requests.get(
    'https://api.example.com/data',
    timeout=(2, 5)  # (connect_timeout, read_timeout)
)

Kubernetes Timeout Configuration

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: backend-service
spec:
  hosts:
  - backend-service
  http:
  - match:
    - uri:
        prefix: /api/
    route:
    - destination:
        host: backend-service
        port:
          number: 8080
    timeout: 5s  # Maximum time for request
    
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: backend-service
spec:
  host: backend-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
        maxRequestsPerConnection: 2

Pattern 5: Graceful Degradation

Purpose: Provide reduced functionality instead of complete failure

The Problem

Without Graceful Degradation:
  Cache service (Redis) fails
  Application throws error
  Users see \"500 Internal Server Error\"
  Users angry, leave your site

With Graceful Degradation:
  Cache service (Redis) fails
  Application detects circuit breaker OPEN
  Queries database directly (slower but works)
  Users see same page, slightly slower (2s instead of 200ms)
  Users don't notice, system recovers
  Once cache recovers, performance returns to normal

Implementation

class UserService:
    def get_user(self, user_id):
        try:
            # Try cache first (fast path)
            user = self.cache.get(f'user:{user_id}')
            if user:
                return user
            
            # Cache miss, get from database
            user = self.db.query_user(user_id)
            
            # Try to cache for next time
            try:
                self.cache.set(f'user:{user_id}', user, ttl=3600)
            except Exception as cache_error:
                # Cache failure is not critical
                logger.warning(f'Cache set failed: {cache_error}')
                # Continue without caching
            
            return user
        
        except Exception as db_error:
            # Last resort: return stale cached data
            logger.error(f'Database query failed: {db_error}')
            stale_user = self.cache_backup.get(f'user:{user_id}')
            if stale_user:
                # Mark as stale so client knows this might be old
                stale_user['_stale'] = True
                return stale_user
            else:
                # No stale data available, must fail
                raise Exception(f'User service unavailable')
 
# Usage
service = UserService()
user = service.get_user(123)
print(user)  # Works even if cache or database temporarily fails

Pattern 6: Rate Limiting / Throttling

Purpose: Prevent overwhelming dependent systems during failure

Token Bucket Algorithm

import time
from threading import Lock
 
class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity  # Max tokens
        self.tokens = capacity     # Current tokens
        self.refill_rate = refill_rate  # Tokens per second
        self.last_refill = time.time()
        self.lock = Lock()
    
    def allow_request(self):
        \"\"\"
        Check if request is allowed (has token available)
        \"\"\"
        with self.lock:
            self._refill()
            
            if self.tokens > 0:
                self.tokens -= 1
                return True
            return False
    
    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        tokens_to_add = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill = now
 
# Usage: 100 requests per second, burst to 200
limiter = TokenBucket(capacity=200, refill_rate=100)
 
def api_endpoint():
    if not limiter.allow_request():
        return {'error': 'Rate limit exceeded'}, 429
    
    return process_request()

Kubernetes Rate Limiting

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-service
spec:
  hosts:
  - api-service
  http:
  - fault:
      delay:
        percentage: 0.1  # Delay 0.1% of requests
        fixedDelay: 5s   # By 5 seconds (for testing)
      abort:
        percentage: 0.001  # Abort 0.001% of requests
        grpcStatus: UNAVAILABLE  # Return UNAVAILABLE
    route:
    - destination:
        host: api-service
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: api-service
spec:
  host: api-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1000
      http:
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
        maxRequestsPerConnection: 1

Testing These Patterns with Chaos Engineering

Experiment 1: Verify Circuit Breaker

# Make requests that work
for i in {1..10}; do curl http://service/health; done
 
# Kill service
kubectl delete pod -n production service-pod-xyz
 
# Make requests (should see circuit breaker activate)
for i in {1..20}; do curl http://service/api; done
 
# Restore service
kubectl get pods -n production  # Verify pod restarted
 
# Requests should eventually succeed
for i in {1..10}; do curl http://service/api; done

Experiment 2: Verify Timeout

# Add 10-second latency to all requests
gremlin attack latency --latency 10000 --length 60
 
# Verify requests timeout at 5 seconds instead of hanging
time curl http://service/api  # Should fail after ~5s, not hang

Key Takeaways

Circuit Breaker: Stop cascading by failing fast
Bulkhead: Isolate failures with resource separation
Retry + Backoff: Handle transient failures automatically
Timeout: Prevent resource exhaustion from hung requests
Graceful Degradation: Provide reduced functionality instead of complete failure
Rate Limiting: Protect against being overwhelmed