Netflix's Chaos Engineering Innovation
Netflix pioneered Chaos Engineering in the cloud through a suite of tools known as the Simian Army. These tools automate the injection of various types of failures to ensure system resilience.
Chaos Monkey
Overview
Chaos Monkey is the most famous member of the Simian Army. It randomly terminates instances (virtual machines) in production to ensure the system handles server failures gracefully.
How It Works
- Runs on a Schedule: Typically runs during business hours (e.g., 9am-3pm)
- Random Selection: Randomly picks instances across regions/availability zones
- Termination: Kills the selected instances without warning
- Observation: System should maintain steady-state and recover automatically
Why Kill Instances Randomly?
- Prevents Complacency: Teams can't assume instances will always be there
- Forces Resilience: Systems must handle graceful shutdown
- Tests Load Balancers: Ensures load balancers detect failures and route around them
- Tests Auto-Scaling: Verifies new instances spin up and rejoin the load balancer
- Tests Health Checks: Confirms health checks work properly
Configuration Example
# Chaos Monkey Configuration
monkey:
enabled: true
termination_schedule: "9 * * * ?" # 9am daily
regions:
- us-east-1
- us-west-2
frequency:
mean_time_between_kills: 1 # Kill an instance roughly every 1 day
leashed: false # true = dry-run, false = actual termination
exceptions:
- tag: "do_not_kill"
- name: "*-prod-critical-*"Expected System Behavior
✅ Good: Instance is killed → New instance starts → Traffic reroutes → No user impact
❌ Bad: Instance is killed → Traffic fails → Users see errors → Manual intervention needed
The Simian Army
Netflix expanded beyond Chaos Monkey with additional "monkeys" targeting different failure modes:
Chaos Gorilla
Targets: Full availability zone failures
What It Does: Terminates all instances in an entire availability zone
Why It Matters: Tests multi-AZ failover, data replication across zones, and DNS failover
Risk Level: High blast radius—typically run less frequently
chaos_gorilla:
enabled: true
kill_probability: 0.5 # Only 50% chance to actually run when triggered
frequency: "monthly"Chaos Kong
Targets: Entire region failures
What It Does: Simulates an entire AWS region becoming unavailable
Why It Matters: Tests global failover, multi-region data consistency, and disaster recovery
Risk Level: Very high—typically used for special testing events
Latency Monkey
Targets: Network latency issues
What It Does: Injects artificial latency (delays) into inter-service communication
Why It Matters: Tests timeout handling, circuit breakers, and graceful degradation
Common Injected Latencies:
- 100-500ms: Client-perceptible slowdown
- 1-5s: Service timeout scenarios
- 10-30s: Hard timeout scenarios
latency_monkey:
enabled: true
rpc_latency_ms: 500 # Add 500ms to remote calls
correlation_id_pattern: ".*latency.*" # Only apply to requests matching patternConformity Monkey
Targets: Configuration drift
What It Does: Verifies instances comply with expected configuration standards and terminates non-compliant instances
Why It Matters: Forces proper configuration management and prevents snowflake servers
Security Monkey
Targets: AWS security configuration issues
What It Does: Scans AWS accounts for security misconfigurations
Why It Matters: Identifies security group issues, public buckets, etc.
Janitor Monkey
Targets: Unused resources
What It Does: Cleans up unused resources (dangling security groups, unused load balancers, unattached volumes)
Why It Matters: Reduces costs and prevents configuration clutter
The Simian Army Architecture
How They Work Together
┌─────────────────────────────────┐
│ Chaos Monkey (Foundation) │
│ - Random instance kills │
└──────────────┬──────────────────┘
│
┌────────┴────────┬─────────────┬──────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌───────────┐ ┌─────────────┐
│Gorilla │ │Latency Monkey│ │Conformity │ │Security │
│(AZ fail) │ │(network lag) │ │(config) │ │(compliance) │
└──────────┘ └──────────────┘ └───────────┘ └─────────────┘
Running order: Conformity → Janitor → Chaos Monkey → Latency → Gorilla → Kong
Scheduling Considerations
Monday-Friday (Business Hours):
9am - Conformity Monkey runs
10am - Chaos Monkey runs (random)
11am - Latency Monkey runs (targeted)
3pm - Security Monkey audit
Weekly (Once):
Friday 8pm - Chaos Gorilla runs (full AZ)
Monthly (Once):
First Sunday of month - Chaos Kong runs (full region)
Implementing Chaos Monkey
Prerequisites
- Auto-scaling Groups: Instances must auto-restart when killed
- Load Balancers: Traffic must reroute to healthy instances
- Health Checks: System must detect failures automatically
- Monitoring: You need to observe what happens
Basic Setup
# 1. Install Chaos Monkey
docker pull netflix/chaosmonkey:latest
# 2. Configure (via environment variables)
export CHAOS_MONKEY_ENABLED=true
export CHAOS_MONKEY_LEASHED=false # Actually kill instances
export CHAOS_MONKEY_REGIONS=us-east-1,us-west-2
export CHAOS_MONKEY_SCHEDULE="0 9 * * MON-FRI" # 9am weekdays
# 3. Run
docker run -e CHAOS_MONKEY_ENABLED=$CHAOS_MONKEY_ENABLED \
-e CHAOS_MONKEY_LEASHED=$CHAOS_MONKEY_LEASHED \
netflix/chaosmonkey:latestWhen to Use Chaos Monkey
✅ Good Use Cases
- Testing auto-scaling behavior
- Verifying load balancer health checks
- Ensuring graceful shutdown on instances
- Testing service discovery mechanisms
❌ Avoid With
- Custom hardware with long startup times
- Non-redundant systems (no auto-scaling)
- Stateful services without replication
- Peak traffic periods
Modern Alternatives
While Chaos Monkey is powerful, newer tools offer additional features:
| Tool | Focus | Modern? | Cloud-Native? |
|---|---|---|---|
| Chaos Monkey | Instance termination | Legacy | AWS-focused |
| Gremlin | Comprehensive failures | Yes | Multi-cloud |
| Litmus | Kubernetes native | Yes | Kubernetes |
| chaos-mesh | Kubernetes native | Yes | Kubernetes |
Key Takeaways
- Chaos Monkey established the practice: Random instance termination at Netflix changed the industry
- The Simian Army expanded the concept: Different tools target different failure types
- Not just for Netflix: These principles apply to any auto-scaling, multi-instance system
- Evolution continues: Modern tools like Gremlin and Litmus build on these foundations