Chaos Monkey & Simian Army - Chaos Engineering

Netflix's Chaos Engineering Innovation

Netflix pioneered Chaos Engineering in the cloud through a suite of tools known as the Simian Army. These tools automate the injection of various types of failures to ensure system resilience.

Chaos Monkey

Overview

Chaos Monkey is the most famous member of the Simian Army. It randomly terminates instances (virtual machines) in production to ensure the system handles server failures gracefully.

How It Works

Runs on a Schedule: Typically runs during business hours (e.g., 9am-3pm)
Random Selection: Randomly picks instances across regions/availability zones
Termination: Kills the selected instances without warning
Observation: System should maintain steady-state and recover automatically

Why Kill Instances Randomly?

Prevents Complacency: Teams can't assume instances will always be there
Forces Resilience: Systems must handle graceful shutdown
Tests Load Balancers: Ensures load balancers detect failures and route around them
Tests Auto-Scaling: Verifies new instances spin up and rejoin the load balancer
Tests Health Checks: Confirms health checks work properly

Configuration Example

# Chaos Monkey Configuration
monkey:
  enabled: true
  termination_schedule: "9 * * * ?"  # 9am daily
  regions:
    - us-east-1
    - us-west-2
  frequency:
    mean_time_between_kills: 1  # Kill an instance roughly every 1 day
  
  leashed: false  # true = dry-run, false = actual termination
  
  exceptions:
    - tag: "do_not_kill"
    - name: "*-prod-critical-*"

Expected System Behavior

✅ Good: Instance is killed → New instance starts → Traffic reroutes → No user impact

❌ Bad: Instance is killed → Traffic fails → Users see errors → Manual intervention needed

The Simian Army

Netflix expanded beyond Chaos Monkey with additional "monkeys" targeting different failure modes:

Chaos Gorilla

Targets: Full availability zone failures

What It Does: Terminates all instances in an entire availability zone

Why It Matters: Tests multi-AZ failover, data replication across zones, and DNS failover

Risk Level: High blast radius—typically run less frequently

chaos_gorilla:
  enabled: true
  kill_probability: 0.5  # Only 50% chance to actually run when triggered
  frequency: "monthly"

Chaos Kong

Targets: Entire region failures

What It Does: Simulates an entire AWS region becoming unavailable

Why It Matters: Tests global failover, multi-region data consistency, and disaster recovery

Risk Level: Very high—typically used for special testing events

Latency Monkey

Targets: Network latency issues

What It Does: Injects artificial latency (delays) into inter-service communication

Why It Matters: Tests timeout handling, circuit breakers, and graceful degradation

Common Injected Latencies:

100-500ms: Client-perceptible slowdown
1-5s: Service timeout scenarios
10-30s: Hard timeout scenarios

latency_monkey:
  enabled: true
  rpc_latency_ms: 500  # Add 500ms to remote calls
  correlation_id_pattern: ".*latency.*"  # Only apply to requests matching pattern

Conformity Monkey

Targets: Configuration drift

What It Does: Verifies instances comply with expected configuration standards and terminates non-compliant instances

Why It Matters: Forces proper configuration management and prevents snowflake servers

Security Monkey

Targets: AWS security configuration issues

What It Does: Scans AWS accounts for security misconfigurations

Why It Matters: Identifies security group issues, public buckets, etc.

Janitor Monkey

Targets: Unused resources

What It Does: Cleans up unused resources (dangling security groups, unused load balancers, unattached volumes)

Why It Matters: Reduces costs and prevents configuration clutter

The Simian Army Architecture

How They Work Together

┌─────────────────────────────────┐
│   Chaos Monkey (Foundation)     │
│   - Random instance kills       │
└──────────────┬──────────────────┘
               │
      ┌────────┴────────┬─────────────┬──────────────┐
      │                 │             │              │
      ▼                 ▼             ▼              ▼
┌──────────┐  ┌──────────────┐  ┌───────────┐  ┌─────────────┐
│Gorilla   │  │Latency Monkey│  │Conformity │  │Security     │
│(AZ fail) │  │(network lag) │  │(config)   │  │(compliance) │
└──────────┘  └──────────────┘  └───────────┘  └─────────────┘

Running order: Conformity → Janitor → Chaos Monkey → Latency → Gorilla → Kong

Scheduling Considerations

Monday-Friday (Business Hours):
  9am  - Conformity Monkey runs
  10am - Chaos Monkey runs (random)
  11am - Latency Monkey runs (targeted)
  3pm  - Security Monkey audit

Weekly (Once):
  Friday 8pm - Chaos Gorilla runs (full AZ)

Monthly (Once):
  First Sunday of month - Chaos Kong runs (full region)

Implementing Chaos Monkey

Prerequisites

Auto-scaling Groups: Instances must auto-restart when killed
Load Balancers: Traffic must reroute to healthy instances
Health Checks: System must detect failures automatically
Monitoring: You need to observe what happens

Basic Setup

# 1. Install Chaos Monkey
docker pull netflix/chaosmonkey:latest
 
# 2. Configure (via environment variables)
export CHAOS_MONKEY_ENABLED=true
export CHAOS_MONKEY_LEASHED=false  # Actually kill instances
export CHAOS_MONKEY_REGIONS=us-east-1,us-west-2
export CHAOS_MONKEY_SCHEDULE="0 9 * * MON-FRI"  # 9am weekdays
 
# 3. Run
docker run -e CHAOS_MONKEY_ENABLED=$CHAOS_MONKEY_ENABLED \
           -e CHAOS_MONKEY_LEASHED=$CHAOS_MONKEY_LEASHED \
           netflix/chaosmonkey:latest

When to Use Chaos Monkey

✅ Good Use Cases

Testing auto-scaling behavior
Verifying load balancer health checks
Ensuring graceful shutdown on instances
Testing service discovery mechanisms

❌ Avoid With

Custom hardware with long startup times
Non-redundant systems (no auto-scaling)
Stateful services without replication
Peak traffic periods

Modern Alternatives

While Chaos Monkey is powerful, newer tools offer additional features:

Tool	Focus	Modern?	Cloud-Native?
Chaos Monkey	Instance termination	Legacy	AWS-focused
Gremlin	Comprehensive failures	Yes	Multi-cloud
Litmus	Kubernetes native	Yes	Kubernetes
chaos-mesh	Kubernetes native	Yes	Kubernetes

Key Takeaways

Chaos Monkey established the practice: Random instance termination at Netflix changed the industry
The Simian Army expanded the concept: Different tools target different failure types
Not just for Netflix: These principles apply to any auto-scaling, multi-instance system
Evolution continues: Modern tools like Gremlin and Litmus build on these foundations