Principles of Chaos - Chaos Engineering

The Five Principles of Chaos Engineering

The Chaos Engineering community has established five core principles that should guide all chaos experiments. These principles, formalized by the Chaos Engineering Institute, ensure experiments are safe, meaningful, and productive.

1. Build a Hypothesis Around Steady-State Behavior

Principle: Focus on the measurable output of a system rather than internal attributes. A system's steady-state behavior defines its normal operating conditions.

What This Means

Before introducing any chaos, establish what "normal" looks like:

Response time is consistently under 200ms
Error rate stays below 0.1%
CPU usage is between 30-60%
All user requests succeed

Example

Hypothesis: "Under normal conditions, our payment service maintains 99.95% success rate"
Measurable Output: "Payment transaction success rate"
Normal Baseline: "Greater than 99.95% over 5-minute rolling window"

2. Vary Real-World Events

Principle: Chaos experiments should reflect the types of failures that occur in real production environments. These are not artificial or unrealistic scenarios—they're events that actually happen.

Real-World Failures to Consider

Infrastructure Failures: Server crashes, disk full, network partition
Application Issues: Memory leaks, resource exhaustion, thread pool depletion
Dependency Failures: Database unavailability, API timeouts, external service degradation
Configuration Problems: Incorrect settings, version mismatches, certificate expiration

Example

Instead of: "Let's create a completely random failure"
Do this: "Let's simulate a network partition that isolates our cache layer"
         "This happens when rack switches fail—a real operational risk"

3. Run Experiments in Production

Principle: While not mandatory, the best learning comes from testing in production with real traffic, real data volumes, and real system interactions.

Why Production?

Real Complexity: Production has interactions test environments cannot replicate
Real Traffic: User behavior patterns differ from synthetic load
Real Data: Volume and characteristics match actual usage
Confidence: Passing production chaos tests builds team confidence

Safety Considerations

Start with non-critical systems
Use gradual rollout (1% → 5% → 10% of traffic)
Have rollback procedures ready
Run during business hours with team on standby
Inform stakeholders beforehand

Example Progression

Week 1: Test in staging environment
Week 2: Test in production, non-business hours, 1% of traffic
Week 3: Test in production, non-business hours, 5% of traffic
Week 4: Test in production, business hours, 5% of traffic

4. Automate Experiments and Keep Them Running

Principle: Manual experiments are one-time learning events. Automated, continuous chaos testing is how you build system resilience over time.

Benefits of Automation

Continuous Learning: You catch regression issues early
Behavioral Verification: The system's actual response is your test assertion
Scalability: Test hundreds of scenarios without manual effort
Regression Prevention: Catch when fixes break under chaos conditions

Automation Tiers

Level 1: Manual experiments on-demand
Level 2: Scheduled daily/weekly experiments
Level 3: Experiments triggered by deployments
Level 4: Continuous chaos in production

Example

Cron job: "Every Monday at 2am, inject 500ms latency for 5 minutes"
Deployment hook: "After every production deployment, run chaos tests"
Continuous: "Randomly fail 5% of requests during high-traffic windows"

5. Minimize Blast Radius

Principle: Design experiments to limit their scope and impact. Start small, verify safety, then expand.

Blast Radius Strategies

Geographic Isolation: Test in one region first
Percentage-Based: Start with 1% of traffic/resources
Service Isolation: Test non-critical services before core services
Time Windows: Run during off-peak hours initially
Automatic Rollback: Have circuit breakers that stop the experiment if damage is detected

Example Progression

Experiment 1: "Kill 1 instance in a 10-instance cluster (10% blast)"
Experiment 2: "Kill 3 instances in a 10-instance cluster (30% blast)"
Experiment 3: "Kill 1 instance in each of 3 availability zones (30% blast)"
Experiment 4: "Kill all instances in one availability zone (100% of that zone)"

Anti-Patterns to Avoid

❌ Running Chaos Without a Hypothesis

Why it fails: You can't distinguish between expected and unexpected failures

❌ Ignoring Blast Radius

Why it fails: Chaos experiments can cause the very outages you're trying to prevent

❌ Not Automating

Why it fails: One-off experiments don't catch regressions or provide continuous learning

❌ Testing Only Infrastructure

Why it fails: Applications sometimes hide infrastructure problems through retries or caching

❌ Not Communicating Results

Why it fails: Learning stays with the team where it was discovered

Key Takeaways

Hypothesis First: Know what you're testing before you inject chaos
Reality-Based: Test scenarios that actually occur in production
Production Ready: The best tests run in production with real traffic
Automated: Manual testing is a start, automation is the goal
Controlled: Minimize blast radius and have rollback plans