G
GuideDevOps
Lesson 3 of 14

Principles of Chaos

Part of the Chaos Engineering tutorial series.

The Five Principles of Chaos Engineering

The Chaos Engineering community has established five core principles that should guide all chaos experiments. These principles, formalized by the Chaos Engineering Institute, ensure experiments are safe, meaningful, and productive.

1. Build a Hypothesis Around Steady-State Behavior

Principle: Focus on the measurable output of a system rather than internal attributes. A system's steady-state behavior defines its normal operating conditions.

What This Means

Before introducing any chaos, establish what "normal" looks like:

  • Response time is consistently under 200ms
  • Error rate stays below 0.1%
  • CPU usage is between 30-60%
  • All user requests succeed

Example

Hypothesis: "Under normal conditions, our payment service maintains 99.95% success rate"
Measurable Output: "Payment transaction success rate"
Normal Baseline: "Greater than 99.95% over 5-minute rolling window"

2. Vary Real-World Events

Principle: Chaos experiments should reflect the types of failures that occur in real production environments. These are not artificial or unrealistic scenarios—they're events that actually happen.

Real-World Failures to Consider

  • Infrastructure Failures: Server crashes, disk full, network partition
  • Application Issues: Memory leaks, resource exhaustion, thread pool depletion
  • Dependency Failures: Database unavailability, API timeouts, external service degradation
  • Configuration Problems: Incorrect settings, version mismatches, certificate expiration

Example

Instead of: "Let's create a completely random failure"
Do this: "Let's simulate a network partition that isolates our cache layer"
         "This happens when rack switches fail—a real operational risk"

3. Run Experiments in Production

Principle: While not mandatory, the best learning comes from testing in production with real traffic, real data volumes, and real system interactions.

Why Production?

  • Real Complexity: Production has interactions test environments cannot replicate
  • Real Traffic: User behavior patterns differ from synthetic load
  • Real Data: Volume and characteristics match actual usage
  • Confidence: Passing production chaos tests builds team confidence

Safety Considerations

  • Start with non-critical systems
  • Use gradual rollout (1% → 5% → 10% of traffic)
  • Have rollback procedures ready
  • Run during business hours with team on standby
  • Inform stakeholders beforehand

Example Progression

Week 1: Test in staging environment
Week 2: Test in production, non-business hours, 1% of traffic
Week 3: Test in production, non-business hours, 5% of traffic
Week 4: Test in production, business hours, 5% of traffic

4. Automate Experiments and Keep Them Running

Principle: Manual experiments are one-time learning events. Automated, continuous chaos testing is how you build system resilience over time.

Benefits of Automation

  • Continuous Learning: You catch regression issues early
  • Behavioral Verification: The system's actual response is your test assertion
  • Scalability: Test hundreds of scenarios without manual effort
  • Regression Prevention: Catch when fixes break under chaos conditions

Automation Tiers

Level 1: Manual experiments on-demand
Level 2: Scheduled daily/weekly experiments
Level 3: Experiments triggered by deployments
Level 4: Continuous chaos in production

Example

Cron job: "Every Monday at 2am, inject 500ms latency for 5 minutes"
Deployment hook: "After every production deployment, run chaos tests"
Continuous: "Randomly fail 5% of requests during high-traffic windows"

5. Minimize Blast Radius

Principle: Design experiments to limit their scope and impact. Start small, verify safety, then expand.

Blast Radius Strategies

  1. Geographic Isolation: Test in one region first
  2. Percentage-Based: Start with 1% of traffic/resources
  3. Service Isolation: Test non-critical services before core services
  4. Time Windows: Run during off-peak hours initially
  5. Automatic Rollback: Have circuit breakers that stop the experiment if damage is detected

Example Progression

Experiment 1: "Kill 1 instance in a 10-instance cluster (10% blast)"
Experiment 2: "Kill 3 instances in a 10-instance cluster (30% blast)"
Experiment 3: "Kill 1 instance in each of 3 availability zones (30% blast)"
Experiment 4: "Kill all instances in one availability zone (100% of that zone)"

Anti-Patterns to Avoid

❌ Running Chaos Without a Hypothesis

Why it fails: You can't distinguish between expected and unexpected failures

❌ Ignoring Blast Radius

Why it fails: Chaos experiments can cause the very outages you're trying to prevent

❌ Not Automating

Why it fails: One-off experiments don't catch regressions or provide continuous learning

❌ Testing Only Infrastructure

Why it fails: Applications sometimes hide infrastructure problems through retries or caching

❌ Not Communicating Results

Why it fails: Learning stays with the team where it was discovered

Key Takeaways

  1. Hypothesis First: Know what you're testing before you inject chaos
  2. Reality-Based: Test scenarios that actually occur in production
  3. Production Ready: The best tests run in production with real traffic
  4. Automated: Manual testing is a start, automation is the goal
  5. Controlled: Minimize blast radius and have rollback plans