Designing Chaos Experiments - Chaos Engineering

Overview

A chaos experiment is a structured test that introduces controlled failure to a system to verify its resilience. Do not just break things randomly; follow the scientific method.

The Experiment Lifecycle

Hypothesis: "If I kill the primary database pod, the service will automatically fail over without users seeing an error."
Blast Radius: Define what part of the system is impacted.
Execution: Run the experiment in a controlled environment (staging or production with monitoring).
Analysis: Verify the hypothesis. Did the system survive as expected?

Example: Network Latency Injection

Inject 500ms latency to all requests to the "auth-service".

# Using a tool like Gremlin or Litmus to inject latency
gremlin attack run latency --percentage 100 --delay 500 --target auth-service

Expected Result: The system should maintain stability, and monitor alerts should report increased latency without failing requests.

Experiment Status: COMPLETED
Hypothesis Verified: True
Auth service latency increased, but 200 OK rate remained at 100%.