Overview
A chaos experiment is a structured test that introduces controlled failure to a system to verify its resilience. Do not just break things randomly; follow the scientific method.
The Experiment Lifecycle
- Hypothesis: "If I kill the primary database pod, the service will automatically fail over without users seeing an error."
- Blast Radius: Define what part of the system is impacted.
- Execution: Run the experiment in a controlled environment (staging or production with monitoring).
- Analysis: Verify the hypothesis. Did the system survive as expected?
Example: Network Latency Injection
Inject 500ms latency to all requests to the "auth-service".
# Using a tool like Gremlin or Litmus to inject latency
gremlin attack run latency --percentage 100 --delay 500 --target auth-serviceExpected Result: The system should maintain stability, and monitor alerts should report increased latency without failing requests.
Experiment Status: COMPLETED
Hypothesis Verified: True
Auth service latency increased, but 200 OK rate remained at 100%.