Why Measurement Matters
Without measurement, you cannot:
- Prove chaos engineering provides value
- Justify continued investment and team time
- Make informed decisions about scaling
- Identify which experiments are most valuable
- Track improvement over time
Key Metrics to Track
Tier 1: Reliability Metrics (Core)
1. System Availability
Definition: Percentage of time the system is operational
Formula: Uptime / Total Time * 100
Example:
Month 1 (before chaos): 99.0% = ~7.2 hours downtime
Month 6 (after chaos): 99.9% = ~43 minutes downtime
Month 12 (after chaos): 99.95% = ~22 minutes downtime
Target: Should increase by 0.5-5% within 12 months
Implementation:
# Calculate from error logs
monitored_period = 30 * 24 * 60 # 30 days in minutes
downtime_incidents = [
{'start': 14532, 'duration': 45}, # Minute and duration
{'start': 18921, 'duration': 12},
{'start': 22145, 'duration': 8},
]
total_downtime = sum(i['duration'] for i in downtime_incidents)
availability = ((monitored_period - total_downtime) / monitored_period) * 100
print(f\"Availability: {availability:.3f}%\")2. Mean Time Between Failures (MTBF)
Definition: Average time between system failures
Formula: Total Uptime / Number of Failures
Example:
Before: 700 hours / 10 failures = 70 hours MTBF
After: 6900 hours / 5 failures = 1380 hours MTBF
Improvement: 20x longer between failures
Tracking:
Failure Log:
- Time: 2024-01-05 14:32:00
Duration: 45 minutes
Cause: Database connection pool exhausted
- Time: 2024-01-07 09:15:00
Duration: 12 minutes
Cause: Cache service unavailable
- Time: 2024-01-12 18:45:00
Duration: 8 minutes
Cause: Network latency spike3. Mean Time To Recovery (MTTR)
Definition: Average time from failure detection to full recovery
Before:
Detection: 10 min (alert ignored)
Diagnosis: 15 min (why?)
Response: 10 min (who does what?)
Fix: 20 min (apply fix)
Total: 55 minutes average
After:
Detection: 2 min (team knows pattern)
Diagnosis: 3 min (practiced this)
Response: Automatic (failover works)
Fix: 5 min (permanent fix)
Total: 10 minutes average
Target: Reduce by 50-80%
Calculation:
import statistics
incidents = [
{'name': 'Database failure 1', 'minutes': 55},
{'name': 'Cache timeout', 'minutes': 43},
{'name': 'Network partition', 'minutes': 38},
{'name': 'Memory leak', 'minutes': 67},
{'name': 'Disk full', 'minutes': 51},
]
mttr = statistics.mean([i['minutes'] for i in incidents])
print(f\"MTTR: {mttr:.0f} minutes\")
# Track trend
mttr_before = 55
mttr_after = 10
improvement = ((mttr_before - mttr_after) / mttr_before) * 100
print(f\"MTTR improved by {improvement:.0f}%\")Tier 2: Operational Metrics
1. Issues Found by Chaos vs Production
Metric: \"Chaos-Found\" vs \"Production-Found\" issues
Ideal: Almost everything found by chaos, nothing by production
Example:
Month 1: 5 issues found by chaos, 2 by production = 71% found by chaos
Month 6: 20 issues found by chaos, 1 by production = 95% found by chaos
Month 12: 40 issues found by chaos, 0 by production = 100% found by chaos
Target: >95% of issues caught before reaching customers
Tracking:
Issues Matrix:
Chaos-Found (Good):
- Circuit breaker threshold too aggressive (fixed)
- Retry logic causes thundering herd (fixed)
- Timeout too short for database (fixed)
- PVC attachment fails silently (fixed)
Production-Found (Bad):
- DNS cache stale after zone change (1 incident, 30 min)2. Chaos Tests Created
Metric: Volume of experiments and coverage
Example:
Sprint 1: 2 experiments (basic)
Sprint 2: 5 experiments (moderate coverage)
Sprint 3: 12 experiments (good coverage)
Sprint 4: 20 experiments (comprehensive)
Target: 1 experiment per critical system + 5+ combinations
3. Remediation Time
Metric: Time from discovering issue via chaos to permanent fix
Definition: Issue found → Root cause identified → Fix deployed
Good: < 1 week
Excellent: < 3 days
Outstanding: Same day
Example:
Issue: "Circuit breaker threshold too low"
Found: Wednesday morning
Fixed: Wednesday afternoon
Remediation Time: 4 hours
Tier 3: Business Metrics
1. Revenue Impact of Downtime
Formula: Downtime Hours × Revenue Per Hour
Example (E-commerce):
Before: 5 outages/year × 1 hour avg = 5 hours downtime
5 hours × $100k/hour = $500k lost revenue
After: 1 outage/year × 0.3 hours avg = 0.3 hours downtime
0.3 hours × $100k/hour = $30k lost revenue
Savings: $470k/year
Example (SaaS):
Customer churn due to outages:
Before: 2% additional churn after major outage × $1M MRR = $20k lost MRR
After: Only 0.2% additional churn = $2k lost MRR
Savings: $18k MRR or $216k/year
Calculation Tool:
class DowntimeROI:
def __init__(self, revenue_per_hour, base_annual_outages, base_outage_duration):
self.revenue_per_hour = revenue_per_hour
self.base_annual_outages = base_annual_outages
self.base_outage_duration = base_outage_duration
def calculate_before(self):
total_hours = self.base_annual_outages * self.base_outage_duration
return total_hours * self.revenue_per_hour
def calculate_after(self, improved_outages, improved_duration):
total_hours = improved_outages * improved_duration
return total_hours * self.revenue_per_hour
def roi_savings(self, before_cost, after_cost):
return before_cost - after_cost
# Example usage
roi = DowntimeROI(
revenue_per_hour=100_000,
base_annual_outages=5,
base_outage_duration=1.0 # hours
)
before = roi.calculate_before() # $500,000
after = roi.calculate_after(improved_outages=1, improved_duration=0.3) # $30,000
savings = roi.roi_savings(before, after) # $470,000
print(f\"Annual downtime cost before: ${before:,.0f}\")
print(f\"Annual downtime cost after: ${after:,.0f}\")
print(f\"Annual savings: ${savings:,.0f}\")2. Customer Satisfaction
Metric: Impact on NPS, CSAT, or customer sentiment
Before:
NPS: 35 (acceptable)
CSAT: 72%
Complaints about \"frequent downtime\": 15% of feedback
After (12 months):
NPS: 52 (good improvement)
CSAT: 87%
Complaints about \"frequent downtime\": 2% of feedback
Improvement: +17 NPS points, +15% CSAT
3. Engineer Productivity
Metric: How much time engineers spend firefighting vs building
Before:
Firefighting/on-call: 40% of time
Feature development: 35% of time
Technical debt/improvement: 25% of time
After:
Firefighting/on-call: 10% of time
Feature development: 55% of time
Technical debt/improvement: 35% of time
Productivity gain: 20 percentage points × engineering team cost
Calculating ROI
Investment Required
Year 1 Costs:
Tool (Gremlin): $60,000/year
One dedicated engineer: $120,000/year (0.5 FTE)
Training and resources: $20,000
Infrastructure for testing: $10,000
Total Year 1 Investment: $210,000
Expected Return (Conservative)
Cost of a Major Outage:
Direct downtime: $500,000 (5 hours @ $100k/hr)
Infrastructure recovery: $50,000
Customer support overtime: $10,000
Potential churn: $200,000
Total per major outage: $760,000
With Chaos Engineering:
Prevent 4 of 5 annual major outages
Reduce remaining outage severity by 70%
Return = ($760,000 × 4) + ($760,000 × 0.3)
Return = $3,040,000 + $228,000
Return = $3,268,000/year
ROI Calculation
ROI = (Total Return - Investment) / Investment × 100
ROI = ($3,268,000 - $210,000) / $210,000 × 100
ROI = $3,058,000 / $210,000 × 100
ROI = 1,457%
Payback Period:
$210,000 / ($3,268,000 / 365) = 23 days
Creating a Metrics Dashboard
Prometheus Queries
# Availability (0-1, multiply by 100 for percentage)
rate(requests_total[1h]) - rate(requests_errors_total[1h]) / rate(requests_total[1h])
# MTTR - Time between failure detection and recovery
histogram_quantile(0.95, recovery_time_seconds_bucket)
# Incident frequency
increase(incidents_total[24h])
# Issues found by chaos vs production
increase(issues_found_by_chaos_total[7d])
vs
increase(issues_found_by_production_total[7d])
# P99 Latency trend over time
histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))Sample Dashboard
\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Chaos Engineering Impact Dashboard \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 \u2502\n\u2502 Availability 99.95% \u2191 from 99.0% \u2502\n\u2502 MTTR 10 min \u2193 from 55 min \u2502\n\u2502 Annual Downtime 22 min \u2193 from 7.2 hours \u2502\n\u2502 Issues Prevented 95 this year \u2502\n\u2502 \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 \u2502\n\u2502 Incidents/Month: \u2502\n\u2502 \u2552\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2555 \u2502\n\u2502 \u25518 \u2502 \u2502 \u2502\u2502 \u2502\n\u2502 \u25516 \u2502 \u2502 \u2502 \u2502\u2507Chaos Start \u2502\n\u2502 \u25514 \u2502 \u2502 \u2502 \u2502 \u2502\u2502 \u2502\n\u2502 \u25512 \u2502 \u2502 \u2502 \u2502 \u2502 \u2502 \u2502 \u2502\n\u2502 \u2551 0\u2502\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2502 \u2502\n\u2502 \u2559\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u255c \u2502\n\u2502 Jan Feb Mar Apr May Jun Jul Aug Sep Oct \u2502\n\u2502 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```
## Executive Reporting
### Monthly Report Template
CHAOS ENGINEERING - MONTHLY IMPACT REPORT
Period: [Month/Year]
RELIABILITY METRICS ✓ System Availability: 99.95% (Target: 99.9%) ✓ Average MTTR: 8 minutes (Target: <15 min) ✓ Incidents This Month: 1 (Target: <2) ✓ None due to issues caught by chaos training
EXPERIMENT ACTIVITY ✓ Experiments Run: 15
- 3 discovered issues (all fixed before reaching customers)
- 0 caused production impact
- Average blast radius: 2% of traffic
✓ Issues Found: 3
- Timeout configuration too aggressive
- Retry logic needed jitter
- PVC attachment timeout too short
BUSINESS IMPACT ✓ Downtime Cost This Month: $8,000 (1 minor incident) ✓ Cost Without Chaos Engineering: $450,000 (estimated) ✓ Cost Avoided: $442,000 ✓ Investment This Month: $17,500 ✓ ROI This Month: 25x
CUSTOMER SATISFACTION ✓ Complaints about "downtime": Down 50% YoY ✓ NPS Score: +52 (up from +35 at start) ✓ CSAT: 87% (up from 72%)
NEXT MONTH FOCUS
- Implement graceful degradation for cache layer
- Test multi-region failover
- Automate chaos tests in CI/CD pipeline
## Key Takeaways
1. **Measure Multiple Dimensions**: Reliability + Operations + Business
2. **Track Trends**: Month-over-month and year-over-year changes matter most
3. **Communicate Value**: Translate metrics to business language ($$)
4. **Conservative Estimates**: Better to under-promise and over-deliver
5. **Continuous Improvement**: Use metrics to guide investment decisions
---