SRE Tools & Monitoring - Site Reliability Engineering

The SRE Observability Stack

The Three Pillars

┌─ Observability ─────────────────────┐
│                                      │
├─ Metrics (Quantitative)             │
│  What we measure numerically        │
│  Prometheus, Datadog, etc.          │
│                                      │
├─ Logs (Deterministic Events)        │
│  What happened at specific times    │
│  ELK, Datadog, Splunk, etc.        │
│                                      │
└─ Traces (Request Paths)             │
   How requests flow through system   │
   Jaeger, DataDog, New Relic, etc.  │

Metrics: The Foundation

What is a Metric?

A metric is a numerical measurement of system behavior at a point in time.

Examples:
- CPU usage: 65%
- Memory usage: 4.2 GB
- Requests per second: 1,250
- Error rate: 0.5%
- Response latency: 150ms (p99)
- Database connections: 380/500

Time-series: How these change over time

Prometheus: De Facto Standard

Prometheus is the most widely used metrics collection system:

# prometheus.yml - Configure prometheus
 
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
scrape_configs:
  - job_name: 'kubernetes'
    kubernetes_sd_configs:
      - role: pod
    metric_path: '/metrics'

Example Prometheus Queries

# CPU usage for API service (last 5 minutes)
rate(process_cpu_seconds_total{job="api"}[5m]) * 100
 
# Current memory usage in GB
node_memory_MemAvailable_bytes / 1e9
 
# Error rate (errors per second)
rate(http_requests_total{status=~"5.."}[5m])
 
# Latency - 99th percentile
histogram_quantile(0.99, rate(http_duration_seconds_bucket[5m]))

Graphing Metrics with Grafana

Grafana visualizes Prometheus data:

# Grafana Dashboard JSON (simplified)
 
dashboard:
  title: "API Service Health"
  panels:
    - type: graph
      title: "CPU Usage"
      targets:
        - expr: "rate(process_cpu_seconds_total{job='api'}[5m]) * 100"
      threshold: 80  # Red line at 80%
      
    - type: graph
      title: "Error Rate"
      targets:
        - expr: "rate(http_requests_total{status=~'5..', job='api'}[5m])"
      alertThreshold: 0.01  # Alert at 1%

Alerting: Notification System

Alert Rules

Alerting rules define when to notify teams:

# prometheus/alerts.yml
 
groups:
  - name: api_alerts
    rules:
      - alert: HighCPU
        expr: rate(process_cpu_seconds_total{job="api"}[5m]) * 100 > 80
        for: 5m  # Alert after sustained 5 minutes
        annotations:
          summary: "API CPU usage > 80%"
          
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5..", job="api"}[5m]) > 0.01
        for: 1m
        annotations:
          summary: "API error rate > 1%"
          
      - alert: ServiceDown
        expr: up{job="api"} == 0
        for: 30s  # Alert immediately
        annotations:
          summary: "API service is down"

Alert Routing with Alertmanager

Route alerts to the right people:

# alertmanager.yml
 
routing:
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  
  # Default route
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
      
    - match:
        severity: high
      receiver: 'slack-oncall'
      continue: true
      
    - match:
        severity: warning
      receiver: 'slack-general'
 
receivers:
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_SERVICE_KEY'
  
  - name: 'slack-oncall'
    slack_configs:
      - channel: '#oncall'

Logs: Event Details

ELK Stack (Elasticsearch, Logstash, Kibana)

The industry standard for log aggregation:

# Filebeat configuration (log shipper)
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    fields:
      service: api
      env: prod
 
output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "api-prod-%{+yyyy.MM.dd}"

Searching Logs with Kibana

# Find all 500 errors in the last hour
status:500 AND @timestamp:[now-1h TO now]

# Find slow requests
response_time_ms > 1000 AND service:api

# Find errors from specific deployment
deployment:v2.3.1 AND level:error

Tracing: Request Paths

Distributed Tracing with Jaeger

Traces show how a request flows through your system:

User Request
    ↓
[API Gateway] → 5ms
    ↓
[Auth Service] → 10ms
    ↓
[User Service] → 25ms
    ↓
[Database Query] → 150ms
    ↓
Total: 190ms

If slow, you see exactly where time is spent

Jaeger Configuration

# jaeger-agent configuration
 
reporter_logLevel: info
 
processors:
  batch:
    queue_size: 2048
    batch_size: 512
    timeout: 5s
 
exporters:
  jaeger:
    endpoint: "http://jaeger-collector:14250"
    tls:
      insecure: true

Incident Management Platforms

PagerDuty

# PagerDuty integration
- Receives alerts from Prometheus
- Notifies on-call engineer
- Escalates if no response
- Tracks incident details
- Schedules on-call rotations

Example workflow:

1. Alert fires in Prometheus
2. Alertmanager sends to PagerDuty
3. PagerDuty triggers alert to on-call engineer
4. Engineer acknowledges within 5 minutes
5. If no ack, escalates to manager

Other Options

- Opsgenie: Similar to PagerDuty
- VictorOps: Incident workflow automation
- xMatters: Enterprise incident management
- Custom webhook: Direct integration

Health Checking

Built-in Health Checks

Every service should expose /health endpoint:

# Python Flask example
 
@app.route('/health')
def health_check():
    checks = {
        'status': 'healthy',
        'database': check_database_connection(),
        'cache': check_cache_connection(),
        'disk': check_disk_space(),
        'memory': check_memory_usage(),
    }
    
    overall_status = 'healthy' if all(checks.values()) else 'unhealthy'
    
    return {
        'status': overall_status,
        'checks': checks,
        'version': '2.3.1',
        'timestamp': datetime.now().isoformat()
    }

Kubernetes Probes

# Kubernetes health check configuration
 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      containers:
      - name: api
        image: api:v2.3.1
        
        # Startup probe (initialization phase)
        startupProbe:
          httpGet:
            path: /health
            port: 8000
          failureThreshold: 30
          periodSeconds: 10
        
        # Liveness probe (is it alive?)
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
          
        # Readiness probe (can take traffic?)
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 5

Recommended Stack for Different Scales

Small Team (< 10 engineers)

Metrics: Prometheus (open source)
Dashboards: Grafana (open source)
Logs: Basic app logging to file + grep
Tracing: Optional
Alerting: Simple email/Slack
On-call: Basic rotation, no dedicated tool

Total cost: ~$500/month (mostly cloud infrastructure)

Growing Company (10-100 engineers)

Metrics: Prometheus or Datadog
Dashboards: Grafana or built-in (Datadog)
Logs: ELK Stack or Datadog
Tracing: Jaeger or Datadog
Alerting: Alertmanager + PagerDuty
On-call: PagerDuty or Opsgenie

Total cost: $5-20k/month

Enterprise (100+ engineers)

Metrics: Datadog or Prometheus + Thanos
Dashboards: Grafana or Datadog
Logs: Splunk or Datadog
Tracing: Datadog or proprietary
Alerting: Enterprise-grade system
On-call: VictorOps or xMatters

Total cost: $50k-500k+/month
Plus: Custom integrations and support

Essential SRE Tools Checklist

Monitoring & Metrics:
  ☐ Time-series database (Prometheus, etc.)
  ☐ Dashboarding tool (Grafana, etc.)
  ☐ Infrastructure monitoring
  ☐ Application performance monitoring (APM)
 
Logging:
  ☐ Log aggregation (ELK, Datadog, etc.)
  ☐ Log parsing and search
  ☐ Retention policies
  ☐ Access controls
 
Alerting:
  ☐ Alert rules engine
  ☐ Alert routing/deduplication
  ☐ On-call escalation
  ☐ Notification channels (email, SMS, Slack, etc.)
 
Incident Management:
  ☐ On-call scheduling
  ☐ Incident tracking
  ☐ War room collaboration
  ☐ Post-incident analysis
 
Infrastructure:
  ☐ Infrastructure as Code
  ☐ Deployment automation
  ☐ Configuration management
  ☐ Secrets management
 
Observability:
  ☐ Distributed tracing
  ☐ APM tools
  ☐ Real User Monitoring (RUM)
  ☐ Synthetic monitoring

Open Source vs Managed Services

Open Source Benefits

✅ Full control
✅ No vendor lock-in
✅ Custom modifications
✅ Lower long-term cost
❌ Must maintain
❌ Requires expertise
❌ Higher initial setup

Managed Services Benefits

✅ No maintenance
✅ Included support
✅ Auto-scaling
✅ SLA guarantees
❌ Vendor lock-in
❌ Higher ongoing cost
❌ Less customization

Hybrid Approach (Recommended)

Use open source for:
- Metrics (Prometheus)
- Dashboards (Grafana)
- Tracing (Jaeger)

Use managed for:
- Log aggregation (Datadog)
- PagerDuty integration
- Custom analysis needs

Benefits:
- Open source flexibility where needed
- Managed convenience for non-differentiating services
- Balance cost and capability

Key Takeaways

✓ Observability = Metrics + Logs + Traces
✓ Prometheus is the metrics standard
✓ Grafana for dashboards
✓ ELK or Datadog for logs
✓ Distributed tracing matters for microservices
✓ Alerting must be smart (not noisy)
✓ Health checks enable automation
✓ Start open source, graduate to managed
✓ Invest gradually as you scale
✓ Right tools enable effective SRE