G
GuideDevOps
Lesson 11 of 15

Automated Remediation

Part of the Site Reliability Engineering tutorial series.

Overview

Automated remediation (or "self-healing") moves beyond reactive manual fixes. By detecting failure patterns and executing predefined runbooks, you keep services healthy without manual intervention.

Levels of Automation

  • L1 (Reactive): Automated alerts notify the on-call engineer.
  • L2 (Guided): An automated system provides the engineer with a "repair button."
  • L3 (Proactive): The system automatically detects and fixes the issue without human intervention.

Example: Restarting a Failing Deployment (Kubernetes)

If a pod is stuck in a CrashLoopBackOff due to memory leaks, an automation script can trigger:

# Automated fix: Restart the deployment if error rate > 50%
if [ $(kubectl get pods | grep Error | wc -l) -gt 5 ]; then
  kubectl rollout restart deployment/my-app
fi

Result: The deployment is refreshed, and the pods enter a running state.

deployment.apps/my-app restarted
Status: All pods back to Running