Overview
Automated remediation (or "self-healing") moves beyond reactive manual fixes. By detecting failure patterns and executing predefined runbooks, you keep services healthy without manual intervention.
Levels of Automation
- L1 (Reactive): Automated alerts notify the on-call engineer.
- L2 (Guided): An automated system provides the engineer with a "repair button."
- L3 (Proactive): The system automatically detects and fixes the issue without human intervention.
Example: Restarting a Failing Deployment (Kubernetes)
If a pod is stuck in a CrashLoopBackOff due to memory leaks, an automation script can trigger:
# Automated fix: Restart the deployment if error rate > 50%
if [ $(kubectl get pods | grep Error | wc -l) -gt 5 ]; then
kubectl rollout restart deployment/my-app
fiResult: The deployment is refreshed, and the pods enter a running state.
deployment.apps/my-app restarted
Status: All pods back to Running