In production DevOps, things will go wrong: network timeouts, permission errors, or missing files. Your scripts must handle these gracefully and log what happened for troubleshooting.
1. Exception Handling
Use try...except blocks to catch errors and prevent your script from crashing silently.
Basic Try-Except
Action:
import logging
try:
# Attempting to read a missing file
with open("config.yaml", "r") as f:
config = f.read()
except FileNotFoundError:
print("Error: The configuration file was not found.")
except Exception as e:
print(f"An unexpected error occurred: {e}")Result:
Error: The configuration file was not found.Catching Specific API Errors
Action:
import requests
try:
response = requests.get("https://api.github.com/invalid-url")
response.raise_for_status()
except requests.exceptions.HTTPError as err:
print(f"HTTP error occurred: {err}")
except Exception as err:
print(f"Other error occurred: {err}")Result:
HTTP error occurred: 404 Client Error: Not Found for url: https://api.github.com/invalid-url2. Logging
The logging module is the standard way to record events. Unlike print, logs can be categorized by severity and sent to files or external systems.
Basic Configuration
Action:
import logging
# Configure logging to show time and severity
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logging.info("Starting the deployment script...")
logging.warning("Disk space is low (15% remaining)")
logging.error("Failed to connect to the database.")Result:
2026-04-10 14:00:00,123 - INFO - Starting the deployment script...
2026-04-10 14:00:00,125 - WARNING - Disk space is low (15% remaining)
2026-04-10 14:00:00,127 - ERROR - Failed to connect to the database.3. Advanced: Retry Logic
DevOps scripts often interact with unstable networks. Adding retries makes your automation "self-healing."
Action:
import time
import random
def unreliable_task():
if random.random() < 0.7: # 70% chance of failure
raise Exception("Temporary Network Timeout")
return "Success!"
max_retries = 3
for attempt in range(max_retries):
try:
print(f"Attempt {attempt + 1}...")
result = unreliable_task()
print(result)
break
except Exception as e:
print(f"Failed: {e}")
if attempt < max_retries - 1:
time.sleep(1) # Wait before retryingResult (Example):
Attempt 1...
Failed: Temporary Network Timeout
Attempt 2...
Failed: Temporary Network Timeout
Attempt 3...
Success!Summary
- Never use a bare
except:block; catch specific exceptions. - Use
logginginstead ofprintfor production scripts. - Use Log Levels: DEBUG (noisy), INFO (normal), WARNING (caution), ERROR (failure).
- Implement retries for network-dependent tasks.
- Log to stderr for errors and stdout for normal output.