Why Culture Matters in SRE
SRE success depends on culture as much as tools:
Good Tools + Bad Culture = Failure
- Team ignores the tools
- Blame-oriented incident response
- People don't trust automation
- Engineers leave
Good Tools + Good Culture = Success
- Team embraces tools and practices
- Blameless, learning-focused
- People trust systems
- Engineers stay and innovate
Organizational Models for SRE
Model 1: Centralized SRE Team
Company Structure:
├── Product Teams (Own feature development)
├── SRE Team
│ ├── SRE Lead
│ ├── Senior SRE
│ └── SRE Engineers
└── Infrastructure Team (Own hardware/cloud)
Characteristics:
- Dedicated SRE team
- Supports all product teams
- Single point of authority on reliability
- Cost-effective for single service
When to use: Small companies, single large application
Model 2: Embedded SRE Teams
Company Structure:
├── Product Team A
│ ├── Feature engineers
│ └── Embedded SRE (1-2 people)
├── Product Team B
│ ├── Feature engineers
│ └── Embedded SRE (1-2 people)
└── Platform SRE Team (shared infrastructure)
Characteristics:
- SRE embedded in each product team
- Also have platform team for shared services
- Closer collaboration
- Expensive but scalable
When to use: Larger companies with multiple distinct services
Model 3: DevOps-First with SRE on Demand
Company Structure:
├── Product Teams (Own feature + operations)
│ ├── Feature engineers
│ └── DevOps engineer
└── SRE Center of Excellence
└── Available to consult on reliability
Characteristics:
- Product teams own everything
- SRE available as consultants
- Lower ops overhead
- Often used at startups/growth stage
When to use: Cloud-native organizations with mature deployment
Hiring SREs
What Makes a Good SRE?
Core Skills:
✅ Software engineering ability (must write code)
✅ Systems thinking (understand complex systems)
✅ Operations experience (know what fails)
✅ Problem-solving (debug under pressure)
✅ Communication (work across teams)
Experience Path (common):
1. Software engineer (2-3 years)
2. Operations/DevOps engineer (2-3 years)
3. SRE engineer (career)
OR
1. SRE at another company (if hiring externally)
Hiring Mistakes to Avoid
❌ "Pure operations person without coding skills"
- Won't be able to write automation
- SRE is engineering discipline, not operations
❌ "Theoretical computer scientist with no ops experience"
- Won't understand operational constraints
- Will be surprised by real-world failures
❌ "Hiring for seniority only"
- Burning out your senior engineers
- Need junior SREs with mentoring
- Career progression matters
✅ "Mix of experience levels"
- Ratio: 1 senior : 2-3 mid-level : 1-2 junior
- Allows mentoring and knowledge sharing
Interview Process
Good SRE interviews assess:
1. Coding ability
- Design a monitoring system
- Write deployment automation
- Optimize database query
2. Operations knowledge
- "System is slow, walk me through investigation"
- "We need 99.99% uptime, how?"
- "Deployment failed, what now?"
3. Communication
- "Explain complex system to non-technical lead"
- "You disagree with product on deadline, how do you handle?"
4. Problem-solving under pressure
- "You're on-call, alert fires, describe response"
5. Culture fit
- "Tell me about an incident you learned from"
- "How do you handle blame in your current org?"
Training and Development
On-Boarding Path
Week 1: Orientation
- Meet team
- Understand services
- Set up development environment
- Get laptop
Week 2-4: Knowledge
- Pair programming with experienced SRE
- Read documentation
- Attend design reviews
- Review recent incidents
Month 2-3: Hands-On
- Part of on-call (with mentor)
- Implement small reliability improvements
- Participate in incident response
- Present learnings
Month 4+: Autonomy
- Independent on-call
- Lead reliability projects
- Mentor new teammates
- Expand responsibilities
Continuous Learning
Reading List for SREs:
- "Site Reliability Engineering" (Google SRE Book)
- "The Phoenix Project" (DevOps culture)
- "Release It!" (production patterns)
- "Designing Data-Intensive Applications" (systems design)
Conferences:
- SREcon (regional and annual)
- DevOps Enterprise Summit
- Cloud Native Computing Foundation conferences
Certifications (optional):
- CKA (Certified Kubernetes Administrator)
- AWS Certified SysOps Administrator
- Various vendor-specific certs
Building Blameless Culture
Foundation: Psychological Safety
Psychological safety is prerequisite for blameless culture
Definition: People feel safe to:
- Take intelligent risks
- Admit mistakes
- Ask for help
- Disagree with authority
Without it:
- People hide problems
- Blame others to protect themselves
- Innovation stops
- Quiet resignations
With it:
- Problems surface quickly
- Team learns together
- Innovation happens
- People stay
Creating Psychological Safety
Leadership behaviors that build safety:
✅ DO:
- Admit when you don't know something
- Ask "What happened?" not "Who caused this?"
- Thank people for surface bugs in low-pressure scenarios
- Celebrate learning from incidents
- Don't shoot the messenger
❌ DON'T:
- Use incidents to create performance evaluations
- Blame individuals in meetings
- Punish people for honest mistakes
- Make heroes of people who "save the day"
- Hide failures from team
Explicit Blameless Commitments
# Team Agreement on Blameless Culture
Our commitment:
- No one will be punished for an incident they contributed to
- We focus on improving systems, not blaming individuals
- Mistakes are learning opportunities
- We discuss failures openly and honestly
Postmortem principles:
- We document what happened, not who caused it
- We ask "What can we improve?" not "Who messed up?"
- Action items focus on system improvements
- Participants feel safe sharing honestly
If we break these principles:
- Any team member can call it out immediately
- We discuss how to do better
- We repair relationships
- We continuously improve our cultureScaling SRE Teams
5 Engineers → 10 Engineers
Addition needed:
- Hire 2-3 more SREs
- SRE lead now spends 50% on management
- Define clear ownership of services
- Start formal training program
10 Engineers → 20 Engineers
Addition needed:
- Split into sub-teams (by service or domain)
- Each sub-team has lead
- Create SRE manager role
- Define career progression
- Codify practices and standards
20+ Engineers
Addition needed:
- SRE director or VP
- Separate teams for different domains:
- Production reliability
- Infrastructure/platform
- Observability/monitoring
- Chaos/resilience
- Hiring manager for each team
- Cross-team coordination meetings
Compensation and Career Growth
Fair SRE Compensation
SRE careers should be as lucrative as software engineering:
Industry ranges (varies by location and company):
- Junior SRE: $100-150k + bonus
- Mid-level SRE: $150-220k + bonus
- Senior SRE: $220-300k + bonus
- SRE Manager: $250-350k + bonus
- SRE Director: $300-400k + bonus
On-call compensation:
- Base: Included in salary
- Stipend: $500-2000/month while on-call
- Holiday/weekend on-call: 1.5x - 2x base rate
- After-hours incident: Comp time or OT pay
Career Progression
Paths for SRE growth:
Staff/Principal Track (Technical):
- Senior SRE → Staff SRE → Principal SRE
- Focus on architecture and technical leadership
- Authority without management responsibility
Management Track:
- Senior SRE → SRE Manager → SRE Lead → Director
- Responsible for team, hiring, development
Both tracks should be equally valued and compensated
Skills Development Ladder
Junior SRE:
- Writing automation scripts
- Responding to incidents (guided)
- Basic monitoring and alerting
- Learning SRE culture and practices
Mid-level SRE:
- Designing reliable systems
- Independent incident response
- Mentoring junior SREs
- Driving reliability improvements
Senior SRE:
- Architectural decisions
- Cross-team collaboration
- Mentoring and hiring
- Organizational reliability strategy
Measuring Team Health
Team Metrics Dashboard
SRE Team Health (Monthly)
On-call Metrics:
Incidents per person: 2.4 avg
Night disruptions: 1.2 per person
MTTR (mean time to recovery): 15 min
On-call satisfaction: 8/10 avg
Capability Metrics:
Runbook coverage: 95% of services
Monitoring coverage: 92% of services
Automation coverage: 80% of services
Team Health:
Turnover (annual): 5% (healthy)
Training hours: 40 hours per person/year
Promotion rate: 20% per year
New hire success rate: 90%
Project Progress:
Toil reduction: 10% this quarter
New services taken on: 2
Major outages: 0
SRO achievement: 99.95% (target met)Employee Satisfaction Survey
Annual SRE Survey
Questions:
"I feel safe admitting mistakes": 4.2/5 ✅
"I have good work-life balance": 3.8/5 ⚠️
"I'm growing technically": 4.5/5 ✅
"I understand my career path": 3.5/5 ⚠️
"I'd recommend this company": 4.3/5 ✅
"On-call is sustainable": 3.2/5 ⚠️
Insights:
- Team feels psychologically safe (good)
- Work-life balance needs improvement (hiring?)
- On-call workload too high (need more people?)
- Career progression unclear (need mentoring?)Building Community
Internal
Regular practices:
- Weekly SRE sync (15 min, quick updates)
- Monthly SRE brown bags (learning and presentations)
- Quarterly SRE retreat (planning and culture)
- Incident retrospectives (continuous learning)
- Mentoring pairs (senior + junior)
External
Representation:
- SRE members speak at conferences
- Write blog posts about practices
- Contribute to open source
- Organize local SRE meetup
Benefits:
- Recruits see your culture
- Learn from community
- Share knowledge
- Improve your brand
Common Team Culture Mistakes
❌ "Heroic on-call"
Problem: Celebrate people working 24/7
Fix: Celebrate good architecture instead
❌ "Blame individuals"
Problem: Use incidents to punish
Fix: Explicit blameless commitment
❌ "Ignore on-call load"
Problem: On-call engineers burn out
Fix: Track and respond to health metrics
❌ "No career growth"
Problem: Senior engineers leave
Fix: Clear paths forward
❌ "Only hire senior people"
Problem: Expensive and no future pipeline
Fix: Mix of experience levels
✅ "Psychological safety"
✅ "Fair compensation"
✅ "Clear growth paths"
✅ "Reasonable on-call"
✅ "Continuous learning"
Key Takeaways
✓ Culture is as important as tools
✓ Psychological safety enables learning
✓ Mix of experience levels in team
✓ Clear career progression
✓ Fair compensation matches software engineers
✓ Explicit blameless commitments
✓ Regular training and development
✓ Track team health metrics
✓ Scale teams thoughtfully
✓ Build community internally and externally