From the DevOps Enterprise Summit 2015, this presentation covers hard-won lessons of transitioning an engineering organization to DevOps. See video at https://www.youtube.com/watch?v=6tREbJl8e_Y.
Lessons:
1. Reorganize around Ownership
2. Lose the Ticket Culture
3. Replace Approvals with Code
4. Enforce a Service Mentality
5. Charge for Usage
6. Prioritize Quality
7. Start Investing in Testing
8. Actively Manage Technical Debt
9. Share On-Call Duties
10. Make Post-Mortems Truly Blameless
DevOps is no longer just for Internet unicorns any more. Today many large enterprises are transitioning from the slow and siloed traditional IT approach to modern DevOps practices, and getting substantial improvements in agility, velocity, scalability, and efficiency. But this transition is not without its challenges and pitfalls, and those of us who have led this journey have the scar tissue to prove it.
A successful transition to DevOps practices ultimately involves changes to organization, to culture, and to architecture. Organizationally, we want to create multi-skilled teams with end-to-end ownership and shared on-call responsibilities. Culturally, we want to prioritize solving problems and improving the product over closing tickets. Architecturally, we want to move to an infrastructure with independently testable and deployable components.
The ten practical lessons outlined in this session synthesize the speaker’s experiences leading teams at eBay, Google, and KIXEYE, as well as from his former consulting practice.
2. 1. Reorganize Teams
Around Ownership
• End-to-end Ownership
o Small, cross-functional team owns application / service from design to
deployment to retirement
o Team has inside it all skill sets needed to do the job
o Depends on other teams for supporting services
o Able to move very rapidly and independently
• “You build it, you run it”
o The same team that builds the software operates the software
o No separate maintenance or sustaining engineering team
3. 1. Reorganize Teams
Around Ownership
• E.g., KIXEYE and MySQL
o Development team wrote the SQL, issued all the queries
o DBA / Ops team responsible for performance and uptime
o Splitting ownership between teams was counterproductive and disruptive
• Alternative strategies
o Centrally-maintained persistence service
OR
o Customer manages its own persistence
4. 2. Lose the
Ticket Culture
Ticket Culture Ownership Culture
Do what is asked for Do what is needed
One-way communication Two-way collaboration
Goal is to close the ticket Goal is product success
Reactive approach Proactive approach
Reinforces silos Reinforces collaboration
Prioritizes process Prioritizes results
5. 3. Replace Approvals
With Code
• Reduce or eliminate approval bodies
o E.g., eBay Architecture Review Board
o (-) Too late
o (-) Too slow
o (-) Too disengaged from details
• Package expertise in code
o Smart, experienced people build their knowledge into code
o Teams with specialized skills (databases, security, compliance, etc.) provide a
service, library, or tool
6. 3. Replace Approvals
With Code
• E.g., Security at Google
o Provide secure foundations by maintaining lower-level libraries and services
o Provide self-service penetration tests, vulnerability assessments, etc.
7. The easiest way to “enforce” a
standard practice is with
working code.
8. 4. Enforce a
Service Mentality
• Vendor-Customer Discipline
o Service team is a vendor; the products are its customers
o Service is useful only to the extent it provides value to its customers
• Customer can choose to use service or not (!)
o Customer team is responsible for deciding what is best for their use case
o Use the right tool for the right job
• Provides powerful incentives
o Service must be *strictly better* than the alternatives of build, buy, borrow
9. 5. Charge for
Usage
• Charge customers for *usage* of the service
o Aligns economic incentives of customer and provider
o Motivates both sides to optimize efficiency
• Free usage leads to waste
o No incentive to control usage or find more efficient alternatives
• E.g., App Engine usage at Google
o Charging particularly egregious internal customer led to 10x reduction in usage
10. 6. Prioritize
Quality
• Quality, Performance, and Reliability are “Priority-0
features”
o “Stop the line” if there is a degradation
o Equally important to users as product features or engaging user experience
• Developers write tests and code together
o Continuous testing of features, performance, load
o Confidence to make risky changes
• “Slow down to speed up”
o Catch bugs earlier, fail faster
11. 6. Prioritize
Quality
• E.g., Development Process at Google
o Code reviews before submission
o Automated tests for everything
o Single searchable source code repository
Internal Open Source Model
o Not “here is a bug report”
o Instead “here is the bug; here is the code fix; here is the test that verifies the fix”
12. 7. Start Investing
in Testing
• Write functional tests around a component
o If you can only write a few tests, they should be meaningful ones
o End-to-end tests exercise more meaningful customer-visible capabilities than unit
tests
• Fail any build that breaks a test
• Keep ratcheting up the tests
o For every new feature, add tests for that feature
o For every new bug, add a test that reproduces the bug and verifies the fix
13. 8. Actively Manage
Technical Debt
• Maintain sustainable and well-understood level of debt
o Denominated in engineering effort to fix
o Plan for how and when you will pay it off
o Track feature work vs. accrued debt over time
• “Don’t have time to do it right” ?
o WRONG – Don’t have time to do it twice (!)
o The more constrained you are on time and resources, the more important it is to
do it solidly the first time
16. 9. Share
On-call Duties
• All members of the team rotate on-call responsibilities
o Strongest motivator to build in solid monitoring and diagnosis capabilities
o Best way to learn the real-world behavior of the system
o Best way to develop empathy for customers and other team members
• Train via on-call “apprenticeship”
o 1. Apprentice starts as secondary on-call, experienced engineer is primary
o 2. Apprentice is primary, experienced engineer is secondary
o 3. Apprentice graduates
17. 10. Make Post-Mortems
Truly Blameless
• Overcoming blame culture takes work
o Institutional memory of blame is long
o E.g., Initial post-mortems at KIXEYE elicited tons of fear
• Constantly reinforce learning over blame
o When you say “blameless”, you have to really mean it (!)
o Don’t ask “what did you do?”, ask “what did you learn?”
18. 10. Make Post-Mortems
Truly Blameless
• Open and Honest Discussion
o Document exactly what happened
o What went right
o What went wrong
• Focus on Learning and Improvement
o How should we change process, technology, documentation, etc.
o How could we have automated the problems away?
o How could we have diagnosed more quickly?
• Take fear and personalization out of it
Engineers will compete to take personal responsibility (!)
“Finally we can fix that broken system”
19. Top Five
Takeaways
• 1. Reorganize Teams Around Ownership
• 2. Replace Approvals With Code
• 3. Prioritize Quality
• 4. Actively Manage Technical Debt
• 5. Make Post-Mortems Truly Blameless
20. What I Could
Use Help With
• Encouraging leaders to lose the blame culture
• Measuring productivity in a principled way
• Overcoming resistance to taking the pager