Facebook, Netflix, Flickr, Etsy, LinkedIn, eSurance, Instagram and Salesforce.com; you know their names. As a consumer, you’ve probably used services provided by many of them. These are some of the “born on the web” companies of the last couple of decades that have helped pioneer new, web-based business models - and in the process become dominant players in their markets, or created new markets altogether. Call them the “Cool Kids”.
What you may not know, however, is that these companies are also strong adopters of a DevOps approach when it comes to software development and delivery. In this presentation we take a look at these companies to discern patterns related to how they have applied DevOps in the areas of Culture, Organization, Practices, Automation and Measurements.
Even if your company bears no resemblance at all to the Cool Kids, you can take away some important learnings from them as you look to apply DevOps to your own software initiatives.
This presentation is a result of a joint project executed by IBM strategists Bill Holtshouser and Carl Zetie, both of the Rational division in IBM Software Group, during the first half of 2014.
2. Please note…
IBM’s statements regarding its plans, directions, and intent are subject to change
or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment,
promise, or legal obligation to deliver any material, code or functionality.
Information about potential future products may not be incorporated into any
contract. The development, release, and timing of any future features or
functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream,
the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results
similar to those stated here.
1
3. Introduction
• This session is based on an examination of a series of “born on the
web” companies to see what common patterns and other learnings can
be derived from their DevOps journeys, with the goal of extracting
guidance for IBM’s clients
• We used only publicly available information such as published
conference presentations, company blogs, videos, news stories and
white papers
• Important: Everything here is strictly our opinion; none of the
companies mentioned reviewed or endorsed these opinions in any way!
2
4. Key Takeaways
• “Born on the Web” startups like Etsy, Netflix and others have been
leaders in applying a DevOps approach to SW development and delivery
– but they are essentially built from the ground up to do so
• These companies display numerous common DevOps-related traits in
the areas of Culture, Organization, Practices, Automation and
Measurements
• Although your enterprise won’t be able to replicate all aspects of these
“cool kid” companies and how they have applied DevOps (nor should
you even try), there are some important learnings from them that
can inform your own DevOps approach
3
7. Believe it or not, Dev and Ops weren’t always separate
“Back in the dawn of the computer
age, there was no distinction between
dev and ops. If you developed, you
operated. You mounted the tapes, you
flipped the switches on the front panel,
you rebooted when things crashed, and
possible even replaced the burned out
vacuum tubes. And you got to wear a
geeky white lab coat…”
“Dev and ops started to separate in the
‘60s, when programmers dumped boxes of
punch cards into readers and “computer
operators” scurried around mounting tapes
in response to IBM JCL. The operators also
pulled printouts from line printers and put
them in labeled cubbyholes, where you got
your output filed under your last name.”
– John Alspaw, Etsy
6
9. Sidebar: Continuous Delivery is more than just “fast
Continuous Integration”
Continuous Delivery
• Websites, SaaS offerings
• Multiple pushes to
production per day
• Highly decoupled,
independent feature sets
• Single image/single
stream
• New practices and
patterns
Continuous
Integration
• Traditional applications,
appliances, mobile apps,
Web APIs
• Delivery to production
every few days to weeks
• Coordinated releases,
multiple version streams
• Established Agile
practices
Continuous
Engineering
• Complex embedded
systems
• Complex product
release and update
cycles
• Management of
variants and versions
• Engineering practices
8
10. Five essential elements of “Cool Kids” DevOps
success
Organization
Practices
Culture Automation
Measure-
ment
9
11. • Trust leads to an acceptance of “reasonable” risk
– Organization, tools, automation, instrumentation can all reduce risk
• Risk = PROBABILITY of Error x COST of Error
– Not all risks are created equal; zero risk is unattainable
– Cost depends on Time to Fix
• Learning from mistakes > blame
– …but there is still Karma: repeated mistakes may lead to loss of privilege
Cool Kids and Culture - key learnings
Culture
At Etsy, employees have a high degree of creative freedom and, when things go wrong,
accountability without blame. “We actually trust people,” CTO Chad Dickerson says. He
calls the approach a “radical decentralization of authority.” – Inc. Magazine, 12/13
1
0
• ALL exhibit a high degree of delegation
– …which leads to velocity
• In order to delegate, the Cool Kids trust… but verify
– E.g. via instrumentation, measurement
12. Re-defining the attitude towards “failure”
11
• NetFlix allows
failure to happen
continuously, and
want their SW to be
able to deal with it;
in fact they take
steps to encourage
errors (Simian
Army)
• In reality they look
at “failure” as simply
another STEP in the
SW development
process
http://techblog.netflix.com/2011/07/netflix-simian-army.html
13. • Adopt an “Ops First” design mentality
– Don’t build what you can’t manage
• Recognize the importance of build
– They don’t just give the build system to the “worst programmer”
or newest hire, but establish a focused role
Cool Kids and Culture – more learnings
Culture
12
14. Bottom line: a culture of trust is required
13
Rapid delivery
requires low
risk
Small
feature sets
Independent
services
Progressive
exposure
Rapid
feedback
Reliable
rollback
High
delegation
& trust
Risk = Probability of error
x Cost of error
Culture
15. Adrian Cockcroft of Netflix on Culture
“Culture is very hard to create or modify but easy to destroy.
This is because everyone has to buy into it for it to be effective,
and then every manager has to hire only people who are
compatible with the culture, and also get rid of people who turn
out not to fit in, even if they are doing good work.
So the short answer is: start a new company from scratch
with the culture you want, and pay a lot of attention to who
you hire. I don't think it is possible to do a culture shift if
there are more than a roomful of people involved.
Even with a roadmap and a guide, you probably won't be able
to follow this path if you are in a large established company.
Your existing culture won't let you.”
http://perfcap.blogspot.com/2012/03/ops-devops-and-noops-at-netflix.html
14
16. Organization follows Culture
Traditional Culture DevOps Culture
My priority is to
deliver code…
fast.
My priority is to
keep the site up
and running.
We’re all on the
same team! Want
some pizza?
15
Organ-
ization
17. • Conway’s Law (you build what you are) applies
– …also applies to how you’re organized
• Feature teams, not platform teams
– Small teams: “two pizza” rule
• Organize for an “end-to-end” responsibility for delivery
– Positive approach to fixing mistakes – learning, not “blame and shame”
• Many common patterns are seen in QA…
– Shared responsibility across a team, everybody does QA, or co-located QA
– Small Quality Engineering CoE team provides common tools/practices
– But NOT a separate/antagonostic QA org (“clean up your own mess”)
• Small DevOps “toolsmith” teams
– A.K.A. Systems Release Engineering
– Provide common tools & processes for automation, logging, monitoring…
– There to help, NOT to do it for you
• Finally - no “throwing it over the wall”…
Organization follows Culture
16
Organ-
ization
19. Practices that “make perfect” for the Cool Kids
Practices
• “Light” planning and specs
– Etsy high level planning done in 60 day chunks and two
week periods; specs kept very light – no more than what is
required
• Cut the cord with traditional release process
– Developers coordinate and drive the release of their own
code without need for a centralized release cycle
– Netflix goes farther than most: “NoOps”
• Speed, speed, speed
– Its all about rapid deployment; some deploy updates to their site 25x
per day
• Progressive rollout of new features, “dark” releases
– Concept of “config flags”, new features there but not yet enabled, then
launched with simple switch in the code
• They talk about it…a LOT
– Lots of internal and external forums / blogs among the Cool Kids
– Example: Etsy “Code as Craft” site www.codeasdraft.com
18
20. • Most of these companies manage a single production
image that they completely control
– The don’t have to worry about shipping releases to
customers who might or might not install those releases
• …therefore there are no branches in their version
control – everything is checked into the trunk
Practices: a single image simplifies things
Practices
19
21. • Testing everything on every check-in is good…but it
isn’t the endgame
– LinkedIn has only a few thousand unit tests
• Testing in a non-production environment can reach a
point of diminishing returns
– Ever-growing lists of unit tests, often testing very obscure
scenarios, often overlapping and redundant
– Limited by your ability to predict real world scenarios
• LinkedIn practice: get to production environment as
soon as practical
– Progressive rollout minimizes the risk when deploying to
production…
Practices: “Continuous Delivery Heresy”
(Yes, you can do too much testing)
Practices
20
22. • Progressive rollout of new features, “dark” releases:
– Deploy to one server with all features disabled to ensure no
performance or resource regressions (also known as “canarying”)
– Turn on features for a small population, and measure (“smoke test”)
– Turn it on for up to 1% of users, and measure
– Progressively roll out to all servers, continuing to measure
– Config Flags (also known as feature flags or gatekeepers [LinkedIn])
control which users see which features
• In order to successfully do Progressive Rollout, you’ll need
two more of our five essential elements:
– Automation, both to progressively roll out and to roll back if a
problem is discovered
– Measurement (tied to Instrumentation), in order to be able to rapidly
measure the impact
Practices: Progressive Rollout
Practices
21
24. • These companies tend to avoid “release-defining
features” that can hold up the entire release
• Cool Kids pattern: release features when they are
ready - the release train waits for nobody
– Also known as date-based releases - the date of release is
fixed, but the features in that release are flexible
• For this to work, you must respect forward and
backwards compatibility of API (service) interfaces
Practices: Fire When Ready!
Practices
23
25. • In general, the Cool Kids automate as much as
possible
– Etsy has invested a lot in automated unit / functional
testing, dev tooling and monitoring, use of dashboards
– Netflix has a heavy degree of automation across the
board
• Automate even the infrastructure, but keep it simple
– LinkedIn, Flickr and Netflix generally build up their
infrastructure from just a single OS image
– From here, configure individual servers using automated
scripts driven by tool of choice (e.g. IBM UrbanCode)
– Also commonly seen was use of “Phoenix” servers (vs.
“Snowflakes”), which can be re-built at any time then
“burned to the ground” if needed
• … but only automate what can be measured
Cool Kids and Automation Auto-
mation
24
26. Think you don’t need to keep an eye on automation?
http://windowsitpro.com/windows-7/aggressive-configmgr-based-windows-7-deployment-takes-down-emory-university
“During TechEd 2014, the Emory University IT department prepared and deployed
Windows 7 upgrades to the campuses computers. If you've worked with ConfigMgr
at all, you know that there are checks-and-balances that can be employed to ensure
that only specifically targeted systems will receive an OS upgrade. In Emory
University's case, the check-and-balance method failed and instead of delivering
the upgrade to applicable computers, delivered Windows 7 to ALL computers
including laptops, desktops, and even servers.
I'll stop for a second to let you take that in.
Yes, even servers.
By the time it was realized what exactly had happened, the Windows 7 sequence
had repartitioned, reformatted, and installed Windows 7. Emory IT powered off the
ConfigMgr server, hoping to stop the deployment before it was too late, but – it was
too late. Even the ConfigMgr server had been repartitioned and reformatted…”
– Windows IT Pro, May 19, 2014
27. Finally: Instrument and Measure
26
• LinkedIn: “Measurement is better than prediction”
• Provide a common framework to make it easy for developers to
choose what to log simply by tagging or registering it
– “Push” from services works better than “pull” or polling
– In many cases, developers need do no more than push key/value pairs
to a logging system
– LinkedIn collects 500K+ metrics per minute at an average of 400
metrics per service
• Instrument user behaviors to improve the user experience
– Esurance: “we mined the data to figure out what people were doing
most often, make those tasks the most prominent and make them
addressable in as few clicks as possible”
• Metrics dashboards also display deployment activity
– So if there’s a problem, you can easily tie the start time of the issue to
the preceding pushes
Measure
-ment
28. • LinkedIn developed and then open
sourced tools for monitoring and
graphing data being pushed to its logs…
Monitoring at LinkedIn
inGraph, inFormed
Measure
-ment
27
29. So…what are the Cool Kids DevOps takeaways?
28
Culture
• Cultural change takes time – take reasonable steps
– Team-building, cross-training, improved communication
– Maybe include your Ops team in requirements / feature
reviews and planning (e.g. via IBM RRC, RTC)
• Don’t turn your organization upside down
– Experiment on a few smaller, low-risk projects
– Maybe create DevOps "center of excellence"
– Tear down walls between teams
Organi-
zation
• “Continuous Integration” is a good starting point
– Push all builds to the last stage before release
– Eat your own dog food (get employees involved to test)
– Try progressive rollout or dark release of features
Practices
30. So…what are the Cool Kids DevOps takeaways?
29
Auto-
mation
• Start by automating a few areas that you can easily see
and track the results from
– E.g. Test / build pipeline, possibly using UrbanCode Deploy
• First, assess your current process and consider the
changes you want to make – then consider how to
measure them
– Instrument and measure anything you intend to automate
Measure
-ment
• But above all, be honest
– Assess your own DevOps maturity and aspirations – where are you
and where do you want to be?
31. 30
IBM can help: DevOps Adoption Framework delivers
measurable outcomes
Enable lean adoption of DevOps capabilities
Adoption Model
Self-assessments
Adoption paths
Adoption services
Solutions
Practices
Tooling
Services
Steer Product-based
Agile
Automated
Collaborative
Optimizing
More
Predictable
More
Transparent
More
Continuous
Process-based
Process-heavy
Manual
Silo-ed
Develop/Test
Deploy
Operate
Inefficient Leaner
Leaner and
Smarter
Continuous
Customer
Feedback &
Optimization
Collaborative
Development
Continuous Release and
Deployment
Continuous
Monitoring
Continuous
Business Planning
Continuous
Testing
Operate Develop/
Test
Deploy
Steer
DevOps
Continuous
Feedback
Community
Stories
Enablement
Feedback
Where and
How to Get
Lean
Expertise
and
Technologies
Knowledge
sharing
32. 31
Where to start: DevOps Adoption Roadmap
Assess desired outcome and supporting practices to drive strategy and rollout
What am I
trying to
achieve?
Think through business-level drivers for improvement
Define measurable goals for your organizational investment
Look across silos and include key Dev and Ops stakeholders
Where am I
currently?
What do you measure and currently achieve
What don’t you measure, but should to improve
What practices are difficult, incubating, well-scaled
How do your team members agree with these findings
What are my
priorities?
Start where you are today and where your improvement goals
Consider changes to People, Practices and Technology
Prioritize change using goals, complexities and dependencies
Step1Step2Step3
Current Practice
Assessment
Objective & Prioritized
Capabilities
Business Goal
Determination
What new
practices
should help
me grow?
Step4
Understand your appetite for cross-functional change
Target improvements with the biggest bang for the buck
Roadmap and agree on an actionable plan
Use measurable milestones that include early wins Strategy/Roadmap
33. 32
Connect with me on Twitter at @BillHoltshouser or LinkedIn at
www.linkedin.com/pub/bill-holtshouser/4/815/66a/