Deck used for my talk at the 2016 Spring User Conference in Toronto. Deck was followed up by a walkthrough of a Jenkins workflow that deployed to Cloud Foundry based on jmeter test results
13. 700 deployments / YEAR
10 + deployments / DAY
50 – 60 deployments / DAY
Every 11.6 SECONDS
Unicorns: Deliver value at the speed of business
14. DevOps @ Target
presented at Velocity, DOES and more …
http://apmblog.dynatrace.com/2016/07/07/measure-frequent-successful-software-releases/
“We increased from monthly to
80 deployments per week
… only 10 incidents per month …
… over 96% successful! ….”
15. “We Deliver High Quality Software,
Faster and Automated using New Stack”
“Shift-Left Performance
to Reduce Lead Time”,
Adam Auerbach, Sr. Dir DevOps
https://github.com/capitalone/Hygieia & https://www.spreaker.com/user/pureperformance
“… deploy some of our most critical production workloads on
the AWS platform …”, Rob Alexander, CIO
18. Richard Dominguez
Developer in Operations
Prep Sportswear
“In 2013 business demanded to go
from monthly to daily deployments”
“80% failed!”
19. Understanding Code Complexity
• 4 Millions Lines of Monolith Code
• Partially coded and commented in Russian
From Monolith to Microservice
• Initial devs no longer with company
• What to extract without breaking it?
Shift Left Quality & Performance
• No automated testing in the pipeline
• Bad builds just made it into production
Cross Application Impacts
• Shared Infrastructure between Apps
• No consolidated monitoring strategy
20. Scaling an Online Sports Club Search Service
2015201420xx
Response Time
2016+
1) 2-Man Project 2) Limited Success
3) Start
Expansion
4) Performance
Slows Growth Users
5) Potential Decline?
21. Early 2015: Monolith Under Pressure
Can‘t scale vertically endlessly!
May: 2.68s 94.09% CPU
Bound
April: 0.52s
22. From Monolith to Services in a Hybrid-Cloud
Move Front End
to Cloud
Scale Backend
in Containers!
26. 26.7s Load Time
5kB Payload
33! Service Calls
99kB - 3kB for each call!
171!Total SQL Count
Architecture Violation
Direct access to DB from frontend service
Single search query end-to-end
27. It‘s not about blind automation of pushing more
bad code on new stacks through a pipeline
28. Understanding Code Complexity
• Existing 10 year old code & 3rd party
• Skills: Not everyone is a perf expert or born architect
From Monolith to Microservice
• Service usage in the End-to-End Scenarios?
• Will it scale? Or is it just a new monolith?
Understand Deployment Complexity
• When moving to Cloud/Virtual: Costs, Latency …
• Old & new patterns, e.g: N+1 Query, Data
Understand Your End Users
• What they like and what they DONT like!
• Its priority list & input for other teams, e.g: testing
29. The fixed end-to-end use case
“Re-architect” vs. “Migrate” to “New Stack”
2.5s (vs 26.7)
5kB Payload
1! (vs 33!) Service Call
5kB (vs 99) Payload!
3!(vs 177) Total
SQL Count
30.
31.
32. Dev&Test: Check-In
Better Code
Performance: Production Ready
Checks! Validate Monitoring
Ops/Biz: Provide Usage and
Resource Feedback for next
Sprints
Test / CI: Stop Bad Builds Early
Build & Deliver Apps like the Unicorns!
With a Metrics-Driven Pipeline!
33. Build 17 testNewsAlert OK
testSearch OK
Build # Use Case Stat #API Calls # SQL Payload CPU
1 5 2kb 70ms
1 35 5kb 120ms
Use Case Tests and Monitors Service & App Metrics
Build 26 testNewsAlert OK
testSearch OK
Build 25 testNewsAlert OK
testSearch OK
1 4 1kb 60ms
34 171 104kb 550ms
Ops
#ServInst Usage RT
1 0.5% 7.2s
1 63% 5.2s
1 4 1kb 60ms
2 3 10kb 150ms
1 0.6% 3.2s
5 75% 2.5s
Build 35 testNewsAlert -
testSearch OK
- - - -
2 3 10kb 150ms
- - -
8 80% 2.0s
Continuous Innovation and Optimization
Re-architecture into “Services” + Performance Fixes
Scenario: Monolithic App with 2 Key Features
38. 1.Performance validation:
1.response time corridor
2.# of SQL calls (N+1!)
3.# of service calls (N+1!)
4.# of exceptions (Frameworks)
2.Architectural validation
1.Making sure services are calling
the right thing and not skipping
layers
Performance metrics in your pipeline
39. Continuous Innovation and Optimization
Dev: Only Check
in Quality Code
Test: Stop Bad
Features/Builds
early
Arch: Optimize scalability,
architecture and
performance
DevOps: Validate production
readiness and monitoring
Biz/Ops: Ensure Happy End
Users and Healthy Environment
Biz/App: Innovate through User
Analytics and Ops Feedback
Ops: Monitor ALL
of your apps
powered by Dynatrace
41. UEM: Conversion to User Experience
New Deployment + Mkt Push
Increase # of unhappy users!
Decline in Conversion Rate
Overall increase of Users!
Spikes in FRUSTRATED Users!
I like to get my cliches out of the way in the beginning, DEVOPS UNICORNS OHhhhMyyyyy!!!!
Several companies changed their way they develop and deploy software over the years. Here are some examples (numbers from 2011 – 2014)
Cars: from 2 deployments to 700
Flicks: 10+ per Day
Etsy: lets every new employee on their first day of employment make a code change and push it through the pipeline in production: THAT’S the right approach towards required culture change
Amazon: every 11.6s
Remember: these are very small changes – which is also a key goal of continuous delivery. The smaller the change the easier it is to deploy, the less risk it has, the easier it is to test and the easier is it to take it out in case it has a problem.
At Dynatrace we also went through a major transformation over the last years.
Unfortunately not every story is a good story. But the bad stories are often not told – even though we can learn even more. PrepSportswear failed 80% of their deployments after speading up deployments
They had a monolithic app that couldnt scale endlessly. Their popularity caused them to think about re-architecture and allowing developers to make faster changes to their code. The were moving towards a Service Approach
Separating frontend logic from backend (search service). The idea was to also host these services potentially in the public cloud (frontend) and in a dynamic virtual enviornment (backend) to be able to scale better globally
On Go Live Date with the new architecture everything looked good at 7AM where not many folks were yet online!
By noon – when the real traffic started to come in the picture was completely different. User Experience across the globe was bad. Response Time jumped from 2.5 to 25s and bounce rate trippled from 20% to 60%
The backend service itself was well tested. The problem was that they never looked at what happens under load „end-to-end“. Turned out that the frontend had direct access to the database to execute the initial query when somebody executed a search. The returned list of search result IDs was then iterated over in a loop. For every element a „Micro“ Service call was made to the backend which resulted in 33! Service Invokations for this particular use case where the search result returned 33 items. Lots of wasted traffic and resources as these Key Architectural Metrics show us
They fixed the problem by understanding the end-to-end use cases and then defined backend service APIs that provided the data they really needed by the frontend. This reduced roundtrips, elimiated the architectural regression and improved performance and scalability
Lessons Learned!
The goal is to measure everything!
If we do all that we can build a beautilful pipeline where quality metrics are enforced along the way!!
Got this story also covered here: https://www.infoq.com/articles/Diagnose-Microservice-Performance-Anti-Patterns
If we monitor these key metrics in dev and in ops we can make much better decisions on which builds to deploy
We immediately detect bad changes and fix them. We will stop builds from making it into Production in case these metrics tell us that something is wrong.
We can also take features out that nobody uses if we have usage insights for our services. Like in this case we monitor % of Visitors using a certain feature. If a feature is never used – even when we spent time to improve performance – it is about time to take this feature out. This removes code that nobody needs and therefore reduces technical debt: less code to maintain – less tests to maintain – less bugs in the system!
How? Leverage your existing Functional, Unit or Integration Tests. Instrument the code you are testing and extract key metrics that you can track from build to build. Then baseline these metrics
Check out blogs on Problem Pattern Detection and Key Performance Metrics
http://apmblog.dynatrace.com/2016/06/23/automatic-problem-detection-with-dynatrace/
http://apmblog.dynatrace.com/2016/02/23/top-tomcat-performance-problems-database-micro-services-and-frameworks/
https://www.infoq.com/articles/Diagnosing-Common-Java-Database-Performance-Hotspots
If one of these metrics spikes you detected a regression that should fail the build
And this all is ...
Understand user behavior depending on who they are and what they are doing.
Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Does the behavior change if they have a less optimal user experience?
Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Seems like users that have a frustrating experience are more likely to click on Support
Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap