Presentation given at CMG Boston - April 20th 2017
#1: How to explain DevOps Transformation?
#2: How Dynatrace transformed from 6months waterfall to 1h code deploy
#3: The role of Monitoring in DevOps / CI/CD
#4: Using Dynatrace for your DevOps Transformation
Powering Real-Time Decisions with Continuous Data Streams
DevOps Transformation at Dynatrace and with Dynatrace
1. DevOps Transformation
at Dynatrace and
with Dynatrace
CMG Boston, April 20th 2017
Andreas Grabner: @grabnerandi, andreas.grabner@dynatrace.com
Podcast: https://www.spreaker.com/user/pureperformance
Dynatrace Trial: http://bit.ly/dtsaastrial
2. confidential
How I explain DevOps Transformation!
or
From Waterfall to Continuous Innovation
through DevOps Automation and Culture
3. confidential
24 “Features in a Box” Ship the whole box!
Photo-Bombed!
Very late feedback
F r u s t r a t i o n !
Quality Control!
Back to Customer
6. confidential
2011: APM about to be disrupted!
Migrate from On-Prem to VM, Cloud, Containers and PaaS
Architectures include micro-services, on-demand scaling,
self-healing
”Cloud Natives“ demand SaaS based solutions
Digital Transformers demand Analytics for Biz, Dev, Ops &
Sec
Many new players on the market
7. confidential
Challenges to master!
Bridging the gap between ”New Stack“ and “Enterprise Stack“
Deploying the same way our customers do: Continuously!
Not disrupting current operations and slower moving customers
Aligning 300+ engineers across 3 different geos
Solution: Innovation through Incubation!
11. 11 COMPANY CONFIDENTIAL – DO NOT DISTRIBUTE #Perform2015
Developer will never do that!
Operator’s job
12. confidential
Shift-Left Quality
Quality/Performance matters in Dev/Staging as well!
Make Dev/CSA/PM dependent from Quality in trunk!
DevOps = start thinking like an Ops before Commit
Shift-Right Metrics
enable DEVs defining quality metrics
make DEVs to the primary consumers of their metrics
13. confidential
How we increased Sprint Quality
Sprint Reviews Done on “dynaSprint“
• Daily Builds get deployed on “dynaDay“. Sprint builds to “dynaSprint
• If you can only show it “on your dev machine“ its NOT DONE!
Deploy Sprint Builds into our internal Production Enviornment
• We monitor Website, Support, Licensing, Community ... With Dynatrace
• If we break our own back office software we ALL feel the pain right away
14. confidential
Which Features to Optimize? Which Features to „Phase Out“
Allows Reducing Technical and Business Debt
How we Prioritized Features
15. confidential
Monitoring as Pipeline & Platform Feature
Dev Perf/Test Ops Biz
Faster Innovation with Quality Gates
Faster Acting on Feedback
Unit Perf
Cont. Perf
New Deploy
New Capability
CI CD Remove/Promote
Triage/Optimize
Update Tests
Innovate/Design
$$$
Lower Costs
Happy Users
16. confidential
acting as
Engineers
Role of Dynatrace DevOps Team
Dynatrace Managed/SaaS
Orchestration Layer
DynatracePipeline Visualization
Deployment Timeline
Log Overview
using Dynatrace Log APIJIRA Integrations
&
Product Managers
18. confidential
Learnings when scaling DevOps Pipelines
Service Team
A
Service Team B
Service Team X
Improve “Efficiency”
Cloud Ops
Ensure “Operational Service”
PM/Biz
Improve“Business”
20. confidential
Dynatrace Transformation by the numbers
26
170
Releases / Year
Deployments / Day
31000 60h
Unit & Int Tests / hour UI Tests per Build
More Quality
~200 340
Code commits / day Stories per sprint
More Agile
93%
Production bugs found by Dev
More Stability 450 99.998%
Global EC2 Instances Global Availability
22. Dev: Shift-Left - Architectural Regression Decisions
= Functional Result (passed/failed)
+ Web Performance Metrics (# of Images, # of JavaScript, Page Load Time, ...)
+ App Performance Metrics (# of SQL, # of Logs, # of API Calls, # of Exceptions ...)
Fail the build early!
24. confidential
Warm Up Phase
Low Load for a couple of mins
Peak Load: 2x Regular Load Simulation
Twice the load requires more than twice
the resources. Services start failing
1x Regular Load
Validating scaling behavior.
Understanding resource
requirements
Perf/Test Use Case: Scalability Decisions
40. confidential
Scaling DevOps in a Cloud Native World with Dynatrace
Service Team A
Service Team B
Service Team X
Improve “Performance Signature”
Continuous Performance, Shift-Left, Failure, Usage Feedback
Cloud Ops
Ensure “Operational Service”
Monitoring as a Service, Capacity Planning, Risk/Cost Control
PM/Biz
Improve“BusinessSignature”
Usage,Behavior,Costs,Innovate,A/BTesting,…
43. #1: Going from 6 Months to 1 Month On Premise Updates
• Challenge: Monolith download too big for our customers
• Impact: Update Process was error prone and “All or Nothing“
• Solution: Componentize, Automate Rollout/Rollback Capability,
A/B Rollout Model
Increased velocity uncovered bottlenecks!
@grabnerandi
44. #2: Education on Frequent Updates
• Challenge: Release Education used to happen 60-90
Days after the release
• Impact: Upgrade to latest version happened very late
• Solution: Education Integrated into Continuous Delivery:
Dev Blogs, YouTube Videos...
Increased velocity uncovered bottlenecks!
@grabnerandi
45. #3: Availabilty of Development / Test Environments
• Challenge: Supporting many different tech stack makes it
hard to maintain it
• Impact: Long running support tickets and long feature
development
• Solution: Infrastructure as Code gives “On Demand“ access to
these enviornments
Increased velocity uncovered bottlenecks!
@grabnerandi
Notes de l'éditeur
Most screenshots are taken from Dynatrace – get your own SaaS trial through http://bit.ly/dtsaastrial
More Resources on our DevOps Transformation @
DevOps Webinar with Bernd Greifeneder (CTO): https://info.dynatrace.com/apm_dtm_ops_17q3_wc_from_enterprise_tocloud_native_na_registration.html
DevOps Webinar with Anita Engleder (DevOps Manager): https://info.dynatrace.com/17q3_wc_from_agile_to_cloudy_devops_na_registration.html
My analogy for Waterfall:
Putting many features into a single release
Ship it to some other entity who does quality control
Final product comes back very late -> hard to remember which features / fotos we created. Often we realize its not what we wanted
This is the new way of delivering software: Continuously – with small batch updates
I use the analogy on how my girlfriend takes pictures:
One at a time
Quality Control and Optimization is in her own hands thanks to software that is “part of the delivery chain” (foto app)
She also controls what to push into production -> post it on Instagram / Facebook
She wants to make her users (friends & family) happy – she is hoping for LIKES!
If she gets dislikes she can remove an image
If she gets comments she can take another picture and deploy it within seconds -> that is Continuous User Driven Innovation
Our Own Transformation + what we hear from customers and the market tells us
EVERYONE WANTS to CHANGE – but the biggest challenge is Org / Culture not Technology
More Resources
DevOps Webinar with Bernd Greifeneder (CTO): https://info.dynatrace.com/apm_dtm_ops_17q3_wc_from_enterprise_tocloud_native_na_registration.html
DevOps Webinar with Anita Engleder (DevOps Manager): https://info.dynatrace.com/17q3_wc_from_agile_to_cloudy_devops_na_registration.html
Some aspects on how we tackled DevOps Transformation
We understood that embedding Monitoring into the whole pipeline is the only way to achieve faster innovation as well as reacting faster to feedback.
But monitoring is not only focused on Operations to “Keep the Lights On”. There are many Feedback Loops within each phase that allow Dev, Test, Ops and Biz to make their own independent decisions based on monitoring data
Our DevOps Team – initially 7 people – now only 3 – are
Responsible for “The Delivery Pipeline and the DevOps Tool Chain”
Their Customers: The different Dev Teams that want to push features through the pipeline into production
Key Lessons Learned: Raise the awareness of quality and the impact of each individual developer on the bottom line -> which is quality in production
“Eat our own dogfood” aka “Drink our own Champagne” -> we install sprint builds into our internal systems
Visualize Build and Pipeline Quality via UFOs -> https://www.dynatrace.com/solutions/devops/ufo/get/
Make Devs Look into production as well
We also learned a lot when scaling from one dev pipeline to many dev pipelines. That happened when we onboarded more teams to the new development model. We saw that Ops was often the first point where different deployments from different teams came together. Understanding all the dependencies was therefore critical. Because this helps you to understand the Risk when it comes to deploying a new version of a component!
Providing good monitoring for the Cloud Ops Teams was essential to ensure “Operational Services”
Monitoring as a Service
Capacity Planning
Risk/Cost Control
For the Service / App Teams it was essential to think about how to Improve “Efficiency” of their deliverables. We also talked about “Improving their Performance Signature”
Continuous Performance
Shift-Left
Failure
Usage Feedback
Product Management and Business on the other side needs data and the capability to improve business
Usage
Behavior
Costs
Innovate
A/B Testing
We learned that we need to have self-service in our pipeline. Intuitive Dashboards, Chat Ops and Voice Ops to allow developers to pro-actively react on feedback from the pipeline
More success numbers of our dynatrace transformation
Dynatrace provides the data to make better decisions in every phase of the pipeline. Lets have a closer look how Dynatrace helps each stake holder
Even if the deployment seemed good because all features work and response time is the same as before. If your resource consumption goes up like this the deployment is NOT GOOD. As you are now paying a lot of money for that extra compute power
Dynatrace can look at key resource, performance, scalability and architectural metrics and trend it from build-to-build. If Dynatrace detects a regression it can notify the build pipeline (Jenkins, Bamboo, TFS, …) that the current code change should not be promoted to the next phase
Screenshot from Dynatrace AppMon
Dynatrace provides the data to make better decisions in every phase of the pipeline. Lets have a closer look how Dynatrace helps each stake holder
When running different types of load tests with different load to figure out how the application scales dynatrace immediately shows you whether your application scales, how many resources you really need to sustain a certain load and which components/layers/tiers/services are your scalability bottleneck
When running scalability tests you want to find out how you system scales, how resource consumption is and when your system is potentially breaking. Here is the way Dynatrace shows you what is happening once you crank up load
#1: Warm Up Phase: getting an overview how the system behaves under low load condition
#2: Heating up to 1x Regular Load: system scales up! Performance is still good!
#3: Testing with 2x Load: System scales up but not linear -> need more than twice the resources for twice the load! First service instances start failing!
Application and Service Teams are most often just focusing on your isolated service. When the service gets deployed into production or into a production like staging or test environment it is the first time to see how the chosen architecture really plays out. Where the end-to-end performance and scalability hotspots are. Its also great to learn about the real dependencies they have against the real implementations of other depending services as most of the time services are tested in complete isolation in lower level environments.
In this example it is easy to see that the Credit Card Verification Service is the clear performance hotspot when the Booking Service gets invoked. Tweaking end-to-end performance should therefore start there if possible.
Another lesson learned is the dependency from the Backend Service to the Configuration Service. It seems that for each call the Booking Service makes to the DotNetBackend Service it is causing an average of 1.9 calls to the Configuration Service. While this is not a performance problem in the moment it its important to know for scalability aspects as well as for production deployments. Knowing how loosly or tightly certain services are coupled, how much data is sent back end forth and how the call ratio is allows capacity planning teams to do a better job when deploying into production!
Continuous Performance Testing or Continuous Performance Validation is a good Pipeline Phase to have before deploying into a Production Environment. It is an envioronment running under continuous load. New builds of individual services or complete applications get deployed on a regular basis. The question is whether a new version of a service, application or component shows any degradation in performance, scalability or resrouce consumption. If so it should not be promoted to the next phase before closer examination
Dynatrace automatically understands applications but more importantly services. Dynatrace also integrates with testing tools so that traffic on certain services can be associated to certain test scenarios you run in your continuous performance environment. Based on this information it is possible to see any regressions between builds or different loads. In the example above it is easy to spot that the build from Nov 17 shows a significant performance regression. Instead of allowing this build into production it is better to look into the differences between Build Nov 16 and Build Nov 17
Dynatrace not only has the high level performance metrics to understand the “Performance Signature” of an application or a service of a certain build or under a certain load pattern. It also has the method level information for developers to see how code execution actually differs between two builds or two configurations. This makes it easy to pinpoint the exact issue and then fix or revert changes to get back to an acceptable performance level
Dynatrace provides the data to make better decisions in every phase of the pipeline. Lets have a closer look how Dynatrace helps each stake holder
Even if the deployment seemed good because all features work and response time is the same as before. If your resource consumption goes up like this the deployment is NOT GOOD. As you are now paying a lot of money for that extra compute power
Screenshot from Dynatrace AppMon
After a deployment it is important to watch out for changed resource consumption behavior. In this case we had a deployment at 12:50. Immedatiely after we see a jump in CPU Consumption. Dynatrace automatically detects that as a problem.
Furthermore it tells as which services or processes consume these resources – allowing you to make better decisions on what to do next: add more resources as this is an intentional change – or – rollback because this is a problem!
After a deployment we see an issue with network connectivity and CPU utilization – impacting our end users
Dynatrace not only detects that issue but shows us the complete problem evolution path which allows us to then see which change actually caused that issue to happen and how to remediate it!
Dynatrace provides the data to make better decisions in every phase of the pipeline. Lets have a closer look how Dynatrace helps each stake holder
The next slides show a scenario that happened in our organization. This dashboard is used by our marketing and business teams to see how well frequented our website is (total numbers in top chart), how user experience plays out (top chart with green/yellow/red) and how many people sign up for our free trial offering (conversion rate)
May 1st was a push of a new release and a marketing campaign started that promoted these features and tried to get people to sign up
Seems everything was working as expected
Day 2 started good but we also saw that slower web site performance (due to the heavy load) was impacting our end user experience and also conversion rate
The Dev Team provided a hotfix to make the sign up for faster
#1: It got deployed around noon
#2: Fix had negative impact as it broke the whole website due to a javascript problem on certain browsers
#3: problem was immediately visible to both business (drop in conversion) and dev (they looked at the reported JavaScript problems and user experience)
Due to the fast feedback from Production the Dev Team immediately fixed that regression – bringing the system back to where they wanted it to be in the first place
Instead of just looking at these dashboards and figure out what is going on – our Dynatrace Artificial Intelligence can do all of this work for you.
Dynatrace automatically detects a negative Impact on your end users – also telling you whether it is a global problem, specific geo region or a specific user type (by browser, os, …). It also tells you the business impact (e.g: conversion rate goes down) and the root cause (JavaScript Error)
Last but not least. As Dyntrace sees every single user and every single click we can do some user behavior analytics.
Does the behavior change if they have a less optimal user experience?
Seems like users that have a frustrating experience are more likely to click on Support
Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
When scaling DevOps / CICD in your Enterprise it is important that you monitor and understand the dependencies between all different services and applications that are deployed and updated on a much faster frequency than before. You need to react on changes that impact your end users or your infrastructure faster than ever in order to minimize the impact to your business.
Dynatrace not only monitors your Cloud Native and Enterprise Stack Infrastructure as well as Services, Applications and End Users. Its AI and automation capabilities really allow you to become more efficient, reduce risk and improve your overall performance and end user satisfaction.