Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Starting Your DevOps Journey – Practical Tips for Ops

1 342 vues

Publié le

To watch, please see:

https://info.dynatrace.com/apm_wc_getting_started_with_devops_na_registration.html

Starting Your DevOps Journey: Practical Tips for Ops

In this webinar, Andreas Grabner, Chief DevOps Activist at Dynatrace, shares practical tips that all IT groups from Dev to Ops can use to start their DevOps journey quickly. With experience from hundreds of DevOps deployments, Andi provides insights it would take your team months or years to learn firsthand.

- Learn how everyone on your Ops team can use APM to better understand and monitor SLAs, Performance and End User Impact of their applications.

- Foster better collaboration between Ops and architects by extending basic system monitoring to monolith and microservices architectures.

- Shift-left your testing and QA by working with metrics that you and the architects agreed on up front, resulting in early relevant feedback and faster code deployments.

- Hear why changing the cultural mindset from “fear of change” to “Continuous Innovation and Optimization” is critical for success.

Andi is joined by guest speaker, Brian Chandler, Systems Engineer at Raymond James, who shares commonly used Ops dashboards that increase collaboration across IT teams and pro-actively break down silos!

Publié dans : Technologie
  • Login to see the comments

Starting Your DevOps Journey – Practical Tips for Ops

  1. 1. Starting your DevOps Journey Practical Tips for Ops http://dynatrace.com/trial Brian Chandler Systems Engineer @ Raymond James @Channer531 Andreas Grabner Chief DevOps Activist @ Dynatrace @grabnerandi
  2. 2. Promise of DevOps: Faster & Efficient Innovation Smaller Apps, Micro-Services More Deployments App-, Service- & End-User Feedback Loops Happy Users Lower Costs
  3. 3. Proof: DevOps Adopters Are … 200x 2,555x more frequent deployments faster lead times than their peers More Agile 3x 24x lower change failure rate faster Mean Time to Recover More Reliable More Successful 2x 50% More likely to exceed market expectations Higher market cap growth over 3 years Source: Puppet Labs 2015 State Of DevOps Report: https://puppet.com/resources/white-paper/2016-state-of-devops-report
  4. 4. Dynatrace Transformation by the numbers 23x 170 More releases Deployments / Day 31000 60h Unit+Int Tests / hour UI Tests per Build More Quality ~200 340 Code commits / day Stories per sprint More Agile 93% Production bugs found by Dev More Stability 450 99.998% Global EC2 Instances Global Availability Webinar @ https://info.dynatrace.com/17q3_wc_from_agile_to_cloudy_devops_na_registration.html
  5. 5. YET: „DevOps Adoption is only 2%“ Gene Kim, Nov 2016
  6. 6. Interesting Ops Learnings from Adopters New Technology Stack New Architectural Patterns End User Focused New Deployment Models
  7. 7. DevOps Requirements and Engagement Options for Ops Feedback through High Quality App & User Data Ops as a Service: “Self-Service for Application Teams” Bridge the Gap between Enterprise Stack and New Stack Shift-Left: (No)Ops as “Part of Application Delivery” RequirementsEngagementOptions
  8. 8. Basic App Monitoring1 App Dependencies2 End User Monitoring3 How to monitor mobile vs desktop vs tablet vs service endpoints? How much network bandwidth is required per app, service and feature? Where to start optimizing bandwidth: CDNs, Caching, Compression? Are our applications up and running? What load patterns do we have per application? What is the resource consumption per application? What are the dependencies between apps, services, DB and infra? How to monitor „non custom app“ tiers? Where are the dependency bottlenecks? Where is the weakest link? Closing the Ops to Dev Feedback Loop: One Step at a Time! “Soft-Launch” Support4 Virtualization Monitoring5 How to automatically monitor virtual and container instances? What to monitor when deploying into public or private clouds? How to deploy and monitor multiple versions of the same app / service? What and how to baseline? Do we have a better or worse version of an app/service/feature? Ops: Need answers to these questions! Closing the gap to AppBizDev Ready for “Cloud Native” How to alert on real problems and not architectural patterns? How to consolidate monitoring between Cloud Native and Enterprise? Who is using our apps? Geo? Device? Which features are used? Whats the behavior? Where to start optimizing? App Flow? Page Size? Conversion Rates? Bounce Rates? Where are the performance / resource hotspots? When and where do applications break? Do we have bad dependencies through code or config? How does the system really behave in production? What to learn for future architecturs? What are the usage patterns for A/B or Green/Blue? Difference between different versions and features? Does the architecture work in these dynamic enviornments? Does scale up/down work as expected? Provide „Monitoring as a Service“ for Cloud Native Application Teams6 Today
  9. 9. Questions to Answer! Are our applications up & running? What are the real load patterns? What is the resource consumption? Where to start optimizing?
  10. 10. Are our Apps Up, Running & Accessible? Availability dropped to 0%
  11. 11. Early Warning SLA Monitoring! Quality of Connectivity, DNS Quality of Connectivity & DNS Quality of Content Delivery Quality of Content Delivery 3rd Party Impact Delivery by Geo Quality of Content Delivery
  12. 12. Client Center Daily Traffic Pattern
  13. 13. Client Center sees a peak of about 3,800 Req/min against the it’s API. Client Center Daily Traffic Pattern
  14. 14. Client Center sees a peak of about 3,800 Req/min against the it’s API. 60 unique calls/functions that make up the Client Center API Client Center Daily Traffic Pattern
  15. 15. ~20% of that traffic is ClientCenter/API/Holdings Client Center Daily Traffic Pattern
  16. 16. ~20% of that traffic is ClientCenter/API/Holdings ~20% of that traffic is ClientCenter/API/ClientDetails Client Center Daily Traffic Pattern
  17. 17. ~20% of that traffic is ClientCenter/API/Holdings ~20% of that traffic is ClientCenter/API/ClientDetails ~20% of that traffic is ClientCenter/API/RecentSearch Client Center Daily Traffic Pattern
  18. 18. Typical Peak Hour If you’re not careful, it could look like this… Rhythmatic peaks and valleys suggest “lock-step” scripts (all virtual users start and end at the same time.) PRD usage is much more “fluid”. Steady stream and balance across transaction usage Total sum of traffic load was met. However, correct ratio of key transactions were not met. Leveraging PRD data to tune QA Load Tests
  19. 19. Normal Production Distribution Failed Load Test Distribution Black: Overall application load and peak volume Percentile breakdown of fast, warning, slow txs VS. Performance Differences Before and After Release
  20. 20. Occurrences of slow AccountList Transactions from load testingDistribution of “yellow” transactions for that time AccountList makes up most of these transactions. Normal distribution of “expected” slow transactions for this API function. Distribution generated from load test. New code would greatly increase the occurrences of slow transactions in production! What is making up all that yellow?
  21. 21. Detection Load Distribution and Deployment Hotspots Overall Load Distribution by SLA Very Slow, Slow, Med, Fast Tip: Logarithmic Y-Axis Finding #3: Server #3 only gets load at certain times! Finding #2a: Server #1 was put back in rotation HERE Finding #2b: Server #2 saw less errors once #1 was up Finding #1: Response Time Spikes at certain times not related to load! Validate Load Balancing Tip: Load per Server! Validate Load Balancing Tip: Load per Server! Validate Load Balancing Tip: Load per Server!
  22. 22. Detection Load Distribution and Deployment Hotspots Requests by App Server: Tip: Percentage Bar Chart Thread Usage: Tip: Pool Size + Actual Use Same for Web ServerSame for Web Server Transfer Rate Identify “heavy hitters” Resource Utilization Tip: CPU, Memory, I/O …
  23. 23. Detecting Resource Regression Hotspots Time of Deployment Other Resources: Bytes Transferred, Disk I/O, # of Log Messages, # of Open Connections, # of Calls …
  24. 24. Detecting Error Hotspots under Load
  25. 25. Automatic Hotspot Detection under Load My Favorite: Layer Breakdown Chart With increasing load: Which LAYER doesn’t SCALE?
  26. 26. Automatic Availability Root Cause Detection Web Performance Optimization Automated  List of root cause explanations for SLA violations
  27. 27. Automatic Baselining per Business Transaction Response Time Baselines based on 50th & 90th Percentile Smart Alerting based on Significant Measurement Violation Direct link to Layer Breakdown and Method Hotspot!
  28. 28. Automatic Anomaly and Root Cause Detection Automatic Anomaly Detection Automatic Root Cause Information Automatic Impact Details
  29. 29. Summary: Capabilities to Get Answers Through Synthetic Monitoring: Are our applications up & running? Availability, Response Time, CDN, Geo, … Content Size and Content Validation Through Endpoint Monitoring: What are the real load patterns? Bucket by Response Time (Fast, Medium, Slow, Very Slow ...) Bucket by Status Code (HTTP 2xx, 3xx, 4xx, 5xx, ...) Through System Monitoring: What is the resource consumption? CPU, Memory, Network and I/O Through Basic Application Monitoring: Where to start optimizing? Top Exceptions & Log Messages; # Thread (Idle, Busy) Memory by Heap Space, Garbage Collection Activity Execution Hotspots by Component
  30. 30. Which services do we actually host? What is the health state of every component? What are the dependencies? What impacts the interconnected system health? Questions to Answer!
  31. 31. Agent-Based Monitoring & Tracing: Bridging Enterprise and New Stack From Mobile Via Middleware To Mainframe And Services To SQL / NoSQL To SQL / NoSQL To SQL / NoSQL To External Services
  32. 32. Analyzing Inter Tier Impact #1: Load Spike Direct correlation with # of SQL queries -> OK! #2: Same Load Spike Direct correlation with # of Exceptions -> OK! #3: Starting with Load Spike Time spent in JDBC (blue) stays very high -> NOT OK! #4: Problem Solved Issue on Oracle Server caused all SQL to be slow
  33. 33. Health State and Impact of Database! DB-Related Blogs from Sonja: https://www.dynatrace.com/blog/author/sonja-chevre/
  34. 34. Proper Connection Pool Sizing! Do we have enough DB CONNECTIONS per pool?
  35. 35. Detecting Database Impact on Message Processing #1: Cluster Failover Event #2: System Struggled but managed load #2: System Struggled but managed load #3: DB Index Job with MAJOR impact on End Users
  36. 36. @ Dynatrace: Service Tier Monitoring #3: Queue Sizes #1: Cassandra Health #2: Cassandra Health #1: Overall Tier Health #4: Error States
  37. 37. What’s lurking under the water of the iceberg?
  38. 38. What is the cause of all performance problems?
  39. 39. 40 Red wave of death appears on dashboard. Conference Bridge/Crisis Center call with lots of “Smart Guy Correlation” Application recovers. Triaging w/o anomaly detection on app dependencies
  40. 40. App1 Web AppSvc MB EntSvc DB App2 Web AppSvc MB EntSvc DB DB EntSvc MB App3 Web AppSvc App4 Web AppSvc MB EntSvc DB App5 Web AppSvc MB EntSvc DB 41 DCRUM – True enterprise monitoring
  41. 41. App1 Web AppSvc MB EntSvc DB App2 Web AppSvc MB EntSvc DB DB EntSvc MB App3 Web AppSvc App4 Web AppSvc MB EntSvc DB App5 Web AppSvc MB EntSvc DB 42 DCRUM – True enterprise monitoring
  42. 42. 43 DCRUM – True enterprise monitoring
  43. 43. 44 App1 App2 App5App4App3 Web Web Web Svc1 WebWeb DB1 EntSvc2 DB2 ENTSvc1 MB Svc2 Svc4Svc3 DCRUM – True enterprise monitoring
  44. 44. 45 DB1 EntSvc2 DB2 ENTSvc1 MB Svc2 Svc4Svc3 DCRUM – True enterprise monitoring
  45. 45. 46 DB1 EntSvc2 DB2 ENTSvc1 MB Svc2 Svc4Svc3 DCRUM – True enterprise monitoring Successful application dependency monitoring will allow you to take a “bottom-up” approach to monitoring your enterprise.
  46. 46. “Bottom-up” Service View Client Group 1, Servers A-D Client Group 2, Servers E-H Client Group 3, Servers I-L Client Group 4, Servers M-Q Client Group 5, Servers R-S Different Apps and services exercise enterprise services and databases in varying ways! Lack of load from these peers against this service Poor performing node in this clientgroup
  47. 47. 48 Link to the appropriate heat map Alert sent based on deviation of calculated baseline Baseline alerting granularity down the operation level, not just the Software Service Delivering this data as actionable alerts
  48. 48. Usage and application behavior vary day-to-day. A rolling average of services is not good enough One week application usage trend Monday Tuesday Wednesday Thursday Friday The need for seasonal baselining
  49. 49. To achieve deeper statistical capabilities, we use a combination of the PureLytics stream and DCRUM REST interface to pour data into analysis tools. This allows us to reach back several weeks, on a single minute for the given day (e.g. Monday at 10:03am compared to the last 5 Mondays at 10:03am) to calculate our baselines. For every unique operation in or enterprise (25k+ recorded). That is a great deal of data! Dynatrace performance metrics streaming
  50. 50. By reaching that far back at granular 1-minute intervals, you can be very confident with the validity of your baseline values A 50ms-150ms deviation may not seem like a huge deal – but in the world of app dependency monitoring, it truly is! Graphical View of deep seasonal baselining
  51. 51. Service 1 needs to call Service 2 multiple times. If service 2 slows down, it has an enormous impact on all upstream services. 150ms shift in service 2 causes Service 1 to shift from 200ms-2s Service 1 Service 2 Upstream impact of dependencies
  52. 52. Automatic Full Stack Monitoring #1: All your Technologies #2: All Key Metrics #3: Physical, Virtual, Containers or Cloud
  53. 53. Smartscape: Real Time Service-Oriented CMDB #1: Understand WHO talks with WHOM? #2: Where are tiers deployed? #3: WHO might be impacted by a failure?
  54. 54. Automatic Service Flow Tracing #1: Understanding Flow #2: Dependencies between Service #3: Service Clustering
  55. 55. Automatic Architectural Pattern Detection #1: Action initiated by the SPA (Single Page App) #2: SPA was making 3 AJAX Calls in total! #3: One of the calls makes 13! Backend REST Calls to external system on 13 asynchronous threads
  56. 56. Automatic Problem Pattern Detection #1: Select Top Common Problem Patterns #1: Explore which transactions have this and other problems
  57. 57. Automating Anomaly Detection #1: All Root Cause Information „encapsulated“ into a single Problem #2: “Time-Lapse” of Problem Evolution #3: All relevant Events: Infra, Logging, App, Service, End User …
  58. 58. Automatic Integration with ChatOps
  59. 59. Summary: Capabilities to get answers Through Automatic Dependency Detection Which services hosted by which processes? Where do these processes run? Through Component Monitoring Key metrics from Oracle, SQL, DB2, MySql, Postgres Throughout on your Message Broker / Bus, Firewalls / Proxies Through End-to-End Tracing Which Services are depending for end-to-end use cases? Where are our bottlenecks? How to optimize Deployment and archtiecture? Through Anomaly Detection Which tiers are acting out-of-the norm after an update or under certain load? Who is impacted when one tier has an issue? Where to look for the real root cause when a service goes down?
  60. 60. Promise of DevOps: Faster & Efficient Innovation Smaller Apps, Micro-Services More Deployments App-, Service- & End-User Feedback Loops Happy Users Lower Costs
  61. 61. Basic App Monitoring1 App Dependencies2 End User Monitoring3 How to monitor mobile vs desktop vs tablet vs service endpoints? How much network bandwidth is required per app, service and feature? Where to start optimizing bandwidth: CDNs, Caching, Compression? Are our applications up and running? What load patterns do we have per application? What is the resource consumption per application? What are the dependencies between apps, services, DB and infra? How to monitor „non custom app“ tiers? Where are the dependency bottlenecks? Where is the weakest link? DevOps Monitoring Maturity: What we covered today? “Soft-Launch” Support4 Virtualization Monitoring5 How to automatically monitor virtual and container instances? What to monitor when deploying into public or private clouds? How to deploy and monitor multiple versions of the same app / service? What and how to baseline? Do we have a better or worse version of an app/service/feature? Ops: Need answers to these questions! Closing the gap to AppBizDev Ready for “Cloud Native” How to alert on real problems and not architectural patterns? How to consolidate monitoring between Cloud Native and Enterprise? Who is using our apps? Geo? Device? Which features are used? Whats the behavior? Where to start optimizing? App Flow? Page Size? Conversion Rates? Bounce Rates? Where are the performance / resource hotspots? When and where do applications break? Do we have bad dependencies through code or config? How does the system really behave in production? What to learn for future architecturs? What are the usage patterns for A/B or Green/Blue? Difference between different versions and features? Does the architecture work in these dynamic enviornments? Does scale up/down work as expected? Provide „Monitoring as a Service“ for Cloud Native Application Teams6
  62. 62. We have the experience.  One of the largest health care insurance providers in the nation – to DevOps in two weeks  One of the largest furniture retailers in the United States – to DevOps in two weeks
  63. 63. We have a proven approach-- The DevOps Xcelerator  Outline your digital performance management (DPM) strategy  Build on what you already have  Implement DPM to support DevOps  Validate your success DPM Vision & Strategy Discovery & Planning Implementation Validate Success Identify DPM goals that guide your implementation strategy in alignment with business objectives. Ask the right questions. Collect the information. Assemble required resources. Create your implementation plan. Follow the Dynatrace Expert Services (DXS) implementation framework to successfully execute your implementation plan. Track, measure and report progress towards your DPM goals so that your digital performance investments add increasing value to the business.
  64. 64. 66 Q & A Brian Chandler Systems Engineer @ Raymond James @Channer531 Andreas Grabner Chief DevOps Activist @ Dynatrace @grabnerandi Action Items for you! Try Dynatrace SaaS: http://bit.ly/dtsaastrial Try Dynatrace AppMon On Premise: http://bit.ly/dtpersonal List to our Podcast: http://bit.ly/pureperf Read more on our blog: http://blog.dynatrace.com

×