Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Engineering Netflix Global Operations in the Cloud

Delivered at re:Invent 2015.

Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. This means more features, more experiments, more deployments, more engineers making changes in production environments, and ever-increasing complexity. Simultaneously improving service availability and accelerating rate of change seems impossible on the surface. At Netflix, operations engineering is both a technical and organizational construct designed to accomplish just that by integrating disciplines like continuous delivery, fault injection, regional traffic management, crisis response, best practice automation, and real-time analytics. In this talk, designed for technical leaders seeking a path to operational excellence, we'll explore these disciplines in depth and how they integrate and create competitive advantages.

  • Soyez le premier à commenter

Engineering Netflix Global Operations in the Cloud

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Josh Evans - Director of Operations Engineering Engineering Netflix Global Operations in the Cloud
  2. 2. Internet
  3. 3. • Two Operational Challenges • Operational Excellence • Operations Engineering Our Journey
  4. 4. Our Journey • Two Operational Challenges • Operational Excellence • Operations Engineering
  5. 5. Product Innovation winning moments of truth
  6. 6. ● Every facet of the product ● 1400 AB tests in the last year & accelerating Continuous Innovation
  7. 7. Challenge #1: Accelerate Innovation and Rate of Change
  8. 8. Scale & Complexity
  9. 9. 100,000s of requests per second 1000s of Global Starts per Second
  10. 10. Approaching Global Reach October - Spain, Portugal, Italy Early 2016 - Korea, Taiwan, Singapore, Hong Kong 65m members  100m ~60 counties  200
  11. 11. EU-WestUS-EastUS-West Multi-Zone, Multi-Region
  12. 12. Netflix CDN (Open Connect) Cloud Control Plane Internet The Bigger Picture Service Partners Service Partners
  13. 13. Challenge #2: Sustain & Improve Quality in the face of ever growing scale & complexity
  14. 14. Our Journey • Two Operational Challenges • Operational Excellence • Operations Engineering
  15. 15. Operational Excellence Quality Velocity
  16. 16. Availability vs. Rate of Change Rate of Change Availability(nines) 6 5 4 3 2 1 0 1 10 100 1000 99.9999% 99.999% 99.99% 99.9% 99% 90% 31.5 seconds 5.26 minutes 52.56 minutes 8.76 hours 3.26 days 36.5 days Quality vs. Velocity
  17. 17. Availability vs. Rate of Change Rate of Change Availability(nines) 6 5 4 3 2 1 0 1 10 100 1000 99.9999% 99.999% 99.99% 99.9% 99% 90% 31.5 seconds 5.26 minutes 52.56 minutes 8.76 hours 3.26 days 36.5 days The Zero Sum Game
  18. 18. Availability vs. Rate of Change Rate of Change Availability(nines) 6 5 4 3 2 1 0 1 10 100 1000 99.9999% 99.999% 99.99% 99.9% 99% 90% 31.5 seconds 5.26 minutes 52.56 minutes 8.76 hours 3.26 days 36.5 days The Zero Sum Game
  19. 19. Availability vs. Rate of Change Rate of Change Availability(nines) 6 5 4 3 2 1 0 1 10 100 1000 99.9999% 99.999% 99.99% 99.9% 99% 90% Shifting the Curve
  20. 20. Operational Excellence is the continuous improvement of the management, design, and function of operational environments to achieve greater quality, velocity, and competitive advantage.
  21. 21. Our Journey • Two Operational Challenges • Operational Excellence • Operations Engineering
  22. 22. Build It design code build bake test deploy Run It operate configure monitor respond You build it, you run it… …globally
  23. 23. Undifferentiated Heavy Lifting
  24. 24. Operations Engineering is the application of software engineering practices and principles to achieve and sustain operational excellence. • automation • modular components • tools & services • best practices
  25. 25. Our Journey – Operations Engineering • Engineering Tools • Insight & Real-time Analytics • Performance & Reliability • Leverage
  26. 26. Our Journey • Engineering Tools • Insight & Real-time Analytics • Performance & Reliability • Leverage
  27. 27. Data Center ● Delayed provisioning ● Hand-crafted servers ● Variations and complexity Our Artisanal Past Delivery ● Late night, manual deployments ● Repeated mistakes ● Painful delays to production fixes
  28. 28. • productivity • velocity • quality Engineering Tools
  29. 29. • cloud management • delivery engine • automation platform
  30. 30. Global Cloud Management
  31. 31. Delivery Pipelines
  32. 32. Automated Global Delivery
  33. 33. The Paved Road • Stash • Gradle • Ubuntu • Jenkins • Spinnaker
  34. 34. Our Journey • Engineering Tools • Insight & Real-time Analytics • Performance & Reliability • Leverage
  35. 35. Insight & Real-Time Analytics
  36. 36. OODA loop
  37. 37. An outage may not be life or death but…
  38. 38. • DES on time series data • Predict the future based on history • Favor recent history • Threshold-based alerts • 6-8 minute delay Anomaly Detection Alert!
  39. 39. Finer Granularity, Shorter Time Windows
  40. 40. Ensemble Learning
  41. 41. Median Absolute Deviation IQR Least Squares HDI Voting
  42. 42. observe, orient, decide, act Alert! From 6-8 minutes to < 1 minute
  43. 43. observe, orient…
  44. 44. …decide, act
  45. 45. How do we take humans out of the equation?
  46. 46. Outlier Detection & Remediation
  47. 47. • Unsupervised machine learning • Density-based clustering algorithm • Actions • Email, page • OOS, detach, terminate Kepler
  48. 48. An ounce of prevention…
  49. 49. Old Version (v1.0) New Version (v1.1) Load BalancerCustomers 100 Servers 5 Servers 95% 5% Metrics Canary Release Process
  50. 50. Old Version (v1.0) New Version (v1.1) Load BalancerCustomers 0 Servers 100 Servers 100% Metrics Canary Release Process
  51. 51. Define • Metrics • A threshold Every n minutes ● Classify metrics ● Compute score ● Make a decision Automatic Canary Analysis
  52. 52. • Systematic observation of facets & permutations • Unsupervised monitoring & decision- making • Automated tuning & recovery • Alerts with analysis Thinking Globally
  53. 53. Our Journey • Engineering Tools • Insight & Real-time Analytics • Performance & Reliability • Leverage
  54. 54. Performance & Reliability
  55. 55. Internet Zuul API NCC P Playback History Playback Sessions MAP
  56. 56. Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the systems capability to withstand turbulent conditions in production.
  57. 57. Cluster A Cluster D Edge Cluster Cluster B Cluster C Imagine a monkey loose in your data center…
  58. 58. Xen Hypervisor vulnerability – 9/25/14 218 out of 2700+ Cassandra nodes rebooted 22 did not reboot successfully Automation handled the rest A State of Xen – Chaos Monkey & Cassandra
  59. 59. Device Service B Service C Internet EdgeZuul Service A ELB FIT Fault-Injection Testing (FIT) • Simulate service failures • Override by device or account • % of member traffic
  60. 60. Device Service B Service C Internet EdgeZuul Service A ELB FIT Fault-Injection Testing (FIT) • Simulate service failures • Override by device or account • % of member traffic
  61. 61. US-EastUS-West AZ1 EU-West Global Traffic Management
  62. 62. The Internet DNS-based Routing Zuul Proxy Back Channel ###, ###, ###
  63. 63. • Alerting and Monitoring • Apache & Tomcat Hardening • Automated Canary Analysis • Autoscaling • Chaos Participation • Consistent Naming • ELB Configuration • Healthcheck Configured • Red-Black Pipeline • Squeeze Testing • Timeout & Fallback Tuning • Workload Reliability Production Ready?
  64. 64. Our Journey • Engineering Tools • Insight & Real-time Analytics • Performance & Reliability • Leverage
  65. 65. ● A federation of tools ● Common UI elements ● Deep linking Operational Tools as a Product
  66. 66. Canary Analysis Conformity Integration Tests Citrus Chaos Static Unit Tests Deep Integration Modular Components Functional Testing
  67. 67. RTA auto-tuning • Alerts • Apache/Tomcat • Auto-scaling • Hystrix fallbacks RTA decision support • ACA • Citrus • Flow Conformity checks • Consistent names • ELBs • Health check • Red/black deployment Delivery integration • ACA • Citrus • FIT Production Ready – Automation & Integration
  68. 68. Internet Our Journey Ends
  69. 69. https://netflix.github.io/
  70. 70. Speaker When? Where? Engineering Netflix Global Operations in the Cloud Josh Evans Wed @11am Palazzo N Efficient Innovation: High-Velocity Cost Management at Netflix Andrew Park Wed @ 2:45pm Palazzo C Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second Peter Bakas Wed @ 2:45pm San Polo 3501B A Day in the Life of a Netflix Engineer Using 37% of the Internet Dave Hahn Wed @ 4:15pm Venetian H Availability: The New Kind of Innovator’s Dilemma Coburn Watson Wed @ 4:15pm Marcello 4501B Real-Time Analytics In Service of Self-Healing Ecosystems Roy Rapoport Chris Sanden Wed @ 4:15pm Lido 3001B Running Spark and Presto on the Netflix Big Data Platform Daniel Weeks Thu @ 11am Palazzo F Splitting the Check on Compliance and Security: Keeping Developers and Auditors Happy in the Cloud Jason Chan Thu @ 11am Marcello 4501B @
  71. 71. Thank you! Josh Evans jevans@netflix.com @josh_evans_nflx

×