Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

17 sur 29 Publicité

Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

Télécharger pour lire hors ligne

This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and fallback logic for dependencies to maximize availability. This approach allows for rapid independent evolution of individual components to maximize the pace of innovation and A/B testing, and offers nearly unlimited scalability as the business grows. Learn how we balance managing change to (or subtraction from) the customer experience, while aggressively scraping barnacle features that add complexity for little value.

This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and fallback logic for dependencies to maximize availability. This approach allows for rapid independent evolution of individual components to maximize the pace of innovation and A/B testing, and offers nearly unlimited scalability as the business grows. Learn how we balance managing change to (or subtraction from) the customer experience, while aggressively scraping barnacle features that add complexity for little value.

Publicité
Publicité

Plus De Contenu Connexe

Diaporamas pour vous (20)

Les utilisateurs ont également aimé (20)

Publicité

Similaire à Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013 (20)

Plus par Amazon Web Services (20)

Publicité

Plus récents (20)

Netflix Development Patterns for Scale, Performance & Availability (DMG206) | AWS re:Invent 2013

  1. 1. Netflix Development Patterns for Rapid Iteration, Scale, Performance, & Availability Neil Hunt, Netflix November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Are You Designing Systems That Are: • • • • Web-scale Global Highly-available Consumer-facing • Cloud Native
  3. 3. Cloud Native • • • • • Service oriented architecture Redundancy Statelessness NoSQL Eventual consistency
  4. 4. Assumptions Everything is Broken Hardware will fail Scale Slowly Changing Large Scale Rapid Change Large Scale Telcos Web-Scale Enterprise IT Startups Slowly Changing Small Scale Rapid Change Small Scale Everything works Software will fail Speed
  5. 5. Netflix Cloud Goals: Availability, Scale, Performance
  6. 6. Performance • Reduce session start by 1s Save 1 human lifetime per day! Win more moments of truth • Suggest choices 1% better 500k hours/day additional value delivered
  7. 7. Scale • • • • • 50% y/y traffic growth 50 Countries, 3 continents Tens of thousands of instances at peak 4 AWS regions, 12 datacenters ~$.001 per start
  8. 8. Availability • Aspire to 4 x nines (99.99% of starts successful) • Per Quarter: – Downtime: < 3 mins (peak time) – Successful starts: 9.999B – Failures: 1M  frustration, calls, lost business
  9. 9. Availabilities Compound N Service Dependencies 2 … N dependencies 99.99% .99 1000 99.99% .999 100 99.99% .9998 10 99.99N% Availability .9
  10. 10. Availabilities Compound To achieve 99.99% availability with 1000 components requires: or 99.9999% availability for each dependency Isolation for independence Component failure leads to system failure Component failure leads to degradation rather than system failure
  11. 11. Availability, Scale, Performance Are Not Enough!
  12. 12. Rapid Iteration – Rate of Change • Running tests • Rolling out tests – Engineering the winning test experience for scale • Adding features • Scaling up • Removing features, simplifying, minimizing
  13. 13. Testing • Up to 1,000 changes per day!
  14. 14. Rate of Change • Change leads to bugs – – – – New features New configurations New types of inputs Scaling up • Availability is in tension with rate of change
  15. 15. Availability / Rate of Change Tradeoff Availability 99.999% 99.99% Frontier of availability/change 99.9% 99% 1 10 100 Rate of Change 1000
  16. 16. Availability / Rate of Change Tradeoff Availability 99.999% 99.99% Frontier of availability/change 99.9% 99% 1 10 100 Rate of Change 1000
  17. 17. Shifting the Curve… Availability 99.999% 99.99% 99.9% 99% 1 10 100 Rate of Change 1000
  18. 18. Shifting the Curve • Must break the chained dependencies that compound in cascading system failure • Subsystem isolation: – Failure in one component should never result in cascading system failure
  19. 19. Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network • Latency monkey to test Dependent System Timeout Dependence
  20. 20. Isolating Subsystems Redundant systems with timeout & failover • Failure of instance • Failure of network Higher Tier System Longer timeout Dependent System Short timeout • Latency monkey to test Dependence
  21. 21. Isolating Subsystems Timeout with fallback default response • Network failure • Software bug { status=mem, plan=4, device=true } Dependent System Timeout & Default response Dependence
  22. 22. Isolating Subsystems Canary Push • Network failure • Software bug Dependent System Timeout Canary instance new code Dependence
  23. 23. Isolating Subsystems Red/Black deployment • Software bugs Dependent System Fail back to old code Bad code pushed Dependence V2.3 Dependence V2.2
  24. 24. Isolating Subsystems Standby Blue system • Independent implementation • Simplified logic Dependent System Fail to static version Static reference implementation Dependence V2.3
  25. 25. Isolating Subsystems Load Balancer Zone isolation • Infrastructure failure (e.g. power outage) Zone A Zone B Dependent System Dependent System • Chaos Gorilla Dependence Dependence
  26. 26. Isolating Subsystems Region isolation DNS • Infrastructure software bugs (e.g. load balancer fail) • Chaos Kong Region E Region W Load Balancer Load Balancer Zone A Zone B Zone A Zone B Dependen t System Dependen t System Dependen t System Dependen t System Dependence Dependence Dependence Dependence
  27. 27. Isolating Subsystems Dependency Mode Isolating Technique Instance Failure Network failure Redundant systems with failover and timeout Timeout with default response Network failure Software bug Canary push Red-black deployment Blue systems Infrastructure failure Zone isolation Cross-zone software bugs Region isolation
  28. 28. Trying Harder Won’t Cut It • Trying harder gets a linear return on an exponential problem • Need to be great at execution AND Have the right architecture • What architectural features are you using to ensure availability, scale, performance, & rapid rate of change?
  29. 29. Please give us your feedback on this presentation DMG206 As a thank you, we will select prize winners daily for completed surveys!

×