Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

Refactoring Organizations - A Netflix Study (QCon NYC 2017)

Refactoring Organizations - A Netflix Study (QCon NYC 2017)

Télécharger pour lire hors ligne

Is your service architecture and engineering velocity constrained by organizational concerns? Does it seem impossible to give priority to key initiatives regardless of intent? Are engineers switching tasks so often that they are just treading water? Are critical projects endlessly backlogged? Has staffing up pushed the limits of your team structure? Navigating through challenges like these can be daunting and solutions fraught with uncertainty. How do you know what, where, when to change. And whatever the answer is today it will most certainly vary over time. Effective organizations evolve, at key inflection points, to support critical business and technical goals. There is not only a strong relationship between organizations and the software they produce (Conway’s Law) but many organizational solutions can be derived from analogs in the technical realm. In other words, we can treat organizational improvement as a refactoring exercise. Over the last 20 years Netflix engineering has proven time and again an ability to adapt and grow, resulting in undisputed dominance over the global internet tv market. In this talk we’ll use Netflix as a case study to illustrate how specific strategies, framed as technical analogs, have been employed to maximize engineering agility, velocity, and impact. These powerful, yet simple strategies and solutions provide a useful blueprint for organizational success.

Is your service architecture and engineering velocity constrained by organizational concerns? Does it seem impossible to give priority to key initiatives regardless of intent? Are engineers switching tasks so often that they are just treading water? Are critical projects endlessly backlogged? Has staffing up pushed the limits of your team structure? Navigating through challenges like these can be daunting and solutions fraught with uncertainty. How do you know what, where, when to change. And whatever the answer is today it will most certainly vary over time. Effective organizations evolve, at key inflection points, to support critical business and technical goals. There is not only a strong relationship between organizations and the software they produce (Conway’s Law) but many organizational solutions can be derived from analogs in the technical realm. In other words, we can treat organizational improvement as a refactoring exercise. Over the last 20 years Netflix engineering has proven time and again an ability to adapt and grow, resulting in undisputed dominance over the global internet tv market. In this talk we’ll use Netflix as a case study to illustrate how specific strategies, framed as technical analogs, have been employed to maximize engineering agility, velocity, and impact. These powerful, yet simple strategies and solutions provide a useful blueprint for organizational success.

Plus De Contenu Connexe

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir

Refactoring Organizations - A Netflix Study (QCon NYC 2017)

  1. 1. Josh Evans – (Former Netflix) Engineering Leader at Large June 26, 2017 Refactoring Organizations A Netflix Study
  2. 2. 2009
  3. 3. Devices …
  4. 4. Queue Reader UX
  5. 5. DVD Foundation
  6. 6. Customer Device Netflix Data Center NCCP Electronic Delivery LoadBalancer Netflix App Security Activation Playback Platform (NRDP) UI XML/RPC Ticket-based security Custom responses DB HTTP/S DVD Legacy DVD Legacy
  7. 7. Netflix API
  8. 8. Let 1000 flowers bloom!
  9. 9. Netflix Data Center API Netflix API LoadBalancer REST API JSON schema HTTP response codes Oauth Content Metadata Application HTTP/S
  10. 10. Customer Device Netflix Data Center API Proposal LB Netflix App Security Activation Playback Platform (NRDP) UI Content Metadata NCCP ED LB
  11. 11. Pros Separation of concerns Bandwidth Increased innovation Cons API reliability Heterogeneous architecture Lack of domain knowledge Assessment
  12. 12. Neil Hunt, CPO
  13. 13. What was the reaction?
  14. 14. Concern Anger
  15. 15. Tribalism
  16. 16. Conway’s Law
  17. 17. If you have four teams working on a compiler you will end up with a four pass compiler
  18. 18. Today’s Premise Conway’s Law describes dysfunction We must embrace architecture before organization Technical analogs drive better organizational solutions
  19. 19. Selfless Leadership Company Team You In that order!
  20. 20. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  21. 21. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  22. 22. 1999 – 2009 Ecommerce (DVD  Streaming) 2009 – 2013 Streaming Infrastructure 2013 - 2016 Operations Engineering 2017 Time off, exploring options Josh Evans – Engineering Leader at Large @ops_engineering @
  23. 23. Global leader in subscription internet TV Growing slate of original content 100 million members 190 countries, 10s of languages 1000s of device types Microservices on AWS Unique company culture
  24. 24. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  25. 25. Why do we refactor?
  26. 26. Functionality Engineering velocity Functional and operational quality As we scale! We refactor to improve or sustain
  27. 27. The ability to enhance a system by adding new functionality at minimal effort Functional Scalability The ease with which a system or component can be modified, added, or removed, to accommodate changing load Load Scalability
  28. 28. The ability for an organization to easily add people and domain responsibilities in response to increased work and complexity The ease with which an organization or team can adapt to shifts in business strategy Organizational Scalability
  29. 29. Common tasks are difficult Strategic efforts are impractical or impossible When do we refactor?
  30. 30. How do we refactor?
  31. 31. Technical Patterns Object-oriented design Micro-service architecture Systems engineering
  32. 32. Example: Organizational Polymorphism
  33. 33. With the right people Instead of a culture process adherence We have a culture of creativity and self discipline, freedom and responsibility Netflix Culture
  34. 34. You build it You run it
  35. 35. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  36. 36. 2009
  37. 37. Devices in Production …
  38. 38. Key Platforms in Progress
  39. 39. Anthony Park
  40. 40. Surprise! John Funge
  41. 41. Big Picture Device Ubiquity Product Innovation Cloud Migration Internationalization Service Reliability
  42. 42. How many engineers? 6
  43. 43. What would you do?
  44. 44. Prioritize & Queue … … Task Queue Completed Tasks Thread Pool … …
  45. 45. Prioritize Service availability Game consoles Downloadable apps CE expansion Mobile … Queue Audio & subtitles International support New codecs … Prioritize & Queue
  46. 46. Scale Up Work profile Roles & throughput Team structure
  47. 47. Manager Engineer Engineer Engineer Engineer Engineer Test/Ops Monolithic Team One leader Undifferentiated roles Ad hoc responsibilities
  48. 48. Monolithic Decomposition Distinct modules & services Workload partitioning Dependency awareness Loose coupling
  49. 49. Server NRDP Features Protocols Security Bootstrap Key Platforms Device integration Device launch Streaming Infrastructure Device & partner-oriented load balancing
  50. 50. On Call Overload
  51. 51. Vicious Cycle Philip was great at Development Test infrastructure Project management Troubleshooting Philip Fisher-Ogden
  52. 52. You build it You run it
  53. 53. Risks Burnout Slow progress on key initiatives Philip Fisher-Ogden
  54. 54. Thread Starvation … … Shared exclusive resource High priority/frequency Other - blocked Tasks
  55. 55. Context Switching Process 1 Process 2 OS Interrupt or system call Save state - pcb1 .. Get state – pcb2 Interrupt or system call Save state – pcb2 Get state – pcb1 .. Executing Executing Idle Executing Idle Idle
  56. 56. Thread Pool Isolation Partition pools & locks Distribute problematic workloads … … … … … … … …
  57. 57. Organizational Solution Deepen troubleshooting skills Distribute escalations Engineer operations
  58. 58. Key Platforms Device integration Device launch Server NRDP Protocols Security Bootstrap Insight/Tools Delivery Dashboards Performance Operational tools Consolidating Operations Engineering
  59. 59. Cloud Migration
  60. 60. Rapid iteration v. systematic, long-cycle execution Cloud v. Product
  61. 61. S S S S. . . DB DB DB DB. . . . . . . . . Member Traffic Batch Processes Heterogeneous Workloads
  62. 62. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Interference
  63. 63. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Interference
  64. 64. Batch S S S S. . . DB DB DB DB. . . . . . . . . Member Path Member Path Member Path Batch Batch Interference
  65. 65. . . . Batch S S S S. . . DB DB DB DB. . . . . . Member Path Member Path Member Path Batch Batch Interference X
  66. 66. Batch S S S S. . . DB DB DB DB. . . . . . Member Path Member Path Member Path Batch Batch Partitioning Online Offline . . .
  67. 67. Silverlight Migration
  68. 68. Partitioning & Domain Portability Streaming Infrastructure Platform Engineering Engineer as a Library Ranjit Ranjit
  69. 69. Partitioning & Domain Portability Platform Engineering Engineer as a Library Streaming Infrastructure
  70. 70. Systems Cloud migration Key Platforms Device integration Device launch Server NRDP Protocols Security Bootstrap Insight/Tools Delivery Dashboards Performance Ops tools Streaming Infrastructure
  71. 71. Staffing
  72. 72. 6  24
  73. 73. Bottleneck
  74. 74. Systems Cloud migration Viewing history Viewing sessions Key Platforms Device integration Device launch Server NRDP Protocols Security Bootstrap Insight/Tools Delivery Dashboards Performance Ops tools Cloning & Parallel Processing
  75. 75. By 2012 cloud migration Canada, Latin America, UK massive device expansion major product improvements Netflix CDN
  76. 76. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  77. 77. IQ Task-oriented Logical Literal Detached Autocratic EQ Feeling-oriented Emotional Social Empathetic Democratic Bimodal Thinking
  78. 78. IQ Design Evaluation Implementation EQ Inception Socialization Overcoming Tribalism
  79. 79. Flawed Inception
  80. 80. Introductions Framework Scaling Teams IQ v EQ Conway’s Revenge Today’s Program
  81. 81. 2012
  82. 82. Customer Device Netflix Data Center API This… LB Netflix App Security Activation Playback Platform (NRDP) UI Content Metadata NCCP ED LB
  83. 83. ELB NCCP API …has become this Zuul
  84. 84. ELB …and this
  85. 85. Growing complexity Duplication of effort Engineering tax
  86. 86. Raising the Stakes Playback start in 500ms More UI/Playback scenarios Faster rate of innovation Better service reliability
  87. 87. Common tasks are difficult Strategic efforts are impractical or impossible When do we refactor?
  88. 88. If you have four teams working on a compiler you will end up with a four pass compiler Conway’s Revenge! We had two teams and a two-service edge architecture
  89. 89. Mature API team Robust API platform Strong operational focus Trust & respect A Better Foundation Daniel Jacobson
  90. 90. Josh: what’s the right architectural solution? Peter: do you care about the organizational implications? Moment of Truth
  91. 91. Selfless Leadership
  92. 92. Josh: what’s the right architectural solution? Peter: do you care about the organizational implications? Moment of Truth Josh: no – we’ll figure that out later
  93. 93. ELB NCCP API Before Zuul
  94. 94. After Integrated architecture Distributed functionality Shared services Common practices
  95. 95. Edge Services Zuul API Server Playback Services Features Security Data Systems Platform Insight/Tools Edge Services Shared services Organized around microservices, functionality, shared services
  96. 96. Takeaways Put architecture first Leverage technical analogs Know when to use IQ v EQ Be selfless
  97. 97. www.linkedin.com/in/jevansnflx Where to find me
  98. 98. ? Refactoring Organizations

Notes de l'éditeur

  • Melvin Conway
  • A thread is unable to gain regular access to shared resources and is unable to make progress 
  • Leading to
    Slower innovation
    Longer service outages
    Inefficiency

×