Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Maintaining the Front Door to Netflix
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://ww...
There are copious notes attached
to each slide in this presentation.
Please read those notes to get
the full context of th...
Global Streaming Video
for TV Shows and Movies
More than 44 Million Subscribers
More than 40 Countries
Netflix Accounts for ~33% of Peak
Internet Traffic in North America
Netflix subscribers are watching more than 1 billion h...
Team Focus:
Build the Best Global Streaming Product
Three aspects of the Streaming Product:
• Non-Member
• Discovery
• Str...
Key Responsibilities
• Broker data between services and UIs
• Maintain a resilient front-door
• Scale the system verticall...
But Before Streaming…
Monolithic Application
In Netflix Data Centers
The bigger the ship…
the slower it turns
Distributed Architecture
1000+ Device Types
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
Reviews
A/B Test
Engine
Dozens of Dependenci...
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Dependency Relationships
2,000,000,000
Requests Per Day to the
Netflix API
30
Distinct Dependent
Services for the Netflix API
~500
Dependency jars Slurped
into the Netflix API
14,000,000,000
Netflix API Calls Per Day to
those Dependent Services
0
Dependent Services with
100% SLA
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime
Per Month
99.99% = 99.7%30
0.3% of 2B = 6M failures per day
2+ Hours of Downtime
Per Month
99.9% = 97%30
3% of 2B = 60M failures per day
20+ Hours of Downtime
Per Month
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Circuit Breaker Dashboard
Call Volume and Health / Last 10 Seconds
Call Volume / Last 2 Minutes
Successful Requests
Successful, But Slower Than Expected
Short-Circuited Requests, Delivering Fallbacks
Timeouts, Delivering Fallbacks
Thread Pool & Task Queue Full, Delivering Fallbacks
Exceptions, Delivering Fallbacks
Error Rate
# + # + # + # / (# + # + # + # + #) = Error Rate
Status of Fallback Circuit
Requests per Second, Over Last 10 Seconds
SLA Information
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Fallback
Personaliz
ation
Engine
User Info
Movie
Metadata
Movie
Ratings
Similar
Movies
API
Reviews
A/B Test
Engine
Fallback
Scaling the Distributed System
AWS Cloud
Autoscaling
Autoscaling
Amazon Auto Scaling Limitations
• Hard to fit policies to variable traffic patterns
(weekday vs weekend)
• Limited control...
The Impact of AAS Limitations
• Traffic drop can lead to scale downs during
outage
• Performance degradation between new
i...
Scryer : Predictive Auto Scaling
Not yet…
Typical Traffic Patterns Over Five Days
Predicted RPS Compared to Actual RPS
Scaling Plan for Predicted Workload
What is Scryer Doing?
• Evaluating needs based on historical data
– Week over week, month over month metrics
• Adjusts ins...
Results
Results : Load Average
Reactive
Predictive
Results : Response Latencies
Reactive
Predictive
Results : Outage Recovery
Results : Outage Recovery
Results : AWS Costs
Scaling Globally
More than 44 Million Subscribers
More than 40 Countries
Zuul
Gatekeeper for the Netflix Streaming Application
Zuul *
• Multi-Region
Resiliency
• Insights
• Stress Testing
• Canary Testing
• Dynamic Routing
• Load Shedding
• Security...
Isthmus
All of these approaches are
designed to prevent failures…
But sometimes the best way to
prevent failures is to force them!
I randomly
terminate instances
in production to
identify dormant
failures.
Chaos
Monkey
Chaos
Gorilla
I simulate an
outage of an
entire Amazon
availability zone.
I simulate an
outage in an AWS
region.
Chaos
Kong
I find instances that
don’t adhere to
best practices.
Conformity
Monkey
I extend Conformity
Monkey to find
security violations.
Security
Monkey
I detect unhealthy
instances and
remove them
from service.
Doctor
Monkey
I clean up the
clutter and waste
that runs in the
cloud.
Janitor
Monkey
I induce artificial
delays and errors into
services to determine
how upstream services
will respond.
Latency
Monkey
Deployments in the Cloud
Dependency Relationships
Testing Philosophy:
Act Fast, React Fast
That Doesn’t Mean We Don’t Test
Automated Delivery Pipeline
Cloud-Based Deployment Techniques
Current Code
In Production
API Requests from
the Internet
Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
...
Canary Analysis Automation
Single Canary Instance
To Test New Code with Production Traffic
(around 1% or less of traffic)
Current Code
In Production
...
Current Code
In Production
API Requests from
the Internet
Current Code
In Production
API Requests from
the Internet
Current Code
In Production
API Requests from
the Internet
Perfect!
Stress Test with Zuul
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Error!
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
Perfect!
Stress Test with Zuul
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Current Code
In Production
API Requests from
the Internet
New Code
Getting Prepared for Production
API Requests from
the Internet
New Code
Getting Prepared for Production
Brokering Data to
1,000+ Device Types
Screen Real Estate
Controller
Technical Capabilities
One-Size-Fits-All
API
Request
Request
Request
Courtesy of South Florida Classical Review
Resource-Based API
vs.
Experience-Based API
Resource-Based Requests
• /users/<id>/ratings/title
• /users/<id>/queues
• /users/<id>/queues/instant
• /users/<id>/recomm...
REST API
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
Network Border Network Bo...
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Bo...
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
OSFA API
Network Border Network Bo...
Experience-Based Requests
• /ps3/homescreen
JAVA API
Network Border Network Border
RECOMME
NDATIONS
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RAT...
RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
SERVER CODE
...
RECOMME
NDATIONSA
ZXSXX C
CCC
MOVIE
DATA
SIMILAR
MOVIES
AUTH
MEMBER
DATA
A/B
TESTS
START-
UP
RATINGS
JAVA API
DATA GATHERI...
https://www.github.com/Netflix
Maintaining the Front Door to Netflix
Daniel Jacobson
@daniel_jacobson
http://www.linkedin.com/in/danieljacobson
http://ww...
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Maintaining the Front Door to Netflix : The Netflix API
Prochain SlideShare
Chargement dans…5
×

Maintaining the Front Door to Netflix : The Netflix API

This presentation was given to the engineering organization at Zendesk. In this presentation, I talk about the challenges that the Netflix API faces in supporting the 1000+ different device types, millions of users, and billions of transactions. The topics range from resiliency, scale, API design, failure injection, continuous delivery, and more.

  • Soyez le premier à commenter

Maintaining the Front Door to Netflix : The Netflix API

  1. Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson
  2. There are copious notes attached to each slide in this presentation. Please read those notes to get the full context of the presentation
  3. Global Streaming Video for TV Shows and Movies
  4. More than 44 Million Subscribers More than 40 Countries
  5. Netflix Accounts for ~33% of Peak Internet Traffic in North America Netflix subscribers are watching more than 1 billion hours a month
  6. Team Focus: Build the Best Global Streaming Product Three aspects of the Streaming Product: • Non-Member • Discovery • Streaming
  7. Key Responsibilities • Broker data between services and UIs • Maintain a resilient front-door • Scale the system vertically and horizontally • Maintain high velocity
  8. But Before Streaming…
  9. Monolithic Application In Netflix Data Centers
  10. The bigger the ship… the slower it turns
  11. Distributed Architecture
  12. 1000+ Device Types
  13. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies Reviews A/B Test Engine Dozens of Dependencies
  14. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  15. Dependency Relationships
  16. 2,000,000,000 Requests Per Day to the Netflix API
  17. 30 Distinct Dependent Services for the Netflix API
  18. ~500 Dependency jars Slurped into the Netflix API
  19. 14,000,000,000 Netflix API Calls Per Day to those Dependent Services
  20. 0 Dependent Services with 100% SLA
  21. 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month
  22. 99.99% = 99.7%30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month
  23. 99.9% = 97%30 3% of 2B = 60M failures per day 20+ Hours of Downtime Per Month
  24. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  25. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  26. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  27. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  28. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  29. Circuit Breaker Dashboard
  30. Call Volume and Health / Last 10 Seconds
  31. Call Volume / Last 2 Minutes
  32. Successful Requests
  33. Successful, But Slower Than Expected
  34. Short-Circuited Requests, Delivering Fallbacks
  35. Timeouts, Delivering Fallbacks
  36. Thread Pool & Task Queue Full, Delivering Fallbacks
  37. Exceptions, Delivering Fallbacks
  38. Error Rate # + # + # + # / (# + # + # + # + #) = Error Rate
  39. Status of Fallback Circuit
  40. Requests per Second, Over Last 10 Seconds
  41. SLA Information
  42. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  43. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  44. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine
  45. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback
  46. Personaliz ation Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback
  47. Scaling the Distributed System
  48. AWS Cloud
  49. Autoscaling
  50. Autoscaling
  51. Amazon Auto Scaling Limitations • Hard to fit policies to variable traffic patterns (weekday vs weekend) • Limited control over capacity adjustments (absolute value or %)
  52. The Impact of AAS Limitations • Traffic drop can lead to scale downs during outage • Performance degradation between new instance launch and taking traffic • Excess capacity at peak and trough
  53. Scryer : Predictive Auto Scaling Not yet…
  54. Typical Traffic Patterns Over Five Days
  55. Predicted RPS Compared to Actual RPS
  56. Scaling Plan for Predicted Workload
  57. What is Scryer Doing? • Evaluating needs based on historical data – Week over week, month over month metrics • Adjusts instance minimums based on algorithms • Relies on Amazon Auto Scaling for unpredicted events
  58. Results
  59. Results : Load Average Reactive Predictive
  60. Results : Response Latencies Reactive Predictive
  61. Results : Outage Recovery
  62. Results : Outage Recovery
  63. Results : AWS Costs
  64. Scaling Globally
  65. More than 44 Million Subscribers More than 40 Countries
  66. Zuul Gatekeeper for the Netflix Streaming Application
  67. Zuul * • Multi-Region Resiliency • Insights • Stress Testing • Canary Testing • Dynamic Routing • Load Shedding • Security • Static Response Handling • Authentication * Most closely resembles an API proxy
  68. Isthmus
  69. All of these approaches are designed to prevent failures…
  70. But sometimes the best way to prevent failures is to force them!
  71. I randomly terminate instances in production to identify dormant failures. Chaos Monkey
  72. Chaos Gorilla I simulate an outage of an entire Amazon availability zone.
  73. I simulate an outage in an AWS region. Chaos Kong
  74. I find instances that don’t adhere to best practices. Conformity Monkey
  75. I extend Conformity Monkey to find security violations. Security Monkey
  76. I detect unhealthy instances and remove them from service. Doctor Monkey
  77. I clean up the clutter and waste that runs in the cloud. Janitor Monkey
  78. I induce artificial delays and errors into services to determine how upstream services will respond. Latency Monkey
  79. Deployments in the Cloud
  80. Dependency Relationships
  81. Testing Philosophy: Act Fast, React Fast
  82. That Doesn’t Mean We Don’t Test
  83. Automated Delivery Pipeline
  84. Cloud-Based Deployment Techniques
  85. Current Code In Production API Requests from the Internet
  86. Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet
  87. Canary Analysis Automation
  88. Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Error!
  89. Current Code In Production API Requests from the Internet
  90. Current Code In Production API Requests from the Internet
  91. Current Code In Production API Requests from the Internet Perfect!
  92. Stress Test with Zuul
  93. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  94. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  95. Error! Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  96. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  97. Current Code In Production API Requests from the Internet Perfect!
  98. Stress Test with Zuul
  99. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  100. Current Code In Production API Requests from the Internet New Code Getting Prepared for Production
  101. API Requests from the Internet New Code Getting Prepared for Production
  102. Brokering Data to 1,000+ Device Types
  103. Screen Real Estate
  104. Controller
  105. Technical Capabilities
  106. One-Size-Fits-All API Request Request Request
  107. Courtesy of South Florida Classical Review
  108. Resource-Based API vs. Experience-Based API
  109. Resource-Based Requests • /users/<id>/ratings/title • /users/<id>/queues • /users/<id>/queues/instant • /users/<id>/recommendations • /catalog/titles/movie • /catalog/titles/series • /catalog/people
  110. REST API RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Network Border Network Border
  111. RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border SERVER CODE CLIENT CODE
  112. RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS OSFA API Network Border Network Border DATA GATHERING, FORMATTING, AND DELIVERY USER INTERFACE RENDERING
  113. Experience-Based Requests • /ps3/homescreen
  114. JAVA API Network Border Network Border RECOMME NDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS Groovy Layer
  115. RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API SERVER CODE CLIENT CODE CLIENT ADAPTER CODE (WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER) Network Border Network Border
  116. RECOMME NDATIONSA ZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START- UP RATINGS JAVA API DATA GATHERING DATA FORMATTING AND DELIVERY USER INTERFACE RENDERING Network Border Network Border
  117. https://www.github.com/Netflix
  118. Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson

×