SlideShare a Scribd company logo
1 of 42
Performance and Fault Tolerance
for the Netflix API
Ben Christensen
Software Engineer – API Platform at Netflix
@benjchristensen
http://www.linkedin.com/in/benjchristensen




http://techblog.netflix.com/
Netflix API




Dependency A              Dependency B           Dependency C



   Dependency D               Dependency E           Dependency F



       Dependency G              Dependency H            Dependency I



               Dependency J          Dependency K            Dependency L



                  Dependency M            Dependency N          Dependency O



                      Dependency P            Dependency Q              Dependency R
Netflix API




Dependency A              Dependency B           Dependency C



   Dependency D               Dependency E           Dependency F



       Dependency G              Dependency H            Dependency I



               Dependency J          Dependency K            Dependency L



                  Dependency M            Dependency N          Dependency O



                      Dependency P            Dependency Q              Dependency R
Dozens of dependencies.

    One going bad takes everything down.


99.99%30          = 99.7% uptime
     0.3% of 1 billion = 3,000,000 failures

            2+ hours downtime/month
even if all dependencies have excellent uptime.

          Reality is generally worse.
No single dependency should
 take down the entire app.

         Fallback.
         Fail silent.
          Fail fast.

        Shed load.
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker
Tryable semaphores for “trusted” clients and fallbacks

       Separate threads for “untrusted” clients

 Aggressive timeouts on threads and network calls
             to “give up and move on”

       Circuit breakers as the “release valve”
30 rps x 0.2 seconds = 6 + breathing room = 10 threads

Thread-pool Queue size: 5-10 (0 doesn't work but get close to it)

                Thread-pool Size + Queue Size

                      Queuing is Not Free
Cost of Thread @ 75rps
  median - 90th - 99th (time in ms)




                 Time for thread to execute   Time user thread waited
Netflix DependencyCommand Implementation
Netflix DependencyCommand Implementation

              Fallbacks

               Cache
         Eventual Consistency
            Stubbed Data
           Empty Response
Netflix DependencyCommand Implementation
So, how does it work in the real world?
Visualizing Circuits in Near-Realtime
    (latency is single-digit seconds, generally 1-2)




        Video available at
  https://vimeo.com/33576628
Rolling 10 second counters


1 minute latency percentiles




  2 minute rate change



circle color and size represent
   health and traffic volume
API Daily Incoming vs Outgoing

Weekend                                        Weekend               Weekend




              8-10 Billion DependencyCommand Executions (threaded)




                        1.2 - 1.6 Billion Incoming Requests
API Hourly Incoming vs Outgoing

 Peak at 700M+ threaded DependencyCommand executions (200k+/second)




              Peak at 100M+ incoming requests (30k+/second)
Fallback.
Fail silent.
 Fail fast.

Shed load.
Netflix API




Dependency A              Dependency B           Dependency C



   Dependency D               Dependency E           Dependency F



       Dependency G              Dependency H            Dependency I



               Dependency J          Dependency K            Dependency L



                  Dependency M            Dependency N          Dependency O



                      Dependency P            Dependency Q              Dependency R
Single Network Request from Clients
     (use LAN instead of WAN)

  Send Only The Bytes That Matter
 (optimize responses for each client)

       Leverage Concurrency
  (but abstract away its complexity)
Single Network Request from Clients
     (use LAN instead of WAN)




                     Device
                              Server


                                       Netflix API
      landing page requires
       ~dozen API requests
Single Network Request from Clients
        (use LAN instead of WAN)




some clients are limited in the number of
   concurrent network connections
Single Network Request from Clients
         (use LAN instead of WAN)




network latency makes this even worse
(mobile, home, wifi, geographic distance, etc)
Single Network Request from Clients
     (use LAN instead of WAN)




                Device
                Server

                         Netflix API




  push call pattern to server ...
Single Network Request from Clients
     (use LAN instead of WAN)




                Device
                Server

                         Netflix API




 ... and eliminate redundant calls
Send Only The Bytes That Matter
         (optimize responses for each client)




                                              Netflix API
                       Device
                                Server
Client                                   Client




           part of client now on server
Send Only The Bytes That Matter
              (optimize responses for each client)




                                                   Netflix API
                            Device
                                     Server
     Client                                   Client




client retrieves and delivers exactly what their
       device needs in its optimal format
Send Only The Bytes That Matter
            (optimize responses for each client)




                    Device
                             Server
                                               Netflix API

                                               Service Layer

Client                                Client




         interface is now a Java API that client
            interacts with at a granular level
Leverage Concurrency
         (but abstract away its complexity)




                Device
                         Server
                                           Netflix API

                                           Service Layer

Client                            Client
Leverage Concurrency
         (but abstract away its complexity)




                Device
                         Server
                                           Netflix API

                                           Service Layer

Client                            Client




   no synchronized, volatile, locks, Futures or
Atomic*/Concurrent* classes in client-server code
Leverage Concurrency
                         (but abstract away its complexity)

Service calls are    def video1Call = api.getVideos(api.getUser(), 123456, 7891234);
all asynchronous     def video2Call = api.getVideos(api.getUser(), 6789543);

                     // higher-order functions used to compose asynchronous calls together
                     wx.merge(video1Call, video2Call).toList().subscribe([
    Functional
                         onNext: {
  programming                listOfVideos ->
 with higher-order           for(video in listOfVideos) {
     functions                   response.getWriter().println("video: " + video.id + " " + video.title);
                             }
                         },
                         onError: {
                             exception ->
                             response.setStatus(500);
                             response.getWriter().println("Error: " + exception.getMessage());
                         }
                     ])




              Fully asynchronous API - Clients can’t block
Device
                             Server

                                      Netflix API




Optimize for each device. Leverage the server.
Netflix is Hiring
                              http://jobs.netflix.com




Fault Tolerance in a High Volume, Distributed System
     http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html



          Making the Netflix API More Resilient
    http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html



              Why REST Keeps Me Up At Night
 http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/



                         Ben Christensen
                           @benjchristensen
                  http://www.linkedin.com/in/benjchristensen

More Related Content

Similar to Netflix API Performance and Fault Tolerance Strategies

The Netflix API Platform for Server-Side Scripting
The Netflix API Platform for Server-Side ScriptingThe Netflix API Platform for Server-Side Scripting
The Netflix API Platform for Server-Side ScriptingKatharina Probst
 
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
Pros and Cons of a MicroServices Architecture talk at AWS ReInventPros and Cons of a MicroServices Architecture talk at AWS ReInvent
Pros and Cons of a MicroServices Architecture talk at AWS ReInventSudhir Tonse
 
Building a Service Mesh with NGINX Owen Garrett.pptx
Building a Service Mesh with NGINX Owen Garrett.pptxBuilding a Service Mesh with NGINX Owen Garrett.pptx
Building a Service Mesh with NGINX Owen Garrett.pptxPINGXIONG3
 
Application DoS In Microservice Architectures
Application DoS In Microservice ArchitecturesApplication DoS In Microservice Architectures
Application DoS In Microservice ArchitecturesScott Behrens
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleSudhir Tonse
 
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdfQConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdfSimranjyotSuri
 
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesMastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesJosh Evans
 
Azure Service Fabric: The road ahead for microservices
Azure Service Fabric: The road ahead for microservicesAzure Service Fabric: The road ahead for microservices
Azure Service Fabric: The road ahead for microservicesMicrosoft Tech Community
 
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...Amazon Web Services
 
Developing reliable applications with .net core and AKS
Developing reliable applications with .net core and AKSDeveloping reliable applications with .net core and AKS
Developing reliable applications with .net core and AKSAlessandro Melchiori
 
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18CodeOps Technologies LLP
 
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...Twilio Inc
 
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)Stuart Lodge
 
Jeffrey Richter
Jeffrey RichterJeffrey Richter
Jeffrey RichterCodeFest
 
Web services - A Practical Approach
Web services - A Practical ApproachWeb services - A Practical Approach
Web services - A Practical ApproachMadhaiyan Muthu
 
REST to JavaScript for Better Client-side Development
REST to JavaScript for Better Client-side DevelopmentREST to JavaScript for Better Client-side Development
REST to JavaScript for Better Client-side DevelopmentHyunghun Cho
 
Windows Azure架构探析
Windows Azure架构探析Windows Azure架构探析
Windows Azure架构探析George Ang
 
Am 04 track1--salvatore orlando--openstack-apac-2012-final
Am 04 track1--salvatore orlando--openstack-apac-2012-finalAm 04 track1--salvatore orlando--openstack-apac-2012-final
Am 04 track1--salvatore orlando--openstack-apac-2012-finalOpenCity Community
 

Similar to Netflix API Performance and Fault Tolerance Strategies (20)

The Netflix API Platform for Server-Side Scripting
The Netflix API Platform for Server-Side ScriptingThe Netflix API Platform for Server-Side Scripting
The Netflix API Platform for Server-Side Scripting
 
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
Pros and Cons of a MicroServices Architecture talk at AWS ReInventPros and Cons of a MicroServices Architecture talk at AWS ReInvent
Pros and Cons of a MicroServices Architecture talk at AWS ReInvent
 
Building a Service Mesh with NGINX Owen Garrett.pptx
Building a Service Mesh with NGINX Owen Garrett.pptxBuilding a Service Mesh with NGINX Owen Garrett.pptx
Building a Service Mesh with NGINX Owen Garrett.pptx
 
Application DoS In Microservice Architectures
Application DoS In Microservice ArchitecturesApplication DoS In Microservice Architectures
Application DoS In Microservice Architectures
 
MicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scaleMicroServices at Netflix - challenges of scale
MicroServices at Netflix - challenges of scale
 
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdfQConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
QConSF2016-JoshEvans-MasteringChaosANetflixGuidetoMicroservices-compressed.pdf
 
Mastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to MicroservicesMastering Chaos - A Netflix Guide to Microservices
Mastering Chaos - A Netflix Guide to Microservices
 
Azure Service Fabric: The road ahead for microservices
Azure Service Fabric: The road ahead for microservicesAzure Service Fabric: The road ahead for microservices
Azure Service Fabric: The road ahead for microservices
 
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
Integrating Infrastructure as Code into a Continuous Delivery Pipeline | AWS ...
 
Developing reliable applications with .net core and AKS
Developing reliable applications with .net core and AKSDeveloping reliable applications with .net core and AKS
Developing reliable applications with .net core and AKS
 
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18
Make Java Microservices Resilient with Istio - Mangesh - IBM - CC18
 
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Co...
 
Net Services
Net ServicesNet Services
Net Services
 
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)
C# - Azure, WP7, MonoTouch and Mono for Android (MonoDroid)
 
Jeffrey Richter
Jeffrey RichterJeffrey Richter
Jeffrey Richter
 
Web services - A Practical Approach
Web services - A Practical ApproachWeb services - A Practical Approach
Web services - A Practical Approach
 
REST to JavaScript for Better Client-side Development
REST to JavaScript for Better Client-side DevelopmentREST to JavaScript for Better Client-side Development
REST to JavaScript for Better Client-side Development
 
Windows Azure架构探析
Windows Azure架构探析Windows Azure架构探析
Windows Azure架构探析
 
Edge architecture ieee international conference on cloud engineering
Edge architecture   ieee international conference on cloud engineeringEdge architecture   ieee international conference on cloud engineering
Edge architecture ieee international conference on cloud engineering
 
Am 04 track1--salvatore orlando--openstack-apac-2012-final
Am 04 track1--salvatore orlando--openstack-apac-2012-finalAm 04 track1--salvatore orlando--openstack-apac-2012-final
Am 04 track1--salvatore orlando--openstack-apac-2012-final
 

Recently uploaded

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central BankingThe Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central BankingSelcen Ozturkcan
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central BankingThe Evolution of Money: Digital Transformation and CBDCs in Central Banking
The Evolution of Money: Digital Transformation and CBDCs in Central Banking
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Netflix API Performance and Fault Tolerance Strategies

  • 1. Performance and Fault Tolerance for the Netflix API Ben Christensen Software Engineer – API Platform at Netflix @benjchristensen http://www.linkedin.com/in/benjchristensen http://techblog.netflix.com/
  • 2. Netflix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 3. Netflix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 4. Dozens of dependencies. One going bad takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse.
  • 5.
  • 6.
  • 7.
  • 8. No single dependency should take down the entire app. Fallback. Fail silent. Fail fast. Shed load.
  • 9. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker
  • 10.
  • 11. Tryable semaphores for “trusted” clients and fallbacks Separate threads for “untrusted” clients Aggressive timeouts on threads and network calls to “give up and move on” Circuit breakers as the “release valve”
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. 30 rps x 0.2 seconds = 6 + breathing room = 10 threads Thread-pool Queue size: 5-10 (0 doesn't work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free
  • 17. Cost of Thread @ 75rps median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited
  • 19. Netflix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response
  • 21. So, how does it work in the real world?
  • 22. Visualizing Circuits in Near-Realtime (latency is single-digit seconds, generally 1-2) Video available at https://vimeo.com/33576628
  • 23. Rolling 10 second counters 1 minute latency percentiles 2 minute rate change circle color and size represent health and traffic volume
  • 24. API Daily Incoming vs Outgoing Weekend Weekend Weekend 8-10 Billion DependencyCommand Executions (threaded) 1.2 - 1.6 Billion Incoming Requests
  • 25. API Hourly Incoming vs Outgoing Peak at 700M+ threaded DependencyCommand executions (200k+/second) Peak at 100M+ incoming requests (30k+/second)
  • 26.
  • 27. Fallback. Fail silent. Fail fast. Shed load.
  • 28. Netflix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 29. Single Network Request from Clients (use LAN instead of WAN) Send Only The Bytes That Matter (optimize responses for each client) Leverage Concurrency (but abstract away its complexity)
  • 30. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API landing page requires ~dozen API requests
  • 31. Single Network Request from Clients (use LAN instead of WAN) some clients are limited in the number of concurrent network connections
  • 32. Single Network Request from Clients (use LAN instead of WAN) network latency makes this even worse (mobile, home, wifi, geographic distance, etc)
  • 33. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API push call pattern to server ...
  • 34. Single Network Request from Clients (use LAN instead of WAN) Device Server Netflix API ... and eliminate redundant calls
  • 35. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device Server Client Client part of client now on server
  • 36. Send Only The Bytes That Matter (optimize responses for each client) Netflix API Device Server Client Client client retrieves and delivers exactly what their device needs in its optimal format
  • 37. Send Only The Bytes That Matter (optimize responses for each client) Device Server Netflix API Service Layer Client Client interface is now a Java API that client interacts with at a granular level
  • 38. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service Layer Client Client
  • 39. Leverage Concurrency (but abstract away its complexity) Device Server Netflix API Service Layer Client Client no synchronized, volatile, locks, Futures or Atomic*/Concurrent* classes in client-server code
  • 40. Leverage Concurrency (but abstract away its complexity) Service calls are def video1Call = api.getVideos(api.getUser(), 123456, 7891234); all asynchronous def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).toList().subscribe([ Functional onNext: { programming listOfVideos -> with higher-order for(video in listOfVideos) { functions response.getWriter().println("video: " + video.id + " " + video.title); } }, onError: { exception -> response.setStatus(500); response.getWriter().println("Error: " + exception.getMessage()); } ]) Fully asynchronous API - Clients can’t block
  • 41. Device Server Netflix API Optimize for each device. Leverage the server.
  • 42. Netflix is Hiring http://jobs.netflix.com Fault Tolerance in a High Volume, Distributed System http://techblog.netflix.com/2012/02/fault-tolerance-in-high-volume.html Making the Netflix API More Resilient http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html Why REST Keeps Me Up At Night http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/ Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen

Editor's Notes

  1. \n
  2. The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming. \n\nMore than 1 billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:7) to dozens of underlying subsystems with peaks of over 200k dependency requests per second. \n
  3. First half of the presentation discusses resilience engineering implemented to handle failure and latency at the integration points with the various dependencies. \n
  4. Even when all dependencies are performing well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if not engineered for resilience. \n
  5. \n
  6. \n
  7. \n
  8. It is a requirement of high volume, high availability applications to build fault and latency tolerance into their architecture and not expect infrastructure to solve it for them. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. Sample of 1 dependency circuit for 12 hours from production cluster with a rate of 75rps on a single server. \n\nEach execution occurs in a separate thread with median, 90th and 99th percentile latencies shown in the first 3 legend values. \n\nThe calling thread median, 90th and 99th percentiles are the last 3 legend values. \n\nThus, the median cost of the thread is 1.62ms - 1.57ms = 0.05ms, at the 90th it is 4.57-2.05 = 2.52ms. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. Second half of the presentation discusses architectural changes to enable optimizing the API for each Netflix device as opposed to a generic one-size-fits-all API which treats all devices the same. \n
  29. Netflix has over 800 unique devices that fall into several dozens classes with unique user experiences, different calling patterns, capabilities and needs from the data and thus the API. \n
  30. The one-size-fits-all API results in chatty clients, some requiring ~dozen requests to render a page. \n
  31. \n
  32. \n
  33. The client should make a single request and push the 'chatty' part to the server where low-latency networks and multi-core servers can perform the work far more efficiently. \n
  34. \n
  35. The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  36. The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  37. The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  38. Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  39. Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  40. Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  41. The Netflix API is becoming a platform that empowers user-interface teams to build their own API endpoints that are optimized to their client applications and devices.\n
  42. \n