SlideShare une entreprise Scribd logo
1  sur  97
Headline
Suudhan Rangarajan (@suudhan)
Senior Software Engineer
Netflix Play API
Why we built an Evolutionary
Architecture
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-play-api
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
Headline
Suudhan Rangarajan (@suudhan)
Senior Software Engineer
Netflix Play API
Why we built an Evolutionary
Architecture
Previous Architecture Workflow
Sign-up
Content
Discovery
Playback
API Service
← Services hosted in AWS →Devices
Domain specific
Microservices
API Proxy
Service
Signup Workflow
← Services hosted in AWS →Devices
Signup API
Sign-up
Content
Discovery
Playback
Domain specific
Microservices
API Proxy
Service
API Service
Content Discovery Workflow
← Services hosted in AWS →Devices
Discovery
API
Sign-up
Content
Discovery
Playback
Domain specific
Microservices
API Proxy
Service
API Service
Playback Workflow
← Services hosted in AWS →Devices
Play API
Sign-up
Content
Discovery
Playback
Domain specific
Microservices
API Proxy
Service
API Service
Previous Architecture
← Services hosted in AWS →Devices
Signup API
Discovery
API
Play API
Sign-up
Content
Discovery
Playback
Domain specific
Microservices
API Proxy
Service
API Service
Identity
Type 1/2 Decisions
Evolvability
Identity
Type 1/2 Decisions
Evolvability
Start with WHY: Ask why your
service exists
Lead the Internet TV revolution to
entertain billions of people across the
world
P
Maximize customer engagement
from signup to streaming
P
Enable acquisition, discovery,
playback functionality 24/7
API Identity: Deliver Acquisition,
Discovery and Playback functions
with high availability
Single Responsibility Principle: Be wary
of multiple-identities rolled up into a
single service
One API Service
Signup API
Discovery
API
Play API
Signup API
Discovery
API
Play API
API Service Per
function
Previous Architecture Current Architecture
Lead the Internet TV revolution to
entertain billions of people across the
world
P
Maximize user engagement of
Netflix customer from signup to
streaming
P
Enable non-member, discovery,
playback functionality 24/7
P
Deliver Playback
Lifecycle 24/7
Decide best
playback
experience
Track events
to measure
playback
experience
Authorize
playback
experience
Play API
Devices
API Proxy
Service
Decide best
playback
experience
Track events
to measure
playback
experience
Authorize
playback
experience
Devices
API Proxy
Service
High Coupling,
Low Evolvability
Play API Identity: Orchestrate
Playback Lifecycle with stable
abstractions
Guiding Principle: We believe in a simple
singular identity for our services. The
identity relates to and complements the
identities of the company, organization,
team and its peer services
Identity
Type 1/2 Decisions
Evolvability
“Some decisions are consequential and irreversible or nearly
irreversible – one-way doors – and these decisions must be made
methodically, carefully, slowly, with great deliberation and
consultation [...] We can call these Type 1 decisions…”
Quote from Jeff Bezos
“...But most decisions aren’t like that – they are changeable,
reversible – they’re two-way doors. If you’ve made a suboptimal
Type 2 decision, you don’t have to live with the consequences for
that long [...] Type 2 decisions can and should be made quickly by
high judgment individuals or small groups.”
Quote from Jeff Bezos
Three Type 1 Decisions to Consider
Synchronous &
Asynchronous
Data ArchitectureAppropriate
Coupling
Two types of Shared Libraries
Play API Service
Utilities
cache
Metrics
Shared
Libraries with
common
functions
Client Libraries
used for
inter-service
communications
Client 1
Client 2
Client 3
“Thick” shared
libraries with 100s of
dependent libraries
(e.g. utilities jar)
Previous Architecture
1) Binary Coupling
Hundreds of
shared libraries
spanning services
across network
boundaries
Previous Architecture
Binary coupling => Distributed
Monolith
Utilities Utilities
Utilities
Service1 Service2
Service3
“The evils of too much coupling between
services are far worse than the problems
caused by code duplication”
- Sam Newman (Building
Microservices)
Play API Service
Playback Decision
Service
Playback
Decision
Client
Previous Architecture
Requests Per
Second of API
Service
Increase in
Latencies
from the API
Service
Execution of
Fallback via
Play Decision
Client
Clients with heavy Fallbacks
Play API Service
Playback Decision
Service
Playback
Decision
Client
Previous Architecture
2) Operational Coupling
“Operational Coupling” might be an
ok choice, if some services/teams are
not yet ready to own and operate a
highly available service.
Many of the client
libraries had the
potential to bring down
the API Service
Previous Architecture
Operational Coupling impacts
Availability
Play API Service
Play API Service
Playback Decisions
Serviceclient
Java Java
Previous Architecture
3) Language Coupling
Play API Service
client
REST over HTTP 1.1
● Unidirectional
(Request/ Response
type APIs)
Previous Architecture
Playback Decisions
Service
Jersey Framework
Communication Protocol
Requirements
Operationally “thin” Clients No or limited shared libraries
Auto-generated clients for
Polyglot support
Bi-Directional Communication
● At Netflix, most use-cases were modelled as Request/Response
○ REST was a simple and easy way of communicating between services; so
choice of REST was more incidental rather than intentional
● Most of the services were not following RESTful principles.
○ The URL didn’t represent a unique resource, instead the parameters passed
in the call determined the response - effectively made them a RPC call
● So we were agnostic to REST vs RPC as long as it meets our requirements
REST vs RPC
Previous Architecture Current Architecture
Play API Service
Playback
Decisions
Playback
Authorize
Playback
Events
Playback
Decisions
Playback
Authorize
Playback
Events
1) Operationally Coupled Clients
2) High Binary Coupling
3) Only Java
4) Unidirectional communication
Play API Service
1) Minimal Operational Coupling
2) Limited Binary Coupling
3) Beyond Java
4) Beyond Request/ Response
gRPC/
HTTP2REST/
HTTP1
Consider “thin” auto-generated clients
with bi-directional communication and
minimize code reuse across service
boundaries
Type 1 Decision: Appropriate Coupling
Three Type 1 Decisions to Consider
Synchronous vs
Asynchronous
Data ArchitectureAppropriate
Coupling
PlayData getPlayData(string customerId, string titleId,
string deviceId){
CustomerInfo custInfo = getCustomerInfo(customerId);
DeviceInfo deviceInfo = getDeviceInfo(deviceId);
PlayData playdata = decidePlayData(custInfo,
deviceInfo, titleId);
return playdata;
}
Request Handler
Thread pool Client Thread pool
Typical Synchronous Architecture
Request Handler
Thread pool Client Thread pool
getPlayData
getCustomerInfo
decidePlayData
Return
One thread per request
Typical Synchronous Architecture
getDeviceInfo
Customer Service
Device Service
Play Data Decision
Service
Request Handler
Thread pool Client Thread pool
getPlayData
getCustomerInfo
decidePlayData
Return
One thread per request
Typical Synchronous Architecture
getDeviceInfo
Customer Service
Device Service
Play Data Decision
Service
Blocking Request Handler Blocking Client I/O
Request Handler
Thread pool Client Thread pool
getPlayData
getCustomerInfo
decidePlayData
Return
One thread per request
Typical Synchronous Architecture
getDeviceInfo
Blocking Request Handler Blocking Client I/O
Works for Simple
Request/Response
Works for Limited
Clients
Beyond Request/Response
One Request - One Response
Request Play-data for Title X
Receive Play-data for Title X
One Request - Stream Response
Request Play-data for Titles X,Y,Z
Receive Play-data for Title X
Receive Play-data for Title Y
Receive Play-data for Title Z
Stream Request - One Response
Request Play-data for Title X
Request Play-data for Title Y
Request Play-data for Title Z
Receive Play-data for Titles X,Y,Z
Stream Request - Stream Response
Request Play-data for Title X
Request Play-data for Title Y
Receive Play-data for Title X
Get Play-data for Title Z
Receive Play-data for Title Y
Receive Play-data for Title Z
Request/Response
Event Loop
Outgoing Event Loop
per client
Worker Threads
Asynchronous Architecture
PlayData getPlayData(string customerId, string titleId,
string deviceId){
Zip(getCustomerInfo(customerId),
getDeviceInfo(deviceId),
(custInfo, deviceInfo) ->
return decidePlayData(custInfo, deviceInfo,
titleId)
);
}
Request/Response
Event Loop
Outgoing Event Loop
per clientWorkflow spans many
worker threads
Asynchronous Architecture
Customer Service
Device Service
PlayData Service
setup
Request/Response
Event Loop
Outgoing Event Loop
per clientWorkflow spans many
worker threads
Asynchronous Architecture
Customer Service
Device Service
PlayData Service
getCustomerInfo
Request/Response
Event Loop
Outgoing Event Loop
per clientWorkflow spans many
worker threads
Asynchronous Architecture
Customer Service
Device Service
PlayData Service
getDeviceInfo
Request/Response
Event Loop
Outgoing Event Loop
per clientWorkflow spans many
worker threads
Asynchronous Architecture
Customer Service
Device Service
PlayData Service
zip
Request/Response
Event Loop
Outgoing Event Loop
per clientWorkflow spans many
worker threads
Asynchronous Architecture
Customer Service
Device Service
PlayData Service
decidePlayData
● All context is passed as messages from one processing unit to
another.
● If we need to follow and reason about a request, we need to build
tools to capture and reassemble the order of execution units
● None of the calls can block
Workflow spans multiple threads
Request/Response
Event Loop
Outgoing Event Loop
per client
Worker Threads
Asynchronous Architecture
Asynchronous Request Handler Non-Blocking I/O
Synchrony
Ask: Do you really have a need
beyond Request/Response?
Network Event Loop
Outgoing Event Loop
per client
Dedicated thread
Synchronous Execution + Asynchronous I/O
Blocking Request Handler Non-Blocking I/O
Current Architecture
getPlayData
getCustomerInfo
decidePlayData
Return
getDeviceInfo
If most of your APIs fit the
Request/Response pattern, consider a
synchronous request handler, with
nonblocking I/O
Type 1 Decision: Synchronous vs
Asynchronous
Three Type 1 Decisions to Consider
Synchronous
vs
Asynchronous
Data ArchitectureAppropriate
Coupling
Without an intentional Data
Architecture, Data becomes its
own monolith
Previous Architecture
What a Data Monolith looks like
Data Source
Data Source
Data Source
Service 1
Service 2
Service 3
Service 4
4 GB
1 GB
2 GB
400 MB
600 MB
API Service
← Multiple Data sources loaded in memory →
←MemoryLoad→
Previous Architecture
What a Data Monolith looks like
4 GB
1 GB
2 GB
400 MB
600 MB
API Service
Very small percentage of data
actually accessed
Previous Architecture
What a Data Monolith looks like
API Service
Each Data Source models gets
coupled across classes and libraries
Previous Architecture
What a Data Monolith looks like
API Service
Unpredictable Performance
Characteristics
Data
Update
CPU Utilization
Previous Architecture
What a Data Monolith looks like
What a Data Monolith looks like
API Service
Potential to bring down the service
Data
Update
Netflix was
down
Previous Architecture
"All problems in computer science can be
solved by another level of indirection."
David Wheeler
(World’s first Comp Sci PhD)
Current Architecture
Data Source
Data Source
Data Source
Data Source
Data Source
Data
Loader
Data
ServicePlay API Service
Data
Store
Materialized View
Current Architecture
Data Source
Data Source
Data Source
Data Source
Data Source
Data
Loader
Data
Service
Uses only the data
it needs
Predictable
Operational
Characteristics
Reduced
Dependency chain
Data
StorePlay API Service
Materialized View
Isolate Data from the Service. At the
very least, ensure that data sources
are accessed via a layer of
abstraction, so that it leaves room for
extension later
Type 1 Decision: Data Architecture
Three Type 1 Decisions to Consider
Synchrony Data ArchitectureAppropriate
Coupling
For Type 2 decisions, choose a path,
experiment and iterate
Guiding Principle: Identify your Type 1
and Type 2 decisions; Spend 80% of your
time debating and aligning on Type 1
Decisions
Identity
Type 1/2 Decisions
Evolvability
An Evolutionary Architecture
supports guided and incremental
change as first principle among
multiple dimensions
- ThoughtWorks
Choosing a microservices architecture
with appropriate coupling allows us to
evolve across multiple dimensions
How evolvable are the Type 1 decisions
Change Play API
Current
Architecture
Previous
Architecture
Asynchronous?
Polyglot services?
Bidirectional APIs?
Additional Data
Sources?
Known
Unknowns
Potential Type 1 decisions in the
future?
Change Play API
Current
Architecture
Previous
Architecture
Containers?
Serverless?
?
?
And we fully expect that there will
be Unknown Unknowns
As we evolve, how to ensure we are
not breaking our original goals?
Use Fitness Functions to guide
change
High Availability Low Latency
Simplicity
Reliability
High
Throughput
Observability Developer
Productivity
Continuous
Integration
Scalable
Evolvability
High Availability Low Latency
Simplicity
Reliability
High
Throughput
Observability Developer
Productivity
Continuous
Integration
Scalable
Evolvability
1
2
3
4
Why Simplicity over Reliability?
Increase in
Operational
Complexity Reliable
Fallback when
service is
down
Why Scalability over Throughput?
New
instances
were added
Increase in
Errors due to
cache
warming
Why Observability over Latency?
Decrease in latency
by using a fully
async executor
Cost of Async: Loss
in Observability
Four 9s of availability
Thin
Clients
P99
latencyResilience
to failures
Merge to
Deploy
Time
1
2
3
Guiding Principle: Define Fitness
functions to act as your guide for
architectural evolution
Previous Architecture Current Architecture
Operational Coupling
Binary Coupling
Only Java
Synchronous
communication
Data Monolith
Operational Isolation
No Binary Coupling
Beyond Java
Asynchronous
communication
Explicit Data
Architecture
Guided Fitness
Functions
Multiple Identities
Singular Identities
● No incidents in a
year
● 4.5 deployments
per week
● Just two rollbacks!
Identity
Type 1/2 Decisions
Evolvability
Build a Evolutionary Architecture
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
netflix-play-api

Contenu connexe

Plus de C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

Plus de C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Dernier

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Dernier (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Netflix Play API - An Evolutionary Architecture

  • 1. Headline Suudhan Rangarajan (@suudhan) Senior Software Engineer Netflix Play API Why we built an Evolutionary Architecture
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-play-api
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4.
  • 5.
  • 6.
  • 7. Headline Suudhan Rangarajan (@suudhan) Senior Software Engineer Netflix Play API Why we built an Evolutionary Architecture
  • 8. Previous Architecture Workflow Sign-up Content Discovery Playback API Service ← Services hosted in AWS →Devices Domain specific Microservices API Proxy Service
  • 9. Signup Workflow ← Services hosted in AWS →Devices Signup API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service
  • 10. Content Discovery Workflow ← Services hosted in AWS →Devices Discovery API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service
  • 11. Playback Workflow ← Services hosted in AWS →Devices Play API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service
  • 12. Previous Architecture ← Services hosted in AWS →Devices Signup API Discovery API Play API Sign-up Content Discovery Playback Domain specific Microservices API Proxy Service API Service
  • 15. Start with WHY: Ask why your service exists
  • 16. Lead the Internet TV revolution to entertain billions of people across the world P Maximize customer engagement from signup to streaming P Enable acquisition, discovery, playback functionality 24/7
  • 17. API Identity: Deliver Acquisition, Discovery and Playback functions with high availability
  • 18. Single Responsibility Principle: Be wary of multiple-identities rolled up into a single service
  • 19. One API Service Signup API Discovery API Play API Signup API Discovery API Play API API Service Per function Previous Architecture Current Architecture
  • 20. Lead the Internet TV revolution to entertain billions of people across the world P Maximize user engagement of Netflix customer from signup to streaming P Enable non-member, discovery, playback functionality 24/7 P Deliver Playback Lifecycle 24/7
  • 21. Decide best playback experience Track events to measure playback experience Authorize playback experience Play API Devices API Proxy Service
  • 22. Decide best playback experience Track events to measure playback experience Authorize playback experience Devices API Proxy Service High Coupling, Low Evolvability
  • 23.
  • 24. Play API Identity: Orchestrate Playback Lifecycle with stable abstractions
  • 25. Guiding Principle: We believe in a simple singular identity for our services. The identity relates to and complements the identities of the company, organization, team and its peer services
  • 27. “Some decisions are consequential and irreversible or nearly irreversible – one-way doors – and these decisions must be made methodically, carefully, slowly, with great deliberation and consultation [...] We can call these Type 1 decisions…” Quote from Jeff Bezos
  • 28. “...But most decisions aren’t like that – they are changeable, reversible – they’re two-way doors. If you’ve made a suboptimal Type 2 decision, you don’t have to live with the consequences for that long [...] Type 2 decisions can and should be made quickly by high judgment individuals or small groups.” Quote from Jeff Bezos
  • 29. Three Type 1 Decisions to Consider Synchronous & Asynchronous Data ArchitectureAppropriate Coupling
  • 30. Two types of Shared Libraries Play API Service Utilities cache Metrics Shared Libraries with common functions Client Libraries used for inter-service communications Client 1 Client 2 Client 3
  • 31. “Thick” shared libraries with 100s of dependent libraries (e.g. utilities jar) Previous Architecture 1) Binary Coupling
  • 32. Hundreds of shared libraries spanning services across network boundaries Previous Architecture Binary coupling => Distributed Monolith Utilities Utilities Utilities Service1 Service2 Service3
  • 33. “The evils of too much coupling between services are far worse than the problems caused by code duplication” - Sam Newman (Building Microservices)
  • 34. Play API Service Playback Decision Service Playback Decision Client Previous Architecture
  • 35. Requests Per Second of API Service Increase in Latencies from the API Service Execution of Fallback via Play Decision Client Clients with heavy Fallbacks
  • 36. Play API Service Playback Decision Service Playback Decision Client Previous Architecture 2) Operational Coupling
  • 37. “Operational Coupling” might be an ok choice, if some services/teams are not yet ready to own and operate a highly available service.
  • 38. Many of the client libraries had the potential to bring down the API Service Previous Architecture Operational Coupling impacts Availability Play API Service
  • 39. Play API Service Playback Decisions Serviceclient Java Java Previous Architecture 3) Language Coupling
  • 40. Play API Service client REST over HTTP 1.1 ● Unidirectional (Request/ Response type APIs) Previous Architecture Playback Decisions Service Jersey Framework Communication Protocol
  • 41. Requirements Operationally “thin” Clients No or limited shared libraries Auto-generated clients for Polyglot support Bi-Directional Communication
  • 42. ● At Netflix, most use-cases were modelled as Request/Response ○ REST was a simple and easy way of communicating between services; so choice of REST was more incidental rather than intentional ● Most of the services were not following RESTful principles. ○ The URL didn’t represent a unique resource, instead the parameters passed in the call determined the response - effectively made them a RPC call ● So we were agnostic to REST vs RPC as long as it meets our requirements REST vs RPC
  • 43.
  • 44. Previous Architecture Current Architecture Play API Service Playback Decisions Playback Authorize Playback Events Playback Decisions Playback Authorize Playback Events 1) Operationally Coupled Clients 2) High Binary Coupling 3) Only Java 4) Unidirectional communication Play API Service 1) Minimal Operational Coupling 2) Limited Binary Coupling 3) Beyond Java 4) Beyond Request/ Response gRPC/ HTTP2REST/ HTTP1
  • 45. Consider “thin” auto-generated clients with bi-directional communication and minimize code reuse across service boundaries Type 1 Decision: Appropriate Coupling
  • 46. Three Type 1 Decisions to Consider Synchronous vs Asynchronous Data ArchitectureAppropriate Coupling
  • 47. PlayData getPlayData(string customerId, string titleId, string deviceId){ CustomerInfo custInfo = getCustomerInfo(customerId); DeviceInfo deviceInfo = getDeviceInfo(deviceId); PlayData playdata = decidePlayData(custInfo, deviceInfo, titleId); return playdata; }
  • 48. Request Handler Thread pool Client Thread pool Typical Synchronous Architecture
  • 49. Request Handler Thread pool Client Thread pool getPlayData getCustomerInfo decidePlayData Return One thread per request Typical Synchronous Architecture getDeviceInfo Customer Service Device Service Play Data Decision Service
  • 50. Request Handler Thread pool Client Thread pool getPlayData getCustomerInfo decidePlayData Return One thread per request Typical Synchronous Architecture getDeviceInfo Customer Service Device Service Play Data Decision Service Blocking Request Handler Blocking Client I/O
  • 51. Request Handler Thread pool Client Thread pool getPlayData getCustomerInfo decidePlayData Return One thread per request Typical Synchronous Architecture getDeviceInfo Blocking Request Handler Blocking Client I/O Works for Simple Request/Response Works for Limited Clients
  • 52. Beyond Request/Response One Request - One Response Request Play-data for Title X Receive Play-data for Title X One Request - Stream Response Request Play-data for Titles X,Y,Z Receive Play-data for Title X Receive Play-data for Title Y Receive Play-data for Title Z Stream Request - One Response Request Play-data for Title X Request Play-data for Title Y Request Play-data for Title Z Receive Play-data for Titles X,Y,Z Stream Request - Stream Response Request Play-data for Title X Request Play-data for Title Y Receive Play-data for Title X Get Play-data for Title Z Receive Play-data for Title Y Receive Play-data for Title Z
  • 53. Request/Response Event Loop Outgoing Event Loop per client Worker Threads Asynchronous Architecture
  • 54. PlayData getPlayData(string customerId, string titleId, string deviceId){ Zip(getCustomerInfo(customerId), getDeviceInfo(deviceId), (custInfo, deviceInfo) -> return decidePlayData(custInfo, deviceInfo, titleId) ); }
  • 55. Request/Response Event Loop Outgoing Event Loop per clientWorkflow spans many worker threads Asynchronous Architecture Customer Service Device Service PlayData Service setup
  • 56. Request/Response Event Loop Outgoing Event Loop per clientWorkflow spans many worker threads Asynchronous Architecture Customer Service Device Service PlayData Service getCustomerInfo
  • 57. Request/Response Event Loop Outgoing Event Loop per clientWorkflow spans many worker threads Asynchronous Architecture Customer Service Device Service PlayData Service getDeviceInfo
  • 58. Request/Response Event Loop Outgoing Event Loop per clientWorkflow spans many worker threads Asynchronous Architecture Customer Service Device Service PlayData Service zip
  • 59. Request/Response Event Loop Outgoing Event Loop per clientWorkflow spans many worker threads Asynchronous Architecture Customer Service Device Service PlayData Service decidePlayData
  • 60. ● All context is passed as messages from one processing unit to another. ● If we need to follow and reason about a request, we need to build tools to capture and reassemble the order of execution units ● None of the calls can block Workflow spans multiple threads
  • 61. Request/Response Event Loop Outgoing Event Loop per client Worker Threads Asynchronous Architecture Asynchronous Request Handler Non-Blocking I/O
  • 62. Synchrony Ask: Do you really have a need beyond Request/Response?
  • 63. Network Event Loop Outgoing Event Loop per client Dedicated thread Synchronous Execution + Asynchronous I/O Blocking Request Handler Non-Blocking I/O Current Architecture getPlayData getCustomerInfo decidePlayData Return getDeviceInfo
  • 64. If most of your APIs fit the Request/Response pattern, consider a synchronous request handler, with nonblocking I/O Type 1 Decision: Synchronous vs Asynchronous
  • 65. Three Type 1 Decisions to Consider Synchronous vs Asynchronous Data ArchitectureAppropriate Coupling
  • 66. Without an intentional Data Architecture, Data becomes its own monolith
  • 67. Previous Architecture What a Data Monolith looks like Data Source Data Source Data Source Service 1 Service 2 Service 3 Service 4
  • 68. 4 GB 1 GB 2 GB 400 MB 600 MB API Service ← Multiple Data sources loaded in memory → ←MemoryLoad→ Previous Architecture What a Data Monolith looks like
  • 69. 4 GB 1 GB 2 GB 400 MB 600 MB API Service Very small percentage of data actually accessed Previous Architecture What a Data Monolith looks like
  • 70. API Service Each Data Source models gets coupled across classes and libraries Previous Architecture What a Data Monolith looks like
  • 71. API Service Unpredictable Performance Characteristics Data Update CPU Utilization Previous Architecture What a Data Monolith looks like
  • 72. What a Data Monolith looks like API Service Potential to bring down the service Data Update Netflix was down Previous Architecture
  • 73. "All problems in computer science can be solved by another level of indirection." David Wheeler (World’s first Comp Sci PhD)
  • 74. Current Architecture Data Source Data Source Data Source Data Source Data Source Data Loader Data ServicePlay API Service Data Store Materialized View
  • 75. Current Architecture Data Source Data Source Data Source Data Source Data Source Data Loader Data Service Uses only the data it needs Predictable Operational Characteristics Reduced Dependency chain Data StorePlay API Service Materialized View
  • 76. Isolate Data from the Service. At the very least, ensure that data sources are accessed via a layer of abstraction, so that it leaves room for extension later Type 1 Decision: Data Architecture
  • 77. Three Type 1 Decisions to Consider Synchrony Data ArchitectureAppropriate Coupling
  • 78. For Type 2 decisions, choose a path, experiment and iterate
  • 79. Guiding Principle: Identify your Type 1 and Type 2 decisions; Spend 80% of your time debating and aligning on Type 1 Decisions
  • 81. An Evolutionary Architecture supports guided and incremental change as first principle among multiple dimensions - ThoughtWorks
  • 82. Choosing a microservices architecture with appropriate coupling allows us to evolve across multiple dimensions
  • 83. How evolvable are the Type 1 decisions Change Play API Current Architecture Previous Architecture Asynchronous? Polyglot services? Bidirectional APIs? Additional Data Sources? Known Unknowns
  • 84. Potential Type 1 decisions in the future? Change Play API Current Architecture Previous Architecture Containers? Serverless? ? ? And we fully expect that there will be Unknown Unknowns
  • 85. As we evolve, how to ensure we are not breaking our original goals?
  • 86. Use Fitness Functions to guide change
  • 87. High Availability Low Latency Simplicity Reliability High Throughput Observability Developer Productivity Continuous Integration Scalable Evolvability
  • 88. High Availability Low Latency Simplicity Reliability High Throughput Observability Developer Productivity Continuous Integration Scalable Evolvability 1 2 3 4
  • 89. Why Simplicity over Reliability? Increase in Operational Complexity Reliable Fallback when service is down
  • 90. Why Scalability over Throughput? New instances were added Increase in Errors due to cache warming
  • 91. Why Observability over Latency? Decrease in latency by using a fully async executor Cost of Async: Loss in Observability
  • 92. Four 9s of availability Thin Clients P99 latencyResilience to failures Merge to Deploy Time 1 2 3
  • 93. Guiding Principle: Define Fitness functions to act as your guide for architectural evolution
  • 94. Previous Architecture Current Architecture Operational Coupling Binary Coupling Only Java Synchronous communication Data Monolith Operational Isolation No Binary Coupling Beyond Java Asynchronous communication Explicit Data Architecture Guided Fitness Functions Multiple Identities Singular Identities
  • 95. ● No incidents in a year ● 4.5 deployments per week ● Just two rollbacks!
  • 96. Identity Type 1/2 Decisions Evolvability Build a Evolutionary Architecture
  • 97. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ netflix-play-api