SlideShare une entreprise Scribd logo
1  sur  87
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scaling Push Messaging for Millions
of Netflix Devices
Susheel Aroskar
Senior Software Engineer
Netflix Inc/Edge Gateway
8 9 7 1 5
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Why do we need push?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How I spend my time in Netflix application
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• What is Push?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• What is Push?
• How you can build it?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• What is Push?
• How you can build it?
• How you can operate it?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• What is Push?
• How you can build it?
• How you can operate it?
• What can you do with it?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Susheel Aroskar
Senior Software Engineer
Cloud Gateway
saroskar@netflix.com
github.com/raksoras
@susheelaroskar
PERSIST
UNTIL
SOMETHING
HAPPPENS
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Zuul Push Architecture
Zuul Push
Servers
WebSockets
Zuul Push
Servers
SSE
Register
User
WebSockets SSE
Zuul Push
Servers
Push
Registry
Register
User
WebSockets SSE
Zuul Push
Servers
Push
Registry
Register
User
Register
User
WebSockets
Push Library
SSE
Zuul Push
Servers
Push
Registry
Register
User
Register
User
WebSockets
Push Library
Push
Message
Queue
SSE
Zuul Push
Servers
Push
Registry
Register
User
Register
User
WebSockets
Push Library Message
Processor
Push
Message
Queue
SSE
Zuul Push
Servers
Push
Registry
Register
User
Register
User
WebSockets
Push Library Message
Processor
Push
Message
Queue
SSE
Zuul Push
Servers
Push
Registry
Register
User
Register
User
Looku
p
Server
WebSockets
Push Library Message
Processor
Push
Message
Queue
SSE
Zuul Push
Servers
Push
Registry
Register
User
Register
User
Looku
p
Server
Deliver
Message
WebSockets
Push Library Message
Processor
Push
Message
Queue
SSE
Zuul Push
Servers
Push
Registry
Register
User
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Zuul Push Server
Handling millions of
persistent connections
C10K Challenge
Handling 10,000+ connections on a single box
Socket Socket
Thread per Connection Model
Thread-1 Thread-2
Read
Write
Write
Read
Socket Socket
Thread per Connection Model
Thread-1 Thread-2
Read
Write
Write
Read
Async I/O Model
Socket
read
callback
write
callback
Socket
Single
Threadread
callback
write
callback
Vs
S
O
C
K
E
T
Channel
Inbound
Handler
Channel
Inbound
Handler
Channel
Outbound
Handler
Channel
Outbound
Handler
Channel Pipeline
Head Tail
Typical Netty Program Structure
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Netty Channel Handler Pipeline
protected void addPushHandlers(ChannelPipeline pl) {
pl.addLast(new HttpServerCodec());
pl.addLast(new HttpObjectAggregator());
pl.addLast(getPushAuthHandler());
pl.addLast(new WebSocketServerCompressionHandler());
pl.addLast(new WebSocketServerProtocolHandler());
pl.addLast(getPushRegistrationHandler());
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Plug in your custom authentication policy
Authenticate by Cookies, JWT or any other custom scheme
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Push Registry
Tracking clients’ connection
metadata in real time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Custom Push Registration Handler
public class MyRegistration extends PushRegistrationHandler {
@Override
protected void registerClient(
ChannelHandlerContext ctx,
PushUserAuth auth,
PushConnection conn,
PushConnectionRegistry registry) {
super.registerClient(ctx, authEvent, conn, registry);
ctx.executor().submit(() -> storeInRedis(auth));
}
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Push Registry Features Checklist
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Push Registry Features Checklist
• Low read latency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Push Registry Features Checklist
• Low read latency
• Automatic record expiry
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Push Registry Features Checklist
• Low read latency
• Automatic record expiry
• Sharding
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Push Registry Features Checklist
• Low read latency
• Automatic record expiry
• Sharding
• Replication
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Few Good Choices for Push Registry
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
We Use Dynomite for Our Push Registry
https://github.com/Netflix/dynomite
Redis
+ Auto-sharding
+ Read/Write quorum
+ Cross-region replication
Dynomite
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Message Processing
Queue, route and deliver
We use Kafka message
queues to decouple
message senders from
receivers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Fire and Forget Message Delivery
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
We Deliver Messages Across Regions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use Different Queues for Different Priorities
We run multiple message processor
instances in parallel to scale our
message processing throughput.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Operating Zuul Push
Different than REST of them
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Persistent Connections Make Zuul Push Server Stateful
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Persistent Connections Make Zuul Push Server Stateful
Long lived stable connections are great for client efficiency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Persistent Connections Make Zuul Push Server Stateful
Long lived stable connections are great for client efficiency
but they are terrible for quick deploy or rollback
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
If You Love Your Clients Set Them Free
Tear down connections periodically
Randomize Each Connection’s Lifetime to
automatically dampen any reconnect storms
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reconnection Storm Under Randomized Connection
Lifetimes
Time
Numberofreconnects
Instead of Closing Connection From Server side,
Ask Client to Close it.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to Optimize the Push Server
Most connections are idle!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use BIG Server to Handle Tons of Connections
ulimit -n 262144
net.ipv4.tcp_rmem="4096 87380
16777216"
net.ipv4.tcp_wmem="4096 87380
16777216"
-Xmx3g -Xms3g
-XX:MaxDirectMemorySize=256m
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Goldilocks Strategy - Just Right Server Size
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Optimize for Cost, NOT Instance Count
$$ $$
❌
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to Auto-scale Push Cluster?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to Auto-scale Push Cluster?
• Requests per second?
• CPU??
Amazon Autoscaling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to Auto-scale Push Cluster?
Number of open connections per server
Amazon CloudWatch
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Using Classic Load Balancers with WebSockets
Classic Load Balancers do not proxy WebSockets
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Solution I - Run CLB as a TCP Load Balancer
7 Application
6 Presentation
5 Session
4. Transport
3. Network
2. Data Link
1. Physical
HTTP
TLS
TCP
IP
Ethernet
Layer 7 HTTP load balancer
Layer 4 TCP load balancer
OSI 7 network layers model HTTP
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Solution 1 - The Good
TLS Termination
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Solution 1 - The Good, The Bad
TLS Termination Cross Site Request Forgery
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Solution 1 - The Good, The Bad and The Ugly
TLS Termination Cross Site Request Forgery
Deregister == Disconnect
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Solution II - Use Application Load Balancers
Application Load Balancers can proxy WebSockets natively
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Quick Recap of Push Operation Best Practices
• Recycle connections periodically
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Quick Recap of Push Operation Best Practices
• Recycle connections periodically
• Randomize connection lifetime
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Quick Recap of Push Operation Best Practices
• Recycle connections periodically
• Randomize connection lifetime
• More number of small servers >> few BIG servers
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Quick Recap of Push Operation Best Practices
• Recycle connections periodically
• Randomize connection lifetime
• More number of small servers >> few BIG servers
• Auto-scale on number of open connections per box
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A Quick Recap of Push Operation Best Practices
• Recycle connections periodically
• Randomize connection lifetime
• More number of small servers >> few BIG servers
• Auto-scale on number of open connections per box
• Either use ALB or CLB in TCP mode
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
If You Build It, They Will Push
Push messaging use cases
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alexa + Netflix = Weekend Lost
Speech Recognition,
Lambda
“Alexa play Stranger Things”
Zuul Push
Alexa Voice Service
Trigger On-demand diagnostics
by sending a push message
Remote recovery by push message
User messaging
What will you
Use PUSH for?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
One Last Call to Action
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
One Last Call to Action - PULL!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
One Last Call to Action - PULL!
https://github.com/Netflix/zuul
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
In Conclusion
Push can make you
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
In Conclusion
Push can make you
Rich (in functionality),
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
In Conclusion
Push can make you
Rich (in functionality),
Thin (by getting rid of polling)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
In Conclusion
Push can make you
Rich (in functionality),
Thin (by getting rid of polling)
and Happy!
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Susheel Aroskar
saroskar@netflix.com
@susheelaroskar

Contenu connexe

Tendances

Kubernetes Workshop
Kubernetes WorkshopKubernetes Workshop
Kubernetes Workshoploodse
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREAraf Karsh Hamid
 
K8s on AWS - Introducing Amazon EKS
K8s on AWS - Introducing Amazon EKSK8s on AWS - Introducing Amazon EKS
K8s on AWS - Introducing Amazon EKSAmazon Web Services
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflixaspyker
 
Developing custom transformation in the Kafka connect to minimize data redund...
Developing custom transformation in the Kafka connect to minimize data redund...Developing custom transformation in the Kafka connect to minimize data redund...
Developing custom transformation in the Kafka connect to minimize data redund...HostedbyConfluent
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes IntroductionPeng Xiao
 
Designing a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsDesigning a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsJulian Mazzitelli
 
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless OverviewOpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless OverviewMaría Angélica Bracho
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Rancher 2.0 Technical Deep Dive
Rancher 2.0 Technical Deep DiveRancher 2.0 Technical Deep Dive
Rancher 2.0 Technical Deep DiveLINE Corporation
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...confluent
 
OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack ArchitectureMirantis
 
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...Amazon Web Services
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideBytemark
 
Deep Dive on Amazon EC2 Instances & Performance Optimization Best Practices (...
Deep Dive on Amazon EC2 Instances & Performance Optimization Best Practices (...Deep Dive on Amazon EC2 Instances & Performance Optimization Best Practices (...
Deep Dive on Amazon EC2 Instances & Performance Optimization Best Practices (...Amazon Web Services
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetesRishabh Indoria
 
Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusChandresh Pancholi
 

Tendances (20)

Kubernetes Workshop
Kubernetes WorkshopKubernetes Workshop
Kubernetes Workshop
 
Microservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SREMicroservices Docker Kubernetes Istio Kanban DevOps SRE
Microservices Docker Kubernetes Istio Kanban DevOps SRE
 
K8s on AWS - Introducing Amazon EKS
K8s on AWS - Introducing Amazon EKSK8s on AWS - Introducing Amazon EKS
K8s on AWS - Introducing Amazon EKS
 
Velocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ NetflixVelocity NYC 2016 - Containers @ Netflix
Velocity NYC 2016 - Containers @ Netflix
 
Developing custom transformation in the Kafka connect to minimize data redund...
Developing custom transformation in the Kafka connect to minimize data redund...Developing custom transformation in the Kafka connect to minimize data redund...
Developing custom transformation in the Kafka connect to minimize data redund...
 
Kubernetes Introduction
Kubernetes IntroductionKubernetes Introduction
Kubernetes Introduction
 
Designing a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd productsDesigning a complete ci cd pipeline using argo events, workflow and cd products
Designing a complete ci cd pipeline using argo events, workflow and cd products
 
Deep Dive Amazon EC2
Deep Dive Amazon EC2Deep Dive Amazon EC2
Deep Dive Amazon EC2
 
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless OverviewOpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
OpenShift Meetup - Tokyo - Service Mesh and Serverless Overview
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Rancher 2.0 Technical Deep Dive
Rancher 2.0 Technical Deep DiveRancher 2.0 Technical Deep Dive
Rancher 2.0 Technical Deep Dive
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
 
Amazon EKS Deep Dive
Amazon EKS Deep DiveAmazon EKS Deep Dive
Amazon EKS Deep Dive
 
OpenStack Architecture
OpenStack ArchitectureOpenStack Architecture
OpenStack Architecture
 
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...
Optimizing Network Performance for Amazon EC2 Instances (CMP308-R1) - AWS re:...
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory Guide
 
Deep Dive on Amazon EC2 Instances & Performance Optimization Best Practices (...
Deep Dive on Amazon EC2 Instances & Performance Optimization Best Practices (...Deep Dive on Amazon EC2 Instances & Performance Optimization Best Practices (...
Deep Dive on Amazon EC2 Instances & Performance Optimization Best Practices (...
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Monitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheusMonitoring on Kubernetes using prometheus
Monitoring on Kubernetes using prometheus
 

Similaire à Scaling Push Messaging for Millions of Netflix Devices

[NEW LAUNCH!] How to Architect for Multi-Region Redundancy Using Anycast IPs ...
[NEW LAUNCH!] How to Architect for Multi-Region Redundancy Using Anycast IPs ...[NEW LAUNCH!] How to Architect for Multi-Region Redundancy Using Anycast IPs ...
[NEW LAUNCH!] How to Architect for Multi-Region Redundancy Using Anycast IPs ...Amazon Web Services
 
Get the Most out of Your Elastic Load Balancer for Different Workloads (NET31...
Get the Most out of Your Elastic Load Balancer for Different Workloads (NET31...Get the Most out of Your Elastic Load Balancer for Different Workloads (NET31...
Get the Most out of Your Elastic Load Balancer for Different Workloads (NET31...Amazon Web Services
 
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018Amazon Web Services
 
Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...
Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...
Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...Amazon Web Services
 
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...Amazon Web Services
 
Day Two Operations of Kubernetes on AWS (GPSTEC309) - AWS re:Invent 2018
Day Two Operations of Kubernetes on AWS (GPSTEC309) - AWS re:Invent 2018Day Two Operations of Kubernetes on AWS (GPSTEC309) - AWS re:Invent 2018
Day Two Operations of Kubernetes on AWS (GPSTEC309) - AWS re:Invent 2018Amazon Web Services
 
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018Amazon Web Services Korea
 
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Amazon Web Services
 
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018AWS Germany
 
Building Microservices with the Twelve-Factor App Pattern - SRV346 - Chicago ...
Building Microservices with the Twelve-Factor App Pattern - SRV346 - Chicago ...Building Microservices with the Twelve-Factor App Pattern - SRV346 - Chicago ...
Building Microservices with the Twelve-Factor App Pattern - SRV346 - Chicago ...Amazon Web Services
 
ServerlessConf 2018 Keynote - Debunking Serverless Myths
ServerlessConf 2018 Keynote - Debunking Serverless MythsServerlessConf 2018 Keynote - Debunking Serverless Myths
ServerlessConf 2018 Keynote - Debunking Serverless MythsTim Wagner
 
Building Microservices with the Twelve Factor App Pattern on AWS
Building Microservices with the Twelve Factor App Pattern on AWSBuilding Microservices with the Twelve Factor App Pattern on AWS
Building Microservices with the Twelve Factor App Pattern on AWSAmazon Web Services
 
A Chronicle of Airbnb Architecture Evolution (ARC407) - AWS re:Invent 2018
A Chronicle of Airbnb Architecture Evolution (ARC407) - AWS re:Invent 2018A Chronicle of Airbnb Architecture Evolution (ARC407) - AWS re:Invent 2018
A Chronicle of Airbnb Architecture Evolution (ARC407) - AWS re:Invent 2018Amazon Web Services
 
ElastiCache & Redis: Database Week San Francisco
ElastiCache & Redis: Database Week San FranciscoElastiCache & Redis: Database Week San Francisco
ElastiCache & Redis: Database Week San FranciscoAmazon Web Services
 
ServerlessConf 2018 Keynote - Debunking Serverless Myths (no video / detailed...
ServerlessConf 2018 Keynote - Debunking Serverless Myths (no video / detailed...ServerlessConf 2018 Keynote - Debunking Serverless Myths (no video / detailed...
ServerlessConf 2018 Keynote - Debunking Serverless Myths (no video / detailed...Tim Wagner
 
[NEW LAUNCH!] Introduction to AWS Global Accelerator (NET330) - AWS re:Invent...
[NEW LAUNCH!] Introduction to AWS Global Accelerator (NET330) - AWS re:Invent...[NEW LAUNCH!] Introduction to AWS Global Accelerator (NET330) - AWS re:Invent...
[NEW LAUNCH!] Introduction to AWS Global Accelerator (NET330) - AWS re:Invent...Amazon Web Services
 
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...Amazon Web Services
 
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfCome scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfAmazon Web Services
 

Similaire à Scaling Push Messaging for Millions of Netflix Devices (20)

[NEW LAUNCH!] How to Architect for Multi-Region Redundancy Using Anycast IPs ...
[NEW LAUNCH!] How to Architect for Multi-Region Redundancy Using Anycast IPs ...[NEW LAUNCH!] How to Architect for Multi-Region Redundancy Using Anycast IPs ...
[NEW LAUNCH!] How to Architect for Multi-Region Redundancy Using Anycast IPs ...
 
Get the Most out of Your Elastic Load Balancer for Different Workloads (NET31...
Get the Most out of Your Elastic Load Balancer for Different Workloads (NET31...Get the Most out of Your Elastic Load Balancer for Different Workloads (NET31...
Get the Most out of Your Elastic Load Balancer for Different Workloads (NET31...
 
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
Scaling Up to Your First 10 Million Users (ARC205-R1) - AWS re:Invent 2018
 
Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...
Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...
Building Massively Parallel Event-Driven Architectures (SRV373-R1) - AWS re:I...
 
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...
Vonage & Aspect: Transform Real-Time Communications & Customer Engagement (TL...
 
Day Two Operations of Kubernetes on AWS (GPSTEC309) - AWS re:Invent 2018
Day Two Operations of Kubernetes on AWS (GPSTEC309) - AWS re:Invent 2018Day Two Operations of Kubernetes on AWS (GPSTEC309) - AWS re:Invent 2018
Day Two Operations of Kubernetes on AWS (GPSTEC309) - AWS re:Invent 2018
 
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
AWS 기반 Microservice 운영을 위한 데브옵스 사례와 Spinnaker 소개::김영욱::AWS Summit Seoul 2018
 
Microservices for Startups
Microservices for StartupsMicroservices for Startups
Microservices for Startups
 
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
Shift-Left SRE: Self-Healing with AWS Lambda Functions (DEV313-S) - AWS re:In...
 
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
Amazon Elastic Container Service for Kubernetes (Amazon EKS) I AWS Dev Day 2018
 
Building Microservices with the Twelve-Factor App Pattern - SRV346 - Chicago ...
Building Microservices with the Twelve-Factor App Pattern - SRV346 - Chicago ...Building Microservices with the Twelve-Factor App Pattern - SRV346 - Chicago ...
Building Microservices with the Twelve-Factor App Pattern - SRV346 - Chicago ...
 
ServerlessConf 2018 Keynote - Debunking Serverless Myths
ServerlessConf 2018 Keynote - Debunking Serverless MythsServerlessConf 2018 Keynote - Debunking Serverless Myths
ServerlessConf 2018 Keynote - Debunking Serverless Myths
 
Building Microservices with the Twelve Factor App Pattern on AWS
Building Microservices with the Twelve Factor App Pattern on AWSBuilding Microservices with the Twelve Factor App Pattern on AWS
Building Microservices with the Twelve Factor App Pattern on AWS
 
A Chronicle of Airbnb Architecture Evolution (ARC407) - AWS re:Invent 2018
A Chronicle of Airbnb Architecture Evolution (ARC407) - AWS re:Invent 2018A Chronicle of Airbnb Architecture Evolution (ARC407) - AWS re:Invent 2018
A Chronicle of Airbnb Architecture Evolution (ARC407) - AWS re:Invent 2018
 
ElastiCache & Redis: Database Week San Francisco
ElastiCache & Redis: Database Week San FranciscoElastiCache & Redis: Database Week San Francisco
ElastiCache & Redis: Database Week San Francisco
 
ElastiCache & Redis
ElastiCache & RedisElastiCache & Redis
ElastiCache & Redis
 
ServerlessConf 2018 Keynote - Debunking Serverless Myths (no video / detailed...
ServerlessConf 2018 Keynote - Debunking Serverless Myths (no video / detailed...ServerlessConf 2018 Keynote - Debunking Serverless Myths (no video / detailed...
ServerlessConf 2018 Keynote - Debunking Serverless Myths (no video / detailed...
 
[NEW LAUNCH!] Introduction to AWS Global Accelerator (NET330) - AWS re:Invent...
[NEW LAUNCH!] Introduction to AWS Global Accelerator (NET330) - AWS re:Invent...[NEW LAUNCH!] Introduction to AWS Global Accelerator (NET330) - AWS re:Invent...
[NEW LAUNCH!] Introduction to AWS Global Accelerator (NET330) - AWS re:Invent...
 
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
Lessons Learned from a Large-Scale Legacy Migration with Sysco (STG311) - AWS...
 
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdfCome scalare da zero ai tuoi primi 10 milioni di utenti.pdf
Come scalare da zero ai tuoi primi 10 milioni di utenti.pdf
 

Dernier

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 

Dernier (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Scaling Push Messaging for Millions of Netflix Devices

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scaling Push Messaging for Millions of Netflix Devices Susheel Aroskar Senior Software Engineer Netflix Inc/Edge Gateway 8 9 7 1 5
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Why do we need push?
  • 4.
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How I spend my time in Netflix application
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • What is Push?
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • What is Push? • How you can build it?
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • What is Push? • How you can build it? • How you can operate it?
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • What is Push? • How you can build it? • How you can operate it? • What can you do with it?
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Susheel Aroskar Senior Software Engineer Cloud Gateway saroskar@netflix.com github.com/raksoras @susheelaroskar
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Zuul Push Architecture
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Zuul Push Server Handling millions of persistent connections
  • 25. C10K Challenge Handling 10,000+ connections on a single box
  • 26. Socket Socket Thread per Connection Model Thread-1 Thread-2 Read Write Write Read
  • 27. Socket Socket Thread per Connection Model Thread-1 Thread-2 Read Write Write Read Async I/O Model Socket read callback write callback Socket Single Threadread callback write callback Vs
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Netty Channel Handler Pipeline protected void addPushHandlers(ChannelPipeline pl) { pl.addLast(new HttpServerCodec()); pl.addLast(new HttpObjectAggregator()); pl.addLast(getPushAuthHandler()); pl.addLast(new WebSocketServerCompressionHandler()); pl.addLast(new WebSocketServerProtocolHandler()); pl.addLast(getPushRegistrationHandler()); }
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Plug in your custom authentication policy Authenticate by Cookies, JWT or any other custom scheme
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Push Registry Tracking clients’ connection metadata in real time
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Custom Push Registration Handler public class MyRegistration extends PushRegistrationHandler { @Override protected void registerClient( ChannelHandlerContext ctx, PushUserAuth auth, PushConnection conn, PushConnectionRegistry registry) { super.registerClient(ctx, authEvent, conn, registry); ctx.executor().submit(() -> storeInRedis(auth)); } }
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Push Registry Features Checklist
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Push Registry Features Checklist • Low read latency
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Push Registry Features Checklist • Low read latency • Automatic record expiry
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Push Registry Features Checklist • Low read latency • Automatic record expiry • Sharding
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Push Registry Features Checklist • Low read latency • Automatic record expiry • Sharding • Replication
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Few Good Choices for Push Registry
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. We Use Dynomite for Our Push Registry https://github.com/Netflix/dynomite Redis + Auto-sharding + Read/Write quorum + Cross-region replication Dynomite
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Message Processing Queue, route and deliver
  • 41. We use Kafka message queues to decouple message senders from receivers
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Fire and Forget Message Delivery
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. We Deliver Messages Across Regions
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use Different Queues for Different Priorities
  • 45. We run multiple message processor instances in parallel to scale our message processing throughput.
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Operating Zuul Push Different than REST of them
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Persistent Connections Make Zuul Push Server Stateful
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Persistent Connections Make Zuul Push Server Stateful Long lived stable connections are great for client efficiency
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Persistent Connections Make Zuul Push Server Stateful Long lived stable connections are great for client efficiency but they are terrible for quick deploy or rollback
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. If You Love Your Clients Set Them Free Tear down connections periodically
  • 51. Randomize Each Connection’s Lifetime to automatically dampen any reconnect storms
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reconnection Storm Under Randomized Connection Lifetimes Time Numberofreconnects
  • 53. Instead of Closing Connection From Server side, Ask Client to Close it.
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to Optimize the Push Server Most connections are idle!
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use BIG Server to Handle Tons of Connections ulimit -n 262144 net.ipv4.tcp_rmem="4096 87380 16777216" net.ipv4.tcp_wmem="4096 87380 16777216" -Xmx3g -Xms3g -XX:MaxDirectMemorySize=256m
  • 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Goldilocks Strategy - Just Right Server Size
  • 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Optimize for Cost, NOT Instance Count $$ $$ ❌
  • 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to Auto-scale Push Cluster?
  • 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to Auto-scale Push Cluster? • Requests per second? • CPU?? Amazon Autoscaling
  • 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to Auto-scale Push Cluster? Number of open connections per server Amazon CloudWatch
  • 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Using Classic Load Balancers with WebSockets Classic Load Balancers do not proxy WebSockets
  • 64. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Solution I - Run CLB as a TCP Load Balancer 7 Application 6 Presentation 5 Session 4. Transport 3. Network 2. Data Link 1. Physical HTTP TLS TCP IP Ethernet Layer 7 HTTP load balancer Layer 4 TCP load balancer OSI 7 network layers model HTTP
  • 65. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Solution 1 - The Good TLS Termination
  • 66. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Solution 1 - The Good, The Bad TLS Termination Cross Site Request Forgery
  • 67. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Solution 1 - The Good, The Bad and The Ugly TLS Termination Cross Site Request Forgery Deregister == Disconnect
  • 68. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Solution II - Use Application Load Balancers Application Load Balancers can proxy WebSockets natively
  • 69. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Quick Recap of Push Operation Best Practices • Recycle connections periodically
  • 70. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Quick Recap of Push Operation Best Practices • Recycle connections periodically • Randomize connection lifetime
  • 71. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Quick Recap of Push Operation Best Practices • Recycle connections periodically • Randomize connection lifetime • More number of small servers >> few BIG servers
  • 72. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Quick Recap of Push Operation Best Practices • Recycle connections periodically • Randomize connection lifetime • More number of small servers >> few BIG servers • Auto-scale on number of open connections per box
  • 73. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Quick Recap of Push Operation Best Practices • Recycle connections periodically • Randomize connection lifetime • More number of small servers >> few BIG servers • Auto-scale on number of open connections per box • Either use ALB or CLB in TCP mode
  • 74. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. If You Build It, They Will Push Push messaging use cases
  • 75. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alexa + Netflix = Weekend Lost Speech Recognition, Lambda “Alexa play Stranger Things” Zuul Push Alexa Voice Service
  • 76. Trigger On-demand diagnostics by sending a push message
  • 77. Remote recovery by push message
  • 79. What will you Use PUSH for?
  • 80. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. One Last Call to Action
  • 81. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. One Last Call to Action - PULL!
  • 82. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. One Last Call to Action - PULL! https://github.com/Netflix/zuul
  • 83. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. In Conclusion Push can make you
  • 84. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. In Conclusion Push can make you Rich (in functionality),
  • 85. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. In Conclusion Push can make you Rich (in functionality), Thin (by getting rid of polling)
  • 86. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. In Conclusion Push can make you Rich (in functionality), Thin (by getting rid of polling) and Happy!
  • 87. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Susheel Aroskar saroskar@netflix.com @susheelaroskar

Notes de l'éditeur

  1. arc334
  2. Imagine it’s Friday night of this week. The conference is over, you are back home, sitting on your favorite couch, ready to unwind and you start Netflix. At least I hope you do :)
  3. This is the first thing you see as soon as you start Netflix. Interesting thing about this list is that it’s not static or universal. It is personalized to your taste. There are hundred and twenty five million versions of this list. One for each of our hundred and twenty five million members. But this one is mine. Personalized to my taste. Which I just realized is filled with crime shows. Let’s not read anything specific into that. Moving on…
  4. Raise your hand if you actually start watching something within a minute or two after seeing that list. Yeah, me neither! Most of us spend considerable amount of time on this screen scrolling, trying to pick something to watch. This behavior is actually relevant to our discussion. Let’s say 20 minutes later you are still browsing the list. Meanwhile, our personalization algorithms are continuously running. So during those 20 minutes we could generate a new, better personalized list of shows for you in the cloud. If that does happen, how do we get that new list in front of you? How do we tell our application that a new list is ready for it to download? Push messaging is a perfect solution for situations like this. Our old app polled our server periodically for new recommendations. It kinda worked but it was both wasteful and not that great latency wise. What’s worse is these twin goals of server efficiency and freshness of UI directly contradict each other. If you make polling interval too low to get freshest UI you are putting more load on your servers and if you increase polling interval to help your servers, the freshness of your UI suffers. Now our server just pushes the new list to the client. Just as one data point, we cut down total number of requests to our website by 12% when we shifted our browser app from polling to push. At more than million requests per second those 12% add up really fast! So please ignore all push messages on your phones for next 40 minutes because we are going to talk about push messaging now. Push notifications may be terrible for conference speakers but background push messages are awesome for applications.
  5. By the end of this presentation you’ll have a very clear understanding of
  6. My name is Susheel Aroskar. I am a software engineer in the Cloud Gateway team at Netflix. All of the Netflix HTTP API traffic passes through our Cloud Gateway. I have been at Netflix for 9 years now. Worked in three different teams. And somehow it still feels like I’m still just browsing the list, the real show is yet to start. So let’s start by defining push. What exactly is push? How is it different from the normal request / response paradigm that we all know and love?
  7. This is actually from a motivational poster at my local gym. That’s why I stopped going there. But it turns out to be surprisingly accurate definition for our purpose today. Push really is different in just two ways: There is a persistent, always-on connection between the server and the client for the entirety of the client’s lifetime, and It’s the server that initiates the data transfer. Something does happen on the server and then the server pushes the data to the client instead of the client requesting it We built our own push messaging system, named Zuul Push, to send background push messages to our app from our servers. Zuul push messages are similar to push messages you get on your smartphones except they work across all sorts of devices not just phones. They work anywhere where Netflix app runs. That includes TVs, game consoles, laptops and smartphones. To achieve this, Zuul Push uses standard, open web protocols like WebSockets and Server Sent Events (SSE) to push messages. Zuul Push server itself is open sourced too and is available today on GitHub.
  8. Zuul push is in fact not a single service but a complete push messaging infrastructure made up of multiple components
  9. There are Zuul Push servers. They sit on the network edge and accept connections from clients.
  10. Clients connect to push servers using either WebSocket or Server Sent Events protocol. Once connected, the client keeps the connection open for its entire lifetime. So these are persistent connections.
  11. Since there are many clients, connected to many push servers we need to track which client is connected to which server. This is the job of the push registry.
  12. On the backend, our push message senders need a simple, robust and high throughput mechanism to send push messages. But our senders don’t really want to know about all the internal details of our push infrastructure. What they really a want is a simple, one liner method call to send a push message to a given client. Our push library gives them this simple interface by hiding all this complexity behind a single sendMessage() call.
  13. Internally sendMessage() drops the message into a push message queue. By introducing message queues between senders and receivers we decouple them, making it easy to operate them independent of each other. Message queues also let us absorb wide variations in number of incoming messages. They act as a buffers that absorb big spikes in traffic.
  14. Finally our message processor ties all these components together to do the actual push message delivery.
  15. It reads push messages from the push message queue. Each push message is addressed to a specific client.
  16. The message router then looks up in the Push Registry which push server the requested client is connected to.
  17. If the push server is found in the registry, message processor connects to that push server and delivers the push message. The port used by the message processor to connect to the push server is reachable only on the internal 10/24 subnetwork and is guarded by Amazon security groups. On the other hand, If the push server is not found in the registry, it means requested client is not connected or online at this time. In such case processor just drops the message on the floor. Now that we have seen how all Zuul Push components fit together, we can dig a little deeper in each component’s details.
  18. Zuul Push server is probably the biggest piece of the whole infrastructure. Our push cluster today handles 10s of millions concurrent, persistent client connections at peak and is rapidly growing. Zuul Push server is based on our Zuul cloud gateway and hence shares its name. Zuul cloud gateway fronts all Netflix HTTP API traffic coming into our system. It handles millions of requests per second. It was recently rewritten to use async, non-blocking I/O so It provided a perfect foundation for building massively scalable push messaging server like Zuul Push.
  19. But why do we need async I/O? Many of you are probably familiar with the C10K challenge. The challenge was first coined in 1999. It simply asks how can we support 10,000 connections on a single server. We have long since blown past the original 10,000 number but the name stuck. This capability to support tens of thousands of connections on a single box is crucial for a service like Zuul Push that has to handle millions of mostly idle but always-on persistent connections.
  20. Traditional way of handling multiple connections is to spawn a new thread for each new connection. This thread then does blocking read/write operations on that connection. This model doesn’t scale to meet the C10K challenge. You would quickly exhaust your memory allocating 10,000 stacks for 10000 threads. It’d also pin your CPU down because of the frequent context switches between those 10,000 threads.
  21. Async I/O follows a different model. It uses operating system’s I/0 multiplexing primitives like epoll or keque to register read/write callbacks for all open connections on a single thread. Whenever any socket is ready for I/O, it’s callbacks get invoked using the same single thread so now you don’t need thousands and thousands of threads. The trade off is somewhat more complex programming model because now you as developer are responsible for keeping track of all the state inside your code. You can no longer rely on the thread stack to do it for you because the same single thread stack is now shared by all the open connections.
  22. We use Netty to do async I/O. Netty is a great open source networking library in Java. It is widely used by many popular open source Java projects like Cassandra, Hadoop etc. so it is well tested and battle proven. We are not going to go into details of Netty in this talk but this is how Netty async I/O program structure looks like from 10,000 feet. Those Inbound and Outbound channel handlers you see here are analogous to the read and write callbacks we just discussed. It’s very similar in essence to how Node.js handles multiple connections. If you know Node.js internals you can think of Netty as a libuv counterpart in the Java world.
  23. This is a simplified version of our push server’s Netty pipeline. There is a lot of stuff going on here but I really want to call out your attention to just two highlighted methods getPushAuthHandler() getPushRegistraionHandler() You can override these methods to plug in your own custom authentication and custom push session registration in Zuul Push server. Rest of the handlers you see here - things like HttpServerCodec or WebSocketServerProtocolHandler - are all off the shelf protocol handlers provided by Netty, which is great. Netty is doing most of the heavy lifting of parsing HTTP and WebSocket protocol here.
  24. Each client connecting to the Zuul push server must identify and authenticate itself before it can start receiving push messages on that connection. You can plugin your own custom authentication by extending PushAuthHandler and implementing its doAuth() method. doAuth() receives original HTTP WebSocket connection request as an argument. This allows you to inspect cookies, other headers and the body of the request inside doAuth() which you can use to implement your own, custom authentication.
  25. As we saw push registry is used to keep track of which client is connected to which Zuul Push server.
  26. Just like custom authentication, Zuul Push lets you plugin custom datastore of your choice for push registry. You’d extend PushRegistrationHandler class and implement it’s registerClient() method to do that.
  27. You can use any data store but for best results it should have following characteristics
  28. Low read latency is important because you write registration once when client connects and then look it up multiple times. Once every time someone sends a push message for that client.
  29. Support for record expiry is important because in the real world we cannot rely on every single client closing its connection cleanly, every single time. Most of the time they will close it cleanly which takes care of cleaning up their push registration record from the registry. But sometimes clients crash. Sometimes servers crash. This will leave behind phantom registration records in the registry. A record that indicates that a particular client is connected to particular server, but is no longer accurate. In such cases We need a way to clean up those phantom registration records automatically. Zuul push relies on automatic record expiry or TTL to do that.
  30. Beyond these two features then there are the usual suspects for high availability
  31. And fault tolerance.
  32. These are all great choices for push registry. There are probably several more
  33. We use Dynomite. Dynomite is another open source project from Netflix that wraps Redis and augments it with features like auto-sharding, read/write quorums and cross region replication. You can think of it as Amazon Dynamo meets Redis. We chose Dynomite since it supports replication across AWS regions out of the box which is important for our use case. And also because Dynomite is well supported operationally inside Netflix by a central data engineering team!
  34. This component handles backend message queuing, routing and delivery of push messages on behalf of message senders.
  35. Most of our push message senders use fire-and-forget approach to message delivery. Those who are interested in knowing the final delivery status of the push message can either subscribe to the push delivery status queue or read it from a Hive table.
  36. Netflix runs in three different AWS regions. A backend service trying to send a push message to a particular customer generally has no idea to which region that customer may be connected. Our message routing infrastructure takes care of routing that message to the correct AWS region for them. We use Kafka message queue replication to deliver messages across regions.
  37. In practice, we have found we can use single push message queue to deliver all sorts of push messages and still stay under our delivery latency SLA. However, our design lets you use different message queues for different message priorities to avoid “priority inversion” issue. Priority inversion happens when a message with higher priority is kept waiting behind lower priority messages for delivery because they all are sharing the same queue. Using different message queues for different priorities guarantees this can never happen.
  38. Our message processor is built on top of Mantis. Mantis is our internal scalable stream processing engine similar to Apache Flink. It uses Mesos container management system. This allows us to quickly spin up more message processor instances. It also has a support for auto-scaling number of processors based on the number of pending messages in the queue. This makes it very easy for us to meet our delivery SLA under wide variety of loads while still staying resource efficient.
  39. At this point I’d like to switch gears and cover some of the operational aspects of running Zuul Push in production at Netflix traffic scale. Zuul push is little different than usual stateless REST services so it requires a little TLC (tender love and care) when you run it in production.
  40. The first and biggest difference is the long lived stable connections. They make Zuul Push somewhat stateful.
  41. Persistent connections are great from client’s point of view because they improve clients’ efficiency dramatically. Unlike plain HTTP, clients don’t have to make and break connections constantly. That’s why we all rejoiced when WebSockets appeared in browsers and replaced hacks like long poll or Comet.
  42. But they are terrible from the point of view of anyone operating a server. Mainly because they complicate deployments and rollbacks. Let’s say you deploy a new build to fix some urgent issue. Your push clients will still be happily connected to your old cluster. Because they open connection once and then hang on to that connection for their lifetime. They won’t migrate to the new cluster just because you deployed the new build. You’d have to force them to migrate by killing the old cluster. But if you do that, they will all swarm to the new cluster at the same time like a thundering herd. It’s a lose-lose scenario. Thundering herd is large number of clients, all trying to connect at the same time. This causes a sudden and large spike in traffic that is order of magnitudes higher than your steady state traffic. It’s one of the thing you have to watch out for when you are trying design a robust, resilient system.
  43. We found our way out of this pickle by limiting client connection lifetime. We auto-close the client connection after certain time. Our clients are coded to reconnect back whenever they lose connection to the server. So the client will connect back and every time it does so it will most probably land on a different server. This limits client’s stickiness to a single server. We have tuned this connection lifetime period carefully to strike a good balance between client efficiency which we desire and client stickiness which we are trying to avoid. Empirically, we have found, somewhere between 25 to 35 minutes is the sweet spot.
  44. Not only we limit connection’s lifetime, we also randomize it within some band, every time the client reconnects. This means different clients end up with slightly different connection lifetimes (somewhere between 28 to 32 minutes in our case) after which they will disconnect and reconnect back.
  45. This randomization ensures that a random network-wide blip doesn’t end up accidentally synchronizing millions of connections’ reconnect schedules causing a thundering herd which would then repeat every 30 minutes after that. The only thing worse than a thundering herd is a recurring thundering herd!
  46. This is an extra optimization. I know I said earlier we auto-close client connection from server side but that’s not entirely accurate. Instead in the latest version, our server sends a special message to the client - using the same push channel - telling it to close the connection from the client side. Because of the way TCP works, the party that closes the connection enters the TCP TIME_WAIT state. This state can consume the file descriptor of that connection for up to 2 minutes on Linux. Since our server is handling tens of thousands of open connections simultaneously, server’s file descriptors are far more valuable than client’s file descriptors. By having client close the connection, we conserve server’s file descriptors. There is a flip side to this optimization though. You have to be prepared to handle misbehaving clients that won’t close their connections when told by the server. To handle such clients we start a timer when we send them CLOSE CONNECTION message and then close the connection forcefully from the server-side if the client doesn’t comply within set time limit.
  47. So we took care of stateful, sticky connections problem. Next we focused our attention on optimizing our push cluster size. Our big epiphany here was most of the connections were idle, most of the time. This meant neither memory nor CPU was under a lot of pressure even with large number of open connections.
  48. So we chose a big Amazon instance type for our push server, carefully tuned its Linux TCP kernel parameters and JVM options and packed it with as many connections as possible... ulimit -n 262144 sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216" sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216" sysctl -w net.core.somaxconn=65536 sysctl -w net.ipv4.tcp_max_syn_backlog=65536 -Xmx3g -Xms3g -XX:MaxDirectMemorySize=256m
  49. Then just one of our servers went down...
  50. And we got a visit from our dear old friend - the thundering herd! All those thousands and thousands of clients from that single server came roaring back with reconnects. You know you have a problem when loss of a single server can start a stampede!
  51. So we licked our wounds, learnt from our mistake, and tried the “Goldilocks” strategy for the second round. You don’t want to run your server either too hot or too cold. So we found the instance size that’s just right for us m4.large 2 vCPU 8 GB 84K connections / box
  52. The main lesson here is you should really optimize for the actual cost of your cluster not just low instance count. I know, when stated like that it seems obvious but it wasn’t obvious to us initially because we conflated small push cluster size - low number of push server instances - with efficient operation. In reality more number of cheaper instances are preferable to fewer number of big instances, cost being equal. Being able to support millions of connections on a single box is certainly impressive technically but it will eventually come back to bite you in production. And even if you don’t have huge traffic volume like Netflix, it may still make sense to use more number of smaller servers instead of few BIG servers. Mostly because using smaller servers gives you more cost-efficient autoscaling at lower traffic volume. At low enough traffic volume, you may only need couple of big servers to handle all your connections. But then you can’t autoscale up and down efficiently to match your traffic since your step size is big - a single server taking somewhere between 25 to 50% of all your traffic. You can fit your traffic curve with much more efficiently - in terms of autoscaling - with small increment/decrement steps, that is small servers.
  53. Next problem we ran into was autoscaling. How do we autoscale our push cluster as the traffic goes up and down?
  54. Our two go to strategies for auto-scaling REST services are either autoscale on RPS - requests per second or on CPU load average. Both are ineffective for Zuul Push. There is no continuous RPS - thanks to persistent, long lived connections and CPU is mostly idle as we saw earlier. So how do you autoscale?
  55. Real limiting factor for a push server is the number of open connections per box. So it makes perfect sense to auto-scale by average number of open connections per box. Thankfully AWS makes it easy to autoscale on anything as long as you can export it as a custom cloud-watch metric from your app. We export number of open connections from our server process.
  56. Final problem we had to solve was to make Amazon’s Classic Load Balancers play nice with WebSockets. Our push servers sit behind Amazon’s Classic Load Balancers or CLBs for short. Unfortunately CLBs can not proxy WebSocket connections. When WebSocket client - like browser - wants to open a WebSocket it sends a special HTTP request to the server called WebSocket upgrade request. If the server supports WebSockets it returns a special “Switching protocols” response and upgrades the original HTTP connection to a long lived WebSocket connection. CLBs do not understand this initial WebSocket upgrade requests. They treat it as any other HTTP request and tear down the connection as soon as server returns the response. So you can’t have persistent WebSocket connections through CLBs. BTW, this is issue is not specific to just CLBs. You’d run into similar issues with any reverse proxy or a load balancer that does not understand WebSocket protocol natively.
  57. We found a way around this by running our CLBs in TCP load balancing mode. Normally CLBs run as HTTP load balancers and do layer 7 load-balancing. But you can configure them to run as TCP load balancers. If you do that, you force them to do load balancing at layer 4. In this mode they just proxy TCP packets back and forth without trying to parse any layer 7 application protocol which would be HTTP in this case. This keeps CLBs from mangling the WebSocket upgrade requests that they do not understand.
  58. Good thing about CLBs in TCP mode is that they can still terminate TLS. This means you can still offload SSL handling to CLBs.
  59. Flip side of WebSockets is they are vulnerable to cross site request forgery if not properly secured. To secure them against CSRF, web server must ensure the “Origin” header has correct value before accepting the incoming WebSocket connection. Thankfully, Zuul Push server already does this for you.
  60. Deregistering server from an CLB kills all client connections to that server instantaneously. Whenever we deploy new build, we deregister old instances from CLB so that they no longer receive any traffic. What we ideally want in this case is for the CLB to not send old instances any new traffic but let the existing connections on those instances continue for the rest of their natural connection lifetime. However, by default, ELBs kill all connections to an instance as soon as you deregister it from the ELB. Fortunately it is possible to make CLBs behave in the manner we want. AWS console has an CLB setting called “connection draining”. Once you enable it and set it to a high enough timeout value, CLBs will gradually drain client connections from your old, out of traffic servers and let them migrate to new servers over time. Once you have made all these tweaks, your CLB will handle lots and lots of WebSocket connections happily, no problem.
  61. I do want to note here that Amazon has recently introduced a new load balancer type - ALB, short for application load balancer that does understand WebSocket protocol. Unfortunately it came too late for us. By then we had already figured how to get CLBs to do what we wanted. But if you are starting today, you may want to give ALB a try first.
  62. May be 20 to 30 minutes. This limits stateful, sticky issues.
  63. To spread out reconnect peaks as time progresses
  64. As long as final cost stays the same. This helps in limiting the size of a thundering herd.
  65. As CPU or RPS are not a good proxy for a load on the system for push cluster
  66. Most load balancers like HAProxy etc. let you to do load balancing at layer 4, TCP level. Most of these operational best practices are already built into Zuul Push.
  67. Finally, what can you do with this push messaging capability? Now that we finally have our push hammer in production we are seeing a lot of nails….
  68. Our recent integration with Alexa is one good example. Suppose user asks Alexa to play “Stranger Things”. The actual speech recognition of user’s spoken command happens in the Cloud using Alexa voice processing service. So now we need an ultra-low latency mechanism to transmit this synthesized command from the cloud to the Netflix application running on user’s TV. The application polling the cloud at fixed intervals clearly won’t do here. Push messaging to the rescue!
  69. We have even more exciting plans for using push in the future. For example, we could auto detect a client that is generating lots of errors and send that client a push message asking it to upload its state and any other relevant diagnostic to cloud.
  70. And if all that the diagnostic data still doesn’t help we could reach for the oldest tool in every software engineer’s toolbox and restart the application. Now we could do it remotely. What could go wrong?
  71. But if something does go wrong, we can now send you a push message, saying “We are sorry”
  72. Hopefully, these examples have already got you thinking about how you can use push messaging to add novel and rich functionality to your applications.
  73. I have been pleading the case for PUSH for the last 40 minutes. Now I have just one last request to make at this point...
  74. I have been pleading the case for PUSH for the last 40 minutes. Now I have just one last request to make at this point...
  75. All of the things we have discussed so far, all of it is open source. You can find it in the project Zuul under Netflix OSS on Github. It even comes with a sample, toy push server example that you can start playing with immediately. So go ahead, give it a spin. File bugs. And if you would be so kind, may be even send us a pull request or two..