SlideShare une entreprise Scribd logo
1  sur  61
Télécharger pour lire hors ligne
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Peter Bakas, Director of Engineering, Event and Data Pipelines, Netflix
October 2015
BDT318
Netflix Keystone
How Netflix Handles Data Streams Up to 8 Million Events Per Second
Peter Bakas
@ Netflix : Cloud Platform Engineering - Event and Data Pipelines
@ Ooyala : Analytics, Discovery, Platform Engineering & Infrastructure
@ Yahoo : Display Advertising, Behavioral Targeting, Payments
@ PayPal : Site Engineering and Architecture
@ Play : Advisor to various Startups (Data, Security, Containers)
Who is this guy?
What to Expect from the Session
• Architectural design and principles for Keystone
• Current state of technologies that Keystone is leveraging
• Best practices in operating Kafka and Samza
Why are we here?
Publish, Collect, Process, Aggregate & Move Data
@ Cloud Scale
• 550 billion events per day
• 8.5 million events & 21 GB per second during peak
• 1+ PB per day
• Hundreds of event types
By the numbers
How did we get here?
Chukwa
Chukwa/Suro + Real-Time
Chukwa/Suro + Real-Time
Now what?!!
Keystone
Keystone
Split Fronting Kafka Clusters
Normal-priority (majority)
• 2 copies, 12 hour retention
High-priority (streaming activities etc.)
• 3 copies, 24 hour retention
Split Fronting Kafka Clusters
Instance type - D2XL
• Large disk (6TB) with 450-475MB/s of sequential I/O
throughput measured
• Large memory (30GB)
• Medium network capability (~ 700Mbps)
• Replication lag starts to show when bytes in above
18MB/second per broker with thousands of partition
• PR is available to Apache Kafka
• https://github.com/apache/kafka/pull/132
• https://issues.apache.org/jira/browse/KAFKA-1215
• Improved availability
• Reduce cost of maintenance
Kafka Zone Aware Replica Assignment
Keystone
Control Plane + Data Plane
• Control plane for router is job manager
• Infrastructure is data plane
• Declarative, reconciliation driven
• Smart scheduling managing tradeoffs
• Auto Scaling based on traffic
• Fault tolerance
• Application (router) faults
• AWS hardware faults
Keystone
Routing Service - Samza
Routing Service - Samza
Routing Service - Samza
Amazon S3 Routing
• ~5800 long running containers for Amazon S3 routing
• ~500 C3-4XL AWS instances for Amazon S3 routing
Elasticsearch Routing
• ~850 long running containers for Elasticsearch routing
• ~70 C3-4XL AWS instances for Elasticsearch routing
Kafka Routing
• ~3200 long running containers for Kafka routing
• ~280 C3-4XL AWS instances for Kafka routing
Routing Service - Samza
Container Footprint:
• 2G - 5G memory
• 160 mbps max network bandwidth
• 1 CPU Share
• 20G disk for buffer & logs
• Processes 1-12 partitions
• Periodically reports health to infrastructure
Routing Service - Samza
Observed Numbers:
• Avg memory usage of ~1.8G per container
• Avg memory usage per node ~20G(Range: 7G - 25G)
• Avg CPU utilization of 8% per node
• Avg NetworkIn 256Mbps per node
• Avg NetworkOut 156Mbps per node
Routing Service - Samza
Publish to Amazon S3 sink:
• Every 200mb or 2 mins
• S3 average upload latency 200ms
Producer to Router latency:
• 30 percentile topics under 500 ms
• 70 percentile topics under 1 sec
• 90 percentile under 2 sec
• Overall average about 2.5 seconds
Kafka to Router consumer lag (est time to catch up):
• 65 percentile under 500ms
• 90 percentile under 5 seconds
+ Alterations
• Internal build of Samza version 0.9.1
• Fixed SAMZA-41 in 0.9.1
• Support static partition range assignment
• Added SAMZA-775 in 0.9.1
• Prefetch buffer specified based on heap to use
• Backported SAMZA-655 to 0.9.1
• Environment variable configuration rewriter
• Backported SAMZA-540 to version 0.9.1
• Expose latency related metrics in OffsetManager
• Integration with Netflix Alert & Monitoring systems
Keystone
Real-time processing
Real-time processing
Real-time processing
Real-time processing
• Streaming jobs to analyze movie plays, A/B tests, etc.
• Direct API for Kafka in 1.3
• Observed 2x performance improvement compared to 1.2
• Additional improvement possible with prefetching and connection pooling
(not available yet)
• Campaigned for backpressure support
• Result - Spark 1.5 release has community developed back pressure
support SPARK-7398
Great. How do I use it?
Annotation-based event definition
@Resource(type = ConsumerStorageType.DB, name =
"S3Diagnostics")
public class S3Diagnostics implements Annotatable {
....
S3Diagnostics s3Diagnostics = new S3Diagnostics();
....
LogManager.logEvent(s3Diagnostics); // log this diagnostic
event
Java
{
"eventName" : "ksproxytest",
"payload" : {
"k1" : "v1",
"k2" : "v2"
}
}
Non-Java : Keystone Proxy
Wire format
• Extensible
• Currently supports JSON
• Will support Avro
• Encapsulated as a shareable jar
• Immutable message through the pipeline
Producer Resilience
• Outage should never disrupt existing instances from serving business
purpose
• Outage should never prevent new instances from starting up
• After service is restored, event producing should resume
automatically
Fail, but never block
block.on.buffer.full=false
handle potential blocking of first metadata request
Trust me, it works!
Keystone Dashboard
Keystone Dashboard
Keystone Dashboard
Trust, but verify!
• Broker monitoring
• Alert on offline broker from ZooKeeper
• Consumer monitoring
• Alert on consumer lag/stuck and unconsumed partitions
• Heart-beating
• Produce/consume messages and measure latency
• Broker performance testing
• Produce tens of thousands messages per second on single instance
• Create multiple consumer groups to test consumer impact on broker
Auditor
Auditor - Broker Monitoring
Consumer Offset
Stuck Consumer Unconsumed Partitions
Auditor - Consumer Monitoring
Consumer Lag
Meet Winston
New Internal Automation Engine:
• Collect diagnostic information based on alerts & other operational
events
• Help services self heal
• Reduce MTTR
• Reduce pager fatigue
• Improve productivity for developer
Winston
Winston
How do you like your Kaffee?
Kaffee
Kaffee
Kaffee
What’s next?
• Performance tuning + optimizations
• Self service
• Schemas + registry
• Event discovery + visualization
• Open Source Auditor/Kaffee
Near Term
And then???
Global real-time data stream + stream processing network
Office Hours
Wed 4:00PM – 5:30PM
@ Booth
pbakas@netflix.com
@peter_bakas
Remember to complete
your evaluations!
Thank you!

Contenu connexe

Tendances

Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
SANG WON PARK
 

Tendances (20)

Stream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NETStream Processing with Apache Kafka and .NET
Stream Processing with Apache Kafka and .NET
 
Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
JanusGraph DataBase Concepts
JanusGraph DataBase ConceptsJanusGraph DataBase Concepts
JanusGraph DataBase Concepts
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)
카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)
카프카(kafka) 성능 테스트 환경 구축 (JMeter, ELK)
 
Amazon Aurora: Under the Hood
Amazon Aurora: Under the HoodAmazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Monitoring and Alerting
Monitoring and AlertingMonitoring and Alerting
Monitoring and Alerting
 
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
 
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
 

En vedette

Hadoop-as-a-Service for Lifecycle Management Simplicity
Hadoop-as-a-Service for Lifecycle Management SimplicityHadoop-as-a-Service for Lifecycle Management Simplicity
Hadoop-as-a-Service for Lifecycle Management Simplicity
DataWorks Summit
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit
 

En vedette (10)

The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Werner Vogels @ FOWA Feb 07
Werner Vogels @ FOWA Feb 07Werner Vogels @ FOWA Feb 07
Werner Vogels @ FOWA Feb 07
 
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and HailoMicroservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo
 
Hadoop-as-a-Service for Lifecycle Management Simplicity
Hadoop-as-a-Service for Lifecycle Management SimplicityHadoop-as-a-Service for Lifecycle Management Simplicity
Hadoop-as-a-Service for Lifecycle Management Simplicity
 
Get Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a ServiceGet Started Quickly with IBM's Hadoop as a Service
Get Started Quickly with IBM's Hadoop as a Service
 
HBase and Hadoop at Adobe
HBase and Hadoop at AdobeHBase and Hadoop at Adobe
HBase and Hadoop at Adobe
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Culture
CultureCulture
Culture
 

Similaire à (BDT318) How Netflix Handles Up To 8 Million Events Per Second

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 

Similaire à (BDT318) How Netflix Handles Up To 8 Million Events Per Second (20)

Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Kafka streams decoupling with stores
Kafka streams decoupling with storesKafka streams decoupling with stores
Kafka streams decoupling with stores
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
AWS Re-Invent 2017 Netflix Keystone SPaaS - Monal Daxini - Abd320 2017
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
Beaming flink to the cloud @ netflix   ff 2016-monal-daxiniBeaming flink to the cloud @ netflix   ff 2016-monal-daxini
Beaming flink to the cloud @ netflix ff 2016-monal-daxini
 
Monal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ NetflixMonal Daxini - Beaming Flink to the Cloud @ Netflix
Monal Daxini - Beaming Flink to the Cloud @ Netflix
 

Plus de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Plus de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Peter Bakas, Director of Engineering, Event and Data Pipelines, Netflix October 2015 BDT318 Netflix Keystone How Netflix Handles Data Streams Up to 8 Million Events Per Second
  • 2. Peter Bakas @ Netflix : Cloud Platform Engineering - Event and Data Pipelines @ Ooyala : Analytics, Discovery, Platform Engineering & Infrastructure @ Yahoo : Display Advertising, Behavioral Targeting, Payments @ PayPal : Site Engineering and Architecture @ Play : Advisor to various Startups (Data, Security, Containers) Who is this guy?
  • 3. What to Expect from the Session • Architectural design and principles for Keystone • Current state of technologies that Keystone is leveraging • Best practices in operating Kafka and Samza
  • 4. Why are we here?
  • 5. Publish, Collect, Process, Aggregate & Move Data
  • 7. • 550 billion events per day • 8.5 million events & 21 GB per second during peak • 1+ PB per day • Hundreds of event types By the numbers
  • 8. How did we get here?
  • 15. Split Fronting Kafka Clusters Normal-priority (majority) • 2 copies, 12 hour retention High-priority (streaming activities etc.) • 3 copies, 24 hour retention
  • 16. Split Fronting Kafka Clusters Instance type - D2XL • Large disk (6TB) with 450-475MB/s of sequential I/O throughput measured • Large memory (30GB) • Medium network capability (~ 700Mbps) • Replication lag starts to show when bytes in above 18MB/second per broker with thousands of partition
  • 17. • PR is available to Apache Kafka • https://github.com/apache/kafka/pull/132 • https://issues.apache.org/jira/browse/KAFKA-1215 • Improved availability • Reduce cost of maintenance Kafka Zone Aware Replica Assignment
  • 19. Control Plane + Data Plane • Control plane for router is job manager • Infrastructure is data plane • Declarative, reconciliation driven • Smart scheduling managing tradeoffs • Auto Scaling based on traffic • Fault tolerance • Application (router) faults • AWS hardware faults
  • 23. Routing Service - Samza Amazon S3 Routing • ~5800 long running containers for Amazon S3 routing • ~500 C3-4XL AWS instances for Amazon S3 routing Elasticsearch Routing • ~850 long running containers for Elasticsearch routing • ~70 C3-4XL AWS instances for Elasticsearch routing Kafka Routing • ~3200 long running containers for Kafka routing • ~280 C3-4XL AWS instances for Kafka routing
  • 24. Routing Service - Samza Container Footprint: • 2G - 5G memory • 160 mbps max network bandwidth • 1 CPU Share • 20G disk for buffer & logs • Processes 1-12 partitions • Periodically reports health to infrastructure
  • 25. Routing Service - Samza Observed Numbers: • Avg memory usage of ~1.8G per container • Avg memory usage per node ~20G(Range: 7G - 25G) • Avg CPU utilization of 8% per node • Avg NetworkIn 256Mbps per node • Avg NetworkOut 156Mbps per node
  • 26. Routing Service - Samza Publish to Amazon S3 sink: • Every 200mb or 2 mins • S3 average upload latency 200ms Producer to Router latency: • 30 percentile topics under 500 ms • 70 percentile topics under 1 sec • 90 percentile under 2 sec • Overall average about 2.5 seconds Kafka to Router consumer lag (est time to catch up): • 65 percentile under 500ms • 90 percentile under 5 seconds
  • 27. + Alterations • Internal build of Samza version 0.9.1 • Fixed SAMZA-41 in 0.9.1 • Support static partition range assignment • Added SAMZA-775 in 0.9.1 • Prefetch buffer specified based on heap to use • Backported SAMZA-655 to 0.9.1 • Environment variable configuration rewriter • Backported SAMZA-540 to version 0.9.1 • Expose latency related metrics in OffsetManager • Integration with Netflix Alert & Monitoring systems
  • 33. • Streaming jobs to analyze movie plays, A/B tests, etc. • Direct API for Kafka in 1.3 • Observed 2x performance improvement compared to 1.2 • Additional improvement possible with prefetching and connection pooling (not available yet) • Campaigned for backpressure support • Result - Spark 1.5 release has community developed back pressure support SPARK-7398
  • 34. Great. How do I use it?
  • 35. Annotation-based event definition @Resource(type = ConsumerStorageType.DB, name = "S3Diagnostics") public class S3Diagnostics implements Annotatable { .... S3Diagnostics s3Diagnostics = new S3Diagnostics(); .... LogManager.logEvent(s3Diagnostics); // log this diagnostic event Java
  • 36. { "eventName" : "ksproxytest", "payload" : { "k1" : "v1", "k2" : "v2" } } Non-Java : Keystone Proxy
  • 37. Wire format • Extensible • Currently supports JSON • Will support Avro • Encapsulated as a shareable jar • Immutable message through the pipeline
  • 38. Producer Resilience • Outage should never disrupt existing instances from serving business purpose • Outage should never prevent new instances from starting up • After service is restored, event producing should resume automatically
  • 39. Fail, but never block block.on.buffer.full=false handle potential blocking of first metadata request
  • 40. Trust me, it works!
  • 45. • Broker monitoring • Alert on offline broker from ZooKeeper • Consumer monitoring • Alert on consumer lag/stuck and unconsumed partitions • Heart-beating • Produce/consume messages and measure latency • Broker performance testing • Produce tens of thousands messages per second on single instance • Create multiple consumer groups to test consumer impact on broker Auditor
  • 46. Auditor - Broker Monitoring
  • 47. Consumer Offset Stuck Consumer Unconsumed Partitions Auditor - Consumer Monitoring Consumer Lag
  • 49. New Internal Automation Engine: • Collect diagnostic information based on alerts & other operational events • Help services self heal • Reduce MTTR • Reduce pager fatigue • Improve productivity for developer Winston
  • 51. How do you like your Kaffee?
  • 56. • Performance tuning + optimizations • Self service • Schemas + registry • Event discovery + visualization • Open Source Auditor/Kaffee Near Term
  • 58. Global real-time data stream + stream processing network
  • 59. Office Hours Wed 4:00PM – 5:30PM @ Booth pbakas@netflix.com @peter_bakas