SlideShare a Scribd company logo
1 of 16
Shriya Arora
Streaming datasets for
Personalization
What is Netflix’s Mission?
Entertainment by allowing you to stream content anywhere, anytime
What is Netflix’s Mission?
Entertainment by allowing you to stream personalized content anywhere,
anytime
How much data do we process to have a personalized Netflix for everyone?
● 125M hours/ day
● 86M active members
● 450B unique events/day
● 600+ kafka topics
Data Infrastructure
Raw data
(S3/hdfs)
Stream
Processing
(Spark, Flink …)
Processed data
(Tables/Indexers)
Batch
processing
(Spark/Pig/Hive/MR)
Application instances
Keystone
Ingestion
Pipeline
What do we solve with streaming that we can’t solve with batch ETL?
● Business Wins
○ Algorithms become more dynamic/responsive
○ Enables research by reducing time delay between event generation and consumption
○ Creates opportunity for new types of algorithms
● Technical Wins
○ Fewer moving parts means fewer places for error
○ Save on storage costs
○ Avoid long running jobs
■ Reduces processing resources
■ Shortens turnaround times
Picking a Stream Processing Engine?
Things to consider:
● Problem Scope/Requirements
○ Event-based pure streaming or micro-batches?
○ Do you want to implement Lambda?
● Existing Internal Technologies
○ Streaming Infrastructure: What are other teams using?
○ ETL eco-system: What about teams that don’t consume out of Kafka?
● What’s your team’s learning curve?
○ Do you know Spark?
○ Do you know Scala?
Getting started with Spark Streaming
Micro-batches
● Data received in DStreams, which are easily converted to RDDs
● Support all fundamental RDD operations like map/flatmap/reduce/reduceByKey
● Basic time-based windowing
● Checkpointing support for resilience to failures
Writing a basic Spark Streaming app
Performance tuning your Spark streaming application
● Choice of micro-batch interval
○ The most important parameter
● Cluster memory
○ Large batch intervals need more memory
● Parallelism
○ DStreams naturally partitioned to Kafka partitions
○ Repartition can help with increased parallelism at the cost of shuffle
● # of CPUs
○ <= number of tasks
○ Depends on how computationally intensive your processing is
Getting started with Flink
Performance tuning your Flink application (Yet to be productionised)
● Persistent data storage for checkpointing
○ Fault-tolerant, highly-available system
○ Support high-throughput for frequent state updates
● Parallelism
○ Optimized for # of Kafka Partitions
○ Optimal number of slots/ CPU
● Size of cluster
○ Function of your incoming stream
○ What is your bottleneck? Network/ Memory/ Computation
● Code Optimization
○ Build an optimal DAG with least network shuffle
Challenges with Spark
● Not a ‘pure’ event streaming system
○ Minimum latency of batch interval
○ Un-intuitive for stream design
● Choice of batch interval is a little too critical
○ Everything can go wrong, if you choose this wrong
○ Build-up of scheduling delay can lead to data loss
● Only time-based windowing
○ Cannot be used to solve session-stitching use cases, or trigger based event
aggregations*
Challenges with Flink
● Non trivial to bring up a basic app, newer concepts to adjust to
○ Complex (though powerful) concepts like Watermarking, checkpointing, custom
windows
● Insufficient monitoring and debugging tools
● Documentation basic, online community support not as proliferated
Challenges with Streaming
● Pioneer Tax: batch.getInfrastructure >= streaming.getInfrastructure
○ Analytics has historically always been batch, instinctively easier to formulate analytical
problems in batch frameworks like MR, Pig, Hive etc.
○ Deployments are non-trivial
● Moving towards unbounded === moving towards “On Call”
○ Batch failures have to be addressed urgently, Streaming failures have to be addressed
immediately.
● Streaming outages more critical than batch outages
○ In batch it’s easy/cheap to recover from outages (as long as the data isn’t lost).
○ In streaming, data recovery (beyond the fault-tolerant limits of the system) can be exhaustive
Questions?

More Related Content

What's hot

Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
HostedbyConfluent
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Maximize the Business Value of Machine Learning and Data Science with Kafka (...Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
confluent
 

What's hot (20)

Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
Kafka and Stream Processing, Taking Analytics Real-time, Mike SpicerKafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at UberDisaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uber
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixGoing from three nines to four nines using Kafka | Tejas Chopra, Netflix
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin KumarSiphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumar
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
The Network Knows—Avi Freedman, CEO & Co-Founder of Kentik
 
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar AasenContainer Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
Container Monitoring Best Practices Using AWS and InfluxData by Gunnar Aasen
 
Elevation Query Extension: Introducing Subselects into Lucene Queries
Elevation Query Extension: Introducing Subselects into Lucene QueriesElevation Query Extension: Introducing Subselects into Lucene Queries
Elevation Query Extension: Introducing Subselects into Lucene Queries
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
How Sysbee Manages Infrastructures and Provides Advanced Monitoring by Using ...
 
RealTime Recommendations @Netflix - Spark
RealTime Recommendations @Netflix - SparkRealTime Recommendations @Netflix - Spark
RealTime Recommendations @Netflix - Spark
 
Kentik Detect Engine - Network Field Day 2017
Kentik Detect Engine - Network Field Day 2017Kentik Detect Engine - Network Field Day 2017
Kentik Detect Engine - Network Field Day 2017
 
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Maximize the Business Value of Machine Learning and Data Science with Kafka (...Maximize the Business Value of Machine Learning and Data Science with Kafka (...
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
 
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
How Sensor Data Can Help Manufacturers Gain Insight to Reduce Waste, Energy C...
 
Kafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data PlatformKafka and Kafka Streams in the Global Schibsted Data Platform
Kafka and Kafka Streams in the Global Schibsted Data Platform
 

Viewers also liked

A little bit of clojure
A little bit of clojureA little bit of clojure
A little bit of clojure
Ben Stopford
 

Viewers also liked (20)

What's new in Drools 6 - London JBUG 2013
What's new in Drools 6 - London JBUG 2013What's new in Drools 6 - London JBUG 2013
What's new in Drools 6 - London JBUG 2013
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Wrangling Big Data in a Small Tech Ecosystem
Wrangling Big Data in a Small Tech EcosystemWrangling Big Data in a Small Tech Ecosystem
Wrangling Big Data in a Small Tech Ecosystem
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Kafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaKafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache Kafka
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11gBest Practices for testing of SOA-based systems - with examples of SOA Suite 11g
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
 
A little bit of clojure
A little bit of clojureA little bit of clojure
A little bit of clojure
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Big Data & the Enterprise
Big Data & the EnterpriseBig Data & the Enterprise
Big Data & the Enterprise
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Building a Real-Time Forecasting Engine with Scala and Akka
Building a Real-Time Forecasting Engine with Scala and Akka Building a Real-Time Forecasting Engine with Scala and Akka
Building a Real-Time Forecasting Engine with Scala and Akka
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are important
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
React Fast by Processing Streaming Data in Real-Time
React Fast by Processing Streaming Data in Real-TimeReact Fast by Processing Streaming Data in Real-Time
React Fast by Processing Streaming Data in Real-Time
 
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big DataVoxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
 
Lightbend Fast Data Platform
Lightbend Fast Data PlatformLightbend Fast Data Platform
Lightbend Fast Data Platform
 

Similar to Streaming datasets for personalization

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Spark Summit
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 

Similar to Streaming datasets for personalization (20)

Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Netty training
Netty trainingNetty training
Netty training
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
 
Netty training
Netty trainingNetty training
Netty training
 
Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine LearningWebinar: Three Reasons Why NAS is No Good for AI and Machine Learning
Webinar: Three Reasons Why NAS is No Good for AI and Machine Learning
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 

Recently uploaded (20)

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 

Streaming datasets for personalization

  • 1. Shriya Arora Streaming datasets for Personalization
  • 2. What is Netflix’s Mission? Entertainment by allowing you to stream content anywhere, anytime
  • 3. What is Netflix’s Mission? Entertainment by allowing you to stream personalized content anywhere, anytime
  • 4. How much data do we process to have a personalized Netflix for everyone? ● 125M hours/ day ● 86M active members ● 450B unique events/day ● 600+ kafka topics
  • 5. Data Infrastructure Raw data (S3/hdfs) Stream Processing (Spark, Flink …) Processed data (Tables/Indexers) Batch processing (Spark/Pig/Hive/MR) Application instances Keystone Ingestion Pipeline
  • 6. What do we solve with streaming that we can’t solve with batch ETL? ● Business Wins ○ Algorithms become more dynamic/responsive ○ Enables research by reducing time delay between event generation and consumption ○ Creates opportunity for new types of algorithms ● Technical Wins ○ Fewer moving parts means fewer places for error ○ Save on storage costs ○ Avoid long running jobs ■ Reduces processing resources ■ Shortens turnaround times
  • 7. Picking a Stream Processing Engine? Things to consider: ● Problem Scope/Requirements ○ Event-based pure streaming or micro-batches? ○ Do you want to implement Lambda? ● Existing Internal Technologies ○ Streaming Infrastructure: What are other teams using? ○ ETL eco-system: What about teams that don’t consume out of Kafka? ● What’s your team’s learning curve? ○ Do you know Spark? ○ Do you know Scala?
  • 8. Getting started with Spark Streaming Micro-batches ● Data received in DStreams, which are easily converted to RDDs ● Support all fundamental RDD operations like map/flatmap/reduce/reduceByKey ● Basic time-based windowing ● Checkpointing support for resilience to failures
  • 9. Writing a basic Spark Streaming app
  • 10. Performance tuning your Spark streaming application ● Choice of micro-batch interval ○ The most important parameter ● Cluster memory ○ Large batch intervals need more memory ● Parallelism ○ DStreams naturally partitioned to Kafka partitions ○ Repartition can help with increased parallelism at the cost of shuffle ● # of CPUs ○ <= number of tasks ○ Depends on how computationally intensive your processing is
  • 12. Performance tuning your Flink application (Yet to be productionised) ● Persistent data storage for checkpointing ○ Fault-tolerant, highly-available system ○ Support high-throughput for frequent state updates ● Parallelism ○ Optimized for # of Kafka Partitions ○ Optimal number of slots/ CPU ● Size of cluster ○ Function of your incoming stream ○ What is your bottleneck? Network/ Memory/ Computation ● Code Optimization ○ Build an optimal DAG with least network shuffle
  • 13. Challenges with Spark ● Not a ‘pure’ event streaming system ○ Minimum latency of batch interval ○ Un-intuitive for stream design ● Choice of batch interval is a little too critical ○ Everything can go wrong, if you choose this wrong ○ Build-up of scheduling delay can lead to data loss ● Only time-based windowing ○ Cannot be used to solve session-stitching use cases, or trigger based event aggregations*
  • 14. Challenges with Flink ● Non trivial to bring up a basic app, newer concepts to adjust to ○ Complex (though powerful) concepts like Watermarking, checkpointing, custom windows ● Insufficient monitoring and debugging tools ● Documentation basic, online community support not as proliferated
  • 15. Challenges with Streaming ● Pioneer Tax: batch.getInfrastructure >= streaming.getInfrastructure ○ Analytics has historically always been batch, instinctively easier to formulate analytical problems in batch frameworks like MR, Pig, Hive etc. ○ Deployments are non-trivial ● Moving towards unbounded === moving towards “On Call” ○ Batch failures have to be addressed urgently, Streaming failures have to be addressed immediately. ● Streaming outages more critical than batch outages ○ In batch it’s easy/cheap to recover from outages (as long as the data isn’t lost). ○ In streaming, data recovery (beyond the fault-tolerant limits of the system) can be exhaustive

Editor's Notes

  1. Total number of processed events much higher ~ 1.5T because of duplicates and redundancy Thousands of shows in every country, ma
  2. Post processing required