SlideShare une entreprise Scribd logo
1  sur  21
1©	Cloudera,	Inc.	All	rights	reserved.
Marton	Balassi	|	Solutions	Architect	@	Cloudera*
@MartonBalassi |	mbalassi@cloudera.com
Judit Feher |	Data	Scientist	@	MTA	SZTAKI
jfeher@ilab.sztaki.hu
*Work	carried	out	while	employed	by	MTA	SZTAKI	on	the	Streamline	project.
Streaming	ML	with	Flink
This	project	has	received	funding	from	the	European	Union’s	Horizon	2020
research	and	innovation	program	under	grant	agreement	No	688191.
2©	Cloudera,	Inc.	All	rights	reserved.
Outline
• Current	FlinkML API	through	an	example
• Adding	streaming	predictors
• Online	learning
• Use	cases	in	the	Streamline	project
• Summary
3©	Cloudera,	Inc.	All	rights	reserved.
FlinkML example	usage
val env = ExecutionEnvironment.getExecutionEnvironment
val trainData = env.readCsvFile[(Int,Int,Double)](trainFile)
val testData = env.readTextFile(testFile).map(_.toInt)
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(trainData)
val prediction = model.predict(testData)
prediction.print()
Given	a	historical	(training)	dataset	of	user	preferences
let	us	recommend	desirable	items	for	a	set	of	users.
Design	motivated	by	the	sci-kit	learn	API.	More	at	http://arxiv.org/abs/1309.0238.
4©	Cloudera,	Inc.	All	rights	reserved.
FlinkML example	usage
val env = ExecutionEnvironment.getExecutionEnvironment
val trainData = env.readCsvFile[(Int,Int,Double)](trainFile)
val testData = env.readTextFile(testFile).map(_.toInt)
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(trainData)
val prediction = model.test(testData)
prediction.print()
Given	a	historical	(training)	dataset	of	user	preferences
let	us	recommend	desirable	items	for	a	set	of	users.
This	is	a	batch	input.	
But	does	it	need	to	be?
5©	Cloudera,	Inc.	All	rights	reserved.
A	little	recommender	theory
Item
factors
User side
information User-Item matrixUser factors
Item side
information
U
I
P
Q
R
• R	is	potentially	huge,	approximate	it	with	P∗Q
• Prediction	is	TopK(user’s	row	∗ Q)
6©	Cloudera,	Inc.	All	rights	reserved.
Prediction	is	a	natural	fit	for	streaming.
7©	Cloudera,	Inc.	All	rights	reserved.
A	closer (schematic)	look at the API
trait PredictDataSetOperation[Self, Testing, Prediction] {
def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction]
}
trait PredictOperation[Instance, Model, Testing, Prediction] {
def getModel(instance: Instance) : DataSet[Model]
def predict(value: Testing, model: Model) : DataSet[Prediction]
}
A	DataSet and	a	record	level	API	to	implement	the	algorithm
(Prediction	is	always	done	on	a	model	already	trained)
The	record	level	version	is	arguably	more	convenient
It	is	wrapped	into	a	default	dataset	level	implementation
8©	Cloudera,	Inc.	All	rights	reserved.
A	closer (schematic)	look at the API
trait Estimator {
def fit[Training](training: DataSet[Training])(implicit f: FitOperation[Training]) = {
f.fit(training)
}
}
trait Transformer extends Estimator {
def transform[I,O](input: DataSet[I])(implicit t: TransformDataSetOperation[I,O]) = {
t.transform(input)
}
}
trait Predictor extends Estimator {
def predict[Testing](testing: DataSet[Testing])(implicit p: PredictDataSetOperation[T]) = {
p.predict(testing)
}
}
Three	well-picked	traits	go	a	long	way
9©	Cloudera,	Inc.	All	rights	reserved.
Could	we	share	the	model	with	a	streaming	job?
10©	Cloudera,	Inc.	All	rights	reserved.
Learn	in	batch,	predict	in	streaming
val env = ExecutionEnvironment.getExecutionEnvironment
val strEnv = StreamExecutionEnvironment.getExecutionEnvironment
val trainData = env.readCsvFile[(Int,Int,Double)](trainFile)
val testData = env.socketTextStream(...).map(_.toInt)
val model = ALS()
.setNumfactors(numFactors)
.setIterations(iterations)
.setLambda(lambda)
model.fit(trainData)
val prediction = model.predictStream(testData)
prediction.print()
11©	Cloudera,	Inc.	All	rights	reserved.
A	closer (schematic)	look at the streaming API
trait PredictDataSetOperation[Self, Testing, Prediction] {
def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction]
}
trait PredictDataStreamOperation[Self, Testing, Prediction] {
def predictDataStream(instance: Self, input: DataStream[Testing]) : DataStream[Prediction]
}
• Implicit	conversions	from	the	batch	Predictors	to	StreamPredictors
• The	model	is	stored	then	loaded	into	a	stateful RichMapFunction processing	
the	input	stream
• Default	wrapper	implementations	to	support	both	the	DataStream	level	and	
the	record	level	implementations
• Adding	the	streaming	predictor	implementation	for	an	algorithm	given	the	
batch	one	is	trivial
12©	Cloudera,	Inc.	All	rights	reserved.
Recommender systems in batch	vs online	learning
• “30M”	Music	listening	dataset	crawled	by	the	
CrowdRec	team
• Implicit,	timestamped	music	listening	dataset
• Each	record	contains:
[	timestamp,	user	,	artist,	album,	track,	…	]
• We	always	recommend	and	learn	when	the	user	
interacts	with	an	item	at	the	first	time
• ~50,000	users,	~100,000	artists,	~500,000	tracks
• This happens when we shuffle the time
• A	partially batch	online	system
Use cases in the Streamline project
Judit Fehér
Hungarian Academy of Sciences
How iALS works and why is it different from ALS
ALS Problem to solve: 𝑅# = 𝑃& 𝑄#
– Linear regression
Error function
𝐿 = 𝑅 − 𝑅*
+,-.
/
+ 𝜆2 𝑃 +,-.
/
+ 𝜆3 𝑄 +,-.
/
Implicit error function
𝐿 = 4 𝑤6,# 𝑟̂6,# − 𝑟6,#
/
:;,:<
6=>,#=>
+ 𝜆2 4 𝑃6
/
:;
6=>
+ 𝜆3 4 𝑄#
/
:<
#=>
• Weighted MSE
• 𝑤6,# = ?
𝑤6,# if	(𝑢, 𝑖) ∈ 𝑇
𝑤I otherwise
𝑤I ≪
𝑤6,#
• Typical weights:
𝑤I = 1, 𝑤6,# = 100 ∗ 𝑠𝑢𝑝𝑝 𝑢, 𝑖
• What does it mean?
– Create two matrices from the events
– (1) Preference matrix
• Binary
• 1 represents the presence of an event
– (2) Confidence matrix
• Interprets our certainty on the
corresponding values in the first matrix
• Negative feedback is much less certain
Machine learning: batch, streaming? Combined?
Streaming recommeder
• Online learning
• Update immediately, e.g. with large
learning rate
• Data streaming
• Read training/testing data only once, no
chance to store
• Real time / Interactive
+ More timely, adapts fast
- Challenging to implement
Batch recommender
• Repeatedly read all training data
multiple times
• Stochastic gradient: use multiple
times in random order
• Elaborate optimization procedures,
e.g. SVM
+ More accurate (?)
+ Easy to implement (?)
Contextualized recommendation (NMusic)
Social recommendation Geo recommendation
R.Palovics,	A.A.Benczur,	L.Kocsis,	T.Kiss,	E.Frigo. "Exploiting temporal
influence in	online	recommendation",	ACM	RecSys (2014)
Palovics,	Szalai,	Kocsis,	Pap,	Frigo,	Benczur.	„Location-Aware Online	
Learning for Top-k	Hashtag Recommendation”,	LocalRec (2015)
Internet Memory Research use cases
Identify events that influence consumer behavior (product purchases, media consumption)
Events influence people
Before a football match, people buy beer, chips, …
Specific events influence specific people (requires user profiles)
A football fan does not play Angry Birds during a football match
Annotation by logistic regression
Train over data in rest
Streaming predict crawl time
Portugal Telecom use cases
MEO quadruple-play
Features
Internet
TV (IPTV)
Mobile phone
Landline phone
Current challenges
Heterogeneous data
Heterogeneous technical solutions
Customers profiling
Cross-domain recommendation
1TB/day
Rovio use cases
Development at Sztaki
iALS
- Flink already has explicit ALS
- The implementation of the implicit version is done
- Currently testing the algorithm's accuracy
Matrix factorization
- Distributed algorithm*
- We have a working prototype tested on smaller matrices but it still needs optimization
Logistic regression
- Implementation in progress
- It is based on stochastic gradient descent, but in Flink there is only a batch version
- Currently working on the gradient descent implementation
Metrics
- Implementation and testing is finished
- We need to create a pull request
*R. Gemulla et al, “Large scale Matrix Factorization with Distributed Stochastic Gradient Descent”, KDD 2011.
21©	Cloudera,	Inc.	All	rights	reserved.
Summary
• Scala	is	a	great tool for building	DSLs
• FlinkML’s API	is	motivated by scikit-learn
• Streaming	is	a	natural	fit	for	ML	predictors
• Online	learning	can	outperform	batch	in	certain	cases
• The	Streamline project	builds on Flink,	aims to contribute back	
as much of	the results as possible

Contenu connexe

Tendances

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Flink Forward
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Flink Forward
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
Lance Co Ting Keh
 

Tendances (20)

FlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache FlinkFlinkML: Large Scale Machine Learning with Apache Flink
FlinkML: Large Scale Machine Learning with Apache Flink
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
K. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward KeynoteK. Tzoumas & S. Ewen – Flink Forward Keynote
K. Tzoumas & S. Ewen – Flink Forward Keynote
 
Christian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingChristian Kreuzfeld – Static vs Dynamic Stream Processing
Christian Kreuzfeld – Static vs Dynamic Stream Processing
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinMoon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache Zeppelin
 
SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
SICS: Apache Flink Streaming
SICS: Apache Flink StreamingSICS: Apache Flink Streaming
SICS: Apache Flink Streaming
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseStream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas Weise
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 

En vedette

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward
 

En vedette (7)

Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP
 
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow
 
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
 
Electricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural NetworksElectricity price forecasting with Recurrent Neural Networks
Electricity price forecasting with Recurrent Neural Networks
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 

Similaire à Márton Balassi Streaming ML with Flink-

Similaire à Márton Balassi Streaming ML with Flink- (20)

SplunkLive! Zurich 2018: Integrating Metrics and Logs
SplunkLive! Zurich 2018: Integrating Metrics and LogsSplunkLive! Zurich 2018: Integrating Metrics and Logs
SplunkLive! Zurich 2018: Integrating Metrics and Logs
 
Daniel Villani
Daniel VillaniDaniel Villani
Daniel Villani
 
[DSC Europe 23] Igor Ilic - Redefining User Experience with Large Language Mo...
[DSC Europe 23] Igor Ilic - Redefining User Experience with Large Language Mo...[DSC Europe 23] Igor Ilic - Redefining User Experience with Large Language Mo...
[DSC Europe 23] Igor Ilic - Redefining User Experience with Large Language Mo...
 
Machine Learning and the Elastic Stack
Machine Learning and the Elastic StackMachine Learning and the Elastic Stack
Machine Learning and the Elastic Stack
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 
Ch01lect1 et
Ch01lect1 etCh01lect1 et
Ch01lect1 et
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout Session
 
Implementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoCImplementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoC
 
Making Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms ReliableMaking Netflix Machine Learning Algorithms Reliable
Making Netflix Machine Learning Algorithms Reliable
 
1025 track1 Malin
1025 track1 Malin1025 track1 Malin
1025 track1 Malin
 
Deconstructing a Machine Learning Pipeline with Virtual Data Lake
Deconstructing a Machine Learning Pipeline with Virtual Data LakeDeconstructing a Machine Learning Pipeline with Virtual Data Lake
Deconstructing a Machine Learning Pipeline with Virtual Data Lake
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout Session
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
 
Data and Business Team Collaboration
Data and Business Team CollaborationData and Business Team Collaboration
Data and Business Team Collaboration
 
Sumologic <3 Open Source
Sumologic <3 Open SourceSumologic <3 Open Source
Sumologic <3 Open Source
 
Machine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout SessionMachine Learning and Analytics Breakout Session
Machine Learning and Analytics Breakout Session
 
Digital Transformation: How to Run Best-in-Class IT Operations in a World of ...
Digital Transformation: How to Run Best-in-Class IT Operations in a World of ...Digital Transformation: How to Run Best-in-Class IT Operations in a World of ...
Digital Transformation: How to Run Best-in-Class IT Operations in a World of ...
 

Plus de Flink Forward

Plus de Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Dernier

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
vexqp
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 

Dernier (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 

Márton Balassi Streaming ML with Flink-

  • 1. 1© Cloudera, Inc. All rights reserved. Marton Balassi | Solutions Architect @ Cloudera* @MartonBalassi | mbalassi@cloudera.com Judit Feher | Data Scientist @ MTA SZTAKI jfeher@ilab.sztaki.hu *Work carried out while employed by MTA SZTAKI on the Streamline project. Streaming ML with Flink This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 688191.
  • 2. 2© Cloudera, Inc. All rights reserved. Outline • Current FlinkML API through an example • Adding streaming predictors • Online learning • Use cases in the Streamline project • Summary
  • 3. 3© Cloudera, Inc. All rights reserved. FlinkML example usage val env = ExecutionEnvironment.getExecutionEnvironment val trainData = env.readCsvFile[(Int,Int,Double)](trainFile) val testData = env.readTextFile(testFile).map(_.toInt) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(trainData) val prediction = model.predict(testData) prediction.print() Given a historical (training) dataset of user preferences let us recommend desirable items for a set of users. Design motivated by the sci-kit learn API. More at http://arxiv.org/abs/1309.0238.
  • 4. 4© Cloudera, Inc. All rights reserved. FlinkML example usage val env = ExecutionEnvironment.getExecutionEnvironment val trainData = env.readCsvFile[(Int,Int,Double)](trainFile) val testData = env.readTextFile(testFile).map(_.toInt) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(trainData) val prediction = model.test(testData) prediction.print() Given a historical (training) dataset of user preferences let us recommend desirable items for a set of users. This is a batch input. But does it need to be?
  • 5. 5© Cloudera, Inc. All rights reserved. A little recommender theory Item factors User side information User-Item matrixUser factors Item side information U I P Q R • R is potentially huge, approximate it with P∗Q • Prediction is TopK(user’s row ∗ Q)
  • 7. 7© Cloudera, Inc. All rights reserved. A closer (schematic) look at the API trait PredictDataSetOperation[Self, Testing, Prediction] { def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction] } trait PredictOperation[Instance, Model, Testing, Prediction] { def getModel(instance: Instance) : DataSet[Model] def predict(value: Testing, model: Model) : DataSet[Prediction] } A DataSet and a record level API to implement the algorithm (Prediction is always done on a model already trained) The record level version is arguably more convenient It is wrapped into a default dataset level implementation
  • 8. 8© Cloudera, Inc. All rights reserved. A closer (schematic) look at the API trait Estimator { def fit[Training](training: DataSet[Training])(implicit f: FitOperation[Training]) = { f.fit(training) } } trait Transformer extends Estimator { def transform[I,O](input: DataSet[I])(implicit t: TransformDataSetOperation[I,O]) = { t.transform(input) } } trait Predictor extends Estimator { def predict[Testing](testing: DataSet[Testing])(implicit p: PredictDataSetOperation[T]) = { p.predict(testing) } } Three well-picked traits go a long way
  • 10. 10© Cloudera, Inc. All rights reserved. Learn in batch, predict in streaming val env = ExecutionEnvironment.getExecutionEnvironment val strEnv = StreamExecutionEnvironment.getExecutionEnvironment val trainData = env.readCsvFile[(Int,Int,Double)](trainFile) val testData = env.socketTextStream(...).map(_.toInt) val model = ALS() .setNumfactors(numFactors) .setIterations(iterations) .setLambda(lambda) model.fit(trainData) val prediction = model.predictStream(testData) prediction.print()
  • 11. 11© Cloudera, Inc. All rights reserved. A closer (schematic) look at the streaming API trait PredictDataSetOperation[Self, Testing, Prediction] { def predictDataSet(instance: Self, input: DataSet[Testing]) : DataSet[Prediction] } trait PredictDataStreamOperation[Self, Testing, Prediction] { def predictDataStream(instance: Self, input: DataStream[Testing]) : DataStream[Prediction] } • Implicit conversions from the batch Predictors to StreamPredictors • The model is stored then loaded into a stateful RichMapFunction processing the input stream • Default wrapper implementations to support both the DataStream level and the record level implementations • Adding the streaming predictor implementation for an algorithm given the batch one is trivial
  • 12. 12© Cloudera, Inc. All rights reserved. Recommender systems in batch vs online learning • “30M” Music listening dataset crawled by the CrowdRec team • Implicit, timestamped music listening dataset • Each record contains: [ timestamp, user , artist, album, track, … ] • We always recommend and learn when the user interacts with an item at the first time • ~50,000 users, ~100,000 artists, ~500,000 tracks • This happens when we shuffle the time • A partially batch online system
  • 13. Use cases in the Streamline project Judit Fehér Hungarian Academy of Sciences
  • 14. How iALS works and why is it different from ALS ALS Problem to solve: 𝑅# = 𝑃& 𝑄# – Linear regression Error function 𝐿 = 𝑅 − 𝑅* +,-. / + 𝜆2 𝑃 +,-. / + 𝜆3 𝑄 +,-. / Implicit error function 𝐿 = 4 𝑤6,# 𝑟̂6,# − 𝑟6,# / :;,:< 6=>,#=> + 𝜆2 4 𝑃6 / :; 6=> + 𝜆3 4 𝑄# / :< #=> • Weighted MSE • 𝑤6,# = ? 𝑤6,# if (𝑢, 𝑖) ∈ 𝑇 𝑤I otherwise 𝑤I ≪ 𝑤6,# • Typical weights: 𝑤I = 1, 𝑤6,# = 100 ∗ 𝑠𝑢𝑝𝑝 𝑢, 𝑖 • What does it mean? – Create two matrices from the events – (1) Preference matrix • Binary • 1 represents the presence of an event – (2) Confidence matrix • Interprets our certainty on the corresponding values in the first matrix • Negative feedback is much less certain
  • 15. Machine learning: batch, streaming? Combined? Streaming recommeder • Online learning • Update immediately, e.g. with large learning rate • Data streaming • Read training/testing data only once, no chance to store • Real time / Interactive + More timely, adapts fast - Challenging to implement Batch recommender • Repeatedly read all training data multiple times • Stochastic gradient: use multiple times in random order • Elaborate optimization procedures, e.g. SVM + More accurate (?) + Easy to implement (?)
  • 16. Contextualized recommendation (NMusic) Social recommendation Geo recommendation R.Palovics, A.A.Benczur, L.Kocsis, T.Kiss, E.Frigo. "Exploiting temporal influence in online recommendation", ACM RecSys (2014) Palovics, Szalai, Kocsis, Pap, Frigo, Benczur. „Location-Aware Online Learning for Top-k Hashtag Recommendation”, LocalRec (2015)
  • 17. Internet Memory Research use cases Identify events that influence consumer behavior (product purchases, media consumption) Events influence people Before a football match, people buy beer, chips, … Specific events influence specific people (requires user profiles) A football fan does not play Angry Birds during a football match Annotation by logistic regression Train over data in rest Streaming predict crawl time
  • 18. Portugal Telecom use cases MEO quadruple-play Features Internet TV (IPTV) Mobile phone Landline phone Current challenges Heterogeneous data Heterogeneous technical solutions Customers profiling Cross-domain recommendation 1TB/day
  • 20. Development at Sztaki iALS - Flink already has explicit ALS - The implementation of the implicit version is done - Currently testing the algorithm's accuracy Matrix factorization - Distributed algorithm* - We have a working prototype tested on smaller matrices but it still needs optimization Logistic regression - Implementation in progress - It is based on stochastic gradient descent, but in Flink there is only a batch version - Currently working on the gradient descent implementation Metrics - Implementation and testing is finished - We need to create a pull request *R. Gemulla et al, “Large scale Matrix Factorization with Distributed Stochastic Gradient Descent”, KDD 2011.
  • 21. 21© Cloudera, Inc. All rights reserved. Summary • Scala is a great tool for building DSLs • FlinkML’s API is motivated by scikit-learn • Streaming is a natural fit for ML predictors • Online learning can outperform batch in certain cases • The Streamline project builds on Flink, aims to contribute back as much of the results as possible