SlideShare une entreprise Scribd logo
1  sur  28
Online Learning with Structured
Streaming
Ram Sriharsha, Vlad Feinberg
@halfabrane
Spark Summit, Brussels
27 October 2016
What is online learning?
• Update model parameters on each data point
• In batch setting get to see the entire dataset before update
• Cannot visit data points again
• In batch setting, can iterate over data points as many
times as we want!
2
An example: the perceptron
3
x
w
Update Rule: if (y != sign(w.x)), w -> w +
y(w.x)
Goal: Find the best line separating
positive
From negative examples on a plane
Why learn online?
• I want to adapt to changing patterns quickly
• data distribution can change
– e.g, distribution of features that affect learning might change
over time
• I need to learn a good model within resource +
time constraints (large-scale learning)
• Time to a given accuracy might be faster for certain online
algorithms
4
Online Classification Setting
• Pick a hypothesis
• For each labeled example (𝘅, y):
• Predict label ỹ using hypothesis
• Observe the loss 𝓛(y, ỹ) (and its gradient)
• Learn from mistake and update hypothesis
• Goal: to make as few mistakes as possible in
comparison to the best hypothesis in hindsight
5
An example: Online SGD
• Initialize weights 𝘄
• Loss function 𝓛 is known.
• For each labeled example (𝘅, y):
• Perform update 𝘄 -> 𝘄 – η∇𝓛(y , 𝘄.𝘅)
• For each new example x:
• Predict ỹ = σ(𝘄.𝘅) (σ is called link function)
6
𝓛(y , 𝘄.𝘅)
𝘄
ẘ
Distributed Online Learning
• Synchronous
• On each worker:
– Load training data, compute gradients and update model, push
model to driver
• On some node:
– Perform model merge
• Asynchronous
• On each worker:
– Load training data, compute gradients and push to server
• On each server:
– Aggregate the gradients, perform update step
7
Challenges
• Not all algorithms admit efficient online versions
• Lack of infrastructure
• (Single machine) Vowpal Wabbit works great but hard to use
from Scala, Java and other languages.
• (Distributed) No implementation that is fault tolerant,
scalable, robust
• Lack of framework in open source to provide
extensible algorithms
• Adagrad, normalized learning, L1 regularization,…
• Online SGD, FTRL, ...
8
Structured Streaming
1. One single API DataFrame for everything
- Same API for machine learning, batch processing, graphX
- Dataset is a typed version of DataFrame for Scala and Java
2. End-to-end exactly-once guarantees
- The guarantees extend into the sources/sinks, e.g. MySQL, S3
3. Understands external event-time
- Handling late arriving data
- Support sessionization based on event-time
Structured Streaming
How does it work?
at any time, the output of the application is
equivalent to executing a batch job on a prefix of
the data
11
The Model Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Quer
y
Input: data from source as an
append-only table
Trigger: how frequently to check
input for new data
Query: operations on input
usual map/filter/reduce
new window, session ops
The Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Quer
y
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Result: final operated table
updated every trigger interval
Output: what part of result to write
to data sink after every
trigger
Complete output: Write full result table every
time
Output
complet
e
output
The Model Trigger: every 1 sec
1 2 3
output
for data
up to 1
Result
Quer
y
Time
data up
to 1
Input data up
to 2
output
for data
up to 2
data up
to 3
output
for data
up to 3
Output
delta
output
Result: final operated table
updated every trigger
interval
Output: what part of result to write
to data sink after every
trigger
Complete output: Write full result table
every time
Delta output: Write only the rows that
changed
in result from previous batch
Streaming ML on
Structured Streaming
Streaming ML on Structured
Streaming
Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Quer
y
Input: append only table
containing
labeled examples
Query: Stateful aggregation
query: picks up the last trained
model, performs a distributed
update + merge
Streaming ML on Structured
Streaming
Trigger: every 1 sec
1 2 3
model
for data
up to t
Result
Quer
y
Time
labeled
examples
up
to time t
InputResult: table of model parameters
updated every trigger interval
Complete mode: table has one
row,
constantly being updated
Append mode (in the works):
table has timestamp-keyed model,
one
row per trigger
Output
intermediate models would have the same state at
this point of computation for the (abstract) queries #1
Why is this hard?
• Need to update model, i.e
• Update(previousModel, newDataPoint) = newModel
• Typical aggregation is associative, commutative
• e.g. sum( P1: sum(sum(0, data[0]), data[1]), P2:
sum(sum(0, data[2]), data[3]))
• General model update violates associativity +
commutativity!
18
Solution: Make Assumptions
• Result may be partition-dependent, but we don’t
care as long as we get some valid result.
average-models(
P1: update(update(previous model, data[0]), data[1]),
P2: update(update(previous model, data[2]), data[3]))
• Only partition-dependent if update and average
don’t commute - can still be deterministic
otherwise!
19
Stateful Aggregator
• Within each partition
• Initialize with previous state (instead of zero in regular
aggregator)
• For each item, update state
• Perform reduce step
• Output final state
Very general abstraction: works for sketches,
online statistics (quantiles), online clustering …
20
How does it work?
Driver
Map Map
State
Store
Labeled Stream
Source
Reduce
Is there more
data? yes
!
run
query
Map
Read labeled
examples
Feature transforms, gradient
updates
Model
averagingsave
model
read last saved
model
APIs
Spark Summit Brussels
27 October 2016
ML Estimator on Streams
• Interoperable with ML pipelines
23
Streaming
DF
m = estimator.fit()
m.writeStream
streaming sink
Input: stream of labelled data
Output: stream of models, updated over
time.
Batch Interoperability
• Seamless application on batch datasets
24
Static DF
for batch
ML
model = estimator.fit(batchDF)
1
n
API Goals
• Provide modern, regret-minimization-based
online algorithms.
• Online Logistic Regression
• Adagrad
• Online gradient descent
• L2 regularization
• Input streams of any kind accepted.
• Streaming aware feature engineering
26
What’s next?
Spark Summit Brussels
27 October 2016
What’s next?
• More bells and whistles
• Adaptive normalization
• L1 regularization
• More algorithms
• Online quantile estimation?
• More general Sketches?
• Online clustering?
• Scale testing and benchmarking
28
Demo
Spark Summit Brussels
27 October 2016

Contenu connexe

Tendances

Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Flink Forward
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Flink Forward
 

Tendances (20)

Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Ufuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one SystemUfuc Celebi – Stream & Batch Processing in one System
Ufuc Celebi – Stream & Batch Processing in one System
 
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeChris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-time
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...
 
Asymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, ExplainedAsymmetry in Large-Scale Graph Analysis, Explained
Asymmetry in Large-Scale Graph Analysis, Explained
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis -  Massimo PeriniDeep Stream Dynamic Graph Analytics with Grapharis -  Massimo Perini
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo Perini
 
Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 

En vedette

A little bit of clojure
A little bit of clojureA little bit of clojure
A little bit of clojure
Ben Stopford
 

En vedette (20)

Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark Streaming
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 
Kafka for data scientists
Kafka for data scientistsKafka for data scientists
Kafka for data scientists
 
Wrangling Big Data in a Small Tech Ecosystem
Wrangling Big Data in a Small Tech EcosystemWrangling Big Data in a Small Tech Ecosystem
Wrangling Big Data in a Small Tech Ecosystem
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
Kafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache KafkaKafka Streams: The Stream Processing Engine of Apache Kafka
Kafka Streams: The Stream Processing Engine of Apache Kafka
 
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11gBest Practices for testing of SOA-based systems - with examples of SOA Suite 11g
Best Practices for testing of SOA-based systems - with examples of SOA Suite 11g
 
A little bit of clojure
A little bit of clojureA little bit of clojure
A little bit of clojure
 
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit EU talk by Ram Sriharsha and Vlad Feinberg
Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg
 
Big Data & the Enterprise
Big Data & the EnterpriseBig Data & the Enterprise
Big Data & the Enterprise
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Building a Real-Time Forecasting Engine with Scala and Akka
Building a Real-Time Forecasting Engine with Scala and Akka Building a Real-Time Forecasting Engine with Scala and Akka
Building a Real-Time Forecasting Engine with Scala and Akka
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Data Stream Analytics - Why they are important
Data Stream Analytics - Why they are importantData Stream Analytics - Why they are important
Data Stream Analytics - Why they are important
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
React Fast by Processing Streaming Data in Real-Time
React Fast by Processing Streaming Data in Real-TimeReact Fast by Processing Streaming Data in Real-Time
React Fast by Processing Streaming Data in Real-Time
 

Similaire à Online learning with structured streaming, spark summit brussels 2016

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 

Similaire à Online learning with structured streaming, spark summit brussels 2016 (20)

08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptx
 
An Introduction to TensorFlow architecture
An Introduction to TensorFlow architectureAn Introduction to TensorFlow architecture
An Introduction to TensorFlow architecture
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Seven deadly sins of ElasticSearch Benchmarking
Seven deadly sins of ElasticSearch BenchmarkingSeven deadly sins of ElasticSearch Benchmarking
Seven deadly sins of ElasticSearch Benchmarking
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit faster
 
more loops lecture by Professor Evan korth
more loops  lecture by Professor Evan korth more loops  lecture by Professor Evan korth
more loops lecture by Professor Evan korth
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Unit 3 part2
Unit 3 part2Unit 3 part2
Unit 3 part2
 
Machine Learning for Capacity Management
 Machine Learning for Capacity Management Machine Learning for Capacity Management
Machine Learning for Capacity Management
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Pregel
PregelPregel
Pregel
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 

Dernier

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
shivangimorya083
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 

Dernier (20)

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Online learning with structured streaming, spark summit brussels 2016

  • 1. Online Learning with Structured Streaming Ram Sriharsha, Vlad Feinberg @halfabrane Spark Summit, Brussels 27 October 2016
  • 2. What is online learning? • Update model parameters on each data point • In batch setting get to see the entire dataset before update • Cannot visit data points again • In batch setting, can iterate over data points as many times as we want! 2
  • 3. An example: the perceptron 3 x w Update Rule: if (y != sign(w.x)), w -> w + y(w.x) Goal: Find the best line separating positive From negative examples on a plane
  • 4. Why learn online? • I want to adapt to changing patterns quickly • data distribution can change – e.g, distribution of features that affect learning might change over time • I need to learn a good model within resource + time constraints (large-scale learning) • Time to a given accuracy might be faster for certain online algorithms 4
  • 5. Online Classification Setting • Pick a hypothesis • For each labeled example (𝘅, y): • Predict label ỹ using hypothesis • Observe the loss 𝓛(y, ỹ) (and its gradient) • Learn from mistake and update hypothesis • Goal: to make as few mistakes as possible in comparison to the best hypothesis in hindsight 5
  • 6. An example: Online SGD • Initialize weights 𝘄 • Loss function 𝓛 is known. • For each labeled example (𝘅, y): • Perform update 𝘄 -> 𝘄 – η∇𝓛(y , 𝘄.𝘅) • For each new example x: • Predict ỹ = σ(𝘄.𝘅) (σ is called link function) 6 𝓛(y , 𝘄.𝘅) 𝘄 ẘ
  • 7. Distributed Online Learning • Synchronous • On each worker: – Load training data, compute gradients and update model, push model to driver • On some node: – Perform model merge • Asynchronous • On each worker: – Load training data, compute gradients and push to server • On each server: – Aggregate the gradients, perform update step 7
  • 8. Challenges • Not all algorithms admit efficient online versions • Lack of infrastructure • (Single machine) Vowpal Wabbit works great but hard to use from Scala, Java and other languages. • (Distributed) No implementation that is fault tolerant, scalable, robust • Lack of framework in open source to provide extensible algorithms • Adagrad, normalized learning, L1 regularization,… • Online SGD, FTRL, ... 8
  • 10. 1. One single API DataFrame for everything - Same API for machine learning, batch processing, graphX - Dataset is a typed version of DataFrame for Scala and Java 2. End-to-end exactly-once guarantees - The guarantees extend into the sources/sinks, e.g. MySQL, S3 3. Understands external event-time - Handling late arriving data - Support sessionization based on event-time Structured Streaming
  • 11. How does it work? at any time, the output of the application is equivalent to executing a batch job on a prefix of the data 11
  • 12. The Model Trigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Quer y Input: data from source as an append-only table Trigger: how frequently to check input for new data Query: operations on input usual map/filter/reduce new window, session ops
  • 13. The Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Quer y Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Output complet e output
  • 14. The Model Trigger: every 1 sec 1 2 3 output for data up to 1 Result Quer y Time data up to 1 Input data up to 2 output for data up to 2 data up to 3 output for data up to 3 Output delta output Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Delta output: Write only the rows that changed in result from previous batch
  • 16. Streaming ML on Structured Streaming Trigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Quer y Input: append only table containing labeled examples Query: Stateful aggregation query: picks up the last trained model, performs a distributed update + merge
  • 17. Streaming ML on Structured Streaming Trigger: every 1 sec 1 2 3 model for data up to t Result Quer y Time labeled examples up to time t InputResult: table of model parameters updated every trigger interval Complete mode: table has one row, constantly being updated Append mode (in the works): table has timestamp-keyed model, one row per trigger Output intermediate models would have the same state at this point of computation for the (abstract) queries #1
  • 18. Why is this hard? • Need to update model, i.e • Update(previousModel, newDataPoint) = newModel • Typical aggregation is associative, commutative • e.g. sum( P1: sum(sum(0, data[0]), data[1]), P2: sum(sum(0, data[2]), data[3])) • General model update violates associativity + commutativity! 18
  • 19. Solution: Make Assumptions • Result may be partition-dependent, but we don’t care as long as we get some valid result. average-models( P1: update(update(previous model, data[0]), data[1]), P2: update(update(previous model, data[2]), data[3])) • Only partition-dependent if update and average don’t commute - can still be deterministic otherwise! 19
  • 20. Stateful Aggregator • Within each partition • Initialize with previous state (instead of zero in regular aggregator) • For each item, update state • Perform reduce step • Output final state Very general abstraction: works for sketches, online statistics (quantiles), online clustering … 20
  • 21. How does it work? Driver Map Map State Store Labeled Stream Source Reduce Is there more data? yes ! run query Map Read labeled examples Feature transforms, gradient updates Model averagingsave model read last saved model
  • 23. ML Estimator on Streams • Interoperable with ML pipelines 23 Streaming DF m = estimator.fit() m.writeStream streaming sink Input: stream of labelled data Output: stream of models, updated over time.
  • 24. Batch Interoperability • Seamless application on batch datasets 24 Static DF for batch ML model = estimator.fit(batchDF) 1 n
  • 25. API Goals • Provide modern, regret-minimization-based online algorithms. • Online Logistic Regression • Adagrad • Online gradient descent • L2 regularization • Input streams of any kind accepted. • Streaming aware feature engineering 26
  • 26. What’s next? Spark Summit Brussels 27 October 2016
  • 27. What’s next? • More bells and whistles • Adaptive normalization • L1 regularization • More algorithms • Online quantile estimation? • More general Sketches? • Online clustering? • Scale testing and benchmarking 28