SlideShare a Scribd company logo
1 of 39
Streaming Topic Model TrainingStreaming Topic Model Training
and Inferenceand Inference
Suneel MarthiSuneel Marthi
Joey FrazeeJoey Frazee
September 5, 2018September 5, 2018
Flink Forward, Berlin, GermanyFlink Forward, Berlin, Germany
1
$WhoAreWe$WhoAreWe
Joey FrazeeJoey Frazee
 @jfrazee@jfrazee
Member of Apache Software Foundation
Committer on Apache NiFi, and PMC on Apache Streams
Suneel MarthiSuneel Marthi
 @suneelmarthi@suneelmarthi
Member of Apache Software Foundation
Committer and PMC on Apache Mahout, Apache OpenNLP, Apache
Streams
2
AgendaAgenda
Motivation for Topic Modeling
Existing Approaches for Topic Modeling
Topic Modeling on Streams
3
Motivation for Topic ModelingMotivation for Topic Modeling
4
Topic ModelsTopic Models
Automatically discovering main topics in collections
of documents
Overall
What are the main themes discussed on a mailing
list or a discussion forum ?
What are the most recurring topics discussed in
research papers ?
5
Topic Models (Contd)Topic Models (Contd)
Per Document
What are the main topics discussed in a certain
(single) newspaper article ?
What is the most distinctive topic that makes a
certain research article more interesting than the
others ?
6
Uses of Topic ModelsUses of Topic Models
Search Engines to organize text
Annotating Documents according to these topics
Uncovering hidden topical patterns across a
document collection
7
Common Approaches to TopicCommon Approaches to Topic
ModelingModeling
8
Latent Dirichlet Allocation (LDA)
Topics are composed by probability distributions
over words
Documents are composed by probability
distributions over Topics
Batch Oriented approach
9
N: vocab size
M/D: # of documents
alpha: prior on per-document topic dist.
beta: prior on topic-word dist.
theta: per-document topic dist.
phi: per-topic word dist.
z: topic for word w
IntuitionIntuition: Documents are a mixture of topics, topics are a mixture of words. So documents: Documents are a mixture of topics, topics are a mixture of words. So documents
can be generated by drawing a sample of topics, and then drawing a sample of words.can be generated by drawing a sample of topics, and then drawing a sample of words.
LDA ModelLDA Model
10
Latent Semantic Analysis (LSA)
SVDed TF-IDF Document-Term Matrix
Batch Oriented approach
11
Drawbacks of Traditional Topic Modeling Approach
Problem : labelling topics is often a manual task
E.g. LDA topic:30% basket,15% ball,10% drunk,…
can be tagged as basketball
12
Common ApproachesCommon Approaches DeepDeep
Learning (of course)Learning (of course)
Learn to perform LDA: LDA supervised DNN training
(inference speed up)
https://arxiv.org/abs/1508.01011
Deep Belief Networks for Topic Modelling
https://arxiv.org/abs/1501.04325
13
What are Embeddings ?What are Embeddings ?
Represent a document as a point in space, semantically similarRepresent a document as a point in space, semantically similar
docs are close together when plotteddocs are close together when plotted
Such a representation is learned via a (shallow) neural networkSuch a representation is learned via a (shallow) neural network
algorithmalgorithm
14
Embeddings for Topic ModelingEmbeddings for Topic Modeling
LDA2Vec: Mixing LDA and word2vec word embeddings
https://arxiv.org/abs/1605.02019
Navigating embeddings: geometric navigation in word
and doc embeddings space looking for topics
15
Static vs Dynamic CorporaStatic vs Dynamic Corpora
Learning Topic Models over a fixed set of Documents
Fetch the Corpus
Train and Fit a topic model
Extract topics
...other downstream tasks
16
Static vs Dynamic Corpora (Contd)Static vs Dynamic Corpora (Contd)
What to do with Corpus updates?
Document collections are not static in real world
There may be no document collections to begin
with
Memory limits with large corpora
Reprocess entire corpora when new documents are
added
17
Static vs Dynamic Corpora (Contd)Static vs Dynamic Corpora (Contd)
Streams, Streams, Streams
Twitter stream
Reddit posts
Papers on ResearchGate and Arxiv
....
18
Questions for AudienceQuestions for Audience
Who’s doing inference or scoring on streaming data?
Who’s doing online learning and optimization on streaming
data?
(2) is hard, sometimes because the algorithms aren’t(2) is hard, sometimes because the algorithms aren’t
there.there. But, also because people don’t know theBut, also because people don’t know the
algorithms are there.algorithms are there.
19
Topic Modeling on StreamsTopic Modeling on Streams
20
2 Streaming Approaches2 Streaming Approaches
Learning Topics from Jira Issues
Online LDA
21
Learning Topics on Jira IssuesLearning Topics on Jira Issues
22
WorkflowWorkflow
23
FLINK-9963: Add a single value table formatFLINK-9963: Add a single value table format
factoryfactory
24
Topics extracted for Flink-9963Topics extracted for Flink-9963
format factory
table schema
table source
sink factory
table sink
25
FLINK-8286: Fix Flink-Yarn-Kerberos integrationFLINK-8286: Fix Flink-Yarn-Kerberos integration
for FLIP-6for FLIP-6
26
Topics extracted for Flink-8286Topics extracted for Flink-8286
yarn kerberos integration
kerberos integrations
yarn kerberos
27
Bayesian InferenceBayesian Inference
GoalGoal: Calculate the posterior distribution of a: Calculate the posterior distribution of a
probabilistic model given dataprobabilistic model given data
Gibbs sampling/MCMC: Repeatedly sample from conditional
distribution(s) to approximate the posterior of the joint
distribution; typically a batch process
Variational Bayes: Optimize the parameters of a function that
approximates joint distribution
28
Variational BayesVariational Bayes
Expectation Maximization-like optimization
Faster and more accurate
Can often be computed online (on windows over streams)
29
Online LDA (1/2)Online LDA (1/2)
Variational Bayes for LDA
Batch or online
Iterate between optimizing per-word topics and per-document
topics (“E” step) and topic-word distributions (“M” step)
Default topic model optimization in Gensim, scikit-learn,
Apache Spark MLlib -- not because it’s amenable to streaming
per se, but more because it’s memory efficient and can be
applied to large corpora.
Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010.Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010.
30
job-constant
K: number of topics
alpha: topic-document dist. prior
eta: topic-word dist. prior
tau0: learning rate
kappa: decay rate
(mini)batch-local
gamma: theta prior
theta: param. for per-document topic dist
phi: param. for per-topic word dist.
rho: update weight
job-global
lambda: parameter for topic-word dist.
t: documents/batches seen (index)
GoalGoal: Optimize the topic-word dist. parameter: Optimize the topic-word dist. parameter
lambda; it tells us what words belong to the topics.lambda; it tells us what words belong to the topics.
[Image source: Hoffman, Blei & Bach, 2010.][Image source: Hoffman, Blei & Bach, 2010.]
Job-global parameters can be tracked in a stateJob-global parameters can be tracked in a state
store. Batch-local parameters can be discarded.store. Batch-local parameters can be discarded.
Online LDA (2/2)Online LDA (2/2)
31
job-constant
K: number of topics
alpha: topic-document dist. prior
eta: topic-word dist. prior
tau0: learning rate
kappa: decay rate
Set job parameters globally withSet job parameters globally with
setGlobalJobParameters() on executionsetGlobalJobParameters() on execution
environment, or pass as arguments toenvironment, or pass as arguments to
operatorsoperators
(mini)batch-local
gamma: theta prior
theta: param. for per-document topic dist
phi: param. for per-topic word dist.
rho: update weight
Operator/task-internal and don't need to beOperator/task-internal and don't need to be
accessible by other tasksaccessible by other tasks
job-global
lambda: parameter for topic-word dist.
t: documents/batches seen (index)
Several options:
External key-value store
Accumulators: lambda update as increment, accumulator +
broadcast variables
Broadcast state
And, (mini)batches selected by Window sizeAnd, (mini)batches selected by Window size
Online LDA on FlinkOnline LDA on Flink
32
So what?So what?
33
For many tasks, the default is batch training/optimization
Often, even if good online training and optimization
algorithms exist (why?)
So if your algorithms only require 1-step of parameter state,
why not just do streaming?
You can get instant, always up-to-date results by putting the
effort into using a different class of optimization algorithms
an using a stream processing engine, so just do that
34
LinksLinks
http://papers.nips.cc/paper/3902-online-learning-
for-latentdirichlet-allocation
Learning from LDA using Deep Neural Networks -
https://arxiv.org/pdf/1508.01011.pdf
Deep Belief Nets for Topic Modeling -
https://arxiv.org/pdf/1501.04325.pdf
Topic Modeling on Jira issues:
https://github.com/tteofili/jtm
35
CreditsCredits
Tommaso Teofili, Simone Tripodi (Adobe - Rome)
Joern Kottmann, Bruno Kinoshita (Apache OpenNLP)
Fabian Hueske (Apache Flink, Data Artisans)
36
Questions ???Questions ???
37

More Related Content

What's hot

ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationBoris Glavic
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataYanchang Zhao
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlBen Healey
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Perl optimization tidbits
Perl optimization tidbitsPerl optimization tidbits
Perl optimization tidbitsDaina Pettit
 
Automated Multiplatform Compilation and Validation of a Collaborative Reposit...
Automated Multiplatform Compilation and Validation of a Collaborative Reposit...Automated Multiplatform Compilation and Validation of a Collaborative Reposit...
Automated Multiplatform Compilation and Validation of a Collaborative Reposit...Ángel Jesús Martínez Bernal
 
R language tutorial
R language tutorialR language tutorial
R language tutorialDavid Chiu
 
Sparql semantic information retrieval by
Sparql semantic information retrieval bySparql semantic information retrieval by
Sparql semantic information retrieval byIJNSA Journal
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)fridolin.wild
 
Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...ijitjournal
 
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
 Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ... Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...Vladimir Alexiev, PhD, PMP
 

What's hot (19)

ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
 
Text Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter DataText Mining with R -- an Analysis of Twitter Data
Text Mining with R -- an Analysis of Twitter Data
 
HDF5 Tools
HDF5 ToolsHDF5 Tools
HDF5 Tools
 
Substituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scriptsSubstituting HDF5 tools with Python/H5py scripts
Substituting HDF5 tools with Python/H5py scripts
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Using HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py moduleUsing HDF5 and Python: The H5py module
Using HDF5 and Python: The H5py module
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Perl optimization tidbits
Perl optimization tidbitsPerl optimization tidbits
Perl optimization tidbits
 
Automated Multiplatform Compilation and Validation of a Collaborative Reposit...
Automated Multiplatform Compilation and Validation of a Collaborative Reposit...Automated Multiplatform Compilation and Validation of a Collaborative Reposit...
Automated Multiplatform Compilation and Validation of a Collaborative Reposit...
 
R language tutorial
R language tutorialR language tutorial
R language tutorial
 
Sparql semantic information retrieval by
Sparql semantic information retrieval bySparql semantic information retrieval by
Sparql semantic information retrieval by
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
rips-hk-lenovo (1)
rips-hk-lenovo (1)rips-hk-lenovo (1)
rips-hk-lenovo (1)
 
Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...Improving data compression ratio by the use of optimality of lzw & adaptive h...
Improving data compression ratio by the use of optimality of lzw & adaptive h...
 
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
 Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ... Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
 

Similar to Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic model training and inference with Apache Flink"

Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Chris Fregly
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD VivaAidan Hogan
 
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)South Tyrol Free Software Conference
 
Streaming Topic Model Training and Inference with Apache Flink
Streaming Topic Model Training and Inference with Apache FlinkStreaming Topic Model Training and Inference with Apache Flink
Streaming Topic Model Training and Inference with Apache FlinkDataWorks Summit
 
A Deep Dive Implementing xAPI in Learning Games
A Deep Dive Implementing xAPI in Learning GamesA Deep Dive Implementing xAPI in Learning Games
A Deep Dive Implementing xAPI in Learning GamesGBLxAPI
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is DistributedAlluxio, Inc.
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Most Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersMost Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersSprintzeal
 
Unboxing ML Models... Plus CoreML!
Unboxing ML Models... Plus CoreML!Unboxing ML Models... Plus CoreML!
Unboxing ML Models... Plus CoreML!Ray Deck
 
Lpi Part 2 Basic Administration
Lpi Part 2 Basic AdministrationLpi Part 2 Basic Administration
Lpi Part 2 Basic AdministrationYemenLinux
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon PresentationGyula Fóra
 
FAIR Projector Builder
FAIR Projector BuilderFAIR Projector Builder
FAIR Projector BuilderMark Wilkinson
 
LangChain + Docugami Webinar
LangChain + Docugami WebinarLangChain + Docugami Webinar
LangChain + Docugami WebinarTaqi Jaffri
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingSaliya Ekanayake
 
Integration&SOA_v0.2
Integration&SOA_v0.2Integration&SOA_v0.2
Integration&SOA_v0.2Sergey Popov
 
Building an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleBuilding an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleMerelda
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
10 Ways To Improve Your Code( Neal Ford)
10  Ways To  Improve  Your  Code( Neal  Ford)10  Ways To  Improve  Your  Code( Neal  Ford)
10 Ways To Improve Your Code( Neal Ford)guestebde
 

Similar to Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic model training and inference with Apache Flink" (20)

Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
SFSCON23 - Chris Mair - Self-hosted, Open Source Large Language Models (LLMs)
 
Streaming Topic Model Training and Inference with Apache Flink
Streaming Topic Model Training and Inference with Apache FlinkStreaming Topic Model Training and Inference with Apache Flink
Streaming Topic Model Training and Inference with Apache Flink
 
A Deep Dive Implementing xAPI in Learning Games
A Deep Dive Implementing xAPI in Learning GamesA Deep Dive Implementing xAPI in Learning Games
A Deep Dive Implementing xAPI in Learning Games
 
The Future of Computing is Distributed
The Future of Computing is DistributedThe Future of Computing is Distributed
The Future of Computing is Distributed
 
10 Ways To Improve Your Code
10 Ways To Improve Your Code10 Ways To Improve Your Code
10 Ways To Improve Your Code
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Most Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersMost Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and Answers
 
Unboxing ML Models... Plus CoreML!
Unboxing ML Models... Plus CoreML!Unboxing ML Models... Plus CoreML!
Unboxing ML Models... Plus CoreML!
 
Lpi Part 2 Basic Administration
Lpi Part 2 Basic AdministrationLpi Part 2 Basic Administration
Lpi Part 2 Basic Administration
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
FAIR Projector Builder
FAIR Projector BuilderFAIR Projector Builder
FAIR Projector Builder
 
LangChain + Docugami Webinar
LangChain + Docugami WebinarLangChain + Docugami Webinar
LangChain + Docugami Webinar
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
Integration&SOA_v0.2
Integration&SOA_v0.2Integration&SOA_v0.2
Integration&SOA_v0.2
 
Building an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleBuilding an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable Scale
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
10 Ways To Improve Your Code( Neal Ford)
10  Ways To  Improve  Your  Code( Neal  Ford)10  Ways To  Improve  Your  Code( Neal  Ford)
10 Ways To Improve Your Code( Neal Ford)
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Recently uploaded

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Flink Forward Berlin 2018: Suneel Marthi & Joey Frazee - "Streaming topic model training and inference with Apache Flink"

  • 1. Streaming Topic Model TrainingStreaming Topic Model Training and Inferenceand Inference Suneel MarthiSuneel Marthi Joey FrazeeJoey Frazee September 5, 2018September 5, 2018 Flink Forward, Berlin, GermanyFlink Forward, Berlin, Germany 1
  • 2. $WhoAreWe$WhoAreWe Joey FrazeeJoey Frazee  @jfrazee@jfrazee Member of Apache Software Foundation Committer on Apache NiFi, and PMC on Apache Streams Suneel MarthiSuneel Marthi  @suneelmarthi@suneelmarthi Member of Apache Software Foundation Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams 2
  • 3. AgendaAgenda Motivation for Topic Modeling Existing Approaches for Topic Modeling Topic Modeling on Streams 3
  • 4. Motivation for Topic ModelingMotivation for Topic Modeling 4
  • 5. Topic ModelsTopic Models Automatically discovering main topics in collections of documents Overall What are the main themes discussed on a mailing list or a discussion forum ? What are the most recurring topics discussed in research papers ? 5
  • 6. Topic Models (Contd)Topic Models (Contd) Per Document What are the main topics discussed in a certain (single) newspaper article ? What is the most distinctive topic that makes a certain research article more interesting than the others ? 6
  • 7. Uses of Topic ModelsUses of Topic Models Search Engines to organize text Annotating Documents according to these topics Uncovering hidden topical patterns across a document collection 7
  • 8. Common Approaches to TopicCommon Approaches to Topic ModelingModeling 8
  • 9. Latent Dirichlet Allocation (LDA) Topics are composed by probability distributions over words Documents are composed by probability distributions over Topics Batch Oriented approach 9
  • 10. N: vocab size M/D: # of documents alpha: prior on per-document topic dist. beta: prior on topic-word dist. theta: per-document topic dist. phi: per-topic word dist. z: topic for word w IntuitionIntuition: Documents are a mixture of topics, topics are a mixture of words. So documents: Documents are a mixture of topics, topics are a mixture of words. So documents can be generated by drawing a sample of topics, and then drawing a sample of words.can be generated by drawing a sample of topics, and then drawing a sample of words. LDA ModelLDA Model 10
  • 11. Latent Semantic Analysis (LSA) SVDed TF-IDF Document-Term Matrix Batch Oriented approach 11
  • 12. Drawbacks of Traditional Topic Modeling Approach Problem : labelling topics is often a manual task E.g. LDA topic:30% basket,15% ball,10% drunk,… can be tagged as basketball 12
  • 13. Common ApproachesCommon Approaches DeepDeep Learning (of course)Learning (of course) Learn to perform LDA: LDA supervised DNN training (inference speed up) https://arxiv.org/abs/1508.01011 Deep Belief Networks for Topic Modelling https://arxiv.org/abs/1501.04325 13
  • 14. What are Embeddings ?What are Embeddings ? Represent a document as a point in space, semantically similarRepresent a document as a point in space, semantically similar docs are close together when plotteddocs are close together when plotted Such a representation is learned via a (shallow) neural networkSuch a representation is learned via a (shallow) neural network algorithmalgorithm 14
  • 15. Embeddings for Topic ModelingEmbeddings for Topic Modeling LDA2Vec: Mixing LDA and word2vec word embeddings https://arxiv.org/abs/1605.02019 Navigating embeddings: geometric navigation in word and doc embeddings space looking for topics 15
  • 16. Static vs Dynamic CorporaStatic vs Dynamic Corpora Learning Topic Models over a fixed set of Documents Fetch the Corpus Train and Fit a topic model Extract topics ...other downstream tasks 16
  • 17. Static vs Dynamic Corpora (Contd)Static vs Dynamic Corpora (Contd) What to do with Corpus updates? Document collections are not static in real world There may be no document collections to begin with Memory limits with large corpora Reprocess entire corpora when new documents are added 17
  • 18. Static vs Dynamic Corpora (Contd)Static vs Dynamic Corpora (Contd) Streams, Streams, Streams Twitter stream Reddit posts Papers on ResearchGate and Arxiv .... 18
  • 19. Questions for AudienceQuestions for Audience Who’s doing inference or scoring on streaming data? Who’s doing online learning and optimization on streaming data? (2) is hard, sometimes because the algorithms aren’t(2) is hard, sometimes because the algorithms aren’t there.there. But, also because people don’t know theBut, also because people don’t know the algorithms are there.algorithms are there. 19
  • 20. Topic Modeling on StreamsTopic Modeling on Streams 20
  • 21. 2 Streaming Approaches2 Streaming Approaches Learning Topics from Jira Issues Online LDA 21
  • 22. Learning Topics on Jira IssuesLearning Topics on Jira Issues 22
  • 24. 23
  • 25. FLINK-9963: Add a single value table formatFLINK-9963: Add a single value table format factoryfactory 24
  • 26. Topics extracted for Flink-9963Topics extracted for Flink-9963 format factory table schema table source sink factory table sink 25
  • 27. FLINK-8286: Fix Flink-Yarn-Kerberos integrationFLINK-8286: Fix Flink-Yarn-Kerberos integration for FLIP-6for FLIP-6
  • 28. 26
  • 29. Topics extracted for Flink-8286Topics extracted for Flink-8286 yarn kerberos integration kerberos integrations yarn kerberos 27
  • 30. Bayesian InferenceBayesian Inference GoalGoal: Calculate the posterior distribution of a: Calculate the posterior distribution of a probabilistic model given dataprobabilistic model given data Gibbs sampling/MCMC: Repeatedly sample from conditional distribution(s) to approximate the posterior of the joint distribution; typically a batch process Variational Bayes: Optimize the parameters of a function that approximates joint distribution 28
  • 31. Variational BayesVariational Bayes Expectation Maximization-like optimization Faster and more accurate Can often be computed online (on windows over streams) 29
  • 32. Online LDA (1/2)Online LDA (1/2) Variational Bayes for LDA Batch or online Iterate between optimizing per-word topics and per-document topics (“E” step) and topic-word distributions (“M” step) Default topic model optimization in Gensim, scikit-learn, Apache Spark MLlib -- not because it’s amenable to streaming per se, but more because it’s memory efficient and can be applied to large corpora. Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010.Hoffman, Blei & Bach. Online Learning for Latent Dirichlet Allocation. Proceedings of Neural Information Processing Systems. 2010. 30
  • 33. job-constant K: number of topics alpha: topic-document dist. prior eta: topic-word dist. prior tau0: learning rate kappa: decay rate (mini)batch-local gamma: theta prior theta: param. for per-document topic dist phi: param. for per-topic word dist. rho: update weight job-global lambda: parameter for topic-word dist. t: documents/batches seen (index) GoalGoal: Optimize the topic-word dist. parameter: Optimize the topic-word dist. parameter lambda; it tells us what words belong to the topics.lambda; it tells us what words belong to the topics. [Image source: Hoffman, Blei & Bach, 2010.][Image source: Hoffman, Blei & Bach, 2010.] Job-global parameters can be tracked in a stateJob-global parameters can be tracked in a state store. Batch-local parameters can be discarded.store. Batch-local parameters can be discarded. Online LDA (2/2)Online LDA (2/2) 31
  • 34. job-constant K: number of topics alpha: topic-document dist. prior eta: topic-word dist. prior tau0: learning rate kappa: decay rate Set job parameters globally withSet job parameters globally with setGlobalJobParameters() on executionsetGlobalJobParameters() on execution environment, or pass as arguments toenvironment, or pass as arguments to operatorsoperators (mini)batch-local gamma: theta prior theta: param. for per-document topic dist phi: param. for per-topic word dist. rho: update weight Operator/task-internal and don't need to beOperator/task-internal and don't need to be accessible by other tasksaccessible by other tasks job-global lambda: parameter for topic-word dist. t: documents/batches seen (index) Several options: External key-value store Accumulators: lambda update as increment, accumulator + broadcast variables Broadcast state And, (mini)batches selected by Window sizeAnd, (mini)batches selected by Window size Online LDA on FlinkOnline LDA on Flink 32
  • 36. For many tasks, the default is batch training/optimization Often, even if good online training and optimization algorithms exist (why?) So if your algorithms only require 1-step of parameter state, why not just do streaming? You can get instant, always up-to-date results by putting the effort into using a different class of optimization algorithms an using a stream processing engine, so just do that 34
  • 37. LinksLinks http://papers.nips.cc/paper/3902-online-learning- for-latentdirichlet-allocation Learning from LDA using Deep Neural Networks - https://arxiv.org/pdf/1508.01011.pdf Deep Belief Nets for Topic Modeling - https://arxiv.org/pdf/1501.04325.pdf Topic Modeling on Jira issues: https://github.com/tteofili/jtm 35
  • 38. CreditsCredits Tommaso Teofili, Simone Tripodi (Adobe - Rome) Joern Kottmann, Bruno Kinoshita (Apache OpenNLP) Fabian Hueske (Apache Flink, Data Artisans) 36