This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~
2. Outline
Basic Concepts
Application and Background
Famous Researchers
Language Model
Vector Space Model (VSM)
Term Frequency-Inverse Document Frequency (TF-IDF)
Latent Semantic Indexing (LSA)
Probabilistic Latent Semantic Indexing (pLSA)
Expectation-Maximization Algorithm (EM) & Maximum-
Likelihood Estimation (MLE)
6/11/2014 2 Middleware, CCNT, ZJU, Yueshen Xu
3. Outline
Latent Dirichlet Allocation (LDA)
Conjugate Prior
Possion Distribution
Variational Distribution and Variational Inference (VD
&VI)
Markov Chain Monte Carlo (MCMC)
Metropolis-Hastings Sampling (MH)
Gibbs Sampling and GS for LDA
Bayesian Theory v.s. Probability Theory
6/11/2014 3 Middleware, CCNT, ZJU, Yueshen Xu
4. Concepts
Latent Semantic Analysis
Topic Model
Text Mining
Natural Language Processing
Computational Linguistics
Information Retrieval
Dimension Reduction
Expectation-Maximization(EM)
6/11/2014 Middleware, CCNT, ZJU
Information Retrieval
Computational Linguistics
Natural Language Processing
LSA/Topic Model
Text Mining
LSA/Topic Model
Data Mining
Reduction
Dimension
Machine
Learning
EM
4
Machine
Translation
Aim:find the topic that a word or a document belongs to
Latent Factor Model
, Yueshen Xu
5. Application
LFM has been a fundamental technique in modern
search engine, recommender system, tag extraction,
blog clustering, twitter topic mining, news (text)
summarization, etc.
Search Engine
PageRank How important….this web page?
LFM How relevance….this web page?
LFM How relevance…the user’s query
vs. one document?
Recommender System
Opinion Extraction
Spam Detection
Tag Extraction
6/11/2014 5 Middleware, CCNT, ZJU
Text Summarization
Abstract Generation
Twitter Topic Mining
Text: Steven Jobs had left us for about two years…..the apple’s price will fall
down….
, Yueshen Xu
6. Famous Researcher
6/11/2014 6 Middleware, CCNT, ZJU
David Blei,
Princeton,
LDA
Chengxiang Zhai,
UIUC, Presidential
Early Career Award
W. Bruce Croft, UMA
Language Model
Bing Liu, UIC
Opinion Mining
John D. Lafferty,
CMU, CRF&IBM
Thomas Hofmann
Brown, pLSA
Andrew McCallum,
UMA, CRF&IBM
Susan Dumais,
Microsoft, LSI
, Yueshen Xu
7. Language Model
Unigram Language Model == Zero-order Markov Chain
Bigram Language Model == First-order Markov Chain
N-gram Language Model == (N-1)-order Markov Chain
Mixture-unigram Language Model
6/11/2014 Middleware, CCNT, ZJU
sw
i
i
MwpMwp )|()|(
Bag of Words(BoW)
No order, no grammar, only multiplicity
sw
ii
i
MwwpMwp )|()|( ,1
8
w
N
M
w
N
M
z
𝑝 𝒘 =
𝑧
𝑝(𝑧)
𝑛=1
𝑁
𝑝(𝑤 𝑛|𝑧)
, Yueshen Xu
8. 9
Vector Space Model
A document is represented as a vector of identifier
Identifier
Boolean: 0, 1
Term Count: How many times…
Term Frequency: How frequent…in this document
TF-IDF: How important…in the corpus most used
Relevance Ranking
First used in SMART(Gerard Salton, Cornell)
6/11/2014 Middleware, CCNT, ZJU
),,,(
),,,(
21
21
tqqq
tjjjj
wwwq
wwwd
Gerard Salton
Award(SIGIR)
qd
qd
j
j
cos
, Yueshen Xu
9. TF-IDF
Mixture language model
Linear combination of a certain distribution(Gaussian)
Better Performance
TF: Term Frequency
IDF: Inversed Document Frequency
TF-IDF
6/11/2014 Middleware, CCNT, ZJU
k kj
ij
ij
n
n
tf Term i, document j, count of i in j
)
|}:{|1
log(
dtDd
N
idf
i
i
N documents in the corpus
iijjij idftfDdtidftf ),,(
How important …in this document
How important …in this corpus
10, Yueshen Xu
10. Latent Semantic Indexing
Challenge
Compare document in the same concept space
Compare documents across languages
Synonymy, ex: buy - purchase, user - consumer
Polysemy, ex; book - book, draw - draw
Key Idea
Dimensionality reduction of word-document co-occurrence matrix
Construction of latent semantic space
6/11/2014 Middleware, CCNT, ZJU
Defects of VSM
Word Document
Word DocumentConcept
VSM
LSI
11, Yueshen Xu
Aspect
Topic
Latent
Factor
11. Singular Value Decomposition
LSI ~= SVD
U, V: orthogonal matrices
∑ :the diagonal matrix with the singular values of N
6/11/2014 Middleware, CCNT, ZJU12
T
VUN
U
t * m
Document
Terms
t * d
m* m m* d
N ∑U V
k < m || k <<mCount, Frequency, TF-IDF
t * m
Document
Terms
t * k
k* k m* d
U V N
word: Exchangeability
k < m || k <<m
k
, Yueshen Xu
12. Singular Value Decomposition
The K-largest singular values
Distinguish the variance between words and documents to a
greatest extent
Discarding the lowest dimensions
Reduce noise
Fill the matrix
Predict & Lower computational complexity
Enlarge the distinctiveness
Decomposition
Concept, semantic, topic (aspect)
6/11/2014 13 Middleware, CCNT, ZJU
(Probabilistic) Matrix Factorization/
Factorization Model: Analytic
solution of SVD
Unsupervised
Learning
, Yueshen Xu
13. Probabilistic Latent Semantic Indexing
pLSI Model
6/11/2014 14 Middleware, CCNT, ZJU
w1
w2
wN
z1
zK
z2
d1
d2
dM
…..
…..
…..
)(dp)|( dzp)|( zwp
Assumption
Pairs(d,w) are assumed to be
generated independently
Conditioned on z, w is generated
independently of d
Words in a document are
exchangeable
Documents are exchangeable
Latent topics z are independent
Generative Process/Model
ZzZz
zwpdzpdpdzwpdpdpdwpwdp )|()|()()|,()()()|(),(
Multinomial Distribution
Multinomial Distribution
One layer of ‘Deep
Neutral Network’
Global
Local
, Yueshen Xu
14. Probabilistic Latent Semantic Indexing
6/11/2014 15 Middleware, CCNT, ZJU
d z w
N
M
Zz
zwpdzpdwp )|()|()|(
Zz
ZzZz
zpzdpzwp
zdpzdwpzwdpdwp
)()|()|(
),(),|(),,(),(
d
z w
N
M
These are two ways to
formulate pLSA, which are
equivalent but lead to two
different inference processes
Equivalent in Bayes Rule
Probabilistic
Graph Model
d:Exchangeability
Directed Acyclic
Graph (DAG)
, Yueshen Xu
15. Expectation-Maximization
EM is a general algorithm for maximum-likelihood estimation
(MLE) where the data are ‘incomplete’ or contains latent
variables: pLSA, GMM, HMM…---Cross Domain
Deduction Process
θ:parameter to be estimated; θ0: initialize randomly; θn: the current
value; θn+1: the next value
6/11/2014 16 Middleware, CCNT, ZJU
)()(max1 nn
LL
),|(log)( XpL )|,(log)( HXpLc
Latent Variable
),|(log)(),|(log)|(log)|,(log)( XHpLXHpXpHXpLc
),|(
),|(
log)()()()(
XHp
XHp
LLLL
n
n
cc
n
, Yueshen Xu
Objective:
16. Expectation-Maximization
6/11/2014 17 Middleware, CCNT, ZJU
),|(
),|(
log),|(
),|()(),|()()()(
XHp
XHp
XHp
XHpLXHpLLL
n
H
n
H
nn
c
H
n
c
n
K-L divergence: non-negative
Kullback-Leibler Divergence, or Relative Entropy
H
nn
c
H
nn
c XHpLLXHpLL ),|()()(),|()()(
Lower Bound
H
n
ccXHp
n
XHpLLEQ n ),|()()]([);( ),|(
Q-function
E-step (expectation): Compute Q;
M-step(maximization): Re-estimate θ by maximizing Q
Convergence
How is EM used in pLSA?
, Yueshen Xu
17. EM in pLSA
6/11/2014 18 Middleware, CCNT, ZJU
K
k
ikkjijk
N
i
M
j
ji
K
k
ikkj
N
i
M
j
jiijk
H
n
ccXHp
n
dzpzwpdwzpwdn
dzpzwpwdndwzp
XHpLLEQ n
11 1
1 1 1
),|(
))|()|(log(),|(),(
))|()|(log(),(),|(
),|()()]([);(
Posterior Random value in initialization
Likelyhood function
Constraints:
1.
2.
1)|(
1
M
j
kj
zwp
1)|(
1
K
k
jk dzp
Lagrange
Multiplier
M
i
K
k
iki
K
k
M
j
kjkc dzpzwpLEH
1 11 1
))|(1())|(1(][
Partial derivative=0
independent
variable
independent
variable
M
m
N
i
imkim
N
i
ijkij
kj
dwzpdwn
dwzpdwn
zwp
1 1
1
),|(),(
),|(),(
)|(
)(
),|(),(
)|(
1
i
M
j
ijkij
ik
dn
dwzpdwn
dzp
M-Step
E-Step
K
l
illj
ikkj
K
l
illji
iikkj
ijk
dzpzwp
dzpzwp
dzpzwpdp
dpdzpzwp
dwzp
1
1
)|()|(
)|()|(
)|()|()(
)()|()|(
),|(
Associative
Law &
Distributive
Law
, Yueshen Xu
𝑙𝑜𝑔 𝑝(𝑤|𝑑) 𝑛(𝑑,𝑤)
18. Bayesian Theory v.s.
Probability Theory
Bayesian Theory v.s. Probability Theory
Estimate 𝜃 through posterior v.s. Estimate 𝜃 through the
maximization of likelihood
Bayesian theory prior v.s. Probability theory statistic
When the number of samples → ∞, Bayesian theory == Probability
theory
Parameter Estimation
𝑝 𝜃 𝐷 ∝ 𝑝 𝐷 𝜃 𝑝 𝜃 𝑝 𝜃 ? Conjugate Prior likelihood is
helpful, but its function is limited Otherwise?
6/11/2014 19 Middleware, CCNT, ZJU
Non-parametric Bayesian Methods (Complicated)
Kernel methods: I just know a little...
VSM CF MF pLSA LDA Non-parametric Bayesian
Deep Learning
, Yueshen Xu
19. Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA)
David M. Blei, Andrew Y. Ng, Michael I. Jordan
Journal of Machine Learning Research,2003, cited > 3000
Hierarchical Bayesian model; Bayesian pLSI
6/11/2014 20 Middleware, CCNT, ZJU
θ z w
N
M
α
β
Iterative times
Generative Process of a document d in a
corpus according to LDA
Choose N ~ Poisson(𝜉); Why?
For each document d={𝑤1, 𝑤2 … 𝑤 𝑛}
Choose 𝜃 ~𝐷𝑖𝑟(𝛼); Why?
For each of the N words 𝑤 𝑛 in d:
a) Choose a topic 𝑧 𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃
Why?
b) Choose a word 𝑤 𝑛 from 𝑝 𝑤 𝑛 𝑧 𝑛, 𝛽 ,
a multinomial probability conditioned on 𝑧 𝑛
Why
ACM-Infosys
Awards
, Yueshen Xu
20. Latent Dirichlet Allocation
LDA(Cont.)
6/11/2014 21 Middleware, CCNT, ZJU
θ z w
N
Mα
𝜑
β
K
β
Generative Process of a document d in LDA
Choose N ~ Poisson(𝜉); Not important
For each document d={𝑤1, 𝑤2 … 𝑤 𝑛}
Choose 𝜃 ~𝐷𝑖𝑟(𝛼);𝜃 = 𝜃1, 𝜃2 … 𝜃 𝐾 , 𝜃 = 𝐾 ,
K is fixed, 1
𝐾
𝜃 = 1, 𝐷𝑖𝑟~𝑀𝑢𝑙𝑡𝑖 →𝐶𝑜𝑛𝑗𝑢𝑔𝑎𝑡𝑒
𝑃𝑟𝑖𝑜𝑟
For each of the N words 𝑤 𝑛 in d:
a) Choose a topic 𝑧 𝑛~𝑀𝑢𝑙𝑡𝑖𝑛𝑜𝑚𝑖𝑛𝑎𝑙 𝜃
b) Choose a word 𝑤 𝑛 from 𝑝 𝑤 𝑛 𝑧 𝑛, 𝛽 ,
a multinomial probability conditioned on
𝑧 𝑛 one word one topic
one document multi-topics
𝜃 = 𝜃1, 𝜃2 … 𝜃 𝐾
z= 𝑧1, 𝑧2 … 𝑧 𝐾
For each word 𝑤 𝑛there is a 𝑧 𝑛
pLSA: the number of p(z|d) is linear
to the number of documents
overfitting
Regularization
M+K Dirichlet-Multinomial
, Yueshen Xu
22. Conjugate Prior &
Distributions
Conjugate Prior:
If the posterior p(θ|x) are in the same family as the p(θ), the prior
and posterior are called conjugate distributions, and the prior is
called a conjugate prior of the likelihood p(x|θ) : p(θ|x) ∝ p(x|θ)p(θ)
Distributions
Binomial Distribution ←→ Beta Distribution
Multinomial Distribution ←→ Dirichlet Distribution
Binomial & Beta Distribution
Binomial Bin(m|N,θ)=C(m,N)θm(1-θ)N-m :likelihood
C(m,N)=N!/(N-m)!m!
Beta(θ|a,b)
6/11/2014 23 Middleware, CCNT, ZJU
11-
)1(
)()(
)(
ba
ba
ba
0
1
)( dteta ta
Why do prior and
posterior need to be
conjugate distributions?
, Yueshen Xu
23. Conjugate Prior &
Distributions
6/11/2014 24 Middleware, CCNT, ZJU
11-
)1(
)()(
)(
)1(),(),,,|(
ba
lm
ba
ba
lmmCbalmp
11-
)1(
)()(
)(
),,,|(
blam
blam
blam
balmp
Beta Distribution!
Parameter Estimation
Multinomial & Dirichlet Distribution
x/ 𝑥 is a multivariate, ex, 𝑥 = (0,0,1,0,0,0): event of 𝑥3 happens
The probabilistic distribution of 𝑥 in only one event : 𝑝 𝑥 𝜃
= 𝑘=1
𝐾
𝜃 𝑘
𝑥 𝑘
, 𝜃 = (𝜃1, 𝜃2 … , 𝜃 𝑘)
, Yueshen Xu
24. Conjugate Prior &
Distributions
Multinomial & Dirichlet Distribution (Cont.)
Mult(𝑚1, 𝑚2, … , 𝑚 𝐾|𝜽, 𝑁)=
𝑁!
𝑚1!𝑚2!…𝑚 𝐾!
𝐶 𝑁
𝑚1
𝐶 𝑁−𝑚1
𝑚2
𝐶 𝑁−𝑚1−𝑚2
𝑚3
…
𝐶 𝑁− 𝑘=1
𝐾−1
𝑚 𝑘
𝑚 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝑥 𝑘
: the likelihood function of 𝜃
6/11/2014 25 Middleware, CCNT, ZJU
Mult: The exact probabilistic distribution of 𝑝 𝑧 𝑘 𝑑𝑗 and 𝑝 𝑤𝑗 𝑧 𝑘
In Bayesian theory, we need to find a conjugate prior of 𝜃 for
Mult, where 0 < 𝜃 < 1, 𝑘=1
𝐾
𝜃 𝑘 = 1
Dirichlet Distribution
𝐷𝑖𝑟 𝜃 𝜶 =
Γ(𝛼0)
Γ 𝛼1 … Γ 𝛼 𝐾
𝑘=1
𝐾
𝜃 𝑘
𝛼 𝑘−1
a vector
Hyper-parameter: parameter in
probabilistic distribution function (pdf)
, Yueshen Xu
26. Poisson Distribution
Why Poisson distribution?
The number of births per hour during a given day; the number of
particles emitted by a radioactive source in a given time; the number
of cases of a disease in different towns
For Bin(n,p), when n is large, and p is small p(X=k)≈
𝜉 𝑘 𝑒−𝜉
𝑘!
, 𝜉 ≈ 𝑛𝑝
𝐺𝑎𝑚𝑚𝑎 𝑥 𝛼 =
𝑥 𝛼−1 𝑒−𝑥
Γ(𝛼)
𝐺𝑎𝑚𝑚𝑎 𝑥 𝛼 = 𝑘 + 1 =
𝑥 𝑘 𝑒−𝑥
𝑘!
(Γ 𝑘 + 1 = 𝑘!)
(Poisson discrete; Gamma continuous)
6/11/2014 27 Middleware, CCNT, ZJU
Poisson Distribution
𝑝 𝑘|𝜉 =
𝜉 𝑘 𝑒−𝜉
𝑘!
Many experimental situations occur in which we observe the
counts of events within a set unit of time, area, volume, length .etc
, Yueshen Xu
28. Solution for LDA
6/11/2014 29 Middleware, CCNT, ZJU
The most significant generative model in Machine Learning Community in the
recent ten years
𝑝 𝒘 𝛼, 𝛽 =
Γ( 𝑖 𝛼𝑖)
𝑖 Γ(𝛼𝑖)
𝑖=1
𝑘
𝜃𝑖
𝛼 𝑖−1
𝑛=1
𝑁
𝑖=1
𝑘
𝑗=1
𝑉
(𝜃𝑖 𝛽𝑖𝑗) 𝑤 𝑛
𝑗
𝑑𝜃
p 𝒘 𝛼, 𝛽 = 𝑝(𝜃|𝛼)
𝑛=1
𝑁
𝑧 𝑛
𝑝 𝑧 𝑛 𝜃 𝑝(𝑤 𝑛|𝑧 𝑛, 𝛽) 𝑑𝜃
Rewrite in terms of
model parameters
𝛼 = 𝛼1, 𝛼2, … 𝛼 𝐾 ; 𝛽 ∈ 𝑅 𝐾×𝑉:What we need to solve out
Variational Inference Gibbs Sampling
Deterministic Inference Stochastic Inference
Why variational inference?Simplify the dependency structure
Why sampling? Approximate the
statistical properties of the population
with those of samples’
, Yueshen Xu
29. Variational Inference
Variational Inference (Inference through a variational
distribution), VI
VI aims to use an approximating distribution that has a simpler
dependency structure than that of the exact posterior distribution
6/11/2014 30 Middleware, CCNT, ZJU
𝑃(𝐻|𝐷) ≈ 𝑄(𝐻)
true posterior distribution
variational distribution
Dissimilarity between
P and Q?
Kullback-Leibler
Divergence
𝐾𝐿(𝑄| 𝑃 = 𝑄 𝐻 𝑙𝑜𝑔
𝑄 𝐻 𝑃 𝐷
𝑃 𝐻, 𝐷
𝑑𝐻
= 𝑄 𝐻 𝑙𝑜𝑔
𝑄 𝐻
𝑃 𝐻, 𝐷
𝑑𝐻 + 𝑙𝑜𝑔𝑃(𝐷)
𝐿
𝑑𝑒𝑓
𝑄 𝐻 𝑙𝑜𝑔𝑃 𝐻, 𝐷 𝑑𝐻 − 𝑄 𝐻 𝑙𝑜𝑔𝑄 𝐻 𝑑𝐻 =< 𝑙𝑜𝑔𝑃(𝐻, 𝐷) >Q(H) +ℍ 𝑄
Entropy of Q
, Yueshen Xu
33. Variational Inference
You can refer to more in the original paper.
Variational EM Algorithm
Aim: (𝛼
∗
, 𝛽
∗
)=arg max 𝑑=1
𝑀
𝑝 𝒘|𝛼, 𝛽
Initialize 𝛼, 𝛽
E-Step: compute 𝛼, 𝛽 through variational inference for likelihood
approximation
M-Step: Maximize the likelihood according to 𝛼, 𝛽
End until convergence
6/11/2014 34 Middleware, CCNT, ZJU, Yueshen Xu
36. Markov Chain Monte Carlo
MCMC Sampling
We should construct the relationship between 𝜋(𝑥) and MC
transition process Detailed Balance Condition
In a common MC, if for 𝝅 𝒙 , 𝑃 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑚𝑎𝑡𝑟𝑖𝑥 , 𝜋 𝑖 𝑃𝑖𝑗 = 𝜋(j)
𝑃𝑗𝑖, 𝑓𝑜𝑟 𝑎𝑙𝑙 𝑖, 𝑗 𝜋(𝑥) is the stationary distribution of this MC
Prove: 𝑖=1
∞
𝜋 𝑖 𝑃𝑖𝑗 = 𝑖=1
∞
𝜋 𝑗 𝑃𝑗𝑖 = 𝜋 𝑗 −→ 𝜋𝑃 = 𝜋𝜋 is the
solution of the equation 𝜋𝑃 = 𝜋 Done
For a common MC(q(i,j), q(j|i), q(ij)), and for any probabilistic
distribution p(x) (the dimension of x is arbitrary) Transformation
6/11/2014 37 Middleware, CCNT, ZJU
𝑝 𝑖 𝑞 𝑖, 𝑗 𝛼 𝑖, 𝑗 = 𝑝 𝑗 𝑞(𝑗, 𝑖)𝛼(𝑗, 𝑖)
Q’(i,j) Q’(j,i)
𝛼 𝑖, 𝑗 = 𝑝 𝑗 𝑞(𝑗, 𝑖),𝛼 𝑗, 𝑖 = 𝑝 𝑖 𝑞(𝑗, 𝑖),
necessary condition
, Yueshen Xu
37. Markov Chain Monte Carlo
MCMC Sampling(cont.)
Step1: Initialize: 𝑋0 = 𝑥0
Step2: for t = 0, 1, 2, …
𝑋𝑡 = 𝑥𝑡, 𝑠𝑎𝑚𝑝𝑙𝑒 𝑦 𝑓𝑟𝑜𝑚 𝑞(𝑥|𝑥𝑡) (𝑦 ∈ 𝐷𝑜𝑚𝑎𝑖𝑛 𝑜𝑓 𝐷𝑒𝑓𝑖𝑛𝑖𝑡𝑖𝑜𝑛)
sample u from Uniform[0,1]
If 𝑢 < 𝛼 𝑥𝑡, 𝑦 = 𝑝 𝑦 𝑞 𝑥𝑡 𝑦 ⇒ 𝑥𝑡 → 𝑦, Xt+1 = y
else Xt+1 = xt
6/11/2014 38 Middleware, CCNT, ZJU
Metropolis-Hastings Sampling
Step1: Initialize: 𝑋0 = 𝑥0
Step2: for t = 0, 1, 2, …n, n+1, n+2…
𝑋𝑡 = 𝑥𝑡, 𝑠𝑎𝑚𝑝𝑙𝑒 𝑦 𝑓𝑟𝑜𝑚 𝑞 𝑥 𝑥𝑡 𝑦 ∈ 𝐷𝑜𝑚𝑎𝑖𝑛 𝑜𝑓 𝐷𝑒𝑓𝑖𝑛𝑖𝑡𝑖on
Burn-in Period
Convergence
, Yueshen Xu