SlideShare une entreprise Scribd logo
1  sur  3
Télécharger pour lire hors ligne
A Note on Latent LSTM Allocation
Tomonari MASADA @ Nagasaki University
August 31, 2017
(I’m not fully confident with this note.)
1 ELBO
In latent LSTM allocation, the topic assignments zd = {zd,1, . . . , zd,Nd
} for each document d are drawn
from the categorical distribution whose parameters are obtained as a softmax output of LSTM.
Based on the description of the generative process given in the paper [1], we obtain the full joint
distribution as follows:
p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β) = p(φ; β)
d
p(wd, zd, φ; LSTM, β) (1)
We maximize the evidence p({w1, . . . , wd}; LSTM, β), which is obtained as below.
p({w1, . . . , wd}; LSTM, β) =
{z1,...,zd}
p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β)dφ
=
{z1,...,zd}
p(φ; β)
d
p(wd, zd|φ; LSTM)dφ, (2)
where
p(wd, zd|φ; LSTM) = p(wd|zd, φ)p(zd; LSTM)
=
t
p(wd,t|zd,t, φ)p(zd,t|zd,1:t−1; LSTM) (3)
Jensen’s inequality gives the following lower bound of the log of the evidence:
log p({w1, . . . , wd}; LSTM, β) = log
Z
p(φ; β)
d
p(wd, zd|φ; LSTM)dφ
= log
Z
q(Z, φ)
p(φ; β) d p(wd, zd|φ; LSTM)
q(Z, φ)
dφ
≥
Z
q(Z, φ) log
p(φ; β) d p(wd, zd|φ; LSTM)
q(Z, φ)
dφ
≡ L (4)
Let this lower bound, i.e., ELBO, be denoted by L.
We assume that the variational posterior q(Z, φ) factorizes as k q(φk) × d q(zd). The q(φk) are
Dirichlet distributions whose parameters are ξk = {ξk,1 . . . , ξk,V }.
Then the ELBO L can be rewritten as below.
L = q(φ) log p(φ; β)dφ +
d zd
q(zd) log p(zd; LSTM) +
d zd
q(zd)q(φ) log p(wd|zd, φ)dφ
−
d zd
q(zd) log q(zd) − q(φ) log q(φ)dφ (5)
1
Further we assume that q(zd) factorizes as t q(zd,t), where the q(zd,t) are the categorical distributions
satisfying
K
k=1 q(zd,t = k) = 1. We let γd,t,k denote q(zd,t = k).
The second term of L in Eq. (5) can be rewritten as below.
zd
q(zd) log p(zd; LSTM) =
zd t
q(zd,t)
t
log p(zd,t|zd,1:t−1; LSTM)
=
zd t
q(zd,t) log p(zd,1; LSTM) + log p(zd,2|zd,1; LSTM) + log p(zd,3|zd,1, zd,2; LSTM)
+ · · · + log p(zd,Nd
|zd,1, . . . , zd,Nd−1; LSTM)
=
K
zd,1=1
q(zd,1) log p(zd,1; LSTM) +
K
zd,1=1
K
zd,2=1
q(zd,1)q(zd,2) log p(zd,2|zd,1; LSTM)
+ · · · +
K
zd,1=1
· · ·
K
zd,Nd−1=1
q(zd,1) · · · q(zd,Nd−1) log p(zd,Nd−1|zd,1, . . . , zd,Nd−2; LSTM)
+ · · · +
K
zd,1=1
· · ·
K
zd,Nd
=1
q(zd,1) · · · q(zd,Nd
) log p(zd,Nd
|zd,1, . . . , zd,Nd−1; LSTM) (6)
The evaluation of Eq. (6) is intractable. However, for each t, the zd,1:t−1 in p(zd,t|zd,1:t−1; LSTM) can be
regarded as free variables whose values are set by some procedure having nothing to do with the generative
model. We obtain the values of the zd,1:t−1 by LSTM forward pass and denote them as ˆzd,1:t−1. Then we
can simplify Eq. (6) as follows:
zd
q(zd) log p(zd; LSTM) =
Nd
t=1
K
zd,t=1
q(zd,t) log p(zd,t|ˆzd,1:t−1; LSTM)
=
Nd
t=1
K
k=1
γd,t,k log p(zd,t = k|ˆzd,1:t−1; LSTM) (7)
The third term of L in Eq. (5) can be rewritten as below.
d zd
q(zd)q(φ) log p(wd|zd, φ)dφ =
d
q(φ)
zd
q(zd)
t
log φzd,t,wd,t
dφ
= q(φ)
d
Nd
t=1
K
k=1
q(zd,t = k) log φk,wd,t
dφ
=
D
d=1
Nd
t=1
K
k=1
γd,t,k q(φk) log φk,wd,t
dφk
=
D
d=1
Nd
t=1
K
k=1
γd,t,k Ψ(ξk,wd,t
) − Ψ
v
ξk,v (8)
The first term of L in Eq. (5) can be rewritten as below.
q(φ) log p(φ; β)dφ =
k
q(φk) log p(φk; β)dφk
= K log Γ(V β) − KV log Γ(β) +
k v
(β − 1) q(φk) log φk,vdφk
= K log Γ(V β) − KV log Γ(β) + (β − 1)
k v
Ψ(ξk,v) − Ψ
v
ξk,v (9)
2
The fourth term of L in Eq. (5) can be rewritten as below.
d zd
q(zd) log q(zd) =
D
d=1
Nd
t=1
K
k=1
q(zd,t = k) log q(zd,t = k) (10)
The last term of L can be rewritten as below.
q(φ) log q(φ)dφ =
k
q(φk) log q(φk)dφk
=
k
log Γ
v
ξk,v −
k v
log Γ(ξk,v) +
k v
(ξk,v − 1) Ψ(ξk,v) − Ψ
v
ξk,v (11)
2 Inference
The partial differentiation of L with respect to γd,t,k is
∂L
∂γd,t,k
= log p(zd,t = k|ˆzd,1:t−1; LSTM) + Ψ(ξk,wd,t
) − Ψ
v
ξk,v − log γd,t,k + const. (12)
By solving ∂L
∂γd,t,k
= 0, we obtain
γd,t,k ∝ φk,wd,t
p(zd,t = k|ˆzd,1:t−1; LSTM), (13)
where φk,wd,t
≡
exp(Ψ(ξk,wd,t
))
exp(Ψ( v ξk,v)) . When t = 1, γd,1,k ∝ φk,wd,1
p(zd,1 = k|LSTM). Therefore, q(zd,1) does
not depend on the zd,t for t > 1, and we can draw a sample from q(zd,1) without seeing the zd,t for
t > 1. When t = 2, γd,2,k ∝ φk,wd,2
p(zd,2 = k|ˆzd,1; LSTM). That is, q(zd,1) depends only on ˆzd,1. One
possible way to determine ˆzd,1 is to draw a sample from q(zd,1), because this drawing can be performed
without seeing the zd,t for t > 1. For each t s.t. t > 2, we may repeat a similar argument. However, this
procedure to determine the ˆzd,t is made possible by the assumption that lead to the approximation given
in Eq. (7), because we cannot obtain the simple update γd,t,k ∝ φk,wd,t
p(zd,t = k|ˆzd,1:t−1; LSTM) without
this assumption. And this assumption tells nothing about how we should sample the zd,t. For example,
we may draw the zd,t simply based on the softmax output at each t of LSTM without using φ. Anyway, it
is sure that the assumption leads to the approximation given in Eq. (7) provides no answer to the question
why we should use φ when sampling the zd,t.
For ξk,v, we obtain the estimation β + d {t:wd,t=v} γd,t,k as usual.
Let θd,t,k denote p(zd,t = k|ˆzd,1:t−1; LSTM), which is a softmax output of LSTM. The partial differen-
tiation of L with respect to any LSTM parameter is
∂L
∂LSTM
=
d∈B
Nd
t=1
K
k=1
γd,t,k
∂
∂LSTM
log θd,t,k =
d∈B
Nd
t=1
K
k=1
γd,t,k
θd,t,k
∂θd,t,k
∂LSTM
(14)
References
[1] Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. Latent LSTM allocation: Joint clustering
and non-linear dynamic modeling of sequence data. In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of
Machine Learning Research, pages 3967–3976, International Convention Centre, Sydney, Australia,
06–11 Aug 2017. PMLR.
3

Contenu connexe

Tendances

Specific Finite Groups(General)
Specific Finite Groups(General)Specific Finite Groups(General)
Specific Finite Groups(General)
Shane Nicklas
 
Specific Finite Groups(General)
Specific Finite Groups(General)Specific Finite Groups(General)
Specific Finite Groups(General)
Shane Nicklas
 
lecture 4
lecture 4lecture 4
lecture 4
sajinsc
 
Specific Finite Groups(General)
Specific Finite Groups(General)Specific Finite Groups(General)
Specific Finite Groups(General)
Shane Nicklas
 

Tendances (20)

Specific Finite Groups(General)
Specific Finite Groups(General)Specific Finite Groups(General)
Specific Finite Groups(General)
 
VCU(Seminar)
VCU(Seminar)VCU(Seminar)
VCU(Seminar)
 
Goldberg-Coxeter construction for 3- or 4-valent plane maps
Goldberg-Coxeter construction for 3- or 4-valent plane mapsGoldberg-Coxeter construction for 3- or 4-valent plane maps
Goldberg-Coxeter construction for 3- or 4-valent plane maps
 
Specific Finite Groups(General)
Specific Finite Groups(General)Specific Finite Groups(General)
Specific Finite Groups(General)
 
lecture 4
lecture 4lecture 4
lecture 4
 
Specific Finite Groups(General)
Specific Finite Groups(General)Specific Finite Groups(General)
Specific Finite Groups(General)
 
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
 
A One-Pass Triclustering Approach: Is There any Room for Big Data?
A One-Pass Triclustering Approach: Is There any Room for Big Data?A One-Pass Triclustering Approach: Is There any Room for Big Data?
A One-Pass Triclustering Approach: Is There any Room for Big Data?
 
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
 
On maximal and variational Fourier restriction
On maximal and variational Fourier restrictionOn maximal and variational Fourier restriction
On maximal and variational Fourier restriction
 
Bayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse ProblemsBayesian Inference and Uncertainty Quantification for Inverse Problems
Bayesian Inference and Uncertainty Quantification for Inverse Problems
 
ADA - Minimum Spanning Tree Prim Kruskal and Dijkstra
ADA - Minimum Spanning Tree Prim Kruskal and Dijkstra ADA - Minimum Spanning Tree Prim Kruskal and Dijkstra
ADA - Minimum Spanning Tree Prim Kruskal and Dijkstra
 
Kumegawa russia
Kumegawa russiaKumegawa russia
Kumegawa russia
 
Mid term solution
Mid term solutionMid term solution
Mid term solution
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
 
Low-rank tensor approximation (Introduction)
Low-rank tensor approximation (Introduction)Low-rank tensor approximation (Introduction)
Low-rank tensor approximation (Introduction)
 
2.6 all pairsshortestpath
2.6 all pairsshortestpath2.6 all pairsshortestpath
2.6 all pairsshortestpath
 
Fdtd ppt for mine
Fdtd ppt   for mineFdtd ppt   for mine
Fdtd ppt for mine
 
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
R package 'bayesImageS': a case study in Bayesian computation using Rcpp and ...
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
 

Similaire à A Note on Latent LSTM Allocation

A Note on Correlated Topic Models
A Note on Correlated Topic ModelsA Note on Correlated Topic Models
A Note on Correlated Topic Models
Tomonari Masada
 
6-Nfa & equivalence with RE.pdf
6-Nfa & equivalence with RE.pdf6-Nfa & equivalence with RE.pdf
6-Nfa & equivalence with RE.pdf
shruti533256
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
Traian Rebedea
 
On the-approximate-solution-of-a-nonlinear-singular-integral-equation
On the-approximate-solution-of-a-nonlinear-singular-integral-equationOn the-approximate-solution-of-a-nonlinear-singular-integral-equation
On the-approximate-solution-of-a-nonlinear-singular-integral-equation
Cemal Ardil
 
Polya recurrence
Polya recurrencePolya recurrence
Polya recurrence
Brian Burns
 
A Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet AllocationA Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet Allocation
Tomonari Masada
 
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Avichai Cohen
 

Similaire à A Note on Latent LSTM Allocation (20)

A Note on Correlated Topic Models
A Note on Correlated Topic ModelsA Note on Correlated Topic Models
A Note on Correlated Topic Models
 
A Note on PCVB0 for HDP-LDA
A Note on PCVB0 for HDP-LDAA Note on PCVB0 for HDP-LDA
A Note on PCVB0 for HDP-LDA
 
6-Nfa & equivalence with RE.pdf
6-Nfa & equivalence with RE.pdf6-Nfa & equivalence with RE.pdf
6-Nfa & equivalence with RE.pdf
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
 
MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化
 
Boundedness of the Twisted Paraproduct
Boundedness of the Twisted ParaproductBoundedness of the Twisted Paraproduct
Boundedness of the Twisted Paraproduct
 
Lecture5
Lecture5Lecture5
Lecture5
 
Litvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdfLitvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdf
 
On Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsOn Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular Integrals
 
On the-approximate-solution-of-a-nonlinear-singular-integral-equation
On the-approximate-solution-of-a-nonlinear-singular-integral-equationOn the-approximate-solution-of-a-nonlinear-singular-integral-equation
On the-approximate-solution-of-a-nonlinear-singular-integral-equation
 
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
 
Polya recurrence
Polya recurrencePolya recurrence
Polya recurrence
 
clock_theorems
clock_theoremsclock_theorems
clock_theorems
 
International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI) International Journal of Mathematics and Statistics Invention (IJMSI)
International Journal of Mathematics and Statistics Invention (IJMSI)
 
A Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet AllocationA Note on Expectation-Propagation for Latent Dirichlet Allocation
A Note on Expectation-Propagation for Latent Dirichlet Allocation
 
Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...
 
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
 
digital control Chapter 2 slide
digital control Chapter 2 slidedigital control Chapter 2 slide
digital control Chapter 2 slide
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
Geometric and viscosity solutions for the Cauchy problem of first order
Geometric and viscosity solutions for the Cauchy problem of first orderGeometric and viscosity solutions for the Cauchy problem of first order
Geometric and viscosity solutions for the Cauchy problem of first order
 

Plus de Tomonari Masada

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
Tomonari Masada
 

Plus de Tomonari Masada (20)

Learning Latent Space Energy Based Prior Modelの解説
Learning Latent Space Energy Based Prior Modelの解説Learning Latent Space Energy Based Prior Modelの解説
Learning Latent Space Energy Based Prior Modelの解説
 
Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説Denoising Diffusion Probabilistic Modelsの重要な式の解説
Denoising Diffusion Probabilistic Modelsの重要な式の解説
 
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic ModelingContext-dependent Token-wise Variational Autoencoder for Topic Modeling
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
 
A note on the density of Gumbel-softmax
A note on the density of Gumbel-softmaxA note on the density of Gumbel-softmax
A note on the density of Gumbel-softmax
 
トピックモデルの基礎と応用
トピックモデルの基礎と応用トピックモデルの基礎と応用
トピックモデルの基礎と応用
 
Expectation propagation for latent Dirichlet allocation
Expectation propagation for latent Dirichlet allocationExpectation propagation for latent Dirichlet allocation
Expectation propagation for latent Dirichlet allocation
 
Mini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic ModelingMini-batch Variational Inference for Time-Aware Topic Modeling
Mini-batch Variational Inference for Time-Aware Topic Modeling
 
A note on variational inference for the univariate Gaussian
A note on variational inference for the univariate GaussianA note on variational inference for the univariate Gaussian
A note on variational inference for the univariate Gaussian
 
Document Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior DistributionsDocument Modeling with Implicit Approximate Posterior Distributions
Document Modeling with Implicit Approximate Posterior Distributions
 
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka CompositionLDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka Composition
 
A Note on ZINB-VAE
A Note on ZINB-VAEA Note on ZINB-VAE
A Note on ZINB-VAE
 
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelA Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic Model
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
Word count in Husserliana Volumes 1 to 28
Word count in Husserliana Volumes 1 to 28Word count in Husserliana Volumes 1 to 28
Word count in Husserliana Volumes 1 to 28
 
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationA Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation
 
FDSE2015
FDSE2015FDSE2015
FDSE2015
 
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...
 
A Note on BPTT for LSTM LM
A Note on BPTT for LSTM LMA Note on BPTT for LSTM LM
A Note on BPTT for LSTM LM
 
The detailed derivation of the derivatives in Table 2 of Marginalized Denoisi...
The detailed derivation of the derivatives in Table 2 of Marginalized Denoisi...The detailed derivation of the derivatives in Table 2 of Marginalized Denoisi...
The detailed derivation of the derivatives in Table 2 of Marginalized Denoisi...
 
ChronoSAGE: Diversifying Topic Modeling Chronologically
ChronoSAGE: Diversifying Topic Modeling ChronologicallyChronoSAGE: Diversifying Topic Modeling Chronologically
ChronoSAGE: Diversifying Topic Modeling Chronologically
 

Dernier

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 

Dernier (20)

Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 

A Note on Latent LSTM Allocation

  • 1. A Note on Latent LSTM Allocation Tomonari MASADA @ Nagasaki University August 31, 2017 (I’m not fully confident with this note.) 1 ELBO In latent LSTM allocation, the topic assignments zd = {zd,1, . . . , zd,Nd } for each document d are drawn from the categorical distribution whose parameters are obtained as a softmax output of LSTM. Based on the description of the generative process given in the paper [1], we obtain the full joint distribution as follows: p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β) = p(φ; β) d p(wd, zd, φ; LSTM, β) (1) We maximize the evidence p({w1, . . . , wd}; LSTM, β), which is obtained as below. p({w1, . . . , wd}; LSTM, β) = {z1,...,zd} p({w1, . . . , wd}, {z1, . . . , zd}, φ; LSTM, β)dφ = {z1,...,zd} p(φ; β) d p(wd, zd|φ; LSTM)dφ, (2) where p(wd, zd|φ; LSTM) = p(wd|zd, φ)p(zd; LSTM) = t p(wd,t|zd,t, φ)p(zd,t|zd,1:t−1; LSTM) (3) Jensen’s inequality gives the following lower bound of the log of the evidence: log p({w1, . . . , wd}; LSTM, β) = log Z p(φ; β) d p(wd, zd|φ; LSTM)dφ = log Z q(Z, φ) p(φ; β) d p(wd, zd|φ; LSTM) q(Z, φ) dφ ≥ Z q(Z, φ) log p(φ; β) d p(wd, zd|φ; LSTM) q(Z, φ) dφ ≡ L (4) Let this lower bound, i.e., ELBO, be denoted by L. We assume that the variational posterior q(Z, φ) factorizes as k q(φk) × d q(zd). The q(φk) are Dirichlet distributions whose parameters are ξk = {ξk,1 . . . , ξk,V }. Then the ELBO L can be rewritten as below. L = q(φ) log p(φ; β)dφ + d zd q(zd) log p(zd; LSTM) + d zd q(zd)q(φ) log p(wd|zd, φ)dφ − d zd q(zd) log q(zd) − q(φ) log q(φ)dφ (5) 1
  • 2. Further we assume that q(zd) factorizes as t q(zd,t), where the q(zd,t) are the categorical distributions satisfying K k=1 q(zd,t = k) = 1. We let γd,t,k denote q(zd,t = k). The second term of L in Eq. (5) can be rewritten as below. zd q(zd) log p(zd; LSTM) = zd t q(zd,t) t log p(zd,t|zd,1:t−1; LSTM) = zd t q(zd,t) log p(zd,1; LSTM) + log p(zd,2|zd,1; LSTM) + log p(zd,3|zd,1, zd,2; LSTM) + · · · + log p(zd,Nd |zd,1, . . . , zd,Nd−1; LSTM) = K zd,1=1 q(zd,1) log p(zd,1; LSTM) + K zd,1=1 K zd,2=1 q(zd,1)q(zd,2) log p(zd,2|zd,1; LSTM) + · · · + K zd,1=1 · · · K zd,Nd−1=1 q(zd,1) · · · q(zd,Nd−1) log p(zd,Nd−1|zd,1, . . . , zd,Nd−2; LSTM) + · · · + K zd,1=1 · · · K zd,Nd =1 q(zd,1) · · · q(zd,Nd ) log p(zd,Nd |zd,1, . . . , zd,Nd−1; LSTM) (6) The evaluation of Eq. (6) is intractable. However, for each t, the zd,1:t−1 in p(zd,t|zd,1:t−1; LSTM) can be regarded as free variables whose values are set by some procedure having nothing to do with the generative model. We obtain the values of the zd,1:t−1 by LSTM forward pass and denote them as ˆzd,1:t−1. Then we can simplify Eq. (6) as follows: zd q(zd) log p(zd; LSTM) = Nd t=1 K zd,t=1 q(zd,t) log p(zd,t|ˆzd,1:t−1; LSTM) = Nd t=1 K k=1 γd,t,k log p(zd,t = k|ˆzd,1:t−1; LSTM) (7) The third term of L in Eq. (5) can be rewritten as below. d zd q(zd)q(φ) log p(wd|zd, φ)dφ = d q(φ) zd q(zd) t log φzd,t,wd,t dφ = q(φ) d Nd t=1 K k=1 q(zd,t = k) log φk,wd,t dφ = D d=1 Nd t=1 K k=1 γd,t,k q(φk) log φk,wd,t dφk = D d=1 Nd t=1 K k=1 γd,t,k Ψ(ξk,wd,t ) − Ψ v ξk,v (8) The first term of L in Eq. (5) can be rewritten as below. q(φ) log p(φ; β)dφ = k q(φk) log p(φk; β)dφk = K log Γ(V β) − KV log Γ(β) + k v (β − 1) q(φk) log φk,vdφk = K log Γ(V β) − KV log Γ(β) + (β − 1) k v Ψ(ξk,v) − Ψ v ξk,v (9) 2
  • 3. The fourth term of L in Eq. (5) can be rewritten as below. d zd q(zd) log q(zd) = D d=1 Nd t=1 K k=1 q(zd,t = k) log q(zd,t = k) (10) The last term of L can be rewritten as below. q(φ) log q(φ)dφ = k q(φk) log q(φk)dφk = k log Γ v ξk,v − k v log Γ(ξk,v) + k v (ξk,v − 1) Ψ(ξk,v) − Ψ v ξk,v (11) 2 Inference The partial differentiation of L with respect to γd,t,k is ∂L ∂γd,t,k = log p(zd,t = k|ˆzd,1:t−1; LSTM) + Ψ(ξk,wd,t ) − Ψ v ξk,v − log γd,t,k + const. (12) By solving ∂L ∂γd,t,k = 0, we obtain γd,t,k ∝ φk,wd,t p(zd,t = k|ˆzd,1:t−1; LSTM), (13) where φk,wd,t ≡ exp(Ψ(ξk,wd,t )) exp(Ψ( v ξk,v)) . When t = 1, γd,1,k ∝ φk,wd,1 p(zd,1 = k|LSTM). Therefore, q(zd,1) does not depend on the zd,t for t > 1, and we can draw a sample from q(zd,1) without seeing the zd,t for t > 1. When t = 2, γd,2,k ∝ φk,wd,2 p(zd,2 = k|ˆzd,1; LSTM). That is, q(zd,1) depends only on ˆzd,1. One possible way to determine ˆzd,1 is to draw a sample from q(zd,1), because this drawing can be performed without seeing the zd,t for t > 1. For each t s.t. t > 2, we may repeat a similar argument. However, this procedure to determine the ˆzd,t is made possible by the assumption that lead to the approximation given in Eq. (7), because we cannot obtain the simple update γd,t,k ∝ φk,wd,t p(zd,t = k|ˆzd,1:t−1; LSTM) without this assumption. And this assumption tells nothing about how we should sample the zd,t. For example, we may draw the zd,t simply based on the softmax output at each t of LSTM without using φ. Anyway, it is sure that the assumption leads to the approximation given in Eq. (7) provides no answer to the question why we should use φ when sampling the zd,t. For ξk,v, we obtain the estimation β + d {t:wd,t=v} γd,t,k as usual. Let θd,t,k denote p(zd,t = k|ˆzd,1:t−1; LSTM), which is a softmax output of LSTM. The partial differen- tiation of L with respect to any LSTM parameter is ∂L ∂LSTM = d∈B Nd t=1 K k=1 γd,t,k ∂ ∂LSTM log θd,t,k = d∈B Nd t=1 K k=1 γd,t,k θd,t,k ∂θd,t,k ∂LSTM (14) References [1] Manzil Zaheer, Amr Ahmed, and Alexander J. Smola. Latent LSTM allocation: Joint clustering and non-linear dynamic modeling of sequence data. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3967–3976, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. 3