SlideShare a Scribd company logo
1 of 79
Download to read offline
RBM from Scratch
Hadi Sinaee
Sharif University of Technology
Department of Computer Engineering
May 17, 2015
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 1 / 20
Outline
1 Unsupervised Learning
2 Liklihood
3 Optimization
4 Having Latent Variables
5 Markov Chain and Gibbs Sampling
6 Restricted Boltzmann Machines
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 2 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ is known,
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ is known, unsupervised learning of
a data distribution with MRF means adjusting parameters θ.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Unsupervised Learning Markov Random Fields
Unsupervised Learning
• Unsupervised learning means learning an unkown distritbution q
beased on sample data.
• This includes finding a new representatoins of data that foster
learning and generalization.
• If structure of the graphical model and family of energy
function parameterize by θ is known, unsupervised learning of
a data distribution with MRF means adjusting parameters θ.
• p(x|θ) shows this dependence.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
for the Gibbs distribution of an MRF → Cannot find maximum
analytically!
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood likilihood of MRF
Liklihood
• Training data S = {x1, x2, ..., xN}, i.i.d sampled from true
distribution q. Standard way of finding the parameters is ML.
• Appling this to MRF → finding the MRF parameters(θ) that
maximize the probability of S under the MRF distribution,p.
L : Θ → R
l(θ|S) = ΣN
i=1ln(p(xi |θ))
for the Gibbs distribution of an MRF → Cannot find maximum
analytically! So using numerical approximation.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
Liklihood KL divergence
Liklihood
KL of true distribution and MRF distribution:
KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x))
• KL comprises of entropy of q and expectation over q. Only
latter depends on the paramter subject to optimization.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 5 / 20
Liklihood KL divergence
Liklihood
KL of true distribution and MRF distribution:
KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x))
• KL comprises of entropy of q and expectation over q. Only
latter depends on the paramter subject to optimization.
• Maximizing likilihood → Minimizing the KL-divergence.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 5 / 20
Optimization Gradient Ascent
Optimization
Iteratively updating the parameters from θt
to θt+1
based on
log-liklihood.
θt+1
= θt
+ η
∂
∂θt
(ΣN
i=1l(θ|xi ))
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 6 / 20
Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
• X=(V,H) is a set of all variables.
V = (V1, V2, ..., Vm) → visibles units
H = (H1, H2, ..., Hn) → hidden units, n = |V | − m
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
• X=(V,H) is a set of all variables.
V = (V1, V2, ..., Vm) → visibles units
H = (H1, H2, ..., Hn) → hidden units, n = |V | − m
• e.g V = set of all pixels, H= set of relationships between V units
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
Having Latent Variables Latent Variables
Having Latent Variables
• We want to model m-dimensional prob. distribution q(e.g an
image with m pixels)
• X=(V,H) is a set of all variables.
V = (V1, V2, ..., Vm) → visibles units
H = (H1, H2, ..., Hn) → hidden units, n = |V | − m
• e.g V = set of all pixels, H= set of relationships between V units
• Our Gibbs distribution of visible units:
p(v) =
1
Z
Σhe−E(v,h)
, Z = Σv,he−E(v,h)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
→One over conditional dist. of hidden units
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
→One over conditional dist. of hidden units
→One over model dist.
• We have to sum over all possible values of (v,h) for this
computation!!
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Having Latent Variables Log-Liklihood of MRF
• Log-Liklihood for one sample:
l(θ|v) = ln[Σhe−E(v,h)
] − ln[
Z
Σv,he−E(v,h)
]
• Then its gradient is:
θl = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
• It is difference of two expectations:
→One over conditional dist. of hidden units
→One over model dist.
• We have to sum over all possible values of (v,h) for this
computation!! Instead approximating this expectation.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
Markov Chain and Gibbs Sampling Markov Chain
Markov Chain
• Stationary Distribution: for distribution π for which it holds
πT
= πT
P where P is transition matrix with pij as matrix
elements.
• Detailed Balance Condition: sufficient condition for π to be
stationary distribution w.r.t pij , i, j ∈ Ω as transition probablities:
π(i)pij = π(j)pji , ∀i, j ∈ Ω
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 9 / 20
Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
• Xi , i ∈ V takes values in a finite set Λ.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
• Xi , i ∈ V takes values in a finite set Λ.
• Time varing states X = {Xk
|k ∈ N}, Xk
= {Xk
1 , Xk
2 , ..., Xk
N}
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
Markov Chain and Gibbs Sampling Gibbs Sampling
Gibbs Sampling
• MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where
V = {1, 2, ..., N}.
• Xi , i ∈ V takes values in a finite set Λ.
• Time varing states X = {Xk
|k ∈ N}, Xk
= {Xk
1 , Xk
2 , ..., Xk
N}
• π(x) = 1
Z
e−ε(x)
is the joint probability distribution of X.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
pxy = q(i)π(yi |(xv )v∈V −{i}),
only differs in ith−element
∃i ∈ V ∀v ∈ V , v = i; xv = yv
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
pxy = q(i)π(yi |(xv )v∈V −{i}),
only differs in ith−element
∃i ∈ V ∀v ∈ V , v = i; xv = yv
pxy = 0, otherwise
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Gibbs Algorithm
Gibbs Sampling
Step 1: At each iteration we pick a random variable Xi , i ∈ V with
probability q(i);q is a strictly positive distribution over V.
Step 2: Sample a new value for Xi based on CPD given the state of
all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni
)
Step 3: Keep doing this!
• Therefore the transition probability for MRF X is defined as
follows, for two states x and y
pxy = q(i)π(yi |(xv )v∈V −{i}),
only differs in ith−element
∃i ∈ V ∀v ∈ V , v = i; xv = yv
pxy = 0, otherwise
pxx = Σi∈Vq(i)π(xi |(xv )v∈V−{i})
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
• pxx > 0 and pxy > 0 and detailed balance condition
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
• pxx > 0 and pxy > 0 and detailed balance condition → Markov
Chain is aperiodic.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Markov Chain and Gibbs Sampling Convergence of Gibbs
Convergence of Gibbs
• π is s.p → CPDs of single variables are s.p.
Every single variable Xi can take every state xi ∈ Λ in a single
transition step → every state can reach any other in a finite
number of steps → Markov chain is irreducible.
• pxx > 0 and pxy > 0 and detailed balance condition → Markov
Chain is aperiodic.
Convergence
Aperiodicity and irreducibility guaranty that the chain converges to
the stationary distribution π.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
Restricted Boltzmann Machines RBM
RBM
• p(v, h) =
1
Z
e−E(v,h)
with
E(v, h) = −Σn
i=1Σm
j=1wij hi vj − Σm
j=1bj vj − Σn
i=1ci hi .
• ci and bj are real-valued biased terms for hidden and visible
units.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 13 / 20
Restricted Boltzmann Machines RBM
• Hidden variables are independent of each other given visibile
units:
p(h|v) =
n
i=1
p(hi |v) and p(v|h) =
m
j=1
p(vj |h)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
Restricted Boltzmann Machines RBM
• Hidden variables are independent of each other given visibile
units:
p(h|v) =
n
i=1
p(hi |v) and p(v|h) =
m
j=1
p(vj |h)
• Joint probability distribution of observations:
p(v) =
1
Z
m
j=1
ebj vj
n
i=1
(1 + eci +Σm
j=1wij vj
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
Restricted Boltzmann Machines RBM
• Hidden variables are independent of each other given visibile
units:
p(h|v) =
n
i=1
p(hi |v) and p(v|h) =
m
j=1
p(vj |h)
• Joint probability distribution of observations:
p(v) =
1
Z
m
j=1
ebj vj
n
i=1
(1 + eci +Σm
j=1wij vj
)
• Conditionaly probability distribution of components:
p(Hi = 1|v) = σ(Σm
j=1wij + ci )
p(Vj = 1|h) = σ(Σn
i=1wij + bj )
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;writing it as
Σvp(v)Σhp(h|v)∂E(v,h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;writing it as
Σvp(v)Σhp(h|v)∂E(v,h)
∂θ
or Σhp(h)Σvp(v|h)∂E(v,h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Gradient of the log-liklihood
• Recap: gradient log-liklihood of an MRF for a single data:
θl(θ|v) = −Σhp(h|v)
∂E(v, h)
∂θ
+ Σv,hp(v, h)
∂E(v, h)
∂θ
the first term is tractable,e.g for wij ,
−Σhp(h|v)
∂E(v, h)
∂wij
=
Σhi
Σh−i
Σh
n
k=1 p(hk |v)
p(h|v) hi vj =
p(Hi =1|v)
σ(Σm
j=1wij + ci ) vj
• We can do the same thing for the second part;writing it as
Σvp(v)Σhp(h|v)∂E(v,h)
∂θ
or Σhp(h)Σvp(v|h)∂E(v,h)
∂θ
• It is also intractable(in terms of smallest layer. i.e 2m
or 2n
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
• w.r.t bj ,
bj
l(θ|v) = vj − Σvp(v)vj
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
• w.r.t bj ,
bj
l(θ|v) = vj − Σvp(v)vj
• w.r.t ci ,
ci
l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
Restricted Boltzmann Machines Gradient of the Log-Liklihood
Computing the derivative of log-liklihood
• w.r.t wij ,
wij
l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj
• w.r.t bj ,
bj
l(θ|v) = vj − Σvp(v)vj
• w.r.t ci ,
ci
l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v)
• to avoid summation over all possible values of v, we can
approximate the expectation.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
• Each step t consists of sampling ht
from p(h|vt
)
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
• Each step t consists of sampling ht
from p(h|vt
) and sampling
vt+1
from p(v|ht
).
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
Restricted Boltzmann Machines Approximating the RBM log-liklihood
Contrastive Divergence
• Using a Gibbs chain run for only k steps (and usually k = 1).
• Starting from a training sample v0
, yeilds sample vk
after k-step.
• Each step t consists of sampling ht
from p(h|vt
) and sampling
vt+1
from p(v|ht
).
• Then using this samples the gradient approximation is given by,
CDk(h, v0
) = −Σhp(h|v0
)
∂E(v0
, h)
∂θ
+ Σhp(h|vk
)
∂E(vk
, h)
∂θ
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
Restricted Boltzmann Machines Contrastive Divergence
k-CD for Batch
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 18 / 20
Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
• Parallel Tempering
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
• Parallel Tempering
we run k (usually k = 1) Gibbs sampling steps.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
Restricted Boltzmann Machines Other derivatives
• Persistent CD(PCD relys on the previous chain update of each
parameter(vk
of previous step is for the initialization of the next
step).
• Fast PCD introduce a set of parameters just for sampling and
not for the model to increaes the speed.
• Parallel Tempering
we run k (usually k = 1) Gibbs sampling steps.In each tempered
Markov chain yielding samples (v1, h1), ..., (vM, hM), choose two
consecutive temprature and exchange particles (vr , hr ) and
(vr−1, hr−1) with prob.
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
Restricted Boltzmann Machines Other derivatives
Results
left: hidden sampling, right: visible sampling
Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 20 / 20

More Related Content

What's hot

powerpoint
powerpointpowerpoint
powerpoint
butest
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
Nesma
 

What's hot (19)

Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...
Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...
Dr Chris Drovandi (QUT) - Bayesian Indirect Inference Using a Parametric Auxi...
 
Lect6 csp
Lect6 cspLect6 csp
Lect6 csp
 
A Unifying Review of Gaussian Linear Models (Roweis 1999)
A Unifying Review of Gaussian Linear Models (Roweis 1999)A Unifying Review of Gaussian Linear Models (Roweis 1999)
A Unifying Review of Gaussian Linear Models (Roweis 1999)
 
Elliptic Curve Cryptography: Arithmetic behind
Elliptic Curve Cryptography: Arithmetic behindElliptic Curve Cryptography: Arithmetic behind
Elliptic Curve Cryptography: Arithmetic behind
 
RuleML2015: Learning Characteristic Rules in Geographic Information Systems
RuleML2015: Learning Characteristic Rules in Geographic Information SystemsRuleML2015: Learning Characteristic Rules in Geographic Information Systems
RuleML2015: Learning Characteristic Rules in Geographic Information Systems
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
powerpoint
powerpointpowerpoint
powerpoint
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
 
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
QMC Opening Workshop, Support Points - a new way to compact distributions, wi...
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market DataBoosted Tree-based Multinomial Logit Model for Aggregated Market Data
Boosted Tree-based Multinomial Logit Model for Aggregated Market Data
 
A Non--convex optimization approach to Correlation Clustering
A Non--convex optimization approach to Correlation ClusteringA Non--convex optimization approach to Correlation Clustering
A Non--convex optimization approach to Correlation Clustering
 
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
CARI-2020, Application of LSTM architectures for next frame forecasting in Se...
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
 
Stochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsStochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithms
 
Cryptography Baby Step Giant Step
Cryptography Baby Step Giant StepCryptography Baby Step Giant Step
Cryptography Baby Step Giant Step
 
20180722 pyro
20180722 pyro20180722 pyro
20180722 pyro
 
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
 

Viewers also liked

Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Indraneel Pole
 
Hebbian Learning
Hebbian LearningHebbian Learning
Hebbian Learning
ESCOM
 

Viewers also liked (10)

The Art Of Backpropagation
The Art Of BackpropagationThe Art Of Backpropagation
The Art Of Backpropagation
 
Introduction to Neural networks (under graduate course) Lecture 8 of 9
Introduction to Neural networks (under graduate course) Lecture 8 of 9Introduction to Neural networks (under graduate course) Lecture 8 of 9
Introduction to Neural networks (under graduate course) Lecture 8 of 9
 
Intro to Excel Basics: Part II
Intro to Excel Basics: Part IIIntro to Excel Basics: Part II
Intro to Excel Basics: Part II
 
restrictedboltzmannmachines
restrictedboltzmannmachinesrestrictedboltzmannmachines
restrictedboltzmannmachines
 
DNN and RBM
DNN and RBMDNN and RBM
DNN and RBM
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
Learning RBM(Restricted Boltzmann Machine in Practice)
Learning RBM(Restricted Boltzmann Machine in Practice)Learning RBM(Restricted Boltzmann Machine in Practice)
Learning RBM(Restricted Boltzmann Machine in Practice)
 
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
 
Hebbian Learning
Hebbian LearningHebbian Learning
Hebbian Learning
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 

Similar to RBM from Scratch

pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
butest
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
butest
 

Similar to RBM from Scratch (20)

Hessian Matrices in Statistics
Hessian Matrices in StatisticsHessian Matrices in Statistics
Hessian Matrices in Statistics
 
More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?More investment in Research and Development for better Education in the future?
More investment in Research and Development for better Education in the future?
 
Joint contrastive learning with infinite possibilities
Joint contrastive learning with infinite possibilitiesJoint contrastive learning with infinite possibilities
Joint contrastive learning with infinite possibilities
 
A baseline for_few_shot_image_classification
A baseline for_few_shot_image_classificationA baseline for_few_shot_image_classification
A baseline for_few_shot_image_classification
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
 
Machine learning
Machine learningMachine learning
Machine learning
 
SVD and the Netflix Dataset
SVD and the Netflix DatasetSVD and the Netflix Dataset
SVD and the Netflix Dataset
 
An introduction to R
An introduction to RAn introduction to R
An introduction to R
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
Firefly exact MCMC for Big Data
Firefly exact MCMC for Big DataFirefly exact MCMC for Big Data
Firefly exact MCMC for Big Data
 
Large Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the TrenchesLarge Scale Recommendation: a view from the Trenches
Large Scale Recommendation: a view from the Trenches
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
 
Nimrita koul Machine Learning
Nimrita koul  Machine LearningNimrita koul  Machine Learning
Nimrita koul Machine Learning
 
Ikdd co ds2017presentation_v2
Ikdd co ds2017presentation_v2Ikdd co ds2017presentation_v2
Ikdd co ds2017presentation_v2
 
VAE-type Deep Generative Models
VAE-type Deep Generative ModelsVAE-type Deep Generative Models
VAE-type Deep Generative Models
 
Talwalkar mlconf (1)
Talwalkar mlconf (1)Talwalkar mlconf (1)
Talwalkar mlconf (1)
 
Composing graphical models with neural networks for structured representatio...
Composing graphical models with  neural networks for structured representatio...Composing graphical models with  neural networks for structured representatio...
Composing graphical models with neural networks for structured representatio...
 

Recently uploaded

An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 

Recently uploaded (20)

Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 

RBM from Scratch

  • 1. RBM from Scratch Hadi Sinaee Sharif University of Technology Department of Computer Engineering May 17, 2015 Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 1 / 20
  • 2. Outline 1 Unsupervised Learning 2 Liklihood 3 Optimization 4 Having Latent Variables 5 Markov Chain and Gibbs Sampling 6 Restricted Boltzmann Machines Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 2 / 20
  • 3. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 4. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 5. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. • If structure of the graphical model Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 6. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. • If structure of the graphical model and family of energy function parameterize by θ Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 7. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. • If structure of the graphical model and family of energy function parameterize by θ is known, Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 8. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. • If structure of the graphical model and family of energy function parameterize by θ is known, unsupervised learning of a data distribution with MRF means adjusting parameters θ. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 9. Unsupervised Learning Markov Random Fields Unsupervised Learning • Unsupervised learning means learning an unkown distritbution q beased on sample data. • This includes finding a new representatoins of data that foster learning and generalization. • If structure of the graphical model and family of energy function parameterize by θ is known, unsupervised learning of a data distribution with MRF means adjusting parameters θ. • p(x|θ) shows this dependence. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 3 / 20
  • 10. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 11. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Standard way of finding the parameters is ML. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 12. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Standard way of finding the parameters is ML. • Appling this to MRF → finding the MRF parameters(θ) that maximize the probability of S under the MRF distribution,p. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 13. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Standard way of finding the parameters is ML. • Appling this to MRF → finding the MRF parameters(θ) that maximize the probability of S under the MRF distribution,p. L : Θ → R l(θ|S) = ΣN i=1ln(p(xi |θ)) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 14. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Standard way of finding the parameters is ML. • Appling this to MRF → finding the MRF parameters(θ) that maximize the probability of S under the MRF distribution,p. L : Θ → R l(θ|S) = ΣN i=1ln(p(xi |θ)) for the Gibbs distribution of an MRF → Cannot find maximum analytically! Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 15. Liklihood likilihood of MRF Liklihood • Training data S = {x1, x2, ..., xN}, i.i.d sampled from true distribution q. Standard way of finding the parameters is ML. • Appling this to MRF → finding the MRF parameters(θ) that maximize the probability of S under the MRF distribution,p. L : Θ → R l(θ|S) = ΣN i=1ln(p(xi |θ)) for the Gibbs distribution of an MRF → Cannot find maximum analytically! So using numerical approximation. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 4 / 20
  • 16. Liklihood KL divergence Liklihood KL of true distribution and MRF distribution: KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x)) • KL comprises of entropy of q and expectation over q. Only latter depends on the paramter subject to optimization. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 5 / 20
  • 17. Liklihood KL divergence Liklihood KL of true distribution and MRF distribution: KL(q||p) = Σx q(x)ln(q(x)) − q(x)ln(p(x)) • KL comprises of entropy of q and expectation over q. Only latter depends on the paramter subject to optimization. • Maximizing likilihood → Minimizing the KL-divergence. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 5 / 20
  • 18. Optimization Gradient Ascent Optimization Iteratively updating the parameters from θt to θt+1 based on log-liklihood. θt+1 = θt + η ∂ ∂θt (ΣN i=1l(θ|xi )) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 6 / 20
  • 19. Having Latent Variables Latent Variables Having Latent Variables • We want to model m-dimensional prob. distribution q(e.g an image with m pixels) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
  • 20. Having Latent Variables Latent Variables Having Latent Variables • We want to model m-dimensional prob. distribution q(e.g an image with m pixels) • X=(V,H) is a set of all variables. V = (V1, V2, ..., Vm) → visibles units H = (H1, H2, ..., Hn) → hidden units, n = |V | − m Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
  • 21. Having Latent Variables Latent Variables Having Latent Variables • We want to model m-dimensional prob. distribution q(e.g an image with m pixels) • X=(V,H) is a set of all variables. V = (V1, V2, ..., Vm) → visibles units H = (H1, H2, ..., Hn) → hidden units, n = |V | − m • e.g V = set of all pixels, H= set of relationships between V units Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
  • 22. Having Latent Variables Latent Variables Having Latent Variables • We want to model m-dimensional prob. distribution q(e.g an image with m pixels) • X=(V,H) is a set of all variables. V = (V1, V2, ..., Vm) → visibles units H = (H1, H2, ..., Hn) → hidden units, n = |V | − m • e.g V = set of all pixels, H= set of relationships between V units • Our Gibbs distribution of visible units: p(v) = 1 Z Σhe−E(v,h) , Z = Σv,he−E(v,h) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 7 / 20
  • 23. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 24. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] • Then its gradient is: θl = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 25. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] • Then its gradient is: θl = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ • It is difference of two expectations: Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 26. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] • Then its gradient is: θl = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ • It is difference of two expectations: →One over conditional dist. of hidden units Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 27. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] • Then its gradient is: θl = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ • It is difference of two expectations: →One over conditional dist. of hidden units →One over model dist. • We have to sum over all possible values of (v,h) for this computation!! Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 28. Having Latent Variables Log-Liklihood of MRF • Log-Liklihood for one sample: l(θ|v) = ln[Σhe−E(v,h) ] − ln[ Z Σv,he−E(v,h) ] • Then its gradient is: θl = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ • It is difference of two expectations: →One over conditional dist. of hidden units →One over model dist. • We have to sum over all possible values of (v,h) for this computation!! Instead approximating this expectation. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 8 / 20
  • 29. Markov Chain and Gibbs Sampling Markov Chain Markov Chain • Stationary Distribution: for distribution π for which it holds πT = πT P where P is transition matrix with pij as matrix elements. • Detailed Balance Condition: sufficient condition for π to be stationary distribution w.r.t pij , i, j ∈ Ω as transition probablities: π(i)pij = π(j)pji , ∀i, j ∈ Ω Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 9 / 20
  • 30. Markov Chain and Gibbs Sampling Gibbs Sampling Gibbs Sampling • MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where V = {1, 2, ..., N}. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
  • 31. Markov Chain and Gibbs Sampling Gibbs Sampling Gibbs Sampling • MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where V = {1, 2, ..., N}. • Xi , i ∈ V takes values in a finite set Λ. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
  • 32. Markov Chain and Gibbs Sampling Gibbs Sampling Gibbs Sampling • MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where V = {1, 2, ..., N}. • Xi , i ∈ V takes values in a finite set Λ. • Time varing states X = {Xk |k ∈ N}, Xk = {Xk 1 , Xk 2 , ..., Xk N} Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
  • 33. Markov Chain and Gibbs Sampling Gibbs Sampling Gibbs Sampling • MRF X = (X1, X2, ..., XN) for a graph G = (V , E) where V = {1, 2, ..., N}. • Xi , i ∈ V takes values in a finite set Λ. • Time varing states X = {Xk |k ∈ N}, Xk = {Xk 1 , Xk 2 , ..., Xk N} • π(x) = 1 Z e−ε(x) is the joint probability distribution of X. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 10 / 20
  • 34. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i); Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 35. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 36. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 37. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 38. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 39. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 40. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Step 3: Keep doing this! Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 41. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Step 3: Keep doing this! • Therefore the transition probability for MRF X is defined as follows, for two states x and y Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 42. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Step 3: Keep doing this! • Therefore the transition probability for MRF X is defined as follows, for two states x and y pxy = q(i)π(yi |(xv )v∈V −{i}), only differs in ith−element ∃i ∈ V ∀v ∈ V , v = i; xv = yv Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 43. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Step 3: Keep doing this! • Therefore the transition probability for MRF X is defined as follows, for two states x and y pxy = q(i)π(yi |(xv )v∈V −{i}), only differs in ith−element ∃i ∈ V ∀v ∈ V , v = i; xv = yv pxy = 0, otherwise Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 44. Markov Chain and Gibbs Sampling Gibbs Algorithm Gibbs Sampling Step 1: At each iteration we pick a random variable Xi , i ∈ V with probability q(i);q is a strictly positive distribution over V. Step 2: Sample a new value for Xi based on CPD given the state of all other variables, i.e π(Xi |(xv )v∈V ı) = π(Xi |(xw )w∈Ni ) Step 3: Keep doing this! • Therefore the transition probability for MRF X is defined as follows, for two states x and y pxy = q(i)π(yi |(xv )v∈V −{i}), only differs in ith−element ∃i ∈ V ∀v ∈ V , v = i; xv = yv pxy = 0, otherwise pxx = Σi∈Vq(i)π(xi |(xv )v∈V−{i}) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 11 / 20
  • 45. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 46. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 47. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step → every state can reach any other in a finite number of steps Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 48. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step → every state can reach any other in a finite number of steps → Markov chain is irreducible. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 49. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step → every state can reach any other in a finite number of steps → Markov chain is irreducible. • pxx > 0 and pxy > 0 and detailed balance condition Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 50. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step → every state can reach any other in a finite number of steps → Markov chain is irreducible. • pxx > 0 and pxy > 0 and detailed balance condition → Markov Chain is aperiodic. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 51. Markov Chain and Gibbs Sampling Convergence of Gibbs Convergence of Gibbs • π is s.p → CPDs of single variables are s.p. Every single variable Xi can take every state xi ∈ Λ in a single transition step → every state can reach any other in a finite number of steps → Markov chain is irreducible. • pxx > 0 and pxy > 0 and detailed balance condition → Markov Chain is aperiodic. Convergence Aperiodicity and irreducibility guaranty that the chain converges to the stationary distribution π. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 12 / 20
  • 52. Restricted Boltzmann Machines RBM RBM • p(v, h) = 1 Z e−E(v,h) with E(v, h) = −Σn i=1Σm j=1wij hi vj − Σm j=1bj vj − Σn i=1ci hi . • ci and bj are real-valued biased terms for hidden and visible units. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 13 / 20
  • 53. Restricted Boltzmann Machines RBM • Hidden variables are independent of each other given visibile units: p(h|v) = n i=1 p(hi |v) and p(v|h) = m j=1 p(vj |h) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
  • 54. Restricted Boltzmann Machines RBM • Hidden variables are independent of each other given visibile units: p(h|v) = n i=1 p(hi |v) and p(v|h) = m j=1 p(vj |h) • Joint probability distribution of observations: p(v) = 1 Z m j=1 ebj vj n i=1 (1 + eci +Σm j=1wij vj ) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
  • 55. Restricted Boltzmann Machines RBM • Hidden variables are independent of each other given visibile units: p(h|v) = n i=1 p(hi |v) and p(v|h) = m j=1 p(vj |h) • Joint probability distribution of observations: p(v) = 1 Z m j=1 ebj vj n i=1 (1 + eci +Σm j=1wij vj ) • Conditionaly probability distribution of components: p(Hi = 1|v) = σ(Σm j=1wij + ci ) p(Vj = 1|h) = σ(Σn i=1wij + bj ) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 14 / 20
  • 56. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 57. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 58. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 59. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj = p(Hi =1|v) σ(Σm j=1wij + ci ) vj Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 60. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj = p(Hi =1|v) σ(Σm j=1wij + ci ) vj • We can do the same thing for the second part; Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 61. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj = p(Hi =1|v) σ(Σm j=1wij + ci ) vj • We can do the same thing for the second part;writing it as Σvp(v)Σhp(h|v)∂E(v,h) ∂θ Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 62. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj = p(Hi =1|v) σ(Σm j=1wij + ci ) vj • We can do the same thing for the second part;writing it as Σvp(v)Σhp(h|v)∂E(v,h) ∂θ or Σhp(h)Σvp(v|h)∂E(v,h) ∂θ Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 63. Restricted Boltzmann Machines Gradient of the Log-Liklihood Gradient of the log-liklihood • Recap: gradient log-liklihood of an MRF for a single data: θl(θ|v) = −Σhp(h|v) ∂E(v, h) ∂θ + Σv,hp(v, h) ∂E(v, h) ∂θ the first term is tractable,e.g for wij , −Σhp(h|v) ∂E(v, h) ∂wij = Σhi Σh−i Σh n k=1 p(hk |v) p(h|v) hi vj = p(Hi =1|v) σ(Σm j=1wij + ci ) vj • We can do the same thing for the second part;writing it as Σvp(v)Σhp(h|v)∂E(v,h) ∂θ or Σhp(h)Σvp(v|h)∂E(v,h) ∂θ • It is also intractable(in terms of smallest layer. i.e 2m or 2n ) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 15 / 20
  • 64. Restricted Boltzmann Machines Gradient of the Log-Liklihood Computing the derivative of log-liklihood • w.r.t wij , wij l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
  • 65. Restricted Boltzmann Machines Gradient of the Log-Liklihood Computing the derivative of log-liklihood • w.r.t wij , wij l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj • w.r.t bj , bj l(θ|v) = vj − Σvp(v)vj Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
  • 66. Restricted Boltzmann Machines Gradient of the Log-Liklihood Computing the derivative of log-liklihood • w.r.t wij , wij l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj • w.r.t bj , bj l(θ|v) = vj − Σvp(v)vj • w.r.t ci , ci l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
  • 67. Restricted Boltzmann Machines Gradient of the Log-Liklihood Computing the derivative of log-liklihood • w.r.t wij , wij l(θ|v) = p(Hi = 1|v)vj − Σvp(v)p(Hi = 1|v)vj • w.r.t bj , bj l(θ|v) = vj − Σvp(v)vj • w.r.t ci , ci l(θ|v) = p(Hi = 1|v) − Σvp(v)p(Hi = 1|v) • to avoid summation over all possible values of v, we can approximate the expectation. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 16 / 20
  • 68. Restricted Boltzmann Machines Approximating the RBM log-liklihood Contrastive Divergence • Using a Gibbs chain run for only k steps (and usually k = 1). Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
  • 69. Restricted Boltzmann Machines Approximating the RBM log-liklihood Contrastive Divergence • Using a Gibbs chain run for only k steps (and usually k = 1). • Starting from a training sample v0 , yeilds sample vk after k-step. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
  • 70. Restricted Boltzmann Machines Approximating the RBM log-liklihood Contrastive Divergence • Using a Gibbs chain run for only k steps (and usually k = 1). • Starting from a training sample v0 , yeilds sample vk after k-step. • Each step t consists of sampling ht from p(h|vt ) Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
  • 71. Restricted Boltzmann Machines Approximating the RBM log-liklihood Contrastive Divergence • Using a Gibbs chain run for only k steps (and usually k = 1). • Starting from a training sample v0 , yeilds sample vk after k-step. • Each step t consists of sampling ht from p(h|vt ) and sampling vt+1 from p(v|ht ). Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
  • 72. Restricted Boltzmann Machines Approximating the RBM log-liklihood Contrastive Divergence • Using a Gibbs chain run for only k steps (and usually k = 1). • Starting from a training sample v0 , yeilds sample vk after k-step. • Each step t consists of sampling ht from p(h|vt ) and sampling vt+1 from p(v|ht ). • Then using this samples the gradient approximation is given by, CDk(h, v0 ) = −Σhp(h|v0 ) ∂E(v0 , h) ∂θ + Σhp(h|vk ) ∂E(vk , h) ∂θ Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 17 / 20
  • 73. Restricted Boltzmann Machines Contrastive Divergence k-CD for Batch Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 18 / 20
  • 74. Restricted Boltzmann Machines Other derivatives • Persistent CD(PCD relys on the previous chain update of each parameter(vk of previous step is for the initialization of the next step). Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
  • 75. Restricted Boltzmann Machines Other derivatives • Persistent CD(PCD relys on the previous chain update of each parameter(vk of previous step is for the initialization of the next step). • Fast PCD introduce a set of parameters just for sampling and not for the model to increaes the speed. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
  • 76. Restricted Boltzmann Machines Other derivatives • Persistent CD(PCD relys on the previous chain update of each parameter(vk of previous step is for the initialization of the next step). • Fast PCD introduce a set of parameters just for sampling and not for the model to increaes the speed. • Parallel Tempering Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
  • 77. Restricted Boltzmann Machines Other derivatives • Persistent CD(PCD relys on the previous chain update of each parameter(vk of previous step is for the initialization of the next step). • Fast PCD introduce a set of parameters just for sampling and not for the model to increaes the speed. • Parallel Tempering we run k (usually k = 1) Gibbs sampling steps. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
  • 78. Restricted Boltzmann Machines Other derivatives • Persistent CD(PCD relys on the previous chain update of each parameter(vk of previous step is for the initialization of the next step). • Fast PCD introduce a set of parameters just for sampling and not for the model to increaes the speed. • Parallel Tempering we run k (usually k = 1) Gibbs sampling steps.In each tempered Markov chain yielding samples (v1, h1), ..., (vM, hM), choose two consecutive temprature and exchange particles (vr , hr ) and (vr−1, hr−1) with prob. Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 19 / 20
  • 79. Restricted Boltzmann Machines Other derivatives Results left: hidden sampling, right: visible sampling Hadi Sinaee (PGM Seminar) RBM from Scratch May 17, 2015 20 / 20