SlideShare une entreprise Scribd logo
1  sur  42
Télécharger pour lire hors ligne
Gradient Estimation Using Stochastic
Computation Graphs
Yoonho Lee
Department of Computer Science and Engineering
Pohang University of Science and Technology
May 9, 2017
Outline
Stochastic Gradient Estimation
Stochastic Computation Graphs
Stochastic Computation Graphs in RL
Other Examples of Stochastic Computation Graphs
Outline
Stochastic Gradient Estimation
Stochastic Computation Graphs
Stochastic Computation Graphs in RL
Other Examples of Stochastic Computation Graphs
Stochastic Gradient Estimation
R(θ) = Ew∼Pθ(dw)[f (θ, w)]
Approximate θR(θ) can be used for:
Computational Finance
Reinforcement Learning
Variational Inference
Explaining the brain with neural networks
Stochastic Gradient Estimation
R(θ) = Ew∼Pθ(dw)[f (θ, w)] =
Ω
f (θ, w)Pθ(dw)
Stochastic Gradient Estimation
R(θ) = Ew∼Pθ(dw)[f (θ, w)] =
Ω
f (θ, w)Pθ(dw)
=
Ω
f (θ, w)
Pθ(dw)
G(dw)
G(dw)
Stochastic Gradient Estimation
R(θ) = Ew∼Pθ(dw)[f (θ, w)] =
Ω
f (θ, w)Pθ(dw)
=
Ω
f (θ, w)
Pθ(dw)
G(dw)
G(dw)
θR(θ) = θ
Ω
f (θ, w)
Pθ(dw)
G(dw)
G(dw)
=
Ω
θ(f (θ, w)
Pθ(dw)
G(dw)
)G(dw)
= Ew∼G(dw)[ θf (θ, w)
Pθ(dw)
G(dw)
+ f (θ, w)
θPθ(dw)
G(dw)
]
Stochastic Gradient Estimation
So far we have
θR(θ) = Ew∼G(dw)[ θf (θ, w)
Pθ(dw)
G(dw)
+ f (θ, w)
θPθ(dw)
G(dw)
]
Stochastic Gradient Estimation
So far we have
θR(θ) = Ew∼G(dw)[ θf (θ, w)
Pθ(dw)
G(dw)
+ f (θ, w)
θPθ(dw)
G(dw)
]
Letting G = Pθ0 and evaluating at θ = θ0, we get
θR(θ)
θ=θ0
= Ew∼Pθ0
(dw)[ θf (θ, w)
Pθ(dw)
Pθ0 (dw)
+ f (θ, w)
θPθ(dw)
Pθ0 (dw)
]
θ=θ0
= Ew∼Pθ0
(dw)[ θf (θ, w) + f (θ, w) θlnPθ(dw)]
θ=θ0
Stochastic Gradient Estimation
So far we have
R(θ) = Ew∼Pθ(dw)[f (θ, w)]
θR(θ)
θ=θ0
= Ew∼Pθ0
(dw)[ θf (θ, w) + f (θ, w) θlnPθ(dw)]
θ=θ0
Stochastic Gradient Estimation
So far we have
R(θ) = Ew∼Pθ(dw)[f (θ, w)]
θR(θ)
θ=θ0
= Ew∼Pθ0
(dw)[ θf (θ, w) + f (θ, w) θlnPθ(dw)]
θ=θ0
1. Assuming w fully determines f :
θR(θ)
θ=θ0
= Ew∼Pθ0
(dw)[f (w) θlnPθ(dw)]
θ=θ0
2. Assuming Pθ(dw) is independent of θ:
θR(θ) = Ew∼P(dw)[ θf (θ, w)]
Stochastic Gradient Estimation
SF vs PD
R(θ) = Ew∼Pθ(dw)[f (θ, w)]
Score Function (SF) Estimator:
θR(θ)
θ=θ0
= Ew∼Pθ0
(dw)[f (w) θlnPθ(dw)]
θ=θ0
Pathwise Derivative (PD) Estimator:
θR(θ) = Ew∼P(dw)[ θf (θ, w)]
Stochastic Gradient Estimation
SF vs PD
R(θ) = Ew∼Pθ(dw)[f (θ, w)]
Score Function (SF) Estimator:
θR(θ)
θ=θ0
= Ew∼Pθ0
(dw)[f (w) θlnPθ(dw)]
θ=θ0
Pathwise Derivative (PD) Estimator:
θR(θ) = Ew∼P(dw)[ θf (θ, w)]
SF allows discontinuous f .
SF requires sample values f ; PD requires derivatives f .
SF takes derivatives over the probability distribution of w; PD
takes derivatives over the function f .
SF has high variance on high-dimensional w.
PD has high variance on rough f .
Stochastic Gradient Estimation
Choices for PD estimator
blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/
Outline
Stochastic Gradient Estimation
Stochastic Computation Graphs
Stochastic Computation Graphs in RL
Other Examples of Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation
Graphs (NIPS 2015)
John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel
Stochastic Computation Graphs
SF estimator:
θR(θ)
θ=θ0
= Ex∼Pθ0
(dx)[f (x) θlnPθ(dx)]
θ=θ0
PD estimator:
θR(θ) = Ez∼P(dz)[ θf (x(θ, z))]
Stochastic Computation Graphs
Stochastic computation graphs are a design choice rather
than an intrinsic property of a problem
Stochastic nodes ’block’ gradient flow
Stochastic Computation Graphs
Neural Networks
Neural Networks are a special case of stochastic computation
graphs
Stochastic Computation Graphs
Formal Definition
Stochastic Computation Graphs
Deriving Unbiased Estimators
Condition 1. (Differentiability Requirements) Given input node
θ ∈ Θ, for all edges (v, w) which satisfy θ D v and θ D w, the
following condition holds: if w is deterministic, ∂w
∂v exists, and if w
is stochastic, ∂
∂v p(w|PARENTSw ) exists.
Theorem 1. Suppose θ ∈ Θ satisfies Condition 1. The following
equations hold:
∂
∂θ
E
c∈C
c = E
w∈S
θ D w
∂
∂θ
logp(w|DEPSw ) ˆQw +
c∈C
θ D c
∂
∂θ
c(DEPSc)
= E
c∈C
ˆc
w∈S
θ D w
∂
∂θ
logp(w|DEPSw ) +
c∈C
θ D c
∂
∂θ
c(DEPSc)
Stochastic Computation Graphs
Surrogate loss functions
We have this equation from Theorem 1:
∂
∂θ
E
c∈C
c = E
w∈S
θ D w
∂
∂θ
logp(w|DEPSw ) ˆQw +
c∈C
θ D c
∂
∂θ
c(DEPSc)
Note that by defining
L(Θ, S) :=
w∈S
logp(w|DEPSw ) ˆQw +
c∈C
c(DEPSc)
we have
∂
∂θ
E
c∈C
c = E
∂
∂θ
L(Θ, S)
Stochastic Computation Graphs
Surrogate loss functions
Stochastic Computation Graphs
Implementations
Making stochastic computation graphs with neural network
packages is easy
tensorflow.org/api_docs/python/tf/contrib/bayesflow/stochastic_gradient_estimators
github.com/pytorch/pytorch/blob/6a69f7007b37bb36d8a712cdf5cebe3ee0cc1cc8/torch/autograd/
variable.py
github.com/blei-lab/edward/blob/master/edward/inferences/klqp.py
Outline
Stochastic Gradient Estimation
Stochastic Computation Graphs
Stochastic Computation Graphs in RL
Other Examples of Stochastic Computation Graphs
MDP
The MDP formulation itself is a stochastic computation graph
Policy Gradient
Surrogate Loss
1
The policy gradient algorithm is Theorem 1 applied to the
MDP graph
1
Richard S Sutton et al. “Policy Gradient Methods for Reinforcement Learning with Function Approximation”.
In: NIPS (1999).
Stochastic Value Gradient
SVG(0)
Learn Qφ, and update2
∆θ ∝ θ
T
t=1
Qφ(st, πθ(st, t))
2
Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).
Stochastic Value Gradient
SVG(1)
Learn Vφ and dynamics model f such that st+1 = f (st, at) + δt.
Given transition (st, at, st+1), infer noise δt.3
∆θ ∝ θ
T
t=1
rt + γVφ(f (st, at) + δt)
3
Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).
Stochastic Value Gradient
SVG(∞)
Learn f , infer all noise variables δt. Differentiate through whole
graph.4
4
Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).
Deterministic Policy Gradient
5
5
David Silver et al. “Deterministic Policy Gradient Algorithms”. In: ICML (2014).
Evolution Strategies
6
6
Tim Salimans et al. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning”. In: arXiv
(2017).
Outline
Stochastic Gradient Estimation
Stochastic Computation Graphs
Stochastic Computation Graphs in RL
Other Examples of Stochastic Computation Graphs
Neural Variational Inference
7
Where
L(x, θ, φ) = Eh∼Q(h|x)[logPθ(x, h) − logQφ(h|x)]
Score function estimator
7
Andriy Mnih and Karol Gregor. “Neural Variational Inference and Learning in Belief Networks”. In: ICML
(2014).
Variational Autoencoder
8
Pathwise derivative estimator applied to the exact same graph
as NVIL
8
Diederik P Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: ICLR (2014).
DRAW
9
Sequential version of VAE
Same graph as SVG(∞) with (known) deterministic
dynamics(e=state, z=action, L=reward)
Similarly, VAE can be seen as SVG(∞) with 1 timestep and
deterministic dynamics
9
Karol Gregor et al. “DRAW: A Recurrent Neural Network For Image Generation”. In: ICML (2015).
GAN
10
Connection to SVG(0) has been established in [10]
Actor has no access to state. Observation is a random choice
between real or fake image. Reward is 1 if real, 0 if fake. D
learns value function of observation. G’s learning signal is D.
10
David Pfau and Oriol Vinyals. “Connecting Generative Adversarial Networks and Actor-Critic Methods”. In:
(2016).
Bayes by backprop
11
and our cost function f is:
f (D, θ) = Ew∼q(w|θ)[f (w, θ)]
= Ew∼q(w|θ)[−logp(D|w) + logq(w|θ) − logp(w)]
From the point of view of stochastic computation graphs,
weights and activations are not different
Estimate ELBO gradient of activation with PD estimator:
VAE
Estimate ELBO gradient of weight with PD estimator: Bayes
by Backprop
11
Charles Blundell et al. “Weight Uncertainty in Neural Networks”. In: ICML (2015).
Summary
Stochastic computation graphs explain many different
algorithms with one gradient estimator(Theorem 1)
Stochastic computation graphs give us insight into the
behavior of estimators
Stochastic computation graphs reveal equivalencies:
NVIL[7]=Policy gradient[12]
VAE[6]/DRAW[4]=SVG(∞)[5]
GAN[3]=SVG(0)[5]
References I
[1] Charles Blundell et al. “Weight Uncertainty in Neural
Networks”. In: ICML (2015).
[2] Michael Fu. “Stochastic Gradient Estimation”. In: (2005).
[3] Ian J. Goodfellow et al. “Generative Adversarial Networks”.
In: NIPS (2014).
[4] Karol Gregor et al. “DRAW: A Recurrent Neural Network
For Image Generation”. In: ICML (2015).
[5] Nicolas Heess et al. “Learning Continuous Control Policies
by Stochastic Value Gradients”. In: NIPS (2015).
[6] Diederik P Kingma and Max Welling. “Auto-Encoding
Variational Bayes”. In: ICLR (2014).
[7] Andriy Mnih and Karol Gregor. “Neural Variational Inference
and Learning in Belief Networks”. In: ICML (2014).
References II
[8] David Pfau and Oriol Vinyals. “Connecting Generative
Adversarial Networks and Actor-Critic Methods”. In: (2016).
[9] Tim Salimans et al. “Evolution Strategies as a Scalable
Alternative to Reinforcement Learning”. In: arXiv (2017).
[10] John Schulman et al. “Gradient Estimation Using Stochastic
Computation Graphs”. In: NIPS (2015).
[11] David Silver et al. “Deterministic Policy Gradient
Algorithms”. In: ICML (2014).
[12] Richard S Sutton et al. “Policy Gradient Methods for
Reinforcement Learning with Function Approximation”. In:
NIPS (1999).
Thank You

Contenu connexe

Tendances

Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image Annotation
Sean Moran
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
Nesma
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 

Tendances (19)

ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Nearly optimal average case complexity of counting bicliques under seth
Nearly optimal average case complexity of counting bicliques under sethNearly optimal average case complexity of counting bicliques under seth
Nearly optimal average case complexity of counting bicliques under seth
 
Q-Metrics in Theory And Practice
Q-Metrics in Theory And PracticeQ-Metrics in Theory And Practice
Q-Metrics in Theory And Practice
 
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image Annotation
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance Applied Machine Learning For Search Engine Relevance
Applied Machine Learning For Search Engine Relevance
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Iwsm2014 an analogy-based approach to estimation of software development ef...
Iwsm2014   an analogy-based approach to estimation of software development ef...Iwsm2014   an analogy-based approach to estimation of software development ef...
Iwsm2014 an analogy-based approach to estimation of software development ef...
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
 
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid ParallelismDS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
DS-MLR: Scaling Multinomial Logistic Regression via Hybrid Parallelism
 
Problem Understanding through Landscape Theory
Problem Understanding through Landscape TheoryProblem Understanding through Landscape Theory
Problem Understanding through Landscape Theory
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
Recursive algorithms
Recursive algorithmsRecursive algorithms
Recursive algorithms
 
Approximation algorithms
Approximation algorithmsApproximation algorithms
Approximation algorithms
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes Family
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 

Similaire à Gradient Estimation Using Stochastic Computation Graphs

Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
butest
 

Similaire à Gradient Estimation Using Stochastic Computation Graphs (20)

Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Litvinenko nlbu2016
Litvinenko nlbu2016Litvinenko nlbu2016
Litvinenko nlbu2016
 
Solving inverse problems via non-linear Bayesian Update of PCE coefficients
Solving inverse problems via non-linear Bayesian Update of PCE coefficientsSolving inverse problems via non-linear Bayesian Update of PCE coefficients
Solving inverse problems via non-linear Bayesian Update of PCE coefficients
 
Scala as a Declarative Language
Scala as a Declarative LanguageScala as a Declarative Language
Scala as a Declarative Language
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
 
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
Tensor Completion for PDEs with uncertain coefficients and Bayesian Update te...
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space Locality-sensitive hashing for search in metric space
Locality-sensitive hashing for search in metric space
 
MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...
MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...
MUMS Undergraduate Workshop - A Biased Introduction to Global Sensitivity Ana...
 
Introduction to modern Variational Inference.
Introduction to modern Variational Inference.Introduction to modern Variational Inference.
Introduction to modern Variational Inference.
 
Logistic Regression in Case-Control Study
Logistic Regression in Case-Control StudyLogistic Regression in Case-Control Study
Logistic Regression in Case-Control Study
 
Laplace's Demon: seminar #1
Laplace's Demon: seminar #1Laplace's Demon: seminar #1
Laplace's Demon: seminar #1
 
Nonparametric testing for exogeneity with discrete regressors and instruments
Nonparametric testing for exogeneity with discrete regressors and instrumentsNonparametric testing for exogeneity with discrete regressors and instruments
Nonparametric testing for exogeneity with discrete regressors and instruments
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
ABC workshop: 17w5025
ABC workshop: 17w5025ABC workshop: 17w5025
ABC workshop: 17w5025
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
 
Triggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphsTriggering patterns of topology changes in dynamic attributed graphs
Triggering patterns of topology changes in dynamic attributed graphs
 
The Magic of Auto Differentiation
The Magic of Auto DifferentiationThe Magic of Auto Differentiation
The Magic of Auto Differentiation
 
Connection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problemsConnection between inverse problems and uncertainty quantification problems
Connection between inverse problems and uncertainty quantification problems
 

Plus de Yoonho Lee

Plus de Yoonho Lee (11)

On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
 
Parameter Space Noise for Exploration
Parameter Space Noise for ExplorationParameter Space Noise for Exploration
Parameter Space Noise for Exploration
 
Meta Learning Shared Hierarchies
Meta Learning Shared HierarchiesMeta Learning Shared Hierarchies
Meta Learning Shared Hierarchies
 
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy Sketches
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
 

Dernier

Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
Kayode Fayemi
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
raffaeleoman
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 

Dernier (18)

Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
Aesthetic Colaba Mumbai Cst Call girls 📞 7738631006 Grant road Call Girls ❤️-...
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
If this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New NigeriaIf this Giant Must Walk: A Manifesto for a New Nigeria
If this Giant Must Walk: A Manifesto for a New Nigeria
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptxChiulli_Aurora_Oman_Raffaele_Beowulf.pptx
Chiulli_Aurora_Oman_Raffaele_Beowulf.pptx
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Causes of poverty in France presentation.pptx
Causes of poverty in France presentation.pptxCauses of poverty in France presentation.pptx
Causes of poverty in France presentation.pptx
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
Busty Desi⚡Call Girls in Sector 51 Noida Escorts >༒8448380779 Escort Service-...
 
Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510Thirunelveli call girls Tamil escorts 7877702510
Thirunelveli call girls Tamil escorts 7877702510
 
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verifiedSector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
Sector 62, Noida Call girls :8448380779 Noida Escorts | 100% verified
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdfThe workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
The workplace ecosystem of the future 24.4.2024 Fabritius_share ii.pdf
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 

Gradient Estimation Using Stochastic Computation Graphs

  • 1. Gradient Estimation Using Stochastic Computation Graphs Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology May 9, 2017
  • 2. Outline Stochastic Gradient Estimation Stochastic Computation Graphs Stochastic Computation Graphs in RL Other Examples of Stochastic Computation Graphs
  • 3. Outline Stochastic Gradient Estimation Stochastic Computation Graphs Stochastic Computation Graphs in RL Other Examples of Stochastic Computation Graphs
  • 4. Stochastic Gradient Estimation R(θ) = Ew∼Pθ(dw)[f (θ, w)] Approximate θR(θ) can be used for: Computational Finance Reinforcement Learning Variational Inference Explaining the brain with neural networks
  • 5. Stochastic Gradient Estimation R(θ) = Ew∼Pθ(dw)[f (θ, w)] = Ω f (θ, w)Pθ(dw)
  • 6. Stochastic Gradient Estimation R(θ) = Ew∼Pθ(dw)[f (θ, w)] = Ω f (θ, w)Pθ(dw) = Ω f (θ, w) Pθ(dw) G(dw) G(dw)
  • 7. Stochastic Gradient Estimation R(θ) = Ew∼Pθ(dw)[f (θ, w)] = Ω f (θ, w)Pθ(dw) = Ω f (θ, w) Pθ(dw) G(dw) G(dw) θR(θ) = θ Ω f (θ, w) Pθ(dw) G(dw) G(dw) = Ω θ(f (θ, w) Pθ(dw) G(dw) )G(dw) = Ew∼G(dw)[ θf (θ, w) Pθ(dw) G(dw) + f (θ, w) θPθ(dw) G(dw) ]
  • 8. Stochastic Gradient Estimation So far we have θR(θ) = Ew∼G(dw)[ θf (θ, w) Pθ(dw) G(dw) + f (θ, w) θPθ(dw) G(dw) ]
  • 9. Stochastic Gradient Estimation So far we have θR(θ) = Ew∼G(dw)[ θf (θ, w) Pθ(dw) G(dw) + f (θ, w) θPθ(dw) G(dw) ] Letting G = Pθ0 and evaluating at θ = θ0, we get θR(θ) θ=θ0 = Ew∼Pθ0 (dw)[ θf (θ, w) Pθ(dw) Pθ0 (dw) + f (θ, w) θPθ(dw) Pθ0 (dw) ] θ=θ0 = Ew∼Pθ0 (dw)[ θf (θ, w) + f (θ, w) θlnPθ(dw)] θ=θ0
  • 10. Stochastic Gradient Estimation So far we have R(θ) = Ew∼Pθ(dw)[f (θ, w)] θR(θ) θ=θ0 = Ew∼Pθ0 (dw)[ θf (θ, w) + f (θ, w) θlnPθ(dw)] θ=θ0
  • 11. Stochastic Gradient Estimation So far we have R(θ) = Ew∼Pθ(dw)[f (θ, w)] θR(θ) θ=θ0 = Ew∼Pθ0 (dw)[ θf (θ, w) + f (θ, w) θlnPθ(dw)] θ=θ0 1. Assuming w fully determines f : θR(θ) θ=θ0 = Ew∼Pθ0 (dw)[f (w) θlnPθ(dw)] θ=θ0 2. Assuming Pθ(dw) is independent of θ: θR(θ) = Ew∼P(dw)[ θf (θ, w)]
  • 12. Stochastic Gradient Estimation SF vs PD R(θ) = Ew∼Pθ(dw)[f (θ, w)] Score Function (SF) Estimator: θR(θ) θ=θ0 = Ew∼Pθ0 (dw)[f (w) θlnPθ(dw)] θ=θ0 Pathwise Derivative (PD) Estimator: θR(θ) = Ew∼P(dw)[ θf (θ, w)]
  • 13. Stochastic Gradient Estimation SF vs PD R(θ) = Ew∼Pθ(dw)[f (θ, w)] Score Function (SF) Estimator: θR(θ) θ=θ0 = Ew∼Pθ0 (dw)[f (w) θlnPθ(dw)] θ=θ0 Pathwise Derivative (PD) Estimator: θR(θ) = Ew∼P(dw)[ θf (θ, w)] SF allows discontinuous f . SF requires sample values f ; PD requires derivatives f . SF takes derivatives over the probability distribution of w; PD takes derivatives over the function f . SF has high variance on high-dimensional w. PD has high variance on rough f .
  • 14. Stochastic Gradient Estimation Choices for PD estimator blog.shakirm.com/2015/10/machine-learning-trick-of-the-day-4-reparameterisation-tricks/
  • 15. Outline Stochastic Gradient Estimation Stochastic Computation Graphs Stochastic Computation Graphs in RL Other Examples of Stochastic Computation Graphs
  • 16. Gradient Estimation Using Stochastic Computation Graphs (NIPS 2015) John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel
  • 17. Stochastic Computation Graphs SF estimator: θR(θ) θ=θ0 = Ex∼Pθ0 (dx)[f (x) θlnPθ(dx)] θ=θ0 PD estimator: θR(θ) = Ez∼P(dz)[ θf (x(θ, z))]
  • 18. Stochastic Computation Graphs Stochastic computation graphs are a design choice rather than an intrinsic property of a problem Stochastic nodes ’block’ gradient flow
  • 19. Stochastic Computation Graphs Neural Networks Neural Networks are a special case of stochastic computation graphs
  • 21. Stochastic Computation Graphs Deriving Unbiased Estimators Condition 1. (Differentiability Requirements) Given input node θ ∈ Θ, for all edges (v, w) which satisfy θ D v and θ D w, the following condition holds: if w is deterministic, ∂w ∂v exists, and if w is stochastic, ∂ ∂v p(w|PARENTSw ) exists. Theorem 1. Suppose θ ∈ Θ satisfies Condition 1. The following equations hold: ∂ ∂θ E c∈C c = E w∈S θ D w ∂ ∂θ logp(w|DEPSw ) ˆQw + c∈C θ D c ∂ ∂θ c(DEPSc) = E c∈C ˆc w∈S θ D w ∂ ∂θ logp(w|DEPSw ) + c∈C θ D c ∂ ∂θ c(DEPSc)
  • 22. Stochastic Computation Graphs Surrogate loss functions We have this equation from Theorem 1: ∂ ∂θ E c∈C c = E w∈S θ D w ∂ ∂θ logp(w|DEPSw ) ˆQw + c∈C θ D c ∂ ∂θ c(DEPSc) Note that by defining L(Θ, S) := w∈S logp(w|DEPSw ) ˆQw + c∈C c(DEPSc) we have ∂ ∂θ E c∈C c = E ∂ ∂θ L(Θ, S)
  • 24. Stochastic Computation Graphs Implementations Making stochastic computation graphs with neural network packages is easy tensorflow.org/api_docs/python/tf/contrib/bayesflow/stochastic_gradient_estimators github.com/pytorch/pytorch/blob/6a69f7007b37bb36d8a712cdf5cebe3ee0cc1cc8/torch/autograd/ variable.py github.com/blei-lab/edward/blob/master/edward/inferences/klqp.py
  • 25. Outline Stochastic Gradient Estimation Stochastic Computation Graphs Stochastic Computation Graphs in RL Other Examples of Stochastic Computation Graphs
  • 26. MDP The MDP formulation itself is a stochastic computation graph
  • 27. Policy Gradient Surrogate Loss 1 The policy gradient algorithm is Theorem 1 applied to the MDP graph 1 Richard S Sutton et al. “Policy Gradient Methods for Reinforcement Learning with Function Approximation”. In: NIPS (1999).
  • 28. Stochastic Value Gradient SVG(0) Learn Qφ, and update2 ∆θ ∝ θ T t=1 Qφ(st, πθ(st, t)) 2 Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).
  • 29. Stochastic Value Gradient SVG(1) Learn Vφ and dynamics model f such that st+1 = f (st, at) + δt. Given transition (st, at, st+1), infer noise δt.3 ∆θ ∝ θ T t=1 rt + γVφ(f (st, at) + δt) 3 Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).
  • 30. Stochastic Value Gradient SVG(∞) Learn f , infer all noise variables δt. Differentiate through whole graph.4 4 Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015).
  • 31. Deterministic Policy Gradient 5 5 David Silver et al. “Deterministic Policy Gradient Algorithms”. In: ICML (2014).
  • 32. Evolution Strategies 6 6 Tim Salimans et al. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning”. In: arXiv (2017).
  • 33. Outline Stochastic Gradient Estimation Stochastic Computation Graphs Stochastic Computation Graphs in RL Other Examples of Stochastic Computation Graphs
  • 34. Neural Variational Inference 7 Where L(x, θ, φ) = Eh∼Q(h|x)[logPθ(x, h) − logQφ(h|x)] Score function estimator 7 Andriy Mnih and Karol Gregor. “Neural Variational Inference and Learning in Belief Networks”. In: ICML (2014).
  • 35. Variational Autoencoder 8 Pathwise derivative estimator applied to the exact same graph as NVIL 8 Diederik P Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: ICLR (2014).
  • 36. DRAW 9 Sequential version of VAE Same graph as SVG(∞) with (known) deterministic dynamics(e=state, z=action, L=reward) Similarly, VAE can be seen as SVG(∞) with 1 timestep and deterministic dynamics 9 Karol Gregor et al. “DRAW: A Recurrent Neural Network For Image Generation”. In: ICML (2015).
  • 37. GAN 10 Connection to SVG(0) has been established in [10] Actor has no access to state. Observation is a random choice between real or fake image. Reward is 1 if real, 0 if fake. D learns value function of observation. G’s learning signal is D. 10 David Pfau and Oriol Vinyals. “Connecting Generative Adversarial Networks and Actor-Critic Methods”. In: (2016).
  • 38. Bayes by backprop 11 and our cost function f is: f (D, θ) = Ew∼q(w|θ)[f (w, θ)] = Ew∼q(w|θ)[−logp(D|w) + logq(w|θ) − logp(w)] From the point of view of stochastic computation graphs, weights and activations are not different Estimate ELBO gradient of activation with PD estimator: VAE Estimate ELBO gradient of weight with PD estimator: Bayes by Backprop 11 Charles Blundell et al. “Weight Uncertainty in Neural Networks”. In: ICML (2015).
  • 39. Summary Stochastic computation graphs explain many different algorithms with one gradient estimator(Theorem 1) Stochastic computation graphs give us insight into the behavior of estimators Stochastic computation graphs reveal equivalencies: NVIL[7]=Policy gradient[12] VAE[6]/DRAW[4]=SVG(∞)[5] GAN[3]=SVG(0)[5]
  • 40. References I [1] Charles Blundell et al. “Weight Uncertainty in Neural Networks”. In: ICML (2015). [2] Michael Fu. “Stochastic Gradient Estimation”. In: (2005). [3] Ian J. Goodfellow et al. “Generative Adversarial Networks”. In: NIPS (2014). [4] Karol Gregor et al. “DRAW: A Recurrent Neural Network For Image Generation”. In: ICML (2015). [5] Nicolas Heess et al. “Learning Continuous Control Policies by Stochastic Value Gradients”. In: NIPS (2015). [6] Diederik P Kingma and Max Welling. “Auto-Encoding Variational Bayes”. In: ICLR (2014). [7] Andriy Mnih and Karol Gregor. “Neural Variational Inference and Learning in Belief Networks”. In: ICML (2014).
  • 41. References II [8] David Pfau and Oriol Vinyals. “Connecting Generative Adversarial Networks and Actor-Critic Methods”. In: (2016). [9] Tim Salimans et al. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning”. In: arXiv (2017). [10] John Schulman et al. “Gradient Estimation Using Stochastic Computation Graphs”. In: NIPS (2015). [11] David Silver et al. “Deterministic Policy Gradient Algorithms”. In: ICML (2014). [12] Richard S Sutton et al. “Policy Gradient Methods for Reinforcement Learning with Function Approximation”. In: NIPS (1999).