SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
1
Distributional Reinforcement Learning
via Moment Matching
(MMDQN)
*백승언, 주정헌, 박혜진
12 Feb, 2023
2
⚫ Introduction
▪ Estimation of the probability distribution
▪ Limitation of distribution estimation in conventional Reinforcement Learning(RL)
⚫ Distributional RL via Moment Matching(MMDQN)
▪ Backgrounds
▪ MMDQN
⚫ Experiment results
Contents
3
Introduction
4
⚫ Moments of a function are quantitative measures related to the shape of the function’s graph
▪ Probability distribution functions(PDF) are generally represented with four measures, mean, variance,
skewness, kurtosis
• Mean : first moment of PDF, (𝜇 = 𝐸[𝑋])
• Variance : second moment of PDF,(𝜎 = 𝐸 𝑋 − 𝜇 2
1
2)
• Skewness : third moment of PDF, (𝛾 = 𝐸[
𝑋−𝜇
𝜎
3
])
• Kurtosis : fourth moment of PDF, (𝐾𝑢𝑟𝑡[𝑥] =
𝐸 𝑋−𝜇 4
𝐸 𝑋−𝜇 2 2)
⚫ Estimation methods of PDF 𝒇 𝒙 in machine learning
▪ Explicit methods
• Determination predetermined statistics as any functional 𝜁: 𝑓 𝑥 ⇒ ℝ
– Median of 𝑓(𝑥): 𝐹−1
(
1
2
), Mean of 𝑓 𝑥 : ‫׬‬
𝒳
𝑃 𝑥 𝑑𝑥
• Estimation of the predetermined statistics using stacked data
▪ Implicit methods
• Estimation of the 𝑓(𝑥; 𝜃) directly using stacked data(GAN, VAE, …)
Estimation of the probability distribution
Various shape of normal dist. based on 𝝁, 𝝈
Various shape of normal dist. based on 𝜸, 𝑲𝒖𝒓𝒕[𝒙]
5
⚫ Naï
ve approaches in estimation of return
▪ Due to stochastic nature of return 𝐺 = Σ𝛾𝑡
𝑟𝑡+1, in general RL, 𝐺 is approximated with 𝑄(𝑠, 𝑎) and 𝑉(𝑠)
• 𝑉 𝑠 = 𝐸 Σ𝛾𝑡
𝑟𝑡+1| 𝑠
▪ These functions, 𝑄 𝑠, 𝑎 and 𝑉(𝑠) are generally estimated with normal bellman operator 𝒯
𝐵 using only mean
value
• (𝒯
𝐵 𝑉) 𝑠 = 𝐸 𝑟(𝑠, 𝜋(𝑠)) + 𝛾𝐸 𝑉(𝑠′
)
Limitation of distribution estimation in conventional Reinforcement Learning
Variance, skewness, kurtosis of return 𝑮 are easily neglected
Return distribution
value [-]
Probability
[-]
These distributions have same
mean, but they are not same!
Return distribution
value [-]
Probability
[-]
Complex dist. could be
modeled in this framework!
• 𝑄(𝑠, 𝑎) = 𝐸 Σ𝛾𝑡
𝑟𝑡+1|𝑠, 𝑎
• (𝒯
𝐵 𝑄) 𝑠, 𝑎 = 𝐸 𝑟(𝑠, 𝑎) + 𝛾𝐸 𝑄(𝑠′
, 𝑎′
)
6
Distributional RL via Moment Matching
(MMDQN)
https://arxiv.org/abs/2007.12354
7
⚫ Basics of probability space (𝛀, 𝚺, 𝑷)
▪ Sample space Ω
• In a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑠𝑝𝑎𝑐𝑒, the set Ω is the set of all possible outcomes. Ω is set with element(event) 𝜔, and is called the 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒
▪ 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ
• A 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ is a set of subsets 𝜔 of Ω s.t.:
– 𝜙 ∈ Σ
– If 𝐴 = 𝜙, Ω ⇒ 𝐴 𝑖𝑠 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎, 𝐴 = 𝜙, 𝐸, 𝐸𝐶
, Ω and 𝐸 ∈ Ω ⇒ A is 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎
▪ Random variable
Backgrounds (I)
– If 𝜔 ∈ Σ, 𝑡ℎ𝑒𝑛 𝜔𝐶
∈ Σ – If 𝜔1, 𝜔2, … , 𝜔𝑛 ∈ Σ, 𝑡ℎ𝑒𝑛, 𝑈𝑖=1
∞
𝜔𝑖 ∈ Σ
Σ ℰ
𝑌
𝐸
• 𝐵(ℝ) is the smallest 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 containing open interval I. This set is called
Borel set
– 𝐵 ℝ = 𝐼 = {(𝑎, 𝑏)|𝑎, 𝑏 ∈ ℝ, 𝑎 < 𝑏}
• If 𝑓 is measurable from (Ω, Σ) to (𝐸, ℰ), it is called a 𝐵𝑜𝑟𝑒𝑙 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 or a
𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒(RV)
– Let (Ω, Σ) and (𝐸, ℰ) be measurable spaces and 𝑓 a function from (Ω, Σ) to (𝐸, ℰ).
The function 𝑓 is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 iff 𝑓−1
ℰ ⊂ Σ
Ω
𝑓−1
(𝑌)
𝑓
8
⚫ Basics of probability space (𝛀, 𝚺, 𝑷)
▪ Probability distribution
• Let (Ω, Σ, 𝜇) be a measure space and 𝑓 be a measurable function from (Ω, Σ) to (𝐸, ℰ). The pushforward measure is denoted by 𝑓#𝜇,,
𝑓# 𝜇, or 𝜇 ∘ 𝑓−1
is a measure on ℰ defined as
– 𝑓#𝜇 𝑌 = 𝑓# 𝜇(𝑌) = 𝜇 ∘ 𝑓−1
𝑌 = 𝜇 𝑓 ∈ 𝑌 = 𝜇 𝑓−1
𝑌 , 𝑌 ∈ ℰ
• If 𝜇 = 𝑃 is a probability measure, and 𝑓 is random variable, then 𝑃 ∘ 𝑓−1
is called the distribution (or the law) of 𝑓 and is denoted by
𝑃𝑋
– 𝑃 is 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 or 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
– In practice, one often use 𝐸, ℰ = ℝ𝑛
, 𝐵 ℝ𝑛
, 𝑓#𝜇: 𝐵 ℝ → [0, 1]
Backgrounds (II)
Σ ℰ
𝑌
𝐸
Ω
𝑓−1(𝑌)
𝑓−1
ℝ
𝑃
𝑓#𝑃
𝑃 ∘ 𝑓−1
9
⚫ Distributional RL
▪ In distributional RL, the cumulative return of a chosen action at state is modeled with the full distribution
rather than expectation of it, 𝑍𝜃 𝑠, 𝑎 ≔
1
𝑁
Σ𝑖=1
𝑁
𝛿𝜃𝑖
(𝑠, 𝑎)
• So that the model can capture its intrinsic randomness instead of just first-order moment(high-order moments,
multi-modality in state-action value function)
▪ Existing distributional RL algorithms
• C51: Pre-determination of the supports of return distribution and then training the categorical distribution.
• QR-DQN: To avoid pre-determined supports, and decrease the theory-practice gap, introducing the quantile
regression
• IQN: Using sampling the quantile, training the full quantile function, and considering the risk of policy
• IDAC: Training the full return distribution directly using the adversarial network, and training the policy based on
semi-implicit methods
Backgrounds (III)
10
⚫ Overview of the MMDQN
▪ Unlike the existing distributional RL, MMDQN has no assumption about predetermined statistics and could
learn the unrestricted statistics
• However, for implementation, deterministic pseudo-samples of the return distribution are learned in MMD
– The authors use the Dirac mixture Ƹ
𝜇𝜃 𝑠, 𝑎 =
1
𝑁
Σ𝑖=1
𝑁
𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎)
▪ The authors analyze the distributional Bellman operator and establish sufficient conditions for the
contraction of the distributional Bellman operator in the MMD
• They analyzed the 𝒯𝜋
is a contraction when the kernel is unrectified kernel 𝑘𝛼 𝑥, 𝑦 ≔ − 𝑥 − 𝑦
𝛼
, ∀𝛼 ∈ ℝ, ∀𝑥, 𝑦 ∈ 𝒳
• However, for practical consideration, they demonstrates the commonly used Gaussian kernel has better
performance in this framework(𝑘 𝑥, 𝑦 ≔ exp −
𝑥−𝑦 2
2𝜎2 )
▪ MMDQN with Gaussian kernel mixture showed the state-of-the-art in the 55 Atari 2600 games.
• For a fair comparison, they used the same architecture of DQN and QR-DQN
MMDQN (I)
11
⚫ Problem setting
▪ For any policy 𝜋, let 𝜇𝜋
= law(𝑍𝜋
) be the distribution of the return RV. 𝑍𝜋
𝑠, 𝑎 ≔ Σ𝑡=0
∞
𝛾𝑡
𝑅(𝑠𝑡, 𝑎𝑡)
▪ 𝒯𝜋
𝜇 𝑠, 𝑎 ≔ ඲
𝒮
න
𝒜
ධ
𝜒
𝑓𝛾,𝑟 #𝜇 𝑠′
, 𝑎′
ℛ 𝑑𝑟 𝑠, 𝑎 𝜋 𝑑𝑎′
𝑠 𝑃 𝑑𝑠′
𝑠, 𝑎 , 𝑓𝛾,𝑟 𝑧 ≔ 𝑟 + 𝛾𝑧, ∀𝑧 𝑎𝑛𝑑
(𝑓𝛾,𝑟)#𝜇 𝑠′
, 𝑎′
is the push forward measure of 𝜇 𝑠′
, 𝑎′
𝑏𝑦 𝑓𝛾,𝑟
⚫ Algorithmic approach
▪ The authors use the Dirac mixture Ƹ
𝜇𝜃 𝑠, 𝑎 =
1
𝑁
Σ𝑖=1
𝑁
𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎)
• They referred to the deterministic samples 𝑍𝜃 𝑠, 𝑎 as particles
▪ Algorithm goal is reduced into learning the particles 𝑍𝜃(𝑠, 𝑎) to approximate 𝜇𝜋
(𝑠, 𝑎).
• To this end, the particles 𝑍𝜃(𝑠, 𝑎) is deterministically evolved to minimize the MMD distance between the
approximate distribution and its distributional Bellman target
MMDQN (II)
12
⚫ Maximum Mean Discrepancy(MMD)
▪ Let ℱ be a Reproducing Kernel Hilbert Space(RKHS) associated with a continuous kernel 𝑘(⋅,⋅) on 𝒳.
▪ The MMD between 𝑝 ∈ 𝑃(𝒳) and 𝑞 ∈ 𝑃 𝒳 is defined as
• MMD 𝑝, 𝑞; ℱ ≔ sup
𝑓∈ℱ: 𝑓 ≤1
𝔼 f Z − 𝔼 𝑓 𝑊 = ‫׬‬
𝜒
𝑘 𝑥,⋅ 𝑝 𝑑𝑥 − ‫׬‬
𝜒
𝑘 𝑥,⋅ 𝑞 𝑑𝑥
ℱ
= 𝔼 𝑘(𝑍, 𝑍′
) + 𝔼 𝑘(𝑊, 𝑊′
) − 2𝔼 𝑘(𝑍, 𝑊
1
2, 𝑍, 𝑍′~𝑝,𝑊, 𝑊′
~𝑞 and they are independent respectively.
▪ In practical, MMD is biased estimated from MMDb with empirical samples 𝑧𝑖 𝑖=1
𝑁
~𝑝 and 𝑤𝑖 𝑖=1
𝑀
~𝑞.
• MMDb
2
𝑧𝑖 , 𝑤𝑖 ; 𝑘 =
1
𝑁2 Σ𝑖,𝑗𝑘 𝑧𝑖, 𝑧𝑗 +
1
𝑀2 Σ𝑖,𝑗𝑘 𝑤𝑖, 𝑤𝑗 −
2
𝑁𝑀
Σ𝑖,𝑗𝑘(𝑧𝑖, 𝑤𝑗)
▪ The authors utilized the Gaussian kernel 𝑘 𝑥, 𝑦 = exp −
𝑥−𝑦 2
ℎ
for objective MMDb
• They exemplified the following intuition:
– The first term serves as a repulsive force that pushes the particles {𝑍𝜃 𝑠, 𝑎 𝑖} away from each other, preventing them from
collapsing into a single mode
– The third term acts as an attractive force which pulls the particle {𝑍𝜃 𝑠, 𝑎 i} closer to their target particles { ෠
𝒯𝑍𝑖}.
• They used the kernel mixture trick with 𝐾 kernels(they are different in bandwidth ℎ)
MMDQN (II)
13
⚫ Pseudo code
▪ Hyper-params inputting and target network initialization corresponds with the main network
▪ Every single step, transition is sampled from the replay buffer
• In the control setting, action is inferred from the policy(𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦)
• In the policy evaluation, action is selected with the estimated
return distribution Ƹ
𝜇 𝑠′
, 𝑎′
▪ For the number of statistics 𝑁, target return value ෠
𝒯𝑍𝑖 is
computed
▪ MMDb objective is computed and backpropagated with SGD
MMDQN (III)
Pseudo code of MMDQN
14
Experiment Results
15
⚫ Comparison with previous methods
▪ Experiments show the superior performance of MMDQN compared to previous methods in 55 Atrai 2600
games(OpenAI Gym env)
• DQN
• C51
• RAINBOW
• FQF
Experiment results (I)
• PRIOR
• QR-DQN
• IQN
Median and mean of best *HN scores
Median and mean of the *HN scores
*HN scores: Human Normalized score
➔ First group
➔ Second group
16
⚫ Comparison with the previous method
▪ Experiments show the superior performance of MMDQN compared to the previous method(QR-DQN) in
55 games in Atari 2600 games(OpenAI Gym env)
• MMD(3 seeds)
Experiment results (III)
Online training curves for MMDQN and QR-DQN
• QR-DQN(2 seeds)
17
⚫ Ablation study
▪ Two sets of ablation studies were performed to answer the following questions
• (a): Which kernel used for the MMDQN shows the best performance?
– Using the mixture of Gaussian kernels with different bandwidths displays the best performance
• (b): What number of particles for the MMDQN shows the best performance?
– Using more than 50 particles of dist. demonstrates better performance, and using 200 particles of dist. exibits the most s
table performance
Experiment results (III)
The sensitivity of MMDQN in the 6 tuning games w.r.t (a): the kernel choice, and (b): the number of particles N
18
Thank you!
19
Q&A
20
Appendix
21
⚫ Basics of measure theory
▪ Measurable space (𝑋, Σ)
• A pair (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 if 𝑋 is a set and Σ is a nonempty 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 of subsets of 𝑋
• A measurable space allows us to define a function that assigns real numbered values to the abstract elements of Σ
▪ Measure 𝜇
• Let (𝑋, Σ) be a measurable space, set function 𝜇 defined on Σ is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 iff has the following properties
– 0 ≤ 𝜇 𝐴 ≤ ∞ 𝑓𝑜𝑟 𝑎𝑛𝑦 𝐴 ∈ Σ
– For any sequence of pairwise disjoint sets {𝐴𝑛}∈ Σ such that 𝑈𝑛=1𝐴𝑛 ∈ Σ, we have 𝜇 𝑈𝑛=1
∞
𝐴𝑛 = Σ𝑛=1
∞
𝜇(𝐴𝑛)
▪ Measure space
• A triplet (𝑋, Σ, 𝜇) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑠𝑝𝑎𝑐𝑒 if (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 and 𝜇: Σ → [0; ∞) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒
• If 𝜇 𝑋 = 1, then 𝜇 is a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒, which we usually use notation 𝑃, and the measure space is a
𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒔𝒑𝒂𝒄𝒆
Backgrounds
– 𝜇 Φ = 0
22
⚫ Basics of measure theory
▪ Measurable function
• Let (Ω, Σ) and (Λ, 𝐺) be measurable spaces and 𝑓 a function from Ω to Λ. The function 𝑓 is called
𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 from (Ω, Σ) to (Λ, G) iff 𝑓−1
𝐺 ⊂ Σ
▪ Random variable
• A random variable 𝑿 is a measurable function from the probability space (Ω, Σ, 𝑃) into the probability space
(𝒳, 𝐵𝒳, 𝑃𝒳), where 𝒳 in ℝ is the range of the 𝑿, 𝐵𝒳 is a 𝐵𝑜𝑟𝑒𝑙 𝑠𝑒𝑡 𝑜𝑓 𝒳 and 𝑃𝒳 is the probability measure(distribution)
on 𝒳
– Specifically, 𝑿: Ω → 𝒳
Backgrounds

Contenu connexe

Similaire à Distributional RL via Moment Matching

Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodelsDai-Hai Nguyen
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooJaeJun Yoo
 
Machine learning ppt and presentation code
Machine learning ppt and presentation codeMachine learning ppt and presentation code
Machine learning ppt and presentation codesharma239172
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptationtaeseon ryu
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!ChenYiHuang5
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentationEleni Stamatelou
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final finaldinesh malla
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsAkisato Kimura
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissectionChenYiHuang5
 
Intro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdfIntro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdfJifarRaya
 
Fortran chapter 2.pdf
Fortran chapter 2.pdfFortran chapter 2.pdf
Fortran chapter 2.pdfJifarRaya
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxSeungeon Baek
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr taeseon ryu
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017Masa Kato
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...ssuser4b1f48
 

Similaire à Distributional RL via Moment Matching (20)

Seminar9
Seminar9Seminar9
Seminar9
 
Metrics for generativemodels
Metrics for generativemodelsMetrics for generativemodels
Metrics for generativemodels
 
Super resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun YooSuper resolution in deep learning era - Jaejun Yoo
Super resolution in deep learning era - Jaejun Yoo
 
Machine learning ppt and presentation code
Machine learning ppt and presentation codeMachine learning ppt and presentation code
Machine learning ppt and presentation code
 
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain AdaptationAdversarial Reinforced Learning for Unsupervised Domain Adaptation
Adversarial Reinforced Learning for Unsupervised Domain Adaptation
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks04 Multi-layer Feedforward Networks
04 Multi-layer Feedforward Networks
 
Time series clustering presentation
Time series clustering presentationTime series clustering presentation
Time series clustering presentation
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
IJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphsIJCAI13 Paper review: Large-scale spectral clustering on graphs
IJCAI13 Paper review: Large-scale spectral clustering on graphs
 
Paper Study: Transformer dissection
Paper Study: Transformer dissectionPaper Study: Transformer dissection
Paper Study: Transformer dissection
 
Cs36565569
Cs36565569Cs36565569
Cs36565569
 
Intro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdfIntro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdf
 
Fortran chapter 2.pdf
Fortran chapter 2.pdfFortran chapter 2.pdf
Fortran chapter 2.pdf
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017
 
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...NS-CUK Seminar: H.E.Lee,  Review on "Gated Graph Sequence Neural Networks", I...
NS-CUK Seminar: H.E.Lee, Review on "Gated Graph Sequence Neural Networks", I...
 
Fa18_P1.pptx
Fa18_P1.pptxFa18_P1.pptx
Fa18_P1.pptx
 

Plus de taeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptxtaeseon ryu
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories taeseon ryu
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Modelstaeseon ryu
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuningtaeseon ryu
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithmtaeseon ryu
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimizationtaeseon ryu
 

Plus de taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdfLLaMA Open and Efficient Foundation Language Models - 230528.pdf
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimization
 

Dernier

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Dernier (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Distributional RL via Moment Matching

  • 1. 1 Distributional Reinforcement Learning via Moment Matching (MMDQN) *백승언, 주정헌, 박혜진 12 Feb, 2023
  • 2. 2 ⚫ Introduction ▪ Estimation of the probability distribution ▪ Limitation of distribution estimation in conventional Reinforcement Learning(RL) ⚫ Distributional RL via Moment Matching(MMDQN) ▪ Backgrounds ▪ MMDQN ⚫ Experiment results Contents
  • 4. 4 ⚫ Moments of a function are quantitative measures related to the shape of the function’s graph ▪ Probability distribution functions(PDF) are generally represented with four measures, mean, variance, skewness, kurtosis • Mean : first moment of PDF, (𝜇 = 𝐸[𝑋]) • Variance : second moment of PDF,(𝜎 = 𝐸 𝑋 − 𝜇 2 1 2) • Skewness : third moment of PDF, (𝛾 = 𝐸[ 𝑋−𝜇 𝜎 3 ]) • Kurtosis : fourth moment of PDF, (𝐾𝑢𝑟𝑡[𝑥] = 𝐸 𝑋−𝜇 4 𝐸 𝑋−𝜇 2 2) ⚫ Estimation methods of PDF 𝒇 𝒙 in machine learning ▪ Explicit methods • Determination predetermined statistics as any functional 𝜁: 𝑓 𝑥 ⇒ ℝ – Median of 𝑓(𝑥): 𝐹−1 ( 1 2 ), Mean of 𝑓 𝑥 : ‫׬‬ 𝒳 𝑃 𝑥 𝑑𝑥 • Estimation of the predetermined statistics using stacked data ▪ Implicit methods • Estimation of the 𝑓(𝑥; 𝜃) directly using stacked data(GAN, VAE, …) Estimation of the probability distribution Various shape of normal dist. based on 𝝁, 𝝈 Various shape of normal dist. based on 𝜸, 𝑲𝒖𝒓𝒕[𝒙]
  • 5. 5 ⚫ Naï ve approaches in estimation of return ▪ Due to stochastic nature of return 𝐺 = Σ𝛾𝑡 𝑟𝑡+1, in general RL, 𝐺 is approximated with 𝑄(𝑠, 𝑎) and 𝑉(𝑠) • 𝑉 𝑠 = 𝐸 Σ𝛾𝑡 𝑟𝑡+1| 𝑠 ▪ These functions, 𝑄 𝑠, 𝑎 and 𝑉(𝑠) are generally estimated with normal bellman operator 𝒯 𝐵 using only mean value • (𝒯 𝐵 𝑉) 𝑠 = 𝐸 𝑟(𝑠, 𝜋(𝑠)) + 𝛾𝐸 𝑉(𝑠′ ) Limitation of distribution estimation in conventional Reinforcement Learning Variance, skewness, kurtosis of return 𝑮 are easily neglected Return distribution value [-] Probability [-] These distributions have same mean, but they are not same! Return distribution value [-] Probability [-] Complex dist. could be modeled in this framework! • 𝑄(𝑠, 𝑎) = 𝐸 Σ𝛾𝑡 𝑟𝑡+1|𝑠, 𝑎 • (𝒯 𝐵 𝑄) 𝑠, 𝑎 = 𝐸 𝑟(𝑠, 𝑎) + 𝛾𝐸 𝑄(𝑠′ , 𝑎′ )
  • 6. 6 Distributional RL via Moment Matching (MMDQN) https://arxiv.org/abs/2007.12354
  • 7. 7 ⚫ Basics of probability space (𝛀, 𝚺, 𝑷) ▪ Sample space Ω • In a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑠𝑝𝑎𝑐𝑒, the set Ω is the set of all possible outcomes. Ω is set with element(event) 𝜔, and is called the 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 ▪ 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ • A 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 Σ is a set of subsets 𝜔 of Ω s.t.: – 𝜙 ∈ Σ – If 𝐴 = 𝜙, Ω ⇒ 𝐴 𝑖𝑠 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎, 𝐴 = 𝜙, 𝐸, 𝐸𝐶 , Ω and 𝐸 ∈ Ω ⇒ A is 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 ▪ Random variable Backgrounds (I) – If 𝜔 ∈ Σ, 𝑡ℎ𝑒𝑛 𝜔𝐶 ∈ Σ – If 𝜔1, 𝜔2, … , 𝜔𝑛 ∈ Σ, 𝑡ℎ𝑒𝑛, 𝑈𝑖=1 ∞ 𝜔𝑖 ∈ Σ Σ ℰ 𝑌 𝐸 • 𝐵(ℝ) is the smallest 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 containing open interval I. This set is called Borel set – 𝐵 ℝ = 𝐼 = {(𝑎, 𝑏)|𝑎, 𝑏 ∈ ℝ, 𝑎 < 𝑏} • If 𝑓 is measurable from (Ω, Σ) to (𝐸, ℰ), it is called a 𝐵𝑜𝑟𝑒𝑙 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 or a 𝑟𝑎𝑛𝑑𝑜𝑚 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒(RV) – Let (Ω, Σ) and (𝐸, ℰ) be measurable spaces and 𝑓 a function from (Ω, Σ) to (𝐸, ℰ). The function 𝑓 is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 iff 𝑓−1 ℰ ⊂ Σ Ω 𝑓−1 (𝑌) 𝑓
  • 8. 8 ⚫ Basics of probability space (𝛀, 𝚺, 𝑷) ▪ Probability distribution • Let (Ω, Σ, 𝜇) be a measure space and 𝑓 be a measurable function from (Ω, Σ) to (𝐸, ℰ). The pushforward measure is denoted by 𝑓#𝜇,, 𝑓# 𝜇, or 𝜇 ∘ 𝑓−1 is a measure on ℰ defined as – 𝑓#𝜇 𝑌 = 𝑓# 𝜇(𝑌) = 𝜇 ∘ 𝑓−1 𝑌 = 𝜇 𝑓 ∈ 𝑌 = 𝜇 𝑓−1 𝑌 , 𝑌 ∈ ℰ • If 𝜇 = 𝑃 is a probability measure, and 𝑓 is random variable, then 𝑃 ∘ 𝑓−1 is called the distribution (or the law) of 𝑓 and is denoted by 𝑃𝑋 – 𝑃 is 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 or 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 – In practice, one often use 𝐸, ℰ = ℝ𝑛 , 𝐵 ℝ𝑛 , 𝑓#𝜇: 𝐵 ℝ → [0, 1] Backgrounds (II) Σ ℰ 𝑌 𝐸 Ω 𝑓−1(𝑌) 𝑓−1 ℝ 𝑃 𝑓#𝑃 𝑃 ∘ 𝑓−1
  • 9. 9 ⚫ Distributional RL ▪ In distributional RL, the cumulative return of a chosen action at state is modeled with the full distribution rather than expectation of it, 𝑍𝜃 𝑠, 𝑎 ≔ 1 𝑁 Σ𝑖=1 𝑁 𝛿𝜃𝑖 (𝑠, 𝑎) • So that the model can capture its intrinsic randomness instead of just first-order moment(high-order moments, multi-modality in state-action value function) ▪ Existing distributional RL algorithms • C51: Pre-determination of the supports of return distribution and then training the categorical distribution. • QR-DQN: To avoid pre-determined supports, and decrease the theory-practice gap, introducing the quantile regression • IQN: Using sampling the quantile, training the full quantile function, and considering the risk of policy • IDAC: Training the full return distribution directly using the adversarial network, and training the policy based on semi-implicit methods Backgrounds (III)
  • 10. 10 ⚫ Overview of the MMDQN ▪ Unlike the existing distributional RL, MMDQN has no assumption about predetermined statistics and could learn the unrestricted statistics • However, for implementation, deterministic pseudo-samples of the return distribution are learned in MMD – The authors use the Dirac mixture Ƹ 𝜇𝜃 𝑠, 𝑎 = 1 𝑁 Σ𝑖=1 𝑁 𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋 (𝑠, 𝑎) ▪ The authors analyze the distributional Bellman operator and establish sufficient conditions for the contraction of the distributional Bellman operator in the MMD • They analyzed the 𝒯𝜋 is a contraction when the kernel is unrectified kernel 𝑘𝛼 𝑥, 𝑦 ≔ − 𝑥 − 𝑦 𝛼 , ∀𝛼 ∈ ℝ, ∀𝑥, 𝑦 ∈ 𝒳 • However, for practical consideration, they demonstrates the commonly used Gaussian kernel has better performance in this framework(𝑘 𝑥, 𝑦 ≔ exp − 𝑥−𝑦 2 2𝜎2 ) ▪ MMDQN with Gaussian kernel mixture showed the state-of-the-art in the 55 Atari 2600 games. • For a fair comparison, they used the same architecture of DQN and QR-DQN MMDQN (I)
  • 11. 11 ⚫ Problem setting ▪ For any policy 𝜋, let 𝜇𝜋 = law(𝑍𝜋 ) be the distribution of the return RV. 𝑍𝜋 𝑠, 𝑎 ≔ Σ𝑡=0 ∞ 𝛾𝑡 𝑅(𝑠𝑡, 𝑎𝑡) ▪ 𝒯𝜋 𝜇 𝑠, 𝑎 ≔ ඲ 𝒮 න 𝒜 ධ 𝜒 𝑓𝛾,𝑟 #𝜇 𝑠′ , 𝑎′ ℛ 𝑑𝑟 𝑠, 𝑎 𝜋 𝑑𝑎′ 𝑠 𝑃 𝑑𝑠′ 𝑠, 𝑎 , 𝑓𝛾,𝑟 𝑧 ≔ 𝑟 + 𝛾𝑧, ∀𝑧 𝑎𝑛𝑑 (𝑓𝛾,𝑟)#𝜇 𝑠′ , 𝑎′ is the push forward measure of 𝜇 𝑠′ , 𝑎′ 𝑏𝑦 𝑓𝛾,𝑟 ⚫ Algorithmic approach ▪ The authors use the Dirac mixture Ƹ 𝜇𝜃 𝑠, 𝑎 = 1 𝑁 Σ𝑖=1 𝑁 𝛿𝑍𝜃(𝑠,𝑎) to approximate 𝜇𝜋 (𝑠, 𝑎) • They referred to the deterministic samples 𝑍𝜃 𝑠, 𝑎 as particles ▪ Algorithm goal is reduced into learning the particles 𝑍𝜃(𝑠, 𝑎) to approximate 𝜇𝜋 (𝑠, 𝑎). • To this end, the particles 𝑍𝜃(𝑠, 𝑎) is deterministically evolved to minimize the MMD distance between the approximate distribution and its distributional Bellman target MMDQN (II)
  • 12. 12 ⚫ Maximum Mean Discrepancy(MMD) ▪ Let ℱ be a Reproducing Kernel Hilbert Space(RKHS) associated with a continuous kernel 𝑘(⋅,⋅) on 𝒳. ▪ The MMD between 𝑝 ∈ 𝑃(𝒳) and 𝑞 ∈ 𝑃 𝒳 is defined as • MMD 𝑝, 𝑞; ℱ ≔ sup 𝑓∈ℱ: 𝑓 ≤1 𝔼 f Z − 𝔼 𝑓 𝑊 = ‫׬‬ 𝜒 𝑘 𝑥,⋅ 𝑝 𝑑𝑥 − ‫׬‬ 𝜒 𝑘 𝑥,⋅ 𝑞 𝑑𝑥 ℱ = 𝔼 𝑘(𝑍, 𝑍′ ) + 𝔼 𝑘(𝑊, 𝑊′ ) − 2𝔼 𝑘(𝑍, 𝑊 1 2, 𝑍, 𝑍′~𝑝,𝑊, 𝑊′ ~𝑞 and they are independent respectively. ▪ In practical, MMD is biased estimated from MMDb with empirical samples 𝑧𝑖 𝑖=1 𝑁 ~𝑝 and 𝑤𝑖 𝑖=1 𝑀 ~𝑞. • MMDb 2 𝑧𝑖 , 𝑤𝑖 ; 𝑘 = 1 𝑁2 Σ𝑖,𝑗𝑘 𝑧𝑖, 𝑧𝑗 + 1 𝑀2 Σ𝑖,𝑗𝑘 𝑤𝑖, 𝑤𝑗 − 2 𝑁𝑀 Σ𝑖,𝑗𝑘(𝑧𝑖, 𝑤𝑗) ▪ The authors utilized the Gaussian kernel 𝑘 𝑥, 𝑦 = exp − 𝑥−𝑦 2 ℎ for objective MMDb • They exemplified the following intuition: – The first term serves as a repulsive force that pushes the particles {𝑍𝜃 𝑠, 𝑎 𝑖} away from each other, preventing them from collapsing into a single mode – The third term acts as an attractive force which pulls the particle {𝑍𝜃 𝑠, 𝑎 i} closer to their target particles { ෠ 𝒯𝑍𝑖}. • They used the kernel mixture trick with 𝐾 kernels(they are different in bandwidth ℎ) MMDQN (II)
  • 13. 13 ⚫ Pseudo code ▪ Hyper-params inputting and target network initialization corresponds with the main network ▪ Every single step, transition is sampled from the replay buffer • In the control setting, action is inferred from the policy(𝜖 − 𝑔𝑟𝑒𝑒𝑑𝑦) • In the policy evaluation, action is selected with the estimated return distribution Ƹ 𝜇 𝑠′ , 𝑎′ ▪ For the number of statistics 𝑁, target return value ෠ 𝒯𝑍𝑖 is computed ▪ MMDb objective is computed and backpropagated with SGD MMDQN (III) Pseudo code of MMDQN
  • 15. 15 ⚫ Comparison with previous methods ▪ Experiments show the superior performance of MMDQN compared to previous methods in 55 Atrai 2600 games(OpenAI Gym env) • DQN • C51 • RAINBOW • FQF Experiment results (I) • PRIOR • QR-DQN • IQN Median and mean of best *HN scores Median and mean of the *HN scores *HN scores: Human Normalized score ➔ First group ➔ Second group
  • 16. 16 ⚫ Comparison with the previous method ▪ Experiments show the superior performance of MMDQN compared to the previous method(QR-DQN) in 55 games in Atari 2600 games(OpenAI Gym env) • MMD(3 seeds) Experiment results (III) Online training curves for MMDQN and QR-DQN • QR-DQN(2 seeds)
  • 17. 17 ⚫ Ablation study ▪ Two sets of ablation studies were performed to answer the following questions • (a): Which kernel used for the MMDQN shows the best performance? – Using the mixture of Gaussian kernels with different bandwidths displays the best performance • (b): What number of particles for the MMDQN shows the best performance? – Using more than 50 particles of dist. demonstrates better performance, and using 200 particles of dist. exibits the most s table performance Experiment results (III) The sensitivity of MMDQN in the 6 tuning games w.r.t (a): the kernel choice, and (b): the number of particles N
  • 21. 21 ⚫ Basics of measure theory ▪ Measurable space (𝑋, Σ) • A pair (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 if 𝑋 is a set and Σ is a nonempty 𝜎 − 𝑎𝑙𝑔𝑒𝑏𝑟𝑎 of subsets of 𝑋 • A measurable space allows us to define a function that assigns real numbered values to the abstract elements of Σ ▪ Measure 𝜇 • Let (𝑋, Σ) be a measurable space, set function 𝜇 defined on Σ is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 iff has the following properties – 0 ≤ 𝜇 𝐴 ≤ ∞ 𝑓𝑜𝑟 𝑎𝑛𝑦 𝐴 ∈ Σ – For any sequence of pairwise disjoint sets {𝐴𝑛}∈ Σ such that 𝑈𝑛=1𝐴𝑛 ∈ Σ, we have 𝜇 𝑈𝑛=1 ∞ 𝐴𝑛 = Σ𝑛=1 ∞ 𝜇(𝐴𝑛) ▪ Measure space • A triplet (𝑋, Σ, 𝜇) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 𝑠𝑝𝑎𝑐𝑒 if (𝑋, Σ) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑠𝑝𝑎𝑐𝑒 and 𝜇: Σ → [0; ∞) is a 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 • If 𝜇 𝑋 = 1, then 𝜇 is a 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑚𝑒𝑎𝑠𝑢𝑟𝑒, which we usually use notation 𝑃, and the measure space is a 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 𝒔𝒑𝒂𝒄𝒆 Backgrounds – 𝜇 Φ = 0
  • 22. 22 ⚫ Basics of measure theory ▪ Measurable function • Let (Ω, Σ) and (Λ, 𝐺) be measurable spaces and 𝑓 a function from Ω to Λ. The function 𝑓 is called 𝑚𝑒𝑎𝑠𝑢𝑟𝑎𝑏𝑙𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 from (Ω, Σ) to (Λ, G) iff 𝑓−1 𝐺 ⊂ Σ ▪ Random variable • A random variable 𝑿 is a measurable function from the probability space (Ω, Σ, 𝑃) into the probability space (𝒳, 𝐵𝒳, 𝑃𝒳), where 𝒳 in ℝ is the range of the 𝑿, 𝐵𝒳 is a 𝐵𝑜𝑟𝑒𝑙 𝑠𝑒𝑡 𝑜𝑓 𝒳 and 𝑃𝒳 is the probability measure(distribution) on 𝒳 – Specifically, 𝑿: Ω → 𝒳 Backgrounds