SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
Dueling Network Architectures for Deep
Reinforcement Learning (ICML 2016)
Yoonho Lee
Department of Computer Science and Engineering
Pohang University of Science and Technology
October 11, 2016
Outline
Reinforcement Learning
Definition of RL
Mathematical formulations
Algorithms
Neural Fitted Q Iteration
Deep Q Network
Double Deep Q Network
Prioritized Experience Replay
Dueling Network
Outline
Reinforcement Learning
Definition of RL
Mathematical formulations
Algorithms
Neural Fitted Q Iteration
Deep Q Network
Double Deep Q Network
Prioritized Experience Replay
Dueling Network
Definition of RL
general setting of RL
Definition of RL
atari setting
Outline
Reinforcement Learning
Definition of RL
Mathematical formulations
Algorithms
Neural Fitted Q Iteration
Deep Q Network
Double Deep Q Network
Prioritized Experience Replay
Dueling Network
Mathematical formulations
objective of RL
Definition
Return Gt is the cumulative discounted reward from time t
Gt = rt+1 + γrt+2 + γ2
rt+3 + . . .
Mathematical formulations
objective of RL
Definition
Return Gt is the cumulative discounted reward from time t
Gt = rt+1 + γrt+2 + γ2
rt+3 + . . .
Definition
A policy π is a function that selects actions given states
π(s) = a
The goal of RL is to find π that maximizes G0
Mathematical formulations
Q-Value
Gt =
∞
i=0
γk
rt+i+1
Definition
The action-value (Q-value) function Qπ(s, a) is the expectation of
Gt under taking action a, and then following policy π afterwards
Qπ(s, a) = Eπ[Gt|St = s, At = a, At+i = π(St+i )∀i ∈ N]
”How good is action a in state s?”
Mathematical formulations
Optimal Q-value
Definition
The optimal Q-value function Q∗(s, a) is the maximum Q-value
over all policies
Q∗(s, a) = max
π
Qπ(s, a)
Mathematical formulations
Optimal Q-value
Definition
The optimal Q-value function Q∗(s, a) is the maximum Q-value
over all policies
Q∗(s, a) = max
π
Qπ(s, a)
Theorem
There exists a policy π∗ such that Qπ∗ (s, a) = Q∗(s, a) for all s, a
Thus, it suffices to find Q∗
Mathematical formulations
Q-Learning
Bellman Optimality Equation
Q∗(s, a) satisfies the following equation:
Q∗(s, a) = R(s, a) + γ
s
P(s |s, a)max
a
Q∗(s , a )
Mathematical formulations
Q-Learning
Bellman Optimality Equation
Q∗(s, a) satisfies the following equation:
Q∗(s, a) = R(s, a) + γ
s
P(s |s, a)max
a
Q∗(s , a )
Q-Learning
Let a be -greedy w.r.t. Q, and a be optimal w.r.t. Q. Q
converges to Q∗ if we iteratively apply the following update:
Q(s, a) ← α(R(s, a) + γQ(s , a )) + (1 − α)Q(s, a)
Mathematical formulations
Other approcahes to RL
Value-Based RL
Estimate Q∗(s, a)
Deep Q Network
Policy-Based RL
Search directly for optimal policy π∗
DDPG, TRPO. . .
Model-Based RL
Use the (learned or given) transition model of environment
Tree Search, DYNA . . .
Outline
Reinforcement Learning
Definition of RL
Mathematical formulations
Algorithms
Neural Fitted Q Iteration
Deep Q Network
Double Deep Q Network
Prioritized Experience Replay
Dueling Network
Neural Fitted Q Iteration
Supervised learning on (s,a,r,s’)
Neural Fitted Q Iteration
input
Neural Fitted Q Iteration
target equation
Neural Fitted Q Iteration
Shortcomings
Exploration is independent of experience
Exploitation does not occur at all
Policy evaluation does not occur at all
Even in exact(table lookup) case, not guaranteed to converge
to Q∗
Outline
Reinforcement Learning
Definition of RL
Mathematical formulations
Algorithms
Neural Fitted Q Iteration
Deep Q Network
Double Deep Q Network
Prioritized Experience Replay
Dueling Network
Deep Q Network
action policy
Choose an -greedy policy w.r.t. Q
Deep Q Network
network freezing
Get a copy of the network every C steps for stability
Deep Q Network
Deep Q Network
Deep Q Network
performance
Deep Q Network
overoptimism
What happens when we overestimate Q?
Deep Q Network
Problems
Overestimation of the Q function at any s spills over to
actions that lead to s → Double DQN
Sampling transitions uniformly from D is inefficient →
Prioritized Experience Replay
Outline
Reinforcement Learning
Definition of RL
Mathematical formulations
Algorithms
Neural Fitted Q Iteration
Deep Q Network
Double Deep Q Network
Prioritized Experience Replay
Dueling Network
Double Deep Q Network
target
We can write the DQN target as:
Y DQN
t = Rt+1 + γQ(St+1, argmax
a
Q(St+1, a; θ−
t ); θ−
t )
Double DQN’s target is:
Y DDQN
t = Rt+1 + γQ(St+1, argmax
a
Q(St+1, a; θt); θ−
t )
This has the effect of decoupling action selection and action
evaluation
Double Deep Q Network
performance
Much stabler with very little change in code
Outline
Reinforcement Learning
Definition of RL
Mathematical formulations
Algorithms
Neural Fitted Q Iteration
Deep Q Network
Double Deep Q Network
Prioritized Experience Replay
Dueling Network
Prioritized Experience Replay
Update ’more surprising’ experiences more frequently
Outline
Reinforcement Learning
Definition of RL
Mathematical formulations
Algorithms
Neural Fitted Q Iteration
Deep Q Network
Double Deep Q Network
Prioritized Experience Replay
Dueling Network
Dueling Network
The scalar approximates V , and the vector approximates A
Dueling Network
forward pass equation
The exact forward pass equation is:
Q(s, a; θ, α, β) = V (s; θ, β) + (A(s, a; θ, α) − max
a
A(s, a ; θ, α))
The following module was found to be more stable without losing
much performance:
Q(s, a; θ, α, β) = V (s; θ, β) + (A(s, a; θ, α) −
1
|A|
a
A(s, a ; θ, α))
Dueling Network
attention
Advantage network only pays attention when current action
crucial
Dueling Network
performance
Achieves state of the art in the Atari domain among DQN
algorithms
Dueling Network
Summary
Since this is an improvement only in network architecture,
methods that improve DQN(e.g. Double DQN) are all
applicable here as well
Solves problem of V and A typically being of different scale
Updates Q values more frequently than a single-stream DQN,
where only a single Q value is updated for each observation
Implicitly splits the credit assignment problem into a recursive
binary problem of “now or later”
Dueling Network
Shortcomings
Only works for |A| < ∞
Still not able to solve tasks involving long-term planning
Better than DQN, but sample complexity is still high
-greedy exploration is essentially random guessing

Contenu connexe

Tendances

강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
Taehoon Kim
 

Tendances (20)

Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient Theorem
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
AI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAI - Introduction to Bellman Equations
AI - Introduction to Bellman Equations
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction강화 학습 기초 Reinforcement Learning an introduction
강화 학습 기초 Reinforcement Learning an introduction
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 
Deep Q-learning explained
Deep Q-learning explainedDeep Q-learning explained
Deep Q-learning explained
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)Q-learning and Deep Q Network (Reinforcement Learning)
Q-learning and Deep Q Network (Reinforcement Learning)
 

Similaire à Dueling Network Architectures for Deep Reinforcement Learning

Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
Kim Hammar
 
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Naoki (Neo) SATO
 

Similaire à Dueling Network Architectures for Deep Reinforcement Learning (20)

Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper Reading
 
Automated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jecturesAutomated Security Response through Online Learning with Adaptive Con jectures
Automated Security Response through Online Learning with Adaptive Con jectures
 
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
pycon2018 "RL Adventure : DQN 부터 Rainbow DQN까지"
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Practical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlowPractical Reinforcement Learning with TensorFlow
Practical Reinforcement Learning with TensorFlow
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptx
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
RLTopics_2021_Lect1.pdf
RLTopics_2021_Lect1.pdfRLTopics_2021_Lect1.pdf
RLTopics_2021_Lect1.pdf
 
Codes and Isogenies
Codes and IsogeniesCodes and Isogenies
Codes and Isogenies
 
First Place Memocode'14 Design Contest Entry
First Place Memocode'14 Design Contest EntryFirst Place Memocode'14 Design Contest Entry
First Place Memocode'14 Design Contest Entry
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
 
Safety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdfSafety Verification of Deep Neural Networks_.pdf
Safety Verification of Deep Neural Networks_.pdf
 
Value Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank ModelsValue Function Approximation via Low-Rank Models
Value Function Approximation via Low-Rank Models
 

Plus de Yoonho Lee

Plus de Yoonho Lee (12)

Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
On First-Order Meta-Learning Algorithms
On First-Order Meta-Learning AlgorithmsOn First-Order Meta-Learning Algorithms
On First-Order Meta-Learning Algorithms
 
New Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient MethodNew Insights and Perspectives on the Natural Gradient Method
New Insights and Perspectives on the Natural Gradient Method
 
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and SubspaceGradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace
 
Parameter Space Noise for Exploration
Parameter Space Noise for ExplorationParameter Space Noise for Exploration
Parameter Space Noise for Exploration
 
Meta Learning Shared Hierarchies
Meta Learning Shared HierarchiesMeta Learning Shared Hierarchies
Meta Learning Shared Hierarchies
 
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
Continuous Adaptation via Meta Learning in Nonstationary and Competitive Envi...
 
The Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and PlanningThe Predictron: End-to-end Learning and Planning
The Predictron: End-to-end Learning and Planning
 
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep NetworksModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
 
Modular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy SketchesModular Multitask Reinforcement Learning with Policy Sketches
Modular Multitask Reinforcement Learning with Policy Sketches
 
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement LearningEvolution Strategies as a Scalable Alternative to Reinforcement Learning
Evolution Strategies as a Scalable Alternative to Reinforcement Learning
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 

Dernier

Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
ZurliaSoop
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Hung Le
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
amilabibi1
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
David Celestin
 

Dernier (17)

My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
Jual obat aborsi Jakarta 085657271886 Cytote pil telat bulan penggugur kandun...
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 
Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptx
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
ICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdfICT role in 21st century education and it's challenges.pdf
ICT role in 21st century education and it's challenges.pdf
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
Bring back lost lover in USA, Canada ,Uk ,Australia ,London Lost Love Spell C...
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 

Dueling Network Architectures for Deep Reinforcement Learning