SlideShare une entreprise Scribd logo
1  sur  50
Télécharger pour lire hors ligne
Hi
1
Deep Q-Network
Deep Reinforcement Learning
Azzeddine CHENINE
@azzeddineCH_
2
15 October 2019
3
Q-Learning
Quick Reminder
4
Quick Reminder
Learning by experience
5
Quick Reminder
Learning by experience
Episodic tasks and continuous task
6
Quick Reminder
Learning by experience
Episodic tasks and continuous task
An action has a reward and results a new state
7
Quick Reminder
The value based methods try to reach an optimal policy by using a value function
8
Quick Reminder
The value based methods try to reach an optimal policy by using a value function
The Q function of an optima policy maximise the expected return G
9
Quick Reminder
The value based methods try to reach an optimal policy by using a value function
The Q function of an optima policy maximise the expected return G
Epsilon greedy policies solve the exploration and exploitation
problem
10
Q-Learning
Limitation of
11
§
Limitation
Non practical with environments with large or in
fi
nite state or
action space
Games
Stock market
Robots
12
Modern problems


require


Modern solutions
13
Deep Q-Networks
Deepminds
- 2015 -
14
No more given de
fi
nitions ….
15
No more given de
fi
nitions ….
fi
guring it out by
ourselves
16
1 - How to replace the
Q-table ?
Neural networks can be trained to
fi
t on a linear or a non
linear function.
17
1 - How to replace the
Q-table ?
Neural networks can be trained to
fi
t on a linear or a non
linear function.
Replace the Q-value function with a deep neural network
18
1 - How to replace the
Q-table ?
Neural networks can be trained to
fi
t on a linear or a non
linear function.
Replace the Q-value function with a deep neural network
The outputs layer hold the Q-value of each possible
action
⟹ make the actions space discrete
19
1 - How to replace the
Q-table ?
Neural networks can be trained to
fi
t on a linear or a non
linear function.
Replace the Q-value function with a deep neural network
The outputs layer hold the Q-value of each possible
action
The input layer takes the desired representation of the
state
⟹ make the actions space discrete
⟹ a pipeline of convolutions
⟹ a Dense layer
⟹ a sequence of LSTM/RNN cells
20
1 - How to replace the
Q-table ✅
21
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How did we used update the Q-table
The agent took an action (a) in a state (S)
22
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How did we used update the Q-table
The agent took an action (a) in a state (S)
The agent moved to the next state (S’) and
received a reward (r)
23
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
the discount rate
𝞭
re
fl
ects how much we care
about the futur reward
How did we used update the Q-table
The agent took an action (a) in a state (S)
The agent moved to the next state (S’) and
received a reward (r)
𝞭
≈ 1⟹ we care too much about the future reward
𝞭
≈ 0⟹ we care too much about the immediate reward
24
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
Q' ( S , a ) ⟸ Q ( S , a ) +
𝛼
[ r ( S , a , S’ ) +
𝞭
max Q(S’, a') - Q ( S, a ) ]
How did we used update the Q-table
A good Q value for an action (a) given a state (S) is a
good approximation to the expected discounted
cumulative reward.
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
25
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values
State
Q-networks
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
26
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values Action
Action selection
State
Q-networks Environments
New State S’
Reward R
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
27
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
State
Q-networks Environments
New State S’
Reward R
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
28
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
State
Q-networks Environments
New State S’
Reward R
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
29
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
State
Q-networks Environments
New State S’
Reward R
Loss Optimizers
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
30
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
New network weights
State
Q-networks Environments
New State S’
Reward R
Loss Optimizers
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
31
1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Update weights
State
New state
For n episode
32
We are chasing a moving target
33
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q ( S, a )
34
Q ( S, a )
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
35
Q ( S, a )
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
36
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q ( S, a )
37
1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
We use a local network and a target network
38
1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
We use a local network and a target network
The target neural network get slightly updated based
on the local networks weights
39
1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
We use a local network and a target network
The target neural network get slightly updated based
on the local networks weights
The local network predict the Q-values for the current
state
40
1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
We use a local network and a target network
The target neural network get slightly updated based
on the local networks weights
The local network predict the Q-values for the current
state
The target network predict the Q-values for the next
state
41
1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
Q-values Action
Action selection
State
Local network Environments
New State S’
Reward R
42
1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
New network weights
State
Local network Environments
New State S’
Reward R
Loss Optimizers for
the local network
Target network
43
?
Someone to answer the question
please
In supervised learning, why we shu
ffl
ing
the dataset gives better accuracy?
44
Someone to answer the question
please
In supervised learning, why we shu
ffl
ing
the dataset gives better accuracy?
If not shu
ffl
ing data, the data can be sorted or
similar data points will lie next to each other,
which leads to slow convergence
?
45
Someone to answer the question
please
In supervised learning, why we shu
ffl
ing
the dataset gives better accuracy?
If not shu
ffl
ing data, the data can be sorted or
similar data points will lie next to each other,
which leads to slow convergence
In our case we are having a big
correlation between consecutive states
which may lead to ine
ffi
cient learning .
?
46
1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How to break the coloration
between states
3 - T-target ✅
4 - Experience
replay
Stack the experiences in a buffer
After the speci
fi
ed number of steps pick an N
random experiences
Proceed with optimizing the loss with the
learning batch
47
1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How to break the coloration
between states
3 - T-target ✅
4 - Experience
replay ✅
Stack the experiences in a buffer
After the speci
fi
ed number of steps pick an N
random experiences
Proceed with optimizing the loss with the
learning batch
48
Build a DQL agent
Putting it all together
49
Saha ftourkoum

Contenu connexe

Tendances

Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Ant colony optimization
Ant colony optimizationAnt colony optimization
Ant colony optimizationMeenakshi Devi
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningshivani saluja
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
swarm intelligence seminar
swarm intelligence seminarswarm intelligence seminar
swarm intelligence seminarRadhikaGupta173
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 

Tendances (20)

Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Ant colony optimization
Ant colony optimizationAnt colony optimization
Ant colony optimization
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
swarm intelligence seminar
swarm intelligence seminarswarm intelligence seminar
swarm intelligence seminar
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 

Similaire à Deep Q-Networks Explained

Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfPo-Chuan Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
Ai lecture 11(unit02)
Ai lecture  11(unit02)Ai lecture  11(unit02)
Ai lecture 11(unit02)vikas dhakane
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Deep Q-Network for beginners
Deep Q-Network for beginnersDeep Q-Network for beginners
Deep Q-Network for beginnersEtsuji Nakai
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門Ryo Iwaki
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee
 
FinalPresentation_200630888 (1)
FinalPresentation_200630888 (1)FinalPresentation_200630888 (1)
FinalPresentation_200630888 (1)woods6795
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptxQingsong Guo
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 

Similaire à Deep Q-Networks Explained (20)

Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Ai lecture 11(unit02)
Ai lecture  11(unit02)Ai lecture  11(unit02)
Ai lecture 11(unit02)
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Deep Q-Network for beginners
Deep Q-Network for beginnersDeep Q-Network for beginners
Deep Q-Network for beginners
 
ゆるふわ強化学習入門
ゆるふわ強化学習入門ゆるふわ強化学習入門
ゆるふわ強化学習入門
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Reinforcement Learning - DQN
Reinforcement Learning - DQNReinforcement Learning - DQN
Reinforcement Learning - DQN
 
FinalPresentation_200630888 (1)
FinalPresentation_200630888 (1)FinalPresentation_200630888 (1)
FinalPresentation_200630888 (1)
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptx
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 

Plus de azzeddine chenine

Introduction to Deep Reinforcement Learning workshop at School of Ai: AI Day
Introduction to Deep Reinforcement Learning workshop at School of Ai: AI Day Introduction to Deep Reinforcement Learning workshop at School of Ai: AI Day
Introduction to Deep Reinforcement Learning workshop at School of Ai: AI Day azzeddine chenine
 
MENADD firebase cloud functions
MENADD firebase cloud functionsMENADD firebase cloud functions
MENADD firebase cloud functionsazzeddine chenine
 
Android pei devfest Algiers 2018
Android pei devfest Algiers 2018Android pei devfest Algiers 2018
Android pei devfest Algiers 2018azzeddine chenine
 
Io18...what's new in Android
Io18...what's new in AndroidIo18...what's new in Android
Io18...what's new in Androidazzeddine chenine
 
Android architecture components
Android architecture components Android architecture components
Android architecture components azzeddine chenine
 

Plus de azzeddine chenine (7)

Introduction to Deep Reinforcement Learning workshop at School of Ai: AI Day
Introduction to Deep Reinforcement Learning workshop at School of Ai: AI Day Introduction to Deep Reinforcement Learning workshop at School of Ai: AI Day
Introduction to Deep Reinforcement Learning workshop at School of Ai: AI Day
 
Firebase ml kit
Firebase ml kitFirebase ml kit
Firebase ml kit
 
MENADD firebase cloud functions
MENADD firebase cloud functionsMENADD firebase cloud functions
MENADD firebase cloud functions
 
Android pei devfest Algiers 2018
Android pei devfest Algiers 2018Android pei devfest Algiers 2018
Android pei devfest Algiers 2018
 
Kubernates
KubernatesKubernates
Kubernates
 
Io18...what's new in Android
Io18...what's new in AndroidIo18...what's new in Android
Io18...what's new in Android
 
Android architecture components
Android architecture components Android architecture components
Android architecture components
 

Dernier

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 

Dernier (20)

Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 

Deep Q-Networks Explained

  • 2. Deep Q-Network Deep Reinforcement Learning Azzeddine CHENINE @azzeddineCH_ 2
  • 6. Quick Reminder Learning by experience Episodic tasks and continuous task 6
  • 7. Quick Reminder Learning by experience Episodic tasks and continuous task An action has a reward and results a new state 7
  • 8. Quick Reminder The value based methods try to reach an optimal policy by using a value function 8
  • 9. Quick Reminder The value based methods try to reach an optimal policy by using a value function The Q function of an optima policy maximise the expected return G 9
  • 10. Quick Reminder The value based methods try to reach an optimal policy by using a value function The Q function of an optima policy maximise the expected return G Epsilon greedy policies solve the exploration and exploitation problem 10
  • 12. § Limitation Non practical with environments with large or in fi nite state or action space Games Stock market Robots 12
  • 15. No more given de fi nitions …. 15
  • 16. No more given de fi nitions …. fi guring it out by ourselves 16
  • 17. 1 - How to replace the Q-table ? Neural networks can be trained to fi t on a linear or a non linear function. 17
  • 18. 1 - How to replace the Q-table ? Neural networks can be trained to fi t on a linear or a non linear function. Replace the Q-value function with a deep neural network 18
  • 19. 1 - How to replace the Q-table ? Neural networks can be trained to fi t on a linear or a non linear function. Replace the Q-value function with a deep neural network The outputs layer hold the Q-value of each possible action ⟹ make the actions space discrete 19
  • 20. 1 - How to replace the Q-table ? Neural networks can be trained to fi t on a linear or a non linear function. Replace the Q-value function with a deep neural network The outputs layer hold the Q-value of each possible action The input layer takes the desired representation of the state ⟹ make the actions space discrete ⟹ a pipeline of convolutions ⟹ a Dense layer ⟹ a sequence of LSTM/RNN cells 20
  • 21. 1 - How to replace the Q-table ✅ 21
  • 22. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How did we used update the Q-table The agent took an action (a) in a state (S) 22
  • 23. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How did we used update the Q-table The agent took an action (a) in a state (S) The agent moved to the next state (S’) and received a reward (r) 23
  • 24. 1 - How to replace the Q-table ✅ 2 - How to optimize the model the discount rate 𝞭 re fl ects how much we care about the futur reward How did we used update the Q-table The agent took an action (a) in a state (S) The agent moved to the next state (S’) and received a reward (r) 𝞭 ≈ 1⟹ we care too much about the future reward 𝞭 ≈ 0⟹ we care too much about the immediate reward 24
  • 25. 1 - How to replace the Q-table ✅ 2 - How to optimize the model Q' ( S , a ) ⟸ Q ( S , a ) + 𝛼 [ r ( S , a , S’ ) + 𝞭 max Q(S’, a') - Q ( S, a ) ] How did we used update the Q-table A good Q value for an action (a) given a state (S) is a good approximation to the expected discounted cumulative reward. ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 25
  • 26. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values State Q-networks ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 26
  • 27. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values Action Action selection State Q-networks Environments New State S’ Reward R ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 27
  • 28. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values Action Action selection ⨁ r ( S , a , S’ ) + 𝞭 max Q(S’, a') State Q-networks Environments New State S’ Reward R ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 28
  • 29. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values Action Action selection ⨁ r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q(S, a') State Q-networks Environments New State S’ Reward R ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 29
  • 30. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values Action Action selection ⨁ r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q(S, a') State Q-networks Environments New State S’ Reward R Loss Optimizers ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 30
  • 31. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights Q-values Action Action selection ⨁ r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q(S, a') New network weights State Q-networks Environments New State S’ Reward R Loss Optimizers ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') 31
  • 32. 1 - How to replace the Q-table ✅ 2 - How to optimize the model How are we going to update the Q- networks weights ⟹ Q ( S, a ) ≈ r ( S , a , S’ ) + 𝞭 max Q(S’, a') Update weights State New state For n episode 32
  • 33. We are chasing a moving target 33
  • 34. r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q ( S, a ) 34
  • 35. Q ( S, a ) r ( S , a , S’ ) + 𝞭 max Q(S’, a') 35
  • 36. Q ( S, a ) r ( S , a , S’ ) + 𝞭 max Q(S’, a') 36
  • 37. r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q ( S, a ) 37
  • 38. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target We use a local network and a target network 38
  • 39. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target We use a local network and a target network The target neural network get slightly updated based on the local networks weights 39
  • 40. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target We use a local network and a target network The target neural network get slightly updated based on the local networks weights The local network predict the Q-values for the current state 40
  • 41. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target We use a local network and a target network The target neural network get slightly updated based on the local networks weights The local network predict the Q-values for the current state The target network predict the Q-values for the next state 41
  • 42. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target Q-values Action Action selection State Local network Environments New State S’ Reward R 42
  • 43. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How are we going to update the Q- networks weights 3 - T-target Q-values Action Action selection ⨁ r ( S , a , S’ ) + 𝞭 max Q(S’, a') Q(S, a') New network weights State Local network Environments New State S’ Reward R Loss Optimizers for the local network Target network 43
  • 44. ? Someone to answer the question please In supervised learning, why we shu ffl ing the dataset gives better accuracy? 44
  • 45. Someone to answer the question please In supervised learning, why we shu ffl ing the dataset gives better accuracy? If not shu ffl ing data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence ? 45
  • 46. Someone to answer the question please In supervised learning, why we shu ffl ing the dataset gives better accuracy? If not shu ffl ing data, the data can be sorted or similar data points will lie next to each other, which leads to slow convergence In our case we are having a big correlation between consecutive states which may lead to ine ffi cient learning . ? 46
  • 47. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How to break the coloration between states 3 - T-target ✅ 4 - Experience replay Stack the experiences in a buffer After the speci fi ed number of steps pick an N random experiences Proceed with optimizing the loss with the learning batch 47
  • 48. 1 - How to replace the Q-table ✅ 2 - How to optimize the model ✅ How to break the coloration between states 3 - T-target ✅ 4 - Experience replay ✅ Stack the experiences in a buffer After the speci fi ed number of steps pick an N random experiences Proceed with optimizing the loss with the learning batch 48
  • 49. Build a DQL agent Putting it all together 49