SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
Reinforcement Learning
Reinforcement Learning
● Reinforcement Learning has been around since 1950s and
○ Produced many interesting applications over the years
○ Like TD-Gammon, a Backgammon playing program
Reinforcement Learning
Reinforcement Learning
● In 2013, researchers from an English startup DeepMind
○ Demonstrated a system that could learn to play
○ Just about any Atari game from scratch and
○ Which outperformed human
○ It used only row pixels as inputs and
○ Without any prior knowledge of the rules of the game
● In March 2016, their system AlphaGo defeated
○ Lee Sedol, the world champion of the game of Go
○ This was possible because of Reinforcement Learning
Reinforcement Learning
Reinforcement Learning
● In this chapter, we will learn about
○ What Reinforcement Learning is and
○ What it is good at
Reinforcement Learning
Reinforcement Learning
Learning to Optimize Rewards
Reinforcement Learning
Reinforcement Learning
● In Reinforcement Learning
○ A software agent makes observations and
○ Takes actions within an environment and
○ In return it receives rewards
Learning to Optimize Rewards
Reinforcement Learning
Goal?
Learning to Optimize Rewards
Reinforcement Learning
Goal
Learn how to take actions in order to maximize
reward
Learning to Optimize Rewards
Reinforcement Learning
● In short, the agent acts in the environment and
○ Learns by trial and error to
○ Maximize its reward
Learning to Optimize Rewards
Reinforcement Learning
So how can we apply this in real-life
applications?
Learning to Optimize Rewards
Reinforcement Learning
Learning to Optimize Rewards - Walking Robot
Reinforcement Learning
● Agent - Program controlling a walking robot
● Environment - Real world
● The agent observes the environment through a set of sensors such as
○ Cameras and touch sensors
● Actions - Sending signals to activate motors
Learning to Optimize Rewards - Walking Robot
Reinforcement Learning
● It may be programmed to get
○ Positive rewards whenever it approaches the target destination and
○ Negative rewards whenever it
■ Wastes time
■ Goes in the wrong direction or
■ Falls down
Learning to Optimize Rewards - Walking Robot
Reinforcement Learning
Learning to Optimize Rewards - Ms. Pac-Man
Reinforcement Learning
● Agent - Program controlling Ms. Pac-Man
● Environment - Simulation of the Atari game
● Actions - Nine possible joystick positions
● Observations - Screenshots
● Rewards - Game points
Learning to Optimize Rewards - Ms. Pac-Man
Reinforcement Learning
Learning to Optimize Rewards - Thermostat
Reinforcement Learning
● Agent - Thermostat
○ Please note, the agent does not have to control a
○ Physically (or virtually) moving thing
● Rewards -
○ Positive rewards whenever agent is close to the target temperature
○ Negative rewards when humans need to tweak the temperature
● Important - Agent must learn to anticipate human needs
Learning to Optimize Rewards - Thermostat
Reinforcement Learning
Learning to Optimize Rewards - Auto Trader
Reinforcement Learning
● Agent -
○ Observes stock market prices and
○ Decide how much to buy or sell every second
● Rewards - The monetary gains and losses
Learning to Optimize Rewards - Auto Trader
Reinforcement Learning
● There are many other examples such as
○ Self-driving cars
○ Placing ads on a web page or
○ Controlling where an image classification system
■ Should focus its attention
Learning to Optimize Rewards
Reinforcement Learning
● Note that there may not be any positive rewards at all
● For example
○ The agent may move around in a maze
○ Getting a negative reward at every time step
○ So it better find the exit as quickly as possible
Learning to Optimize Rewards
Reinforcement Learning
Policy Search
Reinforcement Learning
● The algorithm used by the software agent to
○ Determine its actions is called its policy
● For example, the policy could be a neural network
○ Taking observations as inputs and
○ Outputting the action to take
Policy Search
Reinforcement Learning
● The policy can be any algorithm we can think of
○ And it does not even have to be deterministic
● For example, consider a robotic vacuum cleaner
○ Its reward is the amount of dust it picks up in 30 minutes
● Its policy could be to
○ Move forward with some probability p every second or
○ Randomly rotate left or right with probability 1 – p
○ The rotation angle would be a random angle between –r and +r
● Since this policy involves some randomness
○ It is called a stochastic policy
Policy Search
Reinforcement Learning
● The robot will have an erratic trajectory, which guarantees that
○ It can get to any place it can reach and
○ Pick up all the dust
● The question is: how much dust will it pick up in 30 minutes?
Policy Search
Reinforcement Learning
How would we train such a robot?
Policy Search
Reinforcement Learning
● There are just two policy parameters which we can tweak
○ The probability p and
○ The angle range r
Policy Search
Reinforcement Learning
● One possible learning algorithm could be to
○ Try out many different values for these parameters, and
○ Pick the combination that performs best
● This is the example of policy search using a brute force approach
Policy Search - Brute Force Approach
Four points in policy space and the agent’s corresponding behavior
Reinforcement Learning
● When the policy space is too large then
○ Finding a good set of parameters this way is like
○ Searching for a needle in a gigantic haystack
Policy Search - Brute Force Approach
Four points in policy space and the agent’s corresponding behavior
Reinforcement Learning
● Another way to explore the policy space is to
○ Use genetic algorithms
● For example
○ Randomly create a first generation of 100 policies and
○ Try them out, then “kill” the 80 worst policies and
○ Make the 20 survivors produce 4 offspring each
● An offspring is just a copy of its parent
○ Plus some random variation
Policy Search - Genetic Algorithms
Reinforcement Learning
● The surviving policies plus their offspring together
○ Constitute the second generation
● Continue to iterate through generations this way
○ Until a good policy is found
Policy Search - Genetic Algorithms
Reinforcement Learning
● Another approach is to use
○ Optimization Techniques
● Evaluate the gradients of the rewards
○ With regards to the policy parameters
○ Then tweaking these parameters by following the
○ Gradient toward higher rewards (gradient ascent)
● This approach is called policy gradients
Policy Search - Optimization Techniques
Reinforcement Learning
● Let’s understand this with an example
● Going back to the vacuum cleaner robot example, we can
○ Slightly increase p and evaluate
○ Whether this increases the amount of dust
○ Picked up by the robot in 30 minutes
○ If it does, then increase p some more, or
○ Else reduce p
Policy Search - Optimization Techniques
Reinforcement Learning
Introduction to OpenAI Gym
Reinforcement Learning
● To train an agent in Reinforcement Learning
○ We need a working environment
● For example, if we want agent to run how to play Atari game,
○ We will need a Atari game simulator
● OpenAI gym is a toolkit that provides wide variety of simulations like
○ Atari games
○ Board games
○ 2D and 3D physical simulations and so on
Introduction to OpenAI Gym
Reinforcement Learning
Let’s do a hands-on
Policy Search - Optimization Techniques
Reinforcement Learning
Goal - Balance a pole on top of a movable cart. Pole should be upright
Policy Search - Optimization Techniques
Pole
Cart
Reinforcement Learning
Neural Network Policies
Let’s create a neural network policy.
● Take an observation as input
● Output the action to be executed
● Estimate probability for each action
● Select an action randomly
according to the estimated
probabilities
Example, In cart pole:
○ If it outputs 0.7, then we will
pick action 0 with 70%
probability, and action 1 with
30% probability.
Reinforcement Learning
Neural Network Policies
Q: Why we are picking a random action based on the probability
given by the neural network, rather than just picking the action
with the highest score.
Reinforcement Learning
Neural Network Policies
Q: Why we are picking a random action based on the probability
given by the neural network, rather than just picking the action
with the highest score.
This approach lets the agent find the right balance between
exploring new actions and exploiting the actions that are
known to work well.
Reinforcement Learning
Neural Network Policies
Q: Why we are picking a random action based on the probability
given by the neural network, rather than just picking the action
with the highest score.
This approach lets the agent find the right balance between
exploring new actions and exploiting the actions that are
known to work well.
We will never discover a new dish at restaurent if we don’t try
anything new. Give serendipity a chance.
Reinforcement Learning
Check the complete code in Jupyter notebook
Neural Network Policies
Reinforcement Learning
If we knew what the best action was at each step,
● we could train the neural network as usual,
● by minimizing the cross entropy
● between the estimated probability and the target probability.
It would just be regular supervised learning.
Evaluating Actions: The Credit Assignment Problem
Reinforcement Learning
If we knew what the best action was at each step,
● we could train the neural network as usual,
● by minimizing the cross entropy
● between the estimated probability and the target probability.
It would just be regular supervised learning.
However, in Reinforcement Learning
● the only guidance the agent gets is through rewards,
● and rewards are typically sparse and delayed.
Evaluating Actions: The Credit Assignment Problem
Reinforcement Learning
For example,
If the agent manages to balance the pole for 100 steps, how can it
know which of the 100 actions it took were good, and which of
them were bad?
All it knows is that the pole fell after the last action, but surely this
last action is not entirely responsible.
Evaluating Actions: The Credit Assignment Problem
Reinforcement Learning
This is called the credit assignment problem
● When the agent gets a reward, it is hard for it to know which
actions should get credited (or blamed) for it.
● Think of a dog that gets rewarded hours after it behaved well;
will it understand what it is rewarded for?
Evaluating Actions: The Credit Assignment Problem
Reinforcement Learning
To tackle this problem,
● A common strategy is to evaluate an action based on the sum of
all the rewards that come after it
● usually applying a discount rate r at each step.
● Typical discount rates are 0.95 or 0.99.
Evaluating Actions: The Credit Assignment Problem
Reinforcement Learning
Evaluating Actions: The Credit Assignment Problem
10 + r × 0 + r2 × (–50) = –22.
discount rate r = 0.8
Reinforcement Learning
PG algorithms
● Optimize the parameters of a policy
● by following the gradients toward higher rewards.
One popular class of PG algorithms, called REINFORCE algorithms:
● was introduced back in 19929 by Ronald Williams. Here is one
common variant:
Policy Gradients
Reinforcement Learning
Here is one common variant of REINFORCE:
1. Let the neural network policy play the game several times
a. Compute the gradients that would make the chosen action
even more likely,
b. but don’t apply these gradients yet.
Policy Gradients
Reinforcement Learning
Here is one common variant of REINFORCE:
1. Let the neural network policy play the game several times
a. Compute the gradients that would make the chosen action
even more likely,
b. but don’t apply these gradients yet.
2. Once you have run several episodes, compute each action’s
score (using the method described earlier).
Policy Gradients
Reinforcement Learning
Policy Gradients
Here is one common variant of REINFORCE:
3.
If an action’s score is positive, it means that the action was
good and you want to apply the gradients computed earlier
to make the action even more likely to be chosen in the future.
However, if the score is negative, it means the action was bad
and you want to apply the opposite gradients to make this
action slightly less likely in the future. The solution is simply to
multiply each gradient vector by the corresponding action’s
score.
Reinforcement Learning
Here is one common variant of REINFORCE:
4. Finally, compute the mean of all the resulting gradient vectors,
and use it to perform a Gradient Descent step.
Policy Gradients
Reinforcement Learning
Markov Chains
● In the early 20th century, the mathematician Andrey Markov studied
stochastic processes with no memory, called Markov chains
Markov Decision Processes
Reinforcement Learning
Markov Chains
● Such a process has a fixed number of states, and it randomly evolves
from one state to another at each step
● The probability for it to evolve from a state s to a state s′ is fixed, and it
depends only on the pair (s,s′), not on past states
● The system has no memory
Markov Decision Processes
S Ś
Reinforcement Learning
An example of a Markov chain with four states
Markov Decision Processes
Markov Chains
Reinforcement Learning
Suppose that the
process starts in
state s0
, and
there is a 70%
chance that it will
remain in that
state at the next
step
Markov Decision Processes
Markov Chains
Reinforcement Learning
Eventually it is
bound to leave
that state and
never come back
since no other
state points back
to s0
Markov Decision Processes
Markov Chains
Reinforcement Learning
If it goes to state
s1
, it will then
most likely go to
state s2
with 90%
probability,
then immediately
back to state s1
with 100%
probability
Markov Decision Processes
Markov Chains
Reinforcement Learning
It may alternate a
number of times
between these
two states, but
eventually it will
fall into state s3
and remain there
forever.
This is a
terminal state
Markov Decision Processes
Markov Chains
Reinforcement Learning
● Markov chains can have very different dynamics, and they are heavily used
in thermodynamics, chemistry, statistics, and much more
● Markov decision processes were first described in the 1950s by Richard
Bellman
Markov Decision Processes
Reinforcement Learning
● They resemble Markov chains but with a twist:
○ At each step, an agent can choose one of several possible actions
○ And the transition probabilities depend on the chosen action
● Moreover, some state transitions return some reward, positive or
negative, and the agent’s goal is to find a policy that will maximize
rewards over time
Markov Decision Processes
Reinforcement Learning
This Markov Decision Process has three states and up to three possible
discrete actions at each step
Markov Decision Processes
Reinforcement Learning
If it starts in state
s0
, the agent can
choose between
actions a0
, a1
, or
a2
. If it chooses
action a1
, it just
remains in state s0
with certainty, and
without any
reward
Markov Decision Processes
Reinforcement Learning
It can thus decide
to stay there
forever if it wants.
But if it chooses
action a0
, it has a
70% probability of
gaining a reward of
+10, and remaining
in state s0
Markov Decision Processes
Reinforcement Learning
It can then try
again and again to
gain as much
reward as possible.
But at one point it
is going to end up
instead in state s1
In state s1
it has
only two possible
actions: a0
or a2
Markov Decision Processes
Reinforcement Learning
It can choose to
stay put by
repeatedly
choosing action a0
,
or it can choose to
move on to state
s2
and get a
negative reward of
–50
Markov Decision Processes
Reinforcement Learning
Markov Decision Processes
In state s2
it has no
other choice than
to take action a1
,
which will most
likely lead it back
to state s0
, gaining
a reward of +40
on the way
Reinforcement Learning
Markov Decision Processes
Question
By looking at this
MDP, can
we guess which
strategy will gain
the most reward
over time?
Reinforcement Learning
Markov Decision Processes
● In state s0
it is clear that action a0
is the best option
● And in state s2
the agent has no choice but to take action a1
● But in state s1
it is not obvious whether the agent should stay put a0
or go
through the fire a2
So how do we
estimate the optimal
state value of any
state s ??
Reinforcement Learning
Markov Decision Processes
● Bellman found a way to estimate the optimal state value of any state s,
noted V*(s), which is the sum of all discounted future rewards the agent can
expect on average after it reaches a state s, assuming it acts optimally
● He showed that if the agent acts optimally, then the Bellman Optimality
Equation applies
Reinforcement Learning
Markov Decision Processes
This recursive equation says that
● If the agent acts optimally, then the optimal value of the current state is
equal to the reward it will get on average after taking one optimal action,
plus the expected optimal value of all possible next states that this action can
lead to
Reinforcement Learning
Markov Decision Processes
● T(s, a, s′) is the transition probability from state s to state s′, given that
the agent chose action a
● R(s, a, s′) is the reward that the agent gets when it goes from state s to
state s′, given that the agent chose action a
● γ is the discount rate
Let’s understand this equation
Questions?
https://discuss.cloudxlab.com
reachus@cloudxlab.com

Contenu connexe

Tendances

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningSeolhokim
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement LearningMax Pagels
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesThomas da Silva Paula
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillationNAVER Engineering
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep LearningSebastian Ruder
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 

Tendances (20)

Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement Learning
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Relational knowledge distillation
Relational knowledge distillationRelational knowledge distillation
Relational knowledge distillation
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Deep learning
Deep learningDeep learning
Deep learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 

Similaire à Reinforcement Learning: Learn How Agents Optimize Rewards

Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptxDaniel Slater
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introductionConnorShorten2
 
Machine Learning - Startup weekend UCSB 2018
Machine Learning - Startup weekend UCSB 2018Machine Learning - Startup weekend UCSB 2018
Machine Learning - Startup weekend UCSB 2018Raul Eulogio
 
Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learningguruprasad110
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learningMarsan Ma
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsJaya Kawale
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptxManiMaran230751
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabCloudxLab
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningYigit UNALLAR
 
Machine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brainMachine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brainDevGAMM Conference
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playingSaeid Ghafouri
 
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingDonal Byrne
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
Reinforcement learning slides
Reinforcement learning slidesReinforcement learning slides
Reinforcement learning slidesOmranHakami
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
Reinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionReinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionSeung Jae Lee
 
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...Codemotion
 

Similaire à Reinforcement Learning: Learn How Agents Optimize Rewards (20)

Building a deep learning ai.pptx
Building a deep learning ai.pptxBuilding a deep learning ai.pptx
Building a deep learning ai.pptx
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
 
Machine Learning - Startup weekend UCSB 2018
Machine Learning - Startup weekend UCSB 2018Machine Learning - Startup weekend UCSB 2018
Machine Learning - Startup weekend UCSB 2018
 
Structured prediction with reinforcement learning
Structured prediction with reinforcement learningStructured prediction with reinforcement learning
Structured prediction with reinforcement learning
 
Introduction to reinforcement learning
Introduction to reinforcement learningIntroduction to reinforcement learning
Introduction to reinforcement learning
 
Sequential Decision Making in Recommendations
Sequential Decision Making in RecommendationsSequential Decision Making in Recommendations
Sequential Decision Making in Recommendations
 
24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx24.09.2021 Reinforcement Learning Algorithms.pptx
24.09.2021 Reinforcement Learning Algorithms.pptx
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Machine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brainMachine Learning in Unity - How to give your game AI a real brain
Machine Learning in Unity - How to give your game AI a real brain
 
Online learning & adaptive game playing
Online learning & adaptive game playingOnline learning & adaptive game playing
Online learning & adaptive game playing
 
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous Driving
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Reinforcement learning slides
Reinforcement learning slidesReinforcement learning slides
Reinforcement learning slides
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Reinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionReinforcement Learning 1. Introduction
Reinforcement Learning 1. Introduction
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
Ciro Continisio - Implementing Machine Learning the Unity way - Codemotion Mi...
 

Plus de CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningCloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning OverviewCloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksCloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingCloudxLab
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural NetsCloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabCloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random ForestsCloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision TreesCloudxLab
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector MachinesCloudxLab
 

Plus de CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 

Dernier

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 

Dernier (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 

Reinforcement Learning: Learn How Agents Optimize Rewards

  • 2. Reinforcement Learning ● Reinforcement Learning has been around since 1950s and ○ Produced many interesting applications over the years ○ Like TD-Gammon, a Backgammon playing program Reinforcement Learning
  • 3. Reinforcement Learning ● In 2013, researchers from an English startup DeepMind ○ Demonstrated a system that could learn to play ○ Just about any Atari game from scratch and ○ Which outperformed human ○ It used only row pixels as inputs and ○ Without any prior knowledge of the rules of the game ● In March 2016, their system AlphaGo defeated ○ Lee Sedol, the world champion of the game of Go ○ This was possible because of Reinforcement Learning Reinforcement Learning
  • 4. Reinforcement Learning ● In this chapter, we will learn about ○ What Reinforcement Learning is and ○ What it is good at Reinforcement Learning
  • 5. Reinforcement Learning Learning to Optimize Rewards Reinforcement Learning
  • 6. Reinforcement Learning ● In Reinforcement Learning ○ A software agent makes observations and ○ Takes actions within an environment and ○ In return it receives rewards Learning to Optimize Rewards
  • 8. Reinforcement Learning Goal Learn how to take actions in order to maximize reward Learning to Optimize Rewards
  • 9. Reinforcement Learning ● In short, the agent acts in the environment and ○ Learns by trial and error to ○ Maximize its reward Learning to Optimize Rewards
  • 10. Reinforcement Learning So how can we apply this in real-life applications? Learning to Optimize Rewards
  • 11. Reinforcement Learning Learning to Optimize Rewards - Walking Robot
  • 12. Reinforcement Learning ● Agent - Program controlling a walking robot ● Environment - Real world ● The agent observes the environment through a set of sensors such as ○ Cameras and touch sensors ● Actions - Sending signals to activate motors Learning to Optimize Rewards - Walking Robot
  • 13. Reinforcement Learning ● It may be programmed to get ○ Positive rewards whenever it approaches the target destination and ○ Negative rewards whenever it ■ Wastes time ■ Goes in the wrong direction or ■ Falls down Learning to Optimize Rewards - Walking Robot
  • 14. Reinforcement Learning Learning to Optimize Rewards - Ms. Pac-Man
  • 15. Reinforcement Learning ● Agent - Program controlling Ms. Pac-Man ● Environment - Simulation of the Atari game ● Actions - Nine possible joystick positions ● Observations - Screenshots ● Rewards - Game points Learning to Optimize Rewards - Ms. Pac-Man
  • 16. Reinforcement Learning Learning to Optimize Rewards - Thermostat
  • 17. Reinforcement Learning ● Agent - Thermostat ○ Please note, the agent does not have to control a ○ Physically (or virtually) moving thing ● Rewards - ○ Positive rewards whenever agent is close to the target temperature ○ Negative rewards when humans need to tweak the temperature ● Important - Agent must learn to anticipate human needs Learning to Optimize Rewards - Thermostat
  • 18. Reinforcement Learning Learning to Optimize Rewards - Auto Trader
  • 19. Reinforcement Learning ● Agent - ○ Observes stock market prices and ○ Decide how much to buy or sell every second ● Rewards - The monetary gains and losses Learning to Optimize Rewards - Auto Trader
  • 20. Reinforcement Learning ● There are many other examples such as ○ Self-driving cars ○ Placing ads on a web page or ○ Controlling where an image classification system ■ Should focus its attention Learning to Optimize Rewards
  • 21. Reinforcement Learning ● Note that there may not be any positive rewards at all ● For example ○ The agent may move around in a maze ○ Getting a negative reward at every time step ○ So it better find the exit as quickly as possible Learning to Optimize Rewards
  • 23. Reinforcement Learning ● The algorithm used by the software agent to ○ Determine its actions is called its policy ● For example, the policy could be a neural network ○ Taking observations as inputs and ○ Outputting the action to take Policy Search
  • 24. Reinforcement Learning ● The policy can be any algorithm we can think of ○ And it does not even have to be deterministic ● For example, consider a robotic vacuum cleaner ○ Its reward is the amount of dust it picks up in 30 minutes ● Its policy could be to ○ Move forward with some probability p every second or ○ Randomly rotate left or right with probability 1 – p ○ The rotation angle would be a random angle between –r and +r ● Since this policy involves some randomness ○ It is called a stochastic policy Policy Search
  • 25. Reinforcement Learning ● The robot will have an erratic trajectory, which guarantees that ○ It can get to any place it can reach and ○ Pick up all the dust ● The question is: how much dust will it pick up in 30 minutes? Policy Search
  • 26. Reinforcement Learning How would we train such a robot? Policy Search
  • 27. Reinforcement Learning ● There are just two policy parameters which we can tweak ○ The probability p and ○ The angle range r Policy Search
  • 28. Reinforcement Learning ● One possible learning algorithm could be to ○ Try out many different values for these parameters, and ○ Pick the combination that performs best ● This is the example of policy search using a brute force approach Policy Search - Brute Force Approach Four points in policy space and the agent’s corresponding behavior
  • 29. Reinforcement Learning ● When the policy space is too large then ○ Finding a good set of parameters this way is like ○ Searching for a needle in a gigantic haystack Policy Search - Brute Force Approach Four points in policy space and the agent’s corresponding behavior
  • 30. Reinforcement Learning ● Another way to explore the policy space is to ○ Use genetic algorithms ● For example ○ Randomly create a first generation of 100 policies and ○ Try them out, then “kill” the 80 worst policies and ○ Make the 20 survivors produce 4 offspring each ● An offspring is just a copy of its parent ○ Plus some random variation Policy Search - Genetic Algorithms
  • 31. Reinforcement Learning ● The surviving policies plus their offspring together ○ Constitute the second generation ● Continue to iterate through generations this way ○ Until a good policy is found Policy Search - Genetic Algorithms
  • 32. Reinforcement Learning ● Another approach is to use ○ Optimization Techniques ● Evaluate the gradients of the rewards ○ With regards to the policy parameters ○ Then tweaking these parameters by following the ○ Gradient toward higher rewards (gradient ascent) ● This approach is called policy gradients Policy Search - Optimization Techniques
  • 33. Reinforcement Learning ● Let’s understand this with an example ● Going back to the vacuum cleaner robot example, we can ○ Slightly increase p and evaluate ○ Whether this increases the amount of dust ○ Picked up by the robot in 30 minutes ○ If it does, then increase p some more, or ○ Else reduce p Policy Search - Optimization Techniques
  • 35. Reinforcement Learning ● To train an agent in Reinforcement Learning ○ We need a working environment ● For example, if we want agent to run how to play Atari game, ○ We will need a Atari game simulator ● OpenAI gym is a toolkit that provides wide variety of simulations like ○ Atari games ○ Board games ○ 2D and 3D physical simulations and so on Introduction to OpenAI Gym
  • 36. Reinforcement Learning Let’s do a hands-on Policy Search - Optimization Techniques
  • 37. Reinforcement Learning Goal - Balance a pole on top of a movable cart. Pole should be upright Policy Search - Optimization Techniques Pole Cart
  • 38. Reinforcement Learning Neural Network Policies Let’s create a neural network policy. ● Take an observation as input ● Output the action to be executed ● Estimate probability for each action ● Select an action randomly according to the estimated probabilities Example, In cart pole: ○ If it outputs 0.7, then we will pick action 0 with 70% probability, and action 1 with 30% probability.
  • 39. Reinforcement Learning Neural Network Policies Q: Why we are picking a random action based on the probability given by the neural network, rather than just picking the action with the highest score.
  • 40. Reinforcement Learning Neural Network Policies Q: Why we are picking a random action based on the probability given by the neural network, rather than just picking the action with the highest score. This approach lets the agent find the right balance between exploring new actions and exploiting the actions that are known to work well.
  • 41. Reinforcement Learning Neural Network Policies Q: Why we are picking a random action based on the probability given by the neural network, rather than just picking the action with the highest score. This approach lets the agent find the right balance between exploring new actions and exploiting the actions that are known to work well. We will never discover a new dish at restaurent if we don’t try anything new. Give serendipity a chance.
  • 42. Reinforcement Learning Check the complete code in Jupyter notebook Neural Network Policies
  • 43. Reinforcement Learning If we knew what the best action was at each step, ● we could train the neural network as usual, ● by minimizing the cross entropy ● between the estimated probability and the target probability. It would just be regular supervised learning. Evaluating Actions: The Credit Assignment Problem
  • 44. Reinforcement Learning If we knew what the best action was at each step, ● we could train the neural network as usual, ● by minimizing the cross entropy ● between the estimated probability and the target probability. It would just be regular supervised learning. However, in Reinforcement Learning ● the only guidance the agent gets is through rewards, ● and rewards are typically sparse and delayed. Evaluating Actions: The Credit Assignment Problem
  • 45. Reinforcement Learning For example, If the agent manages to balance the pole for 100 steps, how can it know which of the 100 actions it took were good, and which of them were bad? All it knows is that the pole fell after the last action, but surely this last action is not entirely responsible. Evaluating Actions: The Credit Assignment Problem
  • 46. Reinforcement Learning This is called the credit assignment problem ● When the agent gets a reward, it is hard for it to know which actions should get credited (or blamed) for it. ● Think of a dog that gets rewarded hours after it behaved well; will it understand what it is rewarded for? Evaluating Actions: The Credit Assignment Problem
  • 47. Reinforcement Learning To tackle this problem, ● A common strategy is to evaluate an action based on the sum of all the rewards that come after it ● usually applying a discount rate r at each step. ● Typical discount rates are 0.95 or 0.99. Evaluating Actions: The Credit Assignment Problem
  • 48. Reinforcement Learning Evaluating Actions: The Credit Assignment Problem 10 + r × 0 + r2 × (–50) = –22. discount rate r = 0.8
  • 49. Reinforcement Learning PG algorithms ● Optimize the parameters of a policy ● by following the gradients toward higher rewards. One popular class of PG algorithms, called REINFORCE algorithms: ● was introduced back in 19929 by Ronald Williams. Here is one common variant: Policy Gradients
  • 50. Reinforcement Learning Here is one common variant of REINFORCE: 1. Let the neural network policy play the game several times a. Compute the gradients that would make the chosen action even more likely, b. but don’t apply these gradients yet. Policy Gradients
  • 51. Reinforcement Learning Here is one common variant of REINFORCE: 1. Let the neural network policy play the game several times a. Compute the gradients that would make the chosen action even more likely, b. but don’t apply these gradients yet. 2. Once you have run several episodes, compute each action’s score (using the method described earlier). Policy Gradients
  • 52. Reinforcement Learning Policy Gradients Here is one common variant of REINFORCE: 3. If an action’s score is positive, it means that the action was good and you want to apply the gradients computed earlier to make the action even more likely to be chosen in the future. However, if the score is negative, it means the action was bad and you want to apply the opposite gradients to make this action slightly less likely in the future. The solution is simply to multiply each gradient vector by the corresponding action’s score.
  • 53. Reinforcement Learning Here is one common variant of REINFORCE: 4. Finally, compute the mean of all the resulting gradient vectors, and use it to perform a Gradient Descent step. Policy Gradients
  • 54. Reinforcement Learning Markov Chains ● In the early 20th century, the mathematician Andrey Markov studied stochastic processes with no memory, called Markov chains Markov Decision Processes
  • 55. Reinforcement Learning Markov Chains ● Such a process has a fixed number of states, and it randomly evolves from one state to another at each step ● The probability for it to evolve from a state s to a state s′ is fixed, and it depends only on the pair (s,s′), not on past states ● The system has no memory Markov Decision Processes S Ś
  • 56. Reinforcement Learning An example of a Markov chain with four states Markov Decision Processes Markov Chains
  • 57. Reinforcement Learning Suppose that the process starts in state s0 , and there is a 70% chance that it will remain in that state at the next step Markov Decision Processes Markov Chains
  • 58. Reinforcement Learning Eventually it is bound to leave that state and never come back since no other state points back to s0 Markov Decision Processes Markov Chains
  • 59. Reinforcement Learning If it goes to state s1 , it will then most likely go to state s2 with 90% probability, then immediately back to state s1 with 100% probability Markov Decision Processes Markov Chains
  • 60. Reinforcement Learning It may alternate a number of times between these two states, but eventually it will fall into state s3 and remain there forever. This is a terminal state Markov Decision Processes Markov Chains
  • 61. Reinforcement Learning ● Markov chains can have very different dynamics, and they are heavily used in thermodynamics, chemistry, statistics, and much more ● Markov decision processes were first described in the 1950s by Richard Bellman Markov Decision Processes
  • 62. Reinforcement Learning ● They resemble Markov chains but with a twist: ○ At each step, an agent can choose one of several possible actions ○ And the transition probabilities depend on the chosen action ● Moreover, some state transitions return some reward, positive or negative, and the agent’s goal is to find a policy that will maximize rewards over time Markov Decision Processes
  • 63. Reinforcement Learning This Markov Decision Process has three states and up to three possible discrete actions at each step Markov Decision Processes
  • 64. Reinforcement Learning If it starts in state s0 , the agent can choose between actions a0 , a1 , or a2 . If it chooses action a1 , it just remains in state s0 with certainty, and without any reward Markov Decision Processes
  • 65. Reinforcement Learning It can thus decide to stay there forever if it wants. But if it chooses action a0 , it has a 70% probability of gaining a reward of +10, and remaining in state s0 Markov Decision Processes
  • 66. Reinforcement Learning It can then try again and again to gain as much reward as possible. But at one point it is going to end up instead in state s1 In state s1 it has only two possible actions: a0 or a2 Markov Decision Processes
  • 67. Reinforcement Learning It can choose to stay put by repeatedly choosing action a0 , or it can choose to move on to state s2 and get a negative reward of –50 Markov Decision Processes
  • 68. Reinforcement Learning Markov Decision Processes In state s2 it has no other choice than to take action a1 , which will most likely lead it back to state s0 , gaining a reward of +40 on the way
  • 69. Reinforcement Learning Markov Decision Processes Question By looking at this MDP, can we guess which strategy will gain the most reward over time?
  • 70. Reinforcement Learning Markov Decision Processes ● In state s0 it is clear that action a0 is the best option ● And in state s2 the agent has no choice but to take action a1 ● But in state s1 it is not obvious whether the agent should stay put a0 or go through the fire a2 So how do we estimate the optimal state value of any state s ??
  • 71. Reinforcement Learning Markov Decision Processes ● Bellman found a way to estimate the optimal state value of any state s, noted V*(s), which is the sum of all discounted future rewards the agent can expect on average after it reaches a state s, assuming it acts optimally ● He showed that if the agent acts optimally, then the Bellman Optimality Equation applies
  • 72. Reinforcement Learning Markov Decision Processes This recursive equation says that ● If the agent acts optimally, then the optimal value of the current state is equal to the reward it will get on average after taking one optimal action, plus the expected optimal value of all possible next states that this action can lead to
  • 73. Reinforcement Learning Markov Decision Processes ● T(s, a, s′) is the transition probability from state s to state s′, given that the agent chose action a ● R(s, a, s′) is the reward that the agent gets when it goes from state s to state s′, given that the agent chose action a ● γ is the discount rate Let’s understand this equation