SlideShare a Scribd company logo
1 of 30
Reinforcement Learning
V.Saranya
AP/CSE
Sri Vidya College of Engineering and
Technology,
Virudhunagar
What is learning?
 Learning takes place as a result of interaction
between an agent and the world, the idea
behind learning is that
 Percepts received by an agent should be used not
only for acting, but also for improving the agent’s
ability to behave optimally in the future to achieve
the goal.
Learning types
 Learning types
 Supervised learning:
a situation in which sample (input, output) pairs of the
function to be learned can be perceived or are given

You can think it as if there is a kind teacher
 Reinforcement learning:
in the case of the agent acts on its environment, it
receives some evaluation of its action (reinforcement),
but is not told of which action is the correct one to
achieve its goal
Reinforcement learning
 Task
Learn how to behave successfully to achieve a
goal while interacting with an external environment
 Learn via experiences!
 Examples
 Game playing: player knows whether it win or
lose, but not know how to move at each step
 Control: a traffic system can measure the delay of
cars, but not know how to decrease it.
RL is learning from interaction
RL model
 Each percept(e) is enough to determine
the State(the state is accessible)
 The agent can decompose the Reward
component from a percept.
 The agent task: to find a optimal policy, mapping
states to actions, that maximize long-run measure
of the reinforcement.
 Think of reinforcement as reward
 Can be modeled as “MDP” model!
Review of MDP model
 MDP model <S,T,A,R>
Agent
Environment
State
Reward
Action
s0
r0
a0
s1
a1
r1
s2
a2
r2
s3
• S– set of states
• A– set of actions
• T(s,a,s’) = P(s’|s,a)– the
probability of transition from s to
s’ given action a
• R(s,a)– the expected reward for
taking action a in state s
∑
∑
=
=
'
'
)',,()',,(),(
)',,(),|'(),(
s
s
sasrsasTasR
sasrassPasR
Model based v.s.Model free
approaches
 But, we don’t know anything about the environment
model—the transition function T(s,a,s’)
 Here comes two approaches
 Model based approach RL:
learn the model, and use it to derive the optimal policy.
e.g Adaptive dynamic learning(ADP) approach
 Model free approach RL:
derive the optimal policy without learning the model.
e.g LMS and Temporal difference approach
 Which one is better?
Passive learning v.s. Active
learning
 Passive learning
 The agent imply watches the world going by and
tries to learn the utilities of being in various states
 Active learning
 The agent not simply watches, but also acts
Example environment
Passive learning scenario
 The agent see the the sequences of state
transitions and associate rewards
 The environment generates state transitions and the
agent perceive them
e.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1]
(1,1)(1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1)
(3,1) (4,1) (4,2)[-1]
 Key idea: updating the utility value using the
given training sequences.
Passive leaning scenario
LMS(least mean squares)
updating
 Reward to go of a state
the sum of the rewards from that state until a
terminal state is reached
 Key: use observed reward to go of the state as
the direct evidence of the actual expected utility of
that state
 Learning utility function directly from sequence
example
LMS updating
function LMS-UPDATE (U, e, percepts, M, N ) return an updated U
if TERMINAL?[e] then
{ reward-to-go  0
for each ei in percepts (starting from end) do
s = STATE[ei]
reward-to-go  reward-to-go + REWARS[ei]
U[s] = RUNNING-AVERAGE (U[s], reward-to-go, N[s])
end
}
function RUNNING-AVERAGE (U[s], reward-to-go, N[s] )
U[s] = [ U[s] * (N[s] – 1) + reward-to-go ] / N[s]
LMS updating algorithm in
passive learning
 Drawback:
 The actual utility of a state is constrained to be probability- weighted average
of its successor’s utilities.
 Meet very slowly to correct utilities values (requires a lot of sequences)

for our example, >1000!
Temporal difference method in
passive learning
 TD(0) key idea:
 adjust the estimated utility value of the current state based on its
immediately reward and the estimated value of the next state.
 The updating rule
 is the learning rate parameter
 Only when is a function that decreases as the number of times
a state has been visited increased, then can U(s)converge to the
correct value.
))()'()(()()( sUsUsRsUsU −++= α
α
α
The TD learning curve
(4,3)
(2,3)
(2,2)
(1,1)
(3,1)
(4,1)
(4,2)
Adaptive dynamic
programming(ADP) in passive
learning
 Different with LMS and TD method(model free
approaches)
 ADP is a model based approach!
 The updating rule for passive learning
 However, in an unknown environment, T is not
given, the agent must learn T itself by
experiences with the environment.
 How to learn T?
))'()',(()',()(
'
sUssrssTsU
s
γ+=∑
ADP learning curves
(4,3)
(3,3)
(2,3)
(1,1)
(3,1)
(4,1)
(4,2)
Active learning
 An active agent must consider
 what actions to take?
 what their outcomes maybe
 Update utility equation
 Rule to chose action
))'()',,(),((maxarg
'
sUsasTasRa
sa
∑+= γ
))'()',,(),((max)(
'
sUsasTasRsU
s
a
∑+= γ
Active ADP algorithm
For each s, initialize U(s) , T(s,a,s’) and R(s,a)
Initialize s to current state that is perceived
Loop forever
{
Select an action a and execute it (using current model R and T) using
Receive immediate reward r and observe the new state s’
Using the transition tuple <s,a,s’,r> to update model R and T (see further)
For all the sate s, update U(s) using the updating rule
s = s’
}
))'()',,(),((maxarg
'
sUsasTasRa
sa
∑+= γ
))'()',,(),((max)(
'
sUsasTasRsU
s
a
∑+= γ
How to learn model?
 Use the transition tuple <s, a, s’, r> to learn T(s,a,s’) and R(s,a).
That’s supervised learning!
 Since the agent can get every transition (s, a, s’,r) directly, so take
(s,a)/s’ as an input/output example of the transition probability
function T.
 Different techniques in the supervised learning(see further reading
for detail)
 Use r and T(s,a,s’) to learn R(s,a)
∑=
'
)',,(),(
s
rsasTasR
ADP approach pros and cons
 Pros:
 ADP algorithm converges far faster than LMS and Temporal
learning. That is because it use the information from the model of
the environment.
 Cons:
 Intractable for large state space
 In each step, update U for all states
 Improve this by prioritized-sweeping
 An action has two kinds of outcome
 Gain rewards on the current experience
tuple (s,a,s’)
 Affect the percepts received, and hence
the ability of the agent to learn
Exploration problem in Active
learning
Exploration problem in Active
learning
 A trade off when choosing action between
 its immediately good (reflected in its current utility estimates using
the what we have learned)
 its long term good (exploring more about the environment help it to
behave optimally in the long run)
 Two extreme approaches
 “wacky”approach: acts randomly, in the hope that it will
eventually explore the entire environment.
 “greedy”approach: acts to maximize its utility using current
model estimate
 Just like human in the real world! People need to decide between
 Continuing in a comfortable existence
 Or striking out into the unknown in the hopes of discovering a new and
better life
Exploration problem in Active
learning
 One kind of solution: the agent should be more wacky
when it has little idea of the environment, and more
greedy when it has a model that is close to being
correct
 In a given state, the agent should give some weight to actions that it
has not tried very often.
 While tend to avoid actions that are believed to be of low utility
 Implemented by exploration function f(u,n):
 assigning a higher utility estimate to relatively unexplored action state
pairs
 Chang the updating rule of value function to
 U+ denote the optimistic estimate of the utility
)),(),'()',,((),((max)(
'
saNsUsasTfasrsU
s
a
++
∑+= γ
Exploration problem in Active
learning
 One kind of definition of f(u,n)
if n< Ne
u otherwise
 is an optimistic estimate of the best possible reward
obtainable in any state
 The agent will try each action-state pair(s,a) at least Ne times
 The agent will behave initially as if there were wonderful rewards
scattered all over around– optimistic .
=),( nuf
+
R
{+
R
Generalization in
Reinforcement Learning
 Use generalization techniques to deal with large state
or action space.
 Function approximation techniques
Genetic algorithm and Evolutionary
programming
 Start with a set of individuals
 Apply selection and reproduction operators to “evolve” an individual that is
successful (measured by a fitness function)
Genetic algorithm and Evolutionary
programming
 Imagine the individuals as agent functions
 Fitness function as performance measure or
reward function
 No attempt made to learn the relationship the
rewards and actions taken by an agent
 Simply searches directly in the individual space to
find one that maximizes the fitness functions

More Related Content

What's hot

Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural networkSopheaktra YONG
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
 
Feature selection
Feature selectionFeature selection
Feature selectionDong Guo
 
Hill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligenceHill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligencesandeep54552
 
Reinforcement learning.pptx
Reinforcement learning.pptxReinforcement learning.pptx
Reinforcement learning.pptxaniketgupta16440
 
Back propagation
Back propagationBack propagation
Back propagationNagarajan
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptronomaraldabash
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine LearningSamra Shahzadi
 

What's hot (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Hill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligenceHill climbing algorithm in artificial intelligence
Hill climbing algorithm in artificial intelligence
 
Reinforcement learning.pptx
Reinforcement learning.pptxReinforcement learning.pptx
Reinforcement learning.pptx
 
Back propagation
Back propagationBack propagation
Back propagation
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Supervised learning
  Supervised learning  Supervised learning
Supervised learning
 

Similar to Reinforcement learning 7313

RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.pptssuser43a599
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.pptPOOJASHREEC1
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraData Science Milan
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNINGpradiprahul
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationbutest
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Lecture notes
Lecture notesLecture notes
Lecture notesbutest
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGgerogepatton
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSVijaylakshmi
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 

Similar to Reinforcement learning 7313 (20)

YijueRL.ppt
YijueRL.pptYijueRL.ppt
YijueRL.ppt
 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Q_Learning.ppt
Q_Learning.pptQ_Learning.ppt
Q_Learning.ppt
 
lecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentationlecture_21.pptx - PowerPoint Presentation
lecture_21.pptx - PowerPoint Presentation
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
Lecture notes
Lecture notesLecture notes
Lecture notes
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Intro rl
Intro rlIntro rl
Intro rl
 
ML_lec1.pdf
ML_lec1.pdfML_lec1.pdf
ML_lec1.pdf
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 

More from Slideshare

Crystal report generation in visual studio 2010
Crystal report generation in visual studio 2010Crystal report generation in visual studio 2010
Crystal report generation in visual studio 2010Slideshare
 
Report generation
Report generationReport generation
Report generationSlideshare
 
Security in Relational model
Security in Relational modelSecurity in Relational model
Security in Relational modelSlideshare
 
Entity Relationship Model
Entity Relationship ModelEntity Relationship Model
Entity Relationship ModelSlideshare
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingSlideshare
 
What is in you
What is in youWhat is in you
What is in youSlideshare
 
Propositional logic & inference
Propositional logic & inferencePropositional logic & inference
Propositional logic & inferenceSlideshare
 
Logical reasoning 21.1.13
Logical reasoning 21.1.13Logical reasoning 21.1.13
Logical reasoning 21.1.13Slideshare
 
Statistical learning
Statistical learningStatistical learning
Statistical learningSlideshare
 
Resolution(decision)
Resolution(decision)Resolution(decision)
Resolution(decision)Slideshare
 
Neural networks
Neural networksNeural networks
Neural networksSlideshare
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
Statistical learning
Statistical learningStatistical learning
Statistical learningSlideshare
 
Neural networks
Neural networksNeural networks
Neural networksSlideshare
 
Logical reasoning
Logical reasoning Logical reasoning
Logical reasoning Slideshare
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 

More from Slideshare (20)

Crystal report generation in visual studio 2010
Crystal report generation in visual studio 2010Crystal report generation in visual studio 2010
Crystal report generation in visual studio 2010
 
Report generation
Report generationReport generation
Report generation
 
Trigger
TriggerTrigger
Trigger
 
Security in Relational model
Security in Relational modelSecurity in Relational model
Security in Relational model
 
Entity Relationship Model
Entity Relationship ModelEntity Relationship Model
Entity Relationship Model
 
OLAP
OLAPOLAP
OLAP
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
What is in you
What is in youWhat is in you
What is in you
 
Propositional logic & inference
Propositional logic & inferencePropositional logic & inference
Propositional logic & inference
 
Logical reasoning 21.1.13
Logical reasoning 21.1.13Logical reasoning 21.1.13
Logical reasoning 21.1.13
 
Logic agent
Logic agentLogic agent
Logic agent
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
 
Resolution(decision)
Resolution(decision)Resolution(decision)
Resolution(decision)
 
Neural networks
Neural networksNeural networks
Neural networks
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Statistical learning
Statistical learningStatistical learning
Statistical learning
 
Neural networks
Neural networksNeural networks
Neural networks
 
Logical reasoning
Logical reasoning Logical reasoning
Logical reasoning
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 

Recently uploaded

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 

Recently uploaded (20)

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 

Reinforcement learning 7313

  • 1. Reinforcement Learning V.Saranya AP/CSE Sri Vidya College of Engineering and Technology, Virudhunagar
  • 2. What is learning?  Learning takes place as a result of interaction between an agent and the world, the idea behind learning is that  Percepts received by an agent should be used not only for acting, but also for improving the agent’s ability to behave optimally in the future to achieve the goal.
  • 3. Learning types  Learning types  Supervised learning: a situation in which sample (input, output) pairs of the function to be learned can be perceived or are given  You can think it as if there is a kind teacher  Reinforcement learning: in the case of the agent acts on its environment, it receives some evaluation of its action (reinforcement), but is not told of which action is the correct one to achieve its goal
  • 4. Reinforcement learning  Task Learn how to behave successfully to achieve a goal while interacting with an external environment  Learn via experiences!  Examples  Game playing: player knows whether it win or lose, but not know how to move at each step  Control: a traffic system can measure the delay of cars, but not know how to decrease it.
  • 5. RL is learning from interaction
  • 6. RL model  Each percept(e) is enough to determine the State(the state is accessible)  The agent can decompose the Reward component from a percept.  The agent task: to find a optimal policy, mapping states to actions, that maximize long-run measure of the reinforcement.  Think of reinforcement as reward  Can be modeled as “MDP” model!
  • 7. Review of MDP model  MDP model <S,T,A,R> Agent Environment State Reward Action s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 • S– set of states • A– set of actions • T(s,a,s’) = P(s’|s,a)– the probability of transition from s to s’ given action a • R(s,a)– the expected reward for taking action a in state s ∑ ∑ = = ' ' )',,()',,(),( )',,(),|'(),( s s sasrsasTasR sasrassPasR
  • 8. Model based v.s.Model free approaches  But, we don’t know anything about the environment model—the transition function T(s,a,s’)  Here comes two approaches  Model based approach RL: learn the model, and use it to derive the optimal policy. e.g Adaptive dynamic learning(ADP) approach  Model free approach RL: derive the optimal policy without learning the model. e.g LMS and Temporal difference approach  Which one is better?
  • 9. Passive learning v.s. Active learning  Passive learning  The agent imply watches the world going by and tries to learn the utilities of being in various states  Active learning  The agent not simply watches, but also acts
  • 11. Passive learning scenario  The agent see the the sequences of state transitions and associate rewards  The environment generates state transitions and the agent perceive them e.g (1,1) (1,2) (1,3) (2,3) (3,3) (4,3)[+1] (1,1)(1,2) (1,3) (1,2) (1,3) (1,2) (1,1) (2,1) (3,1) (4,1) (4,2)[-1]  Key idea: updating the utility value using the given training sequences.
  • 13. LMS(least mean squares) updating  Reward to go of a state the sum of the rewards from that state until a terminal state is reached  Key: use observed reward to go of the state as the direct evidence of the actual expected utility of that state  Learning utility function directly from sequence example
  • 14. LMS updating function LMS-UPDATE (U, e, percepts, M, N ) return an updated U if TERMINAL?[e] then { reward-to-go  0 for each ei in percepts (starting from end) do s = STATE[ei] reward-to-go  reward-to-go + REWARS[ei] U[s] = RUNNING-AVERAGE (U[s], reward-to-go, N[s]) end } function RUNNING-AVERAGE (U[s], reward-to-go, N[s] ) U[s] = [ U[s] * (N[s] – 1) + reward-to-go ] / N[s]
  • 15. LMS updating algorithm in passive learning  Drawback:  The actual utility of a state is constrained to be probability- weighted average of its successor’s utilities.  Meet very slowly to correct utilities values (requires a lot of sequences)  for our example, >1000!
  • 16. Temporal difference method in passive learning  TD(0) key idea:  adjust the estimated utility value of the current state based on its immediately reward and the estimated value of the next state.  The updating rule  is the learning rate parameter  Only when is a function that decreases as the number of times a state has been visited increased, then can U(s)converge to the correct value. ))()'()(()()( sUsUsRsUsU −++= α α α
  • 17. The TD learning curve (4,3) (2,3) (2,2) (1,1) (3,1) (4,1) (4,2)
  • 18. Adaptive dynamic programming(ADP) in passive learning  Different with LMS and TD method(model free approaches)  ADP is a model based approach!  The updating rule for passive learning  However, in an unknown environment, T is not given, the agent must learn T itself by experiences with the environment.  How to learn T? ))'()',(()',()( ' sUssrssTsU s γ+=∑
  • 20. Active learning  An active agent must consider  what actions to take?  what their outcomes maybe  Update utility equation  Rule to chose action ))'()',,(),((maxarg ' sUsasTasRa sa ∑+= γ ))'()',,(),((max)( ' sUsasTasRsU s a ∑+= γ
  • 21. Active ADP algorithm For each s, initialize U(s) , T(s,a,s’) and R(s,a) Initialize s to current state that is perceived Loop forever { Select an action a and execute it (using current model R and T) using Receive immediate reward r and observe the new state s’ Using the transition tuple <s,a,s’,r> to update model R and T (see further) For all the sate s, update U(s) using the updating rule s = s’ } ))'()',,(),((maxarg ' sUsasTasRa sa ∑+= γ ))'()',,(),((max)( ' sUsasTasRsU s a ∑+= γ
  • 22. How to learn model?  Use the transition tuple <s, a, s’, r> to learn T(s,a,s’) and R(s,a). That’s supervised learning!  Since the agent can get every transition (s, a, s’,r) directly, so take (s,a)/s’ as an input/output example of the transition probability function T.  Different techniques in the supervised learning(see further reading for detail)  Use r and T(s,a,s’) to learn R(s,a) ∑= ' )',,(),( s rsasTasR
  • 23. ADP approach pros and cons  Pros:  ADP algorithm converges far faster than LMS and Temporal learning. That is because it use the information from the model of the environment.  Cons:  Intractable for large state space  In each step, update U for all states  Improve this by prioritized-sweeping
  • 24.  An action has two kinds of outcome  Gain rewards on the current experience tuple (s,a,s’)  Affect the percepts received, and hence the ability of the agent to learn Exploration problem in Active learning
  • 25. Exploration problem in Active learning  A trade off when choosing action between  its immediately good (reflected in its current utility estimates using the what we have learned)  its long term good (exploring more about the environment help it to behave optimally in the long run)  Two extreme approaches  “wacky”approach: acts randomly, in the hope that it will eventually explore the entire environment.  “greedy”approach: acts to maximize its utility using current model estimate  Just like human in the real world! People need to decide between  Continuing in a comfortable existence  Or striking out into the unknown in the hopes of discovering a new and better life
  • 26. Exploration problem in Active learning  One kind of solution: the agent should be more wacky when it has little idea of the environment, and more greedy when it has a model that is close to being correct  In a given state, the agent should give some weight to actions that it has not tried very often.  While tend to avoid actions that are believed to be of low utility  Implemented by exploration function f(u,n):  assigning a higher utility estimate to relatively unexplored action state pairs  Chang the updating rule of value function to  U+ denote the optimistic estimate of the utility )),(),'()',,((),((max)( ' saNsUsasTfasrsU s a ++ ∑+= γ
  • 27. Exploration problem in Active learning  One kind of definition of f(u,n) if n< Ne u otherwise  is an optimistic estimate of the best possible reward obtainable in any state  The agent will try each action-state pair(s,a) at least Ne times  The agent will behave initially as if there were wonderful rewards scattered all over around– optimistic . =),( nuf + R {+ R
  • 28. Generalization in Reinforcement Learning  Use generalization techniques to deal with large state or action space.  Function approximation techniques
  • 29. Genetic algorithm and Evolutionary programming  Start with a set of individuals  Apply selection and reproduction operators to “evolve” an individual that is successful (measured by a fitness function)
  • 30. Genetic algorithm and Evolutionary programming  Imagine the individuals as agent functions  Fitness function as performance measure or reward function  No attempt made to learn the relationship the rewards and actions taken by an agent  Simply searches directly in the individual space to find one that maximizes the fitness functions