This document provides an introduction to reinforcement learning. It discusses how reinforcement learning agents interact with environments to maximize rewards. It covers key concepts like the reinforcement learning problem, aspects of RL agents including policies and value functions, and popular approaches like value-based, policy-based, and model-based RL. Examples of applying RL to problems like process control, Atari games, and robotics are presented. The document aims to provide context and motivation for using reinforcement learning.
5. Reinforcement Learning Applications
RL application areas
Process Control
23%
Other
8%
Finance
4%
Autonomic Computing
6% Traffic
6%
Robotics
13%
Resource Management
18%
Networking
21%
Survey by Csaba Szepesva
of 77 recent application
papers, based on an IEEE.o
search for the keywords
“RL” and “application”
signal processing
natural language processing
web services
brain-computer interfaces
aircraft control
engine control
bio/chemical reactors
sensor networks
routing
call admission control
network resource management
power systems
inventory control
supply chains
customer service
mobile robots, motion control, Robocup, visionstoplight control, trains, unmanned vehicles
load balancing
memory management
algorithm tuning
option pricing
asset management
Rick Sutton. Deconstructing Reinforcement Learning. ICML 09
17. Agent and Environment
At each step the agent:
• Executes action At
• Receives observation Ot
• Receives scalar reward Rt
The environment:
• Receives action At
• Emits observation Ot+1
• Emits scalar reward Rt+1
Approaches:
• MDP, POMDP
• Multi-arm bandit
Agent
Environment
ActionAt
ObservationOt
Reward Rt
18. History and State
• The history is the sequence of observations
• i.e. all observable variable up to time t
• i.e. the sensorimotor stream of a robot or embodied agent
• What happens next
• The agent selects actions
• The environment
• State is the information used to find next action
• Formally, state is a function of the history:
Ht= O1,R1,A1 … At-1,Ot,Rt
St= f(Ht)
22. Approaches To Reinforcement Learning
• Value-based RL
• Estimate the optimal value function Q*(S,A)
• This is the maximum value achievable under any policy
• Policy-based RL
• Search directly for the optimal policy 𝜋*
• This is the policy achieving maximum future reward
• Model-based RL
• Build a model of the environment
• Plan (e.g. by lookahead) using model
• Use deep neural networks to represent them -> DeepRL
26. The result: A model
Temp before Cooler
[Action]
Temp after Opportunities Observations Probability
90 on 80 404 10 0.025
90 on 82 404 134 0.332
90 on 84 404 215 0.532
90 on 86 404 34 0.084
90 on 88 404 9 0.022
90 on 90 404 2 0.005
90 off 88 381 1 0.003
90 off 90 381 23 0.059
90 off 92 381 101 0.261
90 off 94 381 163 0.421
90 off 96 381 75 0.194
90 off 98 381 24 0.062
27. Now: Take it backward St -> A -> St+1
Temp before Cooler
[Action]
Temp after Opportunities Observations Probability
90 on 80 404 10 0.025
90 on 82 404 134 0.332
90 on 84 404 215 0.532
90 on 86 404 34 0.084
90 on 88 404 9 0.022
90 on 90 404 2 0.005
90 off 88 381 1 0.003
90 off 90 381 23 0.059
90 off 92 381 101 0.261
90 off 94 381 163 0.421
90 off 96 381 75 0.194
90 off 98 381 24 0.062
34. Wrap-up
• RL could become the next star in ML
• More storage space
• More compute power
• Applications in IoT, autonomous driving, process control
• Good foundation research
• Convincing prototypes and applications
à Focus shift
David Silver
“Reinforcement Learning + deep Learning = AI”
36. References
• Some content is reused from
• Introduction to Reinforcement Learning - Shane M. Conway
• Lecture 1: Introduction to Reinforcement Learning – David Silver
• How reinforcement learning works in Becca 7 – Brandon Rohrer
• Johnson M., Hofmann K., Hutton T., Bignell D. (2016) The Malmo
Platform for Artificial Intelligence Experimentation.Proc. 25th
International Joint Conference on Artificial Intelligence, Ed.
Kambhampati S., p. 4246. AAAI Press, Palo Alto, California USA.
https://github.com/Microsoft/malmo