4. What is reinforcement
learning?
In a reinforcement learning setting, one takes actions in an
environment & receives rewards. The ultimate goal is to
maximise rewards over time.
6. What is reinforcement
learning?
A good real-world analogy is teaching your dog a new
command. If the dog correctly performs (acts) the command
you (the environment) gave, he or she is given a treat (a
reward). Over time, you dog will learn to act as commanded
in order to maximise reward over time.
7. What is reinforcement
learning?
Reinforcement learning isn’t entirely dissimilar from the
notion of classical conditioning or Pavlovian response:
“Classical conditioning (also known as Pavlovian or
respondent conditioning) refers to a learning procedure in
which a biologically potent stimulus (e.g. food) is paired with
a previously neutral stimulus (e.g. a bell). It also refers to the
learning process that results from this pairing, through which
the neutral stimulus comes to elicit a response (e.g. salivation)
that is usually similar to the one elicited by the potent
stimulus.”
Classical conditioning,
https://en.wikipedia.org/wiki/Classical_conditioning
8. What is reinforcement
learning?
In the beginning, a reinforcement learning agent knows
nothing about the world. It must explore different options to
learn what works and what doesn’t.
11. What is reinforcement
learning?
We’ll get back to the details later. Before that, let’s think about
why you might want to use reinforcement learning, and how
to do it in a way that actually works in the real world.
13. The case for using
reinforcement learning
Intentionally provocative statement: you can’t really call
machine learning systems intelligent unless they are
reinforcement systems.
Let’s dissect this through some observations.
14. The case for using
reinforcement learning
Observation #1: any system that doesn’t use machine
learning generates data that is ultimately based on human
expertise.
15. The case for using
reinforcement learning
Observation #2: any supervised machine learning system that
uses such data is effectively learning from data generated by
human expertise.
16. The case for using
reinforcement learning
Observation #3: humans aren’t great at everything.
17. The case for using
reinforcement learning
Observation #4: deploying a supervised learning system itself
generates data from a new distribution. However, it still has
its roots in human expertise.
18. The case for using
reinforcement learning
Is this type of source information really the way to go? Is it
really the correct signal?
19. The case for using
reinforcement learning
I don’t think so. Let me elaborate with an example.
20. The case for using
reinforcement learning
Which of the following would I be most interested in?
21. The case for using
reinforcement learning
Which of the following would I be most interested in?
22. The case for using
reinforcement learning
Personal opinion: the only way to uncover the correct signal is
to assume nothing, try out different things (explore), and
learn to act optimally (exploit) based on environmental
feedback. It’s causal by nature. Everything else is a hack*.
* supervised learning can be a massively useful, perhaps even glorious,
hack, but it is still a hack.
24. A fundamentally correct machine learning
system
Learn
Log
Explore
Deploy
The case for using
reinforcement learning
25. The case for using
reinforcement learning
If you agree with this train of thought, it begs a question: why
don’t we use more reinforcement learning?
28. The problem with
reinforcement learning
Supervised learning
Full reinforcement
learning
Max’s Difficulty Continuum
* not necessarily easy
Straightforward* Hard as nails
36. The problem with
reinforcement learning
The standard way to deal with this is to build an
environment simulator, that generates an endless supply of
states & rewards.
This works in constrained, fully digital settings like games.
But for loads of real-world problems, you literally can’t build
a simulator.
37. The case for using
reinforcement learning
How on earth would a simulator know I
enjoy Tudor history?
38. The case for using
reinforcement learning
It can’t.
40. The problem with
reinforcement learning
In a full reinforcement learning setting, rewards can arrive
immediately, or sometime in the future.
41. The problem with
reinforcement learning
Let’s say you have a sequence of ten yes/no decisions to make.
1. If you decide yes at step 1, you get a small immediate
reward and no rewards for the remaining 9 steps.
2. If you say no at step 1, and then follow a very specific
sequence of yeses and nos for the remaining steps, you
get a large reward.
It would make sense to sacrifice short-term rewards in this
case, because of the payoff at the end is large.
42. The problem with
reinforcement learning
Consequence: you need to be able to learn to assign (partial)
rewards to actions that possibly happened a long time ago.
This is known as the credit assignment problem. Solving
this problem means full RL algorithms necessarily depend
on the number of observations, exacerbating the sample
complexity issue even more.
44. The problem with
reinforcement learning
Despite all the issues, RL is still much too promising to
give up on. If we solve RL in real-world settings, we stand to
advance the state of the art significantly.
45. The problem with
reinforcement learning
So how to we do it? Currently, via a set of clever tricks and
simplifications. We aren’t yet able to solve all real-world RL
problems, but you’d be surprised what we can solve today.
46. How can you do reinforcement learning
in the real world?
47. How can you do
reinforcement learning in
the real world?
Currently: via some simplifications.
Let’s look at the Difficulty Continuum again, and add some
pros & cons.
49. Straightforward* Hard as nails
Supervised
learning
Incorrect signal
Independent on
number of
observations
Full reinforcement
learning
Correct signal
Depends on number
of observations
Max’s Difficulty Continuum
* not necessarily easy
How can you do
reinforcement learning in
the real world?
50. If we can find a way to get rid of the dependence on sample
size, yet preserve the correctness of signal as well as possible,
we are on to something.
But can we?
How can you do
reinforcement learning in
the real world?
51. Yes. By making some simple yet critical modifications to
the full RL problem, we can make reinforcement learning
agents capable of solving a huge amount of real-world
problems. Not all problems, but a significant portion.
How can you do
reinforcement learning in
the real world?
52. Simplification #1: we are going to require that the reward for
an action is revealed (almost) immediately and, more
importantly, that is is attributable only to the previous action.
How can you do
reinforcement learning in
the real world?
54. Q: Isn’t the immediate reward requirement a problem?
A: It depends. Though tricky, there is a huge class of
problems for which you can find short-term proxy rewards
that align well with long-term rewards. This is especially true
in online applications.
How can you do
reinforcement learning in
the real world?
55. Proxy reward examples
News site
Long-term reward:
user satisfaction
Short-term proxy:
dwell time
Weight loss program
Long-term reward:
kilos lost
Short-term proxy:
exercise time
Video site
Long-term reward:
annual viewing time
Short-term proxy:
seconds viewed following an
action
General-purpose
If you can build a predictor that
accurately predicts the long-term
reward using short-term features,
use the prediction as a short-term
reward
How can you do
reinforcement learning in
the real world?
56. Simplification #2: we are going to require that possible states
do not depend on previous actions we took.
How can you do
reinforcement learning in
the real world?
58. Given these simplifications, we have what is known as
immediate-reward reinforcement learning, or contextual
bandits as it’s more commonly known.
How can you do
reinforcement learning in
the real world?
59. With no dependence on the number of observations, we have
a setting that is still RL, but closer to supervised learning in
terms of tractability.
How can you do
reinforcement learning in
the real world?
60. Straightforward* Hard as nails
Supervised
learning
Incorrect signal
Independent on
number of
observations
Full reinforcement
learning
Correct signal
Depends on number
of observations
Max’s Difficulty Continuum
* not necessarily easy
How can you do
reinforcement learning in
the real world?
61. Straightforward* Hard as nails
Supervised
learning
Incorrect signal
Independent on
number of
observations
Full reinforcement
learning
Correct signal
Depends on number
of observations
Max’s Difficulty Continuum
* not necessarily easy
Contextual
bandits
Rightish signal
Independent on
number of
observations
How can you do
reinforcement learning in
the real world?
62. The contextual bandit (CB) problem, in CB lingo:
Repeatedly do:
1. Observe features x (analogous to state in RL)
2. Choose action a given x
3. Receive immediate reward r for the action
Objective: maximise expected reward over time.
How can you do
reinforcement learning in
the real world?
63. Given the simplifications, contextual bandit problems are
solveable using much less data than full RL problems. This
makes CBs an excellent candidate for solving real-world
problems.
How can you do
reinforcement learning in
the real world?
64. Next question: how might we go about solving a contextual
bandit problem?
How can you do
reinforcement learning in
the real world?
68. One possible solution: ML
reductions
There are two approaches to solving a machine learning
problem:
1. Design new algorithms
2. Figure out how to reuse existing algorithms
The subfield of machine learning reductions focuses on 2).
It’s one of my favourite ML topics.
69. One possible solution: ML
reductions
General approach: reduce your original data distribution into
something that can be solved by an existing, simpler
algorithm. Solve and rollup the solution to solve your original
problem.
70. One possible solution: ML
reductions
Some of these may be hard to believe, but using either a single
reduction or a stack of reductions, you can reduce at least the
following:
71. One possible solution: ML
reductions
● Importance-weighted binary classification to binary
classification
● Regression to binary classification
● Quantile regression to binary classification
● Multiclass classification to binary classification
● Cost-sensitive multiclass classification to
importance-weighted binary classification
● Cost-sensitive multiclass classification to regression
● Ranking to binary classification
● Contextual bandits to multiclass classification
● Contextual bandits to binary classification
● Contextual bandits to regression
72. One possible solution: ML
reductions
Putting our ML reductionist hat on, let’s take a closer look at
the agent part of the contextual bandit process.
73. Environment
Agent
Goal: learn to act so as to maximise reward over time.
ActionRewardFeatures
One possible solution: ML
reductions
74. Agent
Exploration policy
Job: at each timestep, observe state
& play action, either the best one
or one according to some
exploration strategy
Features Action
One possible solution: ML
reductions
75. Exploration policy
Policy
Job: at each timestep, observe state
& output the best action
Features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
One possible solution: ML
reductions
76. You could argue finding the best way to explore is basically
what RL is all about. It’s such a broad topic that we’ll skip it*
in this talk, and focus on the policy itself.
*give me a shout after the talk if this is something you’d like to learn
more about.
One possible solution: ML
reductions
77. A policy is a learned function that takes a state as input and
outputs a prediction of the best action.
Replace “state” with “features” and “action” with “class” and
you get:
….a learned function that takes a features as input and
outputs a prediction of the best class.
Another way to think about this: a policy is a classifier that
acts.
One possible solution: ML
reductions
78. *Puts reductionist hat on*: all of this sounds an awful lot like
supervised learning.
One possible solution: ML
reductions
79. Supervised learning assumes a full information setting, so we
can’t use it directly. The bad, and beautiful, thing about
reinforcement learning is that you never get to see rewards
for actions you didn’t take.
One possible solution: ML
reductions
80. However, it is possible to fill in “fake” reward information in
such a way that you get a dataset without missing
observations.
One possible solution: ML
reductions
81. This doesn’t seem possible, but it is (we’ll learn one technique
later on). And this is massively exciting, because it means we
can solve the policy part of contextual bandits with any
supervised learning classifier.
One possible solution: ML
reductions
82. By any classifier, I do mean any. We treat the classifier as an
oracle, a black box whose inner workings we don’t even need
to know about. Any classifier (assuming sufficient
expressiveness) will do:
● Gradient boosted classifiers
● Neural nets
● Logistic regression
● Decision trees
● KNN
● SVMs
● Random Forests
● ...
One possible solution: ML
reductions
83. Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Modified
features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
One possible solution: ML
reductions
84. Exciting conclusion: we can reduce contextual bandits to
supervised learning + exploration, and solve the learning part
using an oracle learner.
But how do we deal with the partial information problem
inherent to all RL?
One possible solution: ML
reductions
86. The reinforcement learning setting, including the contextual
bandit setting, suffers from some severe selection bias,
because we never get to see rewards from actions we
never see.
It makes evaluating the goodness of a policy less than
straightforward. Let’s look at an example.
Contextual bandits & the
partial information problem
87. Let’s pretend we’ve collected data (also known as experience)
from a contextual bandit agent that chooses between 4
actions (e.g. news articles) according to some exploration
policy π.
Contextual bandits & the
partial information problem
88. Let’s imagine we’ve logged the following reward sequence
(expected reward: 9/5 = 1.8):
Contextual bandits & the
partial information problem
(a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1)
89. Contextual bandits & the
partial information problem
(a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1)
Now, let’s say we want to improve on the existing system and
train a new policy using the logged data. It chooses:
(a: 1, x, r: ?) (a: 3, x, r: ?) (a: 2, x, r: ?) (a: 1, x, r: ?) (a: 4, x, r: ?)
How can we tell if our new policy is better?
Let’s imagine we’ve logged the following reward sequence
(expected reward: 9/5 = 1.8):
90. Contextual bandits & the
partial information problem
Now, let’s say we want to improve on the existing system and
train a new policy using the logged data. It chooses:
(a: 1, x, r: 1) (a: 3, x, r: ?) (a: 2, x, r: ?) (a: 1, x, r: 4) (a: 4, x, r: ?)
If we only use rewards for actions observed, we get a an
expected reward of 5/2 = 2.5. But is this policy actually
better? Not necessarily.
(a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1)
Let’s imagine we’ve logged the following reward sequence
(expected reward: 9/5 = 1.8):
91. Contextual bandits & the
partial information problem
Now, let’s say we want to improve on the existing system and
train a new policy using the logged data. It chooses:
(a: 1, x, r: 1) (a: 3, x, r: 0) (a: 2, x, r: 0) (a: 1, x, r: 4) (a: 4, x, r: 0)
Setting unseen rewards to zero doesn’t help, either: now the
policy seems worse (expectation 1.0), but we don’t really
know since we are just guessing unseen rewards.
(a: 1, x, r: 1) (a: 2, x, r: 0) (a: 1, x, r: 3) (a: 1, x, r: 4) (a: 1, x, r: 1)
Let’s imagine we’ve logged the following reward sequence
(expected reward: 9/5 = 1.8):
92. Suppose the actual best sequence (hidden from us) is the
one our new policy would have chosen:
Contextual bandits & the
partial information problemWe have a “perfect” policy, with an expected reward of 3.4
–1.9 times better than our previous one–but both our
previous attempts at evaluation didn’t estimate this well at all.
What we need is a way of filling in fake rewards in a way that is
unbiased, in order to build an unbiased estimator.
(a: 1, x, r: 1) (a: 3, x, r: 4) (a: 2, x, r: 5) (a: 1, x, r: 4) (a: 4, x, r: 3)
93. In math notation, our previous (bad) zero-filling estimator can
be formalised as follows:
Contextual bandits & the
partial information problem
Where:
n: the number of actions
x: the features observed during each round
a: the action chosen by the policy during each round
r: the reward observed for the (x,a) pair during each round (missing
observations zero filled)
94. In order to overcome these bias issues, we are going to leverage
one piece of information that we can collect but haven’t used
yet: action probabilities (the probability of choosing a
particular arm at a given timestep).
Since a contextual bandit policy both explores and exploits, at
any given time step, there’s some probability a given action
will be chosen.
So, in addition to features x, action a, and observed reward r
at each timestep, we also have p, the probability the action
was chosen, giving us an (x,a,p,r) quad.
Contextual bandits & the
partial information problem
95. Let’s tweak our bad estimator. If our new policy disagrees with
the logged action at any given time, we fill in a zero reward as
before.
Contextual bandits & the
partial information problem
However, if our new policy agrees, we take the observed reward
and inversely weigh it by the probability it was chosen in
our logged data. This estimator is know as IPS (inverse
propensity scoring, a.k.a. inverse probability weighting).
96. It is possible to show that an IPS estimator provides an
unbiased estimate of the reward. In fact, the proof is so short
that we can do it now.
Contextual bandits & the
partial information problem
99. IPS isn’t the only estimator. Other candidates include:
● Direct method (DM): estimate reward directly using a
separate predictor
● Doubly Robust (DR): combine IPS & DM
● Clipping, Weighted IPS, MTR (upcoming)
Contextual bandits & the
partial information problem
100. What does all this mean? We’ll get to the most interesting bit
shortly, but first, let’s return to the problem of actually
implementing a contextual bandit.
Oracle learners
101. As we saw before, you can reduce contextual bandits to
exploration + supervised learning, and use any supervised
learning algorithm as an oracle learner.
Oracle learners
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
Features
102. Let’s say we want to use multiclass logistic regression as the
oracle. Since we don’t observe all possible rewards at each
timestep, we can’t use it directly.
Oracle learners
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
103. If we did, we’d be learning from incomplete data (as we saw
before) and the classifier wouldn’t work well.
Oracle learners
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
104. We would also run into massive class imbalance issues, since
the majority of the reward information we do have is from
whatever the logged policy though was best.
Oracle learners
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
105. Let’s fiddle around with the data to make it compatible with
oracle classification algorithms.
Oracle learners
Exploration policy
Supervised classifier oracle
Job: at each timestep, observe state
& output the best action
Modified
features
Action
Exploration strategy
Job: at each timestep, decide
whether to choose the best action,
or try some other action
106. Given experience (x,a,p,r) and a supervised classification
algorithm, set rewards as follows (for each timestep):
● For the reward of the action that taken, set r = r/p(a)
● For all other actions, set r = 0
Oracle learners
107. Given experience (x,a,p,r) and a supervised classification
algorithm, set rewards as follows (for each timestep):
● For the reward of the action that taken, set r = r/p(a)
● For all other actions, set r = 0
This is simply IPS!
Oracle learners
108. Result: all missing rewards filled in in an unbiased fashion,
creating a supervised learning problem. The class imbalance
issue is also gone.
It’s really that simple.
Note: not all oracle learners need this tweak, but most classification
algorithms do.
Oracle learners
109. Up next: the last, and most interesting bit of this talk. Oracle learners
111. Using an unbiased estimator to fill in missing rewards allows
us to solve contextual bandits with oracle learners. That’s
neat, but not the best part.
Policy evaluation
112. We never explicitly mentioned what assumptions we have to
place on our logged quads (x,a,r,p) must have in order for us
to estimate rewards in an unbiased fashion.
Policy evaluation
113. Answer: apart from assumptions related to the contextual
bandit setting itself, pretty much nothing. Policy evaluation
114. We can take any logged experience of the form (x,a,r,p) and
evaluate a new policy offline, just like we do in supervised
learning.
Policy evaluation
115. What if the experience was generated by 10 different policies,
each deployed after the other?
Doesn’t matter.
Policy evaluation
116. What if the experience was generated by a policy using an
entirely different learning algorithm (e.g. gradient boosting vs.
logistic regression)?
Doesn’t matter.
Policy evaluation
117. What if the experience was generated by a policy just
randomly exploring, possibly without any machine learning at
all?
Doesn’t matter.
Policy evaluation
118. Regardless of the policy that generated our experience, we can
use it for training a new policy and evaluating it offline. We
can run hundreds of experiments a day, testing new
hyperparameters, exploration options, learning algorithms,
features etc.
And we can do this without using a simulator, using real
world data collected from real users.
Policy evaluation
119. Putting it all together, this gives us a pretty fantastic recipe for
success:
1. Implement data collection system, collecting quads
(x,a,p,r)
2. Deploy your policy (at first, could even be a random
choice sans machine learning)
3. Train a better policy using experience, deploy
4. Repeat 3-4 using your ever-growing experience data
Policy evaluation
120. Putting it all together, this gives us a pretty fantastic recipe for
success:
1. Implement data collection system, collecting quads
(x,a,p,r)
2. Deploy your policy (at first, could even be a random
choice sans machine learning)
3. Train a better policy using experience, deploy
4. Repeat 3-4 using your ever-growing experience data
Policy evaluationImportant!
123. Supervised learning is useful, but doesn’t really uncover the
right signal in many cases.
Full reinforcement learning does uncover the correct signal, is
causal by nature, but is also very difficult to apply to
real-world problems because of the sample complexity
required for credit assignment.
Contextual bandits provide a happy medium by relaxing the
full RL setting to only consider immediate rewards.
Summary
124. Summary
Straightforward* Hard as nails
Supervised
learning
Incorrect signal
Independent on
number of
observations
Full reinforcement
learning
Correct signal
Depends on number
of observations
Contextual
bandits
Rightish signal
Independent on
number of
observations
125. Contextual bandits can be reduced to exploration +
supervised learning, allowing us to take advantage of
ready-made, state-of-the-art learning algorithms.
Contextual bandit policies can be evaluated offline, using
experience quads (x,a,p,r) generated by any previous policy.
A properly implemented contextual bandit learning system is
a self-improving loop: better policies generate more reward,
and provide more data for improving further.
Contextual bandits allow you to solve a host of real-world
problems, using real data instead of simulation, in a causal
manner.
Summary
126. If you have a problem where it is possible to explore, and a desire
to make a machine learning system capable of uncovering new
things, consider immediate-reward RL.