Calculation of Reward Prediction in Behaving Animals

           
                 
Bachelor’s Thesis
Emil Østergård Johansen
Calculation of Reward Prediction in Be-
having Animals
Supervisors: Christian Igel & Jakob Dreyer
January 4, 2015

Contents
1 Introduction 2
2 Background 3
2.1 Dopamine reward system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Reinforcement Learning & Prediciton Error . . . . . . . . . . . . . . . . . . . 8
2.3 Model of learning in the brain . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Modeling 13
3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Discussion 17
4.1 Local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Discounting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Conclusion 19
Appendices 22
Abstract
The theory of reinforcement learning can be applied to modeling learning processes in humans
and animals. The neurotransmitter dopamine is supposed to play a central role in the
reward system, encoding the temporal difference error. This thesis reviews the theory of
reinforcement learning along with the dopamine reward system in the brain. A temporal
difference learning algorithm is used to model behavioral data of wild type and dopamine
transporter knockdown (DATkd) mice. The parameters of the model are adjusted using
data-driven optimization. In the literature, significant behavioral differences between wild
type and DATKd mice were explained by differences in the temperature parameter of the
model. However, the experiments in this thesis showed that there are several local minima
in the parameter space, which allow for alternative explanations of the observed behavioral
differences.
1 Introduction
How does the brain work? Understanding this fundamental question is the main objective
in neuroscience and many other disciplines. Among the functions of the brain, learning
remains an area of key interest. Behavioral psychology has paved the road for a massive
body of knowledge to which famous names such as B.F. Skinner (the Skinner box) and Ivan
Pavlov (classical conditioning) have contributed. While behavior is readily observed and can
be quite informative as to hypothesize about the inner mechanics of learning in the brain,
neurophysiology, that is the branch of physiology concerned with the functioning of the brain,
offers a detailed look of what ‘makes it all tick’.
The brain consists of a myriad of living cells among which the most important are neurons
(colloquially known as brain cells) and glia cells. The intriguing property of this mass of
cells is that it is highly organized. Neurons communicate with one another through cable-
like projections and each neuron can be communicating simultaneously with upwards of
2

10,000 other neurons. Connections are strengthened and weakened dynamically over time,
but the gross organization appears static. This is why it is possible for neurosurgeons to
perform complicated operations with a high degree of confidence that they can predict the
consequences, and it is also the reason why researchers can investigate the physiology of
specific structures of the brain with 10−6m precision.
The discovery of the neurotransmitter dopamine and consequently networks in brain
where dopamine plays an central role has been kick starting intense research that shows
how dopamine is responsible for much of the behavior observed by psychologists. A series
of experiments on monkeys [Schultz, 1998] let researchers directly monitor the activity of
dopamine neurons during learning tasks and a clear correlation was observed.
While neuroscientists have investigated the inner workings of the brain, computer sci-
entists have tried to teach machines to learn and while doing so have developed a group
of algorithms under the common term, ‘reinforcement learning’. A reoccurring parameter
in these algorithms, namely the prediction error, was surprisingly observed to imitate the
behavior of the dopamine neurons monitored in the experiments by Schultz [1998] and since
then reinforcement learning algorithms have been used to model behavioral data.
The work by Beeler et al. [2010] is an example of combining neuroscience with reinforce-
ment learning. They investigate the background activity of dopamine neurons’ effect on
observed behavior by presenting two groups of mice with the same learning task. One group
has been genetically modified to have increased background activity of dopamine neurons
and they use a reinforcement learning model to compare various parameters between the two
groups.
It is the objective of this thesis to describe the background theory of learning in the
context of both the dopamine system and of reinforcement learning. The data obtained by
Beeler et al. [2010] will then serve as a framework for applying this knowledge to an actual
modeling problem and stimulate a discussion of these methods. There is a large amount of
literature on the subject of modeling behavioral data and it is not within the scope of this
thesis to review the newest findings nor implement them. The background review is based
mostly on classical references.
2 Background
2.1 Dopamine reward system
Various theories about the organization of the brain have been proposed throughout the
centuries. In the 18th century Franz Joseph Gall founded the school of phrenology which
claimed that various bumps on the external surface of the cranium would predict various
psychological traits about the person being examined. When focus shifted to the brain,
this idea paved the road for the functional specialization theory, namely that each cognitive
function and psychological trait corresponded to a bounded area in the brain.
Lesion case studies, where the subject of study were patients with a brain tumor, internal
bleeding or other kinds of local damage to an area of the brain, produced many interesting
results that appeared to prove this theory. The most famous of these patients was Phileas
Cage who miraculously survived after having an iron rod driven through his skull, destroying
his left frontal lobe. Cage survived with most cognitive skills intact but with a personality
change that rendered him unrecognizable to his friends and family. This suggested that
the frontal lobes are somehow responsible for personality. The famous brocca’s area in the
inferior (lower) part of the frontal lobe which supposedly is responsible for language, was
similarly discovered by observing that patients with damage to this area all appeared to have
trouble with speech production.
Other theories are of course also being explored. The opposite of functional specialization
is the theory of distributive processing where cognitive functions are assumed to be a network
3

Figure 1: Publications on various neurotransmitters throughout the years. Source: Google
Scholar.
within the brain not bound to a specific area [Uttal, 2001].
Whatever the organization of function, pathways which consists of the projections of
axons from a collection of neurons in one part of the brain to another part, has been thor-
oughly documented and a group of these pathways are of special interest when investigating
motivation and reward.
Dopamine was discovered as an independent neurotransmitter by Carlson et al (1957)
at Lund University and the research on dopamine has been increasing ever since (Figure 1).
Dopamine was first associated with motor function [Wise, 2004], but later also with motiva-
tional behavior first described in Ungerstedt [1970] where feeding and drinking deficits were
observed after inducing damage to the mesolimpic dopamine system. In Wise and Schwartz
[1981] they showed that by injecting the dopamine D2 receptor blocker pimozide, they ob-
served that rats did not learn to press a lever in order to obtain food and water, whereas
rats who were not injected learned this task. This further suggests that dopamine plays a
key role in modulation of motivation and reward-based learning.
The dopaminergic system in the brain consists of several pathways that together span
most of the brain but all emanate from the mid brain (Figure 2). The most important path-
way in regard to reward learning is the mesolimpic pathway which originates from dopamin-
ergic neurons in the ventral tegmental area (VTA) in the mid brain that project its axons
to especially the cortex.
In a series of groundbreaking experiments, starting with Olds and Milner [1954], a rat
was placed in a box with a lever which upon pressing delivered electrical stimulation to
the rats’ brain through an implanted electrode (see Figure 3). In this set-up the rat would
first step on lever by accident and then retreat away from the lever. However, it would
soon come back and press again, and as time went by it would spend most of its time
pressing the lever. In extreme cases rats were observed to press the lever continuously until
fainting from exhaustion. These electrical self-stimulation experiments were repeated, each
time changing the position of the stimulation electrode in the rat brain, and the structure
identified as producing the strongest behavioral response was the ventral tegmental area
(VTA). Blocking dopamine receptors reduced the degree of self-stimulation so the conclusion
was that the rats worked to release dopamine, and thus dopamine served as a reward that
reinforced the behavior that caused the release of it.
Wise and colleagues have carried out numerous self stimulation experiments throughout
4

Figure 2: Dopaminergic System Overview (adapted from Felten et al. [2003])
the years and located many of the structures involved in motivation and reward by noting
that when they observed a positive result (boosted reward experience of the animal seen by
repeated self stimulation), the stimulation electrode was located either in close proximity to
axons of dopaminergic neurons or to axons that terminated at dopminergic neurons [Schultz,
2002].
The important experiments which gave rise to the data essential for this thesis were
done by Schultz and his colleagues (various articles 1990-1994) on behaving monkeys. In
the experiments firing patterns of dopaminergic neurons were measured in vivo (while the
monkeys were alive) as the monkeys performed various learning tasks. They were placed in
front of a box which was covered to prevent vision of the contents of the box. Electrodes
were placed near a population of dopamine neurons in the monkeys’ brain and they then let
the monkey insert its hand into the box which could contain either nothing, a reward (apple)
or a neutral object (a string).
Figure 4A shows a transient increase in dopaminergic neuron firing when the hand of the
primate touches an apple (upper figure) inside the box and an absence of this firing when
there is no apple (lower figure). Control experiments were performed (Figure 4B and C)
which confirmed that: 1. The qualitative observation was indeed due to touching an apple
(a reward) and not just any object (a string was used in place of an apple). 2. The same
result was not due to movement of the arm alone.
The more interesting result, however, is found in Figure 4E where the movement of
reaching for the apple inside the box was initiated by a stimuli which in this case is the
opening of the door to the box. In the control trial (Figure 4D) the dopaminergic neurons
exhibit a transient increase in firing rate, as we would expect by now, when the hand reaches
the reward. The movement here is self-initiated. When the movement is initiated by the
door-opening stimulus, the neural response is to start with observed when the monkey reaches
the reward, similarly to the control. However, after learning the association between stimulus
and the possibility of finding a reward, the increase in firing rate is observed at the time of
the stimulus and not at the time of discovery of the reward. This observation was replicated
in Ljungberg et al. (1992), also with monkeys. It was concluded that this transferring of
neuronal response happens over the course of learning. This is an impressive direct view of
5

Figure 3: Self stimulation experimental setup (adapted from Bear et al. [2007]).
a change on the cellular level in the brain that happens during learning.
Let us look at not the transferring of neuronal activity but just the decrease of activation
at the time of reward. How is this decrease in phasic activation at the time of reward to be
interpreted? An obvious interpretation is that the animals develop an insensitivity to the
rewards and thus less activation in the dopamine neurons is observed. This was however
falsified in [Mirenowicz and Schultz, 1994]. Rather it seems that the degree of activation
somehow depicts the uncertainty or unpredictability [Schultz, 2002] of the reward. In the
first trials, when the reward is unexpected, the activation of dopamine neurons is at its
highest. Then slowly over the course of learning the activity decreases until it is not present
at all. This can be interpreted as a form of learning or in other words, a steady increase in
the certainty that the reward will be presented.
When the reward, against expectations, is on purpose being withheld or delayed, a sig-
nificant drop in dopamine neuron activity is observed [Schultz et al., 1993] which seems like
a natural extension of the unpredictability theory. The drop in activity occurs exactly at
the time where the reward is usually delivered which suggests a sensitivity to the timing of
the reward in addition to the occurrence. The important theme in these individual neuron
experiments by Schultz and others is the unpredictability of reward, as it has come to be
termed, in both timing and occurrence. It is becoming evident that the phasic activity (as
opposed to the constant tonic firing) of dopamine neurons can be interpreted to be pro-
portional to the error in the animals’ prediction of the reward: When the expectation of a
reward is low or non-existent (as it would be prior to any learning task), a large amount
of activity is observed when a reward is presented - contrary to expectation. However, over
time as the animal becomes surer that the reward follows the stimuli, the unpredictability is
decreasing and the prediction error also decreasing, thus a smaller neuronal activity is ob-
served. The term ‘prediction error’ is important because it is a corner stone of reinforcement
learning algorithms and it is in this quantity that the connection was made between the
results discussed here and the reinforcement learning theory most note-worthily described
and developed in Sutton and Barto [1998].
6

Figure 4: Mensencephalic dopamine neuron ﬁring response to experiment. From [Schultz
and Romo, 1990]
Figure 5: Transferring of predictive ﬁring. From [Schultz, 1998]
7

2.2 Reinforcement Learning & Prediciton Error
The prediction error as introduced in the previous section, is a central theme in modern
conditioning theories [Rescorla et al., 1972, Schultz, 2002]. In these theories it is suggested
that in order for learning to take place, the prediction error must be different from zero. A
positive or negative prediction error will lead to the acquisition or extinction or behavior.
To see why this is so, imagine a dog being presented with a stimulus such as the ring of a
bell. In all likelihood the dog predicts that nothing will happen after the bell has sounded.
And if nothing happens, the dog will continue to expect nothing to happen the next time
the bell rings. But if something, contrary to the dog’s prediction, does happens, there is a
prediction error between the dog’s prediction and what actually happened and as a result
the dog changes its prediction for the next time the bell rings or in other words, it learns.
This theory is pleasing because it fits well with our intuitive understanding of learning and
the classic Pavlovian dog experiments.
It should by now become more and more clear that the prediction error observed in the
single dopamine neuron experiments by Schultz and the prediction error observed in learning
tasks in classical conditioning (such as the dog example above), is the same and it is this
subtle point that provides the main motivation for both the data set used in this thesis and
the model which will be introduced later. First, however, I think it is appropriate to clarify
some terms.
In conditioning we distinguish between two main fields of study: Classical and instrumental
or operant conditioning. The main difference being that the actor in classical conditioning
does not have any influence on whether he/she is presented with a reward or not - in other
words it is completely automatic. In instrumental or operant conditioning, the opposite is
true. Here the actor’s interaction with the environment can influence the chance of whether
or not a reward is presented. A system where the actor is presented with positive rewards
and negative rewards (punishments) which provide the basis for learning, is what is known
as a reinforcement learning system which is studied widely in both psychology and computer
science for machine learning. Reinforcement is here taken to mean the strengthening of
stimulus-stimulus, stimulus-response, or stimulus-reward associations that result from the
timely presentation of a reward [Wise, 2004].
Many results from psychological classical conditioning experiments can be predicted by
a simple learning rule which is one of the classical references in the field, namely the The
Rescorla-Wagner rule [Rescorla et al., 1972]. It offers a simple linear relationship for pre-
dicting reward. If we denote v as the expected reward, u as a binary variable representing
the presence or absence of a stimulus and w as a weight associated with the stimulus, then
the expected reward is predicted as
v = wu (1)
The learning takes place in updating the value of the weight w with an update rule that
is designed to minimize the prediction error, that is the squared error between expected
reward and actual reward (r − v)2 [Dayan and Abbott, 2001] and this is what is known as
the Rescorla-Wagner rule
w = w + δu (2)
where
• δ = r − v is the prediction error
• is a learning rate determining how fast the learner incorporates new information into
prior knowledge
If a reward follows every time a stimulus is presented, then w will converge to 1 as the number
of trials increases. If the reward is stochastic and normally distributed, w will converge to
8

the theoretical mean of that distribution.
As already mentioned this is a model of classical conditioning meaning that the actor does
not interact with environment and hence has no influence over whether a reward is presented
or not. It does a poor job modeling instrumental conditioning because there is no notion of
time or interaction between the actor and the environment in the model. For this we need a
slightly more complicated setup.
Reinforcement learning is a problem of optimal control in the sense that the actor, who is
learning, has some control over the environment. Actions lead to rewards or the lack thereof
and the actor seeks to reap the largest amount of rewards. In order to describe this problem
mathematically we need some clearly defined abstractions:
• a set of states S
• a set of action A
• a transition function TTT
• a reward function RRR
• a value function VVV
• a policy π
For the remainder of this introduction we will use the experimental setup described in
Section 3.1 as an example. This setup consists of a mouse in a cage with two levers that
deliver rewards in the form of food. Obviously time is continuous in the real world, but it is
for modeling purposes beneficial to discretize the world in to discrete timesteps t and states
st ∈ S where st is the state at time t and S is the set of all possible states.
From the mouse’s point of view, the way to transition between these states is through
actions which, like states, come from a set of actions so the action at time t we write as
at ∈ A. A certain action then, like pressing a lever, alters the state of the environment
iterating t to t + 1 so that the new state is st+1 ∈ S. The next thing we need is a function
that maps state st and action at to the next state st+1. This we call the transition function
TTT : st, at −→ st+1 and this is what describes the laws that govern the interaction between
the mouse and its environment. It is also termed an environment model. The environment
can be deterministic in the sense that a specific action will always lead to specific state, but
often it is stochastic so that TTT is a set of transition probabilities.
Finally there is the reward function RRR : st, at −→ rt which maps from a state and an
action to the probability of a reward. Often rt is a binary variable which represents the
presence of absence of reward at time t, just like the stimulus variable u in the Rescorla-
Wagner setup.
With this set of abstractions we have a well defined environment S, a way for the mouse
to interact with the environment through actions A and a transition function which tells us
how the environment changes under each action TTT. Finally, the incentive for learning, the
rewards, are also well defined by RRR.
Now comes the interesting part, namely how the mouse chooses what action to take
in order to maximize reward. The way we model this is to assume that the mouse keeps
some value estimate for each possible action so that by choosing the action with the highest
estimated value, it can maximize reward. This is called the value function VVV : at −→ vt. The
mouse updates this value function based on the reward it receives, a trial and error concept.
If VVV∗(a) is the true value for action a, the mouse obviously tries to achieve the following
property:
lim
t→∞
VVVt(a) = VVV∗
(a) (3)
9

A simple way to update VVV is inspired directly by the Rescorla-Wagner rule where the weight
w is replaced with the value and the stimulus u is omitted:
vt = vt + αδt (4)
Here δt = (rt − vt) and has been renamed to α to keep the two update rules separate.
As already suggested, the mouse chooses its actions based on the value function. The
function that encapsulates this behavior of the mouse is termed its policy, π : s −→ a. The
policy π is in words a function that maps from a state to an action. This will usually have
the form of a simple look-up table although it can be imagined to take more complex forms.
The process of updating value estimates and thus the policy is known as policy optimization.
The relationship between the value function and the policy can vary from model to model.
If the policy of the mouse is designed so that it always chooses the action which has the highest
value estimate (highest expected value), then this is known as the greedy choice and it is
the one of two extremes. The other extreme is to completely disregard the value estimates,
and assign equal probabilities to choosing each action. Many different strategies have been
proposed. One easy to understand is the so-called -greedy method (not be confused with
the learning rate from the Rescorla-Wagner rule). Here is a number between 0 and 1
which defines the probability of choosing the greedy action versus choosing equally among
all other action. Another approach is to assign weighted probabilities according to the value
estimates V (a) for each action. A common implementation of this idea is to use the Gibbs
distribution, given as
eVVV(a)/β
k
b=0 eVVV(b)/β
, (5)
to calculate the probability weights [Sutton and Barto, 1998]. This is also known as the
softmax selection rule and it chooses stochastically among all actions but with higher value
actions weighted higher, more specifically it computes the probability P(at = a). This model
is used in the data modeling of this project.
The fact that the value estimations converge as in (3) is less interesting than how fast
they converge. That is, not only does the mouse want to optimize its policy to yield the
largest value on its future actions, but it also wants to optimize the rewards it gets during
learning, as it can’t go on learning forever. This dilemma is known as the exploit-explore
problem and it is central to reinforcement learning.
The following is based on the introduction to temporal difference learning in Dayan and
Abbott [2001]. As suggested above we interpret the value function VVV as the estimated value
of an action. To elaborate, what we really mean is that at time t, VVV(t) is the total future
reward expected from time t and onward to the end of the trial:
T−t
τ=0
rt+τ (6)
Usually future rewards are discounted which means that future rewards are assigned a lower
value than immediate rewards. This is intuitive in the sense that we can imagine an actor to
have more incentive to obtain a reward immediately rather than obtaining the same reward
a year later. This is modelled with a discounting factor γ where 0 ≤ γ ≤ 1. If γ = 0 we say
that the actor is myopic [Sutton and Barto, 1998] or short-sighted meaning that it does not
take into account future rewards but is only interested in maximizing immediate rewards.
As γ approaches 1, the actor becomes more and more far-sighted. So we rewrite (6) as
T−t
τ=0
γτ
rt+τ (7)
10

In this setting, the prediction error δ(t) must then be the difference in actual future rewards
and expected total future rewards
δ(t) =
T−t
τ=0
γτ
rt+τ − vt (8)
This, however, is impossible to compute at time t because we do not yet know the actual
future rewards. This is solved by noting that the first term in (8) can be rewritten as
T−t
τ=0
γτ
rt+τ = rt +
T−t−1
τ=0
γτ+1
rt+1+τ = γ
T−t−1
τ=0
γτ
rt+1+τ (9)
and as we just discussed, vt is an estimate of the total future reward (the latter term)
γ
T−t−1
τ=0
γτ
rt+1+τ ≈ γ vt+1 (10)
so the prediction error becomes
δ(t) = rt + γ vt+1 − vt (11)
and we insert this in (4) and get the update rule
vt = vt + α(rt + γ vt+1 − vt) (12)
The difference vt+1 − vt gives the name to this method, the temporal difference method.
At a first glance it looks like we again need information about the future (vt+1) to compute
the prediction error. One must remember, however, that vt designates the value of the action
taken at time t and similarly, as soon as we know the action at+1 which is computed with
e.g. the softmax selection rule, we know the value of the value estimate vt+1. This method
is called a bootstrapping method [Sutton and Barto, 1998] because it uses an estimate for
estimation.
We now have the complete framework for classical reinforcement learning in order. We
have defined the environment and the actors interaction with this environment through sets
of states S and actions A and we have defined how the actor selects its actions based on the
value estimates it holds about each action (its policy). Finally we have shown how the actor
can update its knowledge by observing the rewards it receives (12).
2.3 Model of learning in the brain
In this section I want to formalize the connection between the dopamine system and rein-
forcement learning a bit further. Although the idea that the brain uses the prediction error
for learning is intuitive, a well defined model of what the various functions in reinforcement
learning represent has been proposed [Montague et al., 1996] and is depicted in Figure 6.
In this model, information about the task to be learned is stored in the cortex which con-
sists of different modalities. When some task is engaged, the cortex outputs its information
about this task to an intermediate layer which represents possible information processing in
other parts of the dopamine system (See Figure 2) . This information can be both excita-
tory (encouraging) or inhibitory (discouraging). For example some modality of the cortex
can discourage the task because it is learned to be dangerous while another modality of
the cortex can encourage the task because it often yields food and the actor is hungry. All
these outputs are weighted individually and summed in our familiar value function VVV(t). In
the figure ˙VVV(t) represents a temporal difference which we recognize as VVV(t + 1) − VVV(t) (see
11

section 2.2). The information is then passed downwards to a group of dopamine neurons
PPP in for example the ventral tegmental area which also receives external information about
the presence or absence of a reward rrr(t) and other factors which may have influence over
the output from PPP. It is in the dopamine neurons that the prediction error δ(t) is computed
simply as the net input
δ(t) = r(t) + ˙V (t) = r(t) + V (t + 1) − V (t) (13)
and then output as a dopamine signal back to the cortex (again refer to Figure 2). This
model is in accordance with the experiments by Schultz et al. discussed in Section 2.1 where
it was observed that the dopamine signal in monkeys appeared to represent the prediction
error δ(t). We also note that (13) is a special case of (11) where γ = 1.
Figure 6: Model of information flow through
the dopamine system. Adapted from Mon-
tague et al. [1996]
The model in Figure 6 explains how the
prediction error and the consequent learning
might be physiologically instantiated in the
brain. The other part that needs an inter-
pretation is the action selection policy. The
softmax action selection model (5) was in-
troduced in Section 2.2 and it is the pre-
ferred model for action selection in reinforce-
ment models as it has been observed to fit
well with behavioral data [Daw and Doya,
2006]. The variable parameter of the soft-
max model is the β parameter also called
‘temperature’, an analogue to the movement
of molecules under different temperatures.
As β −→ 0, the probability of selecting the
action with the highest value approaches 1.
Conversely, as β −→ ∞ the probability is
spread out so that all actions have the same
probability of being selected.
In Figure 7 a simulation of a famous instructional reinforcement learning problem is
shown. The problem is that of a two-armed bandit. Each arm has its own probability of
yielding a reward and it is up to the learner to balance exploitation of learned knowledge of
which lever is more likely to yield a reward and exploration of the other lever in case it might
actually be better. Figure 7 shows a game that lasts 2000 plays. At play 400, the reward
yielding probability of the levers are switched in order to see how fast the model is able to
adapt to this. The result has been averaged over 2000 episodes to smooth out the curves
sufficiently. The choice of which arm to pull was made with the softmax action selection rule
and the result for four different values of β is shown. As is expected, the most conservative
player with β = 0.01 realizes the change the latest and is slowest at adapting. Similarly the
player with β = 0.5 is the quickest at realizing the change and adapting. There are many
qualitative attributes that are interesting to investigate (e.g. it seems that a medium value
is better in the long run, which it is), but overall it serves as a nice illustration of how β
affects the ability to learn and adapt. Although there are hypothesis as to how the brain
controls this, it is still an open question [Beeler et al., 2010, Daw and Doya, 2006].
12

Figure 7: Two-armed bandit simulation with diﬀerent temperature values.
3 Modeling
3.1 Experimental Setup
The data used for modeling in this thesis is the same data used in Beeler et al. [2010].
The main objective of the original experiment was to investigate whether tonic dopamine
modulates learning. This was done by comparing the performance of dopamine transporter
knockdown (DATkd) mice against wild type (normal) mice (n = 10 for both groups).
A home cage operant paradigm was used in the experimental setup, which means that
mice earn their food entirely through lever pressing. Each mouse lives in its own cage for
10 consecutive days where water is freely available, but food can only be obtained through
lever pressing. In each cage are two levers which has a probability of yielding food upon
pressing. When the probabilities of food release for each lever is not equal, one lever has a
lower probability of food release than the other making it ‘cheaper’ in terms of the amount
of work required to gain a reward. Which lever is the cheaper changes every 20-40 minutes
thus constantly oﬀering a learning opportunity for the mouse.
The data is recorded as event codes and accompanying time codes for each event with a
temporal resolution of 100 ms. Events are e.g. ‘left lever pressed’, ‘reward delivered’, ‘lever
probabilities interchanged’ etc.
For the parameter estimation, the data has been converted into two vectors: A binary
choice vector ccc where an element ct is 1 or -1 depending on which lever is pressed. ccc contains
the entire choice sequence for a single mouse concatenated over ten days which is about
50000 events. The other vector is a binary reward vector rrr of the same length as ccc. An
element rt is 1 if a reward was delivered when lever press ct was performed and 0 otherwise.
All temporal information (the time codes in the original data set) has been removed since
the model only predicts what choices are made in which order, not when they are made.
3.2 The Model
13

Figure 8
In the original work by Beeler et al. [2010]
they fit two models to the behavioral data
described in the previous section. One is
a general logistic regression model with the
choice variable c as the dependent variable
and the 100 previous rewards rt−100...t−1 for
each t as the explanatory variables. The re-
sult of this model is shown in Figure 8 where
left on the horizontal axis corresponds to
the most recent rewards. A positive coeffi-
cient value indicates that a reward tends to
promote staying on the same lever whereas
a negative value indicates switching to the
other lever. Intuitively we would expect that
the most recent rewards to promote staying
on the same lever more strongly than re-
wards in the past due to e.g. memory. This holds true for the figure shown here except
for the very most recent rewards where coefficient values tend to −1. This means that the
mice tend to switch to the other lever one or two time steps after receiving a reward but in
the longer run they follow a familiar learning pattern.
A simple softmax action selection model as in (5) will not predict this behavior but rather
that a reward received immediately prior to the current time step will promote staying on
the same lever the strongest. So this model has to be extended. The simple softmax model
for two possible actions (levers) take the following form:
P(ct = 1) =
eVt(1)/β
eVt(1)/β + eVt(−1)/β
=
1
1 + e−(Vt(1)−Vt(−1))/β
= σ(βV (Vt(1) − Vt(−1))) (14)
where σ(z) = 1
1+e−z is the logistic function and βV is the inverse temperature. Whether one
use the temperature or its inverse is simply a matter of convention and since they use its
inverse in the original article, we are adapting this notation here. Similarly for P(ct = −1).
The model used in Beeler et al. [2010] is as follows:
P(ct = 1) = σ(βV [Vt(1) − Vt(−1)] + β1 + βcct−1 + βS[St(1) − St(−1)]) (15)
where
• βV is the inverse temperature for V
• Vt = Vt−1 + αV (rt−1 − Vt−1) is the value function
• β1 is the bias term towards one lever
• βc is the bias towards the previously pressed lever
• βS is the inverse temperature for S
• St = St−1 + αS(rt−1 − St−1) is the short-term value function
Apart form the simple bias terms, it as been augmented from (14) to include an additional
value function S which will capture the short term effects observed in the logistic regression
result (Figure 8). By forcing the parameter βS to be negative we make sure that the last
term has the desired effect.
There are 6 degrees of freedom in this model, namely the parameters βV , αV , β1, βc, βS
and αS, so given a value for each parameter we are able to calculate the action probabilities
at time t. In symbols this probability is written as
P(ct = 1 | βV , αV , β1, βc, βS, αS) (16)
14

3.3 Parameter estimation
In this section, the method for parameter estimation is established. As we just saw, given any
six parameter values, we can compute the probability of the choice at time t with (15). In fact
we not only get the probability for the choice at time t, but since the entire choice sequence
is defined recursively depending solely on the parameters and on the previous values, we get
the whole choice sequence. So by finding good estimates of the parameters we mean finding
parameters that will produce data that is, to the highest degree possible, like the training
data which is the actual choice sequence of a mouse. So if we, given parameter values, can
compute the probability that a single choice is equal some actual choice (ct = 1 or ct = −1)
as
P(ct | βV , αV , β1, βc, βS, αS) =
P(ct = 1) : ct = 1
1 − P(ct = 1) : ct = −1
where P(ct = 1) is given in (15) and if we assume that each choice is independent of all other
choices, we can compute the probility of an entire choice sequence as the product
L(βV , αV , β1, βc, βS, αS | c1, c2, ..., cT ) =
T
t=1
P(ct) (17)
Note that we here assume the choice sequence to be given and let the parameters vary. This
is what is known as the likelihood function and by maximizing this function we can get the
parameters that best fit the actual data. For several reasons it is more practical to convert
the product into a sum by taking the logarithm so that the likelihood function becomes
log(L(βV , αV , β1, βc, βS, αS | c1, c2, ..., cT )) =
T
t=1
log(P(ct)) (18)
Since the logarithm is a monotone transformation, it will have the same extrema as the
transformed function. Furthermore it is also common to use the negative of the log likelihood
which means the likelihood function should be minimized rather than maximized.
The likelihood computation is outlined in Algorithm 1. This procedure returns the like-
lihood for a set of parameters. Since the choice probabilities depend on the value functions,
and the value functions are recursively defined, the likelihood has to be computed iteratively.
The value functions V and S are both initialized at 0. Experimentation showed that the
results did not differ significantly except for the first few time steps. After implementing
this procedure in MATLAB, the native optimizer ‘fmincon’ was used to obtain the results
presented in the following section.
15

input : βV , αV , αS, βS, β1, βc
output: Likelihood (scalar)
// Initialize value functions and likelihood
V1 = [0, 0] ;
S1 = [0, 0] ;
Likelihood = 0;
for t = 2 to end of choice sequence do
ct = action chosen at time t;
cother = action not chosen at time t;
// Update value functions
// For the chosen action
Vt,ct = Vt−1,ct + αV (|rt−1| − Vt−1,ct );
// and for the not chosen action
Vt,cother
= Vt−1,cother
+ αV (0 − Vt−1,cother
);
// Same for the S-values
St,ct = St−1,ct + αS(|rt−1| − St−1,ct );
St,cother
= St−1,cother
+ αS(0 − St−1,cother
);
// Update choice probability
Pt,1 = σ(βV (Vt,1 − Vt,2) + β1 + βcct−1 + βS(St,1 − St,2));
Pt,2 = 1 − Pt,1
// Update likelihood with the negative log of the probability of
the choice at hand
Likelihood = Likelihood − log Pt,ct ;
end
return Likelihood
Algorithm 1: Parameter estimation using maximum likelihood estimation.
3.4 Results
For the ﬁtting a choice of start values and boundaries had to be made. As for start values,
seeing that the primary objective is to replicate the results obtained in Beeler et al. [2010], it
appears natural to choose their parameter estimates as start values. As for boundaries, we
want to make sure that the optimization procedure is searching within the same subspace
of the solution space, so boundaries that are relatively tight around the target solution was
selected. For the learning rates αV and αS, the predetermined boundaries between 0 and 1
was used. After running the ﬁtting procedure, the following results were returned.
Parameters Wild-type DATkd t p
βV 36.80 35.02 0.482 0.636
αV 0.724 0.638 0.378 0.710
αS 0.361 0.501 0.721 0.480
βS −7.633 −6.923 0.317 0.756
β1 0.087 −0.161 0.851 0.406
βS 3.291 3.184 0.162 0.413
The parameter estimates are sample means over 10 mice each and the comparison of
means was done with a two-sample t-test. The lowest p-value is 0.406 so we cannot reject
the null-hypothesis that the means are equal for any of the parameters. This is in accordance
with Beeler et al. [2010] except for the parameter βV which in their work was found to be
16

Figure 9: MDS output of 100 parameter fits on two different mice. Similar colors represent
similar likelihood.
significantly different when comparing wild-type and DATkd mice. After several experiments
it became evident that the the solution is highly dependent on the start values given. This
suggests that the solution space contains many local minima even within relatively tight
boundaries on the parameters. To investigate this further I performed 100 fits on two different
mice with 100 random start values. This gives 100 solutions in R6. In order to visualize it,
multidimensional scaling (MDS) was used to reduce the dimensions to two. MDS is a way to
visualize the similarity or distance between high dimensional data points in a lower dimension
as precisely as possible. The ‘error function’ of MDS is termed the strain function and by
plotting the strain as a function of time (a scree plot), an acceptable number of dimensions
for visualization can be determined. In this case two is appropriate.
The result of the MDS analysis is shown in Figure 9. Each point is a solution to the opti-
mization problem and the solutions are color-coded according to their negative log-likelihood
and the hollow blue dots have the lowest value. Several qualitative traits meet the eye.
Firstly, there are two clear clusters of solutions to the left in each plot and the group to
the right could also be interpreted as one or two clusters. Secondly, solutions appear to be
arranged vertically with very little variation in the horizontal direction. In order to interpret
this, one must remember that in dimensionality reduction analysis like MDS, the axis in
the lower dimension do not have a clear interpretation because the plane formed does not
necessarily lie along any of the original axis. So a arrangement of solutions parallel to the
vertical axis, as in this case, is not the variation of a single parameter, but rather some com-
bination of variables that in this case does not change the likelihood value. This offers a nice
visualization of a specific direction(s) in R6 which does not change the solution considerably.
In other words, the solution hyper-plane has flat areas where many local minima exist.
The results presented here highlight a central problem in mathematical modeling which
will be discussed further in the following section. Although a fit is obtained and value esti-
mates computed, whether these are actually meaningful for individual mice and whether they
can be compared between mice requires careful consideration. When several local minima
are present and especially when they have the same value, the result may be ambiguous.
4 Discussion
4.1 Local minima
This subject has already been touched upon in the Results section where it was demonstrated
that the solution space to the parameter estimation problem contains several local minima
and that these minima almost all have the same value. This can be a problem with big
17

consequences for the conclusions drawn from the model. In another setting, however, it may
not be significant. Imagine a minimization problem of the same sort where each parameter
controls a part of a production line and the goal is to optimize the production output. If
the parameters of this model is fit and several solutions with the same value are found,
one may choose equally among all solutions since they all translate to the same production
output. What the parameter configurations that produce these high efficiency production
lines mean or should be interpreted is, is not of importance. However, in the context of the
data in this project where the objective is to compare parameters from two groups and where
each parameter has a behavioral interpretation, different parameter combinations translate
into different behavior and it is impossible to say which are the ‘true’ parameters for the
mice given that data. What is then the problem, the data or the model? Obviously the
experimental paradigm used in the experiments may have been imperfect. This could for
example mean that what seems like an independent choice of the mouse may in fact have
been strongly influenced by other factors not incorporated in the model. On the other hand,
the model itself could also be too simple. Concrete examples will be discussed in the following
sections.
4.2 Uncertainty
In reinforcement learning algorithms it is often beneficial to vary the learning rate α depend-
ing on the uncertainty in the system. Imagine a simple stochastic one-arm bandit with a
certain probability of yielding a reward. After several plays, a players’ value function for
the bandit will converge to the true probability of yielding a reward. We can say that the
player becomes more and more certain of the value he is attributing to the bandit. Recall
that the learning rate α determines to what degree new information is taken into account.
In the beginning the player is very uncertain and any new information is precious so α is
high. But as time goes on, the player becomes sure about his estimate and even if the bandit
once in a while behaves differently (it is stochastic), the player will not change his estimate
considerably, hence α is low. In machine learning, letting the learning rate vary, produces
much faster learning than a constant learning rate. In the model used in the modeling in
this thesis (15), the learning rates (αV and αS) are kept constant. This obviously fails to
incorporate the above idea and there is the possibility that it would improve the model to
let the learning rate vary. Then rises the question of how to quantify the uncertainty of the
mice and this not a trivial problem. Work has been done on this train of thought [O’reilly,
2013].
4.3 Discounting
In Section 2.2 we derived the prediction error (11) to be
δ(t) = rt + γ vt+1 − vt (19)
where γ is the discounting factor. In the model use for modeling, the prediction error is
δ(t) = rt − vt (20)
so it can be seen as a special case of (19) where γ = 0. Let us recall where this discounting
took place. It came from the notion of the total future reward from time t and onwards
T−t
τ=0
γτ
rt+τ (21)
where a discounting factor close to 0 means that the actor is mostly concerned with maximiz-
ing immediate rewards. If γ = 0 then the only term in (21) which is different from 0 is the
18

reward at time t, rt. In other words, we assume that the mouse is extremely short-sighted. It
is questionable whether this is a valuable assumption. Evidence exists that rats do have an
expectation about a future reward even if they have to work to get there [Howe et al., 2013].
This conclusion was made by observing the dopamine level in rat brains as they navigated
a maze task with a reward at the end. As the rat came closer to the goal, dopamine levels
rose even though it had not encountered the reward yet. This suggests that, at least rats,
do not discount future rewards completely as suggested in the model that has been used in
this thesis and it is something worth looking into.
5 Conclusion
The important experiments by Schultz et al. offered an insight into the activity on a cellular
level during learning. The phasic dopamine signals observed when an unexpected reward is
encountered represent the prediction error, that is the difference between what the learner
expects and what it actually finds. The experiments are important because they offer the
ability to easily quantify learning as it takes place in the brain. The discovery that the
prediction error from a group of algorithms in reinforcement learning can in fact predict this
neuronal behavior paved the road for much investigation. More specifically, the temporal
difference algorithm has since been used to model behavioral data. Montague et al. [1996]
offers an interpretation of the various elements of the model in relation to the brain, and
although simplified it strengthens the connection between these two areas of research.
In this thesis behavioral data of normal and genetically modified mice has been used as a
target for modeling using the before mentioned method. A set of parameter estimations was
obtained by maximizing the likelihood function as implemented in Algorithm 1. Although
a solution was found, it turned out that the solution was far from unique. This points to
a central problem in modeling, where even a large change in parameter values does not
change the solution value much. It poses a problem when the parameters have physiological
interpretations and are thought to control behavior, because we can not say which predicted
behavior is the true one.
It is likely that the model used has limitations that prevent it from including important
parameters in its explanation of behavior, but as stated in the introduction it has not been
the purpose of this thesis to use the latest most complex model, but rather illustrate this
connection both theoretically and its practical application.
It has become clear that although a simple relationship is discovered (here between
dopamine neuron activity and behavior), the modeling of this relationship easily becomes
complex and leaves room for much interpretation by the investigator. However, it is, even
so, a methodology that offers much insight into one of the most complex dynamic systems
we know of, the brain and it will no doubt be the subject of much research to come.
19

References
M. F. Bear, B. W. Connors, and M. A. Paradiso. Neuroscience, volume 2. Lippincott
Williams & Wilkins, 2007.
J. A. Beeler, N. Daw, C. R. Frazier, and X. Zhuang. Tonic dopamine modulates exploitation
of reward learning. Frontiers in behavioral neuroscience, 4, 2010.
N. D. Daw and K. Doya. The computational neurobiology of learning and reward. Current
opinion in neurobiology, 16(2):199–204, 2006.
P. Dayan and L. Abbott. Theoretical neuroscience: Computational and mathematical mod-
eling of neural systems the mit press, 2001.
D. L. Felten, R. F. J´ozefowicz, and F. H. Netter. Netter’s Atlas of Human Neuroscience.
Icon Learning Systems, 2003.
M. W. Howe, P. L. Tierney, S. G. Sandberg, P. E. Phillips, and A. M. Graybiel. Prolonged
dopamine signalling in striatum signals proximity and value of distant rewards. Nature,
500(7464):575–579, 2013.
P. R. Montague, P. Dayan, and T. J. Sejnowski. A framework for mesencephalic dopamine
systems based on predictive hebbian learning. The Journal of neuroscience, 16(5):1936–
1947, 1996.
J. Olds and P. Milner. Positive reinforcement produced by electrical stimulation of septal
area and other regions of rat brain. Journal of comparative and physiological psychology,
47(6):419, 1954.
J. X. O’reilly. Making predictions in a changing world—inference, uncertainty, and learning.
Frontiers in neuroscience, 7, 2013.
R. A. Rescorla, A. R. Wagner, et al. A theory of pavlovian conditioning: Variations in the
eﬀectiveness of reinforcement and nonreinforcement. Classical conditioning II: Current
research and theory, 2:64–99, 1972.
W. Schultz. Predictive reward signal of dopamine neurons. Journal of neurophysiology, 80
(1):1–27, 1998.
W. Schultz. Getting formal with dopamine and reward. Neuron, 36(2):241–263, 2002.
W. Schultz and R. Romo. Dopamine neurons of the monkey midbrain: contingencies of
responses to stimuli eliciting immediate behavioral reactions. J. Neurophysiol, 63(3):607–
24, 1990.
W. Schultz, P. Apicella, and T. Ljungberg. Responses of monkey dopamine neurons to reward
and conditioned stimuli during successive steps of learning a delayed response task. The
Journal of Neuroscience, 13(3):900–913, 1993.
R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
U. Ungerstedt. Adipsia and aphagia after 6-hydroxydopamine induced degeneration of the
nigro-striatal dopamine system. Acta Physiologica Scandinavica. Supplementum, 367:95–
122, 1970.
W. R. Uttal. The new phrenology: The limits of localizing cognitive processes in the brain.
The MIT Press, 2001.
20

R. A. Wise. Dopamine, learning and motivation. Nature reviews neuroscience, 5(6):483–494,
2004.
R. A. Wise and H. V. Schwartz. Pimozide attenuates acquisition of lever-pressing for food
in rats. Pharmacology Biochemistry and Behavior, 15(4):655–656, 1981.
21

Appendices
Likelihood function implementation in MATLAB
function[Likelihood]=GetValueFunctions_optimize(params, Choice, Reward)
betaV = params(1);
alphaV = params(2);
alphaS = params(3);
betaS = params(4);
beta1 = params(5);
betac = params(6);
V = zeros(2, length(Choice));
S = zeros(2, length(Choice));
TD = zeros(2, length(Choice));
P = zeros(2, length(Choice)-1);
Likelihood = 0;
for K = 2:length(Choice)
cur_choice = Choice(K);
if cur_choice == 1;
RK = [Reward(K); 0];
cur_choice_index = 1;
else
RK = [0; Reward(K)];
cur_choice_index = 2;
end
V(:,K) = V(:,K-1) + alphaV*(RK - V(:,K-1));
S(:,K) = S(:,K-1) + alphaS*(RK - V(:,K-1));
TD(:,K) = (RK - V(:,K-1));
P(1, K-1) = 1 / (1 + exp(-(betaV * (V(1,K) - V(2, K)) + beta1 + betac * Choice(K-1)
P(2, K-1) = 1 -P(1, K-1);
Likelihood = Likelihood - log(P(cur_choice_index, K-1));
end
Likelihood;
end
22

The optimization procedure which calls the likelihood function on the previous page
load matlab_mouse_struct;
options = optimoptions(’fmincon’, ’MaxFunEvals’, 1000, ’UseParallel’,’Always’);
beeler=[39.9 0.042 0.342 -8.185 0.461 3.082];
sol = zeros(20,6);
fval = zeros(20,1);
exitflag = zeros(20,1);
for i=1:20;
mus = MOUSE(i);
mus.File
Choice = double(mus.LeftLeverPress | mus.RightLeverPress);
Choice(mus.RightLeverPress) = -1;
Reward = zeros(size(Choice));
Rewardindx = find(mus.reward) - 1;
Reward(Rewardindx) = 1;
keepindx = Choice ~= 0;
Reward = Reward(keepindx);
Choice = Choice(keepindx);
[sol(i,:) fval(i) exitflag(i)] = fmincon(@(x)GetValueFunctions_optimize(x,Choice,...
Reward, beeler, [],[],[],[],[0 0 0 -25 -1 0], [50 2 2 0 5 5], [], options);
end;
23

Calculation of Reward Prediction in Behaving Animals

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (10)

En vedette

En vedette (14)

Similaire à Calculation of Reward Prediction in Behaving Animals

Similaire à Calculation of Reward Prediction in Behaving Animals (20)

Calculation of Reward Prediction in Behaving Animals