SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
           
                 
Bachelor’s Thesis
Emil Østergård Johansen
Calculation of Reward Prediction in Be-
having Animals
Supervisors: Christian Igel & Jakob Dreyer
January 4, 2015
Contents
1 Introduction 2
2 Background 3
2.1 Dopamine reward system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Reinforcement Learning & Prediciton Error . . . . . . . . . . . . . . . . . . . 8
2.3 Model of learning in the brain . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Modeling 13
3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Discussion 17
4.1 Local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Discounting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Conclusion 19
Appendices 22
Abstract
The theory of reinforcement learning can be applied to modeling learning processes in humans
and animals. The neurotransmitter dopamine is supposed to play a central role in the
reward system, encoding the temporal difference error. This thesis reviews the theory of
reinforcement learning along with the dopamine reward system in the brain. A temporal
difference learning algorithm is used to model behavioral data of wild type and dopamine
transporter knockdown (DATkd) mice. The parameters of the model are adjusted using
data-driven optimization. In the literature, significant behavioral differences between wild
type and DATKd mice were explained by differences in the temperature parameter of the
model. However, the experiments in this thesis showed that there are several local minima
in the parameter space, which allow for alternative explanations of the observed behavioral
differences.
1 Introduction
How does the brain work? Understanding this fundamental question is the main objective
in neuroscience and many other disciplines. Among the functions of the brain, learning
remains an area of key interest. Behavioral psychology has paved the road for a massive
body of knowledge to which famous names such as B.F. Skinner (the Skinner box) and Ivan
Pavlov (classical conditioning) have contributed. While behavior is readily observed and can
be quite informative as to hypothesize about the inner mechanics of learning in the brain,
neurophysiology, that is the branch of physiology concerned with the functioning of the brain,
offers a detailed look of what ‘makes it all tick’.
The brain consists of a myriad of living cells among which the most important are neurons
(colloquially known as brain cells) and glia cells. The intriguing property of this mass of
cells is that it is highly organized. Neurons communicate with one another through cable-
like projections and each neuron can be communicating simultaneously with upwards of
2
10,000 other neurons. Connections are strengthened and weakened dynamically over time,
but the gross organization appears static. This is why it is possible for neurosurgeons to
perform complicated operations with a high degree of confidence that they can predict the
consequences, and it is also the reason why researchers can investigate the physiology of
specific structures of the brain with 10−6m precision.
The discovery of the neurotransmitter dopamine and consequently networks in brain
where dopamine plays an central role has been kick starting intense research that shows
how dopamine is responsible for much of the behavior observed by psychologists. A series
of experiments on monkeys [Schultz, 1998] let researchers directly monitor the activity of
dopamine neurons during learning tasks and a clear correlation was observed.
While neuroscientists have investigated the inner workings of the brain, computer sci-
entists have tried to teach machines to learn and while doing so have developed a group
of algorithms under the common term, ‘reinforcement learning’. A reoccurring parameter
in these algorithms, namely the prediction error, was surprisingly observed to imitate the
behavior of the dopamine neurons monitored in the experiments by Schultz [1998] and since
then reinforcement learning algorithms have been used to model behavioral data.
The work by Beeler et al. [2010] is an example of combining neuroscience with reinforce-
ment learning. They investigate the background activity of dopamine neurons’ effect on
observed behavior by presenting two groups of mice with the same learning task. One group
has been genetically modified to have increased background activity of dopamine neurons
and they use a reinforcement learning model to compare various parameters between the two
groups.
It is the objective of this thesis to describe the background theory of learning in the
context of both the dopamine system and of reinforcement learning. The data obtained by
Beeler et al. [2010] will then serve as a framework for applying this knowledge to an actual
modeling problem and stimulate a discussion of these methods. There is a large amount of
literature on the subject of modeling behavioral data and it is not within the scope of this
thesis to review the newest findings nor implement them. The background review is based
mostly on classical references.
2 Background
2.1 Dopamine reward system
Various theories about the organization of the brain have been proposed throughout the
centuries. In the 18th century Franz Joseph Gall founded the school of phrenology which
claimed that various bumps on the external surface of the cranium would predict various
psychological traits about the person being examined. When focus shifted to the brain,
this idea paved the road for the functional specialization theory, namely that each cognitive
function and psychological trait corresponded to a bounded area in the brain.
Lesion case studies, where the subject of study were patients with a brain tumor, internal
bleeding or other kinds of local damage to an area of the brain, produced many interesting
results that appeared to prove this theory. The most famous of these patients was Phileas
Cage who miraculously survived after having an iron rod driven through his skull, destroying
his left frontal lobe. Cage survived with most cognitive skills intact but with a personality
change that rendered him unrecognizable to his friends and family. This suggested that
the frontal lobes are somehow responsible for personality. The famous brocca’s area in the
inferior (lower) part of the frontal lobe which supposedly is responsible for language, was
similarly discovered by observing that patients with damage to this area all appeared to have
trouble with speech production.
Other theories are of course also being explored. The opposite of functional specialization
is the theory of distributive processing where cognitive functions are assumed to be a network
3
Figure 1: Publications on various neurotransmitters throughout the years. Source: Google
Scholar.
within the brain not bound to a specific area [Uttal, 2001].
Whatever the organization of function, pathways which consists of the projections of
axons from a collection of neurons in one part of the brain to another part, has been thor-
oughly documented and a group of these pathways are of special interest when investigating
motivation and reward.
Dopamine was discovered as an independent neurotransmitter by Carlson et al (1957)
at Lund University and the research on dopamine has been increasing ever since (Figure 1).
Dopamine was first associated with motor function [Wise, 2004], but later also with motiva-
tional behavior first described in Ungerstedt [1970] where feeding and drinking deficits were
observed after inducing damage to the mesolimpic dopamine system. In Wise and Schwartz
[1981] they showed that by injecting the dopamine D2 receptor blocker pimozide, they ob-
served that rats did not learn to press a lever in order to obtain food and water, whereas
rats who were not injected learned this task. This further suggests that dopamine plays a
key role in modulation of motivation and reward-based learning.
The dopaminergic system in the brain consists of several pathways that together span
most of the brain but all emanate from the mid brain (Figure 2). The most important path-
way in regard to reward learning is the mesolimpic pathway which originates from dopamin-
ergic neurons in the ventral tegmental area (VTA) in the mid brain that project its axons
to especially the cortex.
In a series of groundbreaking experiments, starting with Olds and Milner [1954], a rat
was placed in a box with a lever which upon pressing delivered electrical stimulation to
the rats’ brain through an implanted electrode (see Figure 3). In this set-up the rat would
first step on lever by accident and then retreat away from the lever. However, it would
soon come back and press again, and as time went by it would spend most of its time
pressing the lever. In extreme cases rats were observed to press the lever continuously until
fainting from exhaustion. These electrical self-stimulation experiments were repeated, each
time changing the position of the stimulation electrode in the rat brain, and the structure
identified as producing the strongest behavioral response was the ventral tegmental area
(VTA). Blocking dopamine receptors reduced the degree of self-stimulation so the conclusion
was that the rats worked to release dopamine, and thus dopamine served as a reward that
reinforced the behavior that caused the release of it.
Wise and colleagues have carried out numerous self stimulation experiments throughout
4
Figure 2: Dopaminergic System Overview (adapted from Felten et al. [2003])
the years and located many of the structures involved in motivation and reward by noting
that when they observed a positive result (boosted reward experience of the animal seen by
repeated self stimulation), the stimulation electrode was located either in close proximity to
axons of dopaminergic neurons or to axons that terminated at dopminergic neurons [Schultz,
2002].
The important experiments which gave rise to the data essential for this thesis were
done by Schultz and his colleagues (various articles 1990-1994) on behaving monkeys. In
the experiments firing patterns of dopaminergic neurons were measured in vivo (while the
monkeys were alive) as the monkeys performed various learning tasks. They were placed in
front of a box which was covered to prevent vision of the contents of the box. Electrodes
were placed near a population of dopamine neurons in the monkeys’ brain and they then let
the monkey insert its hand into the box which could contain either nothing, a reward (apple)
or a neutral object (a string).
Figure 4A shows a transient increase in dopaminergic neuron firing when the hand of the
primate touches an apple (upper figure) inside the box and an absence of this firing when
there is no apple (lower figure). Control experiments were performed (Figure 4B and C)
which confirmed that: 1. The qualitative observation was indeed due to touching an apple
(a reward) and not just any object (a string was used in place of an apple). 2. The same
result was not due to movement of the arm alone.
The more interesting result, however, is found in Figure 4E where the movement of
reaching for the apple inside the box was initiated by a stimuli which in this case is the
opening of the door to the box. In the control trial (Figure 4D) the dopaminergic neurons
exhibit a transient increase in firing rate, as we would expect by now, when the hand reaches
the reward. The movement here is self-initiated. When the movement is initiated by the
door-opening stimulus, the neural response is to start with observed when the monkey reaches
the reward, similarly to the control. However, after learning the association between stimulus
and the possibility of finding a reward, the increase in firing rate is observed at the time of
the stimulus and not at the time of discovery of the reward. This observation was replicated
in Ljungberg et al. (1992), also with monkeys. It was concluded that this transferring of
neuronal response happens over the course of learning. This is an impressive direct view of
5
Figure 3: Self stimulation experimental setup (adapted from Bear et al. [2007]).
a change on the cellular level in the brain that happens during learning.
Let us look at not the transferring of neuronal activity but just the decrease of activation
at the time of reward. How is this decrease in phasic activation at the time of reward to be
interpreted? An obvious interpretation is that the animals develop an insensitivity to the
rewards and thus less activation in the dopamine neurons is observed. This was however
falsified in [Mirenowicz and Schultz, 1994]. Rather it seems that the degree of activation
somehow depicts the uncertainty or unpredictability [Schultz, 2002] of the reward. In the
first trials, when the reward is unexpected, the activation of dopamine neurons is at its
highest. Then slowly over the course of learning the activity decreases until it is not present
at all. This can be interpreted as a form of learning or in other words, a steady increase in
the certainty that the reward will be presented.
When the reward, against expectations, is on purpose being withheld or delayed, a sig-
nificant drop in dopamine neuron activity is observed [Schultz et al., 1993] which seems like
a natural extension of the unpredictability theory. The drop in activity occurs exactly at
the time where the reward is usually delivered which suggests a sensitivity to the timing of
the reward in addition to the occurrence. The important theme in these individual neuron
experiments by Schultz and others is the unpredictability of reward, as it has come to be
termed, in both timing and occurrence. It is becoming evident that the phasic activity (as
opposed to the constant tonic firing) of dopamine neurons can be interpreted to be pro-
portional to the error in the animals’ prediction of the reward: When the expectation of a
reward is low or non-existent (as it would be prior to any learning task), a large amount
of activity is observed when a reward is presented - contrary to expectation. However, over
time as the animal becomes surer that the reward follows the stimuli, the unpredictability is
decreasing and the prediction error also decreasing, thus a smaller neuronal activity is ob-
served. The term ‘prediction error’ is important because it is a corner stone of reinforcement
learning algorithms and it is in this quantity that the connection was made between the
results discussed here and the reinforcement learning theory most note-worthily described
and developed in Sutton and Barto [1998].
6
Figure 4: Mensencephalic dopamine neuron firing response to experiment. From [Schultz
and Romo, 1990]
Figure 5: Transferring of predictive firing. From [Schultz, 1998]
7
2.2 Reinforcement Learning & Prediciton Error
The prediction error as introduced in the previous section, is a central theme in modern
conditioning theories [Rescorla et al., 1972, Schultz, 2002]. In these theories it is suggested
that in order for learning to take place, the prediction error must be different from zero. A
positive or negative prediction error will lead to the acquisition or extinction or behavior.
To see why this is so, imagine a dog being presented with a stimulus such as the ring of a
bell. In all likelihood the dog predicts that nothing will happen after the bell has sounded.
And if nothing happens, the dog will continue to expect nothing to happen the next time
the bell rings. But if something, contrary to the dog’s prediction, does happens, there is a
prediction error between the dog’s prediction and what actually happened and as a result
the dog changes its prediction for the next time the bell rings or in other words, it learns.
This theory is pleasing because it fits well with our intuitive understanding of learning and
the classic Pavlovian dog experiments.
It should by now become more and more clear that the prediction error observed in the
single dopamine neuron experiments by Schultz and the prediction error observed in learning
tasks in classical conditioning (such as the dog example above), is the same and it is this
subtle point that provides the main motivation for both the data set used in this thesis and
the model which will be introduced later. First, however, I think it is appropriate to clarify
some terms.
In conditioning we distinguish between two main fields of study: Classical and instrumental
or operant conditioning. The main difference being that the actor in classical conditioning
does not have any influence on whether he/she is presented with a reward or not - in other
words it is completely automatic. In instrumental or operant conditioning, the opposite is
true. Here the actor’s interaction with the environment can influence the chance of whether
or not a reward is presented. A system where the actor is presented with positive rewards
and negative rewards (punishments) which provide the basis for learning, is what is known
as a reinforcement learning system which is studied widely in both psychology and computer
science for machine learning. Reinforcement is here taken to mean the strengthening of
stimulus-stimulus, stimulus-response, or stimulus-reward associations that result from the
timely presentation of a reward [Wise, 2004].
Many results from psychological classical conditioning experiments can be predicted by
a simple learning rule which is one of the classical references in the field, namely the The
Rescorla-Wagner rule [Rescorla et al., 1972]. It offers a simple linear relationship for pre-
dicting reward. If we denote v as the expected reward, u as a binary variable representing
the presence or absence of a stimulus and w as a weight associated with the stimulus, then
the expected reward is predicted as
v = wu (1)
The learning takes place in updating the value of the weight w with an update rule that
is designed to minimize the prediction error, that is the squared error between expected
reward and actual reward (r − v)2 [Dayan and Abbott, 2001] and this is what is known as
the Rescorla-Wagner rule
w = w + δu (2)
where
• δ = r − v is the prediction error
• is a learning rate determining how fast the learner incorporates new information into
prior knowledge
If a reward follows every time a stimulus is presented, then w will converge to 1 as the number
of trials increases. If the reward is stochastic and normally distributed, w will converge to
8
the theoretical mean of that distribution.
As already mentioned this is a model of classical conditioning meaning that the actor does
not interact with environment and hence has no influence over whether a reward is presented
or not. It does a poor job modeling instrumental conditioning because there is no notion of
time or interaction between the actor and the environment in the model. For this we need a
slightly more complicated setup.
Reinforcement learning is a problem of optimal control in the sense that the actor, who is
learning, has some control over the environment. Actions lead to rewards or the lack thereof
and the actor seeks to reap the largest amount of rewards. In order to describe this problem
mathematically we need some clearly defined abstractions:
• a set of states S
• a set of action A
• a transition function TTT
• a reward function RRR
• a value function VVV
• a policy π
For the remainder of this introduction we will use the experimental setup described in
Section 3.1 as an example. This setup consists of a mouse in a cage with two levers that
deliver rewards in the form of food. Obviously time is continuous in the real world, but it is
for modeling purposes beneficial to discretize the world in to discrete timesteps t and states
st ∈ S where st is the state at time t and S is the set of all possible states.
From the mouse’s point of view, the way to transition between these states is through
actions which, like states, come from a set of actions so the action at time t we write as
at ∈ A. A certain action then, like pressing a lever, alters the state of the environment
iterating t to t + 1 so that the new state is st+1 ∈ S. The next thing we need is a function
that maps state st and action at to the next state st+1. This we call the transition function
TTT : st, at −→ st+1 and this is what describes the laws that govern the interaction between
the mouse and its environment. It is also termed an environment model. The environment
can be deterministic in the sense that a specific action will always lead to specific state, but
often it is stochastic so that TTT is a set of transition probabilities.
Finally there is the reward function RRR : st, at −→ rt which maps from a state and an
action to the probability of a reward. Often rt is a binary variable which represents the
presence of absence of reward at time t, just like the stimulus variable u in the Rescorla-
Wagner setup.
With this set of abstractions we have a well defined environment S, a way for the mouse
to interact with the environment through actions A and a transition function which tells us
how the environment changes under each action TTT. Finally, the incentive for learning, the
rewards, are also well defined by RRR.
Now comes the interesting part, namely how the mouse chooses what action to take
in order to maximize reward. The way we model this is to assume that the mouse keeps
some value estimate for each possible action so that by choosing the action with the highest
estimated value, it can maximize reward. This is called the value function VVV : at −→ vt. The
mouse updates this value function based on the reward it receives, a trial and error concept.
If VVV∗(a) is the true value for action a, the mouse obviously tries to achieve the following
property:
lim
t→∞
VVVt(a) = VVV∗
(a) (3)
9
A simple way to update VVV is inspired directly by the Rescorla-Wagner rule where the weight
w is replaced with the value and the stimulus u is omitted:
vt = vt + αδt (4)
Here δt = (rt − vt) and has been renamed to α to keep the two update rules separate.
As already suggested, the mouse chooses its actions based on the value function. The
function that encapsulates this behavior of the mouse is termed its policy, π : s −→ a. The
policy π is in words a function that maps from a state to an action. This will usually have
the form of a simple look-up table although it can be imagined to take more complex forms.
The process of updating value estimates and thus the policy is known as policy optimization.
The relationship between the value function and the policy can vary from model to model.
If the policy of the mouse is designed so that it always chooses the action which has the highest
value estimate (highest expected value), then this is known as the greedy choice and it is
the one of two extremes. The other extreme is to completely disregard the value estimates,
and assign equal probabilities to choosing each action. Many different strategies have been
proposed. One easy to understand is the so-called -greedy method (not be confused with
the learning rate from the Rescorla-Wagner rule). Here is a number between 0 and 1
which defines the probability of choosing the greedy action versus choosing equally among
all other action. Another approach is to assign weighted probabilities according to the value
estimates V (a) for each action. A common implementation of this idea is to use the Gibbs
distribution, given as
eVVV(a)/β
k
b=0 eVVV(b)/β
, (5)
to calculate the probability weights [Sutton and Barto, 1998]. This is also known as the
softmax selection rule and it chooses stochastically among all actions but with higher value
actions weighted higher, more specifically it computes the probability P(at = a). This model
is used in the data modeling of this project.
The fact that the value estimations converge as in (3) is less interesting than how fast
they converge. That is, not only does the mouse want to optimize its policy to yield the
largest value on its future actions, but it also wants to optimize the rewards it gets during
learning, as it can’t go on learning forever. This dilemma is known as the exploit-explore
problem and it is central to reinforcement learning.
The following is based on the introduction to temporal difference learning in Dayan and
Abbott [2001]. As suggested above we interpret the value function VVV as the estimated value
of an action. To elaborate, what we really mean is that at time t, VVV(t) is the total future
reward expected from time t and onward to the end of the trial:
T−t
τ=0
rt+τ (6)
Usually future rewards are discounted which means that future rewards are assigned a lower
value than immediate rewards. This is intuitive in the sense that we can imagine an actor to
have more incentive to obtain a reward immediately rather than obtaining the same reward
a year later. This is modelled with a discounting factor γ where 0 ≤ γ ≤ 1. If γ = 0 we say
that the actor is myopic [Sutton and Barto, 1998] or short-sighted meaning that it does not
take into account future rewards but is only interested in maximizing immediate rewards.
As γ approaches 1, the actor becomes more and more far-sighted. So we rewrite (6) as
T−t
τ=0
γτ
rt+τ (7)
10
In this setting, the prediction error δ(t) must then be the difference in actual future rewards
and expected total future rewards
δ(t) =
T−t
τ=0
γτ
rt+τ − vt (8)
This, however, is impossible to compute at time t because we do not yet know the actual
future rewards. This is solved by noting that the first term in (8) can be rewritten as
T−t
τ=0
γτ
rt+τ = rt +
T−t−1
τ=0
γτ+1
rt+1+τ = γ
T−t−1
τ=0
γτ
rt+1+τ (9)
and as we just discussed, vt is an estimate of the total future reward (the latter term)
γ
T−t−1
τ=0
γτ
rt+1+τ ≈ γ vt+1 (10)
so the prediction error becomes
δ(t) = rt + γ vt+1 − vt (11)
and we insert this in (4) and get the update rule
vt = vt + α(rt + γ vt+1 − vt) (12)
The difference vt+1 − vt gives the name to this method, the temporal difference method.
At a first glance it looks like we again need information about the future (vt+1) to compute
the prediction error. One must remember, however, that vt designates the value of the action
taken at time t and similarly, as soon as we know the action at+1 which is computed with
e.g. the softmax selection rule, we know the value of the value estimate vt+1. This method
is called a bootstrapping method [Sutton and Barto, 1998] because it uses an estimate for
estimation.
We now have the complete framework for classical reinforcement learning in order. We
have defined the environment and the actors interaction with this environment through sets
of states S and actions A and we have defined how the actor selects its actions based on the
value estimates it holds about each action (its policy). Finally we have shown how the actor
can update its knowledge by observing the rewards it receives (12).
2.3 Model of learning in the brain
In this section I want to formalize the connection between the dopamine system and rein-
forcement learning a bit further. Although the idea that the brain uses the prediction error
for learning is intuitive, a well defined model of what the various functions in reinforcement
learning represent has been proposed [Montague et al., 1996] and is depicted in Figure 6.
In this model, information about the task to be learned is stored in the cortex which con-
sists of different modalities. When some task is engaged, the cortex outputs its information
about this task to an intermediate layer which represents possible information processing in
other parts of the dopamine system (See Figure 2) . This information can be both excita-
tory (encouraging) or inhibitory (discouraging). For example some modality of the cortex
can discourage the task because it is learned to be dangerous while another modality of
the cortex can encourage the task because it often yields food and the actor is hungry. All
these outputs are weighted individually and summed in our familiar value function VVV(t). In
the figure ˙VVV(t) represents a temporal difference which we recognize as VVV(t + 1) − VVV(t) (see
11
section 2.2). The information is then passed downwards to a group of dopamine neurons
PPP in for example the ventral tegmental area which also receives external information about
the presence or absence of a reward rrr(t) and other factors which may have influence over
the output from PPP. It is in the dopamine neurons that the prediction error δ(t) is computed
simply as the net input
δ(t) = r(t) + ˙V (t) = r(t) + V (t + 1) − V (t) (13)
and then output as a dopamine signal back to the cortex (again refer to Figure 2). This
model is in accordance with the experiments by Schultz et al. discussed in Section 2.1 where
it was observed that the dopamine signal in monkeys appeared to represent the prediction
error δ(t). We also note that (13) is a special case of (11) where γ = 1.
Figure 6: Model of information flow through
the dopamine system. Adapted from Mon-
tague et al. [1996]
The model in Figure 6 explains how the
prediction error and the consequent learning
might be physiologically instantiated in the
brain. The other part that needs an inter-
pretation is the action selection policy. The
softmax action selection model (5) was in-
troduced in Section 2.2 and it is the pre-
ferred model for action selection in reinforce-
ment models as it has been observed to fit
well with behavioral data [Daw and Doya,
2006]. The variable parameter of the soft-
max model is the β parameter also called
‘temperature’, an analogue to the movement
of molecules under different temperatures.
As β −→ 0, the probability of selecting the
action with the highest value approaches 1.
Conversely, as β −→ ∞ the probability is
spread out so that all actions have the same
probability of being selected.
In Figure 7 a simulation of a famous instructional reinforcement learning problem is
shown. The problem is that of a two-armed bandit. Each arm has its own probability of
yielding a reward and it is up to the learner to balance exploitation of learned knowledge of
which lever is more likely to yield a reward and exploration of the other lever in case it might
actually be better. Figure 7 shows a game that lasts 2000 plays. At play 400, the reward
yielding probability of the levers are switched in order to see how fast the model is able to
adapt to this. The result has been averaged over 2000 episodes to smooth out the curves
sufficiently. The choice of which arm to pull was made with the softmax action selection rule
and the result for four different values of β is shown. As is expected, the most conservative
player with β = 0.01 realizes the change the latest and is slowest at adapting. Similarly the
player with β = 0.5 is the quickest at realizing the change and adapting. There are many
qualitative attributes that are interesting to investigate (e.g. it seems that a medium value
is better in the long run, which it is), but overall it serves as a nice illustration of how β
affects the ability to learn and adapt. Although there are hypothesis as to how the brain
controls this, it is still an open question [Beeler et al., 2010, Daw and Doya, 2006].
12
Figure 7: Two-armed bandit simulation with different temperature values.
3 Modeling
3.1 Experimental Setup
The data used for modeling in this thesis is the same data used in Beeler et al. [2010].
The main objective of the original experiment was to investigate whether tonic dopamine
modulates learning. This was done by comparing the performance of dopamine transporter
knockdown (DATkd) mice against wild type (normal) mice (n = 10 for both groups).
A home cage operant paradigm was used in the experimental setup, which means that
mice earn their food entirely through lever pressing. Each mouse lives in its own cage for
10 consecutive days where water is freely available, but food can only be obtained through
lever pressing. In each cage are two levers which has a probability of yielding food upon
pressing. When the probabilities of food release for each lever is not equal, one lever has a
lower probability of food release than the other making it ‘cheaper’ in terms of the amount
of work required to gain a reward. Which lever is the cheaper changes every 20-40 minutes
thus constantly offering a learning opportunity for the mouse.
The data is recorded as event codes and accompanying time codes for each event with a
temporal resolution of 100 ms. Events are e.g. ‘left lever pressed’, ‘reward delivered’, ‘lever
probabilities interchanged’ etc.
For the parameter estimation, the data has been converted into two vectors: A binary
choice vector ccc where an element ct is 1 or -1 depending on which lever is pressed. ccc contains
the entire choice sequence for a single mouse concatenated over ten days which is about
50000 events. The other vector is a binary reward vector rrr of the same length as ccc. An
element rt is 1 if a reward was delivered when lever press ct was performed and 0 otherwise.
All temporal information (the time codes in the original data set) has been removed since
the model only predicts what choices are made in which order, not when they are made.
3.2 The Model
13
Figure 8
In the original work by Beeler et al. [2010]
they fit two models to the behavioral data
described in the previous section. One is
a general logistic regression model with the
choice variable c as the dependent variable
and the 100 previous rewards rt−100...t−1 for
each t as the explanatory variables. The re-
sult of this model is shown in Figure 8 where
left on the horizontal axis corresponds to
the most recent rewards. A positive coeffi-
cient value indicates that a reward tends to
promote staying on the same lever whereas
a negative value indicates switching to the
other lever. Intuitively we would expect that
the most recent rewards to promote staying
on the same lever more strongly than re-
wards in the past due to e.g. memory. This holds true for the figure shown here except
for the very most recent rewards where coefficient values tend to −1. This means that the
mice tend to switch to the other lever one or two time steps after receiving a reward but in
the longer run they follow a familiar learning pattern.
A simple softmax action selection model as in (5) will not predict this behavior but rather
that a reward received immediately prior to the current time step will promote staying on
the same lever the strongest. So this model has to be extended. The simple softmax model
for two possible actions (levers) take the following form:
P(ct = 1) =
eVt(1)/β
eVt(1)/β + eVt(−1)/β
=
1
1 + e−(Vt(1)−Vt(−1))/β
= σ(βV (Vt(1) − Vt(−1))) (14)
where σ(z) = 1
1+e−z is the logistic function and βV is the inverse temperature. Whether one
use the temperature or its inverse is simply a matter of convention and since they use its
inverse in the original article, we are adapting this notation here. Similarly for P(ct = −1).
The model used in Beeler et al. [2010] is as follows:
P(ct = 1) = σ(βV [Vt(1) − Vt(−1)] + β1 + βcct−1 + βS[St(1) − St(−1)]) (15)
where
• βV is the inverse temperature for V
• Vt = Vt−1 + αV (rt−1 − Vt−1) is the value function
• β1 is the bias term towards one lever
• βc is the bias towards the previously pressed lever
• βS is the inverse temperature for S
• St = St−1 + αS(rt−1 − St−1) is the short-term value function
Apart form the simple bias terms, it as been augmented from (14) to include an additional
value function S which will capture the short term effects observed in the logistic regression
result (Figure 8). By forcing the parameter βS to be negative we make sure that the last
term has the desired effect.
There are 6 degrees of freedom in this model, namely the parameters βV , αV , β1, βc, βS
and αS, so given a value for each parameter we are able to calculate the action probabilities
at time t. In symbols this probability is written as
P(ct = 1 | βV , αV , β1, βc, βS, αS) (16)
14
3.3 Parameter estimation
In this section, the method for parameter estimation is established. As we just saw, given any
six parameter values, we can compute the probability of the choice at time t with (15). In fact
we not only get the probability for the choice at time t, but since the entire choice sequence
is defined recursively depending solely on the parameters and on the previous values, we get
the whole choice sequence. So by finding good estimates of the parameters we mean finding
parameters that will produce data that is, to the highest degree possible, like the training
data which is the actual choice sequence of a mouse. So if we, given parameter values, can
compute the probability that a single choice is equal some actual choice (ct = 1 or ct = −1)
as
P(ct | βV , αV , β1, βc, βS, αS) =
P(ct = 1) : ct = 1
1 − P(ct = 1) : ct = −1
where P(ct = 1) is given in (15) and if we assume that each choice is independent of all other
choices, we can compute the probility of an entire choice sequence as the product
L(βV , αV , β1, βc, βS, αS | c1, c2, ..., cT ) =
T
t=1
P(ct) (17)
Note that we here assume the choice sequence to be given and let the parameters vary. This
is what is known as the likelihood function and by maximizing this function we can get the
parameters that best fit the actual data. For several reasons it is more practical to convert
the product into a sum by taking the logarithm so that the likelihood function becomes
log(L(βV , αV , β1, βc, βS, αS | c1, c2, ..., cT )) =
T
t=1
log(P(ct)) (18)
Since the logarithm is a monotone transformation, it will have the same extrema as the
transformed function. Furthermore it is also common to use the negative of the log likelihood
which means the likelihood function should be minimized rather than maximized.
The likelihood computation is outlined in Algorithm 1. This procedure returns the like-
lihood for a set of parameters. Since the choice probabilities depend on the value functions,
and the value functions are recursively defined, the likelihood has to be computed iteratively.
The value functions V and S are both initialized at 0. Experimentation showed that the
results did not differ significantly except for the first few time steps. After implementing
this procedure in MATLAB, the native optimizer ‘fmincon’ was used to obtain the results
presented in the following section.
15
input : βV , αV , αS, βS, β1, βc
output: Likelihood (scalar)
// Initialize value functions and likelihood
V1 = [0, 0] ;
S1 = [0, 0] ;
Likelihood = 0;
for t = 2 to end of choice sequence do
ct = action chosen at time t;
cother = action not chosen at time t;
// Update value functions
// For the chosen action
Vt,ct = Vt−1,ct + αV (|rt−1| − Vt−1,ct );
// and for the not chosen action
Vt,cother
= Vt−1,cother
+ αV (0 − Vt−1,cother
);
// Same for the S-values
St,ct = St−1,ct + αS(|rt−1| − St−1,ct );
St,cother
= St−1,cother
+ αS(0 − St−1,cother
);
// Update choice probability
Pt,1 = σ(βV (Vt,1 − Vt,2) + β1 + βcct−1 + βS(St,1 − St,2));
Pt,2 = 1 − Pt,1
// Update likelihood with the negative log of the probability of
the choice at hand
Likelihood = Likelihood − log Pt,ct ;
end
return Likelihood
Algorithm 1: Parameter estimation using maximum likelihood estimation.
3.4 Results
For the fitting a choice of start values and boundaries had to be made. As for start values,
seeing that the primary objective is to replicate the results obtained in Beeler et al. [2010], it
appears natural to choose their parameter estimates as start values. As for boundaries, we
want to make sure that the optimization procedure is searching within the same subspace
of the solution space, so boundaries that are relatively tight around the target solution was
selected. For the learning rates αV and αS, the predetermined boundaries between 0 and 1
was used. After running the fitting procedure, the following results were returned.
Parameters Wild-type DATkd t p
βV 36.80 35.02 0.482 0.636
αV 0.724 0.638 0.378 0.710
αS 0.361 0.501 0.721 0.480
βS −7.633 −6.923 0.317 0.756
β1 0.087 −0.161 0.851 0.406
βS 3.291 3.184 0.162 0.413
The parameter estimates are sample means over 10 mice each and the comparison of
means was done with a two-sample t-test. The lowest p-value is 0.406 so we cannot reject
the null-hypothesis that the means are equal for any of the parameters. This is in accordance
with Beeler et al. [2010] except for the parameter βV which in their work was found to be
16
Figure 9: MDS output of 100 parameter fits on two different mice. Similar colors represent
similar likelihood.
significantly different when comparing wild-type and DATkd mice. After several experiments
it became evident that the the solution is highly dependent on the start values given. This
suggests that the solution space contains many local minima even within relatively tight
boundaries on the parameters. To investigate this further I performed 100 fits on two different
mice with 100 random start values. This gives 100 solutions in R6. In order to visualize it,
multidimensional scaling (MDS) was used to reduce the dimensions to two. MDS is a way to
visualize the similarity or distance between high dimensional data points in a lower dimension
as precisely as possible. The ‘error function’ of MDS is termed the strain function and by
plotting the strain as a function of time (a scree plot), an acceptable number of dimensions
for visualization can be determined. In this case two is appropriate.
The result of the MDS analysis is shown in Figure 9. Each point is a solution to the opti-
mization problem and the solutions are color-coded according to their negative log-likelihood
and the hollow blue dots have the lowest value. Several qualitative traits meet the eye.
Firstly, there are two clear clusters of solutions to the left in each plot and the group to
the right could also be interpreted as one or two clusters. Secondly, solutions appear to be
arranged vertically with very little variation in the horizontal direction. In order to interpret
this, one must remember that in dimensionality reduction analysis like MDS, the axis in
the lower dimension do not have a clear interpretation because the plane formed does not
necessarily lie along any of the original axis. So a arrangement of solutions parallel to the
vertical axis, as in this case, is not the variation of a single parameter, but rather some com-
bination of variables that in this case does not change the likelihood value. This offers a nice
visualization of a specific direction(s) in R6 which does not change the solution considerably.
In other words, the solution hyper-plane has flat areas where many local minima exist.
The results presented here highlight a central problem in mathematical modeling which
will be discussed further in the following section. Although a fit is obtained and value esti-
mates computed, whether these are actually meaningful for individual mice and whether they
can be compared between mice requires careful consideration. When several local minima
are present and especially when they have the same value, the result may be ambiguous.
4 Discussion
4.1 Local minima
This subject has already been touched upon in the Results section where it was demonstrated
that the solution space to the parameter estimation problem contains several local minima
and that these minima almost all have the same value. This can be a problem with big
17
consequences for the conclusions drawn from the model. In another setting, however, it may
not be significant. Imagine a minimization problem of the same sort where each parameter
controls a part of a production line and the goal is to optimize the production output. If
the parameters of this model is fit and several solutions with the same value are found,
one may choose equally among all solutions since they all translate to the same production
output. What the parameter configurations that produce these high efficiency production
lines mean or should be interpreted is, is not of importance. However, in the context of the
data in this project where the objective is to compare parameters from two groups and where
each parameter has a behavioral interpretation, different parameter combinations translate
into different behavior and it is impossible to say which are the ‘true’ parameters for the
mice given that data. What is then the problem, the data or the model? Obviously the
experimental paradigm used in the experiments may have been imperfect. This could for
example mean that what seems like an independent choice of the mouse may in fact have
been strongly influenced by other factors not incorporated in the model. On the other hand,
the model itself could also be too simple. Concrete examples will be discussed in the following
sections.
4.2 Uncertainty
In reinforcement learning algorithms it is often beneficial to vary the learning rate α depend-
ing on the uncertainty in the system. Imagine a simple stochastic one-arm bandit with a
certain probability of yielding a reward. After several plays, a players’ value function for
the bandit will converge to the true probability of yielding a reward. We can say that the
player becomes more and more certain of the value he is attributing to the bandit. Recall
that the learning rate α determines to what degree new information is taken into account.
In the beginning the player is very uncertain and any new information is precious so α is
high. But as time goes on, the player becomes sure about his estimate and even if the bandit
once in a while behaves differently (it is stochastic), the player will not change his estimate
considerably, hence α is low. In machine learning, letting the learning rate vary, produces
much faster learning than a constant learning rate. In the model used in the modeling in
this thesis (15), the learning rates (αV and αS) are kept constant. This obviously fails to
incorporate the above idea and there is the possibility that it would improve the model to
let the learning rate vary. Then rises the question of how to quantify the uncertainty of the
mice and this not a trivial problem. Work has been done on this train of thought [O’reilly,
2013].
4.3 Discounting
In Section 2.2 we derived the prediction error (11) to be
δ(t) = rt + γ vt+1 − vt (19)
where γ is the discounting factor. In the model use for modeling, the prediction error is
δ(t) = rt − vt (20)
so it can be seen as a special case of (19) where γ = 0. Let us recall where this discounting
took place. It came from the notion of the total future reward from time t and onwards
T−t
τ=0
γτ
rt+τ (21)
where a discounting factor close to 0 means that the actor is mostly concerned with maximiz-
ing immediate rewards. If γ = 0 then the only term in (21) which is different from 0 is the
18
reward at time t, rt. In other words, we assume that the mouse is extremely short-sighted. It
is questionable whether this is a valuable assumption. Evidence exists that rats do have an
expectation about a future reward even if they have to work to get there [Howe et al., 2013].
This conclusion was made by observing the dopamine level in rat brains as they navigated
a maze task with a reward at the end. As the rat came closer to the goal, dopamine levels
rose even though it had not encountered the reward yet. This suggests that, at least rats,
do not discount future rewards completely as suggested in the model that has been used in
this thesis and it is something worth looking into.
5 Conclusion
The important experiments by Schultz et al. offered an insight into the activity on a cellular
level during learning. The phasic dopamine signals observed when an unexpected reward is
encountered represent the prediction error, that is the difference between what the learner
expects and what it actually finds. The experiments are important because they offer the
ability to easily quantify learning as it takes place in the brain. The discovery that the
prediction error from a group of algorithms in reinforcement learning can in fact predict this
neuronal behavior paved the road for much investigation. More specifically, the temporal
difference algorithm has since been used to model behavioral data. Montague et al. [1996]
offers an interpretation of the various elements of the model in relation to the brain, and
although simplified it strengthens the connection between these two areas of research.
In this thesis behavioral data of normal and genetically modified mice has been used as a
target for modeling using the before mentioned method. A set of parameter estimations was
obtained by maximizing the likelihood function as implemented in Algorithm 1. Although
a solution was found, it turned out that the solution was far from unique. This points to
a central problem in modeling, where even a large change in parameter values does not
change the solution value much. It poses a problem when the parameters have physiological
interpretations and are thought to control behavior, because we can not say which predicted
behavior is the true one.
It is likely that the model used has limitations that prevent it from including important
parameters in its explanation of behavior, but as stated in the introduction it has not been
the purpose of this thesis to use the latest most complex model, but rather illustrate this
connection both theoretically and its practical application.
It has become clear that although a simple relationship is discovered (here between
dopamine neuron activity and behavior), the modeling of this relationship easily becomes
complex and leaves room for much interpretation by the investigator. However, it is, even
so, a methodology that offers much insight into one of the most complex dynamic systems
we know of, the brain and it will no doubt be the subject of much research to come.
19
References
M. F. Bear, B. W. Connors, and M. A. Paradiso. Neuroscience, volume 2. Lippincott
Williams & Wilkins, 2007.
J. A. Beeler, N. Daw, C. R. Frazier, and X. Zhuang. Tonic dopamine modulates exploitation
of reward learning. Frontiers in behavioral neuroscience, 4, 2010.
N. D. Daw and K. Doya. The computational neurobiology of learning and reward. Current
opinion in neurobiology, 16(2):199–204, 2006.
P. Dayan and L. Abbott. Theoretical neuroscience: Computational and mathematical mod-
eling of neural systems the mit press, 2001.
D. L. Felten, R. F. J´ozefowicz, and F. H. Netter. Netter’s Atlas of Human Neuroscience.
Icon Learning Systems, 2003.
M. W. Howe, P. L. Tierney, S. G. Sandberg, P. E. Phillips, and A. M. Graybiel. Prolonged
dopamine signalling in striatum signals proximity and value of distant rewards. Nature,
500(7464):575–579, 2013.
P. R. Montague, P. Dayan, and T. J. Sejnowski. A framework for mesencephalic dopamine
systems based on predictive hebbian learning. The Journal of neuroscience, 16(5):1936–
1947, 1996.
J. Olds and P. Milner. Positive reinforcement produced by electrical stimulation of septal
area and other regions of rat brain. Journal of comparative and physiological psychology,
47(6):419, 1954.
J. X. O’reilly. Making predictions in a changing world—inference, uncertainty, and learning.
Frontiers in neuroscience, 7, 2013.
R. A. Rescorla, A. R. Wagner, et al. A theory of pavlovian conditioning: Variations in the
effectiveness of reinforcement and nonreinforcement. Classical conditioning II: Current
research and theory, 2:64–99, 1972.
W. Schultz. Predictive reward signal of dopamine neurons. Journal of neurophysiology, 80
(1):1–27, 1998.
W. Schultz. Getting formal with dopamine and reward. Neuron, 36(2):241–263, 2002.
W. Schultz and R. Romo. Dopamine neurons of the monkey midbrain: contingencies of
responses to stimuli eliciting immediate behavioral reactions. J. Neurophysiol, 63(3):607–
24, 1990.
W. Schultz, P. Apicella, and T. Ljungberg. Responses of monkey dopamine neurons to reward
and conditioned stimuli during successive steps of learning a delayed response task. The
Journal of Neuroscience, 13(3):900–913, 1993.
R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
U. Ungerstedt. Adipsia and aphagia after 6-hydroxydopamine induced degeneration of the
nigro-striatal dopamine system. Acta Physiologica Scandinavica. Supplementum, 367:95–
122, 1970.
W. R. Uttal. The new phrenology: The limits of localizing cognitive processes in the brain.
The MIT Press, 2001.
20
R. A. Wise. Dopamine, learning and motivation. Nature reviews neuroscience, 5(6):483–494,
2004.
R. A. Wise and H. V. Schwartz. Pimozide attenuates acquisition of lever-pressing for food
in rats. Pharmacology Biochemistry and Behavior, 15(4):655–656, 1981.
21
Appendices
Likelihood function implementation in MATLAB
function[Likelihood]=GetValueFunctions_optimize(params, Choice, Reward)
betaV = params(1);
alphaV = params(2);
alphaS = params(3);
betaS = params(4);
beta1 = params(5);
betac = params(6);
V = zeros(2, length(Choice));
S = zeros(2, length(Choice));
TD = zeros(2, length(Choice));
P = zeros(2, length(Choice)-1);
Likelihood = 0;
for K = 2:length(Choice)
cur_choice = Choice(K);
if cur_choice == 1;
RK = [Reward(K); 0];
cur_choice_index = 1;
else
RK = [0; Reward(K)];
cur_choice_index = 2;
end
V(:,K) = V(:,K-1) + alphaV*(RK - V(:,K-1));
S(:,K) = S(:,K-1) + alphaS*(RK - V(:,K-1));
TD(:,K) = (RK - V(:,K-1));
P(1, K-1) = 1 / (1 + exp(-(betaV * (V(1,K) - V(2, K)) + beta1 + betac * Choice(K-1)
P(2, K-1) = 1 -P(1, K-1);
Likelihood = Likelihood - log(P(cur_choice_index, K-1));
end
Likelihood;
end
22
The optimization procedure which calls the likelihood function on the previous page
load matlab_mouse_struct;
options = optimoptions(’fmincon’, ’MaxFunEvals’, 1000, ’UseParallel’,’Always’);
beeler=[39.9 0.042 0.342 -8.185 0.461 3.082];
sol = zeros(20,6);
fval = zeros(20,1);
exitflag = zeros(20,1);
for i=1:20;
mus = MOUSE(i);
mus.File
Choice = double(mus.LeftLeverPress | mus.RightLeverPress);
Choice(mus.RightLeverPress) = -1;
Reward = zeros(size(Choice));
Rewardindx = find(mus.reward) - 1;
Reward(Rewardindx) = 1;
keepindx = Choice ~= 0;
Reward = Reward(keepindx);
Choice = Choice(keepindx);
[sol(i,:) fval(i) exitflag(i)] = fmincon(@(x)GetValueFunctions_optimize(x,Choice,...
Reward, beeler, [],[],[],[],[0 0 0 -25 -1 0], [50 2 2 0 5 5], [], options);
end;
23

Contenu connexe

Tendances

Dorsal And Ventral Attention Systems: Distinct Neural Circuits but Collaborat...
Dorsal And Ventral Attention Systems: Distinct Neural Circuits but Collaborat...Dorsal And Ventral Attention Systems: Distinct Neural Circuits but Collaborat...
Dorsal And Ventral Attention Systems: Distinct Neural Circuits but Collaborat...Farvardin Neuro-Cognitive Training Group
 
Positioning system in the brain the brain’s navigational place [autosaved]
Positioning system in the brain the brain’s navigational place [autosaved]Positioning system in the brain the brain’s navigational place [autosaved]
Positioning system in the brain the brain’s navigational place [autosaved]adityadayana
 
Hailey_Evans NAc VTA Poster 2014
Hailey_Evans NAc VTA Poster 2014Hailey_Evans NAc VTA Poster 2014
Hailey_Evans NAc VTA Poster 2014Hailey Zie Evans
 
Neuroplasticity of brain
Neuroplasticity of brainNeuroplasticity of brain
Neuroplasticity of brainDiptanshu Das
 
Mirror-neuron-draft-4-21
Mirror-neuron-draft-4-21Mirror-neuron-draft-4-21
Mirror-neuron-draft-4-21Megan Dempsey
 
Feature Extraction and Classification of NIRS Data
Feature Extraction and Classification of NIRS DataFeature Extraction and Classification of NIRS Data
Feature Extraction and Classification of NIRS DataPritam Mondal
 
Tracking times in temporal patterns embodied in intra-cortical data for cont...
Tracking times in temporal patterns embodied in  intra-cortical data for cont...Tracking times in temporal patterns embodied in  intra-cortical data for cont...
Tracking times in temporal patterns embodied in intra-cortical data for cont...IJECEIAES
 

Tendances (10)

رویکرد علوم اعصاب؛ ذهن به عنوان مغز
رویکرد علوم اعصاب؛ ذهن به عنوان مغز رویکرد علوم اعصاب؛ ذهن به عنوان مغز
رویکرد علوم اعصاب؛ ذهن به عنوان مغز
 
NEURAL PLASTICITY
NEURAL PLASTICITYNEURAL PLASTICITY
NEURAL PLASTICITY
 
Dorsal And Ventral Attention Systems: Distinct Neural Circuits but Collaborat...
Dorsal And Ventral Attention Systems: Distinct Neural Circuits but Collaborat...Dorsal And Ventral Attention Systems: Distinct Neural Circuits but Collaborat...
Dorsal And Ventral Attention Systems: Distinct Neural Circuits but Collaborat...
 
Positioning system in the brain the brain’s navigational place [autosaved]
Positioning system in the brain the brain’s navigational place [autosaved]Positioning system in the brain the brain’s navigational place [autosaved]
Positioning system in the brain the brain’s navigational place [autosaved]
 
Hailey_Evans NAc VTA Poster 2014
Hailey_Evans NAc VTA Poster 2014Hailey_Evans NAc VTA Poster 2014
Hailey_Evans NAc VTA Poster 2014
 
Neuroplasticity of brain
Neuroplasticity of brainNeuroplasticity of brain
Neuroplasticity of brain
 
NCP_Thesis_Francesca_Bocca.pdf
NCP_Thesis_Francesca_Bocca.pdfNCP_Thesis_Francesca_Bocca.pdf
NCP_Thesis_Francesca_Bocca.pdf
 
Mirror-neuron-draft-4-21
Mirror-neuron-draft-4-21Mirror-neuron-draft-4-21
Mirror-neuron-draft-4-21
 
Feature Extraction and Classification of NIRS Data
Feature Extraction and Classification of NIRS DataFeature Extraction and Classification of NIRS Data
Feature Extraction and Classification of NIRS Data
 
Tracking times in temporal patterns embodied in intra-cortical data for cont...
Tracking times in temporal patterns embodied in  intra-cortical data for cont...Tracking times in temporal patterns embodied in  intra-cortical data for cont...
Tracking times in temporal patterns embodied in intra-cortical data for cont...
 

En vedette

En vedette (14)

A Letter to my friend
A Letter to my friendA Letter to my friend
A Letter to my friend
 
Las abejas
Las abejasLas abejas
Las abejas
 
Paysage de l'identité
Paysage de l'identitéPaysage de l'identité
Paysage de l'identité
 
EventSeed ppt
EventSeed pptEventSeed ppt
EventSeed ppt
 
piero longo denuncia irregolarità
piero longo denuncia irregolaritàpiero longo denuncia irregolarità
piero longo denuncia irregolarità
 
El poder-y-sus-conflictos
El poder-y-sus-conflictosEl poder-y-sus-conflictos
El poder-y-sus-conflictos
 
Processus
ProcessusProcessus
Processus
 
Construire l'échaffaudage
Construire l'échaffaudageConstruire l'échaffaudage
Construire l'échaffaudage
 
2015 pics
2015 pics2015 pics
2015 pics
 
Qu'est ce qu'on fait?
Qu'est ce qu'on fait?Qu'est ce qu'on fait?
Qu'est ce qu'on fait?
 
THE DERBY THE PRIDE OF BENGAL
THE DERBY THE PRIDE OF BENGALTHE DERBY THE PRIDE OF BENGAL
THE DERBY THE PRIDE OF BENGAL
 
HFACO
HFACOHFACO
HFACO
 
Bestyrelses CV Lars Kristiansen 20151116
Bestyrelses CV Lars Kristiansen 20151116Bestyrelses CV Lars Kristiansen 20151116
Bestyrelses CV Lars Kristiansen 20151116
 
Références communes et modèles de monde
Références communes et modèles de mondeRéférences communes et modèles de monde
Références communes et modèles de monde
 

Similaire à Calculation of Reward Prediction in Behaving Animals

#36391 Topic Homeland Recovery and Continuity of OperationsNum.docx
#36391 Topic Homeland Recovery and Continuity of OperationsNum.docx#36391 Topic Homeland Recovery and Continuity of OperationsNum.docx
#36391 Topic Homeland Recovery and Continuity of OperationsNum.docxAASTHA76
 
#36417 Topic Discussion 3Number of Pages 1 (Double Spaced).docx
#36417 Topic Discussion 3Number of Pages 1 (Double Spaced).docx#36417 Topic Discussion 3Number of Pages 1 (Double Spaced).docx
#36417 Topic Discussion 3Number of Pages 1 (Double Spaced).docxAASTHA76
 
Genetic Mutagenesis Screen
Genetic Mutagenesis ScreenGenetic Mutagenesis Screen
Genetic Mutagenesis ScreenLisa Martinez
 
#36420 Topic Discussion 6Number of Pages 1 (Double Spaced).docx
#36420 Topic Discussion 6Number of Pages 1 (Double Spaced).docx#36420 Topic Discussion 6Number of Pages 1 (Double Spaced).docx
#36420 Topic Discussion 6Number of Pages 1 (Double Spaced).docxAASTHA76
 
#36421 Topic Discussion 7Number of Pages 1 (Double Spaced).docx
#36421 Topic Discussion 7Number of Pages 1 (Double Spaced).docx#36421 Topic Discussion 7Number of Pages 1 (Double Spaced).docx
#36421 Topic Discussion 7Number of Pages 1 (Double Spaced).docxAASTHA76
 
Imitation, Mirror Neurons And Autism Prueba
Imitation, Mirror Neurons And Autism PruebaImitation, Mirror Neurons And Autism Prueba
Imitation, Mirror Neurons And Autism Pruebaguested2af21
 
Computational neuropharmacology drug designing
Computational neuropharmacology drug designingComputational neuropharmacology drug designing
Computational neuropharmacology drug designingRevathi Boyina
 
Coursera neurobiology
Coursera neurobiologyCoursera neurobiology
Coursera neurobiologygene_slider
 
A history of optogenetics the development of tools for controlling brain circ...
A history of optogenetics the development of tools for controlling brain circ...A history of optogenetics the development of tools for controlling brain circ...
A history of optogenetics the development of tools for controlling brain circ...merzak emerzak
 
Elective Neural Networks. II. The orthogonal brain. On a Heuristic Point ...
Elective Neural Networks.   II. The orthogonal brain.   On a Heuristic Point ...Elective Neural Networks.   II. The orthogonal brain.   On a Heuristic Point ...
Elective Neural Networks. II. The orthogonal brain. On a Heuristic Point ...ABINClaude
 
Final_Huntington_s Disease_Voluntary Movement and the Mechanisms for Failure_...
Final_Huntington_s Disease_Voluntary Movement and the Mechanisms for Failure_...Final_Huntington_s Disease_Voluntary Movement and the Mechanisms for Failure_...
Final_Huntington_s Disease_Voluntary Movement and the Mechanisms for Failure_...Jaime Knoch
 
Brain based learning implications for the elementary classroom
Brain based learning  implications for the elementary classroomBrain based learning  implications for the elementary classroom
Brain based learning implications for the elementary classroomJohara Domato
 
Quantum physics in neuroscience and psychology a new theory with respect to m...
Quantum physics in neuroscience and psychology a new theory with respect to m...Quantum physics in neuroscience and psychology a new theory with respect to m...
Quantum physics in neuroscience and psychology a new theory with respect to m...Elsa von Licy
 
Cerebrum Essay
Cerebrum EssayCerebrum Essay
Cerebrum EssayDana Boo
 

Similaire à Calculation of Reward Prediction in Behaving Animals (20)

#36391 Topic Homeland Recovery and Continuity of OperationsNum.docx
#36391 Topic Homeland Recovery and Continuity of OperationsNum.docx#36391 Topic Homeland Recovery and Continuity of OperationsNum.docx
#36391 Topic Homeland Recovery and Continuity of OperationsNum.docx
 
Reviewpaperrevised
ReviewpaperrevisedReviewpaperrevised
Reviewpaperrevised
 
Triay 2016 Thesis UPF
Triay 2016 Thesis UPFTriay 2016 Thesis UPF
Triay 2016 Thesis UPF
 
#36417 Topic Discussion 3Number of Pages 1 (Double Spaced).docx
#36417 Topic Discussion 3Number of Pages 1 (Double Spaced).docx#36417 Topic Discussion 3Number of Pages 1 (Double Spaced).docx
#36417 Topic Discussion 3Number of Pages 1 (Double Spaced).docx
 
Genetic Mutagenesis Screen
Genetic Mutagenesis ScreenGenetic Mutagenesis Screen
Genetic Mutagenesis Screen
 
#36420 Topic Discussion 6Number of Pages 1 (Double Spaced).docx
#36420 Topic Discussion 6Number of Pages 1 (Double Spaced).docx#36420 Topic Discussion 6Number of Pages 1 (Double Spaced).docx
#36420 Topic Discussion 6Number of Pages 1 (Double Spaced).docx
 
#36421 Topic Discussion 7Number of Pages 1 (Double Spaced).docx
#36421 Topic Discussion 7Number of Pages 1 (Double Spaced).docx#36421 Topic Discussion 7Number of Pages 1 (Double Spaced).docx
#36421 Topic Discussion 7Number of Pages 1 (Double Spaced).docx
 
Imitation, Mirror Neurons And Autism Prueba
Imitation, Mirror Neurons And Autism PruebaImitation, Mirror Neurons And Autism Prueba
Imitation, Mirror Neurons And Autism Prueba
 
Computational neuropharmacology drug designing
Computational neuropharmacology drug designingComputational neuropharmacology drug designing
Computational neuropharmacology drug designing
 
Myelination Essay
Myelination EssayMyelination Essay
Myelination Essay
 
Sherlock.pdf
Sherlock.pdfSherlock.pdf
Sherlock.pdf
 
Coursera neurobiology
Coursera neurobiologyCoursera neurobiology
Coursera neurobiology
 
A history of optogenetics the development of tools for controlling brain circ...
A history of optogenetics the development of tools for controlling brain circ...A history of optogenetics the development of tools for controlling brain circ...
A history of optogenetics the development of tools for controlling brain circ...
 
Elective Neural Networks. II. The orthogonal brain. On a Heuristic Point ...
Elective Neural Networks.   II. The orthogonal brain.   On a Heuristic Point ...Elective Neural Networks.   II. The orthogonal brain.   On a Heuristic Point ...
Elective Neural Networks. II. The orthogonal brain. On a Heuristic Point ...
 
Reply to teacher.pdf
Reply to teacher.pdfReply to teacher.pdf
Reply to teacher.pdf
 
Neuromarketing
NeuromarketingNeuromarketing
Neuromarketing
 
Final_Huntington_s Disease_Voluntary Movement and the Mechanisms for Failure_...
Final_Huntington_s Disease_Voluntary Movement and the Mechanisms for Failure_...Final_Huntington_s Disease_Voluntary Movement and the Mechanisms for Failure_...
Final_Huntington_s Disease_Voluntary Movement and the Mechanisms for Failure_...
 
Brain based learning implications for the elementary classroom
Brain based learning  implications for the elementary classroomBrain based learning  implications for the elementary classroom
Brain based learning implications for the elementary classroom
 
Quantum physics in neuroscience and psychology a new theory with respect to m...
Quantum physics in neuroscience and psychology a new theory with respect to m...Quantum physics in neuroscience and psychology a new theory with respect to m...
Quantum physics in neuroscience and psychology a new theory with respect to m...
 
Cerebrum Essay
Cerebrum EssayCerebrum Essay
Cerebrum Essay
 

Calculation of Reward Prediction in Behaving Animals

  • 1.                               Bachelor’s Thesis Emil Østergård Johansen Calculation of Reward Prediction in Be- having Animals Supervisors: Christian Igel & Jakob Dreyer January 4, 2015
  • 2. Contents 1 Introduction 2 2 Background 3 2.1 Dopamine reward system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Reinforcement Learning & Prediciton Error . . . . . . . . . . . . . . . . . . . 8 2.3 Model of learning in the brain . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Modeling 13 3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Discussion 17 4.1 Local minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3 Discounting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 5 Conclusion 19 Appendices 22 Abstract The theory of reinforcement learning can be applied to modeling learning processes in humans and animals. The neurotransmitter dopamine is supposed to play a central role in the reward system, encoding the temporal difference error. This thesis reviews the theory of reinforcement learning along with the dopamine reward system in the brain. A temporal difference learning algorithm is used to model behavioral data of wild type and dopamine transporter knockdown (DATkd) mice. The parameters of the model are adjusted using data-driven optimization. In the literature, significant behavioral differences between wild type and DATKd mice were explained by differences in the temperature parameter of the model. However, the experiments in this thesis showed that there are several local minima in the parameter space, which allow for alternative explanations of the observed behavioral differences. 1 Introduction How does the brain work? Understanding this fundamental question is the main objective in neuroscience and many other disciplines. Among the functions of the brain, learning remains an area of key interest. Behavioral psychology has paved the road for a massive body of knowledge to which famous names such as B.F. Skinner (the Skinner box) and Ivan Pavlov (classical conditioning) have contributed. While behavior is readily observed and can be quite informative as to hypothesize about the inner mechanics of learning in the brain, neurophysiology, that is the branch of physiology concerned with the functioning of the brain, offers a detailed look of what ‘makes it all tick’. The brain consists of a myriad of living cells among which the most important are neurons (colloquially known as brain cells) and glia cells. The intriguing property of this mass of cells is that it is highly organized. Neurons communicate with one another through cable- like projections and each neuron can be communicating simultaneously with upwards of 2
  • 3. 10,000 other neurons. Connections are strengthened and weakened dynamically over time, but the gross organization appears static. This is why it is possible for neurosurgeons to perform complicated operations with a high degree of confidence that they can predict the consequences, and it is also the reason why researchers can investigate the physiology of specific structures of the brain with 10−6m precision. The discovery of the neurotransmitter dopamine and consequently networks in brain where dopamine plays an central role has been kick starting intense research that shows how dopamine is responsible for much of the behavior observed by psychologists. A series of experiments on monkeys [Schultz, 1998] let researchers directly monitor the activity of dopamine neurons during learning tasks and a clear correlation was observed. While neuroscientists have investigated the inner workings of the brain, computer sci- entists have tried to teach machines to learn and while doing so have developed a group of algorithms under the common term, ‘reinforcement learning’. A reoccurring parameter in these algorithms, namely the prediction error, was surprisingly observed to imitate the behavior of the dopamine neurons monitored in the experiments by Schultz [1998] and since then reinforcement learning algorithms have been used to model behavioral data. The work by Beeler et al. [2010] is an example of combining neuroscience with reinforce- ment learning. They investigate the background activity of dopamine neurons’ effect on observed behavior by presenting two groups of mice with the same learning task. One group has been genetically modified to have increased background activity of dopamine neurons and they use a reinforcement learning model to compare various parameters between the two groups. It is the objective of this thesis to describe the background theory of learning in the context of both the dopamine system and of reinforcement learning. The data obtained by Beeler et al. [2010] will then serve as a framework for applying this knowledge to an actual modeling problem and stimulate a discussion of these methods. There is a large amount of literature on the subject of modeling behavioral data and it is not within the scope of this thesis to review the newest findings nor implement them. The background review is based mostly on classical references. 2 Background 2.1 Dopamine reward system Various theories about the organization of the brain have been proposed throughout the centuries. In the 18th century Franz Joseph Gall founded the school of phrenology which claimed that various bumps on the external surface of the cranium would predict various psychological traits about the person being examined. When focus shifted to the brain, this idea paved the road for the functional specialization theory, namely that each cognitive function and psychological trait corresponded to a bounded area in the brain. Lesion case studies, where the subject of study were patients with a brain tumor, internal bleeding or other kinds of local damage to an area of the brain, produced many interesting results that appeared to prove this theory. The most famous of these patients was Phileas Cage who miraculously survived after having an iron rod driven through his skull, destroying his left frontal lobe. Cage survived with most cognitive skills intact but with a personality change that rendered him unrecognizable to his friends and family. This suggested that the frontal lobes are somehow responsible for personality. The famous brocca’s area in the inferior (lower) part of the frontal lobe which supposedly is responsible for language, was similarly discovered by observing that patients with damage to this area all appeared to have trouble with speech production. Other theories are of course also being explored. The opposite of functional specialization is the theory of distributive processing where cognitive functions are assumed to be a network 3
  • 4. Figure 1: Publications on various neurotransmitters throughout the years. Source: Google Scholar. within the brain not bound to a specific area [Uttal, 2001]. Whatever the organization of function, pathways which consists of the projections of axons from a collection of neurons in one part of the brain to another part, has been thor- oughly documented and a group of these pathways are of special interest when investigating motivation and reward. Dopamine was discovered as an independent neurotransmitter by Carlson et al (1957) at Lund University and the research on dopamine has been increasing ever since (Figure 1). Dopamine was first associated with motor function [Wise, 2004], but later also with motiva- tional behavior first described in Ungerstedt [1970] where feeding and drinking deficits were observed after inducing damage to the mesolimpic dopamine system. In Wise and Schwartz [1981] they showed that by injecting the dopamine D2 receptor blocker pimozide, they ob- served that rats did not learn to press a lever in order to obtain food and water, whereas rats who were not injected learned this task. This further suggests that dopamine plays a key role in modulation of motivation and reward-based learning. The dopaminergic system in the brain consists of several pathways that together span most of the brain but all emanate from the mid brain (Figure 2). The most important path- way in regard to reward learning is the mesolimpic pathway which originates from dopamin- ergic neurons in the ventral tegmental area (VTA) in the mid brain that project its axons to especially the cortex. In a series of groundbreaking experiments, starting with Olds and Milner [1954], a rat was placed in a box with a lever which upon pressing delivered electrical stimulation to the rats’ brain through an implanted electrode (see Figure 3). In this set-up the rat would first step on lever by accident and then retreat away from the lever. However, it would soon come back and press again, and as time went by it would spend most of its time pressing the lever. In extreme cases rats were observed to press the lever continuously until fainting from exhaustion. These electrical self-stimulation experiments were repeated, each time changing the position of the stimulation electrode in the rat brain, and the structure identified as producing the strongest behavioral response was the ventral tegmental area (VTA). Blocking dopamine receptors reduced the degree of self-stimulation so the conclusion was that the rats worked to release dopamine, and thus dopamine served as a reward that reinforced the behavior that caused the release of it. Wise and colleagues have carried out numerous self stimulation experiments throughout 4
  • 5. Figure 2: Dopaminergic System Overview (adapted from Felten et al. [2003]) the years and located many of the structures involved in motivation and reward by noting that when they observed a positive result (boosted reward experience of the animal seen by repeated self stimulation), the stimulation electrode was located either in close proximity to axons of dopaminergic neurons or to axons that terminated at dopminergic neurons [Schultz, 2002]. The important experiments which gave rise to the data essential for this thesis were done by Schultz and his colleagues (various articles 1990-1994) on behaving monkeys. In the experiments firing patterns of dopaminergic neurons were measured in vivo (while the monkeys were alive) as the monkeys performed various learning tasks. They were placed in front of a box which was covered to prevent vision of the contents of the box. Electrodes were placed near a population of dopamine neurons in the monkeys’ brain and they then let the monkey insert its hand into the box which could contain either nothing, a reward (apple) or a neutral object (a string). Figure 4A shows a transient increase in dopaminergic neuron firing when the hand of the primate touches an apple (upper figure) inside the box and an absence of this firing when there is no apple (lower figure). Control experiments were performed (Figure 4B and C) which confirmed that: 1. The qualitative observation was indeed due to touching an apple (a reward) and not just any object (a string was used in place of an apple). 2. The same result was not due to movement of the arm alone. The more interesting result, however, is found in Figure 4E where the movement of reaching for the apple inside the box was initiated by a stimuli which in this case is the opening of the door to the box. In the control trial (Figure 4D) the dopaminergic neurons exhibit a transient increase in firing rate, as we would expect by now, when the hand reaches the reward. The movement here is self-initiated. When the movement is initiated by the door-opening stimulus, the neural response is to start with observed when the monkey reaches the reward, similarly to the control. However, after learning the association between stimulus and the possibility of finding a reward, the increase in firing rate is observed at the time of the stimulus and not at the time of discovery of the reward. This observation was replicated in Ljungberg et al. (1992), also with monkeys. It was concluded that this transferring of neuronal response happens over the course of learning. This is an impressive direct view of 5
  • 6. Figure 3: Self stimulation experimental setup (adapted from Bear et al. [2007]). a change on the cellular level in the brain that happens during learning. Let us look at not the transferring of neuronal activity but just the decrease of activation at the time of reward. How is this decrease in phasic activation at the time of reward to be interpreted? An obvious interpretation is that the animals develop an insensitivity to the rewards and thus less activation in the dopamine neurons is observed. This was however falsified in [Mirenowicz and Schultz, 1994]. Rather it seems that the degree of activation somehow depicts the uncertainty or unpredictability [Schultz, 2002] of the reward. In the first trials, when the reward is unexpected, the activation of dopamine neurons is at its highest. Then slowly over the course of learning the activity decreases until it is not present at all. This can be interpreted as a form of learning or in other words, a steady increase in the certainty that the reward will be presented. When the reward, against expectations, is on purpose being withheld or delayed, a sig- nificant drop in dopamine neuron activity is observed [Schultz et al., 1993] which seems like a natural extension of the unpredictability theory. The drop in activity occurs exactly at the time where the reward is usually delivered which suggests a sensitivity to the timing of the reward in addition to the occurrence. The important theme in these individual neuron experiments by Schultz and others is the unpredictability of reward, as it has come to be termed, in both timing and occurrence. It is becoming evident that the phasic activity (as opposed to the constant tonic firing) of dopamine neurons can be interpreted to be pro- portional to the error in the animals’ prediction of the reward: When the expectation of a reward is low or non-existent (as it would be prior to any learning task), a large amount of activity is observed when a reward is presented - contrary to expectation. However, over time as the animal becomes surer that the reward follows the stimuli, the unpredictability is decreasing and the prediction error also decreasing, thus a smaller neuronal activity is ob- served. The term ‘prediction error’ is important because it is a corner stone of reinforcement learning algorithms and it is in this quantity that the connection was made between the results discussed here and the reinforcement learning theory most note-worthily described and developed in Sutton and Barto [1998]. 6
  • 7. Figure 4: Mensencephalic dopamine neuron firing response to experiment. From [Schultz and Romo, 1990] Figure 5: Transferring of predictive firing. From [Schultz, 1998] 7
  • 8. 2.2 Reinforcement Learning & Prediciton Error The prediction error as introduced in the previous section, is a central theme in modern conditioning theories [Rescorla et al., 1972, Schultz, 2002]. In these theories it is suggested that in order for learning to take place, the prediction error must be different from zero. A positive or negative prediction error will lead to the acquisition or extinction or behavior. To see why this is so, imagine a dog being presented with a stimulus such as the ring of a bell. In all likelihood the dog predicts that nothing will happen after the bell has sounded. And if nothing happens, the dog will continue to expect nothing to happen the next time the bell rings. But if something, contrary to the dog’s prediction, does happens, there is a prediction error between the dog’s prediction and what actually happened and as a result the dog changes its prediction for the next time the bell rings or in other words, it learns. This theory is pleasing because it fits well with our intuitive understanding of learning and the classic Pavlovian dog experiments. It should by now become more and more clear that the prediction error observed in the single dopamine neuron experiments by Schultz and the prediction error observed in learning tasks in classical conditioning (such as the dog example above), is the same and it is this subtle point that provides the main motivation for both the data set used in this thesis and the model which will be introduced later. First, however, I think it is appropriate to clarify some terms. In conditioning we distinguish between two main fields of study: Classical and instrumental or operant conditioning. The main difference being that the actor in classical conditioning does not have any influence on whether he/she is presented with a reward or not - in other words it is completely automatic. In instrumental or operant conditioning, the opposite is true. Here the actor’s interaction with the environment can influence the chance of whether or not a reward is presented. A system where the actor is presented with positive rewards and negative rewards (punishments) which provide the basis for learning, is what is known as a reinforcement learning system which is studied widely in both psychology and computer science for machine learning. Reinforcement is here taken to mean the strengthening of stimulus-stimulus, stimulus-response, or stimulus-reward associations that result from the timely presentation of a reward [Wise, 2004]. Many results from psychological classical conditioning experiments can be predicted by a simple learning rule which is one of the classical references in the field, namely the The Rescorla-Wagner rule [Rescorla et al., 1972]. It offers a simple linear relationship for pre- dicting reward. If we denote v as the expected reward, u as a binary variable representing the presence or absence of a stimulus and w as a weight associated with the stimulus, then the expected reward is predicted as v = wu (1) The learning takes place in updating the value of the weight w with an update rule that is designed to minimize the prediction error, that is the squared error between expected reward and actual reward (r − v)2 [Dayan and Abbott, 2001] and this is what is known as the Rescorla-Wagner rule w = w + δu (2) where • δ = r − v is the prediction error • is a learning rate determining how fast the learner incorporates new information into prior knowledge If a reward follows every time a stimulus is presented, then w will converge to 1 as the number of trials increases. If the reward is stochastic and normally distributed, w will converge to 8
  • 9. the theoretical mean of that distribution. As already mentioned this is a model of classical conditioning meaning that the actor does not interact with environment and hence has no influence over whether a reward is presented or not. It does a poor job modeling instrumental conditioning because there is no notion of time or interaction between the actor and the environment in the model. For this we need a slightly more complicated setup. Reinforcement learning is a problem of optimal control in the sense that the actor, who is learning, has some control over the environment. Actions lead to rewards or the lack thereof and the actor seeks to reap the largest amount of rewards. In order to describe this problem mathematically we need some clearly defined abstractions: • a set of states S • a set of action A • a transition function TTT • a reward function RRR • a value function VVV • a policy π For the remainder of this introduction we will use the experimental setup described in Section 3.1 as an example. This setup consists of a mouse in a cage with two levers that deliver rewards in the form of food. Obviously time is continuous in the real world, but it is for modeling purposes beneficial to discretize the world in to discrete timesteps t and states st ∈ S where st is the state at time t and S is the set of all possible states. From the mouse’s point of view, the way to transition between these states is through actions which, like states, come from a set of actions so the action at time t we write as at ∈ A. A certain action then, like pressing a lever, alters the state of the environment iterating t to t + 1 so that the new state is st+1 ∈ S. The next thing we need is a function that maps state st and action at to the next state st+1. This we call the transition function TTT : st, at −→ st+1 and this is what describes the laws that govern the interaction between the mouse and its environment. It is also termed an environment model. The environment can be deterministic in the sense that a specific action will always lead to specific state, but often it is stochastic so that TTT is a set of transition probabilities. Finally there is the reward function RRR : st, at −→ rt which maps from a state and an action to the probability of a reward. Often rt is a binary variable which represents the presence of absence of reward at time t, just like the stimulus variable u in the Rescorla- Wagner setup. With this set of abstractions we have a well defined environment S, a way for the mouse to interact with the environment through actions A and a transition function which tells us how the environment changes under each action TTT. Finally, the incentive for learning, the rewards, are also well defined by RRR. Now comes the interesting part, namely how the mouse chooses what action to take in order to maximize reward. The way we model this is to assume that the mouse keeps some value estimate for each possible action so that by choosing the action with the highest estimated value, it can maximize reward. This is called the value function VVV : at −→ vt. The mouse updates this value function based on the reward it receives, a trial and error concept. If VVV∗(a) is the true value for action a, the mouse obviously tries to achieve the following property: lim t→∞ VVVt(a) = VVV∗ (a) (3) 9
  • 10. A simple way to update VVV is inspired directly by the Rescorla-Wagner rule where the weight w is replaced with the value and the stimulus u is omitted: vt = vt + αδt (4) Here δt = (rt − vt) and has been renamed to α to keep the two update rules separate. As already suggested, the mouse chooses its actions based on the value function. The function that encapsulates this behavior of the mouse is termed its policy, π : s −→ a. The policy π is in words a function that maps from a state to an action. This will usually have the form of a simple look-up table although it can be imagined to take more complex forms. The process of updating value estimates and thus the policy is known as policy optimization. The relationship between the value function and the policy can vary from model to model. If the policy of the mouse is designed so that it always chooses the action which has the highest value estimate (highest expected value), then this is known as the greedy choice and it is the one of two extremes. The other extreme is to completely disregard the value estimates, and assign equal probabilities to choosing each action. Many different strategies have been proposed. One easy to understand is the so-called -greedy method (not be confused with the learning rate from the Rescorla-Wagner rule). Here is a number between 0 and 1 which defines the probability of choosing the greedy action versus choosing equally among all other action. Another approach is to assign weighted probabilities according to the value estimates V (a) for each action. A common implementation of this idea is to use the Gibbs distribution, given as eVVV(a)/β k b=0 eVVV(b)/β , (5) to calculate the probability weights [Sutton and Barto, 1998]. This is also known as the softmax selection rule and it chooses stochastically among all actions but with higher value actions weighted higher, more specifically it computes the probability P(at = a). This model is used in the data modeling of this project. The fact that the value estimations converge as in (3) is less interesting than how fast they converge. That is, not only does the mouse want to optimize its policy to yield the largest value on its future actions, but it also wants to optimize the rewards it gets during learning, as it can’t go on learning forever. This dilemma is known as the exploit-explore problem and it is central to reinforcement learning. The following is based on the introduction to temporal difference learning in Dayan and Abbott [2001]. As suggested above we interpret the value function VVV as the estimated value of an action. To elaborate, what we really mean is that at time t, VVV(t) is the total future reward expected from time t and onward to the end of the trial: T−t τ=0 rt+τ (6) Usually future rewards are discounted which means that future rewards are assigned a lower value than immediate rewards. This is intuitive in the sense that we can imagine an actor to have more incentive to obtain a reward immediately rather than obtaining the same reward a year later. This is modelled with a discounting factor γ where 0 ≤ γ ≤ 1. If γ = 0 we say that the actor is myopic [Sutton and Barto, 1998] or short-sighted meaning that it does not take into account future rewards but is only interested in maximizing immediate rewards. As γ approaches 1, the actor becomes more and more far-sighted. So we rewrite (6) as T−t τ=0 γτ rt+τ (7) 10
  • 11. In this setting, the prediction error δ(t) must then be the difference in actual future rewards and expected total future rewards δ(t) = T−t τ=0 γτ rt+τ − vt (8) This, however, is impossible to compute at time t because we do not yet know the actual future rewards. This is solved by noting that the first term in (8) can be rewritten as T−t τ=0 γτ rt+τ = rt + T−t−1 τ=0 γτ+1 rt+1+τ = γ T−t−1 τ=0 γτ rt+1+τ (9) and as we just discussed, vt is an estimate of the total future reward (the latter term) γ T−t−1 τ=0 γτ rt+1+τ ≈ γ vt+1 (10) so the prediction error becomes δ(t) = rt + γ vt+1 − vt (11) and we insert this in (4) and get the update rule vt = vt + α(rt + γ vt+1 − vt) (12) The difference vt+1 − vt gives the name to this method, the temporal difference method. At a first glance it looks like we again need information about the future (vt+1) to compute the prediction error. One must remember, however, that vt designates the value of the action taken at time t and similarly, as soon as we know the action at+1 which is computed with e.g. the softmax selection rule, we know the value of the value estimate vt+1. This method is called a bootstrapping method [Sutton and Barto, 1998] because it uses an estimate for estimation. We now have the complete framework for classical reinforcement learning in order. We have defined the environment and the actors interaction with this environment through sets of states S and actions A and we have defined how the actor selects its actions based on the value estimates it holds about each action (its policy). Finally we have shown how the actor can update its knowledge by observing the rewards it receives (12). 2.3 Model of learning in the brain In this section I want to formalize the connection between the dopamine system and rein- forcement learning a bit further. Although the idea that the brain uses the prediction error for learning is intuitive, a well defined model of what the various functions in reinforcement learning represent has been proposed [Montague et al., 1996] and is depicted in Figure 6. In this model, information about the task to be learned is stored in the cortex which con- sists of different modalities. When some task is engaged, the cortex outputs its information about this task to an intermediate layer which represents possible information processing in other parts of the dopamine system (See Figure 2) . This information can be both excita- tory (encouraging) or inhibitory (discouraging). For example some modality of the cortex can discourage the task because it is learned to be dangerous while another modality of the cortex can encourage the task because it often yields food and the actor is hungry. All these outputs are weighted individually and summed in our familiar value function VVV(t). In the figure ˙VVV(t) represents a temporal difference which we recognize as VVV(t + 1) − VVV(t) (see 11
  • 12. section 2.2). The information is then passed downwards to a group of dopamine neurons PPP in for example the ventral tegmental area which also receives external information about the presence or absence of a reward rrr(t) and other factors which may have influence over the output from PPP. It is in the dopamine neurons that the prediction error δ(t) is computed simply as the net input δ(t) = r(t) + ˙V (t) = r(t) + V (t + 1) − V (t) (13) and then output as a dopamine signal back to the cortex (again refer to Figure 2). This model is in accordance with the experiments by Schultz et al. discussed in Section 2.1 where it was observed that the dopamine signal in monkeys appeared to represent the prediction error δ(t). We also note that (13) is a special case of (11) where γ = 1. Figure 6: Model of information flow through the dopamine system. Adapted from Mon- tague et al. [1996] The model in Figure 6 explains how the prediction error and the consequent learning might be physiologically instantiated in the brain. The other part that needs an inter- pretation is the action selection policy. The softmax action selection model (5) was in- troduced in Section 2.2 and it is the pre- ferred model for action selection in reinforce- ment models as it has been observed to fit well with behavioral data [Daw and Doya, 2006]. The variable parameter of the soft- max model is the β parameter also called ‘temperature’, an analogue to the movement of molecules under different temperatures. As β −→ 0, the probability of selecting the action with the highest value approaches 1. Conversely, as β −→ ∞ the probability is spread out so that all actions have the same probability of being selected. In Figure 7 a simulation of a famous instructional reinforcement learning problem is shown. The problem is that of a two-armed bandit. Each arm has its own probability of yielding a reward and it is up to the learner to balance exploitation of learned knowledge of which lever is more likely to yield a reward and exploration of the other lever in case it might actually be better. Figure 7 shows a game that lasts 2000 plays. At play 400, the reward yielding probability of the levers are switched in order to see how fast the model is able to adapt to this. The result has been averaged over 2000 episodes to smooth out the curves sufficiently. The choice of which arm to pull was made with the softmax action selection rule and the result for four different values of β is shown. As is expected, the most conservative player with β = 0.01 realizes the change the latest and is slowest at adapting. Similarly the player with β = 0.5 is the quickest at realizing the change and adapting. There are many qualitative attributes that are interesting to investigate (e.g. it seems that a medium value is better in the long run, which it is), but overall it serves as a nice illustration of how β affects the ability to learn and adapt. Although there are hypothesis as to how the brain controls this, it is still an open question [Beeler et al., 2010, Daw and Doya, 2006]. 12
  • 13. Figure 7: Two-armed bandit simulation with different temperature values. 3 Modeling 3.1 Experimental Setup The data used for modeling in this thesis is the same data used in Beeler et al. [2010]. The main objective of the original experiment was to investigate whether tonic dopamine modulates learning. This was done by comparing the performance of dopamine transporter knockdown (DATkd) mice against wild type (normal) mice (n = 10 for both groups). A home cage operant paradigm was used in the experimental setup, which means that mice earn their food entirely through lever pressing. Each mouse lives in its own cage for 10 consecutive days where water is freely available, but food can only be obtained through lever pressing. In each cage are two levers which has a probability of yielding food upon pressing. When the probabilities of food release for each lever is not equal, one lever has a lower probability of food release than the other making it ‘cheaper’ in terms of the amount of work required to gain a reward. Which lever is the cheaper changes every 20-40 minutes thus constantly offering a learning opportunity for the mouse. The data is recorded as event codes and accompanying time codes for each event with a temporal resolution of 100 ms. Events are e.g. ‘left lever pressed’, ‘reward delivered’, ‘lever probabilities interchanged’ etc. For the parameter estimation, the data has been converted into two vectors: A binary choice vector ccc where an element ct is 1 or -1 depending on which lever is pressed. ccc contains the entire choice sequence for a single mouse concatenated over ten days which is about 50000 events. The other vector is a binary reward vector rrr of the same length as ccc. An element rt is 1 if a reward was delivered when lever press ct was performed and 0 otherwise. All temporal information (the time codes in the original data set) has been removed since the model only predicts what choices are made in which order, not when they are made. 3.2 The Model 13
  • 14. Figure 8 In the original work by Beeler et al. [2010] they fit two models to the behavioral data described in the previous section. One is a general logistic regression model with the choice variable c as the dependent variable and the 100 previous rewards rt−100...t−1 for each t as the explanatory variables. The re- sult of this model is shown in Figure 8 where left on the horizontal axis corresponds to the most recent rewards. A positive coeffi- cient value indicates that a reward tends to promote staying on the same lever whereas a negative value indicates switching to the other lever. Intuitively we would expect that the most recent rewards to promote staying on the same lever more strongly than re- wards in the past due to e.g. memory. This holds true for the figure shown here except for the very most recent rewards where coefficient values tend to −1. This means that the mice tend to switch to the other lever one or two time steps after receiving a reward but in the longer run they follow a familiar learning pattern. A simple softmax action selection model as in (5) will not predict this behavior but rather that a reward received immediately prior to the current time step will promote staying on the same lever the strongest. So this model has to be extended. The simple softmax model for two possible actions (levers) take the following form: P(ct = 1) = eVt(1)/β eVt(1)/β + eVt(−1)/β = 1 1 + e−(Vt(1)−Vt(−1))/β = σ(βV (Vt(1) − Vt(−1))) (14) where σ(z) = 1 1+e−z is the logistic function and βV is the inverse temperature. Whether one use the temperature or its inverse is simply a matter of convention and since they use its inverse in the original article, we are adapting this notation here. Similarly for P(ct = −1). The model used in Beeler et al. [2010] is as follows: P(ct = 1) = σ(βV [Vt(1) − Vt(−1)] + β1 + βcct−1 + βS[St(1) − St(−1)]) (15) where • βV is the inverse temperature for V • Vt = Vt−1 + αV (rt−1 − Vt−1) is the value function • β1 is the bias term towards one lever • βc is the bias towards the previously pressed lever • βS is the inverse temperature for S • St = St−1 + αS(rt−1 − St−1) is the short-term value function Apart form the simple bias terms, it as been augmented from (14) to include an additional value function S which will capture the short term effects observed in the logistic regression result (Figure 8). By forcing the parameter βS to be negative we make sure that the last term has the desired effect. There are 6 degrees of freedom in this model, namely the parameters βV , αV , β1, βc, βS and αS, so given a value for each parameter we are able to calculate the action probabilities at time t. In symbols this probability is written as P(ct = 1 | βV , αV , β1, βc, βS, αS) (16) 14
  • 15. 3.3 Parameter estimation In this section, the method for parameter estimation is established. As we just saw, given any six parameter values, we can compute the probability of the choice at time t with (15). In fact we not only get the probability for the choice at time t, but since the entire choice sequence is defined recursively depending solely on the parameters and on the previous values, we get the whole choice sequence. So by finding good estimates of the parameters we mean finding parameters that will produce data that is, to the highest degree possible, like the training data which is the actual choice sequence of a mouse. So if we, given parameter values, can compute the probability that a single choice is equal some actual choice (ct = 1 or ct = −1) as P(ct | βV , αV , β1, βc, βS, αS) = P(ct = 1) : ct = 1 1 − P(ct = 1) : ct = −1 where P(ct = 1) is given in (15) and if we assume that each choice is independent of all other choices, we can compute the probility of an entire choice sequence as the product L(βV , αV , β1, βc, βS, αS | c1, c2, ..., cT ) = T t=1 P(ct) (17) Note that we here assume the choice sequence to be given and let the parameters vary. This is what is known as the likelihood function and by maximizing this function we can get the parameters that best fit the actual data. For several reasons it is more practical to convert the product into a sum by taking the logarithm so that the likelihood function becomes log(L(βV , αV , β1, βc, βS, αS | c1, c2, ..., cT )) = T t=1 log(P(ct)) (18) Since the logarithm is a monotone transformation, it will have the same extrema as the transformed function. Furthermore it is also common to use the negative of the log likelihood which means the likelihood function should be minimized rather than maximized. The likelihood computation is outlined in Algorithm 1. This procedure returns the like- lihood for a set of parameters. Since the choice probabilities depend on the value functions, and the value functions are recursively defined, the likelihood has to be computed iteratively. The value functions V and S are both initialized at 0. Experimentation showed that the results did not differ significantly except for the first few time steps. After implementing this procedure in MATLAB, the native optimizer ‘fmincon’ was used to obtain the results presented in the following section. 15
  • 16. input : βV , αV , αS, βS, β1, βc output: Likelihood (scalar) // Initialize value functions and likelihood V1 = [0, 0] ; S1 = [0, 0] ; Likelihood = 0; for t = 2 to end of choice sequence do ct = action chosen at time t; cother = action not chosen at time t; // Update value functions // For the chosen action Vt,ct = Vt−1,ct + αV (|rt−1| − Vt−1,ct ); // and for the not chosen action Vt,cother = Vt−1,cother + αV (0 − Vt−1,cother ); // Same for the S-values St,ct = St−1,ct + αS(|rt−1| − St−1,ct ); St,cother = St−1,cother + αS(0 − St−1,cother ); // Update choice probability Pt,1 = σ(βV (Vt,1 − Vt,2) + β1 + βcct−1 + βS(St,1 − St,2)); Pt,2 = 1 − Pt,1 // Update likelihood with the negative log of the probability of the choice at hand Likelihood = Likelihood − log Pt,ct ; end return Likelihood Algorithm 1: Parameter estimation using maximum likelihood estimation. 3.4 Results For the fitting a choice of start values and boundaries had to be made. As for start values, seeing that the primary objective is to replicate the results obtained in Beeler et al. [2010], it appears natural to choose their parameter estimates as start values. As for boundaries, we want to make sure that the optimization procedure is searching within the same subspace of the solution space, so boundaries that are relatively tight around the target solution was selected. For the learning rates αV and αS, the predetermined boundaries between 0 and 1 was used. After running the fitting procedure, the following results were returned. Parameters Wild-type DATkd t p βV 36.80 35.02 0.482 0.636 αV 0.724 0.638 0.378 0.710 αS 0.361 0.501 0.721 0.480 βS −7.633 −6.923 0.317 0.756 β1 0.087 −0.161 0.851 0.406 βS 3.291 3.184 0.162 0.413 The parameter estimates are sample means over 10 mice each and the comparison of means was done with a two-sample t-test. The lowest p-value is 0.406 so we cannot reject the null-hypothesis that the means are equal for any of the parameters. This is in accordance with Beeler et al. [2010] except for the parameter βV which in their work was found to be 16
  • 17. Figure 9: MDS output of 100 parameter fits on two different mice. Similar colors represent similar likelihood. significantly different when comparing wild-type and DATkd mice. After several experiments it became evident that the the solution is highly dependent on the start values given. This suggests that the solution space contains many local minima even within relatively tight boundaries on the parameters. To investigate this further I performed 100 fits on two different mice with 100 random start values. This gives 100 solutions in R6. In order to visualize it, multidimensional scaling (MDS) was used to reduce the dimensions to two. MDS is a way to visualize the similarity or distance between high dimensional data points in a lower dimension as precisely as possible. The ‘error function’ of MDS is termed the strain function and by plotting the strain as a function of time (a scree plot), an acceptable number of dimensions for visualization can be determined. In this case two is appropriate. The result of the MDS analysis is shown in Figure 9. Each point is a solution to the opti- mization problem and the solutions are color-coded according to their negative log-likelihood and the hollow blue dots have the lowest value. Several qualitative traits meet the eye. Firstly, there are two clear clusters of solutions to the left in each plot and the group to the right could also be interpreted as one or two clusters. Secondly, solutions appear to be arranged vertically with very little variation in the horizontal direction. In order to interpret this, one must remember that in dimensionality reduction analysis like MDS, the axis in the lower dimension do not have a clear interpretation because the plane formed does not necessarily lie along any of the original axis. So a arrangement of solutions parallel to the vertical axis, as in this case, is not the variation of a single parameter, but rather some com- bination of variables that in this case does not change the likelihood value. This offers a nice visualization of a specific direction(s) in R6 which does not change the solution considerably. In other words, the solution hyper-plane has flat areas where many local minima exist. The results presented here highlight a central problem in mathematical modeling which will be discussed further in the following section. Although a fit is obtained and value esti- mates computed, whether these are actually meaningful for individual mice and whether they can be compared between mice requires careful consideration. When several local minima are present and especially when they have the same value, the result may be ambiguous. 4 Discussion 4.1 Local minima This subject has already been touched upon in the Results section where it was demonstrated that the solution space to the parameter estimation problem contains several local minima and that these minima almost all have the same value. This can be a problem with big 17
  • 18. consequences for the conclusions drawn from the model. In another setting, however, it may not be significant. Imagine a minimization problem of the same sort where each parameter controls a part of a production line and the goal is to optimize the production output. If the parameters of this model is fit and several solutions with the same value are found, one may choose equally among all solutions since they all translate to the same production output. What the parameter configurations that produce these high efficiency production lines mean or should be interpreted is, is not of importance. However, in the context of the data in this project where the objective is to compare parameters from two groups and where each parameter has a behavioral interpretation, different parameter combinations translate into different behavior and it is impossible to say which are the ‘true’ parameters for the mice given that data. What is then the problem, the data or the model? Obviously the experimental paradigm used in the experiments may have been imperfect. This could for example mean that what seems like an independent choice of the mouse may in fact have been strongly influenced by other factors not incorporated in the model. On the other hand, the model itself could also be too simple. Concrete examples will be discussed in the following sections. 4.2 Uncertainty In reinforcement learning algorithms it is often beneficial to vary the learning rate α depend- ing on the uncertainty in the system. Imagine a simple stochastic one-arm bandit with a certain probability of yielding a reward. After several plays, a players’ value function for the bandit will converge to the true probability of yielding a reward. We can say that the player becomes more and more certain of the value he is attributing to the bandit. Recall that the learning rate α determines to what degree new information is taken into account. In the beginning the player is very uncertain and any new information is precious so α is high. But as time goes on, the player becomes sure about his estimate and even if the bandit once in a while behaves differently (it is stochastic), the player will not change his estimate considerably, hence α is low. In machine learning, letting the learning rate vary, produces much faster learning than a constant learning rate. In the model used in the modeling in this thesis (15), the learning rates (αV and αS) are kept constant. This obviously fails to incorporate the above idea and there is the possibility that it would improve the model to let the learning rate vary. Then rises the question of how to quantify the uncertainty of the mice and this not a trivial problem. Work has been done on this train of thought [O’reilly, 2013]. 4.3 Discounting In Section 2.2 we derived the prediction error (11) to be δ(t) = rt + γ vt+1 − vt (19) where γ is the discounting factor. In the model use for modeling, the prediction error is δ(t) = rt − vt (20) so it can be seen as a special case of (19) where γ = 0. Let us recall where this discounting took place. It came from the notion of the total future reward from time t and onwards T−t τ=0 γτ rt+τ (21) where a discounting factor close to 0 means that the actor is mostly concerned with maximiz- ing immediate rewards. If γ = 0 then the only term in (21) which is different from 0 is the 18
  • 19. reward at time t, rt. In other words, we assume that the mouse is extremely short-sighted. It is questionable whether this is a valuable assumption. Evidence exists that rats do have an expectation about a future reward even if they have to work to get there [Howe et al., 2013]. This conclusion was made by observing the dopamine level in rat brains as they navigated a maze task with a reward at the end. As the rat came closer to the goal, dopamine levels rose even though it had not encountered the reward yet. This suggests that, at least rats, do not discount future rewards completely as suggested in the model that has been used in this thesis and it is something worth looking into. 5 Conclusion The important experiments by Schultz et al. offered an insight into the activity on a cellular level during learning. The phasic dopamine signals observed when an unexpected reward is encountered represent the prediction error, that is the difference between what the learner expects and what it actually finds. The experiments are important because they offer the ability to easily quantify learning as it takes place in the brain. The discovery that the prediction error from a group of algorithms in reinforcement learning can in fact predict this neuronal behavior paved the road for much investigation. More specifically, the temporal difference algorithm has since been used to model behavioral data. Montague et al. [1996] offers an interpretation of the various elements of the model in relation to the brain, and although simplified it strengthens the connection between these two areas of research. In this thesis behavioral data of normal and genetically modified mice has been used as a target for modeling using the before mentioned method. A set of parameter estimations was obtained by maximizing the likelihood function as implemented in Algorithm 1. Although a solution was found, it turned out that the solution was far from unique. This points to a central problem in modeling, where even a large change in parameter values does not change the solution value much. It poses a problem when the parameters have physiological interpretations and are thought to control behavior, because we can not say which predicted behavior is the true one. It is likely that the model used has limitations that prevent it from including important parameters in its explanation of behavior, but as stated in the introduction it has not been the purpose of this thesis to use the latest most complex model, but rather illustrate this connection both theoretically and its practical application. It has become clear that although a simple relationship is discovered (here between dopamine neuron activity and behavior), the modeling of this relationship easily becomes complex and leaves room for much interpretation by the investigator. However, it is, even so, a methodology that offers much insight into one of the most complex dynamic systems we know of, the brain and it will no doubt be the subject of much research to come. 19
  • 20. References M. F. Bear, B. W. Connors, and M. A. Paradiso. Neuroscience, volume 2. Lippincott Williams & Wilkins, 2007. J. A. Beeler, N. Daw, C. R. Frazier, and X. Zhuang. Tonic dopamine modulates exploitation of reward learning. Frontiers in behavioral neuroscience, 4, 2010. N. D. Daw and K. Doya. The computational neurobiology of learning and reward. Current opinion in neurobiology, 16(2):199–204, 2006. P. Dayan and L. Abbott. Theoretical neuroscience: Computational and mathematical mod- eling of neural systems the mit press, 2001. D. L. Felten, R. F. J´ozefowicz, and F. H. Netter. Netter’s Atlas of Human Neuroscience. Icon Learning Systems, 2003. M. W. Howe, P. L. Tierney, S. G. Sandberg, P. E. Phillips, and A. M. Graybiel. Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature, 500(7464):575–579, 2013. P. R. Montague, P. Dayan, and T. J. Sejnowski. A framework for mesencephalic dopamine systems based on predictive hebbian learning. The Journal of neuroscience, 16(5):1936– 1947, 1996. J. Olds and P. Milner. Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain. Journal of comparative and physiological psychology, 47(6):419, 1954. J. X. O’reilly. Making predictions in a changing world—inference, uncertainty, and learning. Frontiers in neuroscience, 7, 2013. R. A. Rescorla, A. R. Wagner, et al. A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Classical conditioning II: Current research and theory, 2:64–99, 1972. W. Schultz. Predictive reward signal of dopamine neurons. Journal of neurophysiology, 80 (1):1–27, 1998. W. Schultz. Getting formal with dopamine and reward. Neuron, 36(2):241–263, 2002. W. Schultz and R. Romo. Dopamine neurons of the monkey midbrain: contingencies of responses to stimuli eliciting immediate behavioral reactions. J. Neurophysiol, 63(3):607– 24, 1990. W. Schultz, P. Apicella, and T. Ljungberg. Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. The Journal of Neuroscience, 13(3):900–913, 1993. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 1998. U. Ungerstedt. Adipsia and aphagia after 6-hydroxydopamine induced degeneration of the nigro-striatal dopamine system. Acta Physiologica Scandinavica. Supplementum, 367:95– 122, 1970. W. R. Uttal. The new phrenology: The limits of localizing cognitive processes in the brain. The MIT Press, 2001. 20
  • 21. R. A. Wise. Dopamine, learning and motivation. Nature reviews neuroscience, 5(6):483–494, 2004. R. A. Wise and H. V. Schwartz. Pimozide attenuates acquisition of lever-pressing for food in rats. Pharmacology Biochemistry and Behavior, 15(4):655–656, 1981. 21
  • 22. Appendices Likelihood function implementation in MATLAB function[Likelihood]=GetValueFunctions_optimize(params, Choice, Reward) betaV = params(1); alphaV = params(2); alphaS = params(3); betaS = params(4); beta1 = params(5); betac = params(6); V = zeros(2, length(Choice)); S = zeros(2, length(Choice)); TD = zeros(2, length(Choice)); P = zeros(2, length(Choice)-1); Likelihood = 0; for K = 2:length(Choice) cur_choice = Choice(K); if cur_choice == 1; RK = [Reward(K); 0]; cur_choice_index = 1; else RK = [0; Reward(K)]; cur_choice_index = 2; end V(:,K) = V(:,K-1) + alphaV*(RK - V(:,K-1)); S(:,K) = S(:,K-1) + alphaS*(RK - V(:,K-1)); TD(:,K) = (RK - V(:,K-1)); P(1, K-1) = 1 / (1 + exp(-(betaV * (V(1,K) - V(2, K)) + beta1 + betac * Choice(K-1) P(2, K-1) = 1 -P(1, K-1); Likelihood = Likelihood - log(P(cur_choice_index, K-1)); end Likelihood; end 22
  • 23. The optimization procedure which calls the likelihood function on the previous page load matlab_mouse_struct; options = optimoptions(’fmincon’, ’MaxFunEvals’, 1000, ’UseParallel’,’Always’); beeler=[39.9 0.042 0.342 -8.185 0.461 3.082]; sol = zeros(20,6); fval = zeros(20,1); exitflag = zeros(20,1); for i=1:20; mus = MOUSE(i); mus.File Choice = double(mus.LeftLeverPress | mus.RightLeverPress); Choice(mus.RightLeverPress) = -1; Reward = zeros(size(Choice)); Rewardindx = find(mus.reward) - 1; Reward(Rewardindx) = 1; keepindx = Choice ~= 0; Reward = Reward(keepindx); Choice = Choice(keepindx); [sol(i,:) fval(i) exitflag(i)] = fmincon(@(x)GetValueFunctions_optimize(x,Choice,... Reward, beeler, [],[],[],[],[0 0 0 -25 -1 0], [50 2 2 0 5 5], [], options); end; 23