Reinforcement Learning and Neuroscience

Reinforcement Learning and Neuroscience
Michael Bosello
Universit`a di Bologna – Department of Computer Science and Engineering, Cesena, Italy
Intelligent Robotic Systems – Exam
Michael Bosello Reinforcement Learning and Neuroscience Intelligent Robotic Systems – Exam 1 / 37

Outline
1 Introduction
2 Temporal Difference
Computational Temporal Difference
Temporal Difference in the Brain
Classical Conditioning
TD Model
The Reward Prediction Error Hypothesis
TD Error / Dopamine Correspondences
3 Agent navigation Inspired by Neuroscience
Navigation using grid-like representation
The Network architecture
Navigation Experiments
4 Conclusions and Suggestions

Introduction
Next in Line...
1 Introduction
TD Model

Introduction
Machine Learning and Neuroscience
Intertwined history [Hassabis et al., 2017][Pillow and Sahani, 2019]
Since the dawn of artificial intelligence (AI), neuroscience and AI have virtuously
influenced each other
Some of the most important tools that are available to machine learning (ML)
originate from neuroscientific metaphors
Reinforcement Learning (RL)
Deep Learning (DL)
ML metaphors have guided the exploration of neural functions by providing
functional metaphors useful to formulate hypotheses
Why it is important now
ML Leading researchers have particularly emphasized the opportunity of
drawing inspiration from neuroscience for the next generation of AI
[Hassabis et al., 2017][Lake et al., 2017][Ford, 2018][Ullman, 2019]
Neuroscience AI offers ideas about the possible mechanism in the brain and means
for formalizing concepts.
Also, the use of ML to analyze neuroimaging datasets to find patterns
[Hassabis et al., 2017][Pillow and Sahani, 2019]

Introduction
This topic is very wide. We will focus on two points:
RL [Sutton and Barto, 2018]
The most remarkable virtuous circle of ML and neuroscience
Trial and error Inspired by research into animal learning (Law of effect)
Temporal Difference (TD) learning inspired by classical conditioning
Evidence that the brain implements a form of TD learning
Core parallel: TD learning / dopamine
Why it is useful now
RL Findings in neuroscience can provide a direction to more effective
agents [Hassabis et al., 2017] e.g. agent navigation inspired by the
entorhinal cortex [Banino et al., 2018]
Neuroscience RL provides insight for psychiatric diseases
[Pillow and Sahani, 2019][Sutton and Barto, 2018]
Suggestions for further reading are provided

Introduction
Disclaimer
Background
We assume prior knowledge of RL (and DL)
We don’t assume prior knowledge of neuroscience (and psychology)
Sources
The part on TD is based on [Sutton and Barto, 2018], unless otherwise stated
The part on navigation is based on [Banino et al., 2018], unless otherwise stated
All the images come from these two sources (but adapted for the context)

Temporal Difference
Next in Line...
1 Introduction
TD Model

Temporal Difference Computational Temporal Difference
Temporal Difference in Agents
Central idea in RL
Learn from experience (sampling like Monte Carlo methods)
→ Doesn’t need a model
Estimation updates are based on other estimations (bootstrapping like Dynamic
Programming)
→ Doesn’t need to wait for the end of an episode to update (online learning)
Estimation of vπ – TD(0)
V (St) ← V (St) + α[Rt+1 + γV (St+1) − V (St)]
TD error
δt = Rt+1 + γV (St+1) − V (St)
Sarsa
Q(St, At) ← Q(St, At) + α[Rt+1 + γQ(St+1, At+1) − Q(St, At)]

Temporal Difference Temporal Difference in the Brain
Classical Conditioning (Pavlovian Conditioning) I
The physiologist Ivan Pavlov discovered that animals’ innate reﬂexes to certain stimuli
can come to be triggered also by other unrelated stimuli
In his experiment, a dog receives food after the sound of a metronome
Initially, the dog produces more saliva only in response to the sight of food
After some trials, the dog starts salivating also in response to the sound stimulus
Unconditioned stimulus (US)
The natural trigger (food)
Unconditioned response (UR)
The unborn reﬂex (salivation)
Conditioned stimulus (CS)
The new predictive stimulus (sound of the metronome)
Conditioned response (CR)
The acquired response (salivation)

Classical Conditioning (Pavlovian Conditioning) II
The animal learns the predictive relationship between the CS and the US
So that the animal can anticipate the US and prepare or protect himself with a CR
(which can differ from the UR and be more effective)
We are considering only the prediction part i.e. policy evaluation

Classical Conditioning (Pavlovian Conditioning) III
Blocking
When the learning of one CR to a potential CS is blocked by another CS
If you use a tone as a CS, and after that, you add light as a CS, there will be no
response to the light alone
Do conditioning depends only on simple temporal contiguity?
Higher-order conditioning
When a learned CS acts as a US in conditioning another CS
Like in the previous experiment, a dog is conditioned with a metronome sound
In the following trials, a black box is placed in the dog’s line of vision before the
metronome sound, but no food is given
The dog starts responding with salivation also to the sight of the black box,
even though it has never anticipated the original US (food)

The Rescorla-Wagner Model
It explains blocking: the animal learns only when his expectations are violated
(surprising)
Each CS has an associative strength that represent its reliability
Let’s assume we have a US Y , a CS A, a CS X and the compound CS AX with
the respectively associative strengths VA, VX , VAX
Aggregate associative strength Vax = VA + VX
The associative strengths change over successive trials according to:
∆VA = αAβY (RY − VAX )
∆VX = αX βY (RY − VAX )
Where αAβY αX βY are the step-size parameters, and RY is the associative strength
supported by the US Y
The associative strengths increase until they reach the supported level RY
If an animal is conditioned with a CS A, adding a CS X will have almost no effect
since the prediction error is already reduced to a low value – there is no surprise –

TD Model
From Rescorla-Wagner to TD
The TD model is based on the Rescorla-Wagner one
A state is deﬁned as a vector: x(s) = (x1(s), x2(s), . . . , xn(s))
In Rescorla-Wagner it is the set of CS meanwhile in TD it is more abstract
t is a time step, not a trial
The aggregate associative strength is: ˆv(s, w) = wT
x(s)
w is the associative strength vector
Like a value estimate
The new update formula
Associative strength vector update wt+1 = wt + αδtx(St)
TD error δt = Rt+1 + γˆv(St+1, wt) − ˆv(St, wt)
With γ = 0 you return to Rescorla-Wagner
To check the origins of the model: [Sutton and Barto, 1981] [Sutton and Barto, 1987]

Neuroscience Basics
Neuroscience
Is the study of the nervous system, its functions, and its changes over time
Neurons
Neurons are cells that process and transfer electrical and chemical signals
A neuron is said to fire when it generates a spike (electrical pulses)
A neuron can reach many other neurons
The background activity of a neuron is its firing rate not related to the stimuli of an
experiment
The phasic activity of a neuron is caused by synaptic input
Synapses are structures that mediate neurons communication
A synapse can produce a chemical neurotransmitter when the neuron fires
A neurotransmitter can inhibit or excite the postsynaptic neuron
Neuromodulators are neurotransmitters having additional effects that can alter the
operation of synapses

Two assumptions
Since the firing rate of a neuron can’t be negative, the neuron
activity is δt−1 + bt, where bt is the background firing rate.
→ A negative error corresponds to a firing rate below b
The representation of the states allow keeping track of time
passed between the cue and the reward
There is a different signal for each time step → the state is
time-dependent
It is used the complete serial compound representation

States that “one of the functions of the phasic activity of dopamine-producing
neurons in mammals is to deliver an error between an old and a new estimate of
expected future reward to target areas throughout the brain.” [Sutton and Barto, 2018]
[Montague et al., 1996] showed that the concept of TD error aligns with the feature
of dopamine neurons
Dopamine
is a neuromodulator
that broadcasts reward prediction errors (not rewards as was previously thought)

Experimental Support for the Hypothesis
Schultz’s group conducted a series of experiments supporting that the responses of
dopamine neurons correspond actually to a TD error and not a simpler error like the
Rescorla-Wagner one.
Experiment
Task 1: monkeys are trained to depress a lever after a light (trigger cue) is
illuminated to obtain juice
Task 2: there are two levers, each one with a light (instruction cue) indicating which
lever will produce juice. The instruction precedes the trigger that must be awaited
Initially, dopamine respond to reward
During training, dopamine response
shift to the earlier stimulus
A landmark of TD learning
When the task is learned, dopamine
response decrease
When moving to task 2, dopamine
response increase

Correspondence I
Dopamine neurons respond to unpredicted rewards
δt−1 = Rt + Vt − Vt−1 = Rt + 0 − 0 = Rt
We consider tabular TD(0)
Without discounting
⇒ The return to be predicted is
simply R* for each state

Correspondence II
Dopamine neurons respond to the earliest predictor
From any predicting state to another one δt−1 = Rt + Vt − Vt−1 = 0 + R − R = 0
From last predicting state to end δt−1 = Rt + Vt − Vt−1 = R + 0 − R = 0
From any state to the earliest predicting state δt−1 = Rt + Vt − Vt−1 = 0 + R − 0 = R
Reward spreads backward
until convergence
The states preceding the
earliest reward-predicting
state are not reliable

Correspondence III
Dopamine neurons ﬁring rates decrease below baseline if a reward does not
occur at its expected time.
When monkeys pull the wrong lever, they receive no juice
They internally keep track of time somehow
δt−1 = Rt + Vt − Vt−1 = 0 + 0 − R = −R

Further readings
Actor-critic in the brain
Dopamine mainly targets two parts of the striatum
Effects depend on the proprieties of the target
It is supposed that
The dorsal striatum acts as an actor
The ventral striatum acts as a critic
Other parallels
More parallels and topic are introduced in [Sutton and Barto, 2018]
Eligibility traces and Hedonistic Neurons
...

Agent navigation Inspired by Neuroscience
Next in Line...
1 Introduction
TD Model

Agent navigation Inspired by Neuroscience Navigation using grid-like representation
Navigation using grid-like representations I
Entorhinal cortex basics [Rowland et al., 2016] [Banino et al., 2018]
The entorhinal cortex creates a neural representation of space
through functionally dedicated cell types
whose firing rates depend on the animal’s position
Grid cells respond to the animal’s location in the environment
There are also: border cells, speed cells, head direction cells
Grid cells have hexagonally arranged firing fields that tile the environment surface
A population of grid cell provide a unique representation (code) of locations
Grid cells perform path integration by taking inputs from speed cells and head
direction cells and resulting in place cells
Path integration is the ability to self-localize based on self-motion
Grid cells are critical for vector-based navigation as they provide Euclidean spatial
metric supporting calculation of goal-directed vectors
Vector-based navigation is the process to follow direct routes to a remembered goal

Agent navigation Inspired by Neuroscience Navigation using grid-like representation
Navigation using grid-like representations II
[Banino et al., 2018] developed a deep RL agent with mammal-like navigation abilities
They trained a recurrent network to perform path integration, leading to the
emergence of representations resembling entorhinal cells
The network is incorporated in a deep RL architecture to perform navigation
Results
Grid cells representations endow agents with the ability to perform proﬁcient
vector-based navigation
The emergent representation provides the Euclidean spatial metric to
calculate goal-directed vectors and the relative positions of two points by examining the
difference in the current vector code and the code of a remembered goal
locate goals in challenging, unfamiliar, and changeable environments
The results provide strong empirical support to
Theories that see grid cells critical in vector-based navigation (support that was
previously missing)
Grid cells’ role in providing location code updated by self-motion cues
The performance of the agent surpassed comparison agents and an expert human
The agents conducted shortcut behaviors like those performed by mammals

Agent navigation Inspired by Neuroscience The Network architecture
Neural Network Performing Path Integration
The network is a long short-term memory (LSTM)
As it happens in the brain:
It must update its estimate of location and head direction (by
predicting place and head-direction cells activations)
It takes as input transactional and angular velocities with
perturbation and a visual input
It is trained with simulated place and head cells activations
during trajectories modeled on those of foraging rodents
This form of supervision is also present in rodent pups where
place and head cells guide the growth of grid cells
The visual input is processed by a CNN
That mimics the correction performed by place cells based on
environmental cues
It generates place and head cell activations
The output is silenced 95% of the time (to mimic imperfect
observations from behaving animals)

Grid-Like Representations
The last one is a linear layer with dropout
Individual units in the linear layer developed stable spatial activity proﬁles similar
to those of neurons within the entorhinal cortex
Grid like representations didn’t emerge without dropout regularization
Dropout is also present in the brain – it is an inductive bias [Hassabis et al., 2017]

Neural Network Performing Vector-Based Navigation
Another LSTM that control the agent and takes as input:
The current grid code (a grid code is the activity of the linear layer of the previous NN)
The goal grid code after it is reached the ﬁrst time (one-shot learning)
A preprocessed visual cue, the last action, and current reward
The ﬁrst output is a discrete action (the actor)
The second output is the value estimate (the critic)

Agent navigation Inspired by Neuroscience Navigation Experiments
A goal grid code provides sufficient information to navigate to an arbitrary location
The authors substituted the goal grid code with a ‘fake’ one sampled randomly
The agent followed a direct path to the newly specified location, circling the absent goal
Like rodents in probe trials of the Morris water maze
Grid cells are crucial: silencing most of grid-like units (simulating targeted lesion),
rather than other units, has a dramatic effect on performance
Only the grid cell agent was able to exploit shortcuts
At the beginning of an episode, the agent explores to find an unmarked goal.
When the agent reaches the goal it is teleported to a new random location. Then
the agent exploits the goal code until the episode ends (fixed step number)
Mazes’ layout, texture, landmark and goal change at each episode
The state of a door changes randomly during an episode at every new run

Conclusions and Suggestions
Next in Line...
1 Introduction
TD Model

Deepening I
Takeaways
Classical conditioning was essential to formulate the core RL rule
Viceversa, RL was crucial to determine the functioning of the brain’s reward
system
Neuroscience continues to inspire novel and more powerful algorithms like the
navigation one
More on RL and Neuroscience
Meta-RL [Hassabis et al., 2017]
RL optimize an RNN that brings out a second RL algorithm, faster than the original
It could be inspired by the recurrent activity of the prefrontal cortex
Study of addiction [Sutton and Barto, 2018]
RL theory could be used to understand the neural basis of drug abuse

Deepening II
Where to start (Neuroscience for AI)
Building machines that learn and think like people [Lake et al., 2017]
Historical recall of neural inspiration
Propose cognitive challenges for agents
Define the essential ingredients for building human-like intelligence
Neuroscience-inspired artificial intelligence [Hassabis et al., 2017]
Highlight the importance of the human brain as an inspiration to AI
Analyze past and current influences from neuroscience to ML techniques
Underline key areas to bridge the gap between machine and human-level intelligence
Using neuroscience to develop artificial intelligence [Ullman, 2019]
Ullman calls into question current highly reductionist approaches
We should use knowledge about biological neurons, like their structure, type, and
connectivity to guide the building of brain-like network models
Intelligence likely lies in both experience and preexisting structures (inductive biases)
Where to start (AI for Neuroscience)
A deep learning framework for neuroscience [Richards et al., 2019]
Neuroscientists need an approach to deal with large experimental data
The three components of an ANN – (i) objective functions, (ii) learning rules and (iii)
architectures – could be used to produce compact (tractable) brain model

Deepening III
Practical Insight
Prior knowledge and inbuilt capacities (inductive biases) are crucial to fast learning
and making inferences [Ullman, 2019][Richards et al., 2019]
Which neuron feature - type, connectivity, structure - could be used to improve ANNs?
A new report claims grid cells could be critical also for abstract reasoning and
concept representation [Constantinescu et al., 2016]
Could ANNs featuring grid-like regularities be used to process abstract concepts?

References I
Banino, A., Barry, C., Uria, B., Blundell, C., Lillicrap, T., Mirowski, P., Pritzel, A.,
Chadwick, M. J., Degris, T., Modayil, J., Wayne, G., Soyer, H., Viola, F., Zhang, B.,
Goroshin, R., Rabinowitz, N., Pascanu, R., Beattie, C., Petersen, S., Sadik, A.,
Gaffney, S., King, H., Kavukcuoglu, K., Hassabis, D., Hadsell, R., and Kumaran, D.
(2018).
Vector-based navigation using grid-like representations in artificial agents.
Nature, 557(7705):429–433.
Constantinescu, A. O., O’Reilly, J. X., and Behrens, T. E. J. (2016).
Organizing conceptual knowledge in humans with a gridlike code.
Science, 352(6292):1464–1468.
Ford, M. (2018).
Architects of Intelligence: The truth about AI from the people building it.
Packt Publishing Ltd.
Hassabis, D., Kumaran, D., Summerfield, C., and Botvinick, M. (2017).
Neuroscience-inspired artificial intelligence.
Neuron, 95(2):245–258.

References II
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. (2017).
Building machines that learn and think like people.
Behavioral and Brain Sciences, 40:e253.
Montague, P., Dayan, P., and Sejnowski, T. (1996).
A framework for mesencephalic dopamine systems based on predictive hebbian
learning.
The Journal of neuroscience : the ofﬁcial journal of the Society for Neuroscience,
16:1936–47.
Pillow, J. and Sahani, M. (2019).
Editorial overview: Machine learning, big data, and neuroscience.
Current Opinion in Neurobiology, 55:iii – iv.
Machine Learning, Big Data, and Neuroscience.

References III
Richards, B. A., Lillicrap, T. P., Beaudoin, P., Bengio, Y., Bogacz, R., Christensen,
A., Clopath, C., Costa, R. P., de Berker, A., Ganguli, S., Gillon, C. J., Hafner, D.,
Kepecs, A., Kriegeskorte, N., Latham, P., Lindsay, G. W., Miller, K. D., Naud, R.,
Pack, C. C., Poirazi, P., Roelfsema, P., Sacramento, J., Saxe, A., Scellier, B.,
Schapiro, A. C., Senn, W., Wayne, G., Yamins, D., Zenke, F., Zylberberg, J.,
Therien, D., and Kording, K. P. (2019).
A deep learning framework for neuroscience.
Nature Neuroscience, 22(11):1761–1770.
Rowland, D. C., Roudi, Y., Moser, M.-B., and Moser, E. I. (2016).
Ten years of grid cells.
Annual Review of Neuroscience, 39(1):19–40.
Sutton, R. and Barto, A. (1981).
Toward a modern theory of adaptive networks: Expectation and prediction.
Psychological review, 88:135–70.

References IV
Sutton, R. S. and Barto, A. G. (1987).
A temporal-difference model of classical conditioning.
In Proceedings of the ninth annual conference of the cognitive science society,
pages 355–378. Seattle, WA.
Sutton, R. S. and Barto, A. G. (2018).
Reinforcement learning : an introduction.
The MIT Press.
Ullman, S. (2019).
Using neuroscience to develop artiﬁcial intelligence.
Science, 363(6428):692–693.

Michael Bosello
Universit`a di Bologna – Department of Computer Science and Engineering, Cesena, Italy
Intelligent Robotic Systems – Exam

Reinforcement Learning and Neuroscience

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Reinforcement Learning and Neuroscience

Similaire à Reinforcement Learning and Neuroscience (20)

Dernier

Dernier (20)

Reinforcement Learning and Neuroscience