JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
1. SS2016 Modern Neural
Computation
Lecture 5: Neural Networks
and Neuroscience
Hirokazu Tanaka
School of Information Science
Japan Institute of Science and Technology
2. Supervised learning as functional approximation.
In this lecture we will learn:
• Single-layer neural networks
Perceptron and the perceptron theorem.
Cerebellum as a perceptron.
• Multi-layer feedforward neural networks
Universal functional approximations, Back-propagation
algorithms
• Recurrent neural networks
Back-propagation-through-time (BPTT) algorithms
• Tempotron
Spike-based perceptron
3. Gradient-descent learning for optimization.
• Classification problem: to output discrete labels.
For a binary classification (i.e., 0 or 1), a cross-entropy is
often used.
• Regression problem: to output continuous values.
Sum of squared errors is often used.
4. Cost function: classification and regression.
• Classification problem: to output discrete labels.
For a binary classification (i.e., 0 or 1), a cross-entropy is
often used.
• Regression problem: to output continuous values.
Sum of squared errors is often used.
ˆ:output of network, :desired outputi iy y
( ) ( ) ( )
ˆ1ˆ
: samples: samples
ˆ ˆlog 1 log 1 log 1ii
i i i i i
yy
i
ii
y y y y y y
−
− − =− + − − ∑∏
( )
: sa p e
2
m l s
ˆi
i
iy y−∑
5. Perceptron: single-layer neural network.
• Assume a single-layer neural network with an input layer
composed of N units and an output layer composed of
one unit.
• Input units are specified by
and an output unit are determined by
( )1
T
Nx x=x
( )T
0
1
0
n
i
i iy f w x fw w
=
= + = +
∑ w x
( )
1 if 0
0 if 0
u
f
u
u
≥
=
<
7. Perceptron: single-layer neural network.
• [Remark] Instead of using
often, an augmented input vector
are used. Then,
( )1
T
Nx x=x
( ) ( )T T
0y f w f= + =w x w x
( )11
T
Nx x=x
( )10
T
Nw w w=w
8. Perceptron Learning Algorithm.
( ) ( ) ( ){ }21 1 2, , ,, , ,P Pd d dx x x
• Given a training set:
• Perceptron learning rule:
( )i i iydη −∆ =w x
while err>1e-4 && count<10
y = sign(w'*X)';
wnew = w + X*(d-y)/P;
wnew = wnew/norm(wnew);
count = count+1;
err = norm(w-wnew)/norm(w)
w = wnew;
end
11. Perceptron’s capacity: Cover’s Counting Theorem.
• Question: Suppose that there are P vectors in N-
dimensional Euclidean space.
There are 2P possible patterns of two classes. How many
of them are linearly separable?
[Remark] They are assumed to be in general position.
• Answer: Cover’s Counting Theorem.
{ }1, ,, N
P i ∈x x x
( )
1
0
1
, 2
N
k
P
C P N
k
−
=
−
=
∑
12. Perceptron’s capacity: Cover’s Counting Theorem.
• Cover’s Counting Theorem.
• Case 𝑃𝑃 ≤ 𝑁𝑁:
• Case 𝑃𝑃 = 2𝑁𝑁:
• Case 𝑃𝑃 ≫ 𝑁𝑁:
( )
1
0
1
, 2
N
k
P
C P N
k
−
=
−
=
∑
( ), 2P
C P N =
( ) 1
, 2P
C P N −
=
( ), N
C P N AP≈
Cover (1965) IEEE Information; Sompolinsky (2013) MIT lecture note
13. Perceptron’s capacity: Cover’s Counting Theorem.
• Case for large P:
Orhan (2014) “Cover’s Function Counting Theorem”
( ) 1 2
1 e
2
rf
,
2 2P
pC P
N
N
p
+ −
≈
14. Cerebellum as a Perceptron.
Llinas (1974) Scientific American
15. Cerebellum as a Perceptron.
• Cerebellar cortex has a feedforward structure:
mossy fibers -> granule cells -> parallel fibers -> Purkinje
cells
Ito (1984) “Cerebellum and Neural Control”
16. Cerebellum as a Perceptron (or its extensions)
• Perceptron model
Marr (1969): Long-term potentiation (LTP) learning.
Albus (1971): Long-term depression (LTD) learning.
• Adaptive filter theory
Fujita (1982): Reverberation among granule and Golgi
cells for generating temporal templates.
• Liquid-state machine model
Yamazaki and Tanaka (2007):
17. Perceptron: a new perspective.
• Evaluation of memory capacity of a Purkinje cell using
perceptron methods (the Gardner limit).
Brunel, N., Hakim, V., Isope, P., Nadal, J. P., & Barbour, B. (2004). Optimal
information storage and the distribution of synaptic weights: perceptron versus
Purkinje cell. Neuron, 43(5), 745-757.
• Estimation of dimensions of neural representations
during visual memory task in the prefrontal cortex using
perceptron methods (Cover’s counting theorem).
Rigotti, M., Barak, O., Warden, M. R., Wang, X. J., Daw, N. D., Miller, E. K., & Fusi,
S. (2013). The importance of mixed selectivity in complex cognitive tasks.
Nature, 497(7451), 585-590.
18. Limitation of Perceptron.
• Only linearly separable input-output sets can be learned.
• Non-linear sets, even a simple one like XOR, CANNOT be
learned.
19. Multilayer neural network: feedforward design
( )n
ix
( )1n
jx −
Layer 1 Layer n-1 Layer n Layer N
( )1n
ijw
−
• Feedforward network: a unit in layer n receives inputs
from layer n-1 and projects to layer n+1.
20. Multilayer neural network: feedforward design
( )n
ix
( )1n
jx −
Layer 1 Layer n-1 Layer n Layer N
( )1n
ijw
−
• Feedforward network: a unit in layer n receives inputs
from layer n-1 and projects to layer n+1.
21. Multilayer neural network: forward propagation.
( ) ( )
( ) ( ) ( )1 1
1
n n n n
i i ij j
j
x f u f w x− −
=
= =
∑
( )
1
1 u
f u
e−
=
+
( )
( )
( ) ( )( )2
1 1
1
1
1
11
u
u uu
f
e
e e
u
e
f u f u
−
− −−
= = − =
+ + +
′ −
Layer n-1 Layer n
( )n
ix
( )1n
jx
−
( )1n
ijw
−
( ) ( ) ( )1 1
1
n n n
i ij j
j
u w x
− −
=
= ∑
In a feedforward multilayer neural network propagates its activities
from one layer to another in one direction:
Inputs to neurons in layer n are a
summation of activities of neurons in
layer n-1:
The function f is called an activation function, and its derivative is
easy to compute:
22. Multilayer neural network: error backpropagation
• Define an cost function as a squared sum of errors in
output units:
Gradients of cost function with respect to weights:
( )
( ) ( )
( )
2 21 1
2 2
N N
i i i
i i
x z= − = ∆∑ ∑
Layer n-1 Layer n
( ) ( ) ( ) ( )
( ) ( )1 1
1
n n n n n
i j j j ji
j
x x w
− −
∆ = ∆ −∑
( )1n
j
−
∆
( )n
i∆
The neurons in the output layer has
explicit supervised errors (the difference
between the network outputs and the
desired outputs). How, then, to compute
the supervising signals for neurons in
intermediate layers?
23. Multilayer neural network: error backpropagation
1. Compute activations of units in all layers.
2. Compute errors in the output units, .
3. “Back-propagate” the errors to lower layers using
4. Update the weights
( )
{ } ( )
{ } ( )
{ }1
,, , ,
n N
i i ix x x
( )
{ }N
i∆
( ) ( ) ( ) ( )
( ) ( )1 1
1n n n n n
i j j j ji
j
x x w
− −
∆ = ∆ −∑
( ) ( ) ( ) ( )
( ) ( )1 1 1
1
n n n n n
ij i i i jw x x xη + + +
∆ =− ∆ −
24. Multilayer neural network as universal machine for
functional approximation.
A multilayer neural network is in principle able to approximate any
functional relationship between inputs and outputs at any desired
accuracy (Funahashi, 1988).
Intuition: A sum or a difference of two sigmoid functions is a “bump-
like” function. And, a sufficiently large number of bump functions
can approximate any function.
25. NETtalk: A parallel network that learns to read aloud.
Sejnowski & Rosenberg (1987) Complex Systems
A feedforward three-layer neural network with delay lines.
26. NETtalk: A parallel network that learns to read aloud.
Sejnowski & Rosenberg (1987) Complex Systems; https://www.youtube.com/watch?v=gakJlr3GecE
A feedforward three-layer neural network with delay lines.
27. NETtalk: A parallel network that learns to read aloud.
Sejnowski & Rosenberg (1987) Complex Systems
Activations of hidden units for a same sound but different inputs
28. Hinton diagrams: characterizing and visualizing
connection to and from hidden units.
Hinton (1992) Sci Am
Activations of hidden units for a same sound but different inputs
29. Autonomous driving learning by backpropagation.
Pomerleau (1991) Neural Comput
Activations of hidden units for a same sound but different inputs
31. Gradient vanishing problem: why is training a multi-layer
neural network so difficult?
Hochreiter et al. (1991)
• The back-propagation algorithm works only for neural networks of
three or four layers.
• Training neural networks with many hidden layers – called “deep
neural networks”- is notoriously difficult.
( ) ( ) ( ) ( )
( ) ( )1 1
1N N N N N
j i i i ij
i
x x w− −
∆ = ∆ −∑
( ) ( ) ( ) ( )
( ) ( )
( ) ( ) ( )
( ) ( ) ( ) ( )
( ) ( )
2 1 1 1 2
1 1 1 2
1
1 1
N N N N N
k j j j jk
j
N N N N N N N
i i i ij j j jk
j i
x x w
x x w x x w
− − − − −
− − − −
∆ = ∆ −
= ∆ − −
∑
∑ ∑
( )
( ) ( ) ( ) ( )( 1) ( 1) ( 1) ( 1) ( ) ( )
~ 1 1 1
n Nn n N N N N
x x x x x x+ + − −
∆ − × × − × − ×∆
32. Multilayer neural network: recurrent connections
• A feedforward neural network can represent an
instantaneous relationship between inputs and outputs
- memoryless: it depends on current inputs but not on
previous inputs.
• In order to describe a history, a neural network should
have its own dynamics.
• One way to incorporate dynamics into a neural network
is to introduce recurrent connections between units.
33. Working memory in the parietal cortex.
• A feedforward neural network can represent an
instantaneous relationship between inputs and outputs
- memoryless: it depends on current inputs x(t) but not
on previous inputs x(t-1), x(t-2), ...
• In order to describe a history, a neural network should
have its own dynamics.
• One way to incorporate dynamics into a neural network
is to introduce recurrent connections between units.
34. Multilayer neural network: recurrent connections
( ) ( )( ) ( ) ( )( )( )1 1ii i
x t f u t f t t+= += +Wx Ua
( ) ( )( )iz t g t= Vx
Recurrent dynamics of neural network:
Output readout:
a x z
U VW
35. Temporal unfolding: backpropagation through time (BPTT)
1t−a
1t−x tztx
{ }10 2 1,, , ,, ,t T −a a a aa
{ }1 2 3, , , ,, ,t Tzz z zz
,U W V
Training set for a recurrent network:
Input series:
Output series:
Optimize the weight matrices so as to approximate the training set:
36. Temporal unfolding: backpropagation through time (BPTT)
0a 1z1x,U W V
0a
2z1x,U W
V,U W
1a 2x
0a
3z
1x,U W
V
,U W
1a 3x2x
,U W
2a
1t−a
1t−x tztx,U W V
42. Tempotron: Spike-based perceptron.
Consider five neurons and each emitting one spike but at different timings:
Rate coding: Information is coded in numbers of spikes in a given period.
( ) ( )31 2 4 5, , , , 1,1,1,1,1r r r r r =
Temporal coding: Information is coded in temporal patterns of spiking.
45. Tempotron: Spike-based perceptron.
3
1 1
t t
w e w e− ∆ −∆
+
2 2
2 t
w e w− ∆
+
2
1 1
t
w e w− ∆
+
3
2 2
t t
w e w e− ∆ −∆
+
( ) ( )2
1 2
3 2
1t t t
w e e w e θ− ∆ − ∆ − ∆
+ + + > ( ) ( )2
1 2
2 3
1t t t
w e w e e θ− ∆ − ∆ − ∆
+ + + <
( ) ( )
3 2 2
1
2 3 2
1 2
2
1
, ,
1
t t t
t t t
w e e e
w e e e
− ∆ − ∆ − ∆
− ∆ − ∆ − ∆
+ +
= = =
+ +
w x x
( ) ( )T T1 2
,θ θ> <w x w x
Consider a classification problem of two spike patterns:
If a vector notation is introduced:
This classification problem is reduced to a perceptron problem:
46. Tempotron: Spike-based perceptron.
3
1 1
t t
w e w e− ∆ −∆
+
2 2
2 t
w e w− ∆
+
2
1 1
t
w e w− ∆
+
3
2 2
t t
w e w e− ∆ −∆
+
( ) ( )2
1 2
3 2
1t t t
w e e w e θ− ∆ − ∆ − ∆
+ + + > ( ) ( )2
1 2
2 3
1t t t
w e w e e θ− ∆ − ∆ − ∆
+ + + <
( ) ( )
3 2 2
1
2 3 2
1 2
2
1
, ,
1
t t t
t t t
w e e e
w e e e
− ∆ − ∆ − ∆
− ∆ − ∆ − ∆
+ +
= = =
+ +
w x x
( ) ( )T T1 2
,θ θ> <w x w x
Consider a classification problem of two spike patterns:
If a vector notation is introduced:
This classification problem is reduced to a perceptron problem:
47. Learning a tempotron: intuition.
3
1 1
t t
w e w e− ∆ −∆
+
2 2
2 t
w e w− ∆
+
2
1 1
t
w e w− ∆
+
3
2 2
t t
w e w e− ∆ −∆
+
( ) ( )2
1 2
3 2
1t t t
w e e w e θ− ∆ − ∆ − ∆
+ + + > ( ) ( )2
1 2
2 3
1t t t
w e w e e θ− ∆ − ∆ − ∆
+ + >+
What was wrong if the second pattern was misclassified?
The last spike of neuron #1 (red one) is most responsible for the error, so
the synaptic strength of this neuron should be reduced.
1w λ∆ = −
48. Learning a tempotron: intuition.
3
1 1
t t
w e w e− ∆ −∆
+
2 2
2 t
w e w− ∆
+
2
1 1
t
w e w− ∆
+
3
2 2
t t
w e w e− ∆ −∆
+
( ) ( )2
1 2
3 2
1t t t
w e e w e θ− ∆ − ∆ − ∆
+ + <+ ( ) ( )2
1 2
2 3
1t t t
w e w e e θ− ∆ − ∆ − ∆
+ + + <
What was wrong if the second pattern was misclassified?
The last spike of neuron #2 (red one) is most responsible for the error, so
the synaptic strength of this neuron should be potentiated.
2w λ∆ = +
49. Exercise: Capacity of perceptron.
• Generate a set of random vectors.
• Write a code for the Perceptron learning algorithm.
• By randomly relabeling, count how many of them are
linearly separable.
Rigotti, M., Barak, O., Warden, M. R., Wang, X. J., Daw, N. D., Miller, E. K., & Fusi, S.
(2013). The importance of mixed selectivity in complex cognitive tasks. Nature,
497(7451), 585-590.
50. Exercise: Training of recurrent neural networks.
0
α
=
I
P
T
1 T
1
n n n n
n n
n n n
+= −
+
P r r P
P P
r P r
Goal: Investigate the effects of chaos and feedback in a recurrent
network.
( )1t n n n t+= −+ + ∆x x x Mr
T
tanhnn nz = w x
tanhn n=r x
1 nn n n ne+= −w w P r
nn ne z f= −
Recurrent dynamics without feedback:
Update of covariance matrix:
Update of weight matrix:
force_internal_all2all.m
51. Exercise: Training of recurrent neural networks.
0
α
=
I
P
T
1 T
1
n n n n
n n
n n n
+= −
+
P r r P
P P
r P r
Goal: Investigate the effects of chaos and feedback in a recurrent
network.
( )1
f
t n nn n n tz+= − ++ + ∆x x Mr wx
T
tanhnn nz = w x
tanhn n=r x
1 nn n n ne+= −w w P r
nn ne z f= −
Recurrent dynamics with feedback:
Update of covariance matrix:
Update of weight matrix:
force_external_feedback_loop.m
52. Exercise: Training of recurrent neural networks.
Goal: Investigate the effects of chaos and feedback in a recurrent
network.
• Investigate the effect of output feedback. Are there any difference
in the activities of recurrent units?
• Investigate the effect of gain parameter g. What happens if the gain
parameter is smaller than 1?
• Try to approximate some other time series such as chaotic ones.
Use the Lorentz model, for example.
53. References
• Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-
propagating errors. Cognitive modeling, 5(3), 1.
• Sejnowski, T. J., & Rosenberg, C. R. (1987). Parallel networks that learn to pronounce English text.
Complex systems, 1(1), 145-168.
• Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks.
Neural networks, 2(3), 183-192.
• S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies
• Zipser, D. (1991). Recurrent network model of the neural mechanism of short-term active memory.
Neural Computation, 3(2), 179-193.
• Johansson, R. S., & Birznieks, I. (2004). First spikes in ensembles of human tactile afferents code
complex spatial fingertip events. Nature neuroscience, 7(2), 170-177.
• Branco, T., Clark, B. A., & Häusser, M. (2010). Dendritic discrimination of temporal input sequences
in cortical neurons. Science, 329(5999), 1671-1675.
• Gütig, R., & Sompolinsky, H. (2006). The tempotron: a neuron that learns spike timing–based
decisions. Nature neuroscience, 9(3), 420-428.
• Sussillo, D., & Abbott, L. F. (2009). Generating coherent patterns of activity from chaotic neural
networks. Neuron, 63(4), 544-557.