1. 10 - Connectionist Methods#1 Introduction
• Human Brain
• Introduction • Complex fabric of multiply networked cells
• Simple Threshold Unit (neurons) which exchange signals.
• Perceptron • First recognised by Golgi & Cajal.
• Neocortex
• Linear Threshold Unit • Site of intelligent capabilities of brain:
• Problems • Approx. 0.2m2, approx. 2-3 mm thick
• Backpropagation • Approx. 100,000 interconnected nerve cells lie
under every square mm.
Neuron Neuron
• Dendritic tree
• Three main structures make up a typical neuron: • Branched structure of thin cell extensions.
• Dendritic tree • Sums output signals of surrounding neurons in form of an
• Cell body (soma) electric potential.
• Axon • Cell Body
• If input potential exceeds a certain threshold value:
axon
• Cell body produces a short electrical spike.
synapse • Conducted along axon.
dendrite
• Axon
Cell body • Branches out & conducts pulse to several thousand target
dendrite (soma)
neurons.
synapse axon • Contacts of axon are either located on dendritic tree or
directly on cell body of target neuron.
• Connections - known as synapses.
2. Simple Threshold Unit Simple Threshold Unit
(TU) (TU)
• McCulloch & Pitts - 1943
• Developed neuron-like threshold units. • TU has N input channels & 1 output channel.
• Each input channel is either active - input = 1, or
• Could be used to represent logical expressions. silent - input = 0.
• Demonstrated how networks of such units might effectively • Activity states of all channels encode input
carry out computations. information as a binary sequence of N bits.
• Two important weaknesses in this model: • State of TU
• Does not explain how interconnections between neurons • Given by linear summation of all input signals
could be formed. & comparison of this sum with a threshold
value, s.
• ie How this might occur through learning • If sum exceeds threshold, "neuron" is excited
• Such networks depend on error-free functioning of all (output = 1) or quiescent (output = 0).
their components (cf error tolerance of biological neural
networks).
Perceptron Perceptron
• Rosenblatt - 1958 • Basic structure:
• Pioneered LTUs as basic unit in neural networks.
• Mark I Perceptron: AUs RUs
• 20x20 array of photocells to act as a retina digitized image
• Layer of 512 association units
• Each AU took input from randomly selected
subset of photocells & formed simple logical
combination of them.
• Output of AUs connected to 8 response units.
• Strength of connections between AUs & RUs set • •
by motor-driven potentiometers. • •
• •
• RUs could mutually interact - eventually agreeing
on a response.
weighted links
3. Linear Threshold Unit LTU
(LTU)
• Simple LTU network:
• Many similarities with simple TU.
• LTU: output unit (LTU)
• Can accept real (not just binary) inputs.
• Has real-valued weights associated with its input 0.2 0.9
connections. -0.3
• TU - activation status is determined by summing input units
0.7 0.3 0.1
inputs.
• LTU - activation status is determined by summing • Weights on input connections form unit's weight
the products of the inputs & the weights attached to vector:
the relevant connections and then testing this sum • [ 0.2 -0.3 0.9 ]
against unit's threshold value.
• Set of input activation levels form input vector:
• If sum > threshold, unit's activation is set to 1;
otherwise set to 0. • [ 0.7 0.3 0.1 ]
• Sum of products of multiplying 2 vectors together -
inner product of vectors: (0.14).
LTU LTU
• Any input vector or weight vector with n • Consider two input units:
components specifies a point in n-dimensional • Activation levels range between 0 & 1
space. • Weights range between -1 & +1
• Components of vector - coordinates of point.
• Advantage of viewing vectors as points or rays in an
n-dimensional space - can understand behaviour of W ( 0.11 0.6 )
I (0.7 0.7 )
LTU in terms of way in which it divides input space
into two: Weight components: (0.11 0.6)
• Region containing all input vectors which turn Input components: (0.7 0.7)
LTU on.
• Region containing all input vectors which turn
LTU off.
4. Learning AND Learning AND
• We can train an LTU to discriminate between two
different classes of input. • Threshold = 0.5
• eg to compute a simple logic function - AND
• To construct an LTU to do this - must ensure that it 1 0 0 0
returns right output for each input.
• NB Only get 1 as output if inner product > threshold
value.
w1 w2 w1 w2 w2
(1 0) (0) w1 w1 w2
(0 0 ) (0)
(0 1) (0) 1 1 1 0 0 1 0 0
(1 1) (1)
w1 + w2 ≥ 0.5 w1 < 0.5 w2 < 0.5 0
Learning AND Perceptron
• Weight vectors must be in filled region to give
satisfactory output.
• Basis of algorithm:
• Iterative reduction of LTU error.
(0 1) (1 1)
• Take each input/output pair in turn & present input
w2 < 0.5 vector to LTU.
(1 0)
• Calculate degree to which activation of unit differs
from desired activation & adjust weight vector so as
to reduce this difference by a small, fixed amount.
w1 + w2 ≥ 0.5
w1 < 0.5
5. Perceptron Algorithm Perceptron
• Minsky & Pappert - late 1960s
• Initialize: • Developed mathematical analysis of perceptron &
• Randomly initialize weights in network. related architectures.
• Cycle through training set applying the following three • Central result:
rules to the weights on the connections to the output unit. • Since basic element of perceptron is LTU - it can
• If activation level of output unit is 1 when it should be 0 - only discriminate between linearly separable
reduce weight on link to the ith input unit by r x Ii where Ii classes.
is activation level of ith input unit & r is a fixed weight step. • Large proportion of interesting classes are NOT
• If activation level of output unit is 0 when it should be 1 - linearly-separable.
increase weight on link to ith input unit by r x Ii • Therefore, perceptron approach is restricted.
• If activation level of output unit is at desired level - do • Minsky & Pappert dampened enthusiasm for neural
nothing. networks.
Linear Separability Backpropagation
• Rumelhart & McClelland - 1986
• LTU discriminates between classes by separating • To achieve more powerful performance - need to move to
them with a line (more generally, a hyperplane) in more complex networks.
the input space.
• Networks containing input & output units + hidden units.
• A great many classes cannot be separated in this way
- ie many classes are NOT linearly separable.
• Terminology:
• For example, • 2-layer network:
• XOR (0 0) (0) • Layer of input units, layer of hidden units & layer of output units.
(0 1) (1) • Feed-forward network:
• Activation flows from input units, through hidden units to output units.
(1 0) (1)
• Completely connected network:
(1 1) (0) • Every unit in every layer receives input from every unit in layer below.
• Strictly layered network:
• Only has connections between units in adjacent layers.
6. Backpropagation Bias
• Simple two-layer network:
• Activation level of a unit is calculated by applying
logistic function to the inputs of the unit.
• But: To obtain satisfactory performance of
Backpropagation it is necessary to allow units to
have some level of activation independent of inputs.
• If introduction of hidden units is to achieve anything • This activation = bias.
- it is essential for units to have non-linear activation
functions. • Implemented by connecting units to a dummy
unit (which always has activation = 1).
• Most common approach is to use a function (often
known as logistic function): 1
1 + e-x
x - total input to unit.
Errors Hidden Unit Error
• Computing output unit error - straightforward. • How to compute error for hidden units ?
• Compare actual & target activation level: • Although we cannot assign errors to hidden units directly - can
deduce level of error by computing errors of output units &
Eo = So d(Ao) propagating this error backwards through the network.
Eo - error on output unit. • Thus, for a 2-layer feed-forward network:
So - difference between actual & target • Contribution that a hidden unit makes to error of an output unit to
which it is connected is simply degree to which the hidden unit
activation. was responsible for giving the output unit the wrong level of
d(Ao) - first derivative of logistic function. activation.
• Size of contribution depends on two factors:
• First derivative is easily computed: • Weight on link which connect two units.
• Activation of hidden unit.
• If Ai is current activation of unit i, first • Can arrive at an estimated error value for any hidden unit by
derivative is just: summing all the "contributions" which it makes to errors of the
Ai (1 - Ai) units to which it is connected.
7. Hidden Unit Error Weights
• Error of a hidden unit, i is: • Once error values have been determined for all units
Si = ∑ Ej Wij Ej = error value of jth unit. in the network - weights are then updated.
j
• Must take account of activation of hidden unit. • Weight on connection which feeds activation from
• Si is multiplied by derivative of activation. unit i to unit j is updated by an amount proportional
to the product of Ej & Ai:
• If Ai is activation of unit, i, final error of i is:
∆ Wij = Ej Ai r (r - learning rate)
Ei = Si d(Ai)
• Known as generalised delta rule.
Ei = Si Ai (1 - Ai)
Sloshing Momentum
• Backpropagation performs gradient descent in
squared error.
• Sloshing can substantially slow down learning.
• Tries to find a global minimum of the error
surface - by modifying weights. • To solve this problem: introduce momentum term.
• This can lead to problem behaviour: sloshing.
• Weight updating process modified to consider
• Example: last change applied to the weight.
• A weight configuration corresponds to a point high up, on one
side of a long thin valley in error surface. • Effect is to smooth out weight changes & thus
• Changes to weights may result in move to a position high up prevent oscillations.
on other side of valley.
• Next iteration will jump back, and so on ... ∆W = (∆Wprevious x momentum) + (∆Wcurrent x (1 - momentum))
8. Backpropagation - Backpropagation -
Example Example
• Learning XOR • Present first training pair: (0 1) (1)
• 2-2-1 network • Set activation of left input unit to 0.0 & activation of
• Randomly initialised weights. right input unit to 1.0
• Propagate activation forward through network using
logistic function & compute new activations:
0.22 0.6
0.5
0.5 0.22
0.37
0.59 0.48
0.39 -0.06 0.37
-0.07 0.39 -0.06
-0.07
0.0 1.0
Backpropagation - Backpropagation -
Example Example
• Calculate error on output unit & propagate error • Update weights.
back through network. • Weights altered by amount proportional to the
• Error = target output (1.0) - actual output (0.6) * product of error value of the receiving unit & the
derivative of activation ( 0.6 (1 - 0.6) ) activation of sending unit.
= 0.1
0.6
0.6 0.5 + (0.1 x 0.59) - 0.559 0.22 + (0.1 x 0.48) = 0.268
0.5 0.22
0.59 0.37 + (0.01 x 0.48
Error = (0.1 x 0.5) x 0.59 (1 - 0.59) Error = (0.1 x 0.22) x 0.48 (1 - 0.48) 1.0) = 0.38
= 0.01 0.59 0.48 = 0.01
0.39 + (0.01 x 0.0) = 0.39 -0.06 + (0.01 x 1.0) = -0.05
0.37
-0.07 + (0.01
0.39 -0.06 0.0 x 0.0) = -0.07 1.0
-0.07
0.0 1.0