This document describes analyzing images of English alphabets using a neural network with backpropagation. It preprocesses 26 alphabet images into 10x10 bipolar arrays as input data. A multilayer feedforward neural network with backpropagation training is used. The network has 100 input nodes, a hidden layer with variable nodes, and 26 output nodes. Different network structures are tested by varying the hidden nodes. The network is trained on the alphabet images and tested to evaluate performance for character recognition. Results are analyzed and conclusions are drawn based on the size and accuracy of the network structures.
2. 2
Analysis of Alphabets Image Rishi P. Metawala
Analysis of Alphabets Images
Table of Contents
Page #
1. Introduction and Background………………………………………………………… 3
2. Problem Description 4
3. Detailed Description of Data set being used 4
4. Image Preprocessing 5
5. Description of Neural network used to solve the problem……………………………. 8
6. Training the Network(simulation results)……………………………………………. 13
7. Testing Performance of Neural Network…………………………………………….. 15
8. Comparison of results 15
9. Conclusions…………………………………………………………………………… 17
10. Appendixes: Program 19
11. References……………………………………………………………………………. 21
3. 3
1. Introduction and Background:
Artificial neural networks have been developed as generalizations of mathematical models of
human cognition or neural biology. A neural network is characterized by its pattern of
connections between the neurons, its method of determining the weights on the connections
and its activation function. Many interesting problems fall under the application of neural
networks. One of the application areas is pattern recognition. Neural networks have been
extensively developed in the area of character recognition (digits or letters).
Single layer neural networks have some major disadvantages namely –
It can represent only a limited number of set functions, Decision boundaries must be a
hyperplane and they can solve only linearly separable problems. The essential character of
these neural networks is that they map similar input patterns to similar output patterns. .
However the constraint that similar input patterns lead to similar outputs is also a limitation of
such networks. For many practical problems, very similar input patterns may have very
different output requirements.[1]
[1]
Such networks cannot even solve a XOR problem. For instance recollect the truth table of XOR
and you will see that the hyperplane needs to divide the space such that (0,0) and (1,1) lie on
one side of the decision boundary and (0,1) and (1,0) to lie on the other side. This is clearly
impossible for a single layer network. Hence we go for Multilayer (feedforward) networks with
backpropogation training algorithm.
4. 4
General-purpose multilayer neural nets, such as the back propagation net have been used for
character recognition. My goal in this report is to present and analyze the character recognition
(namely alphabets of English letters) problem using this back propagation network. I will discuss
backpropogation algorithm in detail in subsequent chapters.
2. Problem Description:
With the assignment a neural network has to be implemented and used to analyze images,
containing English alphabet. These images have to be preprocessed before fed to the Neural
Network. They have to be 10x10 images, converted to an array containing the values 1 and -1. 1
represents black blocks, -1 for the white blocks.
Different Neural Network structures have to be examined; the influences of different structures
have to be measured as well. The structure will differ in the amount of nodes of the hidden
layer, the amount of hidden layers (max 2), and the transfer function for every layer. The
learning algorithm will be Back-propagation or its version, since that is the most popular/known
algorithm nowadays.
The analysis will contain the accuracy of the network against the size of the network.
3. Description of Data Set:
The assignment of character recognition is loosely based on the typical character recognition
problem documented in matlab neural network toolbox [5]. Matlab provides us with a
command prprob which generates alphabets – a matrix of 26 columns and each column has 35
values which can be either 1 or 0. Each column of 35 values defines a 5x7 bitmap of a letter. In
my case I decided to make some changes as demanded by the problem statement namely,
instead of using matlab generated characters I generated characters using paint, resized it in
Microsoft Paint, saved it as a monochrome image to make sure that it is in 2D array(and not 3D)
and then transform the data set into a bi-polar array. Thus in the end I had 26 columns with
each column having 100 values which can be either -1 or 1. I then defined a target matrix, just
like in the standard character recognition problem documented in neural network toolbox
which would map the 26 input vectors to 26 classes.
5. 5
The diagram above shows how the image looks after rescaling it into 10 x 10 and making it
bipolar.
The second type of data set used was also influenced by the character recognition problem as
stated in [5]. However in this case I used the same noise level generation as mentioned in [5]
for simplicity. Note, the noisy bits in our case fall in the range 0 and 1. The reason for using
noise was that we would like to not only recognize the perfectly formed letters but also the
noisy version of the letters. This was done simply to have a better evaluation of the neural
network.
Noise was generated as mentioned in [5]
numNoise = 30;
Xn = min(max(repmat(X,1,numNoise)+randn(35,26*numNoise)*0.2,0),1);
Tn = repmat(T,1,numNoise);
4. Image Preprocessing:
The aim is to identify images of English alphabets. I used 26 images with each image being in
.bmp format. Also the images were in gray scale. The figure below shows few of the 26 images
used for preprocessing. These images are read using the function “imread” from image
processing toolbox. I will illustrate the image preprocessing stepwise below. (Note: I simply
used Microsoft Paint to write down the characters myself and saved them as monochrome
.bmp file)
A = imread('A.bmp');
and so on….
i) Converting to logical
One thing to note is that the images are 24 x 42 pixels grayscale. Meaning they form an array of
0’s and 1’s matrix of size 24 x 42. However, we will convert the image to a logical one using the
matlab function “logical”.
IA = logical(A)
6. 6
ii) Resizing
Now for preprocessing our main aim is to resize the array to be a 10 x 10 image. This is easily
achieved by using image processing matlab function namely “imresize”. This function will resize
the image to the size specified in the command.
A = imresize(IA, [10 10]);
iii) Reshaping the alphabets.
For the purpose of recognition we need to give inputs sequentially. That is, if an array of one
single alphabet is 10 x 10 , it simply means that each alphabet will be represented by 10 x 10 =
100 input vectors. Hence , we have 26 such alphabets. We concatenate the alphabets first to
include them in a single big array of size 10 x 260. The final concatenated image is illustrated
below.
But remember we need to have 100 input vectors hence we need to reshape the image in such
a way that we have 26 columns, one column for each alphabet and each column of size (100 x
1). This is done using the following command.
letternew = reshape(letter,100,26);
we also need to now convert our logical array into a double precision type of array. We need to
do this because the neural network command which creates a back propogation network needs
information about the minmax value of the input vectors. The logical data type does not have
this information but double precision data type carries this information about minimum and
maximum value in the array.
alphabet = double(alphabeto)
The final step is to convert the data into Bipolar, that is make sure that the white is represented
by +1 and the black is represented by -1. This is achieved simply by manually changing each
element of an array to -1 if it is 0 and unchanged if it is 1. I implemented this by using a simple
7. 7
for loop which scanned each element of an array and replacing the 0 by -1. Biploar data is
preferred because the 0 units do not learn.
for i = 1:size(alphabet,1)
for j = 1:size(alphabet,2)
if alphabet(i,j) == 0;
alphabet(i,j) = -1;
end
alphabet
end
end
Finally indeed the original character image was displayed and verified that the image
preprocessing was successful, With each being a 10 x 10 array of bipolar data.
Alphabets after Image Processing
Since the features were retained there was really no need for any feature extraction.
8. 8
5. Description of the Neural network:
We will use feedforward algorithm with Backpropogation training algortihm[2].
A mulitlayer feedforward network consists of an input layer, with one or more hidden layers
and an output layer. These networks use a smooth activation function. The networks are
trained using the back propogation algortithm that adjusts the weights of various layers of the
network.
5.1 Architecture:
Typical architecture [1]
A multilayer neural network with one hidden layer is shown. The output units and the hidden
units also may have biases. The bias terms act like weights on connections from units whose
output is always 1. Only the direction of information flow for the feedforward phase of
operation is shown. During the backpropogation order of learning signals are sent in reverse.
[3]Training mulilayer networks is complicated by the fact that a desired response is only given
for the output nodes. The desired response of the intermediate layers must be derived by
distributing error observed at the output layer back to the previous layer. If there is more than
onde hidden layer, then errors at layer say L must be recursively propagated back to adjust
9. 9
layer L-1. In addition the error at one node must be distributed among only the connected
nodes at the previous layer.
Hidden Layer
Input Layer
Error Calculation
Error Calculation
Y
Hij
Xi
Backpropogation [3]
5.1: Explanation of Back propogation:
One more disadvantage with the single layer perceptron was Linear dependency, that is if
variable outputs are not independent and start influencing each other, If outputs are not
independent of each other then there is no way to have a common variable shared between
neurons which we could tune to bring them into sync. Hence people started to talk about multi
layer network. Let us look at the multilayer figure above. What happens there is that by tuning
the first hidden layer neuron, the very first neuron, you are infact able to change the activation
of all output neurons. But, can we have an algorithm, which will not only be able to tune the
second layer weights but also the first layer weights. So we start from the top and start to back
propogate.
10. 10
1
2
3
i
4
5
6
j
7
8
9
k
BIAS BIAS
X2
X3
X1
X4
O7
O8
O9
Ok
XjWi,j XkWi,k
Now in order to update the weight to the first neuron of ther very first hidden layer, we need to
know what is the error at that specific neuron. But infact the error will only be given at the
output neuron of the second layer. Let us now take the propogation of the error back (to the
hidden layer), hence we now know the error there. It is like saying that the hidden neuron takes
the blame for the error! Now that we have the error at hidden layer, we have an updation of
weights just like in perceptron. I have tried to explain on my own the calculations below.
5.1.2. Error Calculation and Weights Adjustments:
On sending a specified training pattern to the neural network, the weighted sum of the input to
the jth node in the hidden layer is given as
𝑁𝑒𝑡𝑗 = 𝑆𝑢𝑚𝑚𝑎𝑡𝑖𝑜𝑛(𝑊𝑖,𝑗 𝑋𝑗 + 𝜃𝑗)
The above equation calculates the average input to the neuron. The 𝜃𝑗 term is the weighted
value from a bias node that always has an ouput value of 1. The bias node is called the pseudo
input and is used to overcome the problems associated with situations where the values of an
input pattern are zero. To make a decision on if the neuron should fire, the net term is passed
onto appropriate activation function. The resulting value from the activation function
determins the neuron’s output, and becomes the input value for the neurons in the next layer.
A typical activation function can be a sigmoid function. (tansig or even perhaps purelin)
11. 11
Backpropogation:
If the activation value of the output node, k is Ok (see my hand drawn diagram above), and the
expected target output for node k is infact tk, the difference between the actual output and the
expected output is given as:
∆ 𝑘= 𝑡 𝑘 − 𝑜 𝑘
The error signal for node k in the output layer is now given as
𝛿 𝑘 = ∆ 𝑘 𝑜 𝑘(1 − 𝑜 𝑘)
Where 𝑜 𝑘(1 − 𝑜 𝑘) term is the derivative of sigmoid function. Because of the delta rule, the
change in the weight connecting input node j and output node k is proportional to the error at
node k multiplied by the activation of nodej.
We can modify the weights, 𝑊𝑗,𝑘, between the output node , k, and the node, j is:
∆𝑊𝑗,𝑘 = 𝐼𝑟 𝛿 𝑘 𝑋 𝑘
𝑊𝑗,𝑘 = 𝑊𝑗,𝑘 + ∆𝑊𝑗,𝑘
Where ∆𝑊𝑗,𝑘 is the change in the weight between nodes j and k. Ir is the learning rate. Note
that if learning rate is too low then the network will learn very slowly. As learning progresses,
the learning rate decreases as it approaches the optimal point.
Hidden Layer:
The error node j in the hidden layer is given as
𝛿 𝑘 = (𝑡 𝑘 − 𝑜 𝑘)𝑜 𝑘 𝑆𝑢𝑚𝑚𝑎𝑡𝑖𝑜𝑛(𝑊𝑗,𝑘 𝛿 𝑘)
The summation term adds the weighted error signal for all nodes k, in the output layer. Now
the formula to adjust the weight between the input node I and the node j is:
∆𝑊𝑗,𝑘 = 𝐼𝑟 𝛿𝑗 𝑋𝑗 + ∆𝑊𝑗,𝑘 𝜇
𝑊𝑖,𝑗 = 𝑊𝑖,𝑗 + ∆𝑊𝑖,𝑗
12. 12
5.2 Proposed Network
Since the target patterns defined in the assignment (26 target patterns for 26 alphabets) is
quite simple, which is defined by 100 bi-polar values, therefore generally a two-layer/ three
layer feedforward network should be enough for the character recognition task. As noted
before the network will need 100 inputs and 26 output neurons. We can choose bi-polar
sigmoid range as the transfer function for both the hidden and output layers because obviously
it would have a suitable output range [-1 1] for our problem.
We shall choose the initial weights and bias initially to small values so that the active region of
each neuron is not close to the irresponsive part of the transfer function, or else the network
will not learn. In matlab we use the function newff to create the network. The command
initializes each layer’s weight and bias, but we will scale them down by a factor of 0.01. Initially
we set 1 hidden layer with 15 neurons. The reason to scale down the weight was simply to be
on the safer side.
Generally a rule of Thumb is used to select the size of the networks. The mathematical
description is given in the lecture slides, but generalization of what I understood from the
various other literatures.
The number of hidden layers should be between the size of the input layer and the size
of the output layer.
The number of hidden layers should be less than twice the size of input layers.
Or more specifically the size of hidden layers can be given as[3]
𝐻 =
𝑇
5(𝑁 + 𝑀)
Where H = number of hidden layer neurons
N = size of the input layer
M = size of the output layer
T = size of the training set
But these rules only provide a starting point for the selection of number of neurons, ultimately
it always comes down to trial and error for the selection of architecture. So I will illustrate the
network with different sizes of the hidden layer and number of neurons.
13. 13
6. Training the network:
In order to effectively analyze the impact of different neural networks which differ in their
structure in terms of number of nodes, number of hidden layers(maximum 2) and their
activation function. I decided to introduce some random noise, that is, corrupt some bits in
each alphabet. I will then train the network without noise and with noise.
[4] traingdx was used to train the network. traingdx is a network training function that updates
weight and bias values according to gradient descent momentum and an adaptive learning rate.
The simulation results are shown, the graph below indicates the iteration at which the
validation performance reached a minimum. (it displays the error). It gives us a clear idea of the
accuracy of the network against the size(type) of the network. Various combinations of size and
type of the networks are possible. I have randomly tested for some of them. I have now also
introduced the funcionality to change the number of epochs through the network.
15 hidden Neurons - 1 hidden layer – purelin transfer function
Training Data set: Training data set (seen data) is the data used to build the model (determine
its parameters). The training data set in our case is the known data that is images of size 100 x
26 as explained above in section2. We also train the network with noise to improve the
performance. For noise infected training data set, we infect the original training data set with
some random noise levels of 0’s and 1’s at random position, the code is explained in Matlab
comment section. I train the network with actual and also noisy data set just to make the data
insensitive to noise, so it is better able to recognize the noise infected alphabets.
The performance can be simply checked by clicking on the performance tab when the neural
14. 14
network window pops up when training. Infact you can click on performance tab even when
training is going on to check the evolution of the training.
25 hidden Neurons - 1 hidden layer – tansig transfer function
100 hidden Neurons - 1 hidden layer – tansig transfer function
Training stops when the network is no longer likely to improve on training or validation sets. You may
have different results, since when I rerun them, I get some difference but however I expect the
difference not to be much.
15. 15
7. Testing the performance of the Neural Network.
Since we are using noisy patterns and training them, we will now send these patterns and also
the noise free patterns to test the network. Network was trained for noise and noiseless
patterns.
Characters with noise
Inorder to make network insensitive to presence of noise, it is trained with noise as well. In the
figure above a random 3 bits of noise have been introduced in each alphabet.
Testing Data Set: Now we wish to test the data (unseen data) to check the overall performance.
The way I do it is using the sim command from neural network toolbox. When testing, I test on
both the noisy patterns network “netn” and noise free patterns network “net” and see the
average recognition rate. Note, I test both these networks by adding some noise.
Changing the Size:
The neural network command to create the network used is
net = newff(minmax(input),[s1 s2 26],{Tf1 Tf2 Tf3},'trainlm');
I have made the function to add values to the variables of the desired size. Here s1 is the size of
the first hidden layer, s2 the second hidden layer. We need to also define the activation of each
input layer, hidden layer and the output layer, which is given by Tf1,Tf2,tf3. In the last section,
16. 16
also in the attached manual, I have described in detail how to change the size when running the
code.
8. Comparison of the Results.
In this section I will plot the result directly of the recognized characters and compare them.
2 layer - 15 - 10 neurons (tansig) 2 layer - 10 - 10 neurons(tansig)
The two layer hidden networks are not able to produce the characters so well as seen above.
1 Hidden Layer - 25 neurons(tansig) 1 Hidden Layer - 100 neurons (tansig)
Finally I plot the result with 1 hidden layer 25 neurons but now with a different activation function
purelin.
17. 17
1 hidden layer -25 neurons (purelin) 1 hidden layer -15 neurons (tansig - purelin)
One thing to note however is, that the percentage of recognition error with 100 neurons, 1
hidden layer and tansig activation function is negligible, as shown below and also for a
combination of activation function tansig and purelin, the recognition error was almost
negligible if we train the network with noise, also note that only 15 neurons were used and was
much faster.The figure below is plotted at the end of simulation.
1 hidden layer 100 neurons-tansig 1 hidden layer 15 neurons-(purelin)
9. Conclusion
Clearly we see some underfitting when we choose the number of neurons to be too less.
18. 18
Regression plot for 1 hidden layer – 5 neurons
The above graph shows an example of underfitting(obtained by clicking on regression on neural
network window). The small circles are data and the solid blue line is the fit. It is obvious to see
that there is really no fit when I choose the number of neurons to be 5.
And choosing the amount of neurons too high results in something called overfitting.
Overfitting occurs when the neural network has so much information processing capacity that
the limited amount of information contained in the training set is not enough to train all of the
neurons in the hidden layers. Another visible problem with high number of neurons observed
was the amount of training time increased to a point that it was inadequate to completely train
the neural network.
In our case, the amount of hidden layer of 2 was also tested. Frankly speaking there was really
no need of the second hidden layer, as seen from the result, in our case indeed one hidden
layer of neuron network 15 outperformed 2 hidden layers. In literature there has been much
debate about the need of deep layers. Most of them conclude that one hidden layer is enough
for practical application, and that two hidden layers are generally used for discontinuitities such
as sawtooth waveform. One of the reasons I believe that the two hidden layers perform poorly
as compared to one hidden layer in our case is that two hidden layer is probably converging to
a local minimum. In short two hidden layers may introduce a greater risk of convergence to a
local minima as stated in [6].
19. 19
In terms of activation function tansig performed much better compared to purelin when
trained with same number of hidden neurons. The table below gives a summary of my
observations, with 5000 epochs.
Number
of
Neurons
Number of
neurons in
2nd
hidden
layer
Transfer function
Of the Layer’s and
output layer
Number of
hidden layers
Mean square error
(MSE)
Recognition
error
15 - Tansig-Tansig 1 0.014793 38.5%
20 - Tansig-Tansig 1 0.00739 19%
25 - Tansig-Tansig 1 4.1786e - 06 0%
25 - Purelin-Purelin 1 3.7115e -07 10%
20 10 Tansig-Tansig-Tansig 2 0.01488 32%
25 15 Tansig-Tansig-Tansig 2 0.00746 15%
30 25 Tansig-Tansig-Tansig 2 8.5162e-06 0%
25 15 Tansig-purelin-
purelin
2 0.014793 1.5%
In conclusion we can say that 1 hidden layer with 25 neurons give us a perfect fit on the data
with minimum recognition error and mean square error. There is no need for two hidden layers
as it is computationaly complex and also take’s long to converge to a minimum whereas returns
us with no visible advantage over single hidden neurons.
20. 20
10. Appendixes and MATLAB files. (Matlab file will be attached)
Bias: A input of a neural net with any value except zero. Its purpose is to generate different
inputs for different input patterns given to the net.
Epochs:- Number of iteration through the back propogation algorithm.
Learning rate: A changeable value, used by several learning algorithms which effects the
changing of weight values. The greater the learning rate, the more the weight values are
changed. It is usually decreased during the learning process.
Weights: connection between two neurons with a value that is dynamically changed during a
neural net's learning process.
Matlab:
The matlab file will be attached with the email and effort is made to explain most of the part of
the code by using comments in Matlab itself. In summary the program is divided into 6 parts,
namely.
Generating characters
Generating characters with noise.
Creating the neural network.
Training network.
Testing the performance of the network.
Result plots
The main file will call all the functions.
21. 21
How to use the Matlab files provided?
1) Please make sure that all the files including the bmp image files are in the same directory, in
the same folder.
2) Select all and open all the files, there are about 7 files.
3) Locate the file named “main.m” and run the file.
For simulating one hidden layer network.
after running steps 1 to 3
4) After some time you will be prompted to enter the size of the first hidden layer, enter the
size of hidden layer you wish to enter.
You will be prompted again to enter the size of the second hidden layer. Without entering
anything, hit enter.
5) now you are asked to select the transfer function of the first layer. Please enter the transfer
function – purelin, tansig or logsig within the inverted commas.
Eg- ‘tansig’
6) you will be asked again to enter the transfer function of the second layer, repeat the same
step illustrated in 5.
7) you will be again asked to enter the transfer function of the 3rd
layer, do not enter anything
and hit enter.
For simulating two hidden layer network.
After steps from 1 to 3,
8) After some time you will be prompted to enter the size of the first hidden layer, enter the
size of hidden layer you wish to enter.
9)You will be prompted again to enter the size of the second hidden layer. Enter the size of the
second layer you wish.
10) now you are asked to select the transfer function of the first layer. Please enter the
transfer function – purelin, tansig or logsig within the inverted commas.
Eg- ‘tansig’
22. 22
11) you will be asked again to enter the transfer function of the second layer, repeat the same
step illustrated in 10. enter the transfer function – purelin, tansig or logsig within the inverted
commas.
Eg- ‘tansig’
12) you will be again asked to enter the transfer function of the 3rd
layer, again repeat the last
step and enter the transfer function like ‘tansig’ , ‘purelin’ or ‘logsig’ within commas.
Eg: ‘purelin’
The training might take some time depending on the size of the network you have enetered.
13) Additionally I have also made the provision to enter the number of training epochs. I have
tested them on 5000.
Note:
I have tested my code in the Matlab 2013a version. The command newff is obsolete but it still
works with all versions of Matlab. Please make sure you have installed both the image
processing toolbox and neural network toolbox installed in your Matlab version.
23. 23
11. References:
[1] http://web.mit.edu/marshall/www/papers/XOR.pdf
[2] http://psych.stanford.edu/~jlm/papers/PDP/Volume%201/Chap8_PDP86.pdf
[3] Prof. V. Mladenov Lecture slides.
[4] http://www.mathworks.nl/help/nnet/ref/traingdx.html
[5] http://www.mathworks.nl/help/nnet/examples/character-recognition.html
[6] http://www.ictic.sk/archive/?vid=1&aid=3&kid=50101-3&q=f1