SlideShare a Scribd company logo
1 of 27
Neural Networks and
Fuzzy Systems
Multi-layer Feed forward Networks
Dr. Tamer Ahmed Farrag
Course No.: 803522-3
Course Outline
Part I : Neural Networks (11 weeks)
• Introduction to Machine Learning
• Fundamental Concepts of Artificial Neural Networks
(ANN)
• Single layer Perception Classifier
• Multi-layer Feed forward Networks
• Single layer FeedBack Networks
• Unsupervised learning
Part II : Fuzzy Systems (4 weeks)
• Fuzzy set theory
• Fuzzy Systems
2
Outline
• Why we need Multi-layer Feed forward Networks
(MLFF)?
• Error Function (or Cost Function or Loss function)
• Gradient Descent
• Backpropagation
3
Why we need Multi-layer Feed forward
Networks (MLFF)?
• Overcoming failure of single layer perceptron in
solving nonlinear problems.
• First Suggestion:
• Divide the problem space into smaller linearly separable
regions
• Use a perceptron for each linearly separable region
• Combine the output of multiple hidden neurons to
produce a final decision neuron.
4
Region 1
Region 2
Why we need Multi-layer Feed forward
Networks (MLFF)?
• Second suggestion
• In some cases we need a curve decision boundary or we try to solve
more complicated classification and regression problems.
• So, we need to:
• Add more layers
• Increase a number of neurons in each layer.
• Use non linear activation function in
the hidden layers.
• So , we need Multi-layer Feed forward Networks (MLFF).
5
Notation for Multi-Layer Networks
• Dealing with multi-layer networks is easy if a sensible notation is adopted.
• We simply need another label (n) to tell us which layer in the network we
are dealing with.
• Each unit j in layer n receives activations 𝑜𝑢𝑡𝑖
(𝑛−1)
𝑤𝑖𝑗
(𝑛)
from the previous
layer of processing units and sends activations 𝑜𝑢𝑡𝑗
(𝑛)
to the next layer of
units.
6
1
2
3
1
2
layer (0) layer (1)
𝒘𝒊𝒋
(𝟏)
layer (n-1) layer (n)
𝒘𝒊𝒋
(𝒏)
ANN Representation
(1 input layer + 1 hidden layer +1 output layer)
7
for example:
𝑧1
(1)
= (𝑤11
(1)
𝑥1 + 𝑤21
(1)
𝑥2+ 𝑤31
(1)
𝑥3 + 𝑏1
(1)
)
𝑎1
(1)
= 𝑓 𝑧1
(1)
=σ (𝑧1
(1)
)
𝑧2
(2)
= (𝑤12
(2)
𝑎1
(1)
+ 𝑤22
(2)
𝑎2
(1)
+ 𝑤32
(2)
𝑎3
(1)
+ 𝑏2
(2)
)
𝑦2 = 𝑎2
(2)
= 𝑓 𝑧2
(2)
=σ (𝑧2
(2)
)
𝒛𝒋
(𝒍)
=
𝒋
𝒘𝒊𝒋
(𝒍)
𝒂𝒊
(𝒍−𝟏)
+ 𝒃𝒋
(𝒍)
𝒂𝒋
(𝒍)
= 𝒇 𝒛𝒋
𝒍
= σ 𝒛𝒋
𝒍
layer (0)
𝑥1= 𝑎1
(0)
𝑥2= 𝑎2
(0)
𝒛 𝟏
(𝟏)
𝒂 𝟏
(𝟏)
𝒛 𝟐
(𝟏)
𝒂 𝟐
(𝟏)
𝒛 𝟑
(𝟏)
𝒂 𝟑
(𝟏)
layer (1)
𝒛 𝟏
(𝟐)
𝒂 𝟏
(𝟐)
𝒛 𝟐
(𝟐)
𝒂 𝟐
(𝟐)
layer (2)
𝒘 𝟏𝟏
(𝟏)
𝒘 𝟏𝟐
(𝟏)
𝒘 𝟏𝟑
(𝟏)
𝒘 𝟐𝟏
(𝟏)
𝒘 𝟐𝟐
(𝟏)
𝒘 𝟐𝟑
(𝟏)
𝒘 𝟑𝟏
(𝟏)
𝒘 𝟑𝟐
(𝟏)
𝒘 𝟑𝟑
(𝟏)
𝒘 𝟏𝟏
(𝟐)
𝒘 𝟏𝟐
(𝟐)
𝒘 𝟐𝟏
(𝟐)
𝒘 𝟐𝟐
(𝟐)
𝒘 𝟑𝟏
(𝟐)
𝒘 𝟑𝟐
(𝟐)
𝒚 𝟏
𝒚 𝟐
𝑥2= 𝑎2
(0)
Gradient Descent
and Backpropagation
Error Function
● how we can evaluate performance of a neuron
????
● We can use a Error function (or cost function or
loss function) to measure how far off we are from
the expected value.
● Choosing appropriate Error function help the
learning algorithm to reach to best values for
weights and biases.
● We’ll use the following variables:
○ D to represent the true value (desired value)
○ y to represent neuron’s prediction 9
Error Functions
(Cost function or Lost Function)
• There are many formulates for error functions.
• In this course, we will deal with two Error function
formulas.
Sum Squared Error (SSE) :
𝑒 𝑝𝑗 = 𝑦𝑗 − 𝐷𝑗
2
for single perceptron
𝐸𝑆𝑆𝐸=
𝑗=1
𝑛
𝑦𝑗 − 𝐷𝑗
2
1
Cross entropy (CE):
𝐸 𝐶𝐸 =
1
𝑛 𝑗=1
𝑛
[𝐷𝑗 ∗ ln(𝑦𝑗) + (1− 𝐷𝑗) ∗ ln(1− 𝑦𝑗)] (2)
10
1
2
Why the error in ANN occurs?
• Each weight and bias in the network contribute in
the occasion of the error.
• To solve this we need:
• A cost function or error function to compute the error.
(SSE or CE Error function)
• An optimization algorithm to minimize the error
function. (Gradient Decent)
• A learning algorithm to modify weights and biases to
new values to get the error down. (Backpropagation)
• Repeat this operation until find the best solution
11
Gradient Decent (in 1 dimension)
• Assume we have a error function E and we need to
use it to update one weight w
• The figure show the error function in terms of w
• Our target is to learn the value of w produces the
minimum value of E.
How?
12
E
W
minimum
Gradient Decent (in 1 dimension)
• In Gradient Decent algorithm, we use the following
equation to get a better value of w:
𝑤 = 𝑤 − αΔ𝑤 (called Delta rule)
Where:
α : is the learning rate
Δ𝑤 : is mathematically can be computed using
derivative of E with respect to w (
𝑑𝐸
𝑑𝑤
)
13
E
W
minimum
𝑤 = 𝑤 − α
𝑑𝐸
𝑑𝑤
(3)
Local Minima problem
14
Choosing learning rate
15
Gradient Decent (multi dimension)
• In ANN with many layers and many neurons in each layer the
Error function will be multi-variable function.
• So, the derivative in equation (3) should be partial derivative
𝑤𝑖𝑗 = 𝑤𝑖𝑗 − α
𝜕𝐸 𝑗
𝜕𝑤 𝑖𝑗
(4)
• We write equation (4) as :
𝑤𝑖𝑗 = 𝑤𝑖𝑗 − α 𝜕𝑤𝑖𝑗
• Same process will be use to get the
new bias value:
𝑏𝑗= 𝑏𝑗 − α 𝜕𝑏𝑗
16
derivative of activation functions
17
Sigmoid
Learning Rule in the output layer
using SSE as error function and sigmoid as Activation
function
𝜕𝐸 𝑗
𝜕𝑤 𝑖𝑗
=
𝜕𝐸 𝑗
𝜕𝑎 𝑗
(𝑙) *
𝜕𝑎 𝑗
(𝑙)
𝜕𝑧 𝑗
(𝑙) *
𝜕𝑧 𝑗
(𝑙)
𝜕𝑤𝑖𝑗
(𝑙)
Where:
𝐸𝑗 =
𝑗
(𝑦𝑗 − 𝐷𝑗 )2
𝑦𝑗 = 𝑎𝑗
(𝑙)
= 𝑓 𝑧𝑗
𝑙
= 𝜎 𝑧𝑗
𝑙
𝑧𝑗
(𝑙)
=
𝑗
𝑤𝑖𝑗
(𝑙)
𝑎𝑖
(𝑙−1)
+ 𝑏𝑗
(𝑙)
From the previous table:
𝜎′ 𝑧𝑗
𝑙
= 𝜎 𝑧𝑗
𝑙
∗ 1 − 𝜎 𝑧𝑗
𝑙
= 𝑦𝑗 (1 − 𝑦𝑗)
18
Learning Rule in the output layer (cont.)
So (How?),
𝜕𝑦𝑗
𝜕𝑧𝑖
= 𝑦𝑗 (1 − 𝑦𝑗)
𝜕𝑧𝑗
𝜕𝑤𝑖𝑗
= 𝑎𝑖
(𝑙−1)
𝜕𝐸𝑗
𝜕𝑦𝑗
= −2(𝑦𝑗 − 𝐷𝑗 )
• Then:
𝜕𝐸 𝑗
𝜕𝑤 𝑖𝑗
= 2𝑎𝑖
𝑙−1
𝑦𝑗 − 𝐷𝑗 𝑦𝑗 1 − 𝑦𝑗
𝑤𝑖𝑗 = 𝑤𝑖𝑗 − 2 α 𝑎𝑖
𝑙−1
𝑦𝑗 − 𝐷𝑗 𝑦𝑗 1 − 𝑦𝑗
19
Learning Rule in the Hidden layer
• Now we have to determine the appropriate
weight change for an input to hidden weight.
• This is more complicated because it depends on
the error at all of the nodes this weighted
connection can lead to.
• The mathematical proof is out our scope.
20
Gradient Decent (Notes)
Note 1:
• the neuron activation function (f ) should be is defined
and differentiable function.
Note 3:
• The calculating of 𝜕𝑤𝑖𝑗 for the hidden layer will be
more difficult (Why?)
Note 2:
• The previous calculation will be repeated for each
weight and for each bias in the ANN
• So, we need big computational power (what about
deeper networks? )
21
Gradient Decent (Notes)
• 𝜕𝑤𝑖𝑗 is represent the change in the values of 𝑤𝑖𝑗
to get better output
• The equation of 𝜕𝑤𝑖𝑗 is dependent on the choosing
of the Error(Cost) function and activation function.
• Gradient Decent algorithm help in calculated the
new values of weights and bias.
• Question: is one iteration (one trail) enough to
bet the best values for weights and biases
• Answer: No, we need a extended version ?
Backpropagation
22
How Backpropagation Work?
23
𝒘 𝟏𝟏
(𝟏)
𝒘 𝟏𝟐
(𝟏)
𝒘 𝟐𝟏
(𝟏)
𝒘 𝟐𝟐
(𝟏)
𝒘 𝟑𝟏
(𝟏)
𝒘 𝟑𝟐
(𝟏)
𝒘 𝟏𝟏
(𝟐)
𝒘 𝟐𝟏
(𝟐)
𝒚
𝒂 𝟏
(𝟏)
= 𝒘 𝟏𝟏
(𝟏)
-𝛼 𝜕𝒘 𝟏𝟏
(𝟏)
= 𝒘 𝟏𝟏
(𝟐)
-𝛼 𝜕𝒘 𝟏𝟏
(𝟐)
𝑭𝒐𝒓𝒘𝒂𝒓𝒅 𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏 𝑩𝒂𝒄𝒌 𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏
𝒍𝒂𝒚𝒆𝒓 𝟎 𝒍𝒂𝒚𝒆𝒓 𝟏 𝒍𝒂𝒚𝒆𝒓 𝟐
Online Learning vs. Offline Learning
• Online: Pattern-by-Pattern
learning
• Error calculated for each
pattern
• Weights updated after each
individual pattern
𝚫𝒘𝒊𝒋 = −𝜶
𝝏𝑬 𝒑
𝝏𝒘𝒊𝒋
• Offline: Batch learning
• Error calculated for all
patterns
• Weights updated once at
the end of each epoch
𝚫𝒘𝒊𝒋 = −𝜶
𝒑
𝝏𝑬 𝒑
𝝏𝒘𝒊𝒋
24
Choosing Appropriate Activation and Cost
Functions
• We already know consideration of single layer networks what
output activation and cost functions should be used for
particular problem types.
• We have also seen that non-linear hidden unit activations are
needed, such as sigmoids.
• So we can summarize the required network properties:
• Regression/ Function Approximation Problems
• SSE cost function, linear output activations, sigmoid hidden activations
• Classification Problems (2 classes, 1 output)
• CE cost function, sigmoid output and hidden activations
• Classification Problems (multiple-classes, 1 output per class)
• CE cost function, softmax outputs, sigmoid hidden activations
• In each case, application of the gradient descent learning
algorithm (by computing the partial derivatives) leads to
appropriate back-propagation weight update equations.
25
Overall picture : learning process on ANN
26
Neural network simulator
• Search through the internet to find a simulator and
report it
For example:
• https://www.mladdict.com/neural-network-
simulator
• http://playground.tensorflow.org/
27

More Related Content

What's hot

What's hot (20)

Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Perceptron
PerceptronPerceptron
Perceptron
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Learning Methods in a Neural Network
Learning Methods in a Neural NetworkLearning Methods in a Neural Network
Learning Methods in a Neural Network
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Hopfield Networks
Hopfield NetworksHopfield Networks
Hopfield Networks
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
Back propagation
Back propagationBack propagation
Back propagation
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
 
Introduction to artificial neural network
Introduction to artificial neural networkIntroduction to artificial neural network
Introduction to artificial neural network
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
 
Artifical Neural Network and its applications
Artifical Neural Network and its applicationsArtifical Neural Network and its applications
Artifical Neural Network and its applications
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Activation function
Activation functionActivation function
Activation function
 
Neural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) AlgorithmNeural Networks: Least Mean Square (LSM) Algorithm
Neural Networks: Least Mean Square (LSM) Algorithm
 

Similar to 04 Multi-layer Feedforward Networks

nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 

Similar to 04 Multi-layer Feedforward Networks (20)

Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Linear regression, costs & gradient descent
Linear regression, costs & gradient descentLinear regression, costs & gradient descent
Linear regression, costs & gradient descent
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 
Deep learning with TensorFlow
Deep learning with TensorFlowDeep learning with TensorFlow
Deep learning with TensorFlow
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
Class13_Quicksort_Algorithm.pdf
Class13_Quicksort_Algorithm.pdfClass13_Quicksort_Algorithm.pdf
Class13_Quicksort_Algorithm.pdf
 
Lesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdfLesson_8_DeepLearning.pdf
Lesson_8_DeepLearning.pdf
 
SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15SPICE-MATEX @ DAC15
SPICE-MATEX @ DAC15
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Introduction to Neural networks (under graduate course) Lecture 6 of 9
Introduction to Neural networks (under graduate course) Lecture 6 of 9Introduction to Neural networks (under graduate course) Lecture 6 of 9
Introduction to Neural networks (under graduate course) Lecture 6 of 9
 
Backpropagation
BackpropagationBackpropagation
Backpropagation
 

Recently uploaded

Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 

Recently uploaded (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 

04 Multi-layer Feedforward Networks

  • 1. Neural Networks and Fuzzy Systems Multi-layer Feed forward Networks Dr. Tamer Ahmed Farrag Course No.: 803522-3
  • 2. Course Outline Part I : Neural Networks (11 weeks) • Introduction to Machine Learning • Fundamental Concepts of Artificial Neural Networks (ANN) • Single layer Perception Classifier • Multi-layer Feed forward Networks • Single layer FeedBack Networks • Unsupervised learning Part II : Fuzzy Systems (4 weeks) • Fuzzy set theory • Fuzzy Systems 2
  • 3. Outline • Why we need Multi-layer Feed forward Networks (MLFF)? • Error Function (or Cost Function or Loss function) • Gradient Descent • Backpropagation 3
  • 4. Why we need Multi-layer Feed forward Networks (MLFF)? • Overcoming failure of single layer perceptron in solving nonlinear problems. • First Suggestion: • Divide the problem space into smaller linearly separable regions • Use a perceptron for each linearly separable region • Combine the output of multiple hidden neurons to produce a final decision neuron. 4 Region 1 Region 2
  • 5. Why we need Multi-layer Feed forward Networks (MLFF)? • Second suggestion • In some cases we need a curve decision boundary or we try to solve more complicated classification and regression problems. • So, we need to: • Add more layers • Increase a number of neurons in each layer. • Use non linear activation function in the hidden layers. • So , we need Multi-layer Feed forward Networks (MLFF). 5
  • 6. Notation for Multi-Layer Networks • Dealing with multi-layer networks is easy if a sensible notation is adopted. • We simply need another label (n) to tell us which layer in the network we are dealing with. • Each unit j in layer n receives activations 𝑜𝑢𝑡𝑖 (𝑛−1) 𝑤𝑖𝑗 (𝑛) from the previous layer of processing units and sends activations 𝑜𝑢𝑡𝑗 (𝑛) to the next layer of units. 6 1 2 3 1 2 layer (0) layer (1) 𝒘𝒊𝒋 (𝟏) layer (n-1) layer (n) 𝒘𝒊𝒋 (𝒏)
  • 7. ANN Representation (1 input layer + 1 hidden layer +1 output layer) 7 for example: 𝑧1 (1) = (𝑤11 (1) 𝑥1 + 𝑤21 (1) 𝑥2+ 𝑤31 (1) 𝑥3 + 𝑏1 (1) ) 𝑎1 (1) = 𝑓 𝑧1 (1) =σ (𝑧1 (1) ) 𝑧2 (2) = (𝑤12 (2) 𝑎1 (1) + 𝑤22 (2) 𝑎2 (1) + 𝑤32 (2) 𝑎3 (1) + 𝑏2 (2) ) 𝑦2 = 𝑎2 (2) = 𝑓 𝑧2 (2) =σ (𝑧2 (2) ) 𝒛𝒋 (𝒍) = 𝒋 𝒘𝒊𝒋 (𝒍) 𝒂𝒊 (𝒍−𝟏) + 𝒃𝒋 (𝒍) 𝒂𝒋 (𝒍) = 𝒇 𝒛𝒋 𝒍 = σ 𝒛𝒋 𝒍 layer (0) 𝑥1= 𝑎1 (0) 𝑥2= 𝑎2 (0) 𝒛 𝟏 (𝟏) 𝒂 𝟏 (𝟏) 𝒛 𝟐 (𝟏) 𝒂 𝟐 (𝟏) 𝒛 𝟑 (𝟏) 𝒂 𝟑 (𝟏) layer (1) 𝒛 𝟏 (𝟐) 𝒂 𝟏 (𝟐) 𝒛 𝟐 (𝟐) 𝒂 𝟐 (𝟐) layer (2) 𝒘 𝟏𝟏 (𝟏) 𝒘 𝟏𝟐 (𝟏) 𝒘 𝟏𝟑 (𝟏) 𝒘 𝟐𝟏 (𝟏) 𝒘 𝟐𝟐 (𝟏) 𝒘 𝟐𝟑 (𝟏) 𝒘 𝟑𝟏 (𝟏) 𝒘 𝟑𝟐 (𝟏) 𝒘 𝟑𝟑 (𝟏) 𝒘 𝟏𝟏 (𝟐) 𝒘 𝟏𝟐 (𝟐) 𝒘 𝟐𝟏 (𝟐) 𝒘 𝟐𝟐 (𝟐) 𝒘 𝟑𝟏 (𝟐) 𝒘 𝟑𝟐 (𝟐) 𝒚 𝟏 𝒚 𝟐 𝑥2= 𝑎2 (0)
  • 9. Error Function ● how we can evaluate performance of a neuron ???? ● We can use a Error function (or cost function or loss function) to measure how far off we are from the expected value. ● Choosing appropriate Error function help the learning algorithm to reach to best values for weights and biases. ● We’ll use the following variables: ○ D to represent the true value (desired value) ○ y to represent neuron’s prediction 9
  • 10. Error Functions (Cost function or Lost Function) • There are many formulates for error functions. • In this course, we will deal with two Error function formulas. Sum Squared Error (SSE) : 𝑒 𝑝𝑗 = 𝑦𝑗 − 𝐷𝑗 2 for single perceptron 𝐸𝑆𝑆𝐸= 𝑗=1 𝑛 𝑦𝑗 − 𝐷𝑗 2 1 Cross entropy (CE): 𝐸 𝐶𝐸 = 1 𝑛 𝑗=1 𝑛 [𝐷𝑗 ∗ ln(𝑦𝑗) + (1− 𝐷𝑗) ∗ ln(1− 𝑦𝑗)] (2) 10 1 2
  • 11. Why the error in ANN occurs? • Each weight and bias in the network contribute in the occasion of the error. • To solve this we need: • A cost function or error function to compute the error. (SSE or CE Error function) • An optimization algorithm to minimize the error function. (Gradient Decent) • A learning algorithm to modify weights and biases to new values to get the error down. (Backpropagation) • Repeat this operation until find the best solution 11
  • 12. Gradient Decent (in 1 dimension) • Assume we have a error function E and we need to use it to update one weight w • The figure show the error function in terms of w • Our target is to learn the value of w produces the minimum value of E. How? 12 E W minimum
  • 13. Gradient Decent (in 1 dimension) • In Gradient Decent algorithm, we use the following equation to get a better value of w: 𝑤 = 𝑤 − αΔ𝑤 (called Delta rule) Where: α : is the learning rate Δ𝑤 : is mathematically can be computed using derivative of E with respect to w ( 𝑑𝐸 𝑑𝑤 ) 13 E W minimum 𝑤 = 𝑤 − α 𝑑𝐸 𝑑𝑤 (3)
  • 16. Gradient Decent (multi dimension) • In ANN with many layers and many neurons in each layer the Error function will be multi-variable function. • So, the derivative in equation (3) should be partial derivative 𝑤𝑖𝑗 = 𝑤𝑖𝑗 − α 𝜕𝐸 𝑗 𝜕𝑤 𝑖𝑗 (4) • We write equation (4) as : 𝑤𝑖𝑗 = 𝑤𝑖𝑗 − α 𝜕𝑤𝑖𝑗 • Same process will be use to get the new bias value: 𝑏𝑗= 𝑏𝑗 − α 𝜕𝑏𝑗 16
  • 17. derivative of activation functions 17 Sigmoid
  • 18. Learning Rule in the output layer using SSE as error function and sigmoid as Activation function 𝜕𝐸 𝑗 𝜕𝑤 𝑖𝑗 = 𝜕𝐸 𝑗 𝜕𝑎 𝑗 (𝑙) * 𝜕𝑎 𝑗 (𝑙) 𝜕𝑧 𝑗 (𝑙) * 𝜕𝑧 𝑗 (𝑙) 𝜕𝑤𝑖𝑗 (𝑙) Where: 𝐸𝑗 = 𝑗 (𝑦𝑗 − 𝐷𝑗 )2 𝑦𝑗 = 𝑎𝑗 (𝑙) = 𝑓 𝑧𝑗 𝑙 = 𝜎 𝑧𝑗 𝑙 𝑧𝑗 (𝑙) = 𝑗 𝑤𝑖𝑗 (𝑙) 𝑎𝑖 (𝑙−1) + 𝑏𝑗 (𝑙) From the previous table: 𝜎′ 𝑧𝑗 𝑙 = 𝜎 𝑧𝑗 𝑙 ∗ 1 − 𝜎 𝑧𝑗 𝑙 = 𝑦𝑗 (1 − 𝑦𝑗) 18
  • 19. Learning Rule in the output layer (cont.) So (How?), 𝜕𝑦𝑗 𝜕𝑧𝑖 = 𝑦𝑗 (1 − 𝑦𝑗) 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗 = 𝑎𝑖 (𝑙−1) 𝜕𝐸𝑗 𝜕𝑦𝑗 = −2(𝑦𝑗 − 𝐷𝑗 ) • Then: 𝜕𝐸 𝑗 𝜕𝑤 𝑖𝑗 = 2𝑎𝑖 𝑙−1 𝑦𝑗 − 𝐷𝑗 𝑦𝑗 1 − 𝑦𝑗 𝑤𝑖𝑗 = 𝑤𝑖𝑗 − 2 α 𝑎𝑖 𝑙−1 𝑦𝑗 − 𝐷𝑗 𝑦𝑗 1 − 𝑦𝑗 19
  • 20. Learning Rule in the Hidden layer • Now we have to determine the appropriate weight change for an input to hidden weight. • This is more complicated because it depends on the error at all of the nodes this weighted connection can lead to. • The mathematical proof is out our scope. 20
  • 21. Gradient Decent (Notes) Note 1: • the neuron activation function (f ) should be is defined and differentiable function. Note 3: • The calculating of 𝜕𝑤𝑖𝑗 for the hidden layer will be more difficult (Why?) Note 2: • The previous calculation will be repeated for each weight and for each bias in the ANN • So, we need big computational power (what about deeper networks? ) 21
  • 22. Gradient Decent (Notes) • 𝜕𝑤𝑖𝑗 is represent the change in the values of 𝑤𝑖𝑗 to get better output • The equation of 𝜕𝑤𝑖𝑗 is dependent on the choosing of the Error(Cost) function and activation function. • Gradient Decent algorithm help in calculated the new values of weights and bias. • Question: is one iteration (one trail) enough to bet the best values for weights and biases • Answer: No, we need a extended version ? Backpropagation 22
  • 23. How Backpropagation Work? 23 𝒘 𝟏𝟏 (𝟏) 𝒘 𝟏𝟐 (𝟏) 𝒘 𝟐𝟏 (𝟏) 𝒘 𝟐𝟐 (𝟏) 𝒘 𝟑𝟏 (𝟏) 𝒘 𝟑𝟐 (𝟏) 𝒘 𝟏𝟏 (𝟐) 𝒘 𝟐𝟏 (𝟐) 𝒚 𝒂 𝟏 (𝟏) = 𝒘 𝟏𝟏 (𝟏) -𝛼 𝜕𝒘 𝟏𝟏 (𝟏) = 𝒘 𝟏𝟏 (𝟐) -𝛼 𝜕𝒘 𝟏𝟏 (𝟐) 𝑭𝒐𝒓𝒘𝒂𝒓𝒅 𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏 𝑩𝒂𝒄𝒌 𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏 𝒍𝒂𝒚𝒆𝒓 𝟎 𝒍𝒂𝒚𝒆𝒓 𝟏 𝒍𝒂𝒚𝒆𝒓 𝟐
  • 24. Online Learning vs. Offline Learning • Online: Pattern-by-Pattern learning • Error calculated for each pattern • Weights updated after each individual pattern 𝚫𝒘𝒊𝒋 = −𝜶 𝝏𝑬 𝒑 𝝏𝒘𝒊𝒋 • Offline: Batch learning • Error calculated for all patterns • Weights updated once at the end of each epoch 𝚫𝒘𝒊𝒋 = −𝜶 𝒑 𝝏𝑬 𝒑 𝝏𝒘𝒊𝒋 24
  • 25. Choosing Appropriate Activation and Cost Functions • We already know consideration of single layer networks what output activation and cost functions should be used for particular problem types. • We have also seen that non-linear hidden unit activations are needed, such as sigmoids. • So we can summarize the required network properties: • Regression/ Function Approximation Problems • SSE cost function, linear output activations, sigmoid hidden activations • Classification Problems (2 classes, 1 output) • CE cost function, sigmoid output and hidden activations • Classification Problems (multiple-classes, 1 output per class) • CE cost function, softmax outputs, sigmoid hidden activations • In each case, application of the gradient descent learning algorithm (by computing the partial derivatives) leads to appropriate back-propagation weight update equations. 25
  • 26. Overall picture : learning process on ANN 26
  • 27. Neural network simulator • Search through the internet to find a simulator and report it For example: • https://www.mladdict.com/neural-network- simulator • http://playground.tensorflow.org/ 27