Deep learning from scratch

Eran Shlomo, IPP tech lead, Haifa
eran.shlomo@intel.com ©

About me
Haifa IoT Ignition lab and IPP(Intel ingenuity partnership program) tech lead.
Intel Perceptual computing.
Compute, cloud and embedded expert.
Maker and Entrepreneur
Focus on Data science and Machine learning in recent years

Agenda
Lets talk some theory
Lets define a problem
Time to code our network
Meet the pro tools
Time to fancier netwroks.

By the end of the day…
You will have:
• Get some intuition on what is DL, what you can use it for.
• Have understanding of the mechanics behind deep learning.
• Get basic feeling of the concepts and how does DL works.
• Some hands on known tools.
• A list of pointers to continue your learning and experimenting.
You will not have:
• Practical experience on solving problems using DL
• Understanding of the different types of networks and their usage
• The math skills required to be an expert.

We are going to work in try, catch up
• Along the way we have exercises, you will get time to try them
• Usually next slide will contain the solution
• So for every task:
• Try
• Catch up once solution is on board, focus on understanding the solution
• Make sure you have it working, each step is required for the one after it.

Buzzwords alignment attempt
AI
Machine
learning
Supervised
learning
Deep
learning
Machine reasoning
Automated tasks
Train based on data
Neural networks
input
logic
output
input
output
logic

Assembly C (compiler) C++(OOP) JAVA(managed)
Python (run
time)
Where we are in technology timeline perspective
Model
protos
High level
(keras)
???? ???? ????

Deep learning – basic anatomy
Data driven
Training a model
Input, output and hidden neurons
Input layer Hidden layer(s) Output layer
Deep learning Many hidden (deep) layers

The essence of deeplearning
Xi YiWij(1) Wij(2)
W11(1)
X1
Y1
W11(2)
𝑌 = 𝑓 𝑋 = 𝑊𝑋+b
Deep network is essentially a function
we train to detect some pattern
b (bias) is omitted in this drawing
Why is the sudden success ?
A lot of data
A lot of compute
Improved networks

Before we start , some Math …
• As data science is becoming part of every business math gaining extra
popularity.
• Question you need to ask yourself if you wish to go deeper into the field – Do I
Want/Can refresh/increase my math skills.
• For our basic deeplearning course we need some:
• Algebra, Mainly around Matrix/Vector operations
• Calculus, Mainly around derivatives
• You can never get enough of statistics in data science, go over variance, mean,
distributions, probabilities
• Python

Some math references to get start with
https://www.youtube.com/watch?v=K5BLNZw7UeU Matrix operations
https://www.youtube.com/watch?v=kuixY2bCc_0 Multiplying matrices
https://www.youtube.com/watch?v=rAof9Ld5sOg Derivatives
https://www.youtube.com/watch?v=TUJgZ4UDY2g The chain rule
https://www.youtube.com/watch?v=ZkjP5RJLQF4 Linear regression
https://www.youtube.com/watch?v=_Po-xZJflPM Logistic regression
https://www.youtube.com/watch?v=Y4lTTHua0TE Mean, Variance,…

Lets practice some basics
• We are working with python 3.5 . you are encouraged to work with conda
package manager but pip is oke as well.
• numpy is THE math operations package for python, we will be using it to play
with matrices, install it.
• Lets create 5 random normal numbers, making sure numpy is good to go
• Visualization is very important in general and in the course, install matplotlib
• Visualize 100 random numbers like the example above.

Normal random
We got 100 normally distributed numbers, lets create a
histogram of them
By default our normal distribution is with mean 0 and
variance of 1.
Create 2 matrices of 10x10: A = N(0,1), B =N(3,16)
plot their histogram
Hint: Flatten the matrix to histogram
Type equation here.
X~N(M, σ2
)
Variancemean

Normal random
You can switch between the standard normal to any normal, Z=N(0, 12)
Code a function that multiply two matrices explicitly,
you can assume inputs are nd array, don’t use numpy
matrix multiplication operator.
def mul_matrix(a,b):
pass
Hint: for loop is going over rows in numpy, matrix.T is the transpose

Matrix multiplication
Validate your code using numpy
Hint: numpy.dot

Neural networks – Background and inspiration
It is pretty common to compare neural networks to how our brain works:
• Coupled well with the term AI
• Has some sense in it, as many different researches show. Yet we are a bit long from really understanding
how the brain works.
𝑘=0
𝑛
𝑊𝑋
W1
W2
W3
X1
X2
X3
𝑓(𝑥)

Artificial neural networks
Output=f( 𝑘=0
𝑛
𝑊𝑋), where :
WX – inputs multiplied by weights
F(x) is an activation function
Common activation functions: sigmoid,
relu, tanh, linear, …
1. Code the sigmoid function
Use numpy, z can be a
vector
def sigmoid(z):
…2. Plot 100 points of your
sigmoid.
Hint: plt.plot(X, Y,'co') draw
points only

Synthetic problem for our neural network
One of the biggest challenges with deeplearning is data, and a lot of it.
We will override this problem using synthetic problem, we will model predefined
function.
We usually divide our dataset to (at least) two groups:
• Training, ~70% of our data.
• Test, 30% of our data.
In real life cases you are likely to have validation as well.
Our problem is to predict some function behavior given points from this function, later
comparing our model into new generated points from the function.
Lets example

Utils.py
You got utils.py module, this module contains some help functions.
Git repo https://gitlab.com/eshlomo/EazyDnn , utils.py under base_network
Lets generate a pattern:
• Signature:
• Function is given with range, number of samples and generator, which is the
function itself.
• Generate 100 samples of 𝑓 𝑥 = 𝑥2 between -1 to 1
• Plot the generated function

Before we go into our network
• This is a lot of info to push into short time.
• You are likely not be able to follow all details in real time, In order to fully feel
you got it you need to train and train and …
• We are going through these details to give you a solid base for self learning.
• Don’t get too alarmed from the math.
• Feel free to contact me 

Lets model a line
𝑓 𝑥 = 𝑥2Previous ex. solution
We will create the following network
𝑥 𝑓 𝑥 𝑓 𝑥 = 𝑥
• Generate 100 samples of 𝑓 𝑥 = 𝑥, between 0 to 1
• Plot the generated function
𝑤11
(1)
𝑤12
(1)
𝑤13
(1)
𝑤 1 𝑤 2
𝑤11
(2)
𝑤21
(2)

Network basics
We are going to train our network to estimate our line, in other words:
• We are going to find the best vectors w1,w2 that will make our network model
our line.
• Training is an iterative process in which in every iteration we minimize our
model error be small change to our weight vectors.
• For that purpose we define a cost (error) function on our model.
Previous ex. solution 𝑓 𝑥 = 𝑥

Cost function
• Lets mark our model output as 𝑌, and our real output as 𝑌
• We use Quadratic cost (marked with J), Also known as mean squared
error, maximum likelihood, and sum squared error :Err𝑜𝑟 = 𝐽 𝑌 =
1
2
(𝑌 − 𝑌)2
• Keep in mind we know the real output, we have training data==ground truth==
annotated data.
• Since we want to minimize our error, we would like to move our weights
against the derivative at any given iteration.
• Similar to finding minimum of function in calculus, only in numeric way, This
process is called gradient decent.

Gradient decent
Lets start creating our network, create a class that will manage our network:
• 3 layers, their size is constructor parameter.
• 2 weight matrices
• Init all weights with standard random normal
A process in which every iteration we:
• Predict our output – Forward pass
• Calculate our error using our cost function = prediction –
ground truth
• Calculate the error derivative in reference for our weights.
• Update each weight with small Δ opposite to the gradient
direction (minimize) – Backward pass
• Δ is called our learning rate

Our forward pass
𝑘=0
𝑛
𝑊𝑋
W1
W2
W3
X1
X2
X3
𝑠𝑖𝑔𝑚𝑜𝑖𝑑
• Add a method to our class, called
forward
• This method will calculate 𝑌 our model
predicted output
• Use the following naming conventions:
• 𝑍(𝑛)
=𝑊(𝑛−1)
𝑋
• 𝑎(𝑛)
=𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑍(𝑛)
)=𝑓𝑎(𝑍(𝑛)
)
• Forward will return our model output,
AKA sigmoid of the last layer activation
sum - 𝑓𝑎(𝑍(3)
).
𝑍(2) 𝑎(2)

Lets add our cost function
• Add a method to our class, called cost
• This method will calculate our cost for
every iteration
• It gets as parameters our input and
output and returns the cost 𝐽 𝑌 =
1
2
(𝑌 − 𝑌)2
• Keep in mind 𝑌=forward(X)

Back propagation
Once we have the cost (error) we want to calculate the derivative of each weight
in reference to the cost.
Remember – we want to move each weight in the opposite of the direction to its
error.
We want to minimize 𝐽 𝑌 =
1
2
(𝑌 − 𝑌)2, where 𝑌 = 𝑓(𝑊𝑋)  𝐽 𝑊
=
1
2
(𝑌 − 𝑓(𝑊𝑋))2
So we need to calculate :
𝜕𝐽
𝜕𝑊
𝐽 𝑊
We have a composition of parameters, We need to use the chain rule

The chain rule
• https://en.wikipedia.org/wiki/Chain_rule
• A way to compute the derivative of function composition
We want to calculate
𝜕𝐽
𝜕𝑊
𝐽 𝑊 , lets do some chaining
𝜕𝐽
𝜕𝑊
𝐽 𝑊 =
𝜕𝐽
𝜕𝑊(1) 𝐽 𝑊 +
𝜕𝐽
𝜕𝑊(2) 𝐽 𝑊
Lets start with
𝜕𝐽
𝜕𝑊(2) 𝐽 𝑊 :
𝜕𝐽
𝜕𝑊(2)
1
2
(𝑌 − 𝑌)2 = 𝑌 − 𝑌
𝜕 𝑌
𝜕𝑊 2
𝑌= 𝑓𝑎( 𝑍(3)
),
𝜕 𝑌
𝜕𝑊 2 =
𝜕 𝑌
𝜕𝑍(3)
𝜕𝑍(3)
𝜕𝑊 2 we need to calculate sigmoid derivative
𝑍(3)
= 𝑎(2)
𝑊(2)
,
𝜕𝑍(3)
𝜕𝑊 2 = 𝑎(2)  Linear propagation of the error per weight.

Some derivatives
Code a function called sigmoidPrime , calculates sigmoid derivative for matix Z
Code a function called costPrime, calculates
𝜕𝐽
𝜕𝑊(1) ,
𝜕𝐽
𝜕𝑊(2)
Sigmoid derivative :
𝑑
𝑑𝑧
1
1+𝑒−𝑧 = 𝑒−𝑧
(1+𝑒−𝑧)2

Our derivatives functions
Utils.py contains methods you should add to your class, add them.

Time for some training
Go over the methods you have just added, can you tell what are they doing ?
In utils.py there is a class linear_trainer – what is it doing ?
In utils.py there is a method test_line – what is it doing ?
Create a network instance, training and test data and train your network using
the function test_line.

Lets run some more
Much better, now lets try 100K. mmm… same result Ideas ?
Default of training iterations number is 100, lets make it 10000

When error is too high
We usually tend to (in few minutes will get into the why):
• Do more training time
• Get more data
• Get bigger (deeper) model, usually comes with more data.
Do the following:
• Install scipy
• Increase your hidden layer to size 30
• Replace the linear trainer with BFGS_trainer inside the method test_line– find it in
utils.py

Optimization
Gradient decent looks for minimum and can suffer
from these problems:
• Stuck in local minima
• Stuck in plateau
• Learning rate is too big to reach minima,
bouncing…
Read more @ http://sebastianruder.com/optimizing-
gradient-descent/

The Bias Variance tradeoff
We can look at our model error as follows:
noise
model
error
Total
Error
Our error usually comes from combination of these two, These are all equivalent:
• High variance=modeling noise=not enough data=model too big=overfit
• High bias =model too simple=underfit

Bias / Variance
0
20
40
60
80
100
120
140
160
0 5 10 15
Good model
0
20
40
60
80
100
120
140
160
0 5 10 15
High bias
0
20
40
60
80
100
120
140
160
0 5 10 15
High Variance
How can you tell which one of
those do you have ?

Rules of thumb regarding Bias/Variance
• Good accuracy on training and test  Good model
• Good accuracy on training, poor on test  Overfit
• Poor on both  underfit
Put back our training iterations on 1000, still BGFS
Generate the training data as before range 0-1 but test data on range 0-2
How does it look ? Can you guess why ?

Deep neural nets have limits
• On general the network is trained on bounded data
• It is likely not to generalize well out of bound
• So you need your data sets contain all data range OR
• Have more suitable model, for curve
prediction(time series) RNN might have been a
better choice here.
• Try to fit the following curve 2.5𝑒
−𝑥
2 cos(𝜋𝑥) , range 0-
10
• Install tensorflow (cpu version, make sure you are on
python 3.5.x)
• Install keras
• Create our model using Keras(google time…):
• Sequential model
• Dens layers are the sum
• Use relu activation

Next steps if you wish to get deeper
• CNN and caffee: http://adilmoujahid.com/posts/2016/06/introduction-deep-
learning-python-caffe/
• Udacity deep learning course (TF examples walkthrough)
• Andrew NG ML course on courser
• Geof Hinton neural netwroks on coursera
• Stanford CS231 on youtube
• RNN: http://karpathy.github.io/2015/05/21/rnn-effectiveness/
• Quick round tour on different networks , great youtube series:
• https://www.youtube.com/playlist?list=PLjJh1vlSEYgvGod9wWiydumYl8hOXixNu

Deep learning from scratch

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Deep learning from scratch

Similaire à Deep learning from scratch (20)

Plus de Eran Shlomo

Plus de Eran Shlomo (7)

Dernier

Dernier (20)

Deep learning from scratch

Notes de l'éditeur