SlideShare a Scribd company logo
1 of 99
http://ahmedbesbes.com
https://github.com/ahmedbesbes
• Cloud services and hardware (CPU, GPU,TPU)
• Lots of data from “the internet”
• Tools and culture of collaborative and reproducible science
• Resources and efforts from large corporations
• Five decades of research in machine learning
• Cloud services and hardware (CPU, GPU,TPU)
• Lots of data from “the internet”
• Tools and culture of collaborative and reproducible science
• Resources and efforts from large corporations
• Five decades of research in machine learning
• Cloud services and hardware (CPU, GPU,TPU)
• Lots of data from “the internet”
• Tools and culture of collaborative and reproducible science
• Resources and efforts from large corporations
• Five decades of research in machine learning
• Cloud services and hardware (CPU, GPU,TPU)
• Lots of data from “the internet”
• Tools and culture of collaborative and reproducible science
• Resources and efforts from large corporations
• Five decades of research in machine learning
• Cloud services and hardware (CPU, GPU,TPU)
• Lots of data from “the internet”
• Tools and culture of collaborative and reproducible science
• Resources and efforts from large corporations
[Alexa]
Oh Zeus.
• Incoming impulses
• Synapses
• Outcoming impulse
• Firing rate
• Inputs
• Weights
• Output activation
• Activation function
Biological model Computational model
𝑤1
𝑤2
𝑤 𝑑
𝑥1
𝑥2
𝑥 𝑑
…
…
σ( 𝒙. 𝑤 𝑇
+ 𝑏) a = σ(
𝑖=1
𝑑
𝑤𝑖 𝑥𝑖 + 𝑏)
𝑖𝑛𝑝𝑢𝑡
𝒙 ϵ 𝐼𝑅 𝑑
Weights
W ϵ 𝐼𝑅 𝑑
b: bias
(scalar)
• x: input vector
• 𝒛 = 𝒙. 𝒘 𝑻
+ 𝒃: pre-activation
• a: output scalar or activation
• W, b: weights and bias
(learnable parameters of the
neuron)
a =f (
𝑖=1
𝑑
𝑤𝑖 𝑥𝑖 + 𝑏)
𝑤1
𝑤2
𝑤 𝑑
𝑥1
𝑥2
𝑥 𝑑
…
…
σ( 𝒙. 𝑤 𝑇
+ 𝑏) a = σ(
𝑖=1
𝑑
𝑤𝑖 𝑥𝑖 + 𝑏)
𝑖𝑛𝑝𝑢𝑡
𝒙 ϵ 𝐼𝑅 𝑑
Weights
W ϵ 𝐼𝑅 𝑑
b: bias
(scalar)
x1
x2
𝑦1
𝑦2
Input layer Hidden layer Output layer
What happens inside the hidden layer(s):
• Activations of previous neurons become inputs to adjacent neurons
How to interpret that?
• Intermediate non-linear computations ~ feature engineering
• Transformation of the input space
• New representations of the data over one/many layers
x1
x2
Inputs
𝑦1
𝑦2
Outputs
(predictions)
Input layer Hidden layer Output layer
George Cybenko
X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0
x1
x2
h1
h2
o
20
20
-20
-20
b = -10
b = 30
20
20
b = -30
h1 = σ (20x1 + 20x2 -10)
h2 = σ (-20x1 - 20x2 + 30)
o = σ (20h1 + 20h2 – 30)
X1 X2 h1 h2 O
0 0 0 1 0
0 1 1 1 1
1 0 1 1 1
1 1 1 0 0
h1 = σ (20 * 0 + 20 * 0 – 10) = σ ( -10) ~ 0
h1 = σ (20 * 0 + 20 * 1 – 10) = σ (10) ~ 1
h1 = σ (20 * 1 + 20 * 0 – 10) = σ (10) ~ 1
h1 = σ (20 * 1 + 20 * 1 – 10) = σ (30) ~ 1
h2 = σ (-20 * 0 - 20 * 0 + 30) = σ (30) ~ 1
h2 = σ (-20 * 0 - 20 * 1 + 30) = σ (10) ~ 1
h2 = σ (-20 * 1 - 20 * 0 + 30) = σ (10) ~ 1
h2 = σ (-20 * 1 - 20 * 1 + 30) = σ (-10) ~ 0
o = σ (20 * 0 + 20 * 1 - 30) = σ (-10) ~ 0
o = σ (20 * 1 + 20 * 1 - 30) = σ (10) ~ 1
o = σ (20 * 1 + 20 * 1 - 30) = σ (10) ~ 1
o = σ (20 * 1 + 20 * 0 - 30) = σ (-10) ~ 0
X1 X2 h1
0 0 0
0 1 1
1 0 1
1 1 1
X1 X2 h2
0 0 1
0 1 1
1 0 1
1 1 0
h1 h2 O
0 1 0
1 1 1
1 1 1
1 0 0
ℎ1 = 𝑋1 OR 𝑋2 ℎ2 = 𝑋1 𝐴𝑁𝐷 𝑋2 o = ℎ1 𝐴𝑁𝐷 ℎ2
x1
x2
h1
h2
o
𝑋1 OR 𝑋2
𝑋1 𝐴𝑁𝐷 𝑋2
ℎ1 𝐴𝑁𝐷 ℎ2
h2
h1
X1 X2 h1 h2 O Y
0 0 0 1 0 0
0 1 1 1 1 1
1 0 1 1 1 1
1 1 1 0 0 0
Linearly separable problem
The network learnt a new space (h1, h2) where the
data is linearly separable
h1 = σ (20x1 + 20x2 -10)
h2 = σ (-20x1 - 20x2 + 30)
https://playground.tensorflow.org/
Wait … How did we come up with the weights
to solve the XOR problem?
…
We trained the network!
0 - Input:
raw pixels
Car
2 - Prediction
1 - Forward propagation
3 - Loss
computation
𝑙 𝑓 𝑥(𝑖)
; 𝑊 , 𝑦(𝑖)
Loss function
Model prediction
(car)
Ground truth
label (Boat)
The loss function
quantifies the cost
that we pay when
misclassifying a
boat as a car
Training example
Parameters of
the network
Loss Formula for a single training data Formula for all training data Task
Mean
Square
Error
(MSE)
1
2
( 𝑦(𝑖) − 𝑦(𝑖))² 1
2𝑁
𝑖=0
𝑁
( 𝑦(𝑖) − 𝑦(𝑖))²
Regression
Cross
Entropy 𝑦(𝑖) log( 𝑦(𝑖))
1
𝑁
𝑖=0
𝑁
𝑦(𝑖) log( 𝑦(𝑖))
Classification
Examples of loss functions
Optimization problem
min
𝑊
1
𝑛
𝑖=1
𝑛
𝑙 (𝑓 𝑥(𝑖)
; 𝑊 , 𝑦(𝑖)
)
Loss function Training example
Example label
Model
parameters
Average over
the training set
Train the network : find the parameters that minimize the average loss on the training set
Model prediction
𝑤 𝑛+1 ← 𝑤 𝑛 − η
𝑑𝑓 𝑤
𝑑𝑤
, η > 0
Gradient descent algorithm
𝑊𝑛+1
[𝑙]
← 𝑊𝑛
[𝑙]
− η 𝛻 𝑊[𝑙](
1
𝑛
𝑖=1
𝑛
𝑙 𝑓 𝑥 𝑖
; 𝑊), 𝑦 𝑖
Gradient of the loss
w.r.t
weights of layer l
Weight values
of layer l at
iteration n +1
Weight values
of layer l at
iteration n
Average
training loss
Learning
rate
0 - Input:
raw pixels
Car
2 - Prediction
4 – Backward propagation
3 - Loss
computation
𝜕𝐿
𝜕𝑊[6]
𝜕𝐿
𝜕𝑊[5]
𝜕𝐿
𝜕𝑊[4]
𝜕𝐿
𝜕𝑊[3]
𝜕𝐿
𝜕𝑊[2]
𝜕𝐿
𝜕𝑊[1]
5 – Weight
update using
gradient descent
Single training example each time:
stochastic gradient descent
A batch of training examples at each
time: batch gradient descent
Terms Definition Formula
𝒛𝒋
𝒍 Weighted input to
the neuron j in
layer l (pre-
activation)
𝒛𝒋
𝒍
=
𝒌=𝟏
𝒏𝒍−𝟏
𝒘𝒋𝒌
𝒍
∗ 𝒂 𝒌
𝒍−𝟏
+ 𝒃𝒋
𝒍
𝒂𝒋
𝒍 Activation of
neuron j in layer l
𝒂𝒋
𝒍
= 𝝈(𝒛𝒋
𝒍
)
𝒃𝒋
𝒍 Bias of neuron j in
layer
-
𝒘𝒋𝒌
𝒍 Weight connecting
the neuron k in
layer l-1 to the
neuron j in layer l
-
𝑎1
0
𝑎1
0
𝑎1
2
𝑎2
2
L0 L1 L2
𝑤23
2
𝑤22
1
𝑎1
1
𝑎2
1
𝑎3
1
𝑧1
1
𝑧2
1
𝑧3
1
X1
X2
L0 L1 L2
𝑾 𝟏 𝒃 𝟏
𝑦1
𝑦2
𝒂 𝟎𝒛 𝟏
𝑤11
1
𝑤12
1
𝑤21
1
𝑤22
1
𝑤31
1
𝑤32
1
X1
X2
L0 L1 L2
𝑾 𝟐
𝒃 𝟐
𝑦1
𝑦2
𝒂 𝟏
𝒛 𝟐
𝑤11
2
𝑤12
2
𝑤13
2
𝑤21
2
𝑤22
2
𝑤23
2
L-2 L-1 L
Term Definition Formula Shape
𝒛𝒍
Vector of weighted inputs to the
neurons in layer l 𝒛𝒍
= 𝑾𝒍
𝒂𝒍−𝟏
+ 𝒃𝒍
(𝑛𝑙
, )
𝒂𝒍
Vector of neuron activations in layer l
𝒂𝒍
= 𝝈(𝒛𝒍
)
(𝑛𝑙
, )
𝒃𝒍
Vector of neuron biases in layer l
-
(𝑛𝑙
, )
𝒘𝒍
Weight matrix connecting weights in
layer l-1 to weights in layer l -
(𝑛𝑙
, 𝑛𝑙−1
)
Chain rule
2 5 8 10 20
1e-3
1e-2
0.5
1e-1
1
Number of layers
Learningrate
Don’t:
• Initialize weights to 0  this causes symmetry and same gradient
for all weights
• Initialize very small weights  this causes very small gradients
Do:
• He initialization: w = np.random.randn(D, H) * sqrt(2.0/n)
• Initialize all biases with a constant small value ~ 0.01
Dropout:
Ideal situation
http://cs231n.github.io/
http://neuralnetworksanddeeplearning.com/
https://www.coursera.org/learn/neural-networks-deep-learning
def activation(z, derivative=False):
if derivative:
return activation(z) * (1 - activation(z))
else:
return 1 / (1 + np.exp(-z))
def cost_function(y_true, y_pred):
n = y_pred.shape[0]
cost = (1./(2*n)) * np.sum((y_true - y_pred) ** 2)
return cost
def cost_function_prime(y_true, y_pred):
cost_prime = y_pred - y_true
return cost_prime
import numpy as np
from sklearn.metrics import accuracy_score
from tqdm import tqdm, tqdm_notebook
from sklearn.utils import shuffle
from sklearn.cross_validation import train_test_split
Basic imports
Sigmoid activation function
𝜎′ 𝑧 = 𝜎′ 𝑧 (1 − 𝜎′ 𝑧 )
Mean square error (loss)
class NeuralNetwork(object):
def __init__(self, size):
self.size = size
self.weights = [np.random.randn(self.size[i], self.size[i-1]) * np.sqrt(2 / self.size[i-1]) for i in range(1, len(self.size))]
self.biases = [np.random.rand(n, 1) for n in self.size[1:]]
def forward(self, input):
# input shape : (input_shape, batch_size)
a = input
pre_activations = []
activations = [a]
for w, b in zip(self.weights, self.biases):
z = np.dot(w, a) + b
a = activation(z)
pre_activations.append(z)
activations.append(a)
return a, pre_activations, activationssc
𝑎0 = 𝑋
𝑧 𝑙
= 𝑊 𝑙
𝑎 𝑙−1
+ 𝑏 𝑙
𝑎 𝑙
= 𝜎(𝑧 𝑙
)
𝑎 𝐿
= 𝑌
def compute_deltas(self, pre_activations, y_true, y_pred):
delta_L = cost_function_prime(y_true, y_pred) * activation(pre_activations[-1], derivative=True)
deltas = [0] * (len(self.size) - 1)
deltas[-1] = delta_L
for l in range(len(deltas) - 2, -1, -1):
delta = np.dot(self.weights[l + 1].transpose(), deltas[l + 1]) * activation(pre_activations[l], derivative=True)
deltas[l] = delta
return deltas
𝛿 𝐿 =
𝜕𝐿
𝜕𝑧 𝐿
= 𝛻𝑎L ⊙σ′ 𝑧 𝐿
𝛿 𝑙 =
𝜕𝐿
𝜕𝑧 𝑙
= ((𝑤 𝑙+1) 𝑇 𝛿 𝑙+1)⊙σ′(𝑧 𝑙)
def backpropagate(self, deltas, pre_activations, activations):
dW = []
db = []
deltas = [0] + deltas
for l in range(1, len(self.size)):
dW_l = np.dot(deltas[l], activations[l-1].transpose())
db_l = deltas[l]
dW.append(dW_l)
db.append(np.expand_dims(db_l.mean(axis=1), 1))
return dW, db
𝜕𝐿
𝜕𝑊 𝑙 =
𝜕𝐿
𝜕𝑧 𝑙 (𝑎 𝑙−1
) 𝑇
= 𝛿 𝑙
(𝑎 𝑙−1
) 𝑇
𝜕𝐿
𝜕𝑏 𝑙
=
𝜕𝐿
𝜕𝑧 𝑙
= 𝛿 𝑙
def train(self, X, y, batch_size, epochs, learning_rate, validation_split=0.2,
print_every=10):
history_train_losses = []
history_train_accuracies = []
history_test_losses = []
history_test_accuracies = []
x_train, x_test, y_train, y_test = train_test_split(X.T, y.T,
test_size=validation_split)
x_train, x_test, y_train, y_test = x_train.T, x_test.T, y_train.T, y_test.T
for e in tqdm_notebook(range(epochs)):
if x_train.shape[1] % batch_size == 0:
n_batches = int(x_train.shape[1] / batch_size)
else:
n_batches = int(x_train.shape[1] / batch_size ) - 1
x_train, y_train = shuffle(x_train.T, y_train.T, random_state=0)
x_train, y_train = x_train.T, y_train.T
batches_x = [x_train[:, batch_size*i:batch_size*(i+1)] for i in
range(0, n_batches)]
batches_y = [y_train[:, batch_size*i:batch_size*(i+1)] for i in
range(0, n_batches)]
train_losses = []
train_accuracies = []
test_losses = []
test_accuracies = []
train/test split
Preparation of mini batches of data and
labels
Keep track of kpis (accuracy/loss) on train
and validation sets
Training over mini batches
dw_per_epoch = [np.zeros(w.shape) for w in self.weights]
db_per_epoch = [np.zeros(b.shape) for b in self.biases]
for batch_x, batch_y in zip(batches_x, batches_y):
batch_y_pred, pre_activations, activations = self.forward(batch_x)
deltas = self.compute_deltas(pre_activations, batch_y, batch_y_pred)
dW, db = self.backpropagate(deltas, pre_activations, activations)
for i, (dw_i, db_i) in enumerate(zip(dW, db)):
dw_per_epoch[i] += dw_i / batch_size
db_per_epoch[i] += db_i / batch_size
batch_y_train_pred = self.predict(batch_x)
train_loss = cost_function(batch_y, batch_y_train_pred)
train_losses.append(train_loss)
train_accuracy = accuracy_score(batch_y.T, batch_y_train_pred.T)
train_accuracies.append(train_accuracy)
batch_y_test_pred = self.predict(x_test)
test_loss = cost_function(y_test, batch_y_test_pred)
test_losses.append(test_loss)
test_accuracy = accuracy_score(y_test.T, batch_y_test_pred.T)
test_accuracies.append(test_accuracy)
# weight update
for i, (dw_epoch, db_epoch) in enumerate(zip(dw_per_epoch, db_per_epoch)):
self.weights[i] = self.weights[i] - learning_rate * dw_epoch
self.biases[i] = self.biases[i] - learning_rate * db_epoch
history_train_losses.append(np.mean(train_losses))
history_train_accuracies.append(np.mean(train_accuracies))
history_test_losses.append(np.mean(test_losses))
history_test_accuracies.append(np.mean(test_accuracies))
if e % print_every == 0:
print('Epoch {} / {} | train loss: {} | train accuracy: {} | val loss : {} | val accuracy : {}'.format(
e, epochs, np.round(np.mean(train_losses), 3), np.round(np.mean(train_accuracies), 3),
np.round(np.mean(test_losses), 3), np.round(np.mean(test_accuracies), 3)))
history = {'epochs': epochs,
'train_loss': history_train_losses,
'train_acc': history_train_accuracies,
'test_loss': history_test_losses,
'test_acc': history_test_accuracies
}
return history
def predict(self, a):
# input shape : (input_shape, batch_size)
for w, b in zip(self.weights, self.biases):
z = np.dot(w, a) + b
a = activation(z)
predictions = (a > 0.5).astype(int)
# predictions = predictions.reshape(-1)
return predictions
Monitoring
the model
performance
Inference
method
import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Number of hidden neurons
import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Architecture: one-hidden-
layer Neural Net
import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Defining the optimizer
import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Training loop over
the data: one epoch
import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Training loop over
the data: one epoch
import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Set stored gradients to zero
import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Forward pass
import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Compute the loss
import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step()
Backprop
import torch
import torch.nn as nn
import torch.optim as optim
h = 50
net = nn.Sequential(
nn.Linear(2, h),
nn.ReLU(),
nn.Linear(h, 1),
nn.Sigmoid()
)
optimizer = optim.SGD(net.parameters(), lr=1)
for i in range(100):
optimizer.zero_grad()
output = net(X[i])
loss = nn.BCELoss(output, Y[i])
loss.backward()
optimizer.step() Weight update

More Related Content

What's hot

Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Simplilearn
 

What's hot (20)

基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
基礎からのベイズ統計学 輪読会資料  第8章 「比率・相関・信頼性」基礎からのベイズ統計学 輪読会資料  第8章 「比率・相関・信頼性」
基礎からのベイズ統計学 輪読会資料 第8章 「比率・相関・信頼性」
 
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用
 
Hideitsu Hino
Hideitsu HinoHideitsu Hino
Hideitsu Hino
 
Tetsunao Matsuta
Tetsunao MatsutaTetsunao Matsuta
Tetsunao Matsuta
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Complex numbers polynomial multiplication
Complex numbers polynomial multiplicationComplex numbers polynomial multiplication
Complex numbers polynomial multiplication
 
15 integrals of trig products-i-x
15 integrals of trig products-i-x15 integrals of trig products-i-x
15 integrals of trig products-i-x
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 
Algorithm Design and Analysis - Practical File
Algorithm Design and Analysis - Practical FileAlgorithm Design and Analysis - Practical File
Algorithm Design and Analysis - Practical File
 
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...
Control of Discrete-Time Piecewise Affine Probabilistic Systems using Reachab...
 
14 formulas from integration by parts x
14 formulas from integration by parts x14 formulas from integration by parts x
14 formulas from integration by parts x
 
An Introduction into Anomaly Detection Using CUSUM
An Introduction into Anomaly Detection Using CUSUMAn Introduction into Anomaly Detection Using CUSUM
An Introduction into Anomaly Detection Using CUSUM
 
imager package in R and examples..
imager package in R and examples..imager package in R and examples..
imager package in R and examples..
 
Gentlest Introduction to Tensorflow - Part 3
Gentlest Introduction to Tensorflow - Part 3Gentlest Introduction to Tensorflow - Part 3
Gentlest Introduction to Tensorflow - Part 3
 
Gentlest Introduction to Tensorflow
Gentlest Introduction to TensorflowGentlest Introduction to Tensorflow
Gentlest Introduction to Tensorflow
 
IVR - Chapter 4 - Variational methods
IVR - Chapter 4 - Variational methodsIVR - Chapter 4 - Variational methods
IVR - Chapter 4 - Variational methods
 
13 integration by parts x
13 integration by parts x13 integration by parts x
13 integration by parts x
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 

Similar to Introduction to Neural Networks and Deep Learning from Scratch

Open GL T0074 56 sm4
Open GL T0074 56 sm4Open GL T0074 56 sm4
Open GL T0074 56 sm4
Roziq Bahtiar
 
51554 0131469657 ism-13
51554 0131469657 ism-1351554 0131469657 ism-13
51554 0131469657 ism-13
Carlos Fuentes
 
Integration techniques
Integration techniquesIntegration techniques
Integration techniques
Krishna Gali
 

Similar to Introduction to Neural Networks and Deep Learning from Scratch (20)

Disjoint sets
Disjoint setsDisjoint sets
Disjoint sets
 
Tutorial on convolutional neural networks
Tutorial on convolutional neural networksTutorial on convolutional neural networks
Tutorial on convolutional neural networks
 
Monadologie
MonadologieMonadologie
Monadologie
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Digit recognizer by convolutional neural network
Digit recognizer by convolutional neural networkDigit recognizer by convolutional neural network
Digit recognizer by convolutional neural network
 
Introduction to R programming
Introduction to R programmingIntroduction to R programming
Introduction to R programming
 
Intigrations
IntigrationsIntigrations
Intigrations
 
Open GL T0074 56 sm4
Open GL T0074 56 sm4Open GL T0074 56 sm4
Open GL T0074 56 sm4
 
5.n nmodels i
5.n nmodels i5.n nmodels i
5.n nmodels i
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
51554 0131469657 ism-13
51554 0131469657 ism-1351554 0131469657 ism-13
51554 0131469657 ism-13
 
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
Umbra Ignite 2015: Rulon Raymond – The State of Skinning – a dive into modern...
 
Numerical Method Assignment
Numerical Method AssignmentNumerical Method Assignment
Numerical Method Assignment
 
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and ScalaFolding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala
Folding Unfolded - Polyglot FP for Fun and Profit - Haskell and Scala
 
Sbe final exam jan17 - solved-converted
Sbe final exam jan17 - solved-convertedSbe final exam jan17 - solved-converted
Sbe final exam jan17 - solved-converted
 
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
 
Integration techniques
Integration techniquesIntegration techniques
Integration techniques
 
Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology Lecture 2: Stochastic Hydrology
Lecture 2: Stochastic Hydrology
 
UofT_ML_lecture.pptx
UofT_ML_lecture.pptxUofT_ML_lecture.pptx
UofT_ML_lecture.pptx
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

Introduction to Neural Networks and Deep Learning from Scratch

  • 2.
  • 3.
  • 4. • Cloud services and hardware (CPU, GPU,TPU) • Lots of data from “the internet” • Tools and culture of collaborative and reproducible science • Resources and efforts from large corporations
  • 5. • Five decades of research in machine learning • Cloud services and hardware (CPU, GPU,TPU) • Lots of data from “the internet” • Tools and culture of collaborative and reproducible science • Resources and efforts from large corporations
  • 6. • Five decades of research in machine learning • Cloud services and hardware (CPU, GPU,TPU) • Lots of data from “the internet” • Tools and culture of collaborative and reproducible science • Resources and efforts from large corporations
  • 7. • Five decades of research in machine learning • Cloud services and hardware (CPU, GPU,TPU) • Lots of data from “the internet” • Tools and culture of collaborative and reproducible science • Resources and efforts from large corporations
  • 8. • Five decades of research in machine learning • Cloud services and hardware (CPU, GPU,TPU) • Lots of data from “the internet” • Tools and culture of collaborative and reproducible science • Resources and efforts from large corporations
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28. • Incoming impulses • Synapses • Outcoming impulse • Firing rate • Inputs • Weights • Output activation • Activation function Biological model Computational model
  • 29. 𝑤1 𝑤2 𝑤 𝑑 𝑥1 𝑥2 𝑥 𝑑 … … σ( 𝒙. 𝑤 𝑇 + 𝑏) a = σ( 𝑖=1 𝑑 𝑤𝑖 𝑥𝑖 + 𝑏) 𝑖𝑛𝑝𝑢𝑡 𝒙 ϵ 𝐼𝑅 𝑑 Weights W ϵ 𝐼𝑅 𝑑 b: bias (scalar) • x: input vector • 𝒛 = 𝒙. 𝒘 𝑻 + 𝒃: pre-activation • a: output scalar or activation • W, b: weights and bias (learnable parameters of the neuron)
  • 30.
  • 31. a =f ( 𝑖=1 𝑑 𝑤𝑖 𝑥𝑖 + 𝑏)
  • 32. 𝑤1 𝑤2 𝑤 𝑑 𝑥1 𝑥2 𝑥 𝑑 … … σ( 𝒙. 𝑤 𝑇 + 𝑏) a = σ( 𝑖=1 𝑑 𝑤𝑖 𝑥𝑖 + 𝑏) 𝑖𝑛𝑝𝑢𝑡 𝒙 ϵ 𝐼𝑅 𝑑 Weights W ϵ 𝐼𝑅 𝑑 b: bias (scalar)
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 39. What happens inside the hidden layer(s): • Activations of previous neurons become inputs to adjacent neurons How to interpret that? • Intermediate non-linear computations ~ feature engineering • Transformation of the input space • New representations of the data over one/many layers x1 x2 Inputs 𝑦1 𝑦2 Outputs (predictions) Input layer Hidden layer Output layer
  • 41.
  • 42. X1 X2 Y 0 0 0 0 1 1 1 0 1 1 1 0
  • 43. x1 x2 h1 h2 o 20 20 -20 -20 b = -10 b = 30 20 20 b = -30 h1 = σ (20x1 + 20x2 -10) h2 = σ (-20x1 - 20x2 + 30) o = σ (20h1 + 20h2 – 30) X1 X2 h1 h2 O 0 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 0 0 h1 = σ (20 * 0 + 20 * 0 – 10) = σ ( -10) ~ 0 h1 = σ (20 * 0 + 20 * 1 – 10) = σ (10) ~ 1 h1 = σ (20 * 1 + 20 * 0 – 10) = σ (10) ~ 1 h1 = σ (20 * 1 + 20 * 1 – 10) = σ (30) ~ 1 h2 = σ (-20 * 0 - 20 * 0 + 30) = σ (30) ~ 1 h2 = σ (-20 * 0 - 20 * 1 + 30) = σ (10) ~ 1 h2 = σ (-20 * 1 - 20 * 0 + 30) = σ (10) ~ 1 h2 = σ (-20 * 1 - 20 * 1 + 30) = σ (-10) ~ 0 o = σ (20 * 0 + 20 * 1 - 30) = σ (-10) ~ 0 o = σ (20 * 1 + 20 * 1 - 30) = σ (10) ~ 1 o = σ (20 * 1 + 20 * 1 - 30) = σ (10) ~ 1 o = σ (20 * 1 + 20 * 0 - 30) = σ (-10) ~ 0
  • 44. X1 X2 h1 0 0 0 0 1 1 1 0 1 1 1 1 X1 X2 h2 0 0 1 0 1 1 1 0 1 1 1 0 h1 h2 O 0 1 0 1 1 1 1 1 1 1 0 0 ℎ1 = 𝑋1 OR 𝑋2 ℎ2 = 𝑋1 𝐴𝑁𝐷 𝑋2 o = ℎ1 𝐴𝑁𝐷 ℎ2 x1 x2 h1 h2 o 𝑋1 OR 𝑋2 𝑋1 𝐴𝑁𝐷 𝑋2 ℎ1 𝐴𝑁𝐷 ℎ2
  • 45. h2 h1 X1 X2 h1 h2 O Y 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 0 Linearly separable problem The network learnt a new space (h1, h2) where the data is linearly separable h1 = σ (20x1 + 20x2 -10) h2 = σ (-20x1 - 20x2 + 30)
  • 46.
  • 47.
  • 49.
  • 50.
  • 51. Wait … How did we come up with the weights to solve the XOR problem? … We trained the network!
  • 52. 0 - Input: raw pixels Car 2 - Prediction 1 - Forward propagation 3 - Loss computation
  • 53. 𝑙 𝑓 𝑥(𝑖) ; 𝑊 , 𝑦(𝑖) Loss function Model prediction (car) Ground truth label (Boat) The loss function quantifies the cost that we pay when misclassifying a boat as a car Training example Parameters of the network
  • 54. Loss Formula for a single training data Formula for all training data Task Mean Square Error (MSE) 1 2 ( 𝑦(𝑖) − 𝑦(𝑖))² 1 2𝑁 𝑖=0 𝑁 ( 𝑦(𝑖) − 𝑦(𝑖))² Regression Cross Entropy 𝑦(𝑖) log( 𝑦(𝑖)) 1 𝑁 𝑖=0 𝑁 𝑦(𝑖) log( 𝑦(𝑖)) Classification Examples of loss functions
  • 55. Optimization problem min 𝑊 1 𝑛 𝑖=1 𝑛 𝑙 (𝑓 𝑥(𝑖) ; 𝑊 , 𝑦(𝑖) ) Loss function Training example Example label Model parameters Average over the training set Train the network : find the parameters that minimize the average loss on the training set Model prediction
  • 56. 𝑤 𝑛+1 ← 𝑤 𝑛 − η 𝑑𝑓 𝑤 𝑑𝑤 , η > 0 Gradient descent algorithm
  • 57. 𝑊𝑛+1 [𝑙] ← 𝑊𝑛 [𝑙] − η 𝛻 𝑊[𝑙]( 1 𝑛 𝑖=1 𝑛 𝑙 𝑓 𝑥 𝑖 ; 𝑊), 𝑦 𝑖 Gradient of the loss w.r.t weights of layer l Weight values of layer l at iteration n +1 Weight values of layer l at iteration n Average training loss Learning rate
  • 58. 0 - Input: raw pixels Car 2 - Prediction 4 – Backward propagation 3 - Loss computation 𝜕𝐿 𝜕𝑊[6] 𝜕𝐿 𝜕𝑊[5] 𝜕𝐿 𝜕𝑊[4] 𝜕𝐿 𝜕𝑊[3] 𝜕𝐿 𝜕𝑊[2] 𝜕𝐿 𝜕𝑊[1] 5 – Weight update using gradient descent
  • 59.
  • 60. Single training example each time: stochastic gradient descent A batch of training examples at each time: batch gradient descent
  • 61. Terms Definition Formula 𝒛𝒋 𝒍 Weighted input to the neuron j in layer l (pre- activation) 𝒛𝒋 𝒍 = 𝒌=𝟏 𝒏𝒍−𝟏 𝒘𝒋𝒌 𝒍 ∗ 𝒂 𝒌 𝒍−𝟏 + 𝒃𝒋 𝒍 𝒂𝒋 𝒍 Activation of neuron j in layer l 𝒂𝒋 𝒍 = 𝝈(𝒛𝒋 𝒍 ) 𝒃𝒋 𝒍 Bias of neuron j in layer - 𝒘𝒋𝒌 𝒍 Weight connecting the neuron k in layer l-1 to the neuron j in layer l - 𝑎1 0 𝑎1 0 𝑎1 2 𝑎2 2 L0 L1 L2 𝑤23 2 𝑤22 1 𝑎1 1 𝑎2 1 𝑎3 1 𝑧1 1 𝑧2 1 𝑧3 1
  • 62. X1 X2 L0 L1 L2 𝑾 𝟏 𝒃 𝟏 𝑦1 𝑦2 𝒂 𝟎𝒛 𝟏 𝑤11 1 𝑤12 1 𝑤21 1 𝑤22 1 𝑤31 1 𝑤32 1
  • 63. X1 X2 L0 L1 L2 𝑾 𝟐 𝒃 𝟐 𝑦1 𝑦2 𝒂 𝟏 𝒛 𝟐 𝑤11 2 𝑤12 2 𝑤13 2 𝑤21 2 𝑤22 2 𝑤23 2
  • 65. Term Definition Formula Shape 𝒛𝒍 Vector of weighted inputs to the neurons in layer l 𝒛𝒍 = 𝑾𝒍 𝒂𝒍−𝟏 + 𝒃𝒍 (𝑛𝑙 , ) 𝒂𝒍 Vector of neuron activations in layer l 𝒂𝒍 = 𝝈(𝒛𝒍 ) (𝑛𝑙 , ) 𝒃𝒍 Vector of neuron biases in layer l - (𝑛𝑙 , ) 𝒘𝒍 Weight matrix connecting weights in layer l-1 to weights in layer l - (𝑛𝑙 , 𝑛𝑙−1 )
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73. 2 5 8 10 20 1e-3 1e-2 0.5 1e-1 1 Number of layers Learningrate
  • 74. Don’t: • Initialize weights to 0  this causes symmetry and same gradient for all weights • Initialize very small weights  this causes very small gradients Do: • He initialization: w = np.random.randn(D, H) * sqrt(2.0/n) • Initialize all biases with a constant small value ~ 0.01
  • 75.
  • 78.
  • 79.
  • 81.
  • 82. def activation(z, derivative=False): if derivative: return activation(z) * (1 - activation(z)) else: return 1 / (1 + np.exp(-z)) def cost_function(y_true, y_pred): n = y_pred.shape[0] cost = (1./(2*n)) * np.sum((y_true - y_pred) ** 2) return cost def cost_function_prime(y_true, y_pred): cost_prime = y_pred - y_true return cost_prime import numpy as np from sklearn.metrics import accuracy_score from tqdm import tqdm, tqdm_notebook from sklearn.utils import shuffle from sklearn.cross_validation import train_test_split Basic imports Sigmoid activation function 𝜎′ 𝑧 = 𝜎′ 𝑧 (1 − 𝜎′ 𝑧 ) Mean square error (loss)
  • 83. class NeuralNetwork(object): def __init__(self, size): self.size = size self.weights = [np.random.randn(self.size[i], self.size[i-1]) * np.sqrt(2 / self.size[i-1]) for i in range(1, len(self.size))] self.biases = [np.random.rand(n, 1) for n in self.size[1:]] def forward(self, input): # input shape : (input_shape, batch_size) a = input pre_activations = [] activations = [a] for w, b in zip(self.weights, self.biases): z = np.dot(w, a) + b a = activation(z) pre_activations.append(z) activations.append(a) return a, pre_activations, activationssc 𝑎0 = 𝑋 𝑧 𝑙 = 𝑊 𝑙 𝑎 𝑙−1 + 𝑏 𝑙 𝑎 𝑙 = 𝜎(𝑧 𝑙 ) 𝑎 𝐿 = 𝑌
  • 84. def compute_deltas(self, pre_activations, y_true, y_pred): delta_L = cost_function_prime(y_true, y_pred) * activation(pre_activations[-1], derivative=True) deltas = [0] * (len(self.size) - 1) deltas[-1] = delta_L for l in range(len(deltas) - 2, -1, -1): delta = np.dot(self.weights[l + 1].transpose(), deltas[l + 1]) * activation(pre_activations[l], derivative=True) deltas[l] = delta return deltas 𝛿 𝐿 = 𝜕𝐿 𝜕𝑧 𝐿 = 𝛻𝑎L ⊙σ′ 𝑧 𝐿 𝛿 𝑙 = 𝜕𝐿 𝜕𝑧 𝑙 = ((𝑤 𝑙+1) 𝑇 𝛿 𝑙+1)⊙σ′(𝑧 𝑙)
  • 85. def backpropagate(self, deltas, pre_activations, activations): dW = [] db = [] deltas = [0] + deltas for l in range(1, len(self.size)): dW_l = np.dot(deltas[l], activations[l-1].transpose()) db_l = deltas[l] dW.append(dW_l) db.append(np.expand_dims(db_l.mean(axis=1), 1)) return dW, db 𝜕𝐿 𝜕𝑊 𝑙 = 𝜕𝐿 𝜕𝑧 𝑙 (𝑎 𝑙−1 ) 𝑇 = 𝛿 𝑙 (𝑎 𝑙−1 ) 𝑇 𝜕𝐿 𝜕𝑏 𝑙 = 𝜕𝐿 𝜕𝑧 𝑙 = 𝛿 𝑙
  • 86. def train(self, X, y, batch_size, epochs, learning_rate, validation_split=0.2, print_every=10): history_train_losses = [] history_train_accuracies = [] history_test_losses = [] history_test_accuracies = [] x_train, x_test, y_train, y_test = train_test_split(X.T, y.T, test_size=validation_split) x_train, x_test, y_train, y_test = x_train.T, x_test.T, y_train.T, y_test.T for e in tqdm_notebook(range(epochs)): if x_train.shape[1] % batch_size == 0: n_batches = int(x_train.shape[1] / batch_size) else: n_batches = int(x_train.shape[1] / batch_size ) - 1 x_train, y_train = shuffle(x_train.T, y_train.T, random_state=0) x_train, y_train = x_train.T, y_train.T batches_x = [x_train[:, batch_size*i:batch_size*(i+1)] for i in range(0, n_batches)] batches_y = [y_train[:, batch_size*i:batch_size*(i+1)] for i in range(0, n_batches)] train_losses = [] train_accuracies = [] test_losses = [] test_accuracies = [] train/test split Preparation of mini batches of data and labels Keep track of kpis (accuracy/loss) on train and validation sets
  • 87. Training over mini batches dw_per_epoch = [np.zeros(w.shape) for w in self.weights] db_per_epoch = [np.zeros(b.shape) for b in self.biases] for batch_x, batch_y in zip(batches_x, batches_y): batch_y_pred, pre_activations, activations = self.forward(batch_x) deltas = self.compute_deltas(pre_activations, batch_y, batch_y_pred) dW, db = self.backpropagate(deltas, pre_activations, activations) for i, (dw_i, db_i) in enumerate(zip(dW, db)): dw_per_epoch[i] += dw_i / batch_size db_per_epoch[i] += db_i / batch_size batch_y_train_pred = self.predict(batch_x) train_loss = cost_function(batch_y, batch_y_train_pred) train_losses.append(train_loss) train_accuracy = accuracy_score(batch_y.T, batch_y_train_pred.T) train_accuracies.append(train_accuracy) batch_y_test_pred = self.predict(x_test) test_loss = cost_function(y_test, batch_y_test_pred) test_losses.append(test_loss) test_accuracy = accuracy_score(y_test.T, batch_y_test_pred.T) test_accuracies.append(test_accuracy) # weight update for i, (dw_epoch, db_epoch) in enumerate(zip(dw_per_epoch, db_per_epoch)): self.weights[i] = self.weights[i] - learning_rate * dw_epoch self.biases[i] = self.biases[i] - learning_rate * db_epoch
  • 88. history_train_losses.append(np.mean(train_losses)) history_train_accuracies.append(np.mean(train_accuracies)) history_test_losses.append(np.mean(test_losses)) history_test_accuracies.append(np.mean(test_accuracies)) if e % print_every == 0: print('Epoch {} / {} | train loss: {} | train accuracy: {} | val loss : {} | val accuracy : {}'.format( e, epochs, np.round(np.mean(train_losses), 3), np.round(np.mean(train_accuracies), 3), np.round(np.mean(test_losses), 3), np.round(np.mean(test_accuracies), 3))) history = {'epochs': epochs, 'train_loss': history_train_losses, 'train_acc': history_train_accuracies, 'test_loss': history_test_losses, 'test_acc': history_test_accuracies } return history def predict(self, a): # input shape : (input_shape, batch_size) for w, b in zip(self.weights, self.biases): z = np.dot(w, a) + b a = activation(z) predictions = (a > 0.5).astype(int) # predictions = predictions.reshape(-1) return predictions Monitoring the model performance Inference method
  • 89.
  • 90. import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() ) optimizer = optim.SGD(net.parameters(), lr=1) for i in range(100): optimizer.zero_grad() output = net(X[i]) loss = nn.BCELoss(output, Y[i]) loss.backward() optimizer.step() Number of hidden neurons
  • 91. import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() ) optimizer = optim.SGD(net.parameters(), lr=1) for i in range(100): optimizer.zero_grad() output = net(X[i]) loss = nn.BCELoss(output, Y[i]) loss.backward() optimizer.step() Architecture: one-hidden- layer Neural Net
  • 92. import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() ) optimizer = optim.SGD(net.parameters(), lr=1) for i in range(100): optimizer.zero_grad() output = net(X[i]) loss = nn.BCELoss(output, Y[i]) loss.backward() optimizer.step() Defining the optimizer
  • 93. import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() ) optimizer = optim.SGD(net.parameters(), lr=1) for i in range(100): optimizer.zero_grad() output = net(X[i]) loss = nn.BCELoss(output, Y[i]) loss.backward() optimizer.step() Training loop over the data: one epoch
  • 94. import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() ) optimizer = optim.SGD(net.parameters(), lr=1) for i in range(100): optimizer.zero_grad() output = net(X[i]) loss = nn.BCELoss(output, Y[i]) loss.backward() optimizer.step() Training loop over the data: one epoch
  • 95. import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() ) optimizer = optim.SGD(net.parameters(), lr=1) for i in range(100): optimizer.zero_grad() output = net(X[i]) loss = nn.BCELoss(output, Y[i]) loss.backward() optimizer.step() Set stored gradients to zero
  • 96. import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() ) optimizer = optim.SGD(net.parameters(), lr=1) for i in range(100): optimizer.zero_grad() output = net(X[i]) loss = nn.BCELoss(output, Y[i]) loss.backward() optimizer.step() Forward pass
  • 97. import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() ) optimizer = optim.SGD(net.parameters(), lr=1) for i in range(100): optimizer.zero_grad() output = net(X[i]) loss = nn.BCELoss(output, Y[i]) loss.backward() optimizer.step() Compute the loss
  • 98. import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() ) optimizer = optim.SGD(net.parameters(), lr=1) for i in range(100): optimizer.zero_grad() output = net(X[i]) loss = nn.BCELoss(output, Y[i]) loss.backward() optimizer.step() Backprop
  • 99. import torch import torch.nn as nn import torch.optim as optim h = 50 net = nn.Sequential( nn.Linear(2, h), nn.ReLU(), nn.Linear(h, 1), nn.Sigmoid() ) optimizer = optim.SGD(net.parameters(), lr=1) for i in range(100): optimizer.zero_grad() output = net(X[i]) loss = nn.BCELoss(output, Y[i]) loss.backward() optimizer.step() Weight update