3. Introduction to ANN
• First introduced back in 1943 by the Warren
• Successes of ANNs until the 1960s.
• In the early 1980s there was a revival of
interest in ANNs as new network architectures.
• By the 1990s, powerful alternative Machine
4. Reasons why ANN is much more profound
❑There is now a huge quantity of data.
❑The tremendous increase in computing power.
❑The training algorithms have been improved.
❑Theoretical limitations of ANNs have turned
out to be benign.
❑virtuous circle of funding and progress and
7. The Perceptron
• One of the simplest ANN architectures, invented in
1957 by Frank Rosenblatt.
• It is based on a linear threshold unit (LTU).
Z = w1 x1 + w2 x2 + ⋯
+ wn xn = wT ・ x
hw(x) = step (Z)
= step (wT ・x)
9. Training Algorithm
While epoch produces an error
Present network with next inputs from epoch
Err = T – O
If Err <> 0 then
Wj new = Wj old + LR * Ij * Err
• T: actual output , O: predicted output
• LR : learning rate , I :input
10. XOR classification problem and an MLP
that solves it
X1 XOR X2 = (X1 AND NOT X2) OR (X2 AND NOT X1)
11. Multi-Layer Perceptron and
• An MLP is composed of one input layer, one or more
layers of LTUs, called hidden layers, and one final
• When an ANN has two
or more hidden layers, it is called
a deep neural network (DNN).
13. Deep learning Problems
• Vanishing gradients problem (or the related
exploding gradients problem) lower layers
very hard to train.
• Second, with such a large network, training
would be extremely slow.
• Third, a model with millions of parameters
would severely risk overfitting the training
14. Gradients problems
• Gradients often get smaller as the algorithm
progresses down to the lower layers.
• The Gradient Descent update leaves the lower layer
weights unchanged, and training never converges to
a good solution.
• This is called the vanishing gradients problem.
• The gradients can grow bigger and bigger, so many
layers get insanely large weight updates and the
algorithm diverges. This is the exploding gradients
15. Solving the first problem(Van…)
A paper titled “Understanding the Difficulty of
Training Deep Feedforward Neural Networks” by
Xavier Glorot and Yoshua.
1. popular logistic sigmoid activation function.
2. using a normal distribution with a mean of 0
and a standard deviation of 1.
3. the hyperbolic tangent function has a mean
of 0 and behaves slightly better than the
logistic function in DNN.
16. Sigmoid activation function
you can see that when
inputs become large
(negative or positive),
the function saturates
at 0 or 1, with a
close to 0.
17. The problem of RELU (0,max)
• It suffers from a problem (dying ReLUs) during
training, some neurons effectively die.
• they stop outputting anything other than 0.
• In some cases, you may find that half of your
network’s neurons are dead training.
• To solve this problem, you may want to use a
variant of the ReLU function, such as the
18. leaky ReLU (RReLU).
• leaky variants always outperformed the strict ReLU activation
function. In fact, setting α = 0.2 (huge leak) seemed to result in
better performance than α = 0.01 (small leak).
• They also evaluated the randomized leaky ReLU (RReLU).
• also evaluated the parametric leaky ReLU (PReLU),
19. Exponential linear unit (ELU)
• Outperformed all the ReLU variants in their
experiments: training time was reduced and the
neural network performed better on the test set.
20. Batch Normalization
• The technique consists of adding an operation in the model just before
the activation function of each layer.
• Simply zero-centering and normalizing the inputs, then scaling and
shifting the result using two new parameters per layer (one for scaling,
the other for shifting).
• In other words, this operation lets the model learn the optimal scale and
mean of the inputs for each layer.
• γ is the scaling parameter for the layer.
• β is the shifting parameter (offset) for the
22. Reusing Pretrained Layers
• It is generally not a good idea to train a very
large DNN from scratch.
• Try to find an existing neural network that
accomplishes a similar task.
• Reuse the lower layers of this network.
• This is called transfer learning.
• DNN that was trained to
classify pictures into 100
• You now want to train a
DNN to classify specific
types of vehicles.
• Freezing the Lower Layers
• Tweaking, Dropping, or
Replacing the Upper Layers.
25. Faster Optimizers
• Five ways to speed up training (and reach a better
➢ Applying a good initialization strategy for the connection weights.
➢ using a good activation function.
➢ Using Batch Normalization.
➢ Reusing parts of a pretrained network.
➢ Using a faster optimizer than the regular Gradient Descent
• the most popular ones: Momentum optimization, Nesterov
Accelerated Gradient, AdaGrad, RMSProp, and finally Adam
26. Momentum Optimization Algorithm
• Gradient Descent simply updates the weights θ by directly subtracting
the gradient of the cost function J(θ) with regards to the weights
(∇θJ(θ)) multiplied by the learning rate η (equation 1)
• Momentum optimization cares a great deal about what previous
• It updates the weights by simply subtracting this momentum vector.
• A new hyperparameter β, simply called the momentum, which must be
set between 0 and 1, typically 0.9. (equation 2)
Gradient Descent (1) Momentum Optimization (2)
28. Nesterov Momentum optimization
▪ The only difference from vanilla Momentum
optimization is that the gradient is measured
at θ + βm rather than at θ.
▪ This small tweak works because in general the
momentum vector will be pointing in the
▪ where ∇1 represents the gradient of the cost
function measured at the starting point θ,
and ∇2 represents the gradient at the point
located at θ + βm)
29. RMS Optimization
• Accumulating only the gradients from the most recent iterations (as
opposed to all the gradients since the beginning of training).
• It does so by using exponential decay in the first step.
• generally performs better than Momentum optimization and
Nesterov Accelerated Gradients.
• In fact, it was the preferred optimization algorithm of many
researchers until Adam optimization came around.
30. Adam Optimization
• Stands for adaptive moment estimation.
• Combines the ideas of Momentum optimization and RMSProp.
• Steps 3 and 4 are somewhat of a technical detail: since m and s
are initialized at 0, they will be biased toward 0 at the beginning of
training, so these two steps will help boost m and s at the
beginning of training.
Initialize β1 = 0.9, β2 =0.999, η = 0.001
term ϵ initialized to a tiny number 10–8 to avoid division by 0.
33. Learning rate techniques
❑ Predetermined piecewise
constant learning rate For example, set the learning rate to η0 = 0.1 at first,
then to η1 = 0.001 after 50 epochs.
❑ Performance scheduling
Measure the validation error every N steps (just like for early stopping) and
reduce the learning rate by a factor of λ when the error stops dropping.
❑ Exponential scheduling
Set the learning rate to a function of the iteration number t:
This works great, but it requires tuning η0 and r. The learning rate will drop by
a factor of 10 every r steps.
❑ Power scheduling
Set the learning rate to η(t) = η0 (1 + t/r)–c The hyperparameter c is set to 1.
This is similar to exponential scheduling, but the learning rate drops much more
❑ It is a fairly simple algorithm: at every training step, every neuron
(including the input neurons but excluding the output neurons) has
a probability p of being temporarily “dropped out,” meaning it will
be entirely ignored during this training step, but it may be active
during the next step
35. Data Augmentation
❑ Consists of generating new training (rotating, resizing, flipping,
and cropping) instances from existing ones, artificially boosting
the size of the training set.
❑ This will reduce overfitting, making this a regularization technique.
The trick is to generate realistic training instances.
36. Convolutional Neural Networks
❑ A convolutional neural network (or ConvNet) is a type of feed-forward
artificial neural network.
❑ The architecture of a ConvNet is designed to take advantage of the 2D
structure of an input image.
❑ A ConvNet is comprised of one or more convolutional layers (often
with a pooling step) and then followed by one or more fully connected
layers as in a standard multilayer neural network.
37. How CNN works
• For example, a ConvNet takes the input as an image which
can be classified as ‘X’ or ‘O’
38. ConvNet Layers
▪CONV layer will compute the output of neurons that are connected
to local regions in the input, each computing a dot product between
their weights and a small region they are connected to in the input
▪RELU layer will apply an elementwise activation function, such as
the max(0,x) thresholding at zero. This leaves the size of the
▪POOL layer will perform a down sampling operation along the
spatial dimensions (width, height).
▪FC (i.e. fully-connected) layer will compute the class scores,
resulting in volume of size [1x1xN], where each of the N numbers
correspond to a class score, such as among the N categories.
39. Convolutional Layer - Filters
▪ The CONV layer’s parameters consist of a set of learnable
▪ Every filter is small spatially (along width and height), but
extends through the full depth of the input volume.
▪ During the forward pass, we slide (more precisely, convolve)
each filter across the width and height of the input volume and
compute dot products between the entries of the filter and the
input at any position.
40. Convolutional Layer - Filters
• Sliding the filter over the width and height of the input gives
2-dimensional activation map that responds to that filter at
every spatial position.
45. Pool Layer
▪ The pooling layers down-sample the previous layers feature
▪ Its function is to progressively reduce the spatial size of the
representation to reduce the amount of parameters and
computation in the network
▪ The pooling layer often uses the Max operation to perform
the down sampling process.
47. Fully connected layer
❑ Fully connected layers are the normal
flat feed-forward neural network
❑ These layers may have a non-linear
activation function or a softmax
activation in order to predict classes.
❑ To compute our output, we simply
rearrange the output matrices as a 1-
48. SoftMax operation
❑ A special kind of activation layer,
usually at the end of FC layer
❑ Can be viewed as a fancy normalizer
(a.k.a. Normalized exponential
❑ Produce a discrete probability
❑ Very convenient when combined
with cross-entropy loss
49. Recurrent Neural Network
❑ Some problems require previous history/context in order
to be able to give proper output (speech recognition,
stock forecasting, target tracking, etc.
❑ One way to do that is to just provide all the necessary
context in one "snap-shot" and use standard learning
➢ How big should the snap-shot be? Varies for different
instances of the problem.
✓ If the input sequences are of fixed length, or can be
easily padded to a fixed length, they can be
collapsed into a single input vector and any of the
standard pattern classification algorithms.
50. Sequential data
❑ There are many tasks that require learning a temporal sequence
❑ These problems can be broken into 3 distinct types of tasks
➢ Sequence Recognition: Produce a particular output pattern
when a specific input sequence is seen. Applications:
Sentiment Analysis, handwriting recognition
➢ Sequence Reproduction: Generate the rest of a sequence
when the network sees only part of the sequence.
Applications: Time series prediction (stock market, sun spots,
etc), language model.
➢ Temporal Association: Produce a particular output sequence
in response to a specific input sequence. Applications:
machine translation, speech generation
✓ Recurrent networks is flexible enough to solve these
51. Recurrent Networks offer a lot of flexibility:
(2) Sequence output
captioning takes an
image and outputs a
sentence of words).
(3) Sequence input
analysis where a
given sentence is
(4) Sequence input and
sequence output (e.g.
RNN reads a sentence
in English and then
outputs a sentence in
and output (e.g.
where we wish to
label each frame
of the video).
input to fixed-
52. Recurrent Neural Networks
❑ Recurrent neural network lets the
network dynamically learn how much
context it needs in order to solve the
❑ RNN is a multilayer NN with the previous
set of hidden unit activations feeding
back into the network along with the
❑ RNNs have a “memory” which captures
information about what has been
calculated so far.
53. Recurrent neural networks
❑ Parameter sharing makes it possible to extend and apply the model to
examples of different lengths and generalized across them.
❑ It means local connections are shared (same weights) across different
temporal instances of the hidden units.
❑ If we have to define a different function Gt for each possible sequence
length, each with its own parameters, we would not get any
generalization to sequences of a size not seen in the training set.
54. Dynamic systems
❑ A means of describing how one state develops into another state
over the course of time.
❑ Consider the classical form of a dynamical system:
✓ Where st is the system state at time t, ƒ8 is a mapping function.
❑ The same parameters (the same function ƒ8) is used for all time
❑ Unfolding flow graph of such system is:
55. Dynamic systems
❑ Now consider a dynamical system driven by an external signal xt
The state st now contains information about the whole past sequence
57. Cost function
❑ The total loss for a given input/target sequence pair
(x, y), measured in cross entropy
L y, y^= Σ Lt = Σ −yt log y^t
• where yt is the category that should be associated
with time step t in the output sequence. y^tis the
58. Computing the gradient in
Using the generalized back-propagation one can obtain the so-
called Back-propagation Through Time (BPTT) algorithm.
We can then iterate backwards in time to back-propagate
gradients through time, from t = T − 1 down to t = 1,
noting that st (for t < T) has as descendants both ot and
59. Exploding or vanishing gradient
❑ In recurrent nets (also in very deep nets), the final output is the
composition of a large number of non-linear transformations.
❑ Even if each of these non-linear transformations is smooth. Their
composition might not be.
❑ The derivative (i.e. Jacobian matrix) through the whole composition
will tend to be either very small or very large.
❑ Example, suppose all numbers in the product are scalar and have the
same value α. If multiplication times T goes to ∞ then α^T = ∞ if α >
1 and αT = 0 if α < 1.
60. Gradient clipping
❑ Once the gradient value grows extremely large, it causes an overflow
(i.e. NaN) which is easily detectable at runtime.
❑ A simple heuristic solution that clips gradients to a small number
whenever they explode. That is, whenever they reach a certain
threshold, they are set back to a small number. as shown in Algorithm:
Error surface of a single hidden unit RNN
61. Facing the vanishing gradient problem
❑ Echo State Networks (ESN)
❑ Long delays
❑ Leaky Units
❑ Gated Recurrent Neural Networks
62. Echo State Networks (ESN)
❑ How do we set the input and recurrent weights so that a rich set of
histories can be represented in the recurrent neural network state?
❑ Answer: is to make the dynamical system associated with the recurrent
net nearly be on the edge of stability, i.e., more precisely with values
around 1 for the leading eigenvalue of the Jacobian of the state-to-
state transition function.
❑ ESNs proposed to fix the weights of the input→ hidden connections
and the hidden → hidden at carefully random values to make the
Jacobians slightly contractive. This is achieved by making the λ of the
weight matrix large but slightly less than 1.
❑ ESNs are only learn the hidden→output connections.
63. Skipping Connects (Long delays)
❑ Adding Longer-delay connections allow to
connect the past states to future states
through short paths
❑ if we have a connection every time steps. The
gradients will be vanishing or explosion after
number T of time steps as O(hT).
❑ instead, if we have recurrent connections
with a time-delay of D, gradients grow as
O(fiT/D) without vanishing but still may
explosion at T.
❑ because the number of effective steps is T/D.
This allows the learning algorithm to capture
64. Gated Recurrent Neural Networks
❑ GRNNs are a special kind of RNN, capable of learning long-term
dependencies by having more persistent memory. Two popular
➢ Long short-term memory (LSTM) [Hochreiter and Schmidhuber,
➢ Gated recurrent unit (GRU), [Cho et al., 2014]
❑ Applications: handwriting recognition (Graves et al., 2009), speech
recognition (Graves et al., 2013; Graves and Jaitly, 2014), handwriting
generation (Graves, 2013), machine translation (Sutskever et al., 2014a),
image to text conversion (captioning) (Kiros et al., 2014b; Vinyals et al.,
2014b; Xu et al., 2015b) and parsing (Vinyals et al., 2014a).
65. Long Short-Term Memory (LSTM)
❑ Standard RNNs have a very
simple repeating module
structure, such as a single tanh
❑ LSTMs also have this chain like
structure, but the repeating
module has a different
structure. Instead of having a
single neural network layer,
there are four, interacting in a
very special way.
66. Generate image caption
❑ Vinyals et al., Show and Tell: A Neural Image Caption Generator,arXiv
❑ Use a CNN as an image encoder and transform it to a fixed-length
❑ It is used as the initial hidden state of a “decoder” RNN that generates
the target sequence
67. Translate videos to sentences
❑ Venugopalan et al. arXiv 2014
❑ The challenge is to capture the joint dependencies of a sequence of
frames and a corresponding sequence of words
68. Reinforcement Learning
❑ One of the most exciting fields of Machine Learning today,
and also one of the oldest.
❑ It has been around since the 1950s, producing many
interesting applications over the years in particular in
games (e.g., TD-Gammon, a Backgammon playing program).
❑ Revolution took place in 2013 when researchers from an
English startup called DeepMind demonstrated a system
that could learn to play just about any Atari game from
❑ DeepMind was bought by Google for over 500 million
dollars in 2014.
69. Learning to Optimize Rewards
❑ In Reinforcement Learning, a software agent makes
observations and takes actions within an environment, and
in return it receives rewards.
❑ Its objective is to learn to act in a way that will maximize its
expected long-term rewards.
❑ The agent acts in the environment and learns by trial and
error to maximize its pleasure and minimize its pain.
71. Policy Search
❑ The algorithm used by the software agent to determine its
actions is called policy.
❑ For example, the policy could be a neural network taking
observations as inputs and outputting the action to take
72. Stochastic policy
❑ The policy can be any algorithm you can think of, and it
does not even have to be deterministic.
❑ For example, consider a robotic vacuum cleaner whose
reward is the amount of dust it picks up in 30 minutes. Its
policy could be to move forward with some probability p
every second, or randomly rotate left or right with
probability 1 – p.
❑ The rotation angle would be a random angle between –r
and +r. Since this policy involves some randomness, it is
called a stochastic policy.
73. Introduction to OpenAI Gym
❑ One of the challenges of Reinforcement Learning is that in
order to train an agent, you first need to have a working
❑ If you want to program an agent that will learn to play an
Atari game, you will need an Atari game simulator.
❑ If you want to program a walking robot, then the
environment is the real world and you can directly train
your robot in that environment.
74. Example of environment
❑ CartPole environment . This is a 2D simulation in which a
cart can be accelerated left or right in order to balance a
pole placed on top of it
75. Neural Network Policies
❑ In the case of the CartPole
environment, there are just two
possible actions (left or right)
❑ For example, if it outputs 0.7,
then we will pick action 0 with
70% probability, and action 1 with
76. Markov Decision Processes
❑ In the early 20th century, the mathematician Andrey
Markov studied stochastic processes with no memory,
called Markov chains.
❑ Such a process has a fixed number of states, and it
randomly evolves from one state to another at each step.
❑ The probability for it to evolve from a state s to a state s′ is
fixed, and it depends only on the pair (s,s′), not on past
states (the system has no memory).
❑ Markov chains can have very different dynamics, and they
are heavily used in thermodynamics, chemistry, statistics,
and much more.
77. MDP Example
❑ Suppose that the process starts in
state s0, and there is a 70% chance
that it will remain in that state at the
❑ Eventually it is bound to leave that
state and never come back since no
other state points back to s0.
❑ If it goes to state s1, it will then most
likely go to state s2 (90% probability),
then immediately back to state s1
(with 100% probability).
79. Example: Grid World
❑ Noisy movement: actions do not always go as planned
❑ 80% of the time, the action North takes the agent North
(if there is no wall there)
❑ 10% of the time, North takes the agent West; 10% East
❑ If there is a wall in the direction the agent would have been
taken, the agent stays put.
❑ The agent receives rewards each time step
▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)
❑ Goal: maximize sum of rewards
81. Markov Decision Processes
❑ An MDP is defined by:
▪ A set of states s ∈ S
▪ A set of actions a ∈ A
▪ A transition function T(s, a, s’)
▪ Probability that a from s leads to s’, i.e., P(s’|
▪ Also called the model or the dynamics
▪ A reward function R(s, a, s’)
▪ Sometimes just R(s) or R(s’)
▪ A start state
▪ Maybe a terminal state
82. What is Markov about MDPs?
❑ “Markov” generally means that given the present state, the future
and the past are independent
❑ For Markov decision processes, “Markov” means action outcomes
depend only on the current state
❑ This is just like search, where the successor function could only
depend on the current state (not the history)
❑ In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of
actions, from start to a goal
❑ For MDPs, we want an optimal policy π*: S → A
▪ A policy π gives an action for each state
▪ An optimal policy is one that maximizes
expected utility if followed
▪ An explicit policy defines a reflex agent
Optimal policy when
R(s, a, s’) = -0.03 for all
86. Utilities of Sequences
▪ What preferences should an agent have over reward sequences?
▪ More or less?
▪ Now or later?
[1, 2, 2] [2, 3, 4]or
[0, 0, 1] [1, 0, 0]or
▪ It’s reasonable to maximize the sum of rewards
▪ It’s also reasonable to prefer rewards now to rewards later
▪ One solution: values of rewards decay exponentially
Worth Now Worth Next
Worth In Two
▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once
▪ Why discount?
▪ Sooner rewards probably do have higher
utility than later rewards
▪ Also helps our algorithms converge
▪ Example: discount of 0.5
▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
▪ U([1,2,3]) < U([3,2,1])
89. Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies (π depends on time left)
▪ Discounting: use 0 < γ < 1
▪ Smaller γ means smaller “horizon” – shorter term focus
▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached