Hands on machine learning with scikit-learn and tensor flow by ahmed yousry

Hands on Machine
Learning with Scikit Learn
Presented By:
Ahmed Yousry
Agenda
❑Introduction to Artificial Neural Networks.
❑Training Deep Neural Nets.
❑Convolutional Neural Networks.
❑Recurrent Neural Network.
❑Reinforcement Learning.
Introduction to ANN
• First introduced back in 1943 by the Warren
McCulloch .
• Successes of ANNs until the 1960s.
• In the early 1980s there was a revival of
interest in ANNs as new network architectures.
• By the 1990s, powerful alternative Machine
Learning techniques.
Reasons why ANN is much more profound
impact
❑There is now a huge quantity of data.
❑The tremendous increase in computing power.
❑The training algorithms have been improved.
❑Theoretical limitations of ANNs have turned
out to be benign.
❑virtuous circle of funding and progress and
products.
Biological Neurons
ANN simulation
The Perceptron
• One of the simplest ANN architectures, invented in
1957 by Frank Rosenblatt.
• It is based on a linear threshold unit (LTU).
Z = w1 x1 + w2 x2 + ⋯
+ wn xn = wT ・ x
hw(x) = step (Z)
= step (wT ・x)
Multioutput perceptron
❑ A Perceptron with two inputs and three outputs.
Note : No hidden layers in perceptron.
Training Algorithm
While epoch produces an error
Present network with next inputs from epoch
Err = T – O
If Err <> 0 then
Wj new = Wj old + LR * Ij * Err
End If
End While
• T: actual output , O: predicted output
• LR : learning rate , I :input
XOR classification problem and an MLP
that solves it
XOR Function
X1 XOR X2 = (X1 AND NOT X2) OR (X2 AND NOT X1)
2
2
2
2
-1
-1
Z1
Z2
Y
X1
X2
Multi-Layer Perceptron and
Backpropagation
• An MLP is composed of one input layer, one or more
layers of LTUs, called hidden layers, and one final
output layer
• When an ANN has two
or more hidden layers, it is called
a deep neural network (DNN).
A modern MLP (including ReLU and
softmax) for classification
Deep learning Problems
• Vanishing gradients problem (or the related
exploding gradients problem) lower layers
very hard to train.
• Second, with such a large network, training
would be extremely slow.
• Third, a model with millions of parameters
would severely risk overfitting the training
set.
Gradients problems
• Gradients often get smaller as the algorithm
progresses down to the lower layers.
• The Gradient Descent update leaves the lower layer
weights unchanged, and training never converges to
a good solution.
• This is called the vanishing gradients problem.
• The gradients can grow bigger and bigger, so many
layers get insanely large weight updates and the
algorithm diverges. This is the exploding gradients
problem
Solving the first problem(Van…)
A paper titled “Understanding the Difficulty of
Training Deep Feedforward Neural Networks” by
Xavier Glorot and Yoshua.
1. popular logistic sigmoid activation function.
2. using a normal distribution with a mean of 0
and a standard deviation of 1.
3. the hyperbolic tangent function has a mean
of 0 and behaves slightly better than the
logistic function in DNN.
Sigmoid activation function
you can see that when
inputs become large
(negative or positive),
the function saturates
at 0 or 1, with a
derivative extremely
close to 0.
1/1+e^-x
The problem of RELU (0,max)
• It suffers from a problem (dying ReLUs) during
training, some neurons effectively die.
• they stop outputting anything other than 0.
• In some cases, you may find that half of your
network’s neurons are dead training.
• To solve this problem, you may want to use a
variant of the ReLU function, such as the
leaky ReLU.
leaky ReLU (RReLU).
• leaky variants always outperformed the strict ReLU activation
function. In fact, setting α = 0.2 (huge leak) seemed to result in
better performance than α = 0.01 (small leak).
• They also evaluated the randomized leaky ReLU (RReLU).
• also evaluated the parametric leaky ReLU (PReLU),
Exponential linear unit (ELU)
• Outperformed all the ReLU variants in their
experiments: training time was reduced and the
neural network performed better on the test set.
Batch Normalization
• The technique consists of adding an operation in the model just before
the activation function of each layer.
• Simply zero-centering and normalizing the inputs, then scaling and
shifting the result using two new parameters per layer (one for scaling,
the other for shifting).
• In other words, this operation lets the model learn the optimal scale and
mean of the inputs for each layer.
• γ is the scaling parameter for the layer.
• β is the shifting parameter (offset) for the
layer.
Activation functions
Reusing Pretrained Layers
• It is generally not a good idea to train a very
large DNN from scratch.
• Try to find an existing neural network that
accomplishes a similar task.
• Reuse the lower layers of this network.
• This is called transfer learning.
Example
• DNN that was trained to
classify pictures into 100
different categories.
• You now want to train a
DNN to classify specific
types of vehicles.
• Freezing the Lower Layers
weights.
• Tweaking, Dropping, or
Replacing the Upper Layers.
Understanding AlexNet
Consists of 5 Convolutional Layers and 3 Fully Connected Layers (classify 1000 classes)
Faster Optimizers
• Five ways to speed up training (and reach a better
solution):
➢ Applying a good initialization strategy for the connection weights.
➢ using a good activation function.
➢ Using Batch Normalization.
➢ Reusing parts of a pretrained network.
➢ Using a faster optimizer than the regular Gradient Descent
optimizer.
• the most popular ones: Momentum optimization, Nesterov
Accelerated Gradient, AdaGrad, RMSProp, and finally Adam
optimization.
Momentum Optimization Algorithm
• Gradient Descent simply updates the weights θ by directly subtracting
the gradient of the cost function J(θ) with regards to the weights
(∇θJ(θ)) multiplied by the learning rate η (equation 1)
• Momentum optimization cares a great deal about what previous
gradients were.
• It updates the weights by simply subtracting this momentum vector.
• A new hyperparameter β, simply called the momentum, which must be
set between 0 and 1, typically 0.9. (equation 2)
Gradient Descent (1) Momentum Optimization (2)
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Nesterov Momentum optimization
▪ The only difference from vanilla Momentum
optimization is that the gradient is measured
at θ + βm rather than at θ.
▪ This small tweak works because in general the
momentum vector will be pointing in the
right direction
▪ where ∇1 represents the gradient of the cost
function measured at the starting point θ,
and ∇2 represents the gradient at the point
located at θ + βm)
RMS Optimization
• Accumulating only the gradients from the most recent iterations (as
opposed to all the gradients since the beginning of training).
• It does so by using exponential decay in the first step.
• generally performs better than Momentum optimization and
Nesterov Accelerated Gradients.
• In fact, it was the preferred optimization algorithm of many
researchers until Adam optimization came around.
Adam Optimization
• Stands for adaptive moment estimation.
• Combines the ideas of Momentum optimization and RMSProp.
• Steps 3 and 4 are somewhat of a technical detail: since m and s
are initialized at 0, they will be biased toward 0 at the beginning of
training, so these two steps will help boost m and s at the
beginning of training.
Initialize β1 = 0.9, β2 =0.999, η = 0.001
term ϵ initialized to a tiny number 10–8 to avoid division by 0.
Difference between optimizers
Learning rate curves
Learning rate techniques
❑ Predetermined piecewise
constant learning rate For example, set the learning rate to η0 = 0.1 at first,
then to η1 = 0.001 after 50 epochs.
❑ Performance scheduling
Measure the validation error every N steps (just like for early stopping) and
reduce the learning rate by a factor of λ when the error stops dropping.
❑ Exponential scheduling
Set the learning rate to a function of the iteration number t:
This works great, but it requires tuning η0 and r. The learning rate will drop by
a factor of 10 every r steps.
❑ Power scheduling
Set the learning rate to η(t) = η0 (1 + t/r)–c The hyperparameter c is set to 1.
This is similar to exponential scheduling, but the learning rate drops much more
slowly.
Dropout
❑ It is a fairly simple algorithm: at every training step, every neuron
(including the input neurons but excluding the output neurons) has
a probability p of being temporarily “dropped out,” meaning it will
be entirely ignored during this training step, but it may be active
during the next step
Data Augmentation
❑ Consists of generating new training (rotating, resizing, flipping,
and cropping) instances from existing ones, artificially boosting
the size of the training set.
❑ This will reduce overfitting, making this a regularization technique.
The trick is to generate realistic training instances.
Convolutional Neural Networks
❑ A convolutional neural network (or ConvNet) is a type of feed-forward
artificial neural network.
❑ The architecture of a ConvNet is designed to take advantage of the 2D
structure of an input image.
❑ A ConvNet is comprised of one or more convolutional layers (often
with a pooling step) and then followed by one or more fully connected
layers as in a standard multilayer neural network.
How CNN works
• For example, a ConvNet takes the input as an image which
can be classified as ‘X’ or ‘O’
ConvNet Layers
▪CONV layer will compute the output of neurons that are connected
to local regions in the input, each computing a dot product between
their weights and a small region they are connected to in the input
volume.
▪RELU layer will apply an elementwise activation function, such as
the max(0,x) thresholding at zero. This leaves the size of the
volume unchanged.
▪POOL layer will perform a down sampling operation along the
spatial dimensions (width, height).
▪FC (i.e. fully-connected) layer will compute the class scores,
resulting in volume of size [1x1xN], where each of the N numbers
correspond to a class score, such as among the N categories.
Convolutional Layer - Filters
▪ The CONV layer’s parameters consist of a set of learnable
filters.
▪ Every filter is small spatially (along width and height), but
extends through the full depth of the input volume.
▪ During the forward pass, we slide (more precisely, convolve)
each filter across the width and height of the input volume and
compute dot products between the entries of the filter and the
input at any position.
Convolutional Layer - Filters
• Sliding the filter over the width and height of the input gives
2-dimensional activation map that responds to that filter at
every spatial position.
Convolutional Layer – Filters –Example
Convolutional Layer – Filters – Computation Example
Convolutional Layer – Filters – Output Feature Map
Relu Layer
Pool Layer
▪ The pooling layers down-sample the previous layers feature
map.
▪ Its function is to progressively reduce the spatial size of the
representation to reduce the amount of parameters and
computation in the network
▪ The pooling layer often uses the Max operation to perform
the down sampling process.
Pooling Filter example Size = 2 X 2, Stride = 2
Fully connected layer
❑ Fully connected layers are the normal
flat feed-forward neural network
layers.
❑ These layers may have a non-linear
activation function or a softmax
activation in order to predict classes.
❑ To compute our output, we simply
rearrange the output matrices as a 1-
D array.
SoftMax operation
❑ A special kind of activation layer,
usually at the end of FC layer
Outputs
❑ Can be viewed as a fancy normalizer
(a.k.a. Normalized exponential
function)
❑ Produce a discrete probability
distribution vector
❑ Very convenient when combined
with cross-entropy loss
Recurrent Neural Network
❑ Some problems require previous history/context in order
to be able to give proper output (speech recognition,
stock forecasting, target tracking, etc.
❑ One way to do that is to just provide all the necessary
context in one "snap-shot" and use standard learning
➢ How big should the snap-shot be? Varies for different
instances of the problem.
✓ If the input sequences are of fixed length, or can be
easily padded to a fixed length, they can be
collapsed into a single input vector and any of the
standard pattern classification algorithms.
Sequential data
❑ There are many tasks that require learning a temporal sequence
of events
❑ These problems can be broken into 3 distinct types of tasks
➢ Sequence Recognition: Produce a particular output pattern
when a specific input sequence is seen. Applications:
Sentiment Analysis, handwriting recognition
➢ Sequence Reproduction: Generate the rest of a sequence
when the network sees only part of the sequence.
Applications: Time series prediction (stock market, sun spots,
etc), language model.
➢ Temporal Association: Produce a particular output sequence
in response to a specific input sequence. Applications:
machine translation, speech generation
✓ Recurrent networks is flexible enough to solve these
problems.
Recurrent Networks offer a lot of flexibility:
(2) Sequence output
(e.g. image
captioning takes an
image and outputs a
sentence of words).
(3) Sequence input
(e.g. sentiment
analysis where a
given sentence is
classified as
expressing positive
or negative
sentiment).
(4) Sequence input and
sequence output (e.g.
Machine Translation:an
RNN reads a sentence
in English and then
outputs a sentence in
French).
(5) Synced
sequence input
and output (e.g.
video
classification
where we wish to
label each frame
of the video).
(1) fixed-sized
input to fixed-
sized output
(e.g. image
classification)
Recurrent Neural Networks
❑ Recurrent neural network lets the
network dynamically learn how much
context it needs in order to solve the
problem.
❑ RNN is a multilayer NN with the previous
set of hidden unit activations feeding
back into the network along with the
inputs.
❑ RNNs have a “memory” which captures
information about what has been
calculated so far.
Recurrent neural networks
❑ Parameter sharing makes it possible to extend and apply the model to
examples of different lengths and generalized across them.
❑ It means local connections are shared (same weights) across different
temporal instances of the hidden units.
❑ If we have to define a different function Gt for each possible sequence
length, each with its own parameters, we would not get any
generalization to sequences of a size not seen in the training set.
Dynamic systems
❑ A means of describing how one state develops into another state
over the course of time.
❑ Consider the classical form of a dynamical system:
✓ Where st is the system state at time t, ƒ8 is a mapping function.
❑ The same parameters (the same function ƒ8) is used for all time
steps.
❑ Unfolding flow graph of such system is:
Dynamic systems
❑ Now consider a dynamical system driven by an external signal xt
The state st now contains information about the whole past sequence
Recurrent Neural Networks
Cost function
❑ The total loss for a given input/target sequence pair
(x, y), measured in cross entropy
L y, y^= Σ Lt = Σ −yt log y^t
• where yt is the category that should be associated
with time step t in the output sequence. y^tis the
predicted output.
Computing the gradient in
RNN
Using the generalized back-propagation one can obtain the so-
called Back-propagation Through Time (BPTT) algorithm.
We can then iterate backwards in time to back-propagate
gradients through time, from t = T − 1 down to t = 1,
noting that st (for t < T) has as descendants both ot and
st+1
Exploding or vanishing gradient
❑ In recurrent nets (also in very deep nets), the final output is the
composition of a large number of non-linear transformations.
❑ Even if each of these non-linear transformations is smooth. Their
composition might not be.
❑ The derivative (i.e. Jacobian matrix) through the whole composition
will tend to be either very small or very large.
❑ Example, suppose all numbers in the product are scalar and have the
same value α. If multiplication times T goes to ∞ then α^T = ∞ if α >
1 and αT = 0 if α < 1.
Gradient clipping
❑ Once the gradient value grows extremely large, it causes an overflow
(i.e. NaN) which is easily detectable at runtime.
❑ A simple heuristic solution that clips gradients to a small number
whenever they explode. That is, whenever they reach a certain
threshold, they are set back to a small number. as shown in Algorithm:
Error surface of a single hidden unit RNN
Facing the vanishing gradient problem
❑ Echo State Networks (ESN)
❑ Long delays
❑ Leaky Units
❑ Gated Recurrent Neural Networks
Echo State Networks (ESN)
❑ How do we set the input and recurrent weights so that a rich set of
histories can be represented in the recurrent neural network state?
❑ Answer: is to make the dynamical system associated with the recurrent
net nearly be on the edge of stability, i.e., more precisely with values
around 1 for the leading eigenvalue of the Jacobian of the state-to-
state transition function.
❑ ESNs proposed to fix the weights of the input→ hidden connections
and the hidden → hidden at carefully random values to make the
Jacobians slightly contractive. This is achieved by making the λ of the
weight matrix large but slightly less than 1.
❑ ESNs are only learn the hidden→output connections.
Skipping Connects (Long delays)
❑ Adding Longer-delay connections allow to
connect the past states to future states
through short paths
❑ if we have a connection every time steps. The
gradients will be vanishing or explosion after
number T of time steps as O(hT).
❑ instead, if we have recurrent connections
with a time-delay of D, gradients grow as
O(fiT/D) without vanishing but still may
explosion at T.
❑ because the number of effective steps is T/D.
This allows the learning algorithm to capture
longer dependencies
Gated Recurrent Neural Networks
❑ GRNNs are a special kind of RNN, capable of learning long-term
dependencies by having more persistent memory. Two popular
architectures:
➢ Long short-term memory (LSTM) [Hochreiter and Schmidhuber,
1997].
➢ Gated recurrent unit (GRU), [Cho et al., 2014]
❑ Applications: handwriting recognition (Graves et al., 2009), speech
recognition (Graves et al., 2013; Graves and Jaitly, 2014), handwriting
generation (Graves, 2013), machine translation (Sutskever et al., 2014a),
image to text conversion (captioning) (Kiros et al., 2014b; Vinyals et al.,
2014b; Xu et al., 2015b) and parsing (Vinyals et al., 2014a).
Long Short-Term Memory (LSTM)
❑ Standard RNNs have a very
simple repeating module
structure, such as a single tanh
layer.
❑ LSTMs also have this chain like
structure, but the repeating
module has a different
structure. Instead of having a
single neural network layer,
there are four, interacting in a
very special way.
Generate image caption
❑ Vinyals et al., Show and Tell: A Neural Image Caption Generator,arXiv
2014
❑ Use a CNN as an image encoder and transform it to a fixed-length
vector
❑ It is used as the initial hidden state of a “decoder” RNN that generates
the target sequence
Translate videos to sentences
❑ Venugopalan et al. arXiv 2014
❑ The challenge is to capture the joint dependencies of a sequence of
frames and a corresponding sequence of words
Reinforcement Learning
❑ One of the most exciting fields of Machine Learning today,
and also one of the oldest.
❑ It has been around since the 1950s, producing many
interesting applications over the years in particular in
games (e.g., TD-Gammon, a Backgammon playing program).
❑ Revolution took place in 2013 when researchers from an
English startup called DeepMind demonstrated a system
that could learn to play just about any Atari game from
scratch.
❑ DeepMind was bought by Google for over 500 million
dollars in 2014.
Learning to Optimize Rewards
❑ In Reinforcement Learning, a software agent makes
observations and takes actions within an environment, and
in return it receives rewards.
❑ Its objective is to learn to act in a way that will maximize its
expected long-term rewards.
❑ The agent acts in the environment and learns by trial and
error to maximize its pleasure and minimize its pain.
Examples of RL agents
• (a) walking robot,
(b) Ms. Pac-Man,
(c) Go player,
• (d) thermostat,
• (e) automatic
trader5
Policy Search
❑ The algorithm used by the software agent to determine its
actions is called policy.
❑ For example, the policy could be a neural network taking
observations as inputs and outputting the action to take
Stochastic policy
❑ The policy can be any algorithm you can think of, and it
does not even have to be deterministic.
❑ For example, consider a robotic vacuum cleaner whose
reward is the amount of dust it picks up in 30 minutes. Its
policy could be to move forward with some probability p
every second, or randomly rotate left or right with
probability 1 – p.
❑ The rotation angle would be a random angle between –r
and +r. Since this policy involves some randomness, it is
called a stochastic policy.
Introduction to OpenAI Gym
❑ One of the challenges of Reinforcement Learning is that in
order to train an agent, you first need to have a working
environment.
❑ If you want to program an agent that will learn to play an
Atari game, you will need an Atari game simulator.
❑ If you want to program a walking robot, then the
environment is the real world and you can directly train
your robot in that environment.
Example of environment
❑ CartPole environment . This is a 2D simulation in which a
cart can be accelerated left or right in order to balance a
pole placed on top of it
Neural Network Policies
❑ In the case of the CartPole
environment, there are just two
possible actions (left or right)
❑ For example, if it outputs 0.7,
then we will pick action 0 with
70% probability, and action 1 with
30% probability.
Markov Decision Processes
❑ In the early 20th century, the mathematician Andrey
Markov studied stochastic processes with no memory,
called Markov chains.
❑ Such a process has a fixed number of states, and it
randomly evolves from one state to another at each step.
❑ The probability for it to evolve from a state s to a state s′ is
fixed, and it depends only on the pair (s,s′), not on past
states (the system has no memory).
❑ Markov chains can have very different dynamics, and they
are heavily used in thermodynamics, chemistry, statistics,
and much more.
MDP Example
❑ Suppose that the process starts in
state s0, and there is a 70% chance
that it will remain in that state at the
next step.
❑ Eventually it is bound to leave that
state and never come back since no
other state points back to s0.
❑ If it goes to state s1, it will then most
likely go to state s2 (90% probability),
then immediately back to state s1
(with 100% probability).
Another Example
Example: Grid World
❑ Noisy movement: actions do not always go as planned
❑ 80% of the time, the action North takes the agent North
(if there is no wall there)
❑ 10% of the time, North takes the agent West; 10% East
❑ If there is a wall in the direction the agent would have been
taken, the agent stays put.
❑ The agent receives rewards each time step
▪ Small “living” reward each step (can be negative)
▪ Big rewards come at the end (good or bad)
❑ Goal: maximize sum of rewards
Grid World Actions
Deterministic
Grid World
Stochastic
Grid World
Markov Decision Processes
❑ An MDP is defined by:
▪ A set of states s ∈ S
▪ A set of actions a ∈ A
▪ A transition function T(s, a, s’)
▪ Probability that a from s leads to s’, i.e., P(s’|
s, a)
▪ Also called the model or the dynamics
▪ A reward function R(s, a, s’)
▪ Sometimes just R(s) or R(s’)
▪ A start state
▪ Maybe a terminal state
What is Markov about MDPs?
❑ “Markov” generally means that given the present state, the future
and the past are independent
❑ For Markov decision processes, “Markov” means action outcomes
depend only on the current state
❑ This is just like search, where the successor function could only
depend on the current state (not the history)
Andrey Markov
(1856-1922)
Markov Property
S0 S1 St-1 St St+1
..
.
St St+1
=
Policies
❑ In deterministic single-agent search problems,
we wanted an optimal plan, or sequence of
actions, from start to a goal
❑ For MDPs, we want an optimal policy π*: S → A
▪ A policy π gives an action for each state
▪ An optimal policy is one that maximizes
expected utility if followed
▪ An explicit policy defines a reflex agent
Optimal policy when
R(s, a, s’) = -0.03 for all
non-terminals s
Optimal Policies
R(s) = -0.03R(s) = -0.01
R(s) = -2.0R(s) = -0.4
Utilities of Sequences
▪ What preferences should an agent have over reward sequences?
▪ More or less?
▪ Now or later?
[1, 2, 2] [2, 3, 4]or
[0, 0, 1] [1, 0, 0]or
Discounting
▪ It’s reasonable to maximize the sum of rewards
▪ It’s also reasonable to prefer rewards now to rewards later
▪ One solution: values of rewards decay exponentially
Worth Now Worth Next
Step
Worth In Two
Steps
Discounting
▪ How to discount?
▪ Each time we descend a level, we
multiply in the discount once
▪ Why discount?
▪ Sooner rewards probably do have higher
utility than later rewards
▪ Also helps our algorithms converge
▪ Example: discount of 0.5
▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
▪ U([1,2,3]) < U([3,2,1])
Infinite Utilities?!
▪ Problem: What if the game lasts forever? Do we get infinite
rewards?
▪ Solutions:
▪ Finite horizon: (similar to depth-limited search)
▪ Terminate episodes after a fixed T steps (e.g. life)
▪ Gives nonstationary policies (π depends on time left)
▪ Discounting: use 0 < γ < 1
▪ Smaller γ means smaller “horizon” – shorter term focus
▪ Absorbing state: guarantee that for every policy, a terminal state will
eventually be reached
THANKS
QUESTIONS?
1 sur 91

Recommandé

Gradient descent method par
Gradient descent methodGradient descent method
Gradient descent methodProf. Neeta Awasthy
693 vues9 diapositives
Training Neural Networks par
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
3.6K vues57 diapositives
Regularization in deep learning par
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
3.3K vues42 diapositives
Activation functions par
Activation functionsActivation functions
Activation functionsPRATEEK SAHU
1.1K vues19 diapositives
Convolutional neural network par
Convolutional neural networkConvolutional neural network
Convolutional neural networkMojammilHusain
1.1K vues11 diapositives
Deep Generative Models par
Deep Generative ModelsDeep Generative Models
Deep Generative ModelsMijung Kim
1.3K vues18 diapositives

Contenu connexe

Tendances

Artificial neural network par
Artificial neural networkArtificial neural network
Artificial neural networkmustafa aadel
14.2K vues55 diapositives
Cnn par
CnnCnn
CnnNirthika Rajendran
14.1K vues31 diapositives
Machine Learning - Splitting Datasets par
Machine Learning - Splitting DatasetsMachine Learning - Splitting Datasets
Machine Learning - Splitting DatasetsAndrew Ferlitsch
4.6K vues9 diapositives
Deep Learning - CNN and RNN par
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
5.9K vues135 diapositives
Linear regression par
Linear regressionLinear regression
Linear regressionMartinHogg9
12.2K vues48 diapositives
Convolutional Neural Network (CNN) par
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Muhammad Haroon
842 vues16 diapositives

Tendances(20)

Artificial neural network par mustafa aadel
Artificial neural networkArtificial neural network
Artificial neural network
mustafa aadel14.2K vues
Linear regression par MartinHogg9
Linear regressionLinear regression
Linear regression
MartinHogg912.2K vues
Autoencoders in Deep Learning par milad abbasi
Autoencoders in Deep LearningAutoencoders in Deep Learning
Autoencoders in Deep Learning
milad abbasi1.5K vues
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto... par Simplilearn
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Simplilearn2K vues
ViT (Vision Transformer) Review [CDM] par Dongmin Choi
ViT (Vision Transformer) Review [CDM]ViT (Vision Transformer) Review [CDM]
ViT (Vision Transformer) Review [CDM]
Dongmin Choi4.1K vues
Convolutional Neural Network Models - Deep Learning par Benha University
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
Benha University12.6K vues
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le... par Simplilearn
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Simplilearn10.2K vues
Activation function par Astha Jain
Activation functionActivation function
Activation function
Astha Jain7.3K vues
Back propagation using sigmoid & ReLU function par Revanth Kumar
Back propagation using sigmoid & ReLU functionBack propagation using sigmoid & ReLU function
Back propagation using sigmoid & ReLU function
Revanth Kumar289 vues
Perceptron par Nagarajan
PerceptronPerceptron
Perceptron
Nagarajan29.5K vues
Chap 8. Optimization for training deep models par Young-Geun Choi
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
Young-Geun Choi1.5K vues
Vanishing & Exploding Gradients par Siddharth Vij
Vanishing & Exploding GradientsVanishing & Exploding Gradients
Vanishing & Exploding Gradients
Siddharth Vij169 vues

Similaire à Hands on machine learning with scikit-learn and tensor flow by ahmed yousry

ML Module 3 Non Linear Learning.pptx par
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxDebabrataPain1
5 vues147 diapositives
backpropagation in neural networks par
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
26K vues56 diapositives
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ... par
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Simplilearn
1.4K vues56 diapositives
AI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies par
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesAI Class Topic 6: Easy Way to Learn Deep Learning AI Technologies
AI Class Topic 6: Easy Way to Learn Deep Learning AI TechnologiesValue Amplify Consulting
107 vues61 diapositives
08 neural networks par
08 neural networks08 neural networks
08 neural networksankit_ppt
577 vues75 diapositives
Introduction to deep learning par
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
1.4K vues61 diapositives

Similaire à Hands on machine learning with scikit-learn and tensor flow by ahmed yousry(20)

backpropagation in neural networks par Akash Goel
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
Akash Goel26K vues
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ... par Simplilearn
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Simplilearn1.4K vues
08 neural networks par ankit_ppt
08 neural networks08 neural networks
08 neural networks
ankit_ppt577 vues
Introduction to deep learning par Junaid Bhat
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
Junaid Bhat1.4K vues
Introduction to Neural networks (under graduate course) Lecture 9 of 9 par Randa Elanwar
Introduction to Neural networks (under graduate course) Lecture 9 of 9Introduction to Neural networks (under graduate course) Lecture 9 of 9
Introduction to Neural networks (under graduate course) Lecture 9 of 9
Randa Elanwar1.6K vues
Deep learning from a novice perspective par Anirban Santara
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
Anirban Santara1.2K vues
Cvpr 2018 papers review (efficient computing) par DonghyunKang12
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
DonghyunKang12400 vues
Machine learning project par Harsh Jain
Machine learning projectMachine learning project
Machine learning project
Harsh Jain213 vues
Batch normalization presentation par Owin Will
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
Owin Will693 vues
A Survey of Convolutional Neural Networks par Rimzim Thube
A Survey of Convolutional Neural NetworksA Survey of Convolutional Neural Networks
A Survey of Convolutional Neural Networks
Rimzim Thube116 vues
Techniques in Deep Learning par Sourya Dey
Techniques in Deep LearningTechniques in Deep Learning
Techniques in Deep Learning
Sourya Dey777 vues

Dernier

Meet the Bible par
Meet the BibleMeet the Bible
Meet the BibleSteve Thomason
78 vues80 diapositives
The Future of Micro-credentials: Is Small Really Beautiful? par
The Future of Micro-credentials:  Is Small Really Beautiful?The Future of Micro-credentials:  Is Small Really Beautiful?
The Future of Micro-credentials: Is Small Really Beautiful?Mark Brown
75 vues35 diapositives
Monthly Information Session for MV Asterix (November) par
Monthly Information Session for MV Asterix (November)Monthly Information Session for MV Asterix (November)
Monthly Information Session for MV Asterix (November)Esquimalt MFRC
107 vues26 diapositives
Berry country.pdf par
Berry country.pdfBerry country.pdf
Berry country.pdfMariaKenney3
75 vues12 diapositives
unidad 3.pdf par
unidad 3.pdfunidad 3.pdf
unidad 3.pdfMarcosRodriguezUcedo
134 vues38 diapositives

Dernier(20)

The Future of Micro-credentials: Is Small Really Beautiful? par Mark Brown
The Future of Micro-credentials:  Is Small Really Beautiful?The Future of Micro-credentials:  Is Small Really Beautiful?
The Future of Micro-credentials: Is Small Really Beautiful?
Mark Brown75 vues
Monthly Information Session for MV Asterix (November) par Esquimalt MFRC
Monthly Information Session for MV Asterix (November)Monthly Information Session for MV Asterix (November)
Monthly Information Session for MV Asterix (November)
Esquimalt MFRC107 vues
EILO EXCURSION PROGRAMME 2023 par info33492
EILO EXCURSION PROGRAMME 2023EILO EXCURSION PROGRAMME 2023
EILO EXCURSION PROGRAMME 2023
info33492202 vues
Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf par TechSoup
 Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf
Ask The Expert! Nonprofit Website Tools, Tips, and Technology.pdf
TechSoup 53 vues
Education of marginalized and socially disadvantages segments.pptx par GarimaBhati5
Education of marginalized and socially disadvantages segments.pptxEducation of marginalized and socially disadvantages segments.pptx
Education of marginalized and socially disadvantages segments.pptx
GarimaBhati543 vues
Introduction to AERO Supply Chain - #BEAERO Trainning program par Guennoun Wajih
Introduction to AERO Supply Chain  - #BEAERO Trainning programIntroduction to AERO Supply Chain  - #BEAERO Trainning program
Introduction to AERO Supply Chain - #BEAERO Trainning program
Guennoun Wajih95 vues
12.5.23 Poverty and Precarity.pptx par mary850239
12.5.23 Poverty and Precarity.pptx12.5.23 Poverty and Precarity.pptx
12.5.23 Poverty and Precarity.pptx
mary850239381 vues

Hands on machine learning with scikit-learn and tensor flow by ahmed yousry

  • 1. Hands on Machine Learning with Scikit Learn Presented By: Ahmed Yousry
  • 2. Agenda ❑Introduction to Artificial Neural Networks. ❑Training Deep Neural Nets. ❑Convolutional Neural Networks. ❑Recurrent Neural Network. ❑Reinforcement Learning.
  • 3. Introduction to ANN • First introduced back in 1943 by the Warren McCulloch . • Successes of ANNs until the 1960s. • In the early 1980s there was a revival of interest in ANNs as new network architectures. • By the 1990s, powerful alternative Machine Learning techniques.
  • 4. Reasons why ANN is much more profound impact ❑There is now a huge quantity of data. ❑The tremendous increase in computing power. ❑The training algorithms have been improved. ❑Theoretical limitations of ANNs have turned out to be benign. ❑virtuous circle of funding and progress and products.
  • 7. The Perceptron • One of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. • It is based on a linear threshold unit (LTU). Z = w1 x1 + w2 x2 + ⋯ + wn xn = wT ・ x hw(x) = step (Z) = step (wT ・x)
  • 8. Multioutput perceptron ❑ A Perceptron with two inputs and three outputs. Note : No hidden layers in perceptron.
  • 9. Training Algorithm While epoch produces an error Present network with next inputs from epoch Err = T – O If Err <> 0 then Wj new = Wj old + LR * Ij * Err End If End While • T: actual output , O: predicted output • LR : learning rate , I :input
  • 10. XOR classification problem and an MLP that solves it XOR Function X1 XOR X2 = (X1 AND NOT X2) OR (X2 AND NOT X1) 2 2 2 2 -1 -1 Z1 Z2 Y X1 X2
  • 11. Multi-Layer Perceptron and Backpropagation • An MLP is composed of one input layer, one or more layers of LTUs, called hidden layers, and one final output layer • When an ANN has two or more hidden layers, it is called a deep neural network (DNN).
  • 12. A modern MLP (including ReLU and softmax) for classification
  • 13. Deep learning Problems • Vanishing gradients problem (or the related exploding gradients problem) lower layers very hard to train. • Second, with such a large network, training would be extremely slow. • Third, a model with millions of parameters would severely risk overfitting the training set.
  • 14. Gradients problems • Gradients often get smaller as the algorithm progresses down to the lower layers. • The Gradient Descent update leaves the lower layer weights unchanged, and training never converges to a good solution. • This is called the vanishing gradients problem. • The gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem
  • 15. Solving the first problem(Van…) A paper titled “Understanding the Difficulty of Training Deep Feedforward Neural Networks” by Xavier Glorot and Yoshua. 1. popular logistic sigmoid activation function. 2. using a normal distribution with a mean of 0 and a standard deviation of 1. 3. the hyperbolic tangent function has a mean of 0 and behaves slightly better than the logistic function in DNN.
  • 16. Sigmoid activation function you can see that when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. 1/1+e^-x
  • 17. The problem of RELU (0,max) • It suffers from a problem (dying ReLUs) during training, some neurons effectively die. • they stop outputting anything other than 0. • In some cases, you may find that half of your network’s neurons are dead training. • To solve this problem, you may want to use a variant of the ReLU function, such as the leaky ReLU.
  • 18. leaky ReLU (RReLU). • leaky variants always outperformed the strict ReLU activation function. In fact, setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01 (small leak). • They also evaluated the randomized leaky ReLU (RReLU). • also evaluated the parametric leaky ReLU (PReLU),
  • 19. Exponential linear unit (ELU) • Outperformed all the ReLU variants in their experiments: training time was reduced and the neural network performed better on the test set.
  • 20. Batch Normalization • The technique consists of adding an operation in the model just before the activation function of each layer. • Simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). • In other words, this operation lets the model learn the optimal scale and mean of the inputs for each layer. • γ is the scaling parameter for the layer. • β is the shifting parameter (offset) for the layer.
  • 22. Reusing Pretrained Layers • It is generally not a good idea to train a very large DNN from scratch. • Try to find an existing neural network that accomplishes a similar task. • Reuse the lower layers of this network. • This is called transfer learning.
  • 23. Example • DNN that was trained to classify pictures into 100 different categories. • You now want to train a DNN to classify specific types of vehicles. • Freezing the Lower Layers weights. • Tweaking, Dropping, or Replacing the Upper Layers.
  • 24. Understanding AlexNet Consists of 5 Convolutional Layers and 3 Fully Connected Layers (classify 1000 classes)
  • 25. Faster Optimizers • Five ways to speed up training (and reach a better solution): ➢ Applying a good initialization strategy for the connection weights. ➢ using a good activation function. ➢ Using Batch Normalization. ➢ Reusing parts of a pretrained network. ➢ Using a faster optimizer than the regular Gradient Descent optimizer. • the most popular ones: Momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and finally Adam optimization.
  • 26. Momentum Optimization Algorithm • Gradient Descent simply updates the weights θ by directly subtracting the gradient of the cost function J(θ) with regards to the weights (∇θJ(θ)) multiplied by the learning rate η (equation 1) • Momentum optimization cares a great deal about what previous gradients were. • It updates the weights by simply subtracting this momentum vector. • A new hyperparameter β, simply called the momentum, which must be set between 0 and 1, typically 0.9. (equation 2) Gradient Descent (1) Momentum Optimization (2)
  • 28. Nesterov Momentum optimization ▪ The only difference from vanilla Momentum optimization is that the gradient is measured at θ + βm rather than at θ. ▪ This small tweak works because in general the momentum vector will be pointing in the right direction ▪ where ∇1 represents the gradient of the cost function measured at the starting point θ, and ∇2 represents the gradient at the point located at θ + βm)
  • 29. RMS Optimization • Accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). • It does so by using exponential decay in the first step. • generally performs better than Momentum optimization and Nesterov Accelerated Gradients. • In fact, it was the preferred optimization algorithm of many researchers until Adam optimization came around.
  • 30. Adam Optimization • Stands for adaptive moment estimation. • Combines the ideas of Momentum optimization and RMSProp. • Steps 3 and 4 are somewhat of a technical detail: since m and s are initialized at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost m and s at the beginning of training. Initialize β1 = 0.9, β2 =0.999, η = 0.001 term ϵ initialized to a tiny number 10–8 to avoid division by 0.
  • 33. Learning rate techniques ❑ Predetermined piecewise constant learning rate For example, set the learning rate to η0 = 0.1 at first, then to η1 = 0.001 after 50 epochs. ❑ Performance scheduling Measure the validation error every N steps (just like for early stopping) and reduce the learning rate by a factor of λ when the error stops dropping. ❑ Exponential scheduling Set the learning rate to a function of the iteration number t: This works great, but it requires tuning η0 and r. The learning rate will drop by a factor of 10 every r steps. ❑ Power scheduling Set the learning rate to η(t) = η0 (1 + t/r)–c The hyperparameter c is set to 1. This is similar to exponential scheduling, but the learning rate drops much more slowly.
  • 34. Dropout ❑ It is a fairly simple algorithm: at every training step, every neuron (including the input neurons but excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step
  • 35. Data Augmentation ❑ Consists of generating new training (rotating, resizing, flipping, and cropping) instances from existing ones, artificially boosting the size of the training set. ❑ This will reduce overfitting, making this a regularization technique. The trick is to generate realistic training instances.
  • 36. Convolutional Neural Networks ❑ A convolutional neural network (or ConvNet) is a type of feed-forward artificial neural network. ❑ The architecture of a ConvNet is designed to take advantage of the 2D structure of an input image. ❑ A ConvNet is comprised of one or more convolutional layers (often with a pooling step) and then followed by one or more fully connected layers as in a standard multilayer neural network.
  • 37. How CNN works • For example, a ConvNet takes the input as an image which can be classified as ‘X’ or ‘O’
  • 38. ConvNet Layers ▪CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. ▪RELU layer will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves the size of the volume unchanged. ▪POOL layer will perform a down sampling operation along the spatial dimensions (width, height). ▪FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1xN], where each of the N numbers correspond to a class score, such as among the N categories.
  • 39. Convolutional Layer - Filters ▪ The CONV layer’s parameters consist of a set of learnable filters. ▪ Every filter is small spatially (along width and height), but extends through the full depth of the input volume. ▪ During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position.
  • 40. Convolutional Layer - Filters • Sliding the filter over the width and height of the input gives 2-dimensional activation map that responds to that filter at every spatial position.
  • 41. Convolutional Layer – Filters –Example
  • 42. Convolutional Layer – Filters – Computation Example
  • 43. Convolutional Layer – Filters – Output Feature Map
  • 45. Pool Layer ▪ The pooling layers down-sample the previous layers feature map. ▪ Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network ▪ The pooling layer often uses the Max operation to perform the down sampling process.
  • 46. Pooling Filter example Size = 2 X 2, Stride = 2
  • 47. Fully connected layer ❑ Fully connected layers are the normal flat feed-forward neural network layers. ❑ These layers may have a non-linear activation function or a softmax activation in order to predict classes. ❑ To compute our output, we simply rearrange the output matrices as a 1- D array.
  • 48. SoftMax operation ❑ A special kind of activation layer, usually at the end of FC layer Outputs ❑ Can be viewed as a fancy normalizer (a.k.a. Normalized exponential function) ❑ Produce a discrete probability distribution vector ❑ Very convenient when combined with cross-entropy loss
  • 49. Recurrent Neural Network ❑ Some problems require previous history/context in order to be able to give proper output (speech recognition, stock forecasting, target tracking, etc. ❑ One way to do that is to just provide all the necessary context in one "snap-shot" and use standard learning ➢ How big should the snap-shot be? Varies for different instances of the problem. ✓ If the input sequences are of fixed length, or can be easily padded to a fixed length, they can be collapsed into a single input vector and any of the standard pattern classification algorithms.
  • 50. Sequential data ❑ There are many tasks that require learning a temporal sequence of events ❑ These problems can be broken into 3 distinct types of tasks ➢ Sequence Recognition: Produce a particular output pattern when a specific input sequence is seen. Applications: Sentiment Analysis, handwriting recognition ➢ Sequence Reproduction: Generate the rest of a sequence when the network sees only part of the sequence. Applications: Time series prediction (stock market, sun spots, etc), language model. ➢ Temporal Association: Produce a particular output sequence in response to a specific input sequence. Applications: machine translation, speech generation ✓ Recurrent networks is flexible enough to solve these problems.
  • 51. Recurrent Networks offer a lot of flexibility: (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation:an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). (1) fixed-sized input to fixed- sized output (e.g. image classification)
  • 52. Recurrent Neural Networks ❑ Recurrent neural network lets the network dynamically learn how much context it needs in order to solve the problem. ❑ RNN is a multilayer NN with the previous set of hidden unit activations feeding back into the network along with the inputs. ❑ RNNs have a “memory” which captures information about what has been calculated so far.
  • 53. Recurrent neural networks ❑ Parameter sharing makes it possible to extend and apply the model to examples of different lengths and generalized across them. ❑ It means local connections are shared (same weights) across different temporal instances of the hidden units. ❑ If we have to define a different function Gt for each possible sequence length, each with its own parameters, we would not get any generalization to sequences of a size not seen in the training set.
  • 54. Dynamic systems ❑ A means of describing how one state develops into another state over the course of time. ❑ Consider the classical form of a dynamical system: ✓ Where st is the system state at time t, ƒ8 is a mapping function. ❑ The same parameters (the same function ƒ8) is used for all time steps. ❑ Unfolding flow graph of such system is:
  • 55. Dynamic systems ❑ Now consider a dynamical system driven by an external signal xt The state st now contains information about the whole past sequence
  • 57. Cost function ❑ The total loss for a given input/target sequence pair (x, y), measured in cross entropy L y, y^= Σ Lt = Σ −yt log y^t • where yt is the category that should be associated with time step t in the output sequence. y^tis the predicted output.
  • 58. Computing the gradient in RNN Using the generalized back-propagation one can obtain the so- called Back-propagation Through Time (BPTT) algorithm. We can then iterate backwards in time to back-propagate gradients through time, from t = T − 1 down to t = 1, noting that st (for t < T) has as descendants both ot and st+1
  • 59. Exploding or vanishing gradient ❑ In recurrent nets (also in very deep nets), the final output is the composition of a large number of non-linear transformations. ❑ Even if each of these non-linear transformations is smooth. Their composition might not be. ❑ The derivative (i.e. Jacobian matrix) through the whole composition will tend to be either very small or very large. ❑ Example, suppose all numbers in the product are scalar and have the same value α. If multiplication times T goes to ∞ then α^T = ∞ if α > 1 and αT = 0 if α < 1.
  • 60. Gradient clipping ❑ Once the gradient value grows extremely large, it causes an overflow (i.e. NaN) which is easily detectable at runtime. ❑ A simple heuristic solution that clips gradients to a small number whenever they explode. That is, whenever they reach a certain threshold, they are set back to a small number. as shown in Algorithm: Error surface of a single hidden unit RNN
  • 61. Facing the vanishing gradient problem ❑ Echo State Networks (ESN) ❑ Long delays ❑ Leaky Units ❑ Gated Recurrent Neural Networks
  • 62. Echo State Networks (ESN) ❑ How do we set the input and recurrent weights so that a rich set of histories can be represented in the recurrent neural network state? ❑ Answer: is to make the dynamical system associated with the recurrent net nearly be on the edge of stability, i.e., more precisely with values around 1 for the leading eigenvalue of the Jacobian of the state-to- state transition function. ❑ ESNs proposed to fix the weights of the input→ hidden connections and the hidden → hidden at carefully random values to make the Jacobians slightly contractive. This is achieved by making the λ of the weight matrix large but slightly less than 1. ❑ ESNs are only learn the hidden→output connections.
  • 63. Skipping Connects (Long delays) ❑ Adding Longer-delay connections allow to connect the past states to future states through short paths ❑ if we have a connection every time steps. The gradients will be vanishing or explosion after number T of time steps as O(hT). ❑ instead, if we have recurrent connections with a time-delay of D, gradients grow as O(fiT/D) without vanishing but still may explosion at T. ❑ because the number of effective steps is T/D. This allows the learning algorithm to capture longer dependencies
  • 64. Gated Recurrent Neural Networks ❑ GRNNs are a special kind of RNN, capable of learning long-term dependencies by having more persistent memory. Two popular architectures: ➢ Long short-term memory (LSTM) [Hochreiter and Schmidhuber, 1997]. ➢ Gated recurrent unit (GRU), [Cho et al., 2014] ❑ Applications: handwriting recognition (Graves et al., 2009), speech recognition (Graves et al., 2013; Graves and Jaitly, 2014), handwriting generation (Graves, 2013), machine translation (Sutskever et al., 2014a), image to text conversion (captioning) (Kiros et al., 2014b; Vinyals et al., 2014b; Xu et al., 2015b) and parsing (Vinyals et al., 2014a).
  • 65. Long Short-Term Memory (LSTM) ❑ Standard RNNs have a very simple repeating module structure, such as a single tanh layer. ❑ LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.
  • 66. Generate image caption ❑ Vinyals et al., Show and Tell: A Neural Image Caption Generator,arXiv 2014 ❑ Use a CNN as an image encoder and transform it to a fixed-length vector ❑ It is used as the initial hidden state of a “decoder” RNN that generates the target sequence
  • 67. Translate videos to sentences ❑ Venugopalan et al. arXiv 2014 ❑ The challenge is to capture the joint dependencies of a sequence of frames and a corresponding sequence of words
  • 68. Reinforcement Learning ❑ One of the most exciting fields of Machine Learning today, and also one of the oldest. ❑ It has been around since the 1950s, producing many interesting applications over the years in particular in games (e.g., TD-Gammon, a Backgammon playing program). ❑ Revolution took place in 2013 when researchers from an English startup called DeepMind demonstrated a system that could learn to play just about any Atari game from scratch. ❑ DeepMind was bought by Google for over 500 million dollars in 2014.
  • 69. Learning to Optimize Rewards ❑ In Reinforcement Learning, a software agent makes observations and takes actions within an environment, and in return it receives rewards. ❑ Its objective is to learn to act in a way that will maximize its expected long-term rewards. ❑ The agent acts in the environment and learns by trial and error to maximize its pleasure and minimize its pain.
  • 70. Examples of RL agents • (a) walking robot, (b) Ms. Pac-Man, (c) Go player, • (d) thermostat, • (e) automatic trader5
  • 71. Policy Search ❑ The algorithm used by the software agent to determine its actions is called policy. ❑ For example, the policy could be a neural network taking observations as inputs and outputting the action to take
  • 72. Stochastic policy ❑ The policy can be any algorithm you can think of, and it does not even have to be deterministic. ❑ For example, consider a robotic vacuum cleaner whose reward is the amount of dust it picks up in 30 minutes. Its policy could be to move forward with some probability p every second, or randomly rotate left or right with probability 1 – p. ❑ The rotation angle would be a random angle between –r and +r. Since this policy involves some randomness, it is called a stochastic policy.
  • 73. Introduction to OpenAI Gym ❑ One of the challenges of Reinforcement Learning is that in order to train an agent, you first need to have a working environment. ❑ If you want to program an agent that will learn to play an Atari game, you will need an Atari game simulator. ❑ If you want to program a walking robot, then the environment is the real world and you can directly train your robot in that environment.
  • 74. Example of environment ❑ CartPole environment . This is a 2D simulation in which a cart can be accelerated left or right in order to balance a pole placed on top of it
  • 75. Neural Network Policies ❑ In the case of the CartPole environment, there are just two possible actions (left or right) ❑ For example, if it outputs 0.7, then we will pick action 0 with 70% probability, and action 1 with 30% probability.
  • 76. Markov Decision Processes ❑ In the early 20th century, the mathematician Andrey Markov studied stochastic processes with no memory, called Markov chains. ❑ Such a process has a fixed number of states, and it randomly evolves from one state to another at each step. ❑ The probability for it to evolve from a state s to a state s′ is fixed, and it depends only on the pair (s,s′), not on past states (the system has no memory). ❑ Markov chains can have very different dynamics, and they are heavily used in thermodynamics, chemistry, statistics, and much more.
  • 77. MDP Example ❑ Suppose that the process starts in state s0, and there is a 70% chance that it will remain in that state at the next step. ❑ Eventually it is bound to leave that state and never come back since no other state points back to s0. ❑ If it goes to state s1, it will then most likely go to state s2 (90% probability), then immediately back to state s1 (with 100% probability).
  • 79. Example: Grid World ❑ Noisy movement: actions do not always go as planned ❑ 80% of the time, the action North takes the agent North (if there is no wall there) ❑ 10% of the time, North takes the agent West; 10% East ❑ If there is a wall in the direction the agent would have been taken, the agent stays put. ❑ The agent receives rewards each time step ▪ Small “living” reward each step (can be negative) ▪ Big rewards come at the end (good or bad) ❑ Goal: maximize sum of rewards
  • 80. Grid World Actions Deterministic Grid World Stochastic Grid World
  • 81. Markov Decision Processes ❑ An MDP is defined by: ▪ A set of states s ∈ S ▪ A set of actions a ∈ A ▪ A transition function T(s, a, s’) ▪ Probability that a from s leads to s’, i.e., P(s’| s, a) ▪ Also called the model or the dynamics ▪ A reward function R(s, a, s’) ▪ Sometimes just R(s) or R(s’) ▪ A start state ▪ Maybe a terminal state
  • 82. What is Markov about MDPs? ❑ “Markov” generally means that given the present state, the future and the past are independent ❑ For Markov decision processes, “Markov” means action outcomes depend only on the current state ❑ This is just like search, where the successor function could only depend on the current state (not the history) Andrey Markov (1856-1922)
  • 83. Markov Property S0 S1 St-1 St St+1 .. . St St+1 =
  • 84. Policies ❑ In deterministic single-agent search problems, we wanted an optimal plan, or sequence of actions, from start to a goal ❑ For MDPs, we want an optimal policy π*: S → A ▪ A policy π gives an action for each state ▪ An optimal policy is one that maximizes expected utility if followed ▪ An explicit policy defines a reflex agent Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s
  • 85. Optimal Policies R(s) = -0.03R(s) = -0.01 R(s) = -2.0R(s) = -0.4
  • 86. Utilities of Sequences ▪ What preferences should an agent have over reward sequences? ▪ More or less? ▪ Now or later? [1, 2, 2] [2, 3, 4]or [0, 0, 1] [1, 0, 0]or
  • 87. Discounting ▪ It’s reasonable to maximize the sum of rewards ▪ It’s also reasonable to prefer rewards now to rewards later ▪ One solution: values of rewards decay exponentially Worth Now Worth Next Step Worth In Two Steps
  • 88. Discounting ▪ How to discount? ▪ Each time we descend a level, we multiply in the discount once ▪ Why discount? ▪ Sooner rewards probably do have higher utility than later rewards ▪ Also helps our algorithms converge ▪ Example: discount of 0.5 ▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3 ▪ U([1,2,3]) < U([3,2,1])
  • 89. Infinite Utilities?! ▪ Problem: What if the game lasts forever? Do we get infinite rewards? ▪ Solutions: ▪ Finite horizon: (similar to depth-limited search) ▪ Terminate episodes after a fixed T steps (e.g. life) ▪ Gives nonstationary policies (π depends on time left) ▪ Discounting: use 0 < γ < 1 ▪ Smaller γ means smaller “horizon” – shorter term focus ▪ Absorbing state: guarantee that for every policy, a terminal state will eventually be reached