Signaler

Partager

•1 j'aime•301 vues

•1 j'aime•301 vues

Signaler

Partager

Télécharger pour lire hors ligne

Introduction to Artificial Neural Networks. Training Deep Neural Nets. Convolutional Neural Networks. Recurrent Neural Network. Reinforcement Learning.

- 1. Hands on Machine Learning with Scikit Learn Presented By: Ahmed Yousry
- 2. Agenda ❑Introduction to Artificial Neural Networks. ❑Training Deep Neural Nets. ❑Convolutional Neural Networks. ❑Recurrent Neural Network. ❑Reinforcement Learning.
- 3. Introduction to ANN • First introduced back in 1943 by the Warren McCulloch . • Successes of ANNs until the 1960s. • In the early 1980s there was a revival of interest in ANNs as new network architectures. • By the 1990s, powerful alternative Machine Learning techniques.
- 4. Reasons why ANN is much more profound impact ❑There is now a huge quantity of data. ❑The tremendous increase in computing power. ❑The training algorithms have been improved. ❑Theoretical limitations of ANNs have turned out to be benign. ❑virtuous circle of funding and progress and products.
- 7. The Perceptron • One of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. • It is based on a linear threshold unit (LTU). Z = w1 x1 + w2 x2 + ⋯ + wn xn = wT ・ x hw(x) = step (Z) = step (wT ・x)
- 8. Multioutput perceptron ❑ A Perceptron with two inputs and three outputs. Note : No hidden layers in perceptron.
- 9. Training Algorithm While epoch produces an error Present network with next inputs from epoch Err = T – O If Err <> 0 then Wj new = Wj old + LR * Ij * Err End If End While • T: actual output , O: predicted output • LR : learning rate , I :input
- 10. XOR classification problem and an MLP that solves it XOR Function X1 XOR X2 = (X1 AND NOT X2) OR (X2 AND NOT X1) 2 2 2 2 -1 -1 Z1 Z2 Y X1 X2
- 11. Multi-Layer Perceptron and Backpropagation • An MLP is composed of one input layer, one or more layers of LTUs, called hidden layers, and one final output layer • When an ANN has two or more hidden layers, it is called a deep neural network (DNN).
- 12. A modern MLP (including ReLU and softmax) for classification
- 13. Deep learning Problems • Vanishing gradients problem (or the related exploding gradients problem) lower layers very hard to train. • Second, with such a large network, training would be extremely slow. • Third, a model with millions of parameters would severely risk overfitting the training set.
- 14. Gradients problems • Gradients often get smaller as the algorithm progresses down to the lower layers. • The Gradient Descent update leaves the lower layer weights unchanged, and training never converges to a good solution. • This is called the vanishing gradients problem. • The gradients can grow bigger and bigger, so many layers get insanely large weight updates and the algorithm diverges. This is the exploding gradients problem
- 15. Solving the first problem(Van…) A paper titled “Understanding the Difficulty of Training Deep Feedforward Neural Networks” by Xavier Glorot and Yoshua. 1. popular logistic sigmoid activation function. 2. using a normal distribution with a mean of 0 and a standard deviation of 1. 3. the hyperbolic tangent function has a mean of 0 and behaves slightly better than the logistic function in DNN.
- 16. Sigmoid activation function you can see that when inputs become large (negative or positive), the function saturates at 0 or 1, with a derivative extremely close to 0. 1/1+e^-x
- 17. The problem of RELU (0,max) • It suffers from a problem (dying ReLUs) during training, some neurons effectively die. • they stop outputting anything other than 0. • In some cases, you may find that half of your network’s neurons are dead training. • To solve this problem, you may want to use a variant of the ReLU function, such as the leaky ReLU.
- 18. leaky ReLU (RReLU). • leaky variants always outperformed the strict ReLU activation function. In fact, setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01 (small leak). • They also evaluated the randomized leaky ReLU (RReLU). • also evaluated the parametric leaky ReLU (PReLU),
- 19. Exponential linear unit (ELU) • Outperformed all the ReLU variants in their experiments: training time was reduced and the neural network performed better on the test set.
- 20. Batch Normalization • The technique consists of adding an operation in the model just before the activation function of each layer. • Simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). • In other words, this operation lets the model learn the optimal scale and mean of the inputs for each layer. • γ is the scaling parameter for the layer. • β is the shifting parameter (offset) for the layer.
- 22. Reusing Pretrained Layers • It is generally not a good idea to train a very large DNN from scratch. • Try to find an existing neural network that accomplishes a similar task. • Reuse the lower layers of this network. • This is called transfer learning.
- 23. Example • DNN that was trained to classify pictures into 100 different categories. • You now want to train a DNN to classify specific types of vehicles. • Freezing the Lower Layers weights. • Tweaking, Dropping, or Replacing the Upper Layers.
- 24. Understanding AlexNet Consists of 5 Convolutional Layers and 3 Fully Connected Layers (classify 1000 classes)
- 25. Faster Optimizers • Five ways to speed up training (and reach a better solution): ➢ Applying a good initialization strategy for the connection weights. ➢ using a good activation function. ➢ Using Batch Normalization. ➢ Reusing parts of a pretrained network. ➢ Using a faster optimizer than the regular Gradient Descent optimizer. • the most popular ones: Momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and finally Adam optimization.
- 26. Momentum Optimization Algorithm • Gradient Descent simply updates the weights θ by directly subtracting the gradient of the cost function J(θ) with regards to the weights (∇θJ(θ)) multiplied by the learning rate η (equation 1) • Momentum optimization cares a great deal about what previous gradients were. • It updates the weights by simply subtracting this momentum vector. • A new hyperparameter β, simply called the momentum, which must be set between 0 and 1, typically 0.9. (equation 2) Gradient Descent (1) Momentum Optimization (2)
- 28. Nesterov Momentum optimization ▪ The only difference from vanilla Momentum optimization is that the gradient is measured at θ + βm rather than at θ. ▪ This small tweak works because in general the momentum vector will be pointing in the right direction ▪ where ∇1 represents the gradient of the cost function measured at the starting point θ, and ∇2 represents the gradient at the point located at θ + βm)
- 29. RMS Optimization • Accumulating only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). • It does so by using exponential decay in the first step. • generally performs better than Momentum optimization and Nesterov Accelerated Gradients. • In fact, it was the preferred optimization algorithm of many researchers until Adam optimization came around.
- 30. Adam Optimization • Stands for adaptive moment estimation. • Combines the ideas of Momentum optimization and RMSProp. • Steps 3 and 4 are somewhat of a technical detail: since m and s are initialized at 0, they will be biased toward 0 at the beginning of training, so these two steps will help boost m and s at the beginning of training. Initialize β1 = 0.9, β2 =0.999, η = 0.001 term ϵ initialized to a tiny number 10–8 to avoid division by 0.
- 33. Learning rate techniques ❑ Predetermined piecewise constant learning rate For example, set the learning rate to η0 = 0.1 at first, then to η1 = 0.001 after 50 epochs. ❑ Performance scheduling Measure the validation error every N steps (just like for early stopping) and reduce the learning rate by a factor of λ when the error stops dropping. ❑ Exponential scheduling Set the learning rate to a function of the iteration number t: This works great, but it requires tuning η0 and r. The learning rate will drop by a factor of 10 every r steps. ❑ Power scheduling Set the learning rate to η(t) = η0 (1 + t/r)–c The hyperparameter c is set to 1. This is similar to exponential scheduling, but the learning rate drops much more slowly.
- 34. Dropout ❑ It is a fairly simple algorithm: at every training step, every neuron (including the input neurons but excluding the output neurons) has a probability p of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step
- 35. Data Augmentation ❑ Consists of generating new training (rotating, resizing, flipping, and cropping) instances from existing ones, artificially boosting the size of the training set. ❑ This will reduce overfitting, making this a regularization technique. The trick is to generate realistic training instances.
- 36. Convolutional Neural Networks ❑ A convolutional neural network (or ConvNet) is a type of feed-forward artificial neural network. ❑ The architecture of a ConvNet is designed to take advantage of the 2D structure of an input image. ❑ A ConvNet is comprised of one or more convolutional layers (often with a pooling step) and then followed by one or more fully connected layers as in a standard multilayer neural network.
- 37. How CNN works • For example, a ConvNet takes the input as an image which can be classified as ‘X’ or ‘O’
- 38. ConvNet Layers ▪CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. ▪RELU layer will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves the size of the volume unchanged. ▪POOL layer will perform a down sampling operation along the spatial dimensions (width, height). ▪FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1xN], where each of the N numbers correspond to a class score, such as among the N categories.
- 39. Convolutional Layer - Filters ▪ The CONV layer’s parameters consist of a set of learnable filters. ▪ Every filter is small spatially (along width and height), but extends through the full depth of the input volume. ▪ During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position.
- 40. Convolutional Layer - Filters • Sliding the filter over the width and height of the input gives 2-dimensional activation map that responds to that filter at every spatial position.
- 41. Convolutional Layer – Filters –Example
- 42. Convolutional Layer – Filters – Computation Example
- 43. Convolutional Layer – Filters – Output Feature Map
- 44. Relu Layer
- 45. Pool Layer ▪ The pooling layers down-sample the previous layers feature map. ▪ Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network ▪ The pooling layer often uses the Max operation to perform the down sampling process.
- 46. Pooling Filter example Size = 2 X 2, Stride = 2
- 47. Fully connected layer ❑ Fully connected layers are the normal flat feed-forward neural network layers. ❑ These layers may have a non-linear activation function or a softmax activation in order to predict classes. ❑ To compute our output, we simply rearrange the output matrices as a 1- D array.
- 48. SoftMax operation ❑ A special kind of activation layer, usually at the end of FC layer Outputs ❑ Can be viewed as a fancy normalizer (a.k.a. Normalized exponential function) ❑ Produce a discrete probability distribution vector ❑ Very convenient when combined with cross-entropy loss
- 49. Recurrent Neural Network ❑ Some problems require previous history/context in order to be able to give proper output (speech recognition, stock forecasting, target tracking, etc. ❑ One way to do that is to just provide all the necessary context in one "snap-shot" and use standard learning ➢ How big should the snap-shot be? Varies for different instances of the problem. ✓ If the input sequences are of fixed length, or can be easily padded to a fixed length, they can be collapsed into a single input vector and any of the standard pattern classification algorithms.
- 50. Sequential data ❑ There are many tasks that require learning a temporal sequence of events ❑ These problems can be broken into 3 distinct types of tasks ➢ Sequence Recognition: Produce a particular output pattern when a specific input sequence is seen. Applications: Sentiment Analysis, handwriting recognition ➢ Sequence Reproduction: Generate the rest of a sequence when the network sees only part of the sequence. Applications: Time series prediction (stock market, sun spots, etc), language model. ➢ Temporal Association: Produce a particular output sequence in response to a specific input sequence. Applications: machine translation, speech generation ✓ Recurrent networks is flexible enough to solve these problems.
- 51. Recurrent Networks offer a lot of flexibility: (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation:an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). (1) fixed-sized input to fixed- sized output (e.g. image classification)
- 52. Recurrent Neural Networks ❑ Recurrent neural network lets the network dynamically learn how much context it needs in order to solve the problem. ❑ RNN is a multilayer NN with the previous set of hidden unit activations feeding back into the network along with the inputs. ❑ RNNs have a “memory” which captures information about what has been calculated so far.
- 53. Recurrent neural networks ❑ Parameter sharing makes it possible to extend and apply the model to examples of different lengths and generalized across them. ❑ It means local connections are shared (same weights) across different temporal instances of the hidden units. ❑ If we have to define a different function Gt for each possible sequence length, each with its own parameters, we would not get any generalization to sequences of a size not seen in the training set.
- 54. Dynamic systems ❑ A means of describing how one state develops into another state over the course of time. ❑ Consider the classical form of a dynamical system: ✓ Where st is the system state at time t, ƒ8 is a mapping function. ❑ The same parameters (the same function ƒ8) is used for all time steps. ❑ Unfolding flow graph of such system is:
- 55. Dynamic systems ❑ Now consider a dynamical system driven by an external signal xt The state st now contains information about the whole past sequence
- 57. Cost function ❑ The total loss for a given input/target sequence pair (x, y), measured in cross entropy L y, y^= Σ Lt = Σ −yt log y^t • where yt is the category that should be associated with time step t in the output sequence. y^tis the predicted output.
- 58. Computing the gradient in RNN Using the generalized back-propagation one can obtain the so- called Back-propagation Through Time (BPTT) algorithm. We can then iterate backwards in time to back-propagate gradients through time, from t = T − 1 down to t = 1, noting that st (for t < T) has as descendants both ot and st+1
- 59. Exploding or vanishing gradient ❑ In recurrent nets (also in very deep nets), the final output is the composition of a large number of non-linear transformations. ❑ Even if each of these non-linear transformations is smooth. Their composition might not be. ❑ The derivative (i.e. Jacobian matrix) through the whole composition will tend to be either very small or very large. ❑ Example, suppose all numbers in the product are scalar and have the same value α. If multiplication times T goes to ∞ then α^T = ∞ if α > 1 and αT = 0 if α < 1.
- 60. Gradient clipping ❑ Once the gradient value grows extremely large, it causes an overflow (i.e. NaN) which is easily detectable at runtime. ❑ A simple heuristic solution that clips gradients to a small number whenever they explode. That is, whenever they reach a certain threshold, they are set back to a small number. as shown in Algorithm: Error surface of a single hidden unit RNN
- 61. Facing the vanishing gradient problem ❑ Echo State Networks (ESN) ❑ Long delays ❑ Leaky Units ❑ Gated Recurrent Neural Networks
- 62. Echo State Networks (ESN) ❑ How do we set the input and recurrent weights so that a rich set of histories can be represented in the recurrent neural network state? ❑ Answer: is to make the dynamical system associated with the recurrent net nearly be on the edge of stability, i.e., more precisely with values around 1 for the leading eigenvalue of the Jacobian of the state-to- state transition function. ❑ ESNs proposed to fix the weights of the input→ hidden connections and the hidden → hidden at carefully random values to make the Jacobians slightly contractive. This is achieved by making the λ of the weight matrix large but slightly less than 1. ❑ ESNs are only learn the hidden→output connections.
- 63. Skipping Connects (Long delays) ❑ Adding Longer-delay connections allow to connect the past states to future states through short paths ❑ if we have a connection every time steps. The gradients will be vanishing or explosion after number T of time steps as O(hT). ❑ instead, if we have recurrent connections with a time-delay of D, gradients grow as O(fiT/D) without vanishing but still may explosion at T. ❑ because the number of effective steps is T/D. This allows the learning algorithm to capture longer dependencies
- 64. Gated Recurrent Neural Networks ❑ GRNNs are a special kind of RNN, capable of learning long-term dependencies by having more persistent memory. Two popular architectures: ➢ Long short-term memory (LSTM) [Hochreiter and Schmidhuber, 1997]. ➢ Gated recurrent unit (GRU), [Cho et al., 2014] ❑ Applications: handwriting recognition (Graves et al., 2009), speech recognition (Graves et al., 2013; Graves and Jaitly, 2014), handwriting generation (Graves, 2013), machine translation (Sutskever et al., 2014a), image to text conversion (captioning) (Kiros et al., 2014b; Vinyals et al., 2014b; Xu et al., 2015b) and parsing (Vinyals et al., 2014a).
- 65. Long Short-Term Memory (LSTM) ❑ Standard RNNs have a very simple repeating module structure, such as a single tanh layer. ❑ LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.
- 66. Generate image caption ❑ Vinyals et al., Show and Tell: A Neural Image Caption Generator,arXiv 2014 ❑ Use a CNN as an image encoder and transform it to a fixed-length vector ❑ It is used as the initial hidden state of a “decoder” RNN that generates the target sequence
- 67. Translate videos to sentences ❑ Venugopalan et al. arXiv 2014 ❑ The challenge is to capture the joint dependencies of a sequence of frames and a corresponding sequence of words
- 68. Reinforcement Learning ❑ One of the most exciting fields of Machine Learning today, and also one of the oldest. ❑ It has been around since the 1950s, producing many interesting applications over the years in particular in games (e.g., TD-Gammon, a Backgammon playing program). ❑ Revolution took place in 2013 when researchers from an English startup called DeepMind demonstrated a system that could learn to play just about any Atari game from scratch. ❑ DeepMind was bought by Google for over 500 million dollars in 2014.
- 69. Learning to Optimize Rewards ❑ In Reinforcement Learning, a software agent makes observations and takes actions within an environment, and in return it receives rewards. ❑ Its objective is to learn to act in a way that will maximize its expected long-term rewards. ❑ The agent acts in the environment and learns by trial and error to maximize its pleasure and minimize its pain.
- 70. Examples of RL agents • (a) walking robot, (b) Ms. Pac-Man, (c) Go player, • (d) thermostat, • (e) automatic trader5
- 71. Policy Search ❑ The algorithm used by the software agent to determine its actions is called policy. ❑ For example, the policy could be a neural network taking observations as inputs and outputting the action to take
- 72. Stochastic policy ❑ The policy can be any algorithm you can think of, and it does not even have to be deterministic. ❑ For example, consider a robotic vacuum cleaner whose reward is the amount of dust it picks up in 30 minutes. Its policy could be to move forward with some probability p every second, or randomly rotate left or right with probability 1 – p. ❑ The rotation angle would be a random angle between –r and +r. Since this policy involves some randomness, it is called a stochastic policy.
- 73. Introduction to OpenAI Gym ❑ One of the challenges of Reinforcement Learning is that in order to train an agent, you first need to have a working environment. ❑ If you want to program an agent that will learn to play an Atari game, you will need an Atari game simulator. ❑ If you want to program a walking robot, then the environment is the real world and you can directly train your robot in that environment.
- 74. Example of environment ❑ CartPole environment . This is a 2D simulation in which a cart can be accelerated left or right in order to balance a pole placed on top of it
- 75. Neural Network Policies ❑ In the case of the CartPole environment, there are just two possible actions (left or right) ❑ For example, if it outputs 0.7, then we will pick action 0 with 70% probability, and action 1 with 30% probability.
- 76. Markov Decision Processes ❑ In the early 20th century, the mathematician Andrey Markov studied stochastic processes with no memory, called Markov chains. ❑ Such a process has a fixed number of states, and it randomly evolves from one state to another at each step. ❑ The probability for it to evolve from a state s to a state s′ is fixed, and it depends only on the pair (s,s′), not on past states (the system has no memory). ❑ Markov chains can have very different dynamics, and they are heavily used in thermodynamics, chemistry, statistics, and much more.
- 77. MDP Example ❑ Suppose that the process starts in state s0, and there is a 70% chance that it will remain in that state at the next step. ❑ Eventually it is bound to leave that state and never come back since no other state points back to s0. ❑ If it goes to state s1, it will then most likely go to state s2 (90% probability), then immediately back to state s1 (with 100% probability).
- 78. Another Example
- 79. Example: Grid World ❑ Noisy movement: actions do not always go as planned ❑ 80% of the time, the action North takes the agent North (if there is no wall there) ❑ 10% of the time, North takes the agent West; 10% East ❑ If there is a wall in the direction the agent would have been taken, the agent stays put. ❑ The agent receives rewards each time step ▪ Small “living” reward each step (can be negative) ▪ Big rewards come at the end (good or bad) ❑ Goal: maximize sum of rewards
- 80. Grid World Actions Deterministic Grid World Stochastic Grid World
- 81. Markov Decision Processes ❑ An MDP is defined by: ▪ A set of states s ∈ S ▪ A set of actions a ∈ A ▪ A transition function T(s, a, s’) ▪ Probability that a from s leads to s’, i.e., P(s’| s, a) ▪ Also called the model or the dynamics ▪ A reward function R(s, a, s’) ▪ Sometimes just R(s) or R(s’) ▪ A start state ▪ Maybe a terminal state
- 82. What is Markov about MDPs? ❑ “Markov” generally means that given the present state, the future and the past are independent ❑ For Markov decision processes, “Markov” means action outcomes depend only on the current state ❑ This is just like search, where the successor function could only depend on the current state (not the history) Andrey Markov (1856-1922)
- 83. Markov Property S0 S1 St-1 St St+1 .. . St St+1 =
- 84. Policies ❑ In deterministic single-agent search problems, we wanted an optimal plan, or sequence of actions, from start to a goal ❑ For MDPs, we want an optimal policy π*: S → A ▪ A policy π gives an action for each state ▪ An optimal policy is one that maximizes expected utility if followed ▪ An explicit policy defines a reflex agent Optimal policy when R(s, a, s’) = -0.03 for all non-terminals s
- 85. Optimal Policies R(s) = -0.03R(s) = -0.01 R(s) = -2.0R(s) = -0.4
- 86. Utilities of Sequences ▪ What preferences should an agent have over reward sequences? ▪ More or less? ▪ Now or later? [1, 2, 2] [2, 3, 4]or [0, 0, 1] [1, 0, 0]or
- 87. Discounting ▪ It’s reasonable to maximize the sum of rewards ▪ It’s also reasonable to prefer rewards now to rewards later ▪ One solution: values of rewards decay exponentially Worth Now Worth Next Step Worth In Two Steps
- 88. Discounting ▪ How to discount? ▪ Each time we descend a level, we multiply in the discount once ▪ Why discount? ▪ Sooner rewards probably do have higher utility than later rewards ▪ Also helps our algorithms converge ▪ Example: discount of 0.5 ▪ U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3 ▪ U([1,2,3]) < U([3,2,1])
- 89. Infinite Utilities?! ▪ Problem: What if the game lasts forever? Do we get infinite rewards? ▪ Solutions: ▪ Finite horizon: (similar to depth-limited search) ▪ Terminate episodes after a fixed T steps (e.g. life) ▪ Gives nonstationary policies (π depends on time left) ▪ Discounting: use 0 < γ < 1 ▪ Smaller γ means smaller “horizon” – shorter term focus ▪ Absorbing state: guarantee that for every policy, a terminal state will eventually be reached
- 90. THANKS
- 91. QUESTIONS?