This presentation is Part 2 of my September Lisp NYC presentation on Reinforcement Learning and Artificial Neural Nets. We will continue from where we left off by covering Convolutional Neural Nets (CNN) and Recurrent Neural Nets (RNN) in depth.
Time permitting I also plan on having a few slides on each of the following topics:
1. Generative Adversarial Networks (GANs)
2. Differentiable Neural Computers (DNCs)
3. Deep Reinforcement Learning (DRL)
Some code examples will be provided in Clojure.
After a very brief recap of Part 1 (ANN & RL), we will jump right into CNN and their appropriateness for image recognition. We will start by covering the convolution operator. We will then explain feature maps and pooling operations and then explain the LeNet 5 architecture. The MNIST data will be used to illustrate a fully functioning CNN.
Next we cover Recurrent Neural Nets in depth and describe how they have been used in Natural Language Processing. We will explain why gated networks and LSTM are used in practice.
Please note that some exposure or familiarity with Gradient Descent and Backpropagation will be assumed. These are covered in the first part of the talk for which both video and slides are available online.
A lot of material will be drawn from the new Deep Learning book by Goodfellow & Bengio as well as Michael Nielsen's online book on Neural Networks and Deep Learning as well several other online resources.
Bio
Pierre de Lacaze has over 20 years industry experience with AI and Lisp based technologies. He holds a Bachelor of Science in Applied Mathematics and a Master’s Degree in Computer Science.
https://www.linkedin.com/in/pierre-de-lacaze-b11026b/
3. Deep Neural Networks
• A deep neural network is a neural network with multiple
layers of hidden units.
– E.g. MLPs: Multi-Layered Perceptrons (MLPs)
• Convolutional Neural Nets (CNNs)
– Biologically-inspired variants of MLPs
– Successfully used in image recognition, speech recognition
• Recurrent Neural Nets (RNN)
– Cyclic graphs where next layers feeds into previous layers
– Allow for a window of time into past data
– Successfully used or Natural Language processing.
4. Application: Combining CNNs & RNNs
GENERATING IMAGE DESCRIPTIONS
Together with convolutional Neural Networks, RNNs have been used as part of a model to generate
descriptions for unlabeled images. It’s quite amazing how well this seems to work. The combined model even
aligns the generated words with features found in the images.
Deep Visual-Semantic Alignments for Generating Image Descriptions. Source: http://cs.stanford.edu/people/karpathy/deepimagesent
5. Part 0
ANN Review &
Multi-Layered Perceptrons
(MLPs)
Multi Layered Perceptrons (MLPs) are fully
connected feed forward networks with several
layers of hidden units.
6. Linear Units and Perceptrons
• Linear Unit: A linear combination of weighted inputs (real-valued)
• Perceptron: Thresholded Linear Unit (discrete-valued)
Note: w0 is a bias whose purpose is to move the threshold of the activation function.
7. Multi Layered Perceptrons
• These are fully connected Deep Feed Forward Networks
• Every output from previous layer is connected to every unit in the next layer
• They are typically trained using the Backprogation Algorithm
• Backprogation is effectively Gradient Descent applied to every unit in the network.
Image Credit: Michael Bernstein, Neural Networks and Deep Learning, Chapter 2.
9. ANN Backpropagation Algorithm
(Using incremental gradient descent)
1. Initial weights to small random numbers
2. Until termination criteria for each training example
a. Compute the network outputs for the training example
b. For each output unit k compute its error:
δk = ok (1 – ok) (tk – ok)
c. For each hidden unit h compute its error:
δh = oh (1 – oh) Σ (whk δk )
k
d. Update each network weight wij
wij = wij + η δh xij
12. MLP Training Comparisons
❶ MLP with 1 hidden layer of 3 hidden units: 4,500 iterations to converge
❷ MLP with 2 hidden layers of 3 hidden units: 28,000 iteration to converge
❸ MLP with 3 hidden layers of 3 hidden units: 1,000,000+ iterations to converge
13. Part 1
Convolutional Neural Nets
(CNNs)
Convolutional Neural Networks are
biologically-inspired variants of Multi Layered
Perceptrons (MLPs)
14. History of CNNs
• Research dates back to the 1970’s
• Seminal Paper on CNNs:
– Gradient-based learning applied to document recognition,
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, 1998
• Really took off in 2012
– ILSVRC (ImageNet Large-Scale Visual Recognition Challenge)
– 2012 ILSBRC: AlexNet , Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton
– 2013 ILBSRC: ZF Net, Matthew Zeiler and Rob Fergus , NYU
– 2014: VGG Net, Karen Simonyan and Andrew Zisserman, University of Oxford
15. CNN Overview
• A CNN typically consists of one or more convolutional and sampling layers
followed by one or more fully connected layers.
• Specifically designed to exploit 2D input such an image or speech input
• Faster to to train than fully connected networks.
• Sparse Connectivity
– CNNs exploit spatially-local correlation using local connectivity pattern between units of adjacent layers.
– These are called local receptive fields
• Shared Weights
– Replicated units share the same parameterization (weight vector and bias) and form a feature map.
• Max Pooling
– A form of non-linear down-sampling. Max pooling partitions the input image into a set of non-overlapping
rectangles and, for each such sub-region, outputs the maximum value.
16. Local Receptive Fields
• In a fully connected network, every input in the input layer is connected to
every hidden unit.
• This prevents the network from learning spatial features of the image.
• The idea is to map (connect) small rectangular sections of the image (inputs)
to different hidden units.
• These hidden units are called local receptive fields and result in a sparse
connectivity between the input layer and the first hidden layer.
• The stride length is the amount by which we shift the rectangular sections.
Typically use rectangular sections shifted over 1 pixel
• Different sets of local receptive fields form feature maps each of which
represent a potentially different feature.
17. Feature Maps
• Each hidden unit shares the same set of weights and bias but
for a different spatial area of the input.
• This allows that layer to learn the same feature but for
different regions of the image.
• The complete hidden layer will in fact consist of several
feature maps. This is called a convolutional layer.
• The shared bias and weights in each feature map are often
called filters or kernels.
18. How Feature Maps Work
The amount by which
the local receptive field
is shifted is called the
stride length.
A stride length of 1 is
common.
All hidden units in a
feature map share the
same weights and bias.
This greatly reduces the
number of parameters in
a layer.
Image credit: Michael Nielsen’s Neural Networks and Deep Learning, Chapter 6.
19. Why Do Feature Maps Learn Different Features?
• From Quora: Andy Thomas
• Two reasons:
– The weights of the filters are randomly initialized
– Different feature maps reduce the cost function
• Random initialization of the weights will likely ensure each filter
converges to different local minima in the cost function. It is very
unlikely that each filter would begin to resemble other filters, as
that would almost certainly result in an increase of the cost
function and therefore no gradient descent algorithm would head
in that direction.
• Some feature maps may learn the same feature.
20. The Convolution Operator
• A Convolution is a simple mathematical operation common to many image
processing operators.
• Provides a way of “multiplying” two arrays of numbers of different sizes
but same dimensionality
• Input image has M rows and N columns, and the kernel has m rows and n
columns,
• The output image will have M - m + 1 rows, and N - n + 1 columns.
• The purpose of Convolution in a CNN is to extract features from the input image.
• Convolution preserves the spatial relationship between pixels by learning image
features using small squares of input data
21. Output of the Convolutional Layer
• For each hidden unit in each feature map, only take
into account pixels in the local receptive field (sparse
connectivity)
• For each feature map, for the jth ,kth hidden unit in
that feature map, assuming a 5x5 filter (aka kernel),
the output of that unit is given by:
–σ (b + ∑ l=0,4 ∑ m=0,4 wl,m a j+l,k+m)
22. Pooling
• A pooling layer typically follows a convolutional layer.
• Intuitively it is a down sampling of the previous layer.
• Max pooling is technique that selects the maximum
activation from a set of units from the convolutional
layer.
• Effectively take each feature map from convolutional
layer and produce a reduced feature map.
• Other pooling techniques:
– L2 Pooling
• Takes the square root of the sum of the squares of a set of units
23. How Pooling Works
• Pooling is a form of statistical aggregation or downsampling of the previous layer.
• Pooling layers do not learn anything
• While it is common, it is not required to have a pooling after a convolutional layer
Image Credit: Michael Nielsen, Neural Networks and Deep Learning, Chapter 6
24. Backpropagation in CNNs Overview
• Applying backprogation to a convolutional layer is very similar to
applying backprogation to a fully connected except that errors and
gradients are computed separately for each filter.
• Applying backpropagation to a pooling layer involves using an
upsampling function which propagates the error over the sampling
function using its derivatives.
• Backpropagation for a fully connected layer is exactly the same as
for MLPs.
• Yoshua Bengio on Quora: “There is a general recipe for obtaining a
back-propagation algorithm associated with ANY computational
graph. You can find it described in my book, for example, in the
feedforward nets (mlp) chapter (6): DEEP LEARNING”
25. Backpropagation in CNNs
• Error and gradient for fully connected layers
• Error and gradient for convolutional layer
• k indexes the filter number and upsample propagates error through pooling layer)
27. MNIST Data Set
• National Institute for Standards and Technology (NIST)
• Modified NIST Data Set maintained by Yan LeCun
• MNIST Data in CSV format
28. A Simple Architecture for MNIST
Image Credit: Michael Bernstein, Neural Networks and Deep Learning, Chapter 6.
• Input layer: 764 inputs encode the MNIST image
• Convolutional layer: 1728 units representing 3 feature maps
• Max-Pooling layer: 432 units representing 3 feature maps
• Output layer: 10 units, one for each digit MNIST dataset
29. Shared Weights and Training CNNs
• CNN
– 28×28 = 784 input neurons
– 20 feature maps 20×26=520
– Total of 520 weights to learn.
• MLP
– 784=28×28 inputs,
– 30 hidden units,
– Total of 784×30 weights = 23520
– Total of 30 biases,
– Total of 23,550 weights to learn.
• A single fully-connected layer would have more than 40 times as
many weights as the convolutional layer.
30. A CNN Architecture for MNIST
Image Credit: Michael Nielsen, Neural Networks and Deep Learning, Chapter 6
• 9,967 Test images correctly classified out 10,000
• Very similar to LeNet-5 architecture
• Softmax Regression aka Multi-class Logistic Regression is a generalization of
logistic regression that is used for multi-class classification and based of the
softmax function.
31. Incorrectly Classified MNIST Images
Of the 10,000 MNIST test images 9,967 correctly classified, 33 incorrectly classified
32. What features are learned?
• The images above show the type of features the convolutional learns.
• Lighter regions mean a smaller, typically negative weight,
• Darker region mean a larger weight
• Many of the features have distinguishable sub-regions of light and dark
• It’s clear that it’s learning “stuff” related to spatial structure
33. Performance Enhancements
• Regularization Terms to help with overfitting
– Regularization is technique that allows you to penalize
your loss function.
• Ensemble methods
– Train several nets and have them vote on the output.
• Generative expanded data sets
– Basically apply distortions to original data set
– E.g. 50,000 images 250,000 images
35. CNN Summary
• There are four main operations in a CNN:
– Convolution
– Non Linearity (ReLU)
– Pooling or Sub Sampling
– Classification (Fully Connected Layer)
• These operations are the basic building blocks of every CNN.
• CNN’s Faster to train than MLPs because fewer parameters need to be learned.
• Work well with two-dimensional data in which locality is meaningful,
– e.g. object recognition in images.
• CNN can also be used with higher dimensional data
– e.g. MRI Images
• Addition convolutional layers provide higher level features (meta features)
• Pooling layers progressively reduce the spatial size of the representation to reduce the amount of features and the
computational complexity of the network
• Fully Connected layer at the end provides the classifier
• Rectified Linear Units (ReLU) typically outperform networks based on sigmoid activation functions (sigmoid or
tanh).
36. Part 2
Recurrent Neural Nets
(RNNs)
Recurrent Neural Networks are a family of
Neural Networks for procession sequential data.
37. Recurrent Neural Nets Overview
• Leverage the ideas
– unfolding computational graphs
– parameter-sharing to abstract away input position
• “In 2009 I visited Nepal” vs “I visited Nepal in 2009”
• RNNs represent cyclical graphs so information flows in both directions through the
network.
– They are networks with loops in them, allowing information to persist.
• Different flavors of RNNs
– An output at each time-step and recurrent connections between hidden units
– An output at each time-step and recurrent connections only from output units
– An output only after the entire sequence is fed into the network and connections between
hidden units.
• RNNs can simulate a Turing Machine and can represent any computable function
– Siegelman and Sontag, 1995.
– Used an RNN off finite size consisting 886 units
38. RNNs in Practice
• Types of RNN used in Practice
– Vanilla RNNs
– Bidirectional RNNs
– Deep Bidirectional RNNs
– Long Short-Term Memory (LSTM)
• Practical Applications of RNNs
– Language Modeling And Generating Text
– Machine Translation
– Speech Recognition
– Generating Image Descriptions
39. Computational Graphs
• Computational Graph: Formalization of the
structure of a set of computations.
• Unfolding a recursive computation into a
graph with repetitive structure results in
parameter sharing across a deep network
structure.
• Any function involving a recurrence is an RNN
• Hidden Units in RNN:
– h(t) = f(h(t-1), x(t), θ)
– Notice that θ is the same at each time step.
41. Training RNNs
• Backpropagation in Computational Graphs
– Backprogation can be derived for any computational graph by recursively applying the chain
rule. (Deep Learning, Chapter 6)
– The backprogation algorithm consists of performing a Jacobian-gradient-product for each
operation in the graph
– In vector calculus, the Jacobian matrix is the matrix of all first-order partial derivatives of a
vector-valued function
• Backpropagation Through Time (BPTT).
– Gradient at each output depends not only on the calculations of the current time step, but
also the previous time steps.
– Vanilla RNNs trained with BPTT have difficulties learning long-term dependencies, i.e.
dependencies between (words) steps that are far apart)
• “I grew up in France… I speak fluent French”
– Suffers from vanishing/exploding gradient problem.
• Vanishing gradient: your gradients get smaller and smaller in magnitude as you backpropagate through earlier
layers (or through time).
• Activation functions like the sigmoid function produce gradients in range [-1,1] which easily causes the gradient
to vanish in earlier layers.
• Exploding gradient: more of an issue with recurrent networks, where the opposite happens due to a Jacobian
with determinant greater than 1.
– Certain types of RNNs (like LSTMs) were specifically designed to get around these problems.
42. Long Short Term Memory (LSTM)
• LSTMs are a special kind of RNN, capable of learning long-term dependencies.
• Successful in handwriting recognition, speech recognition, image captioning and machine
translation
• Type of gated network
• Introduced by Hochreiter & Schmidhuber (1997)
– Added self-loops which allowed gradient to flow for long durations.
– Weight on the self-loop based on context rather than fixed. (Gers et al., 2000)
– Based on the idea of creating paths through the network in which the gradient neither vanishes nor
explodes.
• Based on the idea of creating paths through the network in which the gradient neither vanishes nor
explodes.
• Leaky units allowed information to accumulate over a long duration
• LSTM’s generalize leaky units by allowing connection weights to change over time.
• LSTM’s allow the network to decide when to forget information.
• A single hidden unit in an LSTM is replaced with a recurrent network cell consisting of 4
components that interact with each other.
43. Gated Network Cells
• Gated network cells replace the hidden units of RNNs
• Input feature is computed using the ANN unit.
• The input can be accumulated if input gate allows it.
• The state has a self-loop controlled by the forget gate
• The output can be turned off by the output gate
28×28
44. LSTM in NLP Generation
Image credit: Google Research Blog
45. LSTM Summary
• A type of RNN architecture that addresses the
vanishing/exploding gradient problem.
• LSTM allow the learning of long-term
dependencies which is crucial for sequences
of inputs.
• Recently achieved state-of-the-art
performance in speech recognition, language
modeling, translation, image captioning
47. Part 3
Generative Adversarial
Networks
(GANs)
Generative Adversarial Networks are an example of generative
models. GANs focus primarily on sample generation, though it is
possible to design GANs that can estimate the probability
distribution.
48. GAN Framework
• Based on the idea of a two player game
– Player 1: Generator
– Player 2: Discriminator
• The generator generates samples and tries to
fool the discriminator
• The discriminator determines if the generated
samples are real or fake
49. Why GANs are useful
• When predicting the next frame in a video, using the Mean Squared Error
(MSE) causes an averaging over many possible futures which causes the
ear to disappear and blurring of the eyes
• The adversarial version does a much better job preserving the ear and not
blurring the eyes.
Image credit: Ian Goodfellow, GANs Tutorial, NIPS 2016
50. GANs Summary
• GANs are generative models that use
supervised learning to approximate an
intractable cost function
• GANs requires finding Nash equilibria in high
dimensional, continuous, non-convex games.
• GANs are crucial to many different state of the
art image generation and manipulation
systems.
51. Part 4
Deep Reinforcement Learning
(DRL)
Deep Reinforcement Learning combines both Deep Learning and
Reinforcement Learning by using Deep Learning techniques to learn values
for the Q Function in Reinforcement Learning. This is described in Google
Deep Mind’s Atari paper and exemplified by the AlphaGo program
52. Deep Reinforcement Learning
• Combines Reinforcement Learning with Deep Learning
• A Form of model-free or unsupervised learning
• Uses Neural Nets to estimate Q Values.
• Very new field. No Wikipedia Page on this topic.
• Idea is to 3feed states and actions into the network to predict Q values.
• Neural networks are exceptionally good in coming up with good features
for highly structured data.
• This is the technology used by Google DeepMind’s AlphaGo program.
53. Reinforcement Learning Revisited
• Definitions
– Policy π is a way of selecting an action given a state
– Value function Qπ (s,a) is the expected total reward for
performing action a from state s given policy π
• Different Approaches
– Policy Based RL
• Search for the optimal policy in space of policies
– Value-based RL
• Estimate optimal value function Q*(s,a)
– Model-based RL
• Build a model of the environment and use look ahead
54. The Many States Problem
• In the Nature Deep Mind Atari paper:
• Take four last screen images, resize them to 84×84 and
convert then to gray scale with 256 gray levels.
• This yields 25684×84×4≈1067970 possible game states.
• This means 1067970 rows in our imaginary Q-table.
• That is more than the number of atoms in the known
universe!
56. Deep Q-Learning Error & Gradient
• Represent Q function using a deep network.
• Error function
• Gradient
57. Strategies & Tricks
• Experience Relay
– During gameplay all the experiences <s,a,r,s′> are stored in a replay memory.
– When training the network, random samples from the replay memory are
used instead of the most recent transition.
– This breaks the similarity of subsequent training samples, which otherwise
might drive the network into a local minimum.
– Also experience replay makes the training task more similar to usual
supervised learning, which simplifies debugging and testing the algorithm.
– One could actually collect all those experiences from human gameplay and the
train network on these.
• Exploration-Exploitation
– ε-greedy exploration
– with probability ε choose a random action, otherwise go with the “greedy”
action with the highest Q-value.