Introduction to neural networks and deep learning. Seminar given by Héloïse Nonne on February 19th, 2015 at CINaM (Centre Interdisciplinaire de Nanosciences de Marseille) at Aix-Marseille University
3. Big Data?
Explosion of data size
Falling cost of data
storage
Increase of
computing power
“Information is the oil of the 21st century, and analytics is the combustion engine.”
Peter Sondergaard, Senior Vice President, Gartner Research
4. The falling cost of data storage
1980 1990 2000 2014
300 000 $ 1 000 $ 100$ 0,1$
1956
IBM 350 RAMAC
Capacity: 3.75 MB
Storage cost for 1 Go
5. Data growing exponentially
• Over 90% of all the data in the
world was created in the past 2
years.
• Now, every year, 2 ZB are generated
1 ZB (zettabyte) = 1 trillion GB
• IDC (International Data Corporation)
predicts a generation of 40 ZB in
2020
• Around 100 hours of video are
uploaded to YouTube every minute
• Today’s datacenters occupy an area
of land equal in size to almost 6,000
football fields
7. Two approaches to large databases
Total failure rate = product of local failure rates
Design for failure at software level
Source; www.tomshardware.com
High-Tech hardware
• Roughly double the cost of commodity
• Roughly 5% failure rate
Commodity (≠ low end) hardware
• Roughly half the cost
• Roughly 10-15% failure rate
8. Distribution algorithm: MapReduce
Key principles of a DFS
• Duplication of data
• Distribution of data
• Colocalization of treatments
• Parallel treatments
• Horizontal and vertical elasticity
Hadoop Distributed File System (HDFS) / Computing
Distribution of data over multiple servers
9. Yes but, what for?
Big Data is about having
an understanding of what your relationship is with the people who are the most
important to you
and an awareness of the potential in that relationship
Joe Rospars, Chief Digital Strategist, Obama for America
10. Les tendances de fond du Big Data
10
La digitalisation massive des sphères économique, industrielle
et sociale ouvre le champ à de nouvelles approches dans les
domaines du marketing, de la finance et de l’industrie.
L’enjeu pour les Directions Générales et les Directions
Opérationnelles est de maîtriser cette opportunité pour faire
face aux changements profonds des marchés et anticiper les
évolutions des attentes des clients, des usages, des processus
et des infrastructures.
La Data Science ou l’art de maîtriser le Big Data tend à
supplanter son aspect technologique, de part son importante
stratégique.
Le Big Data et la Data Science redéfinissent profondément les
relations entre les métiers, la statistique et la technologie.
Digitalisation des
relations sociales
Marketing
Entreprise
digitale
Finance
Usine digitaleIndustrie
Monétisation des
datas
TMT/Banque
11. • Création et développement de produits spécifiques autour des technologies Big Data
• Veille technologique et scientifique
• Recherche et développement en Data Science
• Quantmetry est un cabinet de conseil « pure player » de la Data Science et du Big Data
• Nous aidons les entreprises à créer de la valeur grâce à l’analyse de leurs données
• Nous sommes une équipe pluridisciplinaire de consultants, data scientists, experts Big Data
• Nous appuyons nos recommandations sur des modèles mathématiques et statistiques
Quantmetry : Big Data & Data Science
11
13. Exemples de Projets data
13
• Marketing, ciblage
• Compteurs intelligents: prédiction de consommation d’électricité ou d’eau
• Identification des molécules les plus efficaces dans la chimiothérapie
contre le cancer du sein
• Prédiction d’occupation de station Vélib
• Optimisation des routes aériennes en fonction du trafic
• Prédiction de pannes sur des flottes automobiles
• Prédiction de sécheresse en utilisant les photos satellites
• Détection de fraude (sécurité sociale, assurance, impôts)
16. Artificial intelligence (1956)
16
How to mimic the brain?
Build artificial intelligences able to think
and act like humans
• Information travels as electric signals
(spikes) along the dendrites and axon
• Neuron gets activated if electric signal is
higher than a threshold at the synapse
• Activation is more intense if the
frequency of the signal is high
17. McCulloch & Pitts, Rosenblatt (1950s) The perceptron
17
a 𝑥 = 𝑤1 𝑥1 + 𝑤2 𝑥2 + 𝑏
ℎ 𝑥 = 𝑔(𝑎 𝑥 )
Artifical neuron = a computational unit that makes a computation based on the
information it gets from other neurons
• 𝑥 = input vector (real valued)
electric signal
• 𝑤 = connection weights
excitation or inhibition of the neuron
• 𝑏 = neuron bias
simulates a threshold (in combination
with the weights)
• 𝑔 = activation function
Activation of the neuron
18. Activation functions
18
• Heaviside (perceptron):
𝑔 𝑎 =
1 if 𝑎 > 0
0 otherwise
• Linear function
𝑔 𝑎 = 𝑎
• Sigmoid
𝑔 𝑎 =
1
1 + exp −𝑎
• Tanh
𝑔 𝑎 =
𝑒 𝑎
− 𝑒−𝑎
𝑒 𝑎 + 𝑒−𝑎
Linear function:
• Does not introduce non linearity
• Does not bound the output
-> Not very interesting
Heaviside function:
• A little too harsh -> smoother activation is
preferable to extract valuable information
Sigmoid and tanh are commonly used (with
softmax)
19. Capacity of a neuron: how much can it do?
19
Sigmoid function
𝑔 𝑎 =
1
1 + exp −𝑎
Output ∈ [𝟎, 𝟏]
h x = p(y = 1|x)
Interpretation: the output is the probability
to belong to a given class (y = 0 or 1)
x1
x2
A neuron can solve linearly separable problems
21. The XOR affair (1969)
21
Minsky and Papert (1969), Perceptrons: an introduction to computational geometry
XOR (𝑥1, 𝑥2) impossible
with only two layers
0 1
1
0 0 1
1 0
x1
x2
OK with three layers
An intermediate
layer builds
a better
representation
(with AND
functions)
23. Towards a multiply distributed representation
23
Multiple layers neural networks
Each layer is a distributed representation.
The units are not mutually exclusive
(neurons can all be activated
simultaneously).
Different from a partition of the input
(the input belong to a specific cluster)
24. The treachery of images
24
The CAR concept
• An infinity of possible images!
• A high-level abstraction represented by
pixels
• Many problems:
– Orientation
– Perspective
– Reflection
– Irrelevant background
25. A CAR detector
Built a CAR detector: decompose the problem
• What are the different shapes?
• How are they combined?
• Orientation?
• Perspective
Pixels
Low level
abstraction
Intermediate
level
abstraction
…
High level
abstraction
Car
26. Spectrum of machine learning tasks (Hinton’s view)
Statistics
• Low-dimensional data
(<100 dimensions)
• Lots of noise in the data
• Little structure that can be
captured by a rather simple model
Main problematic:
Separate true structure from noise
Artificial Intelligence
• High-dimensional data
(>100 dimensions)
• Noise should not be a problem
• Huge amount of structure, very
complicated
Main problematic:
Represent the complicated
structure so that it can be learned
27. Training a NN / Learning
27
Training / learning is an optimization problem
M examples with n features
𝑥1, 𝑥2, … , 𝑥 𝑛
Two class 𝟎, 𝟏 classification
Prediction
1 if f x = p y = 1 x > 0.5
0 otherwise
• Classification error is not a smooth function
• Better optimize a smooth upper bound substitute: the loss function
28. Learning algorithm
28
Backpropagation algorithm
• Invented in 1969 (Bryson and Ho)
• Independently re-discovered in the mid-1980s by several groups
• 1989: First successful application to deep neural network (LeCun) – Recognition of hand-written digits
1. Initialize the parameters 𝜃 = (𝑤, 𝑏)
2. For i = 1…M iterations (examples)
• Each training example 𝑥 𝑡
, 𝑦 𝑡
∆= −𝛻𝜃l f 𝑥 𝑡
; 𝜃 , 𝑦 𝑡
− 𝜆𝛻𝜃 𝛺 𝜃
𝜃= 𝜃+𝛼∆
• The gradient tells in what direction the biggest decrease in the loss function is, i.e. how
can we change the parameters to reduce the loss.
• 𝛼: hyperparameter = learning rate
Important things: a good loss function, an initialization method, an efficient way of computing the gradient
many times (for each example!)
29. Training a NN / Learning
29
Then backpropagate -> modify (w,b) for each layer
For each training example, do forward propagation -> get f(x)
30. Many tricks for training a NN
30
• Mini-batch learning
• Regularization: the bias and variance
• How much variance in the correct model: 𝜆 ≫ 0
• Bias: how far away from the true model are we? 𝜆 ∼ 0
• Tuning hyperparameter for a better generalization: do not optimize too
much
Early stopping
32. Why is it so difficult?
Usually better to use only 1 layer! Why?
• Underfitting situation: a very difficult optimization problem
We would do better with a better optimization procedure.
• Saturated units -> vanishing gradient -> updates are difficult
(close to 0)
• But saturation corresponds to the nonlinearity of NN, their
interesting part
• Overfitting situation: too many layers -> too fancy model
• Not enough data!!!! -> But with big data, things tend to improve
Better optimization
Better initialization and better regularization
33. 2006: The Breakthrough
Before 2006: training deep neural networks was unsuccessful!
(except for CNN)
2006: 3 seminal papers
• Hinton, Osindero, and Teh,
A Fast Learning Algorithm for Deep Belief Nets
Neural Computation, 2006
• Bengio, Lamblin, Popovici, Larochelle,
Greedy Layer-Wise Training of Deep Networks
Advances in neural information processing systems, 2007
• Ranzato, Poultney, Chopra, LeCun,
Efficient Learning of Sparse Representations with an Energy-Based Model
Advances in neural information processing systems, 2006
34. The main point: greedy learning
Find the good representation: do it using unsupervised training -> let the neural
networt learn by itself!!
• Recognize the difference between a character and a random image
-> try to understand instead of copying -> less overfitting and improved
generalization
• Unsupervised pretraining: Train layer by layer (greedy learning) -> local extraction of
information -> the previous layer is seen as raw input representing features
• Each layer is able to find the most common features in the training inputs (more
common than random).
Once a good representation has been found at each level: it can be used to initialize
and successfully train a deep neural network with usual supervised gradient-base
optimization (backpropagation)
37. Many unsupervised learning techniques
• Restricted Boltzmann machines
• Stack denoising autoencoders
• Semi-supervised embeddings
• Stacked kernel PCA
• Stacked independent subspace analysis
• …
Partially solves the problem of unlabelled data
• Pre-train on unlabelled data
• Fine-tuning using labelled data (supervised learning)
38. Pretraining does help deep learning
38
Why does unsupervised pre-
training help deep learning?
Erhan, Courville, Manzagol,
Bengio, 2011
39. Google Brain
39
2012: Google’s Large Scale Deep Learning Experiments
• an artificial neural network
• computation spread across 16,000 CPUs
• models with more than 1 billion connections
40. The next steps
40
Deep learning is good for:
• Automatic speech recognition
• Image recognition
• Natural language processing
• How well can deep learning be adapted to distributed systems (Big Data)?
• Learning Online?
• Application to other problems?
• Time series (consumption prediction)
• Scoring (churn prediction, marketing)
• Application to clustering
• How much more data?