More information in our Deep Learning webinar: http://www.slideshare.net/0xdata/h2-o-deeplearningarnocandel052114
Latest slide deck: http://www.slideshare.net/0xdata/h2o-distributed-deep-learning-by-arno-candel-071614
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
2. Who am I?
PhD in Computational Physics, 2005
from ETH Zurich Switzerland
!
6 years at SLAC - Accelerator Physics Modeling
2 years at Skytree, Inc - Machine Learning
3 months at 0xdata/H2O - Machine Learning
!
10+ years in HPC, C++, MPI, Supercomputing
Arno Candel
4. Distributed in-memory math platform
➔ GLM, GBM, RF, K-Means, PCA, Deep Learning
Easy to use SDK / API
➔ Java, R, Scala, Python, JSON, Browser-based GUI
!
Businesses can use ALL of their data (w or w/o Hadoop)
➔ Modeling without Sampling
Big Data + Better Algorithms
➔ Better Predictions
H2O Open Source in-memory
Prediction Engine for Big Data
5. About H20 (aka 0xdata)
Pure Java, Apache v2 Open Source
Join the www.h2o.ai/community!
6. H2O w or w/o Hadoop
H2O
H2O H2O
HDFS HDFS HDFS
YARN Hadoop MR
R Java Scala JSON Python
Standalone Over YARN On MRv1
8. Wikipedia:
Deep learning is a set of algorithms in machine learning
that attempt to model high-level abstractions in data by
using architectures composed of multiple non-linear
transformations.
!
!
!
!
!
Facebook DeepFace (LeCun): “Almost as good as humans at recognising faces”
!
Google Brain (Andrew Ng, Jeff Dean & Geoffrey Hinton)
!
FBI FACE: $1 billion face recognition project
What is Deep Learning?
Example:
Input data
(facial image)
Prediction
(person’s ID)
10. 1970s multi-layer feed-forward Neural Network
(supervised learning with back-propagation)
!
+ distributed processing for big data
(H2O in-memory MapReduce paradigm on distributed data)
!
+ multi-threaded speedup
(H2O Fork/Join worker threads update the model asynchronously)
!
+ smart algorithms for accuracy
(weight initialization, adaptive learning, momentum, dropout, regularization)
!
= Top-notch prediction engine!
Deep Learning in H2O
11. “fully connected” directed graph of neurons
age
income
employment
married
not married
Input layer
Hidden
layer 1
Hidden
layer 2
Output layer
3x4 4x3 3x2#connections
information flow
input/output neuron
hidden neuron
4 3 2#neurons 3
Example Neural Network
12. age
income
employment
yj = tanh(sumi(xi*uij)+bj)
uij
xi
yj
per-class probabilities
sum(pl) = 1
zk = tanh(sumj(yj*vjk)+ck)
vjk
zk
pl
pl = softmax(sumk(zk*wkl)+dl)
wkl
softmax(xk) = exp(xk) / sumk(exp(xk))
“neurons activate each other via weighted sums”
Prediction: Forward Propagation
married
not married
activation function: tanh
alternative:
x -> max(0,x) “rectifier”
pl is a non-linear function of xi:
can approximate ANY function
with enough layers!
bj, ck, dl: bias values
(indep. of inputs)
13. age
income
employment
xi
standardize input xi: mean = 0, stddev = 1
!
horizontalize categorical variables, e.g.
{full-time, part-time, none, self-employed}
->
{0,1,0} = part-time, {0,0,0} = self-employed
Poor man’s initialization: random weights
!
Better: Uniform distribution in
+/- sqrt(6/(#units + #units_previous_layer))
Data preparation & Initialization
Neural Networks are sensitive to numerical noise,
operate best in the linear regime (not saturated)
married
not married
14. Mean Square Error = (0.2^2 + 0.2^2)/2 “penalize differences per-class”
!
Cross-entropy = -log(0.8) “strongly penalize non-1-ness”
Stochastic Gradient Descent
SGD: improve weights and biases for EACH training row
married
not married
For each training row, we make a prediction and compare
with the actual label (supervised training):
1
0
0.8
0.2
predicted actual
Objective: minimize prediction error (MSE or cross-entropy)
w <— w - rate * ∂E/∂w
1
15. Backward Propagation
!
∂E/∂wi = ∂E/∂y * ∂y/∂net * ∂net/∂wi
= ∂(error(y))/∂y * ∂(activation(net))/∂net * xi
Backprop: Compute ∂E/∂wi via chain rule going backwards
wi
net = sumi(wi*xi) + b
xi
E = error(y)
y = activation(net)
How to compute ∂E/∂wi for wi <— wi - rate * ∂E/∂wi ?
Naive: For every i, evaluate E twice at (w1,…,wi±∆,…,wN)… Slow!
16. H2O Deep Learning Architecture
K-V
K-V
HTTPD
HTTPD
nodes/JVMs: sync
threads: async
communication
w
w w
w w w w
w1
w3 w2
w4
w2+w4
w1+w3
w* = (w1+w2+w3+w4)/4
map:
each node trains a
copy of the weights
and biases with
(some* or all of) its
local data with
asynchronous F/J
threads
initial weights and biases w
updated weights and biases w*
H2O atomic
in-memory
K-V store
reduce:
average weights
and biases from
all nodes
Keep iterating over the data (“epochs”), score from time to time
Query & display
the model via
JSON, WWW
2
2 431
1
1
1
4
3 2
1 2
1
i
*mini-batch: number of total rows per iteration, can be less than 1 epoch
17. “Secret” Sauce to Higher Accuracy
Momentum training:
keep changing weights and biases (even if there’s no error)
“find other local minima, and go faster along valleys”
Adaptive learning rate - ADADELTA (Google):
automatically set learning rate for each neuron based on its
training history, combines annealing and momentum features
Learning rate annealing:
rate r = r0 / (1+ß*N), N = training samples
“dig deeper into local minimum”
Grid Search and Checkpointing:
Run a grid search over multiple hyper-parameters,
then continue training the best model
L1/L2/Dropout/MaxSumWeights regularization:
L1: penalizes non-zero weights, L2: penalizes large weights
Dropout: randomly ignore certain inputs “train exp. many models at once”
MaxSumWeights: Reduce all incoming weights if the sum > max value
“regularization avoids overtraining and improves generalization error”
18. MNIST: digits classification
Train: 60,000 rows 784 integer columns 10 classes
Test: 10,000 rows 784 integer columns 10 classes
MNIST: Digitized handwritten digits database (Yann LeCun)
Data: 28x28=784 pixels with values in 0…255 (gray-scale)
One of the most popular multi-class classification problems
Without distortions or convolutions
(which help), the best-ever published
error rate on test set: 0.83% (Microsoft)
19. most frequent mistakes:
confuse 4 with 6 and 9, and 7 with 2
test set error: 1.5% after 40 epochs
1.02% after 400 epochs
0.95% after 4000 epochs
H2O Deep Learning on MNIST:
0.95% test set error (so far)
1 node
21. Live Demo: Cancer Prediction
Interactive ROC
curve with real-
time updates
22. Live Demo: Cancer Prediction
0% training error
with only 322
model parameters
in seconds!
23. Live Demo: Grid Search Regression
Doing a grid search to find good hyper-parameters
to predict AGE from other 7 features
Then continue training the best model
5 hidden 50 tanh layers, rho=0.99, epsilon = 1e-10
MSE < 1 for test set
ages in 44…79
Regression:
1 linear output
neuron
24. Live Demo: ebay Text Classification
Users enter a description when selling an item
Task: Predict the type of item
Data prep: Binary word vector 0,0,1,0,0,0,0,0,1,0,0,0,1,…,0
H2O parses SVMLight sparse format: label 3:1 9:1 13:1 …
!
“Small” sample dataset on jewelry and watches:
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
!
H2O compressed columnar in-memory store:
Only needs 60MB to store 5 billion entries (never inflated)
25. Live Demo: ebay Text Classification
Work in progress, shown results are for illustration only!
Default parameters, no tuning, 4 nodes (16-cores each)
Train: 578,361 rows 8,647 cols 467 classes
Test: 64,263 rows 8,647 cols 143 classes
26. Tips for H2O Deep Learning
!
General:
More layers: more complex functions (non-linearity)
More neurons per layer: detect finer structure in data
More regularization: less overfitting (better validation error)
!
Do a grid search to get a feel for convergence, then continue training.
Try Tanh first. For Rectifier, try max_w2 = 50 and/or L1=1e-5.
Try TanhDropout or RectifierDropout with test/validation set after
finding good parameters for convergence on training set.
Distributed: Smaller mini-batch: more comm., slower, but higher accuracy.
With ADADELTA: Try epsilon = 1e-4,1e-6,1e-8,1e-10, rho = 0.9,0.95,0.99
Without ADADELTA: Try rate = 1e-4…1e-2, rate_annealing = 1e-5…1e-8
Try momentum_start = 0.5, momentum_stable = 0.99,
momentum_ramp = 1/rate_annealing
Try balance_classes = true for imbalanced classes.
Try force_load_balance for small datasets.
27. Summary
H2O is a distributed in-memory math platform that
allows fast prototyping in Java, R, Scala and Python.
!
H2o enables the development of enterprise-quality
blazing fast machine learning applications.
!
H2O Deep Learning is distributed, easy to use, and
early results compete with the world’s best.
!
Deep Learning makes better predictions!
!
Try it yourself and join our next meetup!
git clone https://github.com/0xdata/h2o