SlideShare a Scribd company logo
1 of 43
Download to read offline
Solving Large-Scale Machine Learning Problems
in a Distributed Way
Martin Tak´aˇc
Cognitive Systems Institute Group Speaker Series
June 09 2016
1 / 28
Outline
1 Machine Learning - Examples and Algorithm
2 Distributed Computing
3 Learning Large-Scale Deep Neural Network (DNN)
2 / 28
Examples of Machine Learning
binary classification
classifies person to have cancer or not
decided for an input image to which class it belongs, e.g. car/person
spam detection/credit card fraud detection
multi-class classification
hand-written digits classification
speech understanding
face detection
product recommendation (collaborative filtering)
stock trading
. . . and many many others. . .
3 / 28
Support Vector Machines (SVM)
blue: healthy person
green: e.g. patient with lung cancer
Exhaled breath analysis for lung cancer: predict if patient has cancer or not
4 / 28
ImageNet - Large Scale Visual Recognition Challenge
Two main chalanges
Object detection - 200 categories
Object localization - 1000 categories (over 1.2 million images for training)
5 / 28
ImageNet - Large Scale Visual Recognition Challenge
Two main chalanges
Object detection - 200 categories
Object localization - 1000 categories (over 1.2 million images for training)
The state-of-the-art solution method is Deep Neural Network (DNN)
E.g. input layer has dimension of input image
The output layer has dimension of e.g. 1000 (how many categories we have)
5 / 28
Deep Neural Network
we have to learn the weights between neurons (blue arrows)
the neural network is defining a non-linear and non-convex function (of
weights w) from input x to output y:
y = f (w; x)
6 / 28
Example - MNIST handwritten digits recognition
A good w could give us
f







w;







=







0
0
0
0.991
...







f








w;








=







0
0
...
0
0.999







7 / 28
Mathematical Formulation
Expected Loss Minimization
let (X, Y ) be the distribution of input samples and its labels
we would like to find w such that
w∗
= arg min
w
E(x,y)∼(X,Y )[ (f (w; x), y)]
is a loss function, i.e. (f (w; x), y) = f (w; x) − y 2
8 / 28
Mathematical Formulation
Expected Loss Minimization
let (X, Y ) be the distribution of input samples and its labels
we would like to find w such that
w∗
= arg min
w
E(x,y)∼(X,Y )[ (f (w; x), y)]
is a loss function, i.e. (f (w; x), y) = f (w; x) − y 2
Impossible, as we do not know the distribution (X, Y )
8 / 28
Mathematical Formulation
Expected Loss Minimization
let (X, Y ) be the distribution of input samples and its labels
we would like to find w such that
w∗
= arg min
w
E(x,y)∼(X,Y )[ (f (w; x), y)]
is a loss function, i.e. (f (w; x), y) = f (w; x) − y 2
Impossible, as we do not know the distribution (X, Y )
Common approach: Empirical loss minimization:
we sample n points from (X, Y ): {(xi , yi )}n
i=1
we minimize regularized empirical loss
w∗
= arg min
w
1
n
n
i=1
(f (w; xi ), yi ) +
λ
2
w 2
8 / 28
Stochastic Gradient Descent (SGD) Algorithm
How can we solve
min
w
F(w) :=
1
n
n
i=1
(f (w; xi ); yi ) +
λ
2
w 2
9 / 28
Stochastic Gradient Descent (SGD) Algorithm
How can we solve
min
w
F(w) :=
1
n
n
i=1
(f (w; xi ); yi ) +
λ
2
w 2
1 we can use an iterative algorithm
2 we start with some initial w
3 we compute g = F(w)
4 we get a new iterate w ← w − αg
5 if w is still not good enough go to step 3
if n is very large, computing g can take a while.... even few hours/days
9 / 28
Stochastic Gradient Descent (SGD) Algorithm
How can we solve
min
w
F(w) :=
1
n
n
i=1
(f (w; xi ); yi ) +
λ
2
w 2
1 we can use an iterative algorithm
2 we start with some initial w
3 we compute g = F(w)
4 we get a new iterate w ← w − αg
5 if w is still not good enough go to step 3
if n is very large, computing g can take a while.... even few hours/days
Trick:
choose i ∈ {1, . . . , n} randomly
define gi = (f (w; wi ); yi ) + λ
2 w 2
use gi instead of g in the algorithm (step 4)
9 / 28
Stochastic Gradient Descent (SGD) Algorithm
How can we solve
min
w
F(w) :=
1
n
n
i=1
(f (w; xi ); yi ) +
λ
2
w 2
1 we can use an iterative algorithm
2 we start with some initial w
3 we compute g = F(w)
4 we get a new iterate w ← w − αg
5 if w is still not good enough go to step 3
if n is very large, computing g can take a while.... even few hours/days
Trick:
choose i ∈ {1, . . . , n} randomly
define gi = (f (w; wi ); yi ) + λ
2 w 2
use gi instead of g in the algorithm (step 4)
Note: E[gi ] = g, so in expectation, the ”direction” the algorithm is going is the
same as if we use the true gradient, but we can compute it n times faster!
9 / 28
Outline
1 Machine Learning - Examples and Algorithm
2 Distributed Computing
3 Learning Large-Scale Deep Neural Network (DNN)
10 / 28
The Architecture
What if the size of data {(xi , yi )} exceeds the memory of a single
computing node?
11 / 28
The Architecture
What if the size of data {(xi , yi )} exceeds the memory of a single
computing node?
each node can store portion of the data {(xi , yi )}
each node is connected to the computer network
they can communicate with any other node (over maybe 1 or more switches)
Fact: every communication is much more expensive then accessing local data
(can be even 100,000 times slower).
11 / 28
Outline
1 Machine Learning - Examples and Algorithm
2 Distributed Computing
3 Learning Large-Scale Deep Neural Network (DNN)
12 / 28
Using SGD for DNN in Distributed Way
assume that the size of data or the size weights (or both) is so big, that we
cannot store them on one machine
. . . or we can store them but it takes too long to compute something . . .
SGD: we need to compute w (f (w; xi ); yi )
The DNN has a nice structure
w (f (w; xi ); yi ) can we computed by backpropagation procedure (this is
nothing else just automated differentiation)
13 / 28
Why is SGD a Bad Distributed Algorithm
it samples only 1 sample and computes gi (this is very fast)
then w is updated
each update of w requires a communication (cost c seconds)
hence one iteration is suddenly much slower then if we would run SGD on
one computer
14 / 28
Why is SGD a Bad Distributed Algorithm
it samples only 1 sample and computes gi (this is very fast)
then w is updated
each update of w requires a communication (cost c seconds)
hence one iteration is suddenly much slower then if we would run SGD on
one computer
The trick: Mini-batch SGD
In each iteration
1 Choose randomly S ⊂ {1, 2, . . . , n} with |S| = b
2 Use gb = 1
b i∈S gi instead of just gi
14 / 28
Why is SGD a Bad Distributed Algorithm
it samples only 1 sample and computes gi (this is very fast)
then w is updated
each update of w requires a communication (cost c seconds)
hence one iteration is suddenly much slower then if we would run SGD on
one computer
The trick: Mini-batch SGD
In each iteration
1 Choose randomly S ⊂ {1, 2, . . . , n} with |S| = b
2 Use gb = 1
b i∈S gi instead of just gi
Cost of one epoch
number of MPI calls / epoch n/b
amount of data send over network n
b × log(N) × sizeof (w)
if we increase b → n we would minimize amount of data and number of
number of communications per epoch!
14 / 28
Why is SGD a Bad Distributed Algorithm
it samples only 1 sample and computes gi (this is very fast)
then w is updated
each update of w requires a communication (cost c seconds)
hence one iteration is suddenly much slower then if we would run SGD on
one computer
The trick: Mini-batch SGD
In each iteration
1 Choose randomly S ⊂ {1, 2, . . . , n} with |S| = b
2 Use gb = 1
b i∈S gi instead of just gi
Cost of one epoch
number of MPI calls / epoch n/b
amount of data send over network n
b × log(N) × sizeof (w)
if we increase b → n we would minimize amount of data and number of
number of communications per epoch! Caveat: there is no free lunch!
Very large b means slower convergence!
14 / 28
Model Parallelism
Model parallelism: we partition weights w across many nodes; every node
has all data points (but maybe just few features of them)
Hidden Layer 1
Hidden Layer 2
Output
Input
ForwardPropagation
Hidden Layer 1
Hidden Layer 2
Output
BackwardPropagation
Node1
Node1
Node2
Node2
AllSamples
AllSamples
AllSamples
AllSamples
Exchange Activation
Exchange Activation
Exchange Deltas
Exchange Deltas
15 / 28
Data Parallelism
Data parallelism: we partition data-samples across many nodes, each node
has a fresh copy of w
Hidden Layer 1
Hidden Layer 2
Output
Input
ForwardPropagation
Hidden Layer 1
Hidden Layer 2
Output
BackwardPropagation
Node1
Node1
Node2
Node2
PartialSamples
PartialSamples
PartialSamples
PartialSamples
Hidden Layer 1
Hidden Layer 2
Hidden Layer 1
Hidden Layer 2
Exchange Gradient
16 / 28
Large-Scale Deep Neural Network1
1Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas
Sridharan, Dhiraj Kalamkar, Bharat Kaul, Pradeep Dubey: Distributed Deep Learning Using
Synchronous Stochastic Gradient Descent, arXiv:1602.06709
17 / 28
There is almost no speedup for large b
18 / 28
The Dilemma
large b allows algorithm to be efficiently run on large computer cluster (more
nodes)
very large b doesn’t reduce number of iterations, but each iteration is more
expensive!
The Trick: Do not use just gradient, but use also Hessian (Martens 2010)
19 / 28
The Dilemma
large b allows algorithm to be efficiently run on large computer cluster (more
nodes)
very large b doesn’t reduce number of iterations, but each iteration is more
expensive!
The Trick: Do not use just gradient, but use also Hessian (Martens 2010)
Caveat: Hessian matrix can be very large, e.g. the dimension of weights for
TIMIT datasets is almost 1.5M, hence to store Hessian we would need almost
10TB.
19 / 28
The Dilemma
large b allows algorithm to be efficiently run on large computer cluster (more
nodes)
very large b doesn’t reduce number of iterations, but each iteration is more
expensive!
The Trick: Do not use just gradient, but use also Hessian (Martens 2010)
Caveat: Hessian matrix can be very large, e.g. the dimension of weights for
TIMIT datasets is almost 1.5M, hence to store Hessian we would need almost
10TB.
The Trick:
We can use Hessian Free approach (we need to be able to compute just
Hessian-vector products)
19 / 28
The Dilemma
large b allows algorithm to be efficiently run on large computer cluster (more
nodes)
very large b doesn’t reduce number of iterations, but each iteration is more
expensive!
The Trick: Do not use just gradient, but use also Hessian (Martens 2010)
Caveat: Hessian matrix can be very large, e.g. the dimension of weights for
TIMIT datasets is almost 1.5M, hence to store Hessian we would need almost
10TB.
The Trick:
We can use Hessian Free approach (we need to be able to compute just
Hessian-vector products)
Algorithm:
w ← w − α[ 2
F(w)]−1
F(w)
19 / 28
Non-convexity
We want to minimize
min
w
F(w)
2
F(w) is NOT positive semi-definite at any w!
20 / 28
Computing Step
recall the algorithm
w ← w − α[ 2
F(w)]−1
F(w)
we need to compute p = [ 2
F(w)]−1
F(w), i.e. to solve
2
F(w)p = F(w) (1)
we can use few iterations of CG method to solve it
(CG assumes that 2
F(w) 0)
In our case it may not be true, hence, it is suggested to stop CG sooner, if it
is detected during CG that 2
F(w) is indefinite
We can use a Bi-CG algorithm to solve (1) and modify the algorithm2
as
follows
w ← w − α
p, if pT
F(x) > 0,
−p, otherwise
PS: we use just b samples to estimate 2
F(w)
2Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy and Martin Tak´aˇc: Large Scale Distributed
Hessian-Free Optimization for Deep Neural Network, arXiv:1606.00511, 2016.
21 / 28
Saddle Point
Gradient descent slows down around saddle point. Second order methods can help
a lot to prevent that.
22 / 28
50 100 150 200 250 300 350 400
10
−3
10
−2
10
−1
10
0
MNIST, 4 layers
Number of iteration
TrainError
SGD, b=64
SGD, b=128
ggn−cg, b=512
hess−bicgstab, b=512
hess−cg, b=512
hybrid−cg, b=512
50 100 150 200 250 300 350 400
10
−3
10
−2
10
−1
10
0
MNIST, 4 layers
Number of iteration
TrainError
SGD, b=64
SGD, b=128
ggn−cg, b=1024
hess−bicgstab, b=1024
hess−cg, b=1024
hybrid−cg, b=1024
50 100 150 200 250 300 350 400
10
−3
10
−2
10
−1
10
0
MNIST, 4 layers
Number of iteration
TrainError
SGD, b=64
SGD, b=128
ggn−cg, b=2048
hess−bicgstab, b=2048
hess−cg, b=2048
hybrid−cg, b=2048
10
1
10
2
10
3
10
2
10
3
MNIST, 4 layers
Size of Mini−batch
NumberofIterations
ggn−cg
hess−bicgstab
hess−cg
hybrid−cg
23 / 28
1 1.5 2 2.5 3 3.5 4 4.5 5
10
0
10
1
10
2
10
3
TIMIT, T=18, b=512
log2
(Number of Nodes)
RunTimeperIteration
Gradient
CG
Linesearch
1 1.5 2 2.5 3 3.5 4 4.5 5
10
0
10
1
10
2
10
3
TIMIT, T=18, b=1024
log2
(Number of Nodes)
RunTimeperIteration
Gradient
CG
Linesearch
1 1.5 2 2.5 3 3.5 4 4.5 5
10
0
10
1
10
2
10
3
TIMIT, T=18, b=4096
log2
(Number of Nodes)
RunTimeperIteration
Gradient
CG
Linesearch
1 1.5 2 2.5 3 3.5 4 4.5 5
10
1
10
2
10
3
TIMIT, T=18, b=8192
log2
(Number of Nodes)
RunTimeperIteration
Gradient
CG
Linesearch
24 / 28
1 1.5 2 2.5 3 3.5 4 4.5 5
10
0
10
1
TIMIT, T=18
log2
(Number of Nodes)
RunTimeperOneLineSearch
b=512
b=1024
b=4096
b=8192
25 / 28
Learning Artistic Style by Deep Neural Network3
3Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.06576
26 / 28
Learning Artistic Style by Deep Neural Network3
3Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.06576
26 / 28
Learning Artistic Style by Deep Neural Network4
4Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.0657627 / 28
Learning Artistic Style by Deep Neural Network4
4Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.0657627 / 28
References
1 Albert Berahas, Jorge Nocedal and Martin Tak´aˇc: A Multi-Batch L-BFGS Method for Machine
Learning, arXiv:1605.06049, 2016.
2 Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy and Martin Tak´aˇc:
Large Scale Distributed Hessian-Free Optimization for Deep Neural Network, arXiv:1606.00511, 2016.
3 Chenxin Ma and Martin Tak´aˇc: Partitioning Data on Features or Samples in Communication-Efficient
Distributed Optimization?, OptML@NIPS 2015.
4 Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richt´arik and Martin Tak´aˇc: Adding
vs. Averaging in Distributed Primal-Dual Optimization, ICML 2015.
5 Martin Jaggi, Virginia Smith, Martin Tak´aˇc, Jonathan Terhorst, Thomas Hofmann and Michael I.
Jordan: Communication-Efficient Distributed Dual Coordinate Ascent, NIPS 2014.
6 Richt´arik, P. and Tak´aˇc, M.: Distributed coordinate descent method for learning with big data, Journal
Paper Journal of Machine Learning Research (to appear), 2016
7 Richt´arik, P. and Tak´aˇc, M.: On optimal probabilities in stochastic coordinate descent methods,
Optimization Letters, 2015.
8 Richt´arik, P. and Tak´aˇc, M.: Parallel coordinate descent methods for big data optimization,
Mathematical Programming, 2015.
9 Richt´arik, P. and Tak´aˇc, M.: Iteration complexity of randomized block-coordinate descent methods for
minimizing a composite function, Mathematical Programming, 2012.
10 Tak´aˇc, M., Bijral, A., Richt´arik, P. and Srebro, N.: Mini-batch primal and dual methods for SVMs, In
ICML, 2013.
11 Qu, Z., Richt´arik, P. and Zhang, T.: Randomized dual coordinate ascent with arbitrary sampling,
arXiv:1411.5873, 2014.
12 Qu, Z., Richt´arik, P., Tak´aˇc, M. and Fercoq, O.: SDNA: Stochastic Dual Newton Ascent for Empirical
Risk Minimization, arXiv:1502.02268, 2015.
13 Tappenden, R., Tak´aˇc, M. and Richt´arik, P., On the Complexity of Parallel Coordinate Descent, arXiv:
1503.03033, 2015.
28 / 28

More Related Content

What's hot

What's hot (20)

Deep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlowDeep Learning: R with Keras and TensorFlow
Deep Learning: R with Keras and TensorFlow
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Introduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlowIntroduction to Machine Learning with TensorFlow
Introduction to Machine Learning with TensorFlow
 
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
Convolutional Neural Networks - Veronica Vilaplana - UPC Barcelona 2018
 
Learning stochastic neural networks with Chainer
Learning stochastic neural networks with ChainerLearning stochastic neural networks with Chainer
Learning stochastic neural networks with Chainer
 
Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)Diving into Deep Learning (Silicon Valley Code Camp 2017)
Diving into Deep Learning (Silicon Valley Code Camp 2017)
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Deep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for FinanceDeep Learning in Python with Tensorflow for Finance
Deep Learning in Python with Tensorflow for Finance
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
 
Predicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman networkPredicting organic reaction outcomes with weisfeiler lehman network
Predicting organic reaction outcomes with weisfeiler lehman network
 
Machine Intelligence at Google Scale: TensorFlow
Machine Intelligence at Google Scale: TensorFlowMachine Intelligence at Google Scale: TensorFlow
Machine Intelligence at Google Scale: TensorFlow
 
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager ExecutionTensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
TensorFlow Dev Summit 2018 Extended: TensorFlow Eager Execution
 
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic ModerationHate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
 

Similar to Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614
Sri Ambati
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
Hidekazu Oiwa
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
KarthikVijay59
 

Similar to Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way” (20)

Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Introduction to Applied Machine Learning
Introduction to Applied Machine LearningIntroduction to Applied Machine Learning
Introduction to Applied Machine Learning
 
Greedy Algorithms
Greedy AlgorithmsGreedy Algorithms
Greedy Algorithms
 
01 - DAA - PPT.pptx
01 - DAA - PPT.pptx01 - DAA - PPT.pptx
01 - DAA - PPT.pptx
 
H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614H2O Distributed Deep Learning by Arno Candel 071614
H2O Distributed Deep Learning by Arno Candel 071614
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
 
Lesson 39
Lesson 39Lesson 39
Lesson 39
 
AI Lesson 39
AI Lesson 39AI Lesson 39
AI Lesson 39
 
Review_Cibe Sridharan
Review_Cibe SridharanReview_Cibe Sridharan
Review_Cibe Sridharan
 
H2ODeepLearningThroughExamples021215
H2ODeepLearningThroughExamples021215H2ODeepLearningThroughExamples021215
H2ODeepLearningThroughExamples021215
 
lec10.pdf
lec10.pdflec10.pdf
lec10.pdf
 
ANNs.pdf
ANNs.pdfANNs.pdf
ANNs.pdf
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
chap4_ann (5).pptx
chap4_ann (5).pptxchap4_ann (5).pptx
chap4_ann (5).pptx
 
Introduction to PyTorch
Introduction to PyTorchIntroduction to PyTorch
Introduction to PyTorch
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
 
Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine Learning
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
Traveling Salesman Problem in Distributed Environment
Traveling Salesman Problem in Distributed EnvironmentTraveling Salesman Problem in Distributed Environment
Traveling Salesman Problem in Distributed Environment
 

More from diannepatricia

More from diannepatricia (20)

Teaching cognitive computing with ibm watson
Teaching cognitive computing with ibm watsonTeaching cognitive computing with ibm watson
Teaching cognitive computing with ibm watson
 
Cognitive systems institute talk 8 june 2017 - v.1.0
Cognitive systems institute talk   8 june 2017 - v.1.0Cognitive systems institute talk   8 june 2017 - v.1.0
Cognitive systems institute talk 8 june 2017 - v.1.0
 
Building Compassionate Conversational Systems
Building Compassionate Conversational SystemsBuilding Compassionate Conversational Systems
Building Compassionate Conversational Systems
 
“Artificial Intelligence, Cognitive Computing and Innovating in Practice”
“Artificial Intelligence, Cognitive Computing and Innovating in Practice”“Artificial Intelligence, Cognitive Computing and Innovating in Practice”
“Artificial Intelligence, Cognitive Computing and Innovating in Practice”
 
Cognitive Insights drive self-driving Accessibility
Cognitive Insights drive self-driving AccessibilityCognitive Insights drive self-driving Accessibility
Cognitive Insights drive self-driving Accessibility
 
Artificial Intellingence in the Car
Artificial Intellingence in the CarArtificial Intellingence in the Car
Artificial Intellingence in the Car
 
“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”“Semantic PDF Processing & Document Representation”
“Semantic PDF Processing & Document Representation”
 
Joining Industry and Students for Cognitive Solutions at Karlsruhe Services R...
Joining Industry and Students for Cognitive Solutions at Karlsruhe Services R...Joining Industry and Students for Cognitive Solutions at Karlsruhe Services R...
Joining Industry and Students for Cognitive Solutions at Karlsruhe Services R...
 
170330 cognitive systems institute speaker series mark sherman - watson pr...
170330 cognitive systems institute speaker series    mark sherman - watson pr...170330 cognitive systems institute speaker series    mark sherman - watson pr...
170330 cognitive systems institute speaker series mark sherman - watson pr...
 
“Fairness Cases as an Accelerant and Enabler for Cognitive Assistance Adoption”
“Fairness Cases as an Accelerant and Enabler for Cognitive Assistance Adoption”“Fairness Cases as an Accelerant and Enabler for Cognitive Assistance Adoption”
“Fairness Cases as an Accelerant and Enabler for Cognitive Assistance Adoption”
 
Cognitive Assistance for the Aging
Cognitive Assistance for the AgingCognitive Assistance for the Aging
Cognitive Assistance for the Aging
 
From complex Systems to Networks: Discovering and Modeling the Correct Network"
From complex Systems to Networks: Discovering and Modeling the Correct Network"From complex Systems to Networks: Discovering and Modeling the Correct Network"
From complex Systems to Networks: Discovering and Modeling the Correct Network"
 
The Role of Dialog in Augmented Intelligence
The Role of Dialog in Augmented IntelligenceThe Role of Dialog in Augmented Intelligence
The Role of Dialog in Augmented Intelligence
 
Developing Cognitive Systems to Support Team Cognition
Developing Cognitive Systems to Support Team CognitionDeveloping Cognitive Systems to Support Team Cognition
Developing Cognitive Systems to Support Team Cognition
 
Cyber-Social Learning Systems
Cyber-Social Learning SystemsCyber-Social Learning Systems
Cyber-Social Learning Systems
 
“IT Technology Trends in 2017… and Beyond”
“IT Technology Trends in 2017… and Beyond”“IT Technology Trends in 2017… and Beyond”
“IT Technology Trends in 2017… and Beyond”
 
"Curious Learning: using a mobile platform for early literacy education as a ...
"Curious Learning: using a mobile platform for early literacy education as a ..."Curious Learning: using a mobile platform for early literacy education as a ...
"Curious Learning: using a mobile platform for early literacy education as a ...
 
Embodied Cognition - Booch HICSS50
Embodied Cognition - Booch HICSS50Embodied Cognition - Booch HICSS50
Embodied Cognition - Booch HICSS50
 
KATE - a Platform for Machine Learning
KATE - a Platform for Machine LearningKATE - a Platform for Machine Learning
KATE - a Platform for Machine Learning
 
Cognitive Computing for Aging Society
Cognitive Computing for Aging SocietyCognitive Computing for Aging Society
Cognitive Computing for Aging Society
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distributed Way”

  • 1. Solving Large-Scale Machine Learning Problems in a Distributed Way Martin Tak´aˇc Cognitive Systems Institute Group Speaker Series June 09 2016 1 / 28
  • 2. Outline 1 Machine Learning - Examples and Algorithm 2 Distributed Computing 3 Learning Large-Scale Deep Neural Network (DNN) 2 / 28
  • 3. Examples of Machine Learning binary classification classifies person to have cancer or not decided for an input image to which class it belongs, e.g. car/person spam detection/credit card fraud detection multi-class classification hand-written digits classification speech understanding face detection product recommendation (collaborative filtering) stock trading . . . and many many others. . . 3 / 28
  • 4. Support Vector Machines (SVM) blue: healthy person green: e.g. patient with lung cancer Exhaled breath analysis for lung cancer: predict if patient has cancer or not 4 / 28
  • 5. ImageNet - Large Scale Visual Recognition Challenge Two main chalanges Object detection - 200 categories Object localization - 1000 categories (over 1.2 million images for training) 5 / 28
  • 6. ImageNet - Large Scale Visual Recognition Challenge Two main chalanges Object detection - 200 categories Object localization - 1000 categories (over 1.2 million images for training) The state-of-the-art solution method is Deep Neural Network (DNN) E.g. input layer has dimension of input image The output layer has dimension of e.g. 1000 (how many categories we have) 5 / 28
  • 7. Deep Neural Network we have to learn the weights between neurons (blue arrows) the neural network is defining a non-linear and non-convex function (of weights w) from input x to output y: y = f (w; x) 6 / 28
  • 8. Example - MNIST handwritten digits recognition A good w could give us f        w;        =        0 0 0 0.991 ...        f         w;         =        0 0 ... 0 0.999        7 / 28
  • 9. Mathematical Formulation Expected Loss Minimization let (X, Y ) be the distribution of input samples and its labels we would like to find w such that w∗ = arg min w E(x,y)∼(X,Y )[ (f (w; x), y)] is a loss function, i.e. (f (w; x), y) = f (w; x) − y 2 8 / 28
  • 10. Mathematical Formulation Expected Loss Minimization let (X, Y ) be the distribution of input samples and its labels we would like to find w such that w∗ = arg min w E(x,y)∼(X,Y )[ (f (w; x), y)] is a loss function, i.e. (f (w; x), y) = f (w; x) − y 2 Impossible, as we do not know the distribution (X, Y ) 8 / 28
  • 11. Mathematical Formulation Expected Loss Minimization let (X, Y ) be the distribution of input samples and its labels we would like to find w such that w∗ = arg min w E(x,y)∼(X,Y )[ (f (w; x), y)] is a loss function, i.e. (f (w; x), y) = f (w; x) − y 2 Impossible, as we do not know the distribution (X, Y ) Common approach: Empirical loss minimization: we sample n points from (X, Y ): {(xi , yi )}n i=1 we minimize regularized empirical loss w∗ = arg min w 1 n n i=1 (f (w; xi ), yi ) + λ 2 w 2 8 / 28
  • 12. Stochastic Gradient Descent (SGD) Algorithm How can we solve min w F(w) := 1 n n i=1 (f (w; xi ); yi ) + λ 2 w 2 9 / 28
  • 13. Stochastic Gradient Descent (SGD) Algorithm How can we solve min w F(w) := 1 n n i=1 (f (w; xi ); yi ) + λ 2 w 2 1 we can use an iterative algorithm 2 we start with some initial w 3 we compute g = F(w) 4 we get a new iterate w ← w − αg 5 if w is still not good enough go to step 3 if n is very large, computing g can take a while.... even few hours/days 9 / 28
  • 14. Stochastic Gradient Descent (SGD) Algorithm How can we solve min w F(w) := 1 n n i=1 (f (w; xi ); yi ) + λ 2 w 2 1 we can use an iterative algorithm 2 we start with some initial w 3 we compute g = F(w) 4 we get a new iterate w ← w − αg 5 if w is still not good enough go to step 3 if n is very large, computing g can take a while.... even few hours/days Trick: choose i ∈ {1, . . . , n} randomly define gi = (f (w; wi ); yi ) + λ 2 w 2 use gi instead of g in the algorithm (step 4) 9 / 28
  • 15. Stochastic Gradient Descent (SGD) Algorithm How can we solve min w F(w) := 1 n n i=1 (f (w; xi ); yi ) + λ 2 w 2 1 we can use an iterative algorithm 2 we start with some initial w 3 we compute g = F(w) 4 we get a new iterate w ← w − αg 5 if w is still not good enough go to step 3 if n is very large, computing g can take a while.... even few hours/days Trick: choose i ∈ {1, . . . , n} randomly define gi = (f (w; wi ); yi ) + λ 2 w 2 use gi instead of g in the algorithm (step 4) Note: E[gi ] = g, so in expectation, the ”direction” the algorithm is going is the same as if we use the true gradient, but we can compute it n times faster! 9 / 28
  • 16. Outline 1 Machine Learning - Examples and Algorithm 2 Distributed Computing 3 Learning Large-Scale Deep Neural Network (DNN) 10 / 28
  • 17. The Architecture What if the size of data {(xi , yi )} exceeds the memory of a single computing node? 11 / 28
  • 18. The Architecture What if the size of data {(xi , yi )} exceeds the memory of a single computing node? each node can store portion of the data {(xi , yi )} each node is connected to the computer network they can communicate with any other node (over maybe 1 or more switches) Fact: every communication is much more expensive then accessing local data (can be even 100,000 times slower). 11 / 28
  • 19. Outline 1 Machine Learning - Examples and Algorithm 2 Distributed Computing 3 Learning Large-Scale Deep Neural Network (DNN) 12 / 28
  • 20. Using SGD for DNN in Distributed Way assume that the size of data or the size weights (or both) is so big, that we cannot store them on one machine . . . or we can store them but it takes too long to compute something . . . SGD: we need to compute w (f (w; xi ); yi ) The DNN has a nice structure w (f (w; xi ); yi ) can we computed by backpropagation procedure (this is nothing else just automated differentiation) 13 / 28
  • 21. Why is SGD a Bad Distributed Algorithm it samples only 1 sample and computes gi (this is very fast) then w is updated each update of w requires a communication (cost c seconds) hence one iteration is suddenly much slower then if we would run SGD on one computer 14 / 28
  • 22. Why is SGD a Bad Distributed Algorithm it samples only 1 sample and computes gi (this is very fast) then w is updated each update of w requires a communication (cost c seconds) hence one iteration is suddenly much slower then if we would run SGD on one computer The trick: Mini-batch SGD In each iteration 1 Choose randomly S ⊂ {1, 2, . . . , n} with |S| = b 2 Use gb = 1 b i∈S gi instead of just gi 14 / 28
  • 23. Why is SGD a Bad Distributed Algorithm it samples only 1 sample and computes gi (this is very fast) then w is updated each update of w requires a communication (cost c seconds) hence one iteration is suddenly much slower then if we would run SGD on one computer The trick: Mini-batch SGD In each iteration 1 Choose randomly S ⊂ {1, 2, . . . , n} with |S| = b 2 Use gb = 1 b i∈S gi instead of just gi Cost of one epoch number of MPI calls / epoch n/b amount of data send over network n b × log(N) × sizeof (w) if we increase b → n we would minimize amount of data and number of number of communications per epoch! 14 / 28
  • 24. Why is SGD a Bad Distributed Algorithm it samples only 1 sample and computes gi (this is very fast) then w is updated each update of w requires a communication (cost c seconds) hence one iteration is suddenly much slower then if we would run SGD on one computer The trick: Mini-batch SGD In each iteration 1 Choose randomly S ⊂ {1, 2, . . . , n} with |S| = b 2 Use gb = 1 b i∈S gi instead of just gi Cost of one epoch number of MPI calls / epoch n/b amount of data send over network n b × log(N) × sizeof (w) if we increase b → n we would minimize amount of data and number of number of communications per epoch! Caveat: there is no free lunch! Very large b means slower convergence! 14 / 28
  • 25. Model Parallelism Model parallelism: we partition weights w across many nodes; every node has all data points (but maybe just few features of them) Hidden Layer 1 Hidden Layer 2 Output Input ForwardPropagation Hidden Layer 1 Hidden Layer 2 Output BackwardPropagation Node1 Node1 Node2 Node2 AllSamples AllSamples AllSamples AllSamples Exchange Activation Exchange Activation Exchange Deltas Exchange Deltas 15 / 28
  • 26. Data Parallelism Data parallelism: we partition data-samples across many nodes, each node has a fresh copy of w Hidden Layer 1 Hidden Layer 2 Output Input ForwardPropagation Hidden Layer 1 Hidden Layer 2 Output BackwardPropagation Node1 Node1 Node2 Node2 PartialSamples PartialSamples PartialSamples PartialSamples Hidden Layer 1 Hidden Layer 2 Hidden Layer 1 Hidden Layer 2 Exchange Gradient 16 / 28
  • 27. Large-Scale Deep Neural Network1 1Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, Pradeep Dubey: Distributed Deep Learning Using Synchronous Stochastic Gradient Descent, arXiv:1602.06709 17 / 28
  • 28. There is almost no speedup for large b 18 / 28
  • 29. The Dilemma large b allows algorithm to be efficiently run on large computer cluster (more nodes) very large b doesn’t reduce number of iterations, but each iteration is more expensive! The Trick: Do not use just gradient, but use also Hessian (Martens 2010) 19 / 28
  • 30. The Dilemma large b allows algorithm to be efficiently run on large computer cluster (more nodes) very large b doesn’t reduce number of iterations, but each iteration is more expensive! The Trick: Do not use just gradient, but use also Hessian (Martens 2010) Caveat: Hessian matrix can be very large, e.g. the dimension of weights for TIMIT datasets is almost 1.5M, hence to store Hessian we would need almost 10TB. 19 / 28
  • 31. The Dilemma large b allows algorithm to be efficiently run on large computer cluster (more nodes) very large b doesn’t reduce number of iterations, but each iteration is more expensive! The Trick: Do not use just gradient, but use also Hessian (Martens 2010) Caveat: Hessian matrix can be very large, e.g. the dimension of weights for TIMIT datasets is almost 1.5M, hence to store Hessian we would need almost 10TB. The Trick: We can use Hessian Free approach (we need to be able to compute just Hessian-vector products) 19 / 28
  • 32. The Dilemma large b allows algorithm to be efficiently run on large computer cluster (more nodes) very large b doesn’t reduce number of iterations, but each iteration is more expensive! The Trick: Do not use just gradient, but use also Hessian (Martens 2010) Caveat: Hessian matrix can be very large, e.g. the dimension of weights for TIMIT datasets is almost 1.5M, hence to store Hessian we would need almost 10TB. The Trick: We can use Hessian Free approach (we need to be able to compute just Hessian-vector products) Algorithm: w ← w − α[ 2 F(w)]−1 F(w) 19 / 28
  • 33. Non-convexity We want to minimize min w F(w) 2 F(w) is NOT positive semi-definite at any w! 20 / 28
  • 34. Computing Step recall the algorithm w ← w − α[ 2 F(w)]−1 F(w) we need to compute p = [ 2 F(w)]−1 F(w), i.e. to solve 2 F(w)p = F(w) (1) we can use few iterations of CG method to solve it (CG assumes that 2 F(w) 0) In our case it may not be true, hence, it is suggested to stop CG sooner, if it is detected during CG that 2 F(w) is indefinite We can use a Bi-CG algorithm to solve (1) and modify the algorithm2 as follows w ← w − α p, if pT F(x) > 0, −p, otherwise PS: we use just b samples to estimate 2 F(w) 2Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy and Martin Tak´aˇc: Large Scale Distributed Hessian-Free Optimization for Deep Neural Network, arXiv:1606.00511, 2016. 21 / 28
  • 35. Saddle Point Gradient descent slows down around saddle point. Second order methods can help a lot to prevent that. 22 / 28
  • 36. 50 100 150 200 250 300 350 400 10 −3 10 −2 10 −1 10 0 MNIST, 4 layers Number of iteration TrainError SGD, b=64 SGD, b=128 ggn−cg, b=512 hess−bicgstab, b=512 hess−cg, b=512 hybrid−cg, b=512 50 100 150 200 250 300 350 400 10 −3 10 −2 10 −1 10 0 MNIST, 4 layers Number of iteration TrainError SGD, b=64 SGD, b=128 ggn−cg, b=1024 hess−bicgstab, b=1024 hess−cg, b=1024 hybrid−cg, b=1024 50 100 150 200 250 300 350 400 10 −3 10 −2 10 −1 10 0 MNIST, 4 layers Number of iteration TrainError SGD, b=64 SGD, b=128 ggn−cg, b=2048 hess−bicgstab, b=2048 hess−cg, b=2048 hybrid−cg, b=2048 10 1 10 2 10 3 10 2 10 3 MNIST, 4 layers Size of Mini−batch NumberofIterations ggn−cg hess−bicgstab hess−cg hybrid−cg 23 / 28
  • 37. 1 1.5 2 2.5 3 3.5 4 4.5 5 10 0 10 1 10 2 10 3 TIMIT, T=18, b=512 log2 (Number of Nodes) RunTimeperIteration Gradient CG Linesearch 1 1.5 2 2.5 3 3.5 4 4.5 5 10 0 10 1 10 2 10 3 TIMIT, T=18, b=1024 log2 (Number of Nodes) RunTimeperIteration Gradient CG Linesearch 1 1.5 2 2.5 3 3.5 4 4.5 5 10 0 10 1 10 2 10 3 TIMIT, T=18, b=4096 log2 (Number of Nodes) RunTimeperIteration Gradient CG Linesearch 1 1.5 2 2.5 3 3.5 4 4.5 5 10 1 10 2 10 3 TIMIT, T=18, b=8192 log2 (Number of Nodes) RunTimeperIteration Gradient CG Linesearch 24 / 28
  • 38. 1 1.5 2 2.5 3 3.5 4 4.5 5 10 0 10 1 TIMIT, T=18 log2 (Number of Nodes) RunTimeperOneLineSearch b=512 b=1024 b=4096 b=8192 25 / 28
  • 39. Learning Artistic Style by Deep Neural Network3 3Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.06576 26 / 28
  • 40. Learning Artistic Style by Deep Neural Network3 3Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.06576 26 / 28
  • 41. Learning Artistic Style by Deep Neural Network4 4Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.0657627 / 28
  • 42. Learning Artistic Style by Deep Neural Network4 4Joint work with Jiawei Zhang, based on Leon A. Gatys, Alexander S. Ecker, Matthias Bethge, A Neural Algorithm of Artistic Style, arxiv 1508.0657627 / 28
  • 43. References 1 Albert Berahas, Jorge Nocedal and Martin Tak´aˇc: A Multi-Batch L-BFGS Method for Machine Learning, arXiv:1605.06049, 2016. 2 Xi He, Dheevatsa Mudigere, Mikhail Smelyanskiy and Martin Tak´aˇc: Large Scale Distributed Hessian-Free Optimization for Deep Neural Network, arXiv:1606.00511, 2016. 3 Chenxin Ma and Martin Tak´aˇc: Partitioning Data on Features or Samples in Communication-Efficient Distributed Optimization?, OptML@NIPS 2015. 4 Chenxin Ma, Virginia Smith, Martin Jaggi, Michael I. Jordan, Peter Richt´arik and Martin Tak´aˇc: Adding vs. Averaging in Distributed Primal-Dual Optimization, ICML 2015. 5 Martin Jaggi, Virginia Smith, Martin Tak´aˇc, Jonathan Terhorst, Thomas Hofmann and Michael I. Jordan: Communication-Efficient Distributed Dual Coordinate Ascent, NIPS 2014. 6 Richt´arik, P. and Tak´aˇc, M.: Distributed coordinate descent method for learning with big data, Journal Paper Journal of Machine Learning Research (to appear), 2016 7 Richt´arik, P. and Tak´aˇc, M.: On optimal probabilities in stochastic coordinate descent methods, Optimization Letters, 2015. 8 Richt´arik, P. and Tak´aˇc, M.: Parallel coordinate descent methods for big data optimization, Mathematical Programming, 2015. 9 Richt´arik, P. and Tak´aˇc, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, Mathematical Programming, 2012. 10 Tak´aˇc, M., Bijral, A., Richt´arik, P. and Srebro, N.: Mini-batch primal and dual methods for SVMs, In ICML, 2013. 11 Qu, Z., Richt´arik, P. and Zhang, T.: Randomized dual coordinate ascent with arbitrary sampling, arXiv:1411.5873, 2014. 12 Qu, Z., Richt´arik, P., Tak´aˇc, M. and Fercoq, O.: SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization, arXiv:1502.02268, 2015. 13 Tappenden, R., Tak´aˇc, M. and Richt´arik, P., On the Complexity of Parallel Coordinate Descent, arXiv: 1503.03033, 2015. 28 / 28