Large-scale Machine Learning: Deep, Distributed and Multi-Dimensional:
Modern machine learning involves deep neural network architectures which yields state-of-art performance on multiple domains such as computer vision, natural language processing and speech recognition. As the data and models scale, it becomes necessary to have multiple processing units for both training and inference. Apache MXNet is an open-source framework developed for distributed deep learning. I will describe the underlying lightweight hierarchical parameter server architecture that results in high efficiency in distributed settings.
Pushing the current boundaries of deep learning requires using multiple dimensions and modalities. These can be encoded into tensors, which are natural extensions of matrices. We present new deep learning architectures that preserve the multi-dimensional information in data end-to-end. We show that tensor contractions and regression layers are an effective replacement for fully connected layers in deep learning architectures. They result in significant space savings with negligible performance degradation. These functionalities are available in the Tensorly package with MXNet backend interface for large-scale efficient learning.
Bio: Anima Anandkumar is a principal scientist at Amazon Web Services and a Bren professor at Caltech CMS department. Her research interests are in the areas of large-scale machine learning, non-convex optimization and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms. She is the recipient of several awards such as the Alfred. P. Sloan Fellowship, Microsoft Faculty Fellowship, Google research award, ARO and AFOSR Young Investigator Awards, NSF Career Award, Early Career Excellence in Research Award at UCI, Best Thesis Award from the ACM Sigmetrics society, IBM Fran Allen PhD fellowship, and several best paper awards. She has been featured in a number of forums such as the yourstory, Quora ML session, O’Reilly media, and so on. She received her B.Tech in Electrical Engineering from IIT Madras in 2004 and her PhD from Cornell University in 2009. She was a postdoctoral researcher at MIT from 2009 to 2010, an assistant professor at U.C. Irvine between 2010 and 2016, and a visiting researcher at Microsoft Research New England in 2012 and 2014.
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor, CalTech at MLconf SF 2017
1. Learning at Scale:
Deep, Distributed and Multi-dimensional
Anima Anandkumar
..
Amazon AI & Caltech
2. Significantly improve many applications on multiple domains
“deep learning” trend in the past 10 years
image understanding speech recognition natural language
processing
…
Deep Learning
autonomy
3. Image Classification
Layer 1 Layer 2 Output
multilevel feature extractions from raw pixels
to semantic meanings
explore spatial information with convolution layers
4. Image Classification
§ Hard to define the network
§ the definition of the inception network has >1k lines of codes in Caffe
§ A single image requires billions floating-point operations
§ Intel i7 ~500 GFLOPS
§ Nvidia Titan X: ~5 TFLOPS
§ Memory consumption is linear with number of layers
State-of-the-art networks have tens to hundreds layers
8. Auto Parallelization
18
Write serial programs Run in parallel
>>> import mxnet as mx
>>> A = mx.nd.ones((2,2)) *2
>>> C = A + 2
>>> B = A + 1
>>> D = B * C
>>> D.wait_to_read()
A = 2
C = A + 2 B = A + 1
D = B ⨉ C
14. Back-end System
✧ Optimization
✓ Memory optimization
✓ Operator fusion
✧ Scheduling
✓ Auto-parallelization
11
a b
1
+
⨉
c
fullc
softmax
weight
bias
Back-end
import mxnet as mx
a = mx.nd.zeros((100, 50))
b = mx.nd.ones((100, 50))
c = a * b
c += 1
import mxnet as mx
net = mx.symbol.Variable('data')
net = mx.symbol.FullyConnected(
data=net, num_hidden=128)
net = mx.symbol.SoftmaxOutput(data=net)
texec = mx.module.Module(net)
texec.forward(data=c)
texec.backward()
Front-end
15. In summary
✦ Symbolic
❖ efficient & portable
❖ but hard to use
10
✦ tesla
✦ Imperative
❖ flexible
❖ may be slow
✦ Gluon
❖ imperative for developing
❖ symbolic for deploying
25. Speeding up Tensor Contractions
1 Tensor contractions are a core primitive of multilinear algebra.
2 BLAS 3: Unbounded compute intensity (no. of ops per I/O)
Consider single-index contractions: CC = AA BB
=
=
A(:,1,:) A(:,2,:)A422
B21
C421
e.g. Cmnp = Amnk Bkp
26. Speeding up Tensor Contraction
Explicit permutation dominates,
especially for small tensors.
Consider Cmnp = Akm Bpkn.
1 Akm → Amk
2 Bpkn → Bkpn
3 Cmnp → Cmpn
4 Cm(pn) = Amk Bk(pn)
5 Cmpn → Cmnp
100 200 300 400 500
0
0.2
0.4
0.6
0.8
1
n
(Top) CPU. (Bottom) GPU. The fraction of time
spent in copies/transpositions. Lines are shown with
1, 2, 3, and 6 transpositions.
27. Existing Primitives
GEMM
Suboptimal for many small matrices.
Pointer-to-Pointer BatchedGEMM
Available in MKL 11.3β and cuBLAS 4.1
C[p] = α op(A[p]) op(B[p]) + β C[p]
cublas<T>gemmBatched(cublasHandle_t handle,
cublasOperation_t transA, cublasOperation_t transB,
int M, int N, int K,
const T* alpha,
const T** A, int ldA,
const T** B, int ldB,
const T* beta,
T** C, int ldC,
int batchCount)
28. Tensor Contraction with Extended BLAS Primitives
Cmn[p] = AmkBkn[p]
cublasDgemmStridedBatched(handle,
CUBLAS_OP_N, CUBLAS_OP_N,
M, N, K,
&alpha,
A, ldA1, 0,
B, ldB1, ldB2,
&beta,
C, ldC1, ldC2,
P)
30. A new primitive: StridedBatchedGEMM
Performance on par with pure GEMM (P100 and beyond).
31. Applications: Tucker Decomposition
Tmnp = GijkAmiBnjCpk
mnp ijk
mi
njT G
A
B
pkC Main steps in the algorithm
Ymjk = TmnpBt
njCt
pk
Yink = TmnpAt+1
mi Ct
pk
Yijp = TmnpBt+1
nj At+1
mi
Performance on Tucker decomposition:
20 40 60 80 100 120
10−2
100
102
104
106
n
Time(sec)
TensorToolbox
BTAS
Cyclops
CPU Batched
GPU Batched
32. Tensor Sketches
Randomized dimensionality reduction
through sketching.
◮ Complexity independent of tensor order:
exponential gain!
+1
+1
-1
Tensor T
Sketch s
Applications
Tensor Decomposition via Sketching
Visual Question and Answering
CNN
RNN
What is the
mustach made of?
C
W
H
MCT
L
Avgpooling
FC
Relu
BatchNorm
FC
"Banana"
Softmax
41. Conclusion
Distributed Deep Learning at Scale
Mxnet has many attractive features
◮ Flexible programming
◮ Portable
◮ Highly efficient
Easy to deploy large-scale DL on AWS cloud
◮ Deep Learning AMI
◮ Cloud formation templates
Tensors are the future of ML
Tensor contractions: space savings in deep architectures.
New primitives speed up tensor contractions: extended BLAS
=
++
+
T
u
v
= + ....