5. Optimising SGD
• Linear regression (like)
stochastic gradient descent
with d=5 features and
n=1,000,000 examples.
• Using Python (1), Numba (2),
Numpy (3) and Cython (4)
(https://gist.github.com/zermelozf/
3cd06c8b0ce28f4eeacd)
• Also compared it to pure C++
code (https://gist.github.com/
zermelozf/
4df67d14f72f04b4338a)
(1)
(2)
(3)
(4)
18. Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Cache hit
Cache hitCache miss
Cache miss
Cache miss
Cache hit
19. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
20. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
21. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
22. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
23. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
24. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
25. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
26. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
pointers
pointers
pointers
27. What’s this BLAS magic?
Source: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx
• vectorized y = alpha*x !
• replaced 3 lines of code!
• translated into a 3x speedup over Cython alone!
• please read http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
**On my MacBook Pro, SciPy automatically links against Apple’s vecLib, which contains an excellent BLAS.
Similarly, Intel’s MKL, AMD’s AMCL, Sun’s SunPerf or the automatically tuned ATLAS are all good choices.
30. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
31. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
• … and parallelised with threads!
32. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
33. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
34. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
35. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
36. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
37. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
2.85x speedup
38. (d>>1) Hogwild!on SAG
• Fabian’s experimentation with Julia (lang).
• Running SAG in
parallel, without
a lock.
39. (d>>1) Hogwild!on SAG
• Fabian’s experimentation with Julia (lang).
• Running SAG in
parallel, without
a lock.
• Very nice
speed up!!!
40. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
…
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
41. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 1 …
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
42. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 2 …
job 1 …
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
43. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 …
job 1
job 2
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
44. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 job 4 …
job 1
job 2
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
45. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 job 4 job 5 …
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
46. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 3
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
47. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 3
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
48. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
done
Et cetera…
49. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
…
job 4
job 5 …
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
done
Et cetera…
50. How many consumers?
It depends…
!
• Gensim (R. Rehurec)
• Saw the impact up to 4 consumers earlier
• Vowpal Wabbit (J. Langford)
• Claims no gain with more than 1 consumer!
• 2’10’’ on my macbook pro for ~10GB and 50MM lines
(Criteo’s advertising dataset).
!
• CNNs pre-processing (S. Dieleman)
• Big impact with ?? (several) consumers!
• Useful for data augmentation/preprocessing
51. 5.3GB (~105MM lines) word count
0
55
110
165
220
Number of consumers
1 2 3 4 5 6
Word count java benchmark
source: https://gist.github.com/nicomak/1d6561e6f71d936d3178
• Macbook pro 15’’ 2014
• `sudo purge`
56. Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.
Usually relevant when the task is CPU bound.
Scalability - A perspective on Big data
57. Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.
Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time.
Memory bound tasks… usually.
Scalability - A perspective on Big data
58. Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.
Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time.
Memory bound tasks… usually.
Most “big data” problems are I/O bound. Hard to solve the task in an
acceptable time independently of the size of the data (weak scaling).
Scalability - A perspective on Big data
61. Bring computation to data
20
Map-Reduce: Statistical query model
the sum corresponds
to a reduce operation
62. Bring computation to data
20
Map-Reduce: Statistical query model
f, the map function, is
sent to every machine
the sum corresponds
to a reduce operation
63. Bring computation to data
20
Map-Reduce: Statistical query model
f, the map function, is
sent to every machine
the sum corresponds
to a reduce operation
• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning
Decision Trees. Int. J. Hybrid Intell. Syst. 2004
• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.
64. Spark on Criteo’s data
!
• Logistic regression trained with
minibatch SGD"
• 10GB of data (50MM lines).
Caveat: Quite small for a benchmark
• Super linear strong
scalability.
Not theoretically possible => small
dataset + few instances saturate.
Numberofcores
0
10
20
30
40
timeinsec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
65. Spark on Criteo’s data
!
• Logistic regression trained with
minibatch SGD"
• 10GB of data (50MM lines).
Caveat: Quite small for a benchmark
• Super linear strong
scalability.
Not theoretically possible => small
dataset + few instances saturate.
Numberofcores
0
10
20
30
40
timeinsec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
Manual setup of the cluster
was a bit painful…
74. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
75. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
76. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
77. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
78. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
79. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
80. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
Mesos YARN
• Framework receive offers
• Easy install on AWS, GCE
• Lots of compatible frameworks:
Spark, MPI, Cassandra,
HDFS…
• Mesosphere’s DCOS is really,
really easy to use.
• Frameworks make offers
• Configuration hell (can be
made easier with puppet/
ansible recipes
• Several compatible
frameworks: Spark, Flink,
HDFS…