Building a cutting-edge data processing environment on a budget

Building a cutting-edge data processing
environment on a budget
Ga¨l Varoquaux
e

This talk is not about
rocket science!

Building a cutting-edge data processing
environment on a budget
Ga¨l Varoquaux
e

Disclaimer: this talk is as much about people
and projects as it is about code and algorithms.

Growing up as a penniless academic
I did a PhD in
quantum physics

I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)

Best training ever
for agile project
management

I did a PhD in
quantum physics
Vacuum (leaks)
Computers were only one
of the many moving parts
Matlab
Instrument control

I did a PhD in
quantum physics
Vacuum (leaks)

Shaped my vision
of computing as a
means to an end

Computers were only one
of the many moving parts
Matlab
Instrument control


2011

Tenured researcher
in computer science


2011

Today

Tenured researcher
in computer science

Growing team with
data science
rock stars

1 Using machine learning to
understand brain function

Link neural activity to thoughts and cognition
G Varoquaux

6

1 Functional MRI

t

Recordings of brain activity

G Varoquaux

7

1 Cognitive NeuroImaging

Learn a bilateral link between brain activity
and cognitive function
G Varoquaux

8

1 Encoding models of stimuli

Predicting neural response
ñ a window into brain representations of stimuli
“feature engineering” a description of the world
G Varoquaux

9

1 Decoding brain activity

“brain reading”
G Varoquaux

10

1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]

“brain reading”
G Varoquaux

11


“if it’s not open and veriﬁable by others, it’s not
science, or engineering...”
Stodden, 2010
G Varoquaux

11

Make it work, make it right, make it boring


http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html

Code, data, ... just worksTM
G Varoquaux

http://nilearn.github.io

ni
11

ge
len
al
ch
nt
me
p
elo
ev
ed
arhttp://nilearn.github.io/auto examples/
ftw
plot miyawaki reconstruction.html
So
Code, data, ... just worksTM
G Varoquaux

http://nilearn.github.io

ni
11

1 Data accumulation
When data processing is routine...

“big data”
for rich models of
brain function

Accumulation of scientiﬁc knowledge
and learning formal representations
G Varoquaux

12

1 Data accumulation
When data processing is routine...

“big data”
for rich models of
brain function

“A theory is a good theory if it satisfies two requirements:
It must accurately describe a large class of observations on the basis of a model that contains only a few
arbitrary elements, and it must make definite predictions about the results of future observations.”
Stephen Hawking, A Brief History of Time.

Accumulation of scientific knowledge
and learning formal representations
G Varoquaux

12

1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago

G Varoquaux

13

Buggy code
A lab is no diﬀerent from a startup
Slow code
Diﬃculties
Risks
LeadRecruitment leaves
data scientist

Bus factor

Technical dept
New Limited resources
intern to train
(people & hardware)

G Varoquaux

13

Buggy code
A lab is no diﬀerent from a startup
Slow code
Diﬃculties
Risks
LeadRecruitment leaves
data scientist

Bus factor

Technical dept
New Limited resources
intern to train
(people & hardware)
Our mission is to revolutionize brain data processing
on a tight budget
G Varoquaux

13

2 Patterns in data processing

G Varoquaux

14

2 The data processing workﬂow

agile

Interaction...
Ñ script...
Ñ module...
ý interaction again...
Consolidation,
progressively
Low tech and short
turn-around times

G Varoquaux

15

Paradigm shift as the
dimensionality of data
grows

y

2 From statistics to statistical learning

# features,
not only # samples

From parameter
inference to prediction

x

Statistical learning is
spreading everywhere
G Varoquaux

16

3 Let’s just make software
to solve all these problems.

G Varoquaux

c Theodore W. Gray

17

3 Design philosophy

1. Don’t solve hard problems

The original problem can be bent.

2. Easy setup, works out of the box

Installing software sucks.
Convention over conﬁguration.

3. Fail gracefully

Robust to errors. Easy to debug.

4. Quality, quality, quality
What’s not excellent won’t be used.
G Varoquaux

18

3 Design philosophy

1. Don’t solve hard problems

The original problem can be bent.

2. Easy setup, works out of the box
Installing software sucks.
Not “one software to rule them all”
Convention over conﬁguration.

Break down projects by expertise
3. Fail gracefully

Robust to errors. Easy to debug.

4. Quality, quality, quality
What’s not excellent won’t be used.
G Varoquaux

18

Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-oﬀ between ”just works” and versatility
(think Apple vs Linux)

G Varoquaux

19

Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-oﬀ between ”just works” and versatility
(think Apple vs Linux)
We’re not going to solve all the problems for you
I don’t solve hard problems

Feature-engineering, domain-speciﬁc cases...
Python is a programming language. Use it.
Cover all the 80% usecases in one package
G Varoquaux

19

3 Performance in high-level programming

High-level programming
is what keeps us
alive and kicking

G Varoquaux

20

The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Significant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profiler
scipy-lectures.github.io
Cython

G Varoquaux

not C/C++

20

Hierarchical clustering
PR #2199
The secret sauce
1.Optimize algorithmes,clustersloops
Take the 2 closest not for
2. Merge them
3. Update the distance matrix
Know perfectly Numpy and Scipy
...
- Significant data should be arrays/memoryviews
Faster with constraints: sparse distance matrix
- Avoid memory copies, rely on blas/lapack
- Keep a heap queue of distances: cheap minimum
line-profiler/memory-profiler
- Need sparse growable structure for neighborhoods
scipy-lectures.github.io
skip-list in Cython!
Oplog nq insert, remove, access
Cython
not C/C++
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux

20

0
3
0 0 38
01 7 87
9 1 78
4
5 9 40 7990 8779
4
1 5 49 0771 0775 9447
13 6 97 17 52 79
7 3 70 74 27 97
4
4 7 47 6553 0771 4661 7001 7992
48 7 75 34 18 12 15 27
9 8 54 49 87 24 57 7
0
7 9 03 7221 4226 9004 7117 4779 788
78 9 34 15 65 49 78 97
8 5 45 54 53 95 88 7
9 7 56 46 35 51 87
5 1 67 63 58 19 7
7 8 73 34 80 90
1 7 32 49 09 0
8 7 24 90 98
7 4 45 08 8
7 5 56 84
4 6 61 4
5 2 14
60 4
0
2
3
0
0 0 38
01 7 87
9 1 78
94 0 79 87
5 4 90 79
4
1 5 49 0771 0775 9447
3
7 1 36 9770 1774 5227 7997
4 65 07 46 70 79
4 7 47 53 71 61 01 92
8
9 4 87 7554 3449 1887 1224 1557 277
90 3 72 42 90 71 47 78
7 0 21 26 04 17 79 8
78 9 34 15 65 49 78 97
8 5 45 54 53 95 88 7
9 7 56 46 35 51 87
5 1 67 63 58 19 7
7 8 73 34 80 90
0
1 7 32 49 09 0
3
8 7 24 90 98
0 0 38
7 4 45 08 8
01 7 87
6 84
75 5
9 1 78
4 6 61 4
4
5 2 14
5 9 40 7990 8779
60 4
4
1 5 49 0771 0775 9447
2
0
13 6 97 17 52 79
7 3 70 74 27 97
4
4 7 47 6553 0771 4661 7001 7992
48 7 75 34 18 12 15 27
9 8 54 49 87 24 57 7
0
7 9 03 7221 4226 9004 7117 4779 788
78 9 34 15 65 49 78 97
8 5 45 54 53 95 88 7
9 7 56 46 35 51 87
5 1 67 63 58 19 7
7 8 73 34 80 90
1 7 32 49 09 0
8 7 24 90 98
7 4 45 08 8
7 5 56 84
4 6 61 4
5 2 14
60 4
2
0

3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language

bokeh, chaco, hadoop, Mayavi, CPUs

G Varoquaux
21

Object API exposes a data-processing language
ﬁt, predict, transform, score, partial ﬁt
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...

G Varoquaux

21

Object API exposes a data-processing language
fit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
configuration/run pattern
curry in functional programming
Ideas from MVC pattern
G Varoquaux

traits, pyre
functools.partial
21

4 Big data on small hardware

G Varoquaux

22

h
h
isdata on smallishardware
ll
g
4 Big
a
Big
sm

“Big data”:

Petabytes...
Distributed storage
Computing cluster

G Varoquaux

Mere mortals:

Gigabytes...
Python programming
Oﬀ-the-self computers
22

4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?

G Varoquaux

23

Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
No: just do a running mean

G Varoquaux

23

Converges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clustering
X = np.random.normal(size=(10 000, 200))
scipy.cluster.vq. sklearn.cluster.
MiniBatchKMeans(n clusters=10,
kmeans(X, 10,
n init=2).ﬁt(X)
iter=2)
11.33 s
0.62 s
G Varoquaux

23

4 On-the-ﬂy data reduction

Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work

G Varoquaux

24

Dropping data
1 loop: take a random fraction of the data
2

run algorithm on that fraction

3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation

Exploits redundancy across observations
Run the loop in parallel

G Varoquaux

24

Random projections (will average features)
sklearn.random projection
random linear combinations of the features

Fast clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy

Hashing

when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer

stateless: can be used in parallel
G Varoquaux

24

Example: randomized SVD
Random projection
sklearn.utils.extmath.randomized svd
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux

24

4 Biggish iron
Our new box:
48 cores
384G RAM
70T storage

15 ke

(SSD cache on RAID controller)

Gets our work done faster than our 800 CPU cluster
It’s the access patterns!
“Nobody ever got ﬁred for using Hadoop on a cluster”
A. Rowstron et al., HotCDP ’12
G Varoquaux

25

5 Avoiding the framework

joblib

G Varoquaux

26

5 Parallel processing

big picture

Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Workers compete for data access
Memory bus is a bottleneck
The right grain of parallelism
Too ﬁne ñ overhead
Too coarse ñ memory shortage
Scale by the relevant cache pool

G Varoquaux

27


joblib

Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
>>> from joblib import Parallel, delayed
>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)
...
for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]

G Varoquaux

27


joblib

IPython, multiprocessing, celery, MPI?
joblib is higher-level

No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-ﬂy dispatch of jobs – memory-friendly
Threads or processes backend

G Varoquaux

27


Queues

Queues: high-performance, concurrent-friendly
Diﬃculty: callback on result arrival
ñ multiple threads in caller ` risk of deadlocks
Dispatch queue should ﬁll up “slowly”
ñ pre dispatch in joblib
ñ Back and forth communication
Door open to race conditions
G Varoquaux

28

5 Parallel processing:

what happens where

joblib design: Caller, dispatch queue, and collect
queue in same process
Beneﬁt: robustness
Grand-central dispatch design: dispatch queue has
a process of its own
Beneﬁt: resource managment in nested for loops
G Varoquaux

29

5 Caching
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization

G Varoquaux

30

5 Caching

The joblib approach

avoid manually chained scripts (make-like usage)
For performance:

Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a)
# computes a using f
c = g(a)
# retrieves results from store
G Varoquaux

30

5 Caching

The joblib approach

Challenges in the context of big data
avoid b are big chained scripts (make-like usage)
a & manually
For performance:
Design goals
a & b arbitrary Python objects
No dependencies
Drop-in, framework-less code
Memoize pattern
g = mem.cache(f)
b = g(a)
c = g(a)
G Varoquaux

30

5 Caching

The joblib approach

For bricks for out-of-core algorithms coming soon
Lego reproducibility:
avoid manually chained scripts
ąąą result = g.call and shelve(a)(make-like usage)
For performance:
ąąą result
avoiding re-computing is the crux argument hash=”...”)
MemorizedResult(cachedir=”...”, func=”g...”,of optimization
ąąą c = result.get()

Memoize pattern
g = mem.cache(f)
b = g(a)
c = g(a)
G Varoquaux

30

5 Eﬃcient input argument hashing

–

joblib.hash

Compute md5‹ of input arguments
Trade-oﬀ between features and cost
Black boxy
Robust and completely generic

G Varoquaux

31

5 Eﬃcient input argument hashing

–

joblib.hash

Compute md5‹ of input arguments
Implementation

1. Create an md5 hash object
2. Subclass the standard-library pickler

= state machine that walks the object graph

3. Walk the object graph:

- ndarrays: pass data pointer to md5 algorithm
(“update” method)

- the rest: pickle

4. Update the md5 with the pickle
‹ md5 is in the Python standard library

G Varoquaux

31

5 Fast, disk-based, concurrent, store – joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .npy for large numpy arrays (np.save),
pickle for the rest
ñ Multiple ﬁles
Store concurrency issues
Strategy: atomic operations ` try/except
Renaming a directory is atomic
Directory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux

32

5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buﬀers
(bypass gzip module to work online + in-memory)

G Varoquaux

33

5 Making I/O fast
Fast compression
Avoiding copies
zlib.compress: C-contiguous buﬀers
Copyless storage of raw buﬀer
+ meta-information (strides, class...)

G Varoquaux

33

5 Making I/O fast
Fast compression
Avoiding copies
Copyless storage of raw buﬀer
Single ﬁle dump
coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
G Varoquaux

33

5 Making I/O fast
Fast compression
StandardWhat matters on large with buffers
library: zlib.compress systems
(bypass gzip module to stored
Numbers of bytes work online + in-memory)
brings network/SATA bus down
Avoiding copies
Memory usage
Copyless storage brings buffer
of raw compute nodes down
Number of atomic file access
Single file dump brings shared storage down soon
coming
File opening is slow on cluster
Challenge: streaming the above for memory usage

G Varoquaux

33

y axis scale: 1 is np.save

5 Benchmarking to np.save and pytables

G Varoquaux

NeuroImaging data (MNI atlas)

34

6 The bigger picture: building
an ecosystem

Helping your future self
G Varoquaux

35

6 Community-based development in scikit-learn
Huge feature set:
beneﬁts of a large team
Project growth:
More than 200 contributors
„ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux

36

6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month
mayavi „ 30 email/month
“Hey Gael, I take it you’re too
busy. That’s okay, I spent a day
trying to install XXX and I think
I’ll succeed myself. Next time
though please don’t ignore my
emails, I really don’t like it. You
can say, ‘sorry, I have no time to
help you.’ Just don’t ignore.”
G Varoquaux

37

6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month
mayavi „ 30 email/month
Your “beneﬁts” come from a fraction of the code
Data loading?
Maybe?
Standard algorithms?
Nah
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux

37

6 Many eyes makes code fast

Bench WiseRF anybody?

L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux
38

6 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire

G Varoquaux

39

6 Core project contributors

Number of commits

Normalized number of commits
since 2009-06

Individual committer
G Varoquaux

Credit: Fernando Perez, Gist 5843625

40

6 The tragedy of the commons
Individuals, acting independently and rationally according to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia


Core projects (boring) taken for granted
ñ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux

41

Solving problems that matter

The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code
scikit-learn, joblib, nilearn, ...

@GaelVaroquaux

I hope

Cutting-edge ... environment ... on a budget
1 Set the goals right
Don’t solve hard problems
What’s your original problem?

@GaelVaroquaux

2 Use the simplest technological solutions possible
Be very technically sophisticated
Don’t use that sophistication

@GaelVaroquaux

3 Don’t forget the human factors
With your users (documentation)
With your contributors

@GaelVaroquaux

3 Don’t forget the human factors

A perfect
design?

@GaelVaroquaux

Building a cutting-edge data processing environment on a budget

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (8)

En vedette

En vedette (20)

Similaire à Building a cutting-edge data processing environment on a budget

Similaire à Building a cutting-edge data processing environment on a budget (20)

Plus de Gael Varoquaux

Plus de Gael Varoquaux (15)

Dernier

Dernier (20)

Building a cutting-edge data processing environment on a budget