As a penniless academic I wanted to do "big data" for science. Open source, Python, and simple patterns were the way forward. Staying on top of todays growing datasets is an arm race. Data analytics machinery —clusters, NOSQL, visualization, Hadoop, machine learning, ...— can spread a team's resources thin. Focusing on simple patterns, lightweight technologies, and a good understanding of the applications gets us most of the way for a fraction of the cost.
I will present a personal perspective on ten years of scientific data processing with Python. What are the emerging patterns in data processing? How can modern data-mining ideas be used without a big engineering team? What constraints and design trade-offs govern software projects like scikit-learn, Mayavi, or joblib? How can we make the most out of distributed hardware with simple framework-less code?
Nell’iperspazio con Rocket: il Framework Web di Rust!
Building a cutting-edge data processing environment on a budget
1. Building a cutting-edge data processing
environment on a budget
Ga¨l Varoquaux
e
This talk is not about
rocket science!
2. Building a cutting-edge data processing
environment on a budget
Ga¨l Varoquaux
e
Disclaimer: this talk is as much about people
and projects as it is about code and algorithms.
3. Growing up as a penniless academic
I did a PhD in
quantum physics
4. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Best training ever
for agile project
management
5. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument control
6. Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Shaped my vision
of computing as a
means to an end
Computers were only one
of the many moving parts
Matlab
Instrument control
7. Growing up as a penniless academic
2011
Tenured researcher
in computer science
8. Growing up as a penniless academic
2011
Today
Tenured researcher
in computer science
Growing team with
data science
rock stars
9. 1 Using machine learning to
understand brain function
Link neural activity to thoughts and cognition
G Varoquaux
6
12. 1 Encoding models of stimuli
Predicting neural response
ñ a window into brain representations of stimuli
“feature engineering” a description of the world
G Varoquaux
9
14. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“brain reading”
G Varoquaux
11
15. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“if it’s not open and verifiable by others, it’s not
science, or engineering...”
Stodden, 2010
G Varoquaux
11
16. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
17. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html
Code, data, ... just worksTM
G Varoquaux
http://nilearn.github.io
ni
11
18. 1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
ge
len
al
ch
nt
me
p
elo
ev
ed
arhttp://nilearn.github.io/auto examples/
ftw
plot miyawaki reconstruction.html
So
Code, data, ... just worksTM
G Varoquaux
http://nilearn.github.io
ni
11
19. 1 Data accumulation
When data processing is routine...
“big data”
for rich models of
brain function
Accumulation of scientific knowledge
and learning formal representations
G Varoquaux
12
20. 1 Data accumulation
When data processing is routine...
“big data”
for rich models of
brain function
“A theory is a good theory if it satisfies two requirements:
It must accurately describe a large class of observations on the basis of a model that contains only a few
arbitrary elements, and it must make definite predictions about the results of future observations.”
Stephen Hawking, A Brief History of Time.
Accumulation of scientific knowledge
and learning formal representations
G Varoquaux
12
21. 1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago
G Varoquaux
13
22. 1 Petty day-to-day technicalities
Buggy code
A lab is no different from a startup
Slow code
Difficulties
Risks
LeadRecruitment leaves
data scientist
Bus factor
Technical dept
New Limited resources
intern to train
(people & hardware)
I don’t understand the
code I have written a year ago
G Varoquaux
13
23. 1 Petty day-to-day technicalities
Buggy code
A lab is no different from a startup
Slow code
Difficulties
Risks
LeadRecruitment leaves
data scientist
Bus factor
Technical dept
New Limited resources
intern to train
(people & hardware)
I don’t understand the
code I have written a year ago
Our mission is to revolutionize brain data processing
on a tight budget
G Varoquaux
13
25. 2 The data processing workflow
agile
Interaction...
Ñ script...
Ñ module...
ý interaction again...
Consolidation,
progressively
Low tech and short
turn-around times
G Varoquaux
15
26. Paradigm shift as the
dimensionality of data
grows
y
2 From statistics to statistical learning
# features,
not only # samples
From parameter
inference to prediction
x
Statistical learning is
spreading everywhere
G Varoquaux
16
27. 3 Let’s just make software
to solve all these problems.
G Varoquaux
c Theodore W. Gray
17
28. 3 Design philosophy
1. Don’t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Convention over configuration.
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
What’s not excellent won’t be used.
G Varoquaux
18
29. 3 Design philosophy
1. Don’t solve hard problems
The original problem can be bent.
2. Easy setup, works out of the box
Installing software sucks.
Not “one software to rule them all”
Convention over configuration.
Break down projects by expertise
3. Fail gracefully
Robust to errors. Easy to debug.
4. Quality, quality, quality
What’s not excellent won’t be used.
G Varoquaux
18
31. Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” and versatility
(think Apple vs Linux)
G Varoquaux
19
32. Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” and versatility
(think Apple vs Linux)
We’re not going to solve all the problems for you
I don’t solve hard problems
Feature-engineering, domain-specific cases...
Python is a programming language. Use it.
Cover all the 80% usecases in one package
G Varoquaux
19
33. 3 Performance in high-level programming
High-level programming
is what keeps us
alive and kicking
G Varoquaux
20
34. 3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Significant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profiler
scipy-lectures.github.io
Cython
G Varoquaux
not C/C++
20
35. 3 Performance in high-level programming
Hierarchical clustering
PR #2199
The secret sauce
1.Optimize algorithmes,clustersloops
Take the 2 closest not for
2. Merge them
3. Update the distance matrix
Know perfectly Numpy and Scipy
...
- Significant data should be arrays/memoryviews
Faster with constraints: sparse distance matrix
- Avoid memory copies, rely on blas/lapack
- Keep a heap queue of distances: cheap minimum
line-profiler/memory-profiler
- Need sparse growable structure for neighborhoods
scipy-lectures.github.io
skip-list in Cython!
Oplog nq insert, remove, access
Cython
not C/C++
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux
20
36. 3 Performance in high-level programming
Hierarchical clustering
PR #2199
The secret sauce
1.Optimize algorithmes,clustersloops
Take the 2 closest not for
2. Merge them
3. Update the distance matrix
Know perfectly Numpy and Scipy
...
- Significant data should be arrays/memoryviews
Faster with constraints: sparse distance matrix
- Avoid memory copies, rely on blas/lapack
- Keep a heap queue of distances: cheap minimum
line-profiler/memory-profiler
- Need sparse growable structure for neighborhoods
scipy-lectures.github.io
skip-list in Cython!
Oplog nq insert, remove, access
Cython
not C/C++
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux
20
38. 3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
fit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
G Varoquaux
21
39. 3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
fit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
configuration/run pattern
curry in functional programming
Ideas from MVC pattern
G Varoquaux
traits, pyre
functools.partial
21
41. h
h
isdata on smallishardware
ll
g
4 Big
a
Big
sm
“Big data”:
Petabytes...
Distributed storage
Computing cluster
G Varoquaux
Mere mortals:
Gigabytes...
Python programming
Off-the-self computers
22
42. 4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
G Varoquaux
23
43. 4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
No: just do a running mean
G Varoquaux
23
44. 4 On-line algorithms
Converges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clustering
X = np.random.normal(size=(10 000, 200))
scipy.cluster.vq. sklearn.cluster.
MiniBatchKMeans(n clusters=10,
kmeans(X, 10,
n init=2).fit(X)
iter=2)
11.33 s
0.62 s
G Varoquaux
23
45. 4 On-the-fly data reduction
Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work
G Varoquaux
24
46. 4 On-the-fly data reduction
Dropping data
1 loop: take a random fraction of the data
2
run algorithm on that fraction
3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation
Exploits redundancy across observations
Run the loop in parallel
G Varoquaux
24
47. 4 On-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features
Fast clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing
when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer
stateless: can be used in parallel
G Varoquaux
24
48. 4 On-the-fly data reduction
Example: randomized SVD
Random projection
sklearn.utils.extmath.randomized svd
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux
24
49. 4 Biggish iron
Our new box:
48 cores
384G RAM
70T storage
15 ke
(SSD cache on RAID controller)
Gets our work done faster than our 800 CPU cluster
It’s the access patterns!
“Nobody ever got fired for using Hadoop on a cluster”
A. Rowstron et al., HotCDP ’12
G Varoquaux
25
51. 5 Parallel processing
big picture
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Workers compete for data access
Memory bus is a bottleneck
The right grain of parallelism
Too fine ñ overhead
Too coarse ñ memory shortage
Scale by the relevant cache pool
G Varoquaux
27
52. 5 Parallel processing
joblib
Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
>>> from joblib import Parallel, delayed
>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)
...
for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
G Varoquaux
27
53. 5 Parallel processing
joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-fly dispatch of jobs – memory-friendly
Threads or processes backend
G Varoquaux
27
54. 5 Parallel processing
joblib
IPython, multiprocessing, celery, MPI?
joblib is higher-level
No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-fly dispatch of jobs – memory-friendly
Threads or processes backend
G Varoquaux
27
55. 5 Parallel processing
Queues
Queues: high-performance, concurrent-friendly
Difficulty: callback on result arrival
ñ multiple threads in caller ` risk of deadlocks
Dispatch queue should fill up “slowly”
ñ pre dispatch in joblib
ñ Back and forth communication
Door open to race conditions
G Varoquaux
28
56. 5 Parallel processing:
what happens where
joblib design: Caller, dispatch queue, and collect
queue in same process
Benefit: robustness
Grand-central dispatch design: dispatch queue has
a process of its own
Benefit: resource managment in nested for loops
G Varoquaux
29
57. 5 Caching
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
G Varoquaux
30
58. 5 Caching
The joblib approach
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a)
# computes a using f
c = g(a)
# retrieves results from store
G Varoquaux
30
59. 5 Caching
The joblib approach
Challenges in the context of big data
For reproducibility:
avoid b are big chained scripts (make-like usage)
a & manually
For performance:
Design goals
avoiding re-computing is the crux of optimization
a & b arbitrary Python objects
No dependencies
Drop-in, framework-less code
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a)
# computes a using f
c = g(a)
# retrieves results from store
G Varoquaux
30
60. 5 Caching
The joblib approach
For bricks for out-of-core algorithms coming soon
Lego reproducibility:
avoid manually chained scripts
ąąą result = g.call and shelve(a)(make-like usage)
For performance:
ąąą result
avoiding re-computing is the crux argument hash=”...”)
MemorizedResult(cachedir=”...”, func=”g...”,of optimization
ąąą c = result.get()
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a)
# computes a using f
c = g(a)
# retrieves results from store
G Varoquaux
30
61. 5 Efficient input argument hashing
–
joblib.hash
Compute md5‹ of input arguments
Trade-off between features and cost
Black boxy
Robust and completely generic
G Varoquaux
31
62. 5 Efficient input argument hashing
–
joblib.hash
Compute md5‹ of input arguments
Implementation
1. Create an md5 hash object
2. Subclass the standard-library pickler
= state machine that walks the object graph
3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm
(“update” method)
- the rest: pickle
4. Update the md5 with the pickle
‹ md5 is in the Python standard library
G Varoquaux
31
63. 5 Fast, disk-based, concurrent, store – joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .npy for large numpy arrays (np.save),
pickle for the rest
ñ Multiple files
Store concurrency issues
Strategy: atomic operations ` try/except
Renaming a directory is atomic
Directory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux
32
64. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
G Varoquaux
33
65. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buffers
Copyless storage of raw buffer
+ meta-information (strides, class...)
G Varoquaux
33
66. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buffers
Copyless storage of raw buffer
+ meta-information (strides, class...)
Single file dump
coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
G Varoquaux
33
67. 5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
StandardWhat matters on large with buffers
library: zlib.compress systems
(bypass gzip module to stored
Numbers of bytes work online + in-memory)
brings network/SATA bus down
Avoiding copies
zlib.compress: C-contiguous buffers
Memory usage
Copyless storage brings buffer
of raw compute nodes down
+ meta-information (strides, class...)
Number of atomic file access
Single file dump brings shared storage down soon
coming
File opening is slow on cluster
Challenge: streaming the above for memory usage
G Varoquaux
33
68. y axis scale: 1 is np.save
5 Benchmarking to np.save and pytables
G Varoquaux
NeuroImaging data (MNI atlas)
34
69. 6 The bigger picture: building
an ecosystem
Helping your future self
G Varoquaux
35
70. 6 Community-based development in scikit-learn
Huge feature set:
benefits of a large team
Project growth:
More than 200 contributors
„ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux
36
71. 6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month
mayavi „ 30 email/month
“Hey Gael, I take it you’re too
busy. That’s okay, I spent a day
trying to install XXX and I think
I’ll succeed myself. Next time
though please don’t ignore my
emails, I really don’t like it. You
can say, ‘sorry, I have no time to
help you.’ Just don’t ignore.”
G Varoquaux
37
72. 6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month
mayavi „ 30 email/month
Your “benefits” come from a fraction of the code
Data loading?
Maybe?
Standard algorithms?
Nah
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux
37
73. 6 Many eyes makes code fast
Bench WiseRF anybody?
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux
38
74. 6 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire
G Varoquaux
39
75. 6 Core project contributors
Number of commits
Normalized number of commits
since 2009-06
Individual committer
G Varoquaux
Credit: Fernando Perez, Gist 5843625
40
76. 6 The tragedy of the commons
Individuals, acting independently and rationally according to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia
Make it work, make it right, make it boring
Core projects (boring) taken for granted
ñ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux
41
77. Solving problems that matter
The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code
scikit-learn, joblib, nilearn, ...
@GaelVaroquaux
I hope
78. Cutting-edge ... environment ... on a budget
1 Set the goals right
Don’t solve hard problems
What’s your original problem?
@GaelVaroquaux
79. Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
Be very technically sophisticated
Don’t use that sophistication
@GaelVaroquaux
80. Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Don’t forget the human factors
With your users (documentation)
With your contributors
@GaelVaroquaux
81. Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Don’t forget the human factors
A perfect
design?
@GaelVaroquaux