SlideShare une entreprise Scribd logo
1  sur  81
Télécharger pour lire hors ligne
Building a cutting-edge data processing
environment on a budget
Ga¨l Varoquaux
e

This talk is not about
rocket science!
Building a cutting-edge data processing
environment on a budget
Ga¨l Varoquaux
e

Disclaimer: this talk is as much about people
and projects as it is about code and algorithms.
Growing up as a penniless academic
I did a PhD in
quantum physics
Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)

Best training ever
for agile project
management
Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument control
Growing up as a penniless academic
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)

Shaped my vision
of computing as a
means to an end

Computers were only one
of the many moving parts
Matlab
Instrument control
Growing up as a penniless academic

2011

Tenured researcher
in computer science
Growing up as a penniless academic

2011

Today

Tenured researcher
in computer science

Growing team with
data science
rock stars
1 Using machine learning to
understand brain function

Link neural activity to thoughts and cognition
G Varoquaux

6
1 Functional MRI

t

Recordings of brain activity

G Varoquaux

7
1 Cognitive NeuroImaging

Learn a bilateral link between brain activity
and cognitive function
G Varoquaux

8
1 Encoding models of stimuli

Predicting neural response
ñ a window into brain representations of stimuli
“feature engineering” a description of the world
G Varoquaux

9
1 Decoding brain activity

“brain reading”
G Varoquaux

10
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]

“brain reading”
G Varoquaux

11
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]

“if it’s not open and verifiable by others, it’s not
science, or engineering...”
Stodden, 2010
G Varoquaux

11
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring

http://nilearn.github.io/auto examples/
plot miyawaki reconstruction.html

Code, data, ... just worksTM
G Varoquaux

http://nilearn.github.io

ni
11
1 Data processing feats
Visual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
ge
len
al
ch
nt
me
p
elo
ev
ed
arhttp://nilearn.github.io/auto examples/
ftw
plot miyawaki reconstruction.html
So
Code, data, ... just worksTM
G Varoquaux

http://nilearn.github.io

ni
11
1 Data accumulation
When data processing is routine...

“big data”
for rich models of
brain function

Accumulation of scientific knowledge
and learning formal representations
G Varoquaux

12
1 Data accumulation
When data processing is routine...

“big data”
for rich models of
brain function

“A theory is a good theory if it satisfies two requirements:
It must accurately describe a large class of observations on the basis of a model that contains only a few
arbitrary elements, and it must make definite predictions about the results of future observations.”
Stephen Hawking, A Brief History of Time.

Accumulation of scientific knowledge
and learning formal representations
G Varoquaux

12
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand the
code I have written a year ago

G Varoquaux

13
1 Petty day-to-day technicalities
Buggy code
A lab is no different from a startup
Slow code
Difficulties
Risks
LeadRecruitment leaves
data scientist

Bus factor

Technical dept
New Limited resources
intern to train
(people & hardware)
I don’t understand the
code I have written a year ago

G Varoquaux

13
1 Petty day-to-day technicalities
Buggy code
A lab is no different from a startup
Slow code
Difficulties
Risks
LeadRecruitment leaves
data scientist

Bus factor

Technical dept
New Limited resources
intern to train
(people & hardware)
I don’t understand the
code I have written a year ago
Our mission is to revolutionize brain data processing
on a tight budget
G Varoquaux

13
2 Patterns in data processing

G Varoquaux

14
2 The data processing workflow

agile

Interaction...
Ñ script...
Ñ module...
ý interaction again...
Consolidation,
progressively
Low tech and short
turn-around times

G Varoquaux

15
Paradigm shift as the
dimensionality of data
grows

y

2 From statistics to statistical learning

# features,
not only # samples

From parameter
inference to prediction

x

Statistical learning is
spreading everywhere
G Varoquaux

16
3 Let’s just make software
to solve all these problems.

G Varoquaux

c Theodore W. Gray

17
3 Design philosophy

1. Don’t solve hard problems

The original problem can be bent.

2. Easy setup, works out of the box

Installing software sucks.
Convention over configuration.

3. Fail gracefully

Robust to errors. Easy to debug.

4. Quality, quality, quality
What’s not excellent won’t be used.
G Varoquaux

18
3 Design philosophy

1. Don’t solve hard problems

The original problem can be bent.

2. Easy setup, works out of the box
Installing software sucks.
Not “one software to rule them all”
Convention over configuration.

Break down projects by expertise
3. Fail gracefully

Robust to errors. Easy to debug.

4. Quality, quality, quality
What’s not excellent won’t be used.
G Varoquaux

18
G Varoquaux

19
Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” and versatility
(think Apple vs Linux)

G Varoquaux

19
Vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” and versatility
(think Apple vs Linux)
We’re not going to solve all the problems for you
I don’t solve hard problems

Feature-engineering, domain-specific cases...
Python is a programming language. Use it.
Cover all the 80% usecases in one package
G Varoquaux

19
3 Performance in high-level programming

High-level programming
is what keeps us
alive and kicking

G Varoquaux

20
3 Performance in high-level programming
The secret sauce
Optimize algorithmes, not for loops
Know perfectly Numpy and Scipy
- Significant data should be arrays/memoryviews
- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profiler
scipy-lectures.github.io
Cython

G Varoquaux

not C/C++

20
3 Performance in high-level programming
Hierarchical clustering
PR #2199
The secret sauce
1.Optimize algorithmes,clustersloops
Take the 2 closest not for
2. Merge them
3. Update the distance matrix
Know perfectly Numpy and Scipy
...
- Significant data should be arrays/memoryviews
Faster with constraints: sparse distance matrix
- Avoid memory copies, rely on blas/lapack
- Keep a heap queue of distances: cheap minimum
line-profiler/memory-profiler
- Need sparse growable structure for neighborhoods
scipy-lectures.github.io
skip-list in Cython!
Oplog nq insert, remove, access
Cython
not C/C++
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux

20
3 Performance in high-level programming
Hierarchical clustering
PR #2199
The secret sauce
1.Optimize algorithmes,clustersloops
Take the 2 closest not for
2. Merge them
3. Update the distance matrix
Know perfectly Numpy and Scipy
...
- Significant data should be arrays/memoryviews
Faster with constraints: sparse distance matrix
- Avoid memory copies, rely on blas/lapack
- Keep a heap queue of distances: cheap minimum
line-profiler/memory-profiler
- Need sparse growable structure for neighborhoods
scipy-lectures.github.io
skip-list in Cython!
Oplog nq insert, remove, access
Cython
not C/C++
bind C++ map[int, float] with Cython
Fast traversal, possibly in Cython, for step 3.
G Varoquaux

20
0
3
0 0 38
01 7 87
9 1 78
4
5 9 40 7990 8779
4
1 5 49 0771 0775 9447
13 6 97 17 52 79
7 3 70 74 27 97
4
4 7 47 6553 0771 4661 7001 7992
48 7 75 34 18 12 15 27
9 8 54 49 87 24 57 7
0
7 9 03 7221 4226 9004 7117 4779 788
78 9 34 15 65 49 78 97
8 5 45 54 53 95 88 7
9 7 56 46 35 51 87
5 1 67 63 58 19 7
7 8 73 34 80 90
1 7 32 49 09 0
8 7 24 90 98
7 4 45 08 8
7 5 56 84
4 6 61 4
5 2 14
60 4
0
2
3
0
0 0 38
01 7 87
9 1 78
94 0 79 87
5 4 90 79
4
1 5 49 0771 0775 9447
3
7 1 36 9770 1774 5227 7997
4 65 07 46 70 79
4 7 47 53 71 61 01 92
8
9 4 87 7554 3449 1887 1224 1557 277
90 3 72 42 90 71 47 78
7 0 21 26 04 17 79 8
78 9 34 15 65 49 78 97
8 5 45 54 53 95 88 7
9 7 56 46 35 51 87
5 1 67 63 58 19 7
7 8 73 34 80 90
0
1 7 32 49 09 0
3
8 7 24 90 98
0 0 38
7 4 45 08 8
01 7 87
6 84
75 5
9 1 78
4 6 61 4
4
5 2 14
5 9 40 7990 8779
60 4
4
1 5 49 0771 0775 9447
2
0
13 6 97 17 52 79
7 3 70 74 27 97
4
4 7 47 6553 0771 4661 7001 7992
48 7 75 34 18 12 15 27
9 8 54 49 87 24 57 7
0
7 9 03 7221 4226 9004 7117 4779 788
78 9 34 15 65 49 78 97
8 5 45 54 53 95 88 7
9 7 56 46 35 51 87
5 1 67 63 58 19 7
7 8 73 34 80 90
1 7 32 49 09 0
8 7 24 90 98
7 4 45 08 8
7 5 56 84
4 6 61 4
5 2 14
60 4
2
0

3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language

bokeh, chaco, hadoop, Mayavi, CPUs

G Varoquaux
21
3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
fit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...

G Varoquaux

21
3 Architecture of a data-manipulation toolkit
Separate data from operations,
but keep an imperative-like language
Object API exposes a data-processing language
fit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
configuration/run pattern
curry in functional programming
Ideas from MVC pattern
G Varoquaux

traits, pyre
functools.partial
21
4 Big data on small hardware

G Varoquaux

22
h
h
isdata on smallishardware
ll
g
4 Big
a
Big
sm

“Big data”:

Petabytes...
Distributed storage
Computing cluster

G Varoquaux

Mere mortals:

Gigabytes...
Python programming
Off-the-self computers
22
4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?

G Varoquaux

23
4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillion
numbers
Hard?
No: just do a running mean

G Varoquaux

23
4 On-line algorithms
Converges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clustering
X = np.random.normal(size=(10 000, 200))
scipy.cluster.vq. sklearn.cluster.
MiniBatchKMeans(n clusters=10,
kmeans(X, 10,
n init=2).fit(X)
iter=2)
11.33 s
0.62 s
G Varoquaux

23
4 On-the-fly data reduction

Big data is often I/O bound
Layer memory access
CPU caches
RAM
Local disks
Distant storage
Less data also means less work

G Varoquaux

24
4 On-the-fly data reduction
Dropping data
1 loop: take a random fraction of the data
2

run algorithm on that fraction

3 aggregate results across sub-samplings
Looks like bagging: bootstrap aggregation

Exploits redundancy across observations
Run the loop in parallel

G Varoquaux

24
4 On-the-fly data reduction
Random projections (will average features)
sklearn.random projection
random linear combinations of the features

Fast clustering of features
sklearn.cluster.WardAgglomeration
on images: super-pixel strategy

Hashing

when observations have varying size
(e.g. words)
sklearn.feature extraction.text.
HashingVectorizer

stateless: can be used in parallel
G Varoquaux

24
4 On-the-fly data reduction
Example: randomized SVD
Random projection
sklearn.utils.extmath.randomized svd
X = np.random.normal(size=(50000, 200))
%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop
%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop
%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux

24
4 Biggish iron
Our new box:
48 cores
384G RAM
70T storage

15 ke

(SSD cache on RAID controller)

Gets our work done faster than our 800 CPU cluster
It’s the access patterns!
“Nobody ever got fired for using Hadoop on a cluster”
A. Rowstron et al., HotCDP ’12
G Varoquaux

25
5 Avoiding the framework

joblib

G Varoquaux

26
5 Parallel processing

big picture

Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
Workers compete for data access
Memory bus is a bottleneck
The right grain of parallelism
Too fine ñ overhead
Too coarse ñ memory shortage
Scale by the relevant cache pool

G Varoquaux

27
5 Parallel processing

joblib

Focus on embarassingly parallel for loops
Life is too short to worry about deadlocks
>>> from joblib import Parallel, delayed
>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)
...
for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]

G Varoquaux

27
5 Parallel processing

joblib

IPython, multiprocessing, celery, MPI?
joblib is higher-level

No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-fly dispatch of jobs – memory-friendly
Threads or processes backend

G Varoquaux

27
5 Parallel processing

joblib

IPython, multiprocessing, celery, MPI?
joblib is higher-level

No dependencies, works everywhere
Better traceback reporting
Memmaping arrays to share memory (O. Grisel)
On-the-fly dispatch of jobs – memory-friendly
Threads or processes backend

G Varoquaux

27
5 Parallel processing

Queues

Queues: high-performance, concurrent-friendly
Difficulty: callback on result arrival
ñ multiple threads in caller ` risk of deadlocks
Dispatch queue should fill up “slowly”
ñ pre dispatch in joblib
ñ Back and forth communication
Door open to race conditions
G Varoquaux

28
5 Parallel processing:

what happens where

joblib design: Caller, dispatch queue, and collect
queue in same process
Benefit: robustness
Grand-central dispatch design: dispatch queue has
a process of its own
Benefit: resource managment in nested for loops
G Varoquaux

29
5 Caching
For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization

G Varoquaux

30
5 Caching

The joblib approach

For reproducibility:
avoid manually chained scripts (make-like usage)
For performance:
avoiding re-computing is the crux of optimization

Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a)
# computes a using f
c = g(a)
# retrieves results from store
G Varoquaux

30
5 Caching

The joblib approach

Challenges in the context of big data
For reproducibility:
avoid b are big chained scripts (make-like usage)
a & manually
For performance:
Design goals
avoiding re-computing is the crux of optimization
a & b arbitrary Python objects
No dependencies
Drop-in, framework-less code
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a)
# computes a using f
c = g(a)
# retrieves results from store
G Varoquaux

30
5 Caching

The joblib approach

For bricks for out-of-core algorithms coming soon
Lego reproducibility:
avoid manually chained scripts
ąąą result = g.call and shelve(a)(make-like usage)
For performance:
ąąą result
avoiding re-computing is the crux argument hash=”...”)
MemorizedResult(cachedir=”...”, func=”g...”,of optimization
ąąą c = result.get()

Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a)
# computes a using f
c = g(a)
# retrieves results from store
G Varoquaux

30
5 Efficient input argument hashing

–

joblib.hash

Compute md5‹ of input arguments
Trade-off between features and cost
Black boxy
Robust and completely generic

G Varoquaux

31
5 Efficient input argument hashing

–

joblib.hash

Compute md5‹ of input arguments
Implementation

1. Create an md5 hash object
2. Subclass the standard-library pickler

= state machine that walks the object graph

3. Walk the object graph:

- ndarrays: pass data pointer to md5 algorithm
(“update” method)

- the rest: pickle

4. Update the md5 with the pickle
‹ md5 is in the Python standard library

G Varoquaux

31
5 Fast, disk-based, concurrent, store – joblib.dump
Persisting arbritrary objects
Once again sub-class the pickler
Use .npy for large numpy arrays (np.save),
pickle for the rest
ñ Multiple files
Store concurrency issues
Strategy: atomic operations ` try/except
Renaming a directory is atomic
Directory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux

32
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)

G Varoquaux

33
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buffers
Copyless storage of raw buffer
+ meta-information (strides, class...)

G Varoquaux

33
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
Standard library: zlib.compress with buffers
(bypass gzip module to work online + in-memory)
Avoiding copies
zlib.compress: C-contiguous buffers
Copyless storage of raw buffer
+ meta-information (strides, class...)
Single file dump
coming soon
File opening is slow on cluster
Challenge: streaming the above for memory usage
G Varoquaux

33
5 Making I/O fast
Fast compression
CPU may be faster than disk access
in particular in parallel
StandardWhat matters on large with buffers
library: zlib.compress systems
(bypass gzip module to stored
Numbers of bytes work online + in-memory)
brings network/SATA bus down
Avoiding copies
zlib.compress: C-contiguous buffers
Memory usage
Copyless storage brings buffer
of raw compute nodes down
+ meta-information (strides, class...)
Number of atomic file access
Single file dump brings shared storage down soon
coming
File opening is slow on cluster
Challenge: streaming the above for memory usage

G Varoquaux

33
y axis scale: 1 is np.save

5 Benchmarking to np.save and pytables

G Varoquaux

NeuroImaging data (MNI atlas)

34
6 The bigger picture: building
an ecosystem

Helping your future self
G Varoquaux

35
6 Community-based development in scikit-learn
Huge feature set:
benefits of a large team
Project growth:
More than 200 contributors
„ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux

36
6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month
mayavi „ 30 email/month
“Hey Gael, I take it you’re too
busy. That’s okay, I spent a day
trying to install XXX and I think
I’ll succeed myself. Next time
though please don’t ignore my
emails, I really don’t like it. You
can say, ‘sorry, I have no time to
help you.’ Just don’t ignore.”
G Varoquaux

37
6 The economics of open source
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month
mayavi „ 30 email/month
Your “benefits” come from a fraction of the code
Data loading?
Maybe?
Standard algorithms?
Nah
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux

37
6 Many eyes makes code fast

Bench WiseRF anybody?

L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux
38
6 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,
give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/
scikit-learn-dveloppement-communautaire

G Varoquaux

39
6 Core project contributors

Number of commits

Normalized number of commits
since 2009-06

Individual committer
G Varoquaux

Credit: Fernando Perez, Gist 5843625

40
6 The tragedy of the commons
Individuals, acting independently and rationally according to each one’s self-interest, behave contrary to the
whole group’s long-term best interests by depleting
some common resource.
Wikipedia

Make it work, make it right, make it boring

Core projects (boring) taken for granted
ñ Hard to fund, less excitement
They need citation, in papers & on corporate web pages
G Varoquaux

41
Solving problems that matter

The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code
scikit-learn, joblib, nilearn, ...

@GaelVaroquaux

I hope
Cutting-edge ... environment ... on a budget
1 Set the goals right
Don’t solve hard problems
What’s your original problem?

@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
Be very technically sophisticated
Don’t use that sophistication

@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Don’t forget the human factors
With your users (documentation)
With your contributors

@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Don’t forget the human factors

A perfect
design?

@GaelVaroquaux

Contenu connexe

Tendances

Towards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of JuliaTowards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of JuliaMaurizio Tomasi
 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakesDataWorks Summit
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
Data-driven Hypothesis Management
Data-driven Hypothesis ManagementData-driven Hypothesis Management
Data-driven Hypothesis Managementbgoncalves2
 
Machine teaching tbo_20190518
Machine teaching tbo_20190518Machine teaching tbo_20190518
Machine teaching tbo_20190518Yi-Fan Liou
 
20190927 generative models_aia
20190927 generative models_aia20190927 generative models_aia
20190927 generative models_aiaYi-Fan Liou
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With PythonMosky Liu
 

Tendances (8)

Towards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of JuliaTowards new solutions for scientific computing: the case of Julia
Towards new solutions for scientific computing: the case of Julia
 
Machine learning on Hadoop data lakes
Machine learning on Hadoop data lakesMachine learning on Hadoop data lakes
Machine learning on Hadoop data lakes
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Data-driven Hypothesis Management
Data-driven Hypothesis ManagementData-driven Hypothesis Management
Data-driven Hypothesis Management
 
Machine teaching tbo_20190518
Machine teaching tbo_20190518Machine teaching tbo_20190518
Machine teaching tbo_20190518
 
20190927 generative models_aia
20190927 generative models_aia20190927 generative models_aia
20190927 generative models_aia
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With Python
 
H20: A platform for big math
H20: A platform for big math H20: A platform for big math
H20: A platform for big math
 

En vedette

Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
 
A hand-waving introduction to sparsity for compressed tomography reconstruction
A hand-waving introduction to sparsity for compressed tomography reconstructionA hand-waving introduction to sparsity for compressed tomography reconstruction
A hand-waving introduction to sparsity for compressed tomography reconstructionGael Varoquaux
 
Advanced network modelling 2: connectivity measures, goup analysis
Advanced network modelling 2: connectivity measures, goup analysisAdvanced network modelling 2: connectivity measures, goup analysis
Advanced network modelling 2: connectivity measures, goup analysisGael Varoquaux
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
Connectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis MethodsConnectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis MethodsGael Varoquaux
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonGael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Brain reading, compressive sensing, fMRI and statistical learning in Python
Brain reading, compressive sensing, fMRI and statistical learning in PythonBrain reading, compressive sensing, fMRI and statistical learning in Python
Brain reading, compressive sensing, fMRI and statistical learning in PythonGael Varoquaux
 
Kuala Lumpur Hotels Airport
Kuala Lumpur Hotels AirportKuala Lumpur Hotels Airport
Kuala Lumpur Hotels Airportquicksweet
 
Charles dellschau
Charles dellschau Charles dellschau
Charles dellschau Mossmickey
 
Adobe Acrobat Pro X - 2014 UVM Extension Professional Improvement Conference
Adobe Acrobat Pro X - 2014 UVM Extension Professional Improvement ConferenceAdobe Acrobat Pro X - 2014 UVM Extension Professional Improvement Conference
Adobe Acrobat Pro X - 2014 UVM Extension Professional Improvement ConferenceCathy Yandow
 
Presentacion Ad Media Net
Presentacion Ad Media NetPresentacion Ad Media Net
Presentacion Ad Media NetAndres Castillo
 
ACTFL Best of Toys 2011 3 modes presentation
ACTFL Best of Toys 2011  3 modes presentation ACTFL Best of Toys 2011  3 modes presentation
ACTFL Best of Toys 2011 3 modes presentation Toni Theisen
 
Newsletter AudiSEC Sàrl Prévention SST
Newsletter AudiSEC Sàrl Prévention SSTNewsletter AudiSEC Sàrl Prévention SST
Newsletter AudiSEC Sàrl Prévention SSTvalerienaef
 

En vedette (20)

Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
A hand-waving introduction to sparsity for compressed tomography reconstruction
A hand-waving introduction to sparsity for compressed tomography reconstructionA hand-waving introduction to sparsity for compressed tomography reconstruction
A hand-waving introduction to sparsity for compressed tomography reconstruction
 
Advanced network modelling 2: connectivity measures, goup analysis
Advanced network modelling 2: connectivity measures, goup analysisAdvanced network modelling 2: connectivity measures, goup analysis
Advanced network modelling 2: connectivity measures, goup analysis
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Connectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis MethodsConnectomics: Parcellations and Network Analysis Methods
Connectomics: Parcellations and Network Analysis Methods
 
Scikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en PythonScikit learn: apprentissage statistique en Python
Scikit learn: apprentissage statistique en Python
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Brain reading, compressive sensing, fMRI and statistical learning in Python
Brain reading, compressive sensing, fMRI and statistical learning in PythonBrain reading, compressive sensing, fMRI and statistical learning in Python
Brain reading, compressive sensing, fMRI and statistical learning in Python
 
Kuala Lumpur Hotels Airport
Kuala Lumpur Hotels AirportKuala Lumpur Hotels Airport
Kuala Lumpur Hotels Airport
 
Guia coordinador facilitador
Guia coordinador facilitadorGuia coordinador facilitador
Guia coordinador facilitador
 
Charles dellschau
Charles dellschau Charles dellschau
Charles dellschau
 
Adobe Acrobat Pro X - 2014 UVM Extension Professional Improvement Conference
Adobe Acrobat Pro X - 2014 UVM Extension Professional Improvement ConferenceAdobe Acrobat Pro X - 2014 UVM Extension Professional Improvement Conference
Adobe Acrobat Pro X - 2014 UVM Extension Professional Improvement Conference
 
34 guía ejercitación
34 guía ejercitación34 guía ejercitación
34 guía ejercitación
 
Presentacion Ad Media Net
Presentacion Ad Media NetPresentacion Ad Media Net
Presentacion Ad Media Net
 
Apache HTTP y Moodle
Apache HTTP y MoodleApache HTTP y Moodle
Apache HTTP y Moodle
 
ACTFL Best of Toys 2011 3 modes presentation
ACTFL Best of Toys 2011  3 modes presentation ACTFL Best of Toys 2011  3 modes presentation
ACTFL Best of Toys 2011 3 modes presentation
 
Newsletter AudiSEC Sàrl Prévention SST
Newsletter AudiSEC Sàrl Prévention SSTNewsletter AudiSEC Sàrl Prévention SST
Newsletter AudiSEC Sàrl Prévention SST
 

Similaire à Building a cutting-edge data processing environment on a budget

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxPyData
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Gael Varoquaux
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in PythonGael Varoquaux
 
The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!Jonathan Ross
 
Vision Algorithmics
Vision AlgorithmicsVision Algorithmics
Vision Algorithmicspotaters
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...Balázs Kégl
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future TensePaco Nathan
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...Herman Wu
 
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...Keiichiro Ono
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Jason Dai
 
Year of the #WiFiCactus
Year of the #WiFiCactusYear of the #WiFiCactus
Year of the #WiFiCactusDefCamp
 
Interpreting the data parallel analysis with sawzall
Interpreting the data  parallel analysis with sawzallInterpreting the data  parallel analysis with sawzall
Interpreting the data parallel analysis with sawzallLee David
 
What is AI, Machine Learning, Neural Networks, Deep Learning and Data Science
What is AI, Machine Learning, Neural Networks, Deep Learning and Data ScienceWhat is AI, Machine Learning, Neural Networks, Deep Learning and Data Science
What is AI, Machine Learning, Neural Networks, Deep Learning and Data ScienceSom Shahapurkar
 
Real time analytics @ netflix
Real time analytics @ netflixReal time analytics @ netflix
Real time analytics @ netflixCody Rioux
 
Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Research Data Alliance
 
Back To The Future.Key 2
Back To The Future.Key 2Back To The Future.Key 2
Back To The Future.Key 2gueste8cc560
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 

Similaire à Building a cutting-edge data processing environment on a budget (20)

Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael VaroquauxBuilding a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
Building a Cutting-Edge Data Process Environment on a Budget by Gael Varoquaux
 
Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...Better neuroimaging data processing: driven by evidence, open communities, an...
Better neuroimaging data processing: driven by evidence, open communities, an...
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
 
The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!The Art Of Performance Tuning - with presenter notes!
The Art Of Performance Tuning - with presenter notes!
 
Vision Algorithmics
Vision AlgorithmicsVision Algorithmics
Vision Algorithmics
 
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...   DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
DARMDN: Deep autoregressive mixture density nets for dynamical system mode...
 
Data Science in Future Tense
Data Science in Future TenseData Science in Future Tense
Data Science in Future Tense
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
 
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
Automated ML Workflow for Distributed Big Data Using Analytics Zoo (CVPR2020 ...
 
Year of the #WiFiCactus
Year of the #WiFiCactusYear of the #WiFiCactus
Year of the #WiFiCactus
 
Interpreting the data parallel analysis with sawzall
Interpreting the data  parallel analysis with sawzallInterpreting the data  parallel analysis with sawzall
Interpreting the data parallel analysis with sawzall
 
What is AI, Machine Learning, Neural Networks, Deep Learning and Data Science
What is AI, Machine Learning, Neural Networks, Deep Learning and Data ScienceWhat is AI, Machine Learning, Neural Networks, Deep Learning and Data Science
What is AI, Machine Learning, Neural Networks, Deep Learning and Data Science
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
Python and Sage
Python and SagePython and Sage
Python and Sage
 
Real time analytics @ netflix
Real time analytics @ netflixReal time analytics @ netflix
Real time analytics @ netflix
 
Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...Reproducibility challenges in computational settings: what are they, why shou...
Reproducibility challenges in computational settings: what are they, why shou...
 
Back To The Future.Key 2
Back To The Future.Key 2Back To The Future.Key 2
Back To The Future.Key 2
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 

Plus de Gael Varoquaux

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueGael Varoquaux
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingGael Varoquaux
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing valuesGael Varoquaux
 
Dirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataDirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataGael Varoquaux
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settingsGael Varoquaux
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Gael Varoquaux
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingGael Varoquaux
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomesGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsityGael Varoquaux
 
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Gael Varoquaux
 

Plus de Gael Varoquaux (15)

Evaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic valueEvaluating machine learning models and their diagnostic value
Evaluating machine learning models and their diagnostic value
 
Measuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imagingMeasuring mental health with machine learning and brain imaging
Measuring mental health with machine learning and brain imaging
 
Machine learning with missing values
Machine learning with missing valuesMachine learning with missing values
Machine learning with missing values
 
Dirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated dataDirty data science machine learning on non-curated data
Dirty data science machine learning on non-curated data
 
Representation learning in limited-data settings
Representation learning in limited-data settingsRepresentation learning in limited-data settings
Representation learning in limited-data settings
 
Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?Functional-connectome biomarkers to meet clinical needs?
Functional-connectome biomarkers to meet clinical needs?
 
Atlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mappingAtlases of cognition with large-scale human brain mapping
Atlases of cognition with large-scale human brain mapping
 
Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
Scikit-learn: apprentissage statistique en Python. Créer des machines intelli...
 

Dernier

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Dernier (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Building a cutting-edge data processing environment on a budget

  • 1. Building a cutting-edge data processing environment on a budget Ga¨l Varoquaux e This talk is not about rocket science!
  • 2. Building a cutting-edge data processing environment on a budget Ga¨l Varoquaux e Disclaimer: this talk is as much about people and projects as it is about code and algorithms.
  • 3. Growing up as a penniless academic I did a PhD in quantum physics
  • 4. Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Best training ever for agile project management
  • 5. Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Computers were only one of the many moving parts Matlab Instrument control
  • 6. Growing up as a penniless academic I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Shaped my vision of computing as a means to an end Computers were only one of the many moving parts Matlab Instrument control
  • 7. Growing up as a penniless academic 2011 Tenured researcher in computer science
  • 8. Growing up as a penniless academic 2011 Today Tenured researcher in computer science Growing team with data science rock stars
  • 9. 1 Using machine learning to understand brain function Link neural activity to thoughts and cognition G Varoquaux 6
  • 10. 1 Functional MRI t Recordings of brain activity G Varoquaux 7
  • 11. 1 Cognitive NeuroImaging Learn a bilateral link between brain activity and cognitive function G Varoquaux 8
  • 12. 1 Encoding models of stimuli Predicting neural response ñ a window into brain representations of stimuli “feature engineering” a description of the world G Varoquaux 9
  • 13. 1 Decoding brain activity “brain reading” G Varoquaux 10
  • 14. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] “brain reading” G Varoquaux 11
  • 15. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] “if it’s not open and verifiable by others, it’s not science, or engineering...” Stodden, 2010 G Varoquaux 11
  • 16. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring
  • 17. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring http://nilearn.github.io/auto examples/ plot miyawaki reconstruction.html Code, data, ... just worksTM G Varoquaux http://nilearn.github.io ni 11
  • 18. 1 Data processing feats Visual image reconstruction from human brain activity [Miyawaki, et al. (2008)] Make it work, make it right, make it boring ge len al ch nt me p elo ev ed arhttp://nilearn.github.io/auto examples/ ftw plot miyawaki reconstruction.html So Code, data, ... just worksTM G Varoquaux http://nilearn.github.io ni 11
  • 19. 1 Data accumulation When data processing is routine... “big data” for rich models of brain function Accumulation of scientific knowledge and learning formal representations G Varoquaux 12
  • 20. 1 Data accumulation When data processing is routine... “big data” for rich models of brain function “A theory is a good theory if it satisfies two requirements: It must accurately describe a large class of observations on the basis of a model that contains only a few arbitrary elements, and it must make definite predictions about the results of future observations.” Stephen Hawking, A Brief History of Time. Accumulation of scientific knowledge and learning formal representations G Varoquaux 12
  • 21. 1 Petty day-to-day technicalities Buggy code Slow code Lead data scientist leaves New intern to train I don’t understand the code I have written a year ago G Varoquaux 13
  • 22. 1 Petty day-to-day technicalities Buggy code A lab is no different from a startup Slow code Difficulties Risks LeadRecruitment leaves data scientist Bus factor Technical dept New Limited resources intern to train (people & hardware) I don’t understand the code I have written a year ago G Varoquaux 13
  • 23. 1 Petty day-to-day technicalities Buggy code A lab is no different from a startup Slow code Difficulties Risks LeadRecruitment leaves data scientist Bus factor Technical dept New Limited resources intern to train (people & hardware) I don’t understand the code I have written a year ago Our mission is to revolutionize brain data processing on a tight budget G Varoquaux 13
  • 24. 2 Patterns in data processing G Varoquaux 14
  • 25. 2 The data processing workflow agile Interaction... Ñ script... Ñ module... ý interaction again... Consolidation, progressively Low tech and short turn-around times G Varoquaux 15
  • 26. Paradigm shift as the dimensionality of data grows y 2 From statistics to statistical learning # features, not only # samples From parameter inference to prediction x Statistical learning is spreading everywhere G Varoquaux 16
  • 27. 3 Let’s just make software to solve all these problems. G Varoquaux c Theodore W. Gray 17
  • 28. 3 Design philosophy 1. Don’t solve hard problems The original problem can be bent. 2. Easy setup, works out of the box Installing software sucks. Convention over configuration. 3. Fail gracefully Robust to errors. Easy to debug. 4. Quality, quality, quality What’s not excellent won’t be used. G Varoquaux 18
  • 29. 3 Design philosophy 1. Don’t solve hard problems The original problem can be bent. 2. Easy setup, works out of the box Installing software sucks. Not “one software to rule them all” Convention over configuration. Break down projects by expertise 3. Fail gracefully Robust to errors. Easy to debug. 4. Quality, quality, quality What’s not excellent won’t be used. G Varoquaux 18
  • 31. Vision Machine learning without learning the machinery Black box that can be opened Right trade-off between ”just works” and versatility (think Apple vs Linux) G Varoquaux 19
  • 32. Vision Machine learning without learning the machinery Black box that can be opened Right trade-off between ”just works” and versatility (think Apple vs Linux) We’re not going to solve all the problems for you I don’t solve hard problems Feature-engineering, domain-specific cases... Python is a programming language. Use it. Cover all the 80% usecases in one package G Varoquaux 19
  • 33. 3 Performance in high-level programming High-level programming is what keeps us alive and kicking G Varoquaux 20
  • 34. 3 Performance in high-level programming The secret sauce Optimize algorithmes, not for loops Know perfectly Numpy and Scipy - Significant data should be arrays/memoryviews - Avoid memory copies, rely on blas/lapack line-profiler/memory-profiler scipy-lectures.github.io Cython G Varoquaux not C/C++ 20
  • 35. 3 Performance in high-level programming Hierarchical clustering PR #2199 The secret sauce 1.Optimize algorithmes,clustersloops Take the 2 closest not for 2. Merge them 3. Update the distance matrix Know perfectly Numpy and Scipy ... - Significant data should be arrays/memoryviews Faster with constraints: sparse distance matrix - Avoid memory copies, rely on blas/lapack - Keep a heap queue of distances: cheap minimum line-profiler/memory-profiler - Need sparse growable structure for neighborhoods scipy-lectures.github.io skip-list in Cython! Oplog nq insert, remove, access Cython not C/C++ bind C++ map[int, float] with Cython Fast traversal, possibly in Cython, for step 3. G Varoquaux 20
  • 36. 3 Performance in high-level programming Hierarchical clustering PR #2199 The secret sauce 1.Optimize algorithmes,clustersloops Take the 2 closest not for 2. Merge them 3. Update the distance matrix Know perfectly Numpy and Scipy ... - Significant data should be arrays/memoryviews Faster with constraints: sparse distance matrix - Avoid memory copies, rely on blas/lapack - Keep a heap queue of distances: cheap minimum line-profiler/memory-profiler - Need sparse growable structure for neighborhoods scipy-lectures.github.io skip-list in Cython! Oplog nq insert, remove, access Cython not C/C++ bind C++ map[int, float] with Cython Fast traversal, possibly in Cython, for step 3. G Varoquaux 20
  • 37. 0 3 0 0 38 01 7 87 9 1 78 4 5 9 40 7990 8779 4 1 5 49 0771 0775 9447 13 6 97 17 52 79 7 3 70 74 27 97 4 4 7 47 6553 0771 4661 7001 7992 48 7 75 34 18 12 15 27 9 8 54 49 87 24 57 7 0 7 9 03 7221 4226 9004 7117 4779 788 78 9 34 15 65 49 78 97 8 5 45 54 53 95 88 7 9 7 56 46 35 51 87 5 1 67 63 58 19 7 7 8 73 34 80 90 1 7 32 49 09 0 8 7 24 90 98 7 4 45 08 8 7 5 56 84 4 6 61 4 5 2 14 60 4 0 2 3 0 0 0 38 01 7 87 9 1 78 94 0 79 87 5 4 90 79 4 1 5 49 0771 0775 9447 3 7 1 36 9770 1774 5227 7997 4 65 07 46 70 79 4 7 47 53 71 61 01 92 8 9 4 87 7554 3449 1887 1224 1557 277 90 3 72 42 90 71 47 78 7 0 21 26 04 17 79 8 78 9 34 15 65 49 78 97 8 5 45 54 53 95 88 7 9 7 56 46 35 51 87 5 1 67 63 58 19 7 7 8 73 34 80 90 0 1 7 32 49 09 0 3 8 7 24 90 98 0 0 38 7 4 45 08 8 01 7 87 6 84 75 5 9 1 78 4 6 61 4 4 5 2 14 5 9 40 7990 8779 60 4 4 1 5 49 0771 0775 9447 2 0 13 6 97 17 52 79 7 3 70 74 27 97 4 4 7 47 6553 0771 4661 7001 7992 48 7 75 34 18 12 15 27 9 8 54 49 87 24 57 7 0 7 9 03 7221 4226 9004 7117 4779 788 78 9 34 15 65 49 78 97 8 5 45 54 53 95 88 7 9 7 56 46 35 51 87 5 1 67 63 58 19 7 7 8 73 34 80 90 1 7 32 49 09 0 8 7 24 90 98 7 4 45 08 8 7 5 56 84 4 6 61 4 5 2 14 60 4 2 0 3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language bokeh, chaco, hadoop, Mayavi, CPUs G Varoquaux 21
  • 38. 3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language Object API exposes a data-processing language fit, predict, transform, score, partial fit Instantiated without data but with all the parameters Objects pipeline, merging, etc... G Varoquaux 21
  • 39. 3 Architecture of a data-manipulation toolkit Separate data from operations, but keep an imperative-like language Object API exposes a data-processing language fit, predict, transform, score, partial fit Instantiated without data but with all the parameters Objects pipeline, merging, etc... configuration/run pattern curry in functional programming Ideas from MVC pattern G Varoquaux traits, pyre functools.partial 21
  • 40. 4 Big data on small hardware G Varoquaux 22
  • 41. h h isdata on smallishardware ll g 4 Big a Big sm “Big data”: Petabytes... Distributed storage Computing cluster G Varoquaux Mere mortals: Gigabytes... Python programming Off-the-self computers 22
  • 42. 4 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? G Varoquaux 23
  • 43. 4 On-line algorithms Process the data one sample at a time Compute the mean of a gazillion numbers Hard? No: just do a running mean G Varoquaux 23
  • 44. 4 On-line algorithms Converges to expectations Mini-batch = bunch observations for vectorization Example: K-Means clustering X = np.random.normal(size=(10 000, 200)) scipy.cluster.vq. sklearn.cluster. MiniBatchKMeans(n clusters=10, kmeans(X, 10, n init=2).fit(X) iter=2) 11.33 s 0.62 s G Varoquaux 23
  • 45. 4 On-the-fly data reduction Big data is often I/O bound Layer memory access CPU caches RAM Local disks Distant storage Less data also means less work G Varoquaux 24
  • 46. 4 On-the-fly data reduction Dropping data 1 loop: take a random fraction of the data 2 run algorithm on that fraction 3 aggregate results across sub-samplings Looks like bagging: bootstrap aggregation Exploits redundancy across observations Run the loop in parallel G Varoquaux 24
  • 47. 4 On-the-fly data reduction Random projections (will average features) sklearn.random projection random linear combinations of the features Fast clustering of features sklearn.cluster.WardAgglomeration on images: super-pixel strategy Hashing when observations have varying size (e.g. words) sklearn.feature extraction.text. HashingVectorizer stateless: can be used in parallel G Varoquaux 24
  • 48. 4 On-the-fly data reduction Example: randomized SVD Random projection sklearn.utils.extmath.randomized svd X = np.random.normal(size=(50000, 200)) %timeit lapack = linalg.svd(X, full matrices=False) 1 loops, best of 3: 6.09 s per loop %timeit arpack=splinalg.svds(X, 10) 1 loops, best of 3: 2.49 s per loop %timeit randomized = randomized svd(X, 10) 1 loops, best of 3: 303 ms per loop linalg.norm(lapack[0][:, :10] - arpack[0]) / 2000 0.0022360679774997738 linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000 0.0022121161221386925 G Varoquaux 24
  • 49. 4 Biggish iron Our new box: 48 cores 384G RAM 70T storage 15 ke (SSD cache on RAID controller) Gets our work done faster than our 800 CPU cluster It’s the access patterns! “Nobody ever got fired for using Hadoop on a cluster” A. Rowstron et al., HotCDP ’12 G Varoquaux 25
  • 50. 5 Avoiding the framework joblib G Varoquaux 26
  • 51. 5 Parallel processing big picture Focus on embarassingly parallel for loops Life is too short to worry about deadlocks Workers compete for data access Memory bus is a bottleneck The right grain of parallelism Too fine ñ overhead Too coarse ñ memory shortage Scale by the relevant cache pool G Varoquaux 27
  • 52. 5 Parallel processing joblib Focus on embarassingly parallel for loops Life is too short to worry about deadlocks >>> from joblib import Parallel, delayed >>> Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] G Varoquaux 27
  • 53. 5 Parallel processing joblib IPython, multiprocessing, celery, MPI? joblib is higher-level No dependencies, works everywhere Better traceback reporting Memmaping arrays to share memory (O. Grisel) On-the-fly dispatch of jobs – memory-friendly Threads or processes backend G Varoquaux 27
  • 54. 5 Parallel processing joblib IPython, multiprocessing, celery, MPI? joblib is higher-level No dependencies, works everywhere Better traceback reporting Memmaping arrays to share memory (O. Grisel) On-the-fly dispatch of jobs – memory-friendly Threads or processes backend G Varoquaux 27
  • 55. 5 Parallel processing Queues Queues: high-performance, concurrent-friendly Difficulty: callback on result arrival ñ multiple threads in caller ` risk of deadlocks Dispatch queue should fill up “slowly” ñ pre dispatch in joblib ñ Back and forth communication Door open to race conditions G Varoquaux 28
  • 56. 5 Parallel processing: what happens where joblib design: Caller, dispatch queue, and collect queue in same process Benefit: robustness Grand-central dispatch design: dispatch queue has a process of its own Benefit: resource managment in nested for loops G Varoquaux 29
  • 57. 5 Caching For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization G Varoquaux 30
  • 58. 5 Caching The joblib approach For reproducibility: avoid manually chained scripts (make-like usage) For performance: avoiding re-computing is the crux of optimization Memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store G Varoquaux 30
  • 59. 5 Caching The joblib approach Challenges in the context of big data For reproducibility: avoid b are big chained scripts (make-like usage) a & manually For performance: Design goals avoiding re-computing is the crux of optimization a & b arbitrary Python objects No dependencies Drop-in, framework-less code Memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store G Varoquaux 30
  • 60. 5 Caching The joblib approach For bricks for out-of-core algorithms coming soon Lego reproducibility: avoid manually chained scripts ąąą result = g.call and shelve(a)(make-like usage) For performance: ąąą result avoiding re-computing is the crux argument hash=”...”) MemorizedResult(cachedir=”...”, func=”g...”,of optimization ąąą c = result.get() Memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store G Varoquaux 30
  • 61. 5 Efficient input argument hashing – joblib.hash Compute md5‹ of input arguments Trade-off between features and cost Black boxy Robust and completely generic G Varoquaux 31
  • 62. 5 Efficient input argument hashing – joblib.hash Compute md5‹ of input arguments Implementation 1. Create an md5 hash object 2. Subclass the standard-library pickler = state machine that walks the object graph 3. Walk the object graph: - ndarrays: pass data pointer to md5 algorithm (“update” method) - the rest: pickle 4. Update the md5 with the pickle ‹ md5 is in the Python standard library G Varoquaux 31
  • 63. 5 Fast, disk-based, concurrent, store – joblib.dump Persisting arbritrary objects Once again sub-class the pickler Use .npy for large numpy arrays (np.save), pickle for the rest ñ Multiple files Store concurrency issues Strategy: atomic operations ` try/except Renaming a directory is atomic Directory layout consistent with remove operations Good performance, usable on shared disks (cluster) G Varoquaux 32
  • 64. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buffers (bypass gzip module to work online + in-memory) G Varoquaux 33
  • 65. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buffers (bypass gzip module to work online + in-memory) Avoiding copies zlib.compress: C-contiguous buffers Copyless storage of raw buffer + meta-information (strides, class...) G Varoquaux 33
  • 66. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel Standard library: zlib.compress with buffers (bypass gzip module to work online + in-memory) Avoiding copies zlib.compress: C-contiguous buffers Copyless storage of raw buffer + meta-information (strides, class...) Single file dump coming soon File opening is slow on cluster Challenge: streaming the above for memory usage G Varoquaux 33
  • 67. 5 Making I/O fast Fast compression CPU may be faster than disk access in particular in parallel StandardWhat matters on large with buffers library: zlib.compress systems (bypass gzip module to stored Numbers of bytes work online + in-memory) brings network/SATA bus down Avoiding copies zlib.compress: C-contiguous buffers Memory usage Copyless storage brings buffer of raw compute nodes down + meta-information (strides, class...) Number of atomic file access Single file dump brings shared storage down soon coming File opening is slow on cluster Challenge: streaming the above for memory usage G Varoquaux 33
  • 68. y axis scale: 1 is np.save 5 Benchmarking to np.save and pytables G Varoquaux NeuroImaging data (MNI atlas) 34
  • 69. 6 The bigger picture: building an ecosystem Helping your future self G Varoquaux 35
  • 70. 6 Community-based development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors „ 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 36
  • 71. 6 The economics of open source Code maintenance too expensive to be alone scikit-learn „ 300 email/month nipy „ 45 email/month joblib „ 45 email/month mayavi „ 30 email/month “Hey Gael, I take it you’re too busy. That’s okay, I spent a day trying to install XXX and I think I’ll succeed myself. Next time though please don’t ignore my emails, I really don’t like it. You can say, ‘sorry, I have no time to help you.’ Just don’t ignore.” G Varoquaux 37
  • 72. 6 The economics of open source Code maintenance too expensive to be alone scikit-learn „ 300 email/month nipy „ 45 email/month joblib „ 45 email/month mayavi „ 30 email/month Your “benefits” come from a fraction of the code Data loading? Maybe? Standard algorithms? Nah Share the common code... ...to avoid dying under code Code becomes less precious with time And somebody might contribute features G Varoquaux 37
  • 73. 6 Many eyes makes code fast Bench WiseRF anybody? L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 38
  • 74. 6 6 steps to a community-driven project 1 Focus on quality 2 Build great docs and examples 3 Use github 4 Limit the technicality of your codebase 5 Releasing and packaging matter 6 Focus on your contributors, give them credit, decision power http://www.slideshare.net/GaelVaroquaux/ scikit-learn-dveloppement-communautaire G Varoquaux 39
  • 75. 6 Core project contributors Number of commits Normalized number of commits since 2009-06 Individual committer G Varoquaux Credit: Fernando Perez, Gist 5843625 40
  • 76. 6 The tragedy of the commons Individuals, acting independently and rationally according to each one’s self-interest, behave contrary to the whole group’s long-term best interests by depleting some common resource. Wikipedia Make it work, make it right, make it boring Core projects (boring) taken for granted ñ Hard to fund, less excitement They need citation, in papers & on corporate web pages G Varoquaux 41
  • 77. Solving problems that matter The 80/20 rule 80% of the usecases can be solved with 20% of the lines of code scikit-learn, joblib, nilearn, ... @GaelVaroquaux I hope
  • 78. Cutting-edge ... environment ... on a budget 1 Set the goals right Don’t solve hard problems What’s your original problem? @GaelVaroquaux
  • 79. Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible Be very technically sophisticated Don’t use that sophistication @GaelVaroquaux
  • 80. Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible 3 Don’t forget the human factors With your users (documentation) With your contributors @GaelVaroquaux
  • 81. Cutting-edge ... environment ... on a budget 1 Set the goals right 2 Use the simplest technological solutions possible 3 Don’t forget the human factors A perfect design? @GaelVaroquaux