Python as the Zen of Data Science

Python as the Zen of Data Science
Travis E. Oliphant, Ph.D.
Peter Wang
Continuum Analytics

Zen Approach
5
• Right Practice
• Right Attitude
• Right Understanding
“The Zen way of calligraphy is to write in the most straightforward,
simple way as if you were a beginner not trying to make something
skillful or beautiful, but simple writing with the full attention as if
you were discovering what you were writing for the first time.”
— Zen Mind, Beginner’s Mind

“Right Understanding”
6
The purpose of studying Buddhism is not to study Buddhism
but to study ourselves. You are not your body.
Zen:
Python:
The purpose of writing Python code is not to just
produce software, but to study ourselves. You are
not your technology stack!

7
• Compose language primitives, built-ins, classes whenever
possible
• Much more powerful and accessible than trying to memorize a
huge list of proprietary functions
• Reject artificial distinctions between where data “should live” and
where it gets computed
• Empower each individual to use their own knowledge, instead of
taking design power out of their hands with pre-ordained
architectures and “stacks”.
Pythonic Approach

Why Python?
8
Analyst
• Uses graphical tools
• Can call functions,
cut & paste code
• Can change some
variables
Gets paid for:
Insight
Excel, VB, Tableau,
Analyst / Data
Developer
• Builds simple apps & workﬂows
• Used to be "just an analyst"
• Likes coding to solve problems
• Doesn't want to be a "full-time
programmer"
Gets paid (like a rock star) for:
Code that produces insight
SAS, R, Matlab,
Programmer
• Creates frameworks
& compilers
• Uses IDEs
• Degree in CompSci
• Knows multiple
languages
Gets paid for:
Code
C, C++, Java, JS,
Python Python Python

Pythonic Data Analysis
9
• Make an immediate connection with the data (using Pandas+ / NumPy+ /
scikit-learn / Bokeh and matplotlib)
• PyData stack means agility which allows you to rapidly build
understanding with powerful modeling and easy manipulation.
• Let the Data drive the analysis not the technology stack.
• Empower the data-scientist, quant, geophysicist, biochemist, directly with
simple language constructs they can use that “fits their brain.”
• Scale-out is a later step. Python can work with you there too. There are
great tools no matter where your data is located or how it is stored or how
your cluster is managed. Not tied to a particular distributed story.

Zen of Python
10
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Zen of NumPy (and Pandas)
11
• Strided is better than scattered
• Contiguous is better than strided
• Descriptive is better than imperative (use data-types)
• Array-oriented and data-oriented is often better than object-oriented
• Broadcasting is a great idea – use where possible
• Split-apply-combine is a great idea – use where possible
• Vectorized is better than an explicit loop
• Write more ufuncs and generalized ufuncs (numba can help)
• Unless it’s complicated — then use numba
• Think in higher dimensions

“Zen of Data Science”
12
• Get More and better data.
• Better data is determined by better models.
• How you compute matters.
• Put the data in the hands and minds of people with knowledge.
• Fail quickly and often — but not in the same way.
• Where and how the data is stored is secondary to analysis and
understanding.
• Premature horizontal scaling is the root of all evil.
• When you must scale — data locality and parallel algorithms are the key.
• Learn to think in building blocks that can be parallelized.

PyData Stack -- about 3,000,000 users
13
NumPy
scikit-learnscikit-image statsmodels Cython
PyTables/Numexpr Dask / Blaze SymPy Numba
OpenCV astropy BioPython GDALPySAL
... many many more ...
MatplotlibSciPy Bokehpandas

How do we get 100s of dependencies delivered?
14
We started Continuum with 3 primary goals: 
1. Scale the NumPy/PyData stack horizontally
2. Make it easy for Python users to produce data-science applications in the browser/
notebook
3. Get more adoption of the PyData stack
So, we wrote a package manager, conda.
and a distribution of Python + R, Anaconda.

Game-Changing
Enterprise Ready
Python Distribution
15
• 2 million downloads in last 2 years
• 200k / month and growing
• conda package manager serves up 5 million
packages per month
• Recommended installer for IPython/Jupyter,
Pandas, SciPy, Scikit-learn, etc.

Anaconda — Portable Environments
17
• Easy to install
• Quick & agile data exploration
• Powerful data analysis
• Simple to collaborate
• Accessible to all
PYTHON & R OPEN SOURCE ANALYTICS
NumPy SciPy Pandas Scikit-learn Jupyter/ IPython
Numba Matplotlib Spyder Numexpr Cython Theano
Scikit-image NLTK NetworkX IRKernel dplyr shiny
ggplot2 tidyr caret nnet And 330+ packages
conda

Traditional Analytics
Ad hoc usage
and/or BI
report
Production
deployment
Data
Data
Data
Data
Data Mining
and/or
Modelling
ETL

Traditional Analytics
Ad hoc usage
and/or BI
report
Production
deployment
Data
Data
Data
Data
Data Mining
and/or
Modelling
ETL
ANACONDA

Package Managers
20
yum (rpm)
apt-get (dpkg)
Linux OSX
macports
homebrew
fink
Windows
chocolatey
npackd
Cross-platform
conda Sophisticated light-weight
environments included!
http://conda.pydata.org

Conda features
21
• Excellent support for “system-level” environments — like having mini VMs but much
lighter weight than docker (micro containers)
• Minimizes code-copies (uses hard/soft links if possible)
• Simple format: binary tar-ball + metadata
• Metadata allows static analysis of dependencies
• Easy to create multiple “channels” which are repositories for packages
• User installable (no root privileges needed)
• Integrates very well with pip and other language-specific package managers.
• Cross Platform

Basic Conda Usage
22
Install a package conda install sympy
List all installed packages conda list
Search for packages conda search llvm
Create a new environment conda create -n py3k python=3
Remove a package conda remove nose
Get help conda install --help

Advanced Conda Usage
23
Install a package in an environment conda install -n py3k sympy
Update all packages conda update --all
Export list of packages conda list --export packages.txt
Install packages from an export conda install --file packages.txt
See package history conda list --revisions
Revert to a revision conda install --revision 23
Remove unused packages and cached tarballs conda clean -pt

Environments
24
• Environments are simple: just link the package to a different directory
• Hard-links are very cheap, and very fast — even on Windows.
• Conda environments are completely independent installations of
everything
• No fiddling with PYTHONPATH or sym-linking site-packages
• “Activating” an environment just means changing your PATH so
that its bin/ or Scripts/ comes first.
• Unix: 
• Windows:
conda create -n py3k python=3.5
source activate py3k
activate py3k

Anaconda Platform Analytics Repository
26
• Commercial long-term support
• Private, on-premise package mirror
• Proprietary tools for building custom
distribution, like Anaconda
• Enterprise tools for managing custom
packages and environments
• Available on the cloud at  
http://anaconda.org

Anaconda Cluster: Anaconda + Hadoop + Spark
For Data Scientists:
• Rapidly, easily create clusters on EC2, DigitalOcean, on-prem cloud/provisioner
• Manage Python, R, Java, JS packages across the cluster
For Operations & IT:
• Robustly manage runtime state across the cluster
• Outside the scope of rpm, chef, puppet, etc.
• Isolate/sandbox packages & libraries for different jobs or groups of users
• Without introducing complexity of Docker / virtualization
• Cross platform - same tooling for laptops, workstations, servers, clusters
27

Cluster Creation
28
$ acluster create mycluster —profile=spark_profile
$ acluster submit mycluster mycode.py
$ acluster destroy mycluster
spark_profile:
provider: aws_east
num_nodes: 4
node_id: ami-3c994355
node_type: m1.large
aws_east:
secret_id: <aws_access_key_id>
secret_key: <aws_secret_access_key>
keyname: id_rsa.pub
location: us-east-1
private_key: ~/.ssh/id_rsa
cloud_provider: ec2
security_group: all-open
http://continuumio.github.io/conda-cluster/quickstart.html

Cluster Management
29
$ acluster manage mycluster list
... info -e
... install python=3 pandas flask
... set_env
... push_env <local> <remote>
$ acluster ssh mycluster
$ acluster run.cmd mycluster "cat /etc/hosts"
Package & environment management:
Easy SSH & remote commands:
http://continuumio.github.io/conda-cluster/manage.html

Anaconda Cluster & Spark
30
# example.py
conf = SparkConf()
conf.setMaster("yarn-client")
conf.setAppName("MY APP")
sc = SparkContext(conf=conf)
# analysis
sc.parallelize(range(1000)).map(lambda x: (x, x % 2)).take(10)
$ acluster submit MY_CLUSTER /path/to/example.py
Remember that Blaze has a higher-level interface to Spark and Dask provides
a more Pythonic approach.

ANACONDA OPEN SOURCE
TECHNOLOGY
31

32
• Infrastructure for meta-data, meta-compute, and expression graphs/dataflow
• Data glue for scale-up or scale-out
• Generic remote computation & query system
• (NumPy+Pandas+LINQ+OLAP+PADL).mashup()
Blaze is an extensible high-level interface for data
analytics. It feels like NumPy/Pandas. It drives other
data systems. Blaze expressions enable high-level
reasoning. It’s an ecosystem of tools.
http://blaze.pydata.org
Blaze

Glue 2.0
33
Python’s legacy as a powerful
glue language
• manipulate ﬁles
• call fast libraries
Next-gen Glue:
• Link data silos
• Link disjoint memory &
compute
• Unify disparate runtime
models
• Transcend legacy models of
computers

41
“General Purpose Programming”

42
Domain-Speciﬁc
Query Language
Analytics System

43
html,css,js,…
py,r,sql,…
java,c,cpp,cs

45
Expressions
Metadata
Runtime

46
+ - / * ^ []
join, groupby, filter
map, sort, take
where, topk
datashape,dtype,
shape,stride
hdf5,json,csv,xls
protobuf,avro,...
NumPy,Pandas,R,
Julia,K,SQL,Spark,
Mongo,Cassandra,...

APIs, syntax, language
47
Data Runtime
Expressions
metadata
storage/containers
compute
datashape
blaze
dask
odo
parallelize optimize, JIT

Blaze
48
Interface to query data on different storage systems http://blaze.pydata.org/en/latest/
from blaze import Data
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
…
Current focus is the “dark data” and pydata stack for run-time (dask, dynd,
numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).

Blaze
49
iris[['sepal_length', 'species']]Select columns
log(iris.sepal_length * 10)Operate
Reduce iris.sepal_length.mean()
Split-apply
-combine
by(iris.species, shortest=iris.petal_length.min(),
longest=iris.petal_length.max(),
average=iris.petal_length.mean())
Add new
columns
transform(iris, sepal_ratio = iris.sepal_length /
iris.sepal_width, petal_ratio = iris.petal_length /
iris.petal_width)
Text matching iris.like(species='*versicolor')
iris.relabel(petal_length='PETAL-LENGTH',
petal_width='PETAL-WIDTH')
Relabel columns
Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]

50
datashapeblaze
Blaze uses datashape as its type system (like DyND)
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")

Datashape
51
A structured data description language
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
*
var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
4

Datashape
52
{
flowersdb: {
iris: var * {
species: string
}
},
iriscsv: var * {
sepal_length: ?float64,
sepal_width: ?float64,
petal_length: ?float64,
petal_width: ?float64,
species: ?string
},
irisjson: var * {
species: string
},
irismongo: 150 * {
species: string
}
}
# Arrays of Structures
100 * {
name: string,
birthday: date,
address: {
street: string,
city: string,
postalcode: string,
country: string
}
}
# Structure of Arrays
{
x: 100 * 100 * float32,
y: 100 * 100 * float32,
u: 100 * 100 * float32,
v: 100 * 100 * float32,
}
# Function prototype
(3 * int32, float64) -> 3 * float64
# Function prototype with broadcasting dimensions
(A... * int32, A... * int32) -> A... * int32
# Arrays
3 * 4 * int32
3 * 4 * int32
10 * var * float64
3 * complex[float64]

iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount: float64}"
irismongo:
source: mongodb://localhost/mydb::iris
Blaze Server — Lights up your Dark Data
53
Builds off of Blaze uniform interface
to host data remotely through a JSON
web API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
server.yaml

Blaze Server
54
Blaze Client
>>> from blaze import Data
>>> s = Data('blaze://localhost:6363')
>>> t.fields
[u'iriscsv', u'irisdb', u'irisjson', u’irismongo']
>>> t.iriscsv
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
>>> t.irisdb
petal_length petal_width sepal_length sepal_width species
0 1.4 0.2 5.1 3.5 Iris-setosa
1 1.4 0.2 4.9 3.0 Iris-setosa
2 1.3 0.2 4.7 3.2 Iris-setosa

Compute recipes work with existing libraries and have multiple
backends
• python list
• numpy arrays
• dynd
• pandas DataFrame
• Spark, Impala
• Mongo
• dask
55

• Ideally, you can layer expressions over any data 
• Write once, deploy anywhere 
• Practically, expressions will work better on specific data
structures, formats, and engines 
• Use odo to copy from one format and/or engine to another
56

57
Dask: Out-of-Core PyData
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure Python
Core Ideas
• Dynamic task scheduling yields sane parallelism
• Simple library to enable parallelism
• Dask.array/dataframe to encapsulate the functionality
• Distributed scheduler

Example: Ocean Temp Data
58
• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html
• Every 1/4 degree, 720x1440 array each day

Bigger data...
59
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressed
If you don't have this much RAM...
... better start chunking.

• Collections build task graphs
• Schedulers execute task graphs
• Graph specification = uniting interface
• A generalization of RDDs
61

Simple Architecture for Scaling
62
Dask
collections

• dask.array

• dask.dataframe

• dask.bag

• dask.imperative*
Python
Ecosystem
Dask
Graph
Specification
Dask
Schedulers

dask.array: OOC, parallel, ND array
63
Arithmetic: +, *, ...
Reductions: mean, max, ...
Slicing: x[10:, 100:50:-2]
Fancy indexing: x[:, [3, 1, 2]]
Some linear algebra: tensordot, qr, svd
Parallel algorithms (approximate quantiles, topk, ...)
Slightly overlapping arrays
Integration with HDF5

Dask Array
64
numpy
dask
>>> import numpy as np
>>> np_ones = np.ones((5000, 1000))
>>> np_ones
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> np_y = np.log(np_ones + 1)[:5].sum(axis=1)
>>> np_y
array([ 693.14718056, 693.14718056, 693.14718056,
693.14718056, 693.14718056])
>>> import dask.array as da
>>> da_ones = da.ones((5000000, 1000000),
chunks=(1000, 1000))
>>> da_ones.compute()
array([[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
...,
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.],
[ 1., 1., 1., ..., 1., 1., 1.]])
>>> da_y = da.log(da_ones + 1)[:5].sum(axis=1)
>>> np_da_y = np.array(da_y) #fits in memory
array([ 693.14718056, 693.14718056, 693.14718056,
693.14718056, …, 693.14718056])
# If result doesn’t fit in memory
>>> da_y.to_hdf5('myfile.hdf5', 'result')

dask.dataframe: OOC, parallel, ND array
65
Elementwise operations: df.x + df.y
Row-wise selections: df[df.x > 0]
Aggregations: df.x.max()
groupby-aggregate: df.groupby(df.x).y.max()
Value counts: df.x.value_counts()
Drop duplicates: df.x.drop_duplicates()
Join on index: dd.merge(df1, df2, left_index=True,
right_index=True)

Dask Dataframe
66
pandas dask
>>> import pandas as pd
>>> df = pd.read_csv('iris.csv')
>>> df.head()
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
>>> max_sepal_length_setosa = df[df.species ==
'setosa'].sepal_length.max()
5.7999999999999998
>>> import dask.dataframe as dd
>>> ddf = dd.read_csv('*.csv')
>>> ddf.head()
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
…
>>> d_max_sepal_length_setosa = ddf[ddf.species ==
'setosa'].sepal_length.max()
>>> d_max_sepal_length_setosa.compute()
5.7999999999999998

More Complex Graphs
67
cross validation

68
http://continuum.io/blog/xray-dask

69
from dask import dataframe as dd
columns = ["name", "amenity", "Longitude", "Latitude"]
data = dd.read_csv('POIWorld.csv', usecols=columns)
with_name = data[data.name.notnull()]
with_amenity = data[data.amenity.notnull()]
is_starbucks = with_name.name.str.contains('[Ss]tarbucks')
is_dunkin = with_name.name.str.contains('[Dd]unkin')
starbucks = with_name[is_starbucks]
dunkin = with_name[is_dunkin]
locs = dd.compute(starbucks.Longitude,
starbucks.Latitude,
dunkin.Longitude,
dunkin.Latitude)
# extract arrays of values fro the series:
lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs]
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
def draw_USA():
"""initialize a basemap centered on the continental USA"""
plt.figure(figsize=(14, 10))
return Basemap(projection='lcc', resolution='l',
llcrnrlon=-119, urcrnrlon=-64,
llcrnrlat=22, urcrnrlat=49,
lat_1=33, lat_2=45, lon_0=-95,
area_thresh=10000)
m = draw_USA()
# Draw map background
m.fillcontinents(color='white', lake_color='#eeeeee')
m.drawstates(color='lightgray')
m.drawcoastlines(color='lightgray')
m.drawcountries(color='lightgray')
m.drawmapboundary(fill_color='#eeeeee')
# Plot the values in Starbucks Green and Dunkin Donuts Orange
style = dict(s=5, marker='o', alpha=0.5, zorder=2)
m.scatter(lon_s, lat_s, latlon=True,
label="Starbucks", color='#00592D', **style)
m.scatter(lon_d, lat_d, latlon=True,
label="Dunkin' Donuts", color='#FC772A', **style)
plt.legend(loc='lower left', frameon=False);

Distributed
70
Pythonic Multiple-machine Parallelism that understands Dask graphs
1) Defines Center (dcenter) and Worker (dworker)
2) Simplified setup with dcluster for example —
Center
dcluster 192.168.0.{1,2,3,4}
dcluster —hostfile hostfile.txt
or
3) Create Executor objects like 
concurrent.futures (Python 3) or 
futures (Python 2.7 back-port)
4) Data locality supported with ad-hoc task graphs 
by returning futures wherever possible
New Libray but stabilizing quickly — communicate with blaze-dev@continuum.io

Python and Hadoop
(without the JVM)
71
Chat:

http://gitter.im/blaze/dask

Email:

blaze-‐dev@continuum.io
Join
the
conversation!

HDFS without Java
72
1. HDFS splits large files into many small blocks replicated on many
datanodes
2. For efficient computation we must use data directly on datanodes
3. distributed.hdfs queries the locations of the individual blocks
4. distributed executes functions directly on those blocks on the
datanodes
5. distributed+pandas enables distributed CSV processing on HDFS
in pure Python
6. Coming soon — dask on hdfs

73
$ hdfs dfs -cp yellow_tripdata_2014-01.csv /data/nyctaxi/
>>> from distributed import hdfs
>>> blocks = hdfs.get_locations('/data/nyctaxi/', '192.168.50.100', 9000)
>>> columns = ['vendor_id', 'pickup_datetime', 'dropoff_datetime',
'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude',
'rate_code', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude',
'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tip_amount',
'tolls_amount', 'total_amount']
>>> from distributed import Executor
>>> executor = Executor('192.168.1.100:8787')
>>> dfs = [executor.submit(pd.read_csv, block['path'], workers=block['hosts'],
... columns=columns, skiprows=1)
... for block in blocks]
These operations produce Future objects that point to remote results on the worker computers. This does not
pull results back to local memory. We can use these futures in later computations with the executor.

74
def sum_series(seq):
result = seq[0]
for s in seq[1:]:
result = result.add(s, fill_value=0)
return result
>>> counts = executor.map(lambda df: df.passenger_count.value_counts(), dfs)
>>> total = executor.submit(sum_series, counts)
>>> total.result()
0 259
1 9727301
2 1891581
3 566248
4 267540
5 789070
6 540444
7 7
8 5
9 16
208 19

Bokeh
75
http://bokeh.pydata.org
• Interactive visualization
• Novel graphics
• Streaming, dynamic, large data
• For the browser, with or without a server
• No need to write Javascript

Previous: Javascript code generation
78
server.py Browser
js_str = """ <d3.js>
<highchart.js>
<etc.js>
"""
plot.js.template
App Model
D3
highcharts
ﬂot
crossﬁlter
etc. ...
One-shot; no MVC interaction; no data streaming
HTML

bokeh.py & bokeh.js
79
server.py BrowserApp Model
BokehJS
object graph
bokeh-server
bokeh.py
object graph
JSON

rBokeh
82
http://hafen.github.io/rbokeh

86
hGp://nbviewer.ipython.org/github/bokeh/bokeh-‐notebooks/blob/master/tutorial/00
-‐
intro.ipynb#InteracLon

Additional Demos & Topics
87
• Airline flights
• Pandas table
• Streaming / Animation
• Large data rendering

88
• Dynamic, just-in-time compiler for Python & NumPy
• Uses LLVM
• Outputs x86 and GPU (CUDA, HSA)
• (Premium version is in Accelerate part of  
Anaconda Workgroup and Anaconda Enterprise subscriptions)
http://numba.pydata.org
Numba

Python Compilation Space
89
Ahead Of Time Just In Time
Relies on
CPython /
libpython
Cython
Shedskin
Nuitka (today)
Pythran
Numba
HOPE
Theano
Replaces
CPython /
libpython
Nuitka (future) Pyston
PyPy

91
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up

Numba Features
• Numba supports:
– Windows, OS X, and Linux
– 32 and 64-bit x86 CPUs and NVIDIA GPUs
– Python 2 and 3
– NumPy versions 1.6 through 1.9
• Does not require a C/C++ compiler on the user’s system.
• < 70 MB to install.
• Does not replace the standard Python interpreter 
(all of your existing Python libraries are still available)
92

Numba Modes
• object mode: Compiled code operates on Python objects. Only
significant performance improvement is compilation of loops that can be
compiled in nopython mode (see below).
• nopython mode: Compiled code operates on “machine native” data.
Usually within 25% of the performance of equivalent C or FORTRAN.
93

The Basics
95
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x speedup!
Numba decorator 
(nopython=True not required)

CUDA Python (in open-source Numba!)
96
CUDA Development
using Python syntax for
optimal performance!
You have to understand
CUDA at least a little —
writing kernels that launch
in parallel on the GPU

Black-Scholes: Results
98
core i7 GeForce GTX 560 Ti
About 9x faster
on this GPU
~ same speed as
CUDA-C

Other interesting things
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize)
• Call ctypes and cffi functions directly and pass them as arguments
• Preliminary support for types that understand the buffer protocol
• Pickle Numba functions to run on remote execution engines
• “numba annotate” to dump HTML annotated version of compiled code
• See: http://numba.pydata.org/numba-doc/0.20.0/
99

What Doesn’t Work?
(A non-comprehensive list)
• Sets, lists, dictionaries, user defined classes (tuples do work!)
• List, set and dictionary comprehensions
• Recursion
• Exceptions with non-constant parameters
• Most string operations (buffer support is very preliminary!)
• yield from
• closures inside a JIT function (compiling JIT functions inside a closure works…)
• Modifying globals
• Passing an axis argument to numpy array reduction functions
• Easy debugging (you have to debug in Python mode).
100

How Numba Works
101
Bytecode
Analysis
Python Function
(bytecode)
Function Arguments
Type Inference
Numba IR
Rewrite IR
Lowering
LLVM IRLLVM JITMachine Code
@jit
def do_math(a,b):
…
>>> do_math(x, y)
Cache
Execute!

Recently Added Numba Features
• A new GPU target: the Heterogenous System Architecture, supported by AMD APUs
• Support for named tuples in nopython mode
• Limited support for lists in nopython mode
• On-disk caching of compiled functions (opt-in) — both LLVM and pre-compiled
• A simulator for debugging GPU functions with the Python debugger on the CPU
• Can choose to release the GIL in nopython functions
• Ahead of time compilation
• vectorize and guvectorize on GPU and parallel targets now in open-source Numba!
• JIT Classes — coming soon!
102

Python as the Zen of Data Science

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (15)

Similaire à Python as the Zen of Data Science

Similaire à Python as the Zen of Data Science (20)

Plus de Travis Oliphant

Plus de Travis Oliphant (14)

Dernier

Dernier (20)

Python as the Zen of Data Science