The document discusses Python and its suitability for data science. It describes Python's Zen-like approach of focusing on simplicity and empowering users. It promotes Python's data science stack, including NumPy, Pandas, scikit-learn and others, and how they allow for rapid data analysis and model building. It also describes the Anaconda distribution and conda package manager for easily managing Python environments and packages.
5. Zen Approach
5
• Right Practice
• Right Attitude
• Right Understanding
“The Zen way of calligraphy is to write in the most straightforward,
simple way as if you were a beginner not trying to make something
skillful or beautiful, but simple writing with the full attention as if
you were discovering what you were writing for the first time.”
— Zen Mind, Beginner’s Mind
6. “Right Understanding”
6
The purpose of studying Buddhism is not to study Buddhism
but to study ourselves. You are not your body.
Zen:
Python:
The purpose of writing Python code is not to just
produce software, but to study ourselves. You are
not your technology stack!
7. 7
• Compose language primitives, built-ins, classes whenever
possible
• Much more powerful and accessible than trying to memorize a
huge list of proprietary functions
• Reject artificial distinctions between where data “should live” and
where it gets computed
• Empower each individual to use their own knowledge, instead of
taking design power out of their hands with pre-ordained
architectures and “stacks”.
Pythonic Approach
8. Why Python?
8
Analyst
• Uses graphical tools
• Can call functions,
cut & paste code
• Can change some
variables
Gets paid for:
Insight
Excel, VB, Tableau,
Analyst / Data
Developer
• Builds simple apps & workflows
• Used to be "just an analyst"
• Likes coding to solve problems
• Doesn't want to be a "full-time
programmer"
Gets paid (like a rock star) for:
Code that produces insight
SAS, R, Matlab,
Programmer
• Creates frameworks
& compilers
• Uses IDEs
• Degree in CompSci
• Knows multiple
languages
Gets paid for:
Code
C, C++, Java, JS,
Python Python Python
9. Pythonic Data Analysis
9
• Make an immediate connection with the data (using Pandas+ / NumPy+ /
scikit-learn / Bokeh and matplotlib)
• PyData stack means agility which allows you to rapidly build
understanding with powerful modeling and easy manipulation.
• Let the Data drive the analysis not the technology stack.
• Empower the data-scientist, quant, geophysicist, biochemist, directly with
simple language constructs they can use that “fits their brain.”
• Scale-out is a later step. Python can work with you there too. There are
great tools no matter where your data is located or how it is stored or how
your cluster is managed. Not tied to a particular distributed story.
10. Zen of Python
10
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
11. Zen of NumPy (and Pandas)
11
• Strided is better than scattered
• Contiguous is better than strided
• Descriptive is better than imperative (use data-types)
• Array-oriented and data-oriented is often better than object-oriented
• Broadcasting is a great idea – use where possible
• Split-apply-combine is a great idea – use where possible
• Vectorized is better than an explicit loop
• Write more ufuncs and generalized ufuncs (numba can help)
• Unless it’s complicated — then use numba
• Think in higher dimensions
12. “Zen of Data Science”
12
• Get More and better data.
• Better data is determined by better models.
• How you compute matters.
• Put the data in the hands and minds of people with knowledge.
• Fail quickly and often — but not in the same way.
• Where and how the data is stored is secondary to analysis and
understanding.
• Premature horizontal scaling is the root of all evil.
• When you must scale — data locality and parallel algorithms are the key.
• Learn to think in building blocks that can be parallelized.
13. PyData Stack -- about 3,000,000 users
13
NumPy
scikit-learnscikit-image statsmodels Cython
PyTables/Numexpr Dask / Blaze SymPy Numba
OpenCV astropy BioPython GDALPySAL
... many many more ...
MatplotlibSciPy Bokehpandas
14. How do we get 100s of dependencies delivered?
14
We started Continuum with 3 primary goals:
1. Scale the NumPy/PyData stack horizontally
2. Make it easy for Python users to produce data-science applications in the browser/
notebook
3. Get more adoption of the PyData stack
So, we wrote a package manager, conda.
and a distribution of Python + R, Anaconda.
15. Game-Changing
Enterprise Ready
Python Distribution
15
• 2 million downloads in last 2 years
• 200k / month and growing
• conda package manager serves up 5 million
packages per month
• Recommended installer for IPython/Jupyter,
Pandas, SciPy, Scikit-learn, etc.
21. Conda features
21
• Excellent support for “system-level” environments — like having mini VMs but much
lighter weight than docker (micro containers)
• Minimizes code-copies (uses hard/soft links if possible)
• Simple format: binary tar-ball + metadata
• Metadata allows static analysis of dependencies
• Easy to create multiple “channels” which are repositories for packages
• User installable (no root privileges needed)
• Integrates very well with pip and other language-specific package managers.
• Cross Platform
22. Basic Conda Usage
22
Install a package conda install sympy
List all installed packages conda list
Search for packages conda search llvm
Create a new environment conda create -n py3k python=3
Remove a package conda remove nose
Get help conda install --help
23. Advanced Conda Usage
23
Install a package in an environment conda install -n py3k sympy
Update all packages conda update --all
Export list of packages conda list --export packages.txt
Install packages from an export conda install --file packages.txt
See package history conda list --revisions
Revert to a revision conda install --revision 23
Remove unused packages and cached tarballs conda clean -pt
24. Environments
24
• Environments are simple: just link the package to a different directory
• Hard-links are very cheap, and very fast — even on Windows.
• Conda environments are completely independent installations of
everything
• No fiddling with PYTHONPATH or sym-linking site-packages
• “Activating” an environment just means changing your PATH so
that its bin/ or Scripts/ comes first.
• Unix:
• Windows:
conda create -n py3k python=3.5
source activate py3k
activate py3k
26. Anaconda Platform Analytics Repository
26
• Commercial long-term support
• Private, on-premise package mirror
• Proprietary tools for building custom
distribution, like Anaconda
• Enterprise tools for managing custom
packages and environments
• Available on the cloud at
http://anaconda.org
27. Anaconda Cluster: Anaconda + Hadoop + Spark
For Data Scientists:
• Rapidly, easily create clusters on EC2, DigitalOcean, on-prem cloud/provisioner
• Manage Python, R, Java, JS packages across the cluster
For Operations & IT:
• Robustly manage runtime state across the cluster
• Outside the scope of rpm, chef, puppet, etc.
• Isolate/sandbox packages & libraries for different jobs or groups of users
• Without introducing complexity of Docker / virtualization
• Cross platform - same tooling for laptops, workstations, servers, clusters
27
32. 32
• Infrastructure for meta-data, meta-compute, and expression graphs/dataflow
• Data glue for scale-up or scale-out
• Generic remote computation & query system
• (NumPy+Pandas+LINQ+OLAP+PADL).mashup()
Blaze is an extensible high-level interface for data
analytics. It feels like NumPy/Pandas. It drives other
data systems. Blaze expressions enable high-level
reasoning. It’s an ecosystem of tools.
http://blaze.pydata.org
Blaze
33. Glue 2.0
33
Python’s legacy as a powerful
glue language
• manipulate files
• call fast libraries
Next-gen Glue:
• Link data silos
• Link disjoint memory &
compute
• Unify disparate runtime
models
• Transcend legacy models of
computers
47. APIs, syntax, language
47
Data Runtime
Expressions
metadata
storage/containers
compute
datashape
blaze
dask
odo
parallelize optimize, JIT
48. Blaze
48
Interface to query data on different storage systems http://blaze.pydata.org/en/latest/
from blaze import Data
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
…
Current focus is the “dark data” and pydata stack for run-time (dask, dynd,
numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
50. 50
datashapeblaze
Blaze uses datashape as its type system (like DyND)
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")
51. Datashape
51
A structured data description language
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
*
var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
4
53. iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount: float64}"
irismongo:
source: mongodb://localhost/mydb::iris
Blaze Server — Lights up your Dark Data
53
Builds off of Blaze uniform interface
to host data remotely through a JSON
web API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
server.yaml
55. Compute recipes work with existing libraries and have multiple
backends
• python list
• numpy arrays
• dynd
• pandas DataFrame
• Spark, Impala
• Mongo
• dask
55
56. • Ideally, you can layer expressions over any data
• Write once, deploy anywhere
• Practically, expressions will work better on specific data
structures, formats, and engines
• Use odo to copy from one format and/or engine to another
56
57. 57
Dask: Out-of-Core PyData
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using blocked algorithms and task scheduling
• Written in pure Python
Core Ideas
• Dynamic task scheduling yields sane parallelism
• Simple library to enable parallelism
• Dask.array/dataframe to encapsulate the functionality
• Distributed scheduler
58. Example: Ocean Temp Data
58
• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html
• Every 1/4 degree, 720x1440 array each day
59. Bigger data...
59
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressed
If you don't have this much RAM...
... better start chunking.
69. 69
from dask import dataframe as dd
columns = ["name", "amenity", "Longitude", "Latitude"]
data = dd.read_csv('POIWorld.csv', usecols=columns)
with_name = data[data.name.notnull()]
with_amenity = data[data.amenity.notnull()]
is_starbucks = with_name.name.str.contains('[Ss]tarbucks')
is_dunkin = with_name.name.str.contains('[Dd]unkin')
starbucks = with_name[is_starbucks]
dunkin = with_name[is_dunkin]
locs = dd.compute(starbucks.Longitude,
starbucks.Latitude,
dunkin.Longitude,
dunkin.Latitude)
# extract arrays of values fro the series:
lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs]
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
def draw_USA():
"""initialize a basemap centered on the continental USA"""
plt.figure(figsize=(14, 10))
return Basemap(projection='lcc', resolution='l',
llcrnrlon=-119, urcrnrlon=-64,
llcrnrlat=22, urcrnrlat=49,
lat_1=33, lat_2=45, lon_0=-95,
area_thresh=10000)
m = draw_USA()
# Draw map background
m.fillcontinents(color='white', lake_color='#eeeeee')
m.drawstates(color='lightgray')
m.drawcoastlines(color='lightgray')
m.drawcountries(color='lightgray')
m.drawmapboundary(fill_color='#eeeeee')
# Plot the values in Starbucks Green and Dunkin Donuts Orange
style = dict(s=5, marker='o', alpha=0.5, zorder=2)
m.scatter(lon_s, lat_s, latlon=True,
label="Starbucks", color='#00592D', **style)
m.scatter(lon_d, lat_d, latlon=True,
label="Dunkin' Donuts", color='#FC772A', **style)
plt.legend(loc='lower left', frameon=False);
70. Distributed
70
Pythonic Multiple-machine Parallelism that understands Dask graphs
1) Defines Center (dcenter) and Worker (dworker)
2) Simplified setup with dcluster for example —
Center
dcluster 192.168.0.{1,2,3,4}
dcluster —hostfile hostfile.txt
or
3) Create Executor objects like
concurrent.futures (Python 3) or
futures (Python 2.7 back-port)
4) Data locality supported with ad-hoc task graphs
by returning futures wherever possible
New Libray but stabilizing quickly — communicate with blaze-dev@continuum.io
71. Python and Hadoop
(without the JVM)
71
Chat:
http://gitter.im/blaze/dask
Email:
blaze-‐dev@continuum.io
Join
the
conversation!
72. HDFS without Java
72
1. HDFS splits large files into many small blocks replicated on many
datanodes
2. For efficient computation we must use data directly on datanodes
3. distributed.hdfs queries the locations of the individual blocks
4. distributed executes functions directly on those blocks on the
datanodes
5. distributed+pandas enables distributed CSV processing on HDFS
in pure Python
6. Coming soon — dask on hdfs
73. 73
$ hdfs dfs -cp yellow_tripdata_2014-01.csv /data/nyctaxi/
>>> from distributed import hdfs
>>> blocks = hdfs.get_locations('/data/nyctaxi/', '192.168.50.100', 9000)
>>> columns = ['vendor_id', 'pickup_datetime', 'dropoff_datetime',
'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude',
'rate_code', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude',
'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tip_amount',
'tolls_amount', 'total_amount']
>>> from distributed import Executor
>>> executor = Executor('192.168.1.100:8787')
>>> dfs = [executor.submit(pd.read_csv, block['path'], workers=block['hosts'],
... columns=columns, skiprows=1)
... for block in blocks]
These operations produce Future objects that point to remote results on the worker computers. This does not
pull results back to local memory. We can use these futures in later computations with the executor.
74. 74
def sum_series(seq):
result = seq[0]
for s in seq[1:]:
result = result.add(s, fill_value=0)
return result
>>> counts = executor.map(lambda df: df.passenger_count.value_counts(), dfs)
>>> total = executor.submit(sum_series, counts)
>>> total.result()
0 259
1 9727301
2 1891581
3 566248
4 267540
5 789070
6 540444
7 7
8 5
9 16
208 19
87. Additional Demos & Topics
87
• Airline flights
• Pandas table
• Streaming / Animation
• Large data rendering
88. 88
• Dynamic, just-in-time compiler for Python & NumPy
• Uses LLVM
• Outputs x86 and GPU (CUDA, HSA)
• (Premium version is in Accelerate part of
Anaconda Workgroup and Anaconda Enterprise subscriptions)
http://numba.pydata.org
Numba
89. Python Compilation Space
89
Ahead Of Time Just In Time
Relies on
CPython /
libpython
Cython
Shedskin
Nuitka (today)
Pythran
Numba
HOPE
Theano
Replaces
CPython /
libpython
Nuitka (future) Pyston
PyPy
91. 91
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in range(m//2, M-m//2):
for j in range(n//2, N-n//2):
result = 0.0
for k in range(m):
for l in range(n):
result += image[i+k-m//2,j+l-n//2]*filt[k, l]
output[i,j] = result
~1500x speed-up
92. Numba Features
• Numba supports:
– Windows, OS X, and Linux
– 32 and 64-bit x86 CPUs and NVIDIA GPUs
– Python 2 and 3
– NumPy versions 1.6 through 1.9
• Does not require a C/C++ compiler on the user’s system.
• < 70 MB to install.
• Does not replace the standard Python interpreter
(all of your existing Python libraries are still available)
92
93. Numba Modes
• object mode: Compiled code operates on Python objects. Only
significant performance improvement is compilation of loops that can be
compiled in nopython mode (see below).
• nopython mode: Compiled code operates on “machine native” data.
Usually within 25% of the performance of equivalent C or FORTRAN.
93
95. The Basics
95
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
2.7x speedup!
Numba decorator
(nopython=True not required)
96. CUDA Python (in open-source Numba!)
96
CUDA Development
using Python syntax for
optimal performance!
You have to understand
CUDA at least a little —
writing kernels that launch
in parallel on the GPU
99. Other interesting things
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize)
• Call ctypes and cffi functions directly and pass them as arguments
• Preliminary support for types that understand the buffer protocol
• Pickle Numba functions to run on remote execution engines
• “numba annotate” to dump HTML annotated version of compiled code
• See: http://numba.pydata.org/numba-doc/0.20.0/
99
100. What Doesn’t Work?
(A non-comprehensive list)
• Sets, lists, dictionaries, user defined classes (tuples do work!)
• List, set and dictionary comprehensions
• Recursion
• Exceptions with non-constant parameters
• Most string operations (buffer support is very preliminary!)
• yield from
• closures inside a JIT function (compiling JIT functions inside a closure works…)
• Modifying globals
• Passing an axis argument to numpy array reduction functions
• Easy debugging (you have to debug in Python mode).
100
101. How Numba Works
101
Bytecode
Analysis
Python Function
(bytecode)
Function Arguments
Type Inference
Numba IR
Rewrite IR
Lowering
LLVM IRLLVM JITMachine Code
@jit
def do_math(a,b):
…
>>> do_math(x, y)
Cache
Execute!
102. Recently Added Numba Features
• A new GPU target: the Heterogenous System Architecture, supported by AMD APUs
• Support for named tuples in nopython mode
• Limited support for lists in nopython mode
• On-disk caching of compiled functions (opt-in) — both LLVM and pre-compiled
• A simulator for debugging GPU functions with the Python debugger on the CPU
• Can choose to release the GIL in nopython functions
• Ahead of time compilation
• vectorize and guvectorize on GPU and parallel targets now in open-source Numba!
• JIT Classes — coming soon!
102