Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Python as the Zen of Data Science
Travis E. Oliphant, Ph.D.
Peter Wang
Continuum Analytics
2
Data Science
3
禪 Zen
4
道
Zen Approach
5
• Right Practice
• Right Attitude
• Right Understanding
“The Zen way of calligraphy is to write in the most...
“Right Understanding”
6
The purpose of studying Buddhism is not to study Buddhism
but to study ourselves. You are not your...
7
• Compose language primitives, built-ins, classes whenever
possible
• Much more powerful and accessible than trying to m...
Why Python?
8
Analyst
• Uses graphical tools
• Can call functions,
cut & paste code
• Can change some
variables
Gets paid ...
Pythonic Data Analysis
9
• Make an immediate connection with the data (using Pandas+ / NumPy+ /
scikit-learn / Bokeh and m...
Zen of Python
10
>>> import this
The Zen of Python, by Tim Peters
Beautiful is better than ugly.
Explicit is better than i...
Zen of NumPy (and Pandas)
11
• Strided is better than scattered
• Contiguous is better than strided
• Descriptive is bette...
“Zen of Data Science”
12
• Get More and better data.
• Better data is determined by better models.
• How you compute matte...
PyData Stack -- about 3,000,000 users
13
NumPy
scikit-learnscikit-image statsmodels Cython
PyTables/Numexpr Dask / Blaze S...
How do we get 100s of dependencies delivered?
14
We started Continuum with 3 primary goals:

1. Scale the NumPy/PyData sta...
Game-Changing
Enterprise Ready
Python Distribution
15
• 2 million downloads in last 2 years
• 200k / month and growing
• c...
Some Users
16
Anaconda — Portable Environments
17
• Easy to install
• Quick & agile data exploration
• Powerful data analysis
• Simple t...
Traditional Analytics
Ad hoc usage
and/or BI
report
Production
deployment
Data
Data
Data
Data
Data Mining
and/or
Modelling...
Traditional Analytics
Ad hoc usage
and/or BI
report
Production
deployment
Data
Data
Data
Data
Data Mining
and/or
Modelling...
Package Managers
20
yum (rpm)
apt-get (dpkg)
Linux OSX
macports
homebrew
fink
Windows
chocolatey
npackd
Cross-platform
con...
Conda features
21
• Excellent support for “system-level” environments — like having mini VMs but much
lighter weight than ...
Basic Conda Usage
22
Install a package conda install sympy
List all installed packages conda list
Search for packages cond...
Advanced Conda Usage
23
Install a package in an environment conda install -n py3k sympy
Update all packages conda update -...
Environments
24
• Environments are simple: just link the package to a different directory
• Hard-links are very cheap, and...
25
Anaconda Platform Analytics Repository
26
• Commercial long-term support
• Private, on-premise package mirror
• Proprietar...
Anaconda Cluster: Anaconda + Hadoop + Spark
For Data Scientists:
• Rapidly, easily create clusters on EC2, DigitalOcean, o...
Cluster Creation
28
$ acluster create mycluster —profile=spark_profile
$ acluster submit mycluster mycode.py
$ acluster de...
Cluster Management
29
$ acluster manage mycluster list
... info -e
... install python=3 pandas flask
... set_env
... push_...
Anaconda Cluster & Spark
30
# example.py
conf = SparkConf()
conf.setMaster("yarn-client")
conf.setAppName("MY APP")
sc = S...
ANACONDA OPEN SOURCE
TECHNOLOGY
31
32
• Infrastructure for meta-data, meta-compute, and expression graphs/dataflow
• Data glue for scale-up or scale-out
• Ge...
Glue 2.0
33
Python’s legacy as a powerful
glue language
• manipulate files
• call fast libraries
Next-gen Glue:
• Link data...
34
35
Data
36
“Math”
Data
37
Math
Big Data
38
Math
Big Data
39
Math
Big Data
40
Math
Big Data
Programs
41
“General Purpose Programming”
42
Domain-Specific
Query Language
Analytics System
43
html,css,js,…
py,r,sql,…
java,c,cpp,cs
44
?
45
Expressions
Metadata
Runtime
46
+ - / * ^ []
join, groupby, filter
map, sort, take
where, topk
datashape,dtype,
shape,stride
hdf5,json,csv,xls
protobuf...
APIs, syntax, language
47
Data Runtime
Expressions
metadata
storage/containers
compute
datashape
blaze
dask
odo
paralleliz...
Blaze
48
Interface to query data on different storage systems http://blaze.pydata.org/en/latest/
from blaze import Data
ir...
Blaze
49
iris[['sepal_length', 'species']]Select columns
log(iris.sepal_length * 10)Operate
Reduce iris.sepal_length.mean(...
50
datashapeblaze
Blaze uses datashape as its type system (like DyND)
>>> iris = Data('iris.json')
>>> iris.dshape
dshape(...
Datashape
51
A structured data description language
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
i...
Datashape
52
{
flowersdb: {
iris: var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width:...
iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: str...
Blaze Server
54
Blaze Client
>>> from blaze import Data
>>> s = Data('blaze://localhost:6363')
>>> t.fields
[u'iriscsv', u...
Compute recipes work with existing libraries and have multiple
backends
• python list
• numpy arrays
• dynd
• pandas DataF...
• Ideally, you can layer expressions over any data

• Write once, deploy anywhere

• Practically, expressions will work be...
57
Dask: Out-of-Core PyData
• A parallel computing framework
• That leverages the excellent Python ecosystem
• Using block...
Example: Ocean Temp Data
58
• http://www.esrl.noaa.gov/psd/data/gridded/
data.noaa.oisst.v2.highres.html
• Every 1/4 degre...
Bigger data...
59
36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressed
If you don't have this much RAM...
... better star...
DAG of Computation
60
• Collections build task graphs
• Schedulers execute task graphs
• Graph specification = uniting interface
• A generalizat...
Simple Architecture for Scaling
62
Dask	
  collections	
  
• dask.array	
  
• dask.dataframe	
  
• dask.bag	
  
• dask.imp...
dask.array: OOC, parallel, ND array
63
Arithmetic: +, *, ...
Reductions: mean, max, ...
Slicing: x[10:, 100:50:-2]
Fancy i...
Dask Array
64
numpy
dask
>>> import numpy as np
>>> np_ones = np.ones((5000, 1000))
>>> np_ones
array([[ 1., 1., 1., ..., ...
dask.dataframe: OOC, parallel, ND array
65
Elementwise operations: df.x + df.y
Row-wise selections: df[df.x > 0]
Aggregati...
Dask Dataframe
66
pandas dask
>>> import pandas as pd
>>> df = pd.read_csv('iris.csv')
>>> df.head()
sepal_length sepal_wi...
More Complex Graphs
67
cross validation
68
http://continuum.io/blog/xray-dask
69
from dask import dataframe as dd
columns = ["name", "amenity", "Longitude", "Latitude"]
data = dd.read_csv('POIWorld.cs...
Distributed
70
Pythonic Multiple-machine Parallelism that understands Dask graphs
1) Defines Center (dcenter) and Worker (...
Python and Hadoop
(without the JVM)
71
Chat:	
  	
  http://gitter.im/blaze/dask	
  
Email:	
  	
  blaze-­‐dev@continuum.io...
HDFS without Java
72
1. HDFS splits large files into many small blocks replicated on many
datanodes
2. For efficient compu...
73
$ hdfs dfs -cp yellow_tripdata_2014-01.csv /data/nyctaxi/
>>> from distributed import hdfs
>>> blocks = hdfs.get_locati...
74
def sum_series(seq):
result = seq[0]
for s in seq[1:]:
result = result.add(s, fill_value=0)
return result
>>> counts = ...
Bokeh
75
http://bokeh.pydata.org
• Interactive visualization
• Novel graphics
• Streaming, dynamic, large data
• For the b...
Versatile Plots
76
Novel Graphics
77
Previous: Javascript code generation
78
server.py Browser
js_str = """ <d3.js>
<highchart.js>
<etc.js>
"""
plot.js.templat...
bokeh.py & bokeh.js
79
server.py BrowserApp Model
BokehJS
object graph
bokeh-server
bokeh.py
object graph
JSON
80
81
4GB Interactive Web Viz
rBokeh
82
http://hafen.github.io/rbokeh
83
84
85
86
hGp://nbviewer.ipython.org/github/bokeh/bokeh-­‐notebooks/blob/master/tutorial/00	
  -­‐	
  intro.ipynb#InteracLon	
  
Additional Demos & Topics
87
• Airline flights
• Pandas table
• Streaming / Animation
• Large data rendering
88
• Dynamic, just-in-time compiler for Python & NumPy
• Uses LLVM
• Outputs x86 and GPU (CUDA, HSA)
• (Premium version is...
Python Compilation Space
89
Ahead Of Time Just In Time
Relies on
CPython /
libpython
Cython
Shedskin
Nuitka (today)
Pythra...
Example
90
Numba
91
@jit('void(f8[:,:],f8[:,:],f8[:,:])')
def filter(image, filt, output):
M, N = image.shape
m, n = filt.shape
for i in ra...
Numba Features
• Numba supports:
– Windows, OS X, and Linux
– 32 and 64-bit x86 CPUs and NVIDIA GPUs
– Python 2 and 3
– Nu...
Numba Modes
• object mode: Compiled code operates on Python objects. Only
significant performance improvement is compilati...
The Basics
94
The Basics
95
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the a...
CUDA Python (in open-source Numba!)
96
CUDA Development
using Python syntax for
optimal performance!
You have to understan...
Example: Black-Scholes
97
Black-Scholes: Results
98
core i7 GeForce GTX 560 Ti
About 9x faster
on this GPU
~ same speed as
CUDA-C
Other interesting things
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize)
• C...
What Doesn’t Work?
(A non-comprehensive list)
• Sets, lists, dictionaries, user defined classes (tuples do work!)
• List, ...
How Numba Works
101
Bytecode
Analysis
Python Function
(bytecode)
Function Arguments
Type Inference
Numba IR
Rewrite IR
Low...
Recently Added Numba Features
• A new GPU target: the Heterogenous System Architecture, supported by AMD APUs
• Support fo...
Prochain SlideShare
Chargement dans…5
×

Python as the Zen of Data Science

8 738 vues

Publié le

A description of how Python and particularly Anaconda helps achieve Data Science Nirvana.

Publié dans : Technologie
  • Hi All, We are planning to start new devops online batch on this week... If any one interested to attend the demo please register in our website... For this batch we are also provide everyday recorded sessions with Materials. For more information feel free to contact us : siva@keylabstraining.com. For Course Content and Recorded Demo Click Here : http://www.keylabstraining.com/devops-online-training-tutorial
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Python as the Zen of Data Science

  1. 1. Python as the Zen of Data Science Travis E. Oliphant, Ph.D. Peter Wang Continuum Analytics
  2. 2. 2 Data Science
  3. 3. 3 禪 Zen
  4. 4. 4 道
  5. 5. Zen Approach 5 • Right Practice • Right Attitude • Right Understanding “The Zen way of calligraphy is to write in the most straightforward, simple way as if you were a beginner not trying to make something skillful or beautiful, but simple writing with the full attention as if you were discovering what you were writing for the first time.” — Zen Mind, Beginner’s Mind
  6. 6. “Right Understanding” 6 The purpose of studying Buddhism is not to study Buddhism but to study ourselves. You are not your body. Zen: Python: The purpose of writing Python code is not to just produce software, but to study ourselves. You are not your technology stack!
  7. 7. 7 • Compose language primitives, built-ins, classes whenever possible • Much more powerful and accessible than trying to memorize a huge list of proprietary functions • Reject artificial distinctions between where data “should live” and where it gets computed • Empower each individual to use their own knowledge, instead of taking design power out of their hands with pre-ordained architectures and “stacks”. Pythonic Approach
  8. 8. Why Python? 8 Analyst • Uses graphical tools • Can call functions, cut & paste code • Can change some variables Gets paid for: Insight Excel, VB, Tableau, Analyst / Data Developer • Builds simple apps & workflows • Used to be "just an analyst" • Likes coding to solve problems • Doesn't want to be a "full-time programmer" Gets paid (like a rock star) for: Code that produces insight SAS, R, Matlab, Programmer • Creates frameworks & compilers • Uses IDEs • Degree in CompSci • Knows multiple languages Gets paid for: Code C, C++, Java, JS, Python Python Python
  9. 9. Pythonic Data Analysis 9 • Make an immediate connection with the data (using Pandas+ / NumPy+ / scikit-learn / Bokeh and matplotlib) • PyData stack means agility which allows you to rapidly build understanding with powerful modeling and easy manipulation. • Let the Data drive the analysis not the technology stack. • Empower the data-scientist, quant, geophysicist, biochemist, directly with simple language constructs they can use that “fits their brain.” • Scale-out is a later step. Python can work with you there too. There are great tools no matter where your data is located or how it is stored or how your cluster is managed. Not tied to a particular distributed story.
  10. 10. Zen of Python 10 >>> import this The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
  11. 11. Zen of NumPy (and Pandas) 11 • Strided is better than scattered • Contiguous is better than strided • Descriptive is better than imperative (use data-types) • Array-oriented and data-oriented is often better than object-oriented • Broadcasting is a great idea – use where possible • Split-apply-combine is a great idea – use where possible • Vectorized is better than an explicit loop • Write more ufuncs and generalized ufuncs (numba can help) • Unless it’s complicated — then use numba • Think in higher dimensions
  12. 12. “Zen of Data Science” 12 • Get More and better data. • Better data is determined by better models. • How you compute matters. • Put the data in the hands and minds of people with knowledge. • Fail quickly and often — but not in the same way. • Where and how the data is stored is secondary to analysis and understanding. • Premature horizontal scaling is the root of all evil. • When you must scale — data locality and parallel algorithms are the key. • Learn to think in building blocks that can be parallelized.
  13. 13. PyData Stack -- about 3,000,000 users 13 NumPy scikit-learnscikit-image statsmodels Cython PyTables/Numexpr Dask / Blaze SymPy Numba OpenCV astropy BioPython GDALPySAL ... many many more ... MatplotlibSciPy Bokehpandas
  14. 14. How do we get 100s of dependencies delivered? 14 We started Continuum with 3 primary goals:
 1. Scale the NumPy/PyData stack horizontally 2. Make it easy for Python users to produce data-science applications in the browser/ notebook 3. Get more adoption of the PyData stack So, we wrote a package manager, conda. and a distribution of Python + R, Anaconda.
  15. 15. Game-Changing Enterprise Ready Python Distribution 15 • 2 million downloads in last 2 years • 200k / month and growing • conda package manager serves up 5 million packages per month • Recommended installer for IPython/Jupyter, Pandas, SciPy, Scikit-learn, etc.
  16. 16. Some Users 16
  17. 17. Anaconda — Portable Environments 17 • Easy to install • Quick & agile data exploration • Powerful data analysis • Simple to collaborate • Accessible to all PYTHON & R OPEN SOURCE ANALYTICS NumPy SciPy Pandas Scikit-learn Jupyter/ IPython Numba Matplotlib Spyder Numexpr Cython Theano Scikit-image NLTK NetworkX IRKernel dplyr shiny ggplot2 tidyr caret nnet And 330+ packages conda
  18. 18. Traditional Analytics Ad hoc usage and/or BI report Production deployment Data Data Data Data Data Mining and/or Modelling ETL
  19. 19. Traditional Analytics Ad hoc usage and/or BI report Production deployment Data Data Data Data Data Mining and/or Modelling ETL ANACONDA
  20. 20. Package Managers 20 yum (rpm) apt-get (dpkg) Linux OSX macports homebrew fink Windows chocolatey npackd Cross-platform conda Sophisticated light-weight environments included! http://conda.pydata.org
  21. 21. Conda features 21 • Excellent support for “system-level” environments — like having mini VMs but much lighter weight than docker (micro containers) • Minimizes code-copies (uses hard/soft links if possible) • Simple format: binary tar-ball + metadata • Metadata allows static analysis of dependencies • Easy to create multiple “channels” which are repositories for packages • User installable (no root privileges needed) • Integrates very well with pip and other language-specific package managers. • Cross Platform
  22. 22. Basic Conda Usage 22 Install a package conda install sympy List all installed packages conda list Search for packages conda search llvm Create a new environment conda create -n py3k python=3 Remove a package conda remove nose Get help conda install --help
  23. 23. Advanced Conda Usage 23 Install a package in an environment conda install -n py3k sympy Update all packages conda update --all Export list of packages conda list --export packages.txt Install packages from an export conda install --file packages.txt See package history conda list --revisions Revert to a revision conda install --revision 23 Remove unused packages and cached tarballs conda clean -pt
  24. 24. Environments 24 • Environments are simple: just link the package to a different directory • Hard-links are very cheap, and very fast — even on Windows. • Conda environments are completely independent installations of everything • No fiddling with PYTHONPATH or sym-linking site-packages • “Activating” an environment just means changing your PATH so that its bin/ or Scripts/ comes first. • Unix:
 • Windows: conda create -n py3k python=3.5 source activate py3k activate py3k
  25. 25. 25
  26. 26. Anaconda Platform Analytics Repository 26 • Commercial long-term support • Private, on-premise package mirror • Proprietary tools for building custom distribution, like Anaconda • Enterprise tools for managing custom packages and environments • Available on the cloud at 
 http://anaconda.org
  27. 27. Anaconda Cluster: Anaconda + Hadoop + Spark For Data Scientists: • Rapidly, easily create clusters on EC2, DigitalOcean, on-prem cloud/provisioner • Manage Python, R, Java, JS packages across the cluster For Operations & IT: • Robustly manage runtime state across the cluster • Outside the scope of rpm, chef, puppet, etc. • Isolate/sandbox packages & libraries for different jobs or groups of users • Without introducing complexity of Docker / virtualization • Cross platform - same tooling for laptops, workstations, servers, clusters 27
  28. 28. Cluster Creation 28 $ acluster create mycluster —profile=spark_profile $ acluster submit mycluster mycode.py $ acluster destroy mycluster spark_profile: provider: aws_east num_nodes: 4 node_id: ami-3c994355 node_type: m1.large aws_east: secret_id: <aws_access_key_id> secret_key: <aws_secret_access_key> keyname: id_rsa.pub location: us-east-1 private_key: ~/.ssh/id_rsa cloud_provider: ec2 security_group: all-open http://continuumio.github.io/conda-cluster/quickstart.html
  29. 29. Cluster Management 29 $ acluster manage mycluster list ... info -e ... install python=3 pandas flask ... set_env ... push_env <local> <remote> $ acluster ssh mycluster $ acluster run.cmd mycluster "cat /etc/hosts" Package & environment management: Easy SSH & remote commands: http://continuumio.github.io/conda-cluster/manage.html
  30. 30. Anaconda Cluster & Spark 30 # example.py conf = SparkConf() conf.setMaster("yarn-client") conf.setAppName("MY APP") sc = SparkContext(conf=conf) # analysis sc.parallelize(range(1000)).map(lambda x: (x, x % 2)).take(10) $ acluster submit MY_CLUSTER /path/to/example.py Remember that Blaze has a higher-level interface to Spark and Dask provides a more Pythonic approach.
  31. 31. ANACONDA OPEN SOURCE TECHNOLOGY 31
  32. 32. 32 • Infrastructure for meta-data, meta-compute, and expression graphs/dataflow • Data glue for scale-up or scale-out • Generic remote computation & query system • (NumPy+Pandas+LINQ+OLAP+PADL).mashup() Blaze is an extensible high-level interface for data analytics. It feels like NumPy/Pandas. It drives other data systems. Blaze expressions enable high-level reasoning. It’s an ecosystem of tools. http://blaze.pydata.org Blaze
  33. 33. Glue 2.0 33 Python’s legacy as a powerful glue language • manipulate files • call fast libraries Next-gen Glue: • Link data silos • Link disjoint memory & compute • Unify disparate runtime models • Transcend legacy models of computers
  34. 34. 34
  35. 35. 35 Data
  36. 36. 36 “Math” Data
  37. 37. 37 Math Big Data
  38. 38. 38 Math Big Data
  39. 39. 39 Math Big Data
  40. 40. 40 Math Big Data Programs
  41. 41. 41 “General Purpose Programming”
  42. 42. 42 Domain-Specific Query Language Analytics System
  43. 43. 43 html,css,js,… py,r,sql,… java,c,cpp,cs
  44. 44. 44 ?
  45. 45. 45 Expressions Metadata Runtime
  46. 46. 46 + - / * ^ [] join, groupby, filter map, sort, take where, topk datashape,dtype, shape,stride hdf5,json,csv,xls protobuf,avro,... NumPy,Pandas,R, Julia,K,SQL,Spark, Mongo,Cassandra,...
  47. 47. APIs, syntax, language 47 Data Runtime Expressions metadata storage/containers compute datashape blaze dask odo parallelize optimize, JIT
  48. 48. Blaze 48 Interface to query data on different storage systems http://blaze.pydata.org/en/latest/ from blaze import Data iris = Data('iris.csv') iris = Data('sqlite:///flowers.db::iris') iris = Data('mongodb://localhost/mydb::iris') iris = Data('iris.json') CSV SQL MongoDB JSON iris = Data('s3://blaze-data/iris.csv')S3 … Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
  49. 49. Blaze 49 iris[['sepal_length', 'species']]Select columns log(iris.sepal_length * 10)Operate Reduce iris.sepal_length.mean() Split-apply -combine by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean()) Add new columns transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width) Text matching iris.like(species='*versicolor') iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]
  50. 50. 50 datashapeblaze Blaze uses datashape as its type system (like DyND) >>> iris = Data('iris.json') >>> iris.dshape dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")
  51. 51. Datashape 51 A structured data description language http://datashape.pydata.org/ dimension dtype unit types var 3 string int32 4 float64 * * * * var * { x : int32, y : string, z : float64 } datashape tabular datashape record ordered struct dtype { x : int32, y : string, z : float64 } collection of types keyed by labels 4
  52. 52. Datashape 52 { flowersdb: { iris: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } } # Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } } # Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, } # Function prototype (3 * int32, float64) -> 3 * float64 # Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32 # Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64]
  53. 53. iriscsv: source: iris.csv irisdb: source: sqlite:///flowers.db::iris irisjson: source: iris.json dshape: "var * {name: string, amount: float64}" irismongo: source: mongodb://localhost/mydb::iris Blaze Server — Lights up your Dark Data 53 Builds off of Blaze uniform interface to host data remotely through a JSON web API. $ blaze-server server.yaml -e localhost:6363/compute.json server.yaml
  54. 54. Blaze Server 54 Blaze Client >>> from blaze import Data >>> s = Data('blaze://localhost:6363') >>> t.fields [u'iriscsv', u'irisdb', u'irisjson', u’irismongo'] >>> t.iriscsv sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa >>> t.irisdb petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa
  55. 55. Compute recipes work with existing libraries and have multiple backends • python list • numpy arrays • dynd • pandas DataFrame • Spark, Impala • Mongo • dask 55
  56. 56. • Ideally, you can layer expressions over any data
 • Write once, deploy anywhere
 • Practically, expressions will work better on specific data structures, formats, and engines
 • Use odo to copy from one format and/or engine to another 56
  57. 57. 57 Dask: Out-of-Core PyData • A parallel computing framework • That leverages the excellent Python ecosystem • Using blocked algorithms and task scheduling • Written in pure Python Core Ideas • Dynamic task scheduling yields sane parallelism • Simple library to enable parallelism • Dask.array/dataframe to encapsulate the functionality • Distributed scheduler
  58. 58. Example: Ocean Temp Data 58 • http://www.esrl.noaa.gov/psd/data/gridded/ data.noaa.oisst.v2.highres.html • Every 1/4 degree, 720x1440 array each day
  59. 59. Bigger data... 59 36 years: 720 x 1440 x 12341 x 4 = 51 GB uncompressed If you don't have this much RAM... ... better start chunking.
  60. 60. DAG of Computation 60
  61. 61. • Collections build task graphs • Schedulers execute task graphs • Graph specification = uniting interface • A generalization of RDDs 61
  62. 62. Simple Architecture for Scaling 62 Dask  collections   • dask.array   • dask.dataframe   • dask.bag   • dask.imperative* Python  Ecosystem Dask  Graph  Specification Dask  Schedulers
  63. 63. dask.array: OOC, parallel, ND array 63 Arithmetic: +, *, ... Reductions: mean, max, ... Slicing: x[10:, 100:50:-2] Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svd Parallel algorithms (approximate quantiles, topk, ...) Slightly overlapping arrays Integration with HDF5
  64. 64. Dask Array 64 numpy dask >>> import numpy as np >>> np_ones = np.ones((5000, 1000)) >>> np_ones array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1) >>> np_y array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, 693.14718056]) >>> import dask.array as da >>> da_ones = da.ones((5000000, 1000000), chunks=(1000, 1000)) >>> da_ones.compute() array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1) >>> np_da_y = np.array(da_y) #fits in memory array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, …, 693.14718056]) # If result doesn’t fit in memory >>> da_y.to_hdf5('myfile.hdf5', 'result')
  65. 65. dask.dataframe: OOC, parallel, ND array 65 Elementwise operations: df.x + df.y Row-wise selections: df[df.x > 0] Aggregations: df.x.max() groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts() Drop duplicates: df.x.drop_duplicates() Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
  66. 66. Dask Dataframe 66 pandas dask >>> import pandas as pd >>> df = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998
  67. 67. More Complex Graphs 67 cross validation
  68. 68. 68 http://continuum.io/blog/xray-dask
  69. 69. 69 from dask import dataframe as dd columns = ["name", "amenity", "Longitude", "Latitude"] data = dd.read_csv('POIWorld.csv', usecols=columns) with_name = data[data.name.notnull()] with_amenity = data[data.amenity.notnull()] is_starbucks = with_name.name.str.contains('[Ss]tarbucks') is_dunkin = with_name.name.str.contains('[Dd]unkin') starbucks = with_name[is_starbucks] dunkin = with_name[is_dunkin] locs = dd.compute(starbucks.Longitude, starbucks.Latitude, dunkin.Longitude, dunkin.Latitude) # extract arrays of values fro the series: lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs] %matplotlib inline import matplotlib.pyplot as plt from mpl_toolkits.basemap import Basemap def draw_USA(): """initialize a basemap centered on the continental USA""" plt.figure(figsize=(14, 10)) return Basemap(projection='lcc', resolution='l', llcrnrlon=-119, urcrnrlon=-64, llcrnrlat=22, urcrnrlat=49, lat_1=33, lat_2=45, lon_0=-95, area_thresh=10000) m = draw_USA() # Draw map background m.fillcontinents(color='white', lake_color='#eeeeee') m.drawstates(color='lightgray') m.drawcoastlines(color='lightgray') m.drawcountries(color='lightgray') m.drawmapboundary(fill_color='#eeeeee') # Plot the values in Starbucks Green and Dunkin Donuts Orange style = dict(s=5, marker='o', alpha=0.5, zorder=2) m.scatter(lon_s, lat_s, latlon=True, label="Starbucks", color='#00592D', **style) m.scatter(lon_d, lat_d, latlon=True, label="Dunkin' Donuts", color='#FC772A', **style) plt.legend(loc='lower left', frameon=False);
  70. 70. Distributed 70 Pythonic Multiple-machine Parallelism that understands Dask graphs 1) Defines Center (dcenter) and Worker (dworker) 2) Simplified setup with dcluster for example — Center dcluster 192.168.0.{1,2,3,4} dcluster —hostfile hostfile.txt or 3) Create Executor objects like
 concurrent.futures (Python 3) or
 futures (Python 2.7 back-port) 4) Data locality supported with ad-hoc task graphs
 by returning futures wherever possible New Libray but stabilizing quickly — communicate with blaze-dev@continuum.io
  71. 71. Python and Hadoop (without the JVM) 71 Chat:    http://gitter.im/blaze/dask   Email:    blaze-­‐dev@continuum.io Join  the  conversation!
  72. 72. HDFS without Java 72 1. HDFS splits large files into many small blocks replicated on many datanodes 2. For efficient computation we must use data directly on datanodes 3. distributed.hdfs queries the locations of the individual blocks 4. distributed executes functions directly on those blocks on the datanodes 5. distributed+pandas enables distributed CSV processing on HDFS in pure Python 6. Coming soon — dask on hdfs
  73. 73. 73 $ hdfs dfs -cp yellow_tripdata_2014-01.csv /data/nyctaxi/ >>> from distributed import hdfs >>> blocks = hdfs.get_locations('/data/nyctaxi/', '192.168.50.100', 9000) >>> columns = ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'rate_code', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tip_amount', 'tolls_amount', 'total_amount'] >>> from distributed import Executor >>> executor = Executor('192.168.1.100:8787') >>> dfs = [executor.submit(pd.read_csv, block['path'], workers=block['hosts'], ... columns=columns, skiprows=1) ... for block in blocks] These operations produce Future objects that point to remote results on the worker computers. This does not pull results back to local memory. We can use these futures in later computations with the executor.
  74. 74. 74 def sum_series(seq): result = seq[0] for s in seq[1:]: result = result.add(s, fill_value=0) return result >>> counts = executor.map(lambda df: df.passenger_count.value_counts(), dfs) >>> total = executor.submit(sum_series, counts) >>> total.result() 0 259 1 9727301 2 1891581 3 566248 4 267540 5 789070 6 540444 7 7 8 5 9 16 208 19
  75. 75. Bokeh 75 http://bokeh.pydata.org • Interactive visualization • Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • No need to write Javascript
  76. 76. Versatile Plots 76
  77. 77. Novel Graphics 77
  78. 78. Previous: Javascript code generation 78 server.py Browser js_str = """ <d3.js> <highchart.js> <etc.js> """ plot.js.template App Model D3 highcharts flot crossfilter etc. ... One-shot; no MVC interaction; no data streaming HTML
  79. 79. bokeh.py & bokeh.js 79 server.py BrowserApp Model BokehJS object graph bokeh-server bokeh.py object graph JSON
  80. 80. 80
  81. 81. 81 4GB Interactive Web Viz
  82. 82. rBokeh 82 http://hafen.github.io/rbokeh
  83. 83. 83
  84. 84. 84
  85. 85. 85
  86. 86. 86 hGp://nbviewer.ipython.org/github/bokeh/bokeh-­‐notebooks/blob/master/tutorial/00  -­‐  intro.ipynb#InteracLon  
  87. 87. Additional Demos & Topics 87 • Airline flights • Pandas table • Streaming / Animation • Large data rendering
  88. 88. 88 • Dynamic, just-in-time compiler for Python & NumPy • Uses LLVM • Outputs x86 and GPU (CUDA, HSA) • (Premium version is in Accelerate part of 
 Anaconda Workgroup and Anaconda Enterprise subscriptions) http://numba.pydata.org Numba
  89. 89. Python Compilation Space 89 Ahead Of Time Just In Time Relies on CPython / libpython Cython Shedskin Nuitka (today) Pythran Numba HOPE Theano Replaces CPython / libpython Nuitka (future) Pyston PyPy
  90. 90. Example 90 Numba
  91. 91. 91 @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up
  92. 92. Numba Features • Numba supports: – Windows, OS X, and Linux – 32 and 64-bit x86 CPUs and NVIDIA GPUs – Python 2 and 3 – NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter
 (all of your existing Python libraries are still available) 92
  93. 93. Numba Modes • object mode: Compiled code operates on Python objects. Only significant performance improvement is compilation of loops that can be compiled in nopython mode (see below). • nopython mode: Compiled code operates on “machine native” data. Usually within 25% of the performance of equivalent C or FORTRAN. 93
  94. 94. The Basics 94
  95. 95. The Basics 95 Array Allocation Looping over ndarray x as an iterator Using numpy math functions Returning a slice of the array 2.7x speedup! Numba decorator
 (nopython=True not required)
  96. 96. CUDA Python (in open-source Numba!) 96 CUDA Development using Python syntax for optimal performance! You have to understand CUDA at least a little — writing kernels that launch in parallel on the GPU
  97. 97. Example: Black-Scholes 97
  98. 98. Black-Scholes: Results 98 core i7 GeForce GTX 560 Ti About 9x faster on this GPU ~ same speed as CUDA-C
  99. 99. Other interesting things • CUDA Simulator to debug your code in Python interpreter • Generalized ufuncs (@guvectorize) • Call ctypes and cffi functions directly and pass them as arguments • Preliminary support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled code • See: http://numba.pydata.org/numba-doc/0.20.0/ 99
  100. 100. What Doesn’t Work? (A non-comprehensive list) • Sets, lists, dictionaries, user defined classes (tuples do work!) • List, set and dictionary comprehensions • Recursion • Exceptions with non-constant parameters • Most string operations (buffer support is very preliminary!) • yield from • closures inside a JIT function (compiling JIT functions inside a closure works…) • Modifying globals • Passing an axis argument to numpy array reduction functions • Easy debugging (you have to debug in Python mode). 100
  101. 101. How Numba Works 101 Bytecode Analysis Python Function (bytecode) Function Arguments Type Inference Numba IR Rewrite IR Lowering LLVM IRLLVM JITMachine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute!
  102. 102. Recently Added Numba Features • A new GPU target: the Heterogenous System Architecture, supported by AMD APUs • Support for named tuples in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) — both LLVM and pre-compiled • A simulator for debugging GPU functions with the Python debugger on the CPU • Can choose to release the GIL in nopython functions • Ahead of time compilation • vectorize and guvectorize on GPU and parallel targets now in open-source Numba! • JIT Classes — coming soon! 102

×