This talk gives a high-level overview of the motivation, design goals, and status of the Blaze project from Continuum Analytics which is a large-scale array object for Python.
Unblocking The Main Thread Solving ANRs and Frozen Frames
Blaze: a large-scale, array-oriented infrastructure for Python
1. Blaze: a large-scale, array-
oriented infrastructure for
Python
Travis E. Oliphant
PyData Silicon Valley 2013
Tuesday, March 19, 13
2. Brief History
Person Package Year
Matrix Object
Jim Fulton 1994
in Python
Jim Hugunin Numeric 1995
Perry Greenfield, Rick
White, Todd Miller Numarray 2001
Travis Oliphant NumPy 2005
Tuesday, March 19, 13
3. Early pieces of SciPy
fftw wrappers cephesmodule
June 1998 November 1998
stats.py
December 1998
Gary
Strangman
Tuesday, March 19, 13
4. 1999 : Early SciPy emerges
Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis
environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen,
and others. Activity in 1998, led to increased interest in 1999.
In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be
present and began wrapping / writing in earnest. On 6 April 1999, I announced I would
be creating this uber-package which eventually became SciPy
Gaussian quadrature 5 Jan 1999
cephes 1.0 30 Jan 1999
sigtools 0.40 23 Feb 1999
Numeric docs March 1999
cephes 1.1 9 Mar 1999 Plotting??
multipack 0.3 13 Apr 1999
Helper routines 14 Apr 1999 Gist
multipack 0.6 (leastsq, ode, fsolve, 29 Apr 1999 XPLOT
quad)
DISLIN
sparse plan described 30 May 1999
Gnuplot
multipack 0.7 14 Jun 1999
SparsePy 0.1
cephes 1.2 (vectorize)
5 Nov 1999
29 Dec 1999
Helping with f2py
Tuesday, March 19, 13
5. SciPy 2001 Travis Oliphant
optimize
sparse
interpolate
integrate
special
signal
stats Founded in 2001 with Travis Vaught
fftpack
misc
Eric Jones
weave
cluster
Pearu Peterson
GA*
linalg
interpolate
f2py
Tuesday, March 19, 13
6. Community effort many, many others --- forgive me!
• Chuck Harris
• Pauli Virtanen
• David Cournapeau
• Stefan van der Walt
• Jake Vanderplas
• Josef Perktold
• Anne Archibald
• Dag Sverre Seljebotn
• Robert Kern
• Matthew Brett
• Warren Weckesser
• Ralf Gommers
• Joe Harrington --- Documentation effort
• Andrew Straw --- www.scipy.org
Tuesday, March 19, 13
7. 1,000,000 to 2,000,000 users of NumPy!
Tuesday, March 19, 13
9. What is good about NumPy?
• Array-oriented
• Extensive DType System (including structures)
• C-API --- lots of libraries
• Simple to understand data-structure
• Memory mapping
• Syntax support from Python
• Large community of users
• Ufuncs and more
• Broadcasting
• Easy to interface C/C++/Fortran code
Tuesday, March 19, 13
10. What is wrong with NumPy
• Dtype system is difficult to extend
• Immediate mode creates huge temporaries
(spawning Numexpr)
• “Almost” an in-memory data-base comparable
to SQL-lite (missing indexes)
• Integration with sparse arrays
• Lots of un-optimized parts
• Minimal support for multi-core / GPU
Tuesday, March 19, 13
11. Improvements needed
• NDArray improvements
• Indexes (esp. for Structured arrays)
• SQL front-end
• Multi-level, hierarchical labels
• selection via mappings (labeled arrays)
• Memory spaces (array made up of regions)
• Distributed arrays (global array)
• Compressed arrays
• Standard distributed persistance
• fancy indexing as view and optimizations
• streaming arrays
Tuesday, March 19, 13
12. Improvements needed
• Dtype improvements
• Enumerated types (including dynamic enumeration)
• Derived fields
• Specification as a class (or JSON)
• Pointer dtype (i.e. C++ object, or varchar)
• Missing data: masks and bit-patterns
• Parameterized field names
• Computed fields
Tuesday, March 19, 13
13. Improvements needed
• Ufunc improvements
• Generalized ufuncs support more than just
contiguous arrays
• Specification of ufuncs in Python
• Move most dtype “array functions” to ufuncs
• Unify error-handling for all computations
• Allow lazy-evaluation and remote computation ---
streaming and generator data
• Structured and string dtype ufuncs
• Multi-core and GPU optimized ufuncs
• Group-by reduction
Tuesday, March 19, 13
14. More Improvements needed
• Miscellaneous improvements
• ABI-management
• Eventual Move to library (NDLib)?
• NDLib could serve as base for Javascript and other
high-level languages?
• Integration with LLVM
• Possible dtype / shape / stride unification into a “table
interface”
• Remote computation
• Fast I/O for CSV and Excel
Tuesday, March 19, 13
15. New Project
NumPy
Blaze
Out of Core,
Distributed and Optimized
NumPy
Tuesday, March 19, 13
17. Blaze: Different kinds of Arrays
Indexable
Record Type Primitive Type
NDTable NDArray
Deferred Concrete Deferred Concrete
Tuesday, March 19, 13
18. Blaze Deferred Arrays
• Symbolic objects which build a graph
• Represents deferred computation
+"
A + B*C
A" *"
Usually what you have when B" C"
you have a Blaze Array
Tuesday, March 19, 13
19. Deferred allows handling large arrays
Can be handled out-of-
core using chunks to
stream through memory.
Tuesday, March 19, 13
20. Blaze Concrete Array
URL URL URL URL URL
Data Descriptor
Where are the bytes? Indexes
DataShape Extensible Type System
What do the bytes mean?
which includes shape
MetaData Dictionary
Labels, provenance, etc.
Tuesday, March 19, 13
22. URLs Provide Bytes
Arbitrarily sliced
Memory-Like Random Seeks
Deal with in chunks
File-Like Random Seeks
Deal with in Chunks
Stream-Like Sequential Seeks
Tuesday, March 19, 13
23. Blaze Data Container
Index
Data Buffer
Operation
ByteProvider
Data Descriptor
Protocol
NumPy BLZ RDBMS
Persistent
Data Stream Format CSV
Tuesday, March 19, 13
24. Indexes
Contiguous / Strided NumPy-Like
Chunked / Tiled Special Access
Opaque
Opaque Element-only Iterator-access
Tuesday, March 19, 13
25. Indexes allow for many orderings
Tuesday, March 19, 13
26. DataShape Type System
Shape DType
DataShape
• A data description language
• A super-set of NumPy’s dtype
• Provides more flexibility
Tuesday, March 19, 13
27. Allows for all kinds of containers
Tuesday, March 19, 13
28. Advanced Types
Parametrized Types type Point = {
x : int;
type SquareMatrix T = N, N, T
y : int
}
Alias Types
type Space = {
type IntMatrix N = N, N, int32 a: Point;
b: Point
}
5, 10, Space
Tuesday, March 19, 13
29. Advanced Shapes
{1,2,4,2,1}, int32 [
[1],
Could Represent [1,2],
[1,3,2,9],
[3,2],
[3]
]
Tuesday, March 19, 13
30. Execution Model
• Graphs dispatch to specialized library code
that is “registered with the system” based on
type and meta-data of array (blaze Modules)
• Many operations can be compiled with LLVM
to machine-code
• BLIR (simple typed expression syntax)
• Numba (Python compiler)
Tuesday, March 19, 13
31. Blaze Agents
Code
Data
Blaze CSV
Agent Directory
Blaze
Code Graph with Blaze Agent MongoDB
Arrays
Blaze
Agent Vertica
Blaze
Agent HDFS
Tuesday, March 19, 13
32. How?
“I think you should be more
explicit here in step two.”
Tuesday, March 19, 13
33. Team
Travis Oliphant Stephen Diehl
NumPy, SciPy
Mark Florisson
Peter Wang Numba
Chaco, Bokeh
Francesc Alted
Mark Wiebe PyTables
NumPy, DyND
Oscar Villellas
Tuesday, March 19, 13
34. DARPA providing help
DARPA-BAA-12-38: XDATA
TA-1: Scalable analytics and data processing technology
TA-2: Visual user interface technology
Tuesday, March 19, 13
41. Roadmap
• 0.1 release expected in May
• 0.3 release at end of August
• 1.0 Release by PyData west-coast 2014
• Now only get involved if you want to develop
• Then, continue building PyData ecosystem
around scalable array.
Tuesday, March 19, 13
42. NumFOCUS
Num(Py) Foundation for Open Code for Usable Science
http://www.numfocus.org
Tuesday, March 19, 13