3. @jreback
● Former quant
● Senior Engineer at Two Sigma, working on holistic approaches to
modeling
● Core committer to pandas for last 5 years
● Managed pandas since 2013
13. pandas’s role in the Python Data Ecosystem
pandas
Numerical
Computing
IO / Data
Access
Data
Visualization
Statistics +
Machine
Learning
Libraries
Users
24. ● In-memory format that is custom
● Eager evaluation model, no query planning
● "Slow", limited multicore algorithms for large datasets
Performance
25. Data Tooling Spectrum
Small Data
“Medium” Data
“Big” Data
< 5GB 5-100GB > 100 GB
pandas starts to fail as an effective
tool somewhere around the 10 GB
mark
29. Big Data Unfriendly
Each system has its own internal memory format
70-80% computation wasted on serialization and deserialization
Similar functionality implemented in multiple projects
33. pandas2 architecture
pandas2
Arrow-optimized data connectors
Arrow in-memory format
Parallel Dataflow Execution Engine
Apache Arrow
Python user API, User-defined functions
Logical Data Frame Expression Graphs
Ibis
DataFrame semantics & compatibility
34.
35.
36.
37.
38.
39.
40.
41. Ibis in a nutshell
Ibis
Python code
Compiler Back End
compiled code
Abstract
Syntax Tree
42. pandas2 architecture
pandas2
Arrow-optimized data connectors
Arrow in-memory format
Parallel Dataflow Execution Engine
Apache Arrow
Python user API, User-defined functions
Logical Data Frame Expression Graphs
Ibis
DataFrame semantics & compatibility
43. Apache Arrow project
The Arrow supports zero-copy reads and is optimized for data locality.
Fast
Arrow acts as a new high-performance interface between various systems.
Flexible
Apache Arrow is backed by many key open source projects.
Standard
44. Big Data friendly
All systems utilize the same memory format
No overhead for cross-system communication
Projects can share functionality (eg, Parquet-to-Arrow reader)
pandas is the go-to pre & post-processor. The UX and some of the UI.
Warty missing data support
Limited, non-extensible type metadata
Weak support for categorical data
Issues:
output shape of the result
its not obvious how to predict performance
output shape, 2) performance, maybe show unwrapping groupy.. note in ibis that we automatically do this. not obvious how to predict performance.
output shape, 2) performance, maybe show unwrapping groupy.. note in ibis that we automatically do this. not obvious how to predict performance.
no type checking on args
not syntactful
can’t easily rename
what is a view? leaky implementation. copy-on-write to the rescue.
you need 2x - 5x of the data size to be comfortable.
other talks: Li Jin & streamz
why this storage methods:
input was traditionally 2-d numpy arrays, zero-copy
predicting performance is hard when you add/remove blocks
we are storing in fortran ordering, meaning columnar formats.
We have implemented a number of short-term fixes.
These won’t solve some of the major issues,
but can be helpful in removing rough edges and incrementally improving performance.
We want to de-couple our *intent* from the *execution*
pandas2 drop-in replacements. 95% compat (missing value support changes, deferred execution, copy-on-write, Table API).
talk about SQLAlchemy here.
mention mypy here
described our intent, dtypes; deferred execution, this allow query planning
can chain predicates
We want to de-couple our *intent* from the *execution*
pandas2 drop-in replacements. 95% compat (missing value support changes, deferred execution, copy-on-write, Table API).
this is what Arrow is today
In the future, this will also include computation.
but i think you want to lead w/ the full stack here i.e. operator call graph -> operators -> on high performance in memory format. (otherwise the following slides get confusing)
WM: done, and added definition of pandas2
compute / simd / vectors
gpu dataframe (goAI /
top image - looks like a dataframe (but is actually a Table, e.g. no indices)
distinguish betwen logical & physical plans, use (a+b).log().sum()
Inspired by prior art in peer ecosystems
Analytic SQL databases
Modern deep learning frameworks
Kernel Function -> UFuncs
We want to de-couple our *intent* from the *execution*
pandas2 drop-in replacements. 95% compat (missing value support changes, deferred execution, copy-on-write, Table API).
Mention contributors? e.g. Phil c
Ibis
BigQuery, File backends (CSV, HDF5, parquet) - Done
Spark backend - Wanted
Arrow
Compute kernels
Hearsurgeon / Mechanic joke