Shared Infrastructure for Data Science

Wes McKinney @wesmckinn
SHARED INFRASTRUCTURE FOR
DATA SCIENCE
WES MCKINNEY @WESMCKINN
Rice Data Science Conference | October 2017

I M P O R TA N T L E G A L I N F O R M AT I O N
• The information presented here is offered for informational purposes only and should not be used for any other purpose
(including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes
only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any
offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of
Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at
any time.
• Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such
copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright
or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma,
nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
Wes McKinney @wesmckinn 3

THINKING ON THE LAST 10 YEARS
4
2007 2017

Shared front-ends
for data science

THE NEXT 10 YEARS AND BEYOND
7
2017 2027 …

THE AI ARMS RACE

CHANGING HARDWARE LANDSCAPE
DISK
PROCESSIN
G
MEMORY
9

T
DATA SCIENCE “LANGUAGE “SILOS”
FRONT-END
PYTHON R JVM JULIA …
10

WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
11

WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
pandas NumPy
pandas
NumPy
pandas
scikit-learn
12

RENOVATING PANDAS

T
MAKING THE SILOS “SMALLER”
FRONT-END
PYTHON R JVM JULIA
?
…
14

PROGRAMMING LANGUAGES
AS USER INTERFACES
15

GRAPHIC: Iceberg under sea (only top
part visible to naked eye)

T
df <- read_csv(…)
df % group_by(…) % summarise(…)
df = read_csv(…)
df.groupby(…).aggregate(…)
PYTHON
R
SAME ANALYSIS, DIFFERENT
IMPLEMENTATION
17

T
A SHARED RUNTIME FOR DATA SCIENCE
FRONT-END
PYTHON R JVM JULIA
SHARED DATA SCIENCE RUNTIME
…
18

T
PART 1: STANDARD IN-MEMORY FORMAT
R
PYTHON
JVM
PORTABLE DATA
FRAME
Non-Portable Data Frames
20…

T
PART 2: ZERO COPY INTERCHANGE
RPYTHON JVM
SHARED MEMORY + STANDARD MEMORY FORMATS
…
21

T
PART 3: HIGH PERFORMANCE DATA
ACCESS
BINARY
COLUMNAR
CSV
SQL
PORTABLE
DATA FRAME
Storage Formats/ Databases
… 22

T
PART 4: FLEXIBLE COMPUTATION ENGINE
• Zero-overhead User-defined Functions
• Portable Operator “Graphs”
• “Embeddable” in Larger Systems
23

APACHE ARROW
Language-agnostic Data Frame Format
Zero-Copy Interchange
24

24
Without Arrow With Arrow
Simple, fast data interchange

24
• Cache-efficient columnar memory: optimized for CPU affinity and
SIMD / parallel processing, O(1) random value access
• Zero-copy messaging / IPC: Language-agnostic metadata,
batch/file-based and streaming binary formats
• Complex schema support: Flat and nested data types
• Main implementations in C++ and Java: with integration tests
• Bindings / implementations for C, Python, Ruby, Javascript in
various stages of development
Big picture Arrow goals

T
BUILDING THE ARROW FORMAT
• “Superset” of representations supported by
R, pandas, SQL engines
• Optimized for CPU cache affinity
• ASF Governance: Open + Transparent
Community Project
25

FEATHER: MINIMALIST ARROW ON DISK

Some Arrow OSS Users
Feather Format
Ray Project
27

Logical Operator Graphs
27
(a + b).log()
Log Add
a
b

Terminology
27
• Kernel functions: atomic units of
computation
• Operator nodes: input/output types,
operator parallelism properties

Parallel Execution of Operator Graphs
27
a b
ADD LOG
tmp out

Some Optimization strategies
27
• Multicore scheduling
• Elimination of temporaries
• Operator fusion / pipelinng

A
28
Arrow-optimized data connectors
Arrow in-memory format
Logical Data Frame Expression Graphs
Parallel Dataflow Execution Engine
Python user API, DataFrame semantics,
User-defined functions
pandas2
Apache Arrow

Wes McKinney @wesmckinn
THANK YOU
WES MCKINNEY @WESMCKINN
Apache Arrow: http://arrow.apache.org

Shared Infrastructure for Data Science

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Shared Infrastructure for Data Science

Similaire à Shared Infrastructure for Data Science (20)

Plus de Wes McKinney

Plus de Wes McKinney (20)

Dernier

Dernier (20)

Shared Infrastructure for Data Science