Apache Arrow (Strata-Hadoop World San Jose 2016)

DREMIO
Faster conclusions using in-memory
columnar SQL and machine learning
Strata San Jose - March 30, 2016
Apache

Arrow

DREMIO
Who
Wes McKinney
•  Engineer at Cloudera, formerly
DataPad CEO/founder
•  Wrote bestseller Python for
Data Analysis 2012
•  Open source projects
–  Python {pandas, Ibis,
statsmodels}
–  Apache {Arrow, Parquet, Kudu
(incubating)}
•  Mostly work in Python and
Cython/C/C++
Jacques Nadeau
•  CTO & Co-Founder at
Dremio, formerly Architect
at MapR
•  Open Source projects
–  Apache {Arrow, Parquet,
Calcite, Drill, HBase,
Phoenix}
•  Mostly work in Java

DREMIO
Arrow in a Slide
•  New Top-level Apache Software Foundation project
–  Announced Feb 17, 2016
•  Focused on Columnar In-Memory Analytics
1.  10-100x speedup on many workloads
2.  Common data layer enables companies to choose best of breed
systems
3.  Designed to work with any programming language
4.  Support for both relational and complex data as-is
•  Developers from 13+ major open source projects involved
–  A significant % of the world’s data will be processed through Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R

DREMIO
Agenda
•  Purpose
•  Memory Representation
•  Language Bindings
•  IPC & RPC
•  Example Integrations

DREMIO
Overview
•  A high speed in-memory representation
•  Well-documented and cross language
compatible
•  Designed to take advantage of modern
CPU characteristics
•  Embeddable in execution engines, storage
layers, etc.

DREMIO
Focus on CPU Efficiency
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer

Arrow
Memory Buffer

•  Cache Locality
•  Super-scalar & vectorized
operation
•  Minimal Structure
Overhead
•  Constant value access
–  With minimal structure
overhead
•  Operate directly on
columnar compressed data

DREMIO
High Performance Sharing & Interchange
Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert

DREMIO
Shared Need > Open Source Opportunity
•  Columnar is Complex
•  Shredded Columnar is even
more complex
•  We all need to go to same
place
•  Take Advantage of Open
Source approach
•  Once we pick a shared
solution, we get interchange
for “free”
“We
are
also
considering
switching
to

a
columnar
canonical
in-‐memory

format
for
data
that
needs
to
be

materialized
during
query
processing,

in
order
to
take
advantage
of
SIMD

instrucBons”
-‐Impala
Team

“A
large
fracBon
of
the
CPU
Bme
is
spent

waiBng
for
data
to
be
fetched
from
main

memory…we
are
designing
cache-‐friendly

algorithms
and
data
structures
so
Spark

applicaBons
will
spend
less
Bme
waiBng
to

fetch
data
from
memory
and
more
Bme

doing
useful
work
–
Spark
Team

DREMIO
In Memory Representation

DREMIO
Columnar data persons
=
[{

name:
'wes',

iq:
180,

addresses:
[

{number:
2,
street
'a'},

{number:
3,
street
'bb'}

]

},
{

name:
'joe',

iq:
100,

addresses:
[

{number:
4,
street
'ccc'},

{number:
5,
street
'dddd'},

{number:
2,
street
'f'}

]

}]

DREMIO
Simple Example: persons.iq
person.iq
180
100

DREMIO
Simple Example: persons.addresses.number
person.addresses
0
2
5
person.addresses.number
2
3
4
5
6
offset

DREMIO
Columnar data
person.addresses.street
person.addresses
0
2
5
offset
0
1
3
6
10
a
b
b
c
c
c
d
d
d
d
f
person.addresses.number
2
3
4
5
6
offset

DREMIO
Language Bindings
•  Target Languages
–  Java (beta)
–  CPP (underway)
–  Python & Pandas (underway)
–  R
–  Julia
•  Initial Focus
–  Read a structure
–  Write a structure
–  Manage Memory

DREMIO
Java: Creating Dynamic Off-heap Structures
FieldWriter
w=
getWriter();

w.varChar("name").write("Wes");

w.integer("iq").write(180);

ListWriter
list
=
writer.list("addresses");

list.startList();

MapWriter
map
=
list.map();

map.start();

map.integer("number").writeInt(2);

map.varChar("street").write("a");

map.end();

map.start();

map.integer("number").writeInt(3);

map.varChar("street").write("bb");

map.end();

list.endList();

{

name:
'wes',

iq:
180,

addresses:
[

{number:
2,
street
'a'},

{number:
3,
street
'bb'}

]

}

Json
RepresentaBon
ProgrammaBc
ConstrucBon

DREMIO
Java: Memory Management (& NVMe)
•  Chunk-based managed allocator
–  Built on top of Netty’s JEMalloc implementation
•  Create a tree of allocators
–  Limit and transfer semantics across allocators
–  Leak detection and location accounting
•  Wrap native memory from other applications
•  New support for integration with Intel’s Persistent
Memory library via Apache Mnemonic

DREMIO
Common Message Pattern
•  Schema Negotiation
–  Logical Description of structure
–  Identification of dictionary
encoded Nodes
•  Dictionary Batch
–  Dictionary ID, Values
•  Record Batch
–  Batches of records up to 64K
–  Leaf nodes up to 2B values
Schema

NegoBaBon

DicBonary

Batch

Record

Batch

Record

Batch

Record

Batch

1..N

Batches

0..N

Batches

DREMIO
Record Batch Construction
Schema

NegoBaBon

DicBonary

Batch

Record

Batch

Record

Batch

Record

Batch

name
(offset)

name
(data)

iq
(data)

addresses
(list
offset)

addresses.number

addresses.street
(offset)
addresses.street
(data)

data
header
(describes
offsets
into
data)

name
(bitmap)

iq
(bitmap)

addresses
(bitmap)

addresses.number
(bitmap)

addresses.street
(bitmap)

{

name:
'wes',

iq:
180,

addresses:
[

{number:
2,

street
'a'},

{number:
3,

street
'bb'}

]

}

Each
box
is

conBguous
memory,

enBrely
conBguous
on

wire

DREMIO
RPC & IPC: Moving Data Between Systems
RPC
•  Avoid Serialization & Deserialization
•  Layer TBD: Focused on supporting vectored io
–  Scatter/gather reads/writes against socket
IPC
•  Alpha implementation using memory mapped files
–  Moving data between Python and Drill
•  Working on shared allocation approach
–  Shared reference counting and well-defined ownership
semantics

DREMIO
Real World Example: Python With Spark or Drill
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine

DREMIO
Real World Example: Feather File Format for
Python and R
•  Problem: fast, language-
agnostic binary data
frame file format
•  Written by Wes
McKinney (Python)
Hadley Wickham (R)
•  Read speeds close to
disk IO performance
Arrow array 0
Arrow array 1
…
Arrow array n
Feather
metadata
Feather file
Apache Arrow
memory
Google
flatbuffers

DREMIO
Real World Example: Feather File Format for
Python and R
library(feather)

path
<-‐
"my_data.feather"

write_feather(df,
path)

df
<-‐
read_feather(path)

import
feather

path
=
'my_data.feather'

feather.write_dataframe(df,
path)

df
=
feather.read_dataframe(path)

R
Python

DREMIO
What’s Next
•  Parquet for Python & C++
– Using Arrow Representation
•  Available IPC Implementation
•  Spark, Drill Integration
– Faster UDFs, Storage interfaces

DREMIO
Get Involved
•  Join the community
– dev@arrow.apache.org
– Slack:
https://apachearrowslackin.herokuapp.com/
– http://arrow.apache.org
– @ApacheArrow, @wesmckinn, @intjesus

Apache Arrow (Strata-Hadoop World San Jose 2016)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (13)

Similaire à Apache Arrow (Strata-Hadoop World San Jose 2016)

Similaire à Apache Arrow (Strata-Hadoop World San Jose 2016) (20)

Plus de Wes McKinney

Plus de Wes McKinney (18)

Dernier

Dernier (20)

Apache Arrow (Strata-Hadoop World San Jose 2016)