Generic or specific? Making sensible software design decisions
SkyhookDM - Towards an Arrow-Native Storage System
1. Jayjeet Chakraborty
Towards an Arrow-Native Storage System
SkyhookDM
Mentored by: Carlos Maltzahn, Ivo Jimenez, Je
ff
LeFevre
1
2. Who am I ?
• Incoming Grad Student at UC Santa Cruz
• CS Graduate from NIT Durgapur, India
• IRIS-HEP Fellow Summer 2020
• Twitter: @heyjc25
• Github: JayjeetAtGithub
• LinkedIn: https://www.linkedin.com/in/jayjeet-chakraborty-077579162/
• E-Mail: jchakra1@ucsc.edu
2
3. Problem
• CPU is the new bottleneck with high speed network and storage devices.
• Client-side processing of data from highly e
ffi
cient storage formats like
Parquet, ORC exhausts the CPUs.
• Severely hampered scalability.
• O
ffl
oad computation from client to the storage layer.
• Take advantage of the idle CPUs of storage systems for increased processing
rates and faster queries.
• Results in less data movement and network tra
ffi
c.
Our Solution
3
4. Introduction to Ceph
1.Provides 3 types of storage interface:
File, Object, Block.
2.No central point of failure. Uses
CRUSH maps that contains object -
OSD mapping. A CRUSH map in each
client. Client talks directly to OSD.
3.Highly extensible Object storage layer
via the Ceph Object Classes SDK.
4
5. • Language-independent columnar memory format for
fl
at and hierarchical data,
organised for e
ffi
cient analytic operations on modern hardware.
• Share data between processes without serialization overhead.
Before
Arrow
After Arrow
5
7. Design Paradigm
• Extend client and storage layers of
programmable storage systems
with data access libraries.
• Embed a FS shim inside storage
nodes to have
fi
le-like view over
objects.
• Allow direct interaction with objects
in an object store while bypassing
the
fi
lesystem layer utilising FS
metadata.
7
8. Architecture
• Arrow data access libraries embedded inside Ceph OSDs to allow
fi
le fragment scanning inside the storage
layer.
• Expose the functionality through the Arrow Dataset API by creating a new
fi
le format abstraction
“RadosParquetFileFormat”.
8
9. File Layout Design
• Large multi-gigabyte Parquet
fi
les are split into smaller ~128 MB Parquet
fi
les.
• Each Parquet
fi
le is stored in a single RADOS object for SkyhookDM to access.
9
10. Experiments: Latency
• O
ffl
oading makes queries with higher
selectivity faster as less amount of data
is moved around the system. Also, less
time goes in data (de)serialization and
more into processing.
• LZ4 compressed Arrow IPC
fi
les
(Bottom) makes SkyhookDM better
performing than Parquet
fi
les (Top) since
they are faster to R/W.
Parquet
on Disk
LZ4 IPC on
Disk
10
11. Experiments: CPU Usage
• SkyhookDM nicely o
ffl
oads CPU usage from client layer to storage layer. For
example with 4 OSDs and 100% selectivity,
Without
Skyhook
With Skyhook
11
12. Experiments: Network Traffic
• SkyhookDM saves network
bandwidth by transferring only
the data that is requested by the
client.
• We end up transferring a little
more data in case of 100% as
LZ4 compressed Arrow is larger
than Parquet binary data.
1%
10%
100%
12
13. Experiments: Crash Recovery
• In SkyhookDM, since processing is colocated with storage nodes, the crash recovery
and consistency semantics of the storage layer apply naturally to query processing.
Crash Point
13
14. Coffea + SkyhookDM
• Implemented a run_parquet_job executor method in Co
ff
ea to be able to read from
Parquet
fi
les using the Arrow Dataset API. This in turn allowed integrating Co
ff
ea with
SkyhookDM seamlessly.
14
15. 41.5%
30.5%
24.6%
3
.
3
4
%
0.103%
0.0324%
0.00855%
0.00511%
[6] Serialize Result Table
[5] Scan Parquet Data
[7] Result Transfer
[4] Disk I/O
[3] Deserialize Scan Request
[1] Stat Fragment
[8] Deserialize Result Table
[2] Serialize Scan Request
Sending uncompressed IPC
Ongoing Work
• Arrow’s memory layout requires internal memory copies to serialize it to a
contiguous on the wire format and this has a very high overhead.
48.3%
29.5%
11.7%
5.37%
5.11%
0.0513%
0.0304%
0.00771%
[5] Scan Parquet Data
[6] Serialize Result Table
[7] Result Transfer
[8] Deserialize Result Table
[4] Disk I/O
[3] Deserialize Scan Request
[1] Stat Fragment
[2] Serialize Scan Request
Sending LZ4 compressed IPC
• Collaborating with ServiceX and Co
ff
ea team to integrate SkyhookDM into the
larger analysis facility ecosystem.
15
16. Checkout our work
• Github Repository: https://github.com/uccross/skyhookdm-arrow
• Docker containers: https://github.com/uccross/skyhookdm-arrow-docker
• ArXiv Paper: https://arxiv.org/pdf/2105.09894.pdf
• Co
ff
ea Skyhook Plugin: https://github.com/Co
ff
eaTeam/co
ff
ea/tree/master/
docker/co
ff
ea_rados_parquet
• Several bugs found and reported in Apache Arrow: ARROW-13161,
ARROW-13126, ARROW-13088.
16