16. End-to-end engines: drawbacks
• Example: SQL on Hadoop systems, Apache Spark,
others
• Serve some use cases well, others less well
• Fall short in ML/AI domain
22. How to Eliminate Serialization
“Serialized” and In-Memory Format
must be the same (or nearly so)
23. A Collective Realization in 2015
Many open source developers had noted the
absence of an in-memory standard for
structured data analytics
24. ● Language-agnostic in-memory format for
analytical query processing on modern
hardware
● Low-overhead data sharing and transport
● A cross-language development platform to
build Arrow-powered applications
Mission
26. Apache Arrow “meta” goals
• Forge collaborations between database
systems and data science / ML / AI
communities
• Eliminate barriers to code sharing between
application ecosystems and programming
languages
27. Community over Code
• ASF open governance model
• ~400 unique contributors
• 49 committers, 28 PMC members
• 11 programming languages
represented
28. Arrow Development in Practice
• “Core” format and protocol implementations
• “Batteries-included” standard libraries
• Common build / test / package infrastructure and
compatibility testing
31. • gRPC-based framework for custom data
services
• High-speed network dataset transfer
• Now available for C++, Java, Python
Arrow Flight: Fast Data Services
Development Partners
32. Flight key ideas
• Zero-serialization
• Bidirectional streaming transfers
• Parallel transfers + horizontal scalability
designed into the protocol
• Reap benefits of Google’s work on gRPC
33. Flight use cases
• Replacing slow database protocols like
JDBC / ODBC
• General network data movement
• Retrofit legacy systems with fast Arrow IO
35. Funding Arrow Development
• Apache projects are technically communities of
volunteers
• Much development contributed by direct users of Arrow
• Ursa Labs: not-for-profit group I founded in 2018 with
initial support of RStudio and Two Sigma