Apache Arrow is a new standard for in-memory columnar data processing. It is a complement to Apache Parquet and Apache ORC. In this deck we review key design goals and how Arrow works in detail.
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Apache Arrow - An Overview
1. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Apache Arrow
Columnar In-Memory Analytics
UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
2. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Dremio [NOT TODAY’S TOPIC]
Jacques
Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Apache Drill PMC Chair
• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer
Shiran
Founder & CEO
• VP Product, MapR; Microsoft; IBM
Research
• Apache Drill Founder
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Apache Parquet Founder
• Apache Pig PMC Member
• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Founded in June 2015
• Led by experts in Big Data and open source
(Apache Parquet, Drill, Pig, Calcite and more)
• Currently in stealth
3. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Introducing Apache Arrow
• New open source project under the Apache Software Foundation
– Top-level project (directly!)
• Introduces new era of Columnar In-Memory Analytics
1. 10-100x speedup & concurrency for most workloads
2. Common data layer enables companies to choose best of breed
systems
3. Users can utilize any programming language
4. Works with relational and complex data as-is; no ETL required
• 13 major open source Big Data projects are already on board
– A significant % of the world’s data will be processed through Arrow!
UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
4. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Arrow Turbo-Charges Big Data Execution Engines
Apache Arrow Apache Arrow Apache Arrow Apache Arrow
Impala
Apache ArrowApache Arrow Apache Arrow Apache Arrow
…
5. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Performance Advantage of Columnar In-Memory
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Traditional
Memory Buffer
Arrow
Memory Buffer
• Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs
• Arrow optimizes CPU prefetching
and caching
6. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Evolution Towards Heterogeneous Data Infrastructure
RDBMS
Hadoop MapReduce
Databases
Cassandra
Elasticsearch
HBase
Kudu
MongoDB
Parquet
Phoenix
Execution Engines
Drill
Ibis
Impala
MapReduce
Pandas
Spark
Storm
Phase 1
Common Scheduler
YARN Mesos
Kubernetes
Phase 2
Common Data/Memory
Arrow
7. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Advantages of a Common Data Layer
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
8. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Who’s Behind Apache Arrow?
• The creators and lead developers of 13
major open source Big Data projects
– Employees of Cloudera, Databricks,
Datastax, Dremio, Hortonworks, MapR,
Salesforce, Twitter
• Jacques Nadeau is the PMC Chair (aka VP
Apache Arrow)
– Co-founder & CTO of Dremio
Calcite
Cassandra
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
9. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Current Status
• C, C++, Python and Java implementations
currently underway
• Will be adopted by Drill, Ibis, Impala, Kudu,
Parquet and Spark by EOY
• Additional languages (eg, R, JavaScript) and
projects also expected to adopt Arrow by EOY
10. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Questions?
Jacques Nadeau
Dremio Founder & CTO
VP Apache Arrow
Julien Le Dem
Dremio Architect
VP Apache Parquet
12. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
PMC Members/Committers
Jacques Nadeau (PMC Chair)
Todd Lipcon
Ted Dunning
Michael Stack
P. Taylor Goetz
Reynold Xin
Julian Hyde
Julien Le Dem
James Taylor
Jake Luciani
Parth Chandra
Alex Levenson
Marcel Kornacker
Steven Phillips
Hanifi Gunes
Jason Altekruse
Abdel Hakim Deneche
Wes McKinney
Karthik Ramasamy
David Alves
Seshadri Mahalingam
Ippokratis Pandis
Notes de l'éditeur
This is changing the world! Emphasize that.
Trying to turbo-charge all the major technologies that people use today.
Explain that columnar on disk existed for several years, this is columnar in memory
Is this only CPU and cache, or also main memory? BOTH, EVERYTHING. That’s what’s amazing here.
Very technical explanation – simplify it. One blue vs 4 blues
Maybe improve the slide – from common scheduling to common data in memory
Don’t say it will come in in the coming months and years. Years is too far in the future. Everyone has the need today.
We’re not offloading the work for them, they are going to do the work.
Relationships – good point
Call this a platform?