Leveling Up Analytics with Apache Arrow

•

2 j'aime•1,447 vues

Wes McKinney

Keynote from OmniSci Converge conference 10/22/2019 in Mountain View, CA

Technologie

Leveling up the Analytics Stack
Wes McKinney
@wesmckinn

Cloudera’s Jeff Hammerbacher
“Data ﬁrst, ask questions later”
Enable everyone to
“party on the data”
(paraphrasing!)

Simpliﬁed MapReduce Arch
Storage
Step 1 Step 2 Step 3 ...

“Scalability! But at what COST?”
McSherry, Isard, Murray 2015
https://www.usenix.org/system/ﬁles/conference/hotos15/hotos15-paper-mcsherry.pdf

Conﬁguration that
Outperforms a
Single
Thread

Storage Feasibility
Computational Feasibility
Resource Utilization
Interactivity
Up the hierarchy of needs...

Rise of “End-to-end” Execution Engines
Storage
Step 1 Step 2 Step 3 ...
INPUT OUTPUT

End-to-end engines: drawbacks
• Example: SQL on Hadoop systems, Apache Spark,
others
• Serve some use cases well, others less well
• Fall short in ML/AI domain

The Interoperability Conundrum
Engine A Engine B
Data Handoﬀ

Some Hardware Trends
• Manycore processor architectures
• Much faster disk
• Much faster networking
• Beyond CPUs

“Why Modern CPUs Are Starving
and What Can Be Done about It”
Francesc Alted, IEEE 2010

Serialization
Translation of data into a form that can
be stored or transmitted, and
reconstructed later

How to Eliminate Serialization
“Serialized” and In-Memory Format
must be the same (or nearly so)

A Collective Realization in 2015
Many open source developers had noted the
absence of an in-memory standard for
structured data analytics

● Language-agnostic in-memory format for
analytical query processing on modern
hardware
● Low-overhead data sharing and transport
● A cross-language development platform to
build Arrow-powered applications
Mission

Why Column-oriented?
• Reduce unnecessary IO
• Increase memory throughput
• Better parallelism
• Leverage SIMD instructions

Apache Arrow “meta” goals
• Forge collaborations between database
systems and data science / ML / AI
communities
• Eliminate barriers to code sharing between
application ecosystems and programming
languages

Community over Code
• ASF open governance model
• ~400 unique contributors
• 49 committers, 28 PMC members
• 11 programming languages
represented

Arrow Development in Practice
• “Core” format and protocol implementations
• “Batteries-included” standard libraries
• Common build / test / package infrastructure and
compatibility testing

Language Relationships
C++
Java
Go
Rust
C#
JavaScript
C Ruby
Python
R
MATLAB

• gRPC-based framework for custom data
services
• High-speed network dataset transfer
• Now available for C++, Java, Python
Arrow Flight: Fast Data Services
Development Partners

Flight key ideas
• Zero-serialization
• Bidirectional streaming transfers
• Parallel transfers + horizontal scalability
designed into the protocol
• Reap beneﬁts of Google’s work on gRPC

Flight use cases
• Replacing slow database protocols like
JDBC / ODBC
• General network data movement
• Retroﬁt legacy systems with fast Arrow IO

Notable Arrow subcomponents
Rust DataFusion
Arrow-native
Rust query
engine
Gandiva
LLVM analytical
expression
compiler
Plasma
Shared memory
object store

Funding Arrow Development
• Apache projects are technically communities of
volunteers
• Much development contributed by direct users of Arrow
• Ursa Labs: not-for-proﬁt group I founded in 2018 with
initial support of RStudio and Two Sigma

Thank you!
https://arrow.apache.org
https://ursalabs.org

Recommandé

Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

Ursa Labs and Apache Arrow in 2019Wes McKinney

Future of pandasJeff Reback

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney

Recommandé

Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

Ursa Labs and Apache Arrow in 2019Wes McKinney

Future of pandasJeff Reback

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Presto: Distributed sql query engine kiran palaka

New Directions for Apache ArrowWes McKinney

PrestoChen Chun

Presto: SQL-on-anythingDataWorks Summit

Presto @ Facebook: Past, Present and FutureDataWorks Summit

Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anythingPiotr Findeisen

Presto at Hadoop Summit 2016kbajda

Membase Meetup 2010Membase

Securing Data in Hadoop at UberDataWorks Summit

Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney

Operationalizing Big Data Pipelines At ScaleDatabricks

MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks

Boston Hadoop Meetup: Presto for the EnterpriseMatt Fuller

Open Source DataViz with Apache SupersetCarl W. Handlin

Large Scale Graph Analytics with JanusGraphP. Taylor Goetz

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward

Presto @ Uber Hadoop summit2017Zhenxiao Luo

PrestoKnoldus Inc.

Datacenter Computing with Apache Mesos - BigData DCPaco Nathan

Building FoundationDBFoundationDB

Contenu connexe

Tendances

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Presto: Distributed sql query engine kiran palaka

New Directions for Apache ArrowWes McKinney

PrestoChen Chun

Presto: SQL-on-anythingDataWorks Summit

Presto @ Facebook: Past, Present and FutureDataWorks Summit

Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anythingPiotr Findeisen

Presto at Hadoop Summit 2016kbajda

Membase Meetup 2010Membase

Securing Data in Hadoop at UberDataWorks Summit

Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney

Operationalizing Big Data Pipelines At ScaleDatabricks

MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks

Boston Hadoop Meetup: Presto for the EnterpriseMatt Fuller

Open Source DataViz with Apache SupersetCarl W. Handlin

Large Scale Graph Analytics with JanusGraphP. Taylor Goetz

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward

Presto @ Uber Hadoop summit2017Zhenxiao Luo

PrestoKnoldus Inc.

Tendances (20)

Apache Arrow -- Cross-language development platform for in-memory data

Presto: Distributed sql query engine

New Directions for Apache Arrow

Presto

Presto: SQL-on-anything

Presto @ Facebook: Past, Present and Future

Presto Strata London 2019: Cost-Based Optimizer for interactive SQL on anything

Presto at Hadoop Summit 2016

Membase Meetup 2010

Securing Data in Hadoop at Uber

Apache Arrow: Open Source Standard Becomes an Enterprise Necessity

Operationalizing Big Data Pipelines At Scale

MLflow: Infrastructure for a Complete Machine Learning Life Cycle

Boston Hadoop Meetup: Presto for the Enterprise

Open Source DataViz with Apache Superset

Large Scale Graph Analytics with JanusGraph

Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...

Presto @ Uber Hadoop summit2017

Presto

Similaire à Leveling Up Analytics with Apache Arrow

Datacenter Computing with Apache Mesos - BigData DCPaco Nathan

Building FoundationDBFoundationDB

Architecture Patterns - Open DiscussionNguyen Tung

Introduction to Apache Mesos and DC/OSSteve Wong

Strata SC 2014: Apache Mesos as an SDK for Building Distributed FrameworksPaco Nathan

Azure Cosmos DB - The Swiss Army NoSQL Cloud DatabaseBizTalk360

After the LAMP, it's time to get MEANJeff Fox

Hpc lunch and learnJohn D Almon

Microsoft Openness Mongo DBHeriyadi Janwar

Above the cloud joarder kamalJoarder Kamal

Michael stack -the state of apache h basehdhappy001

Nisha talagala keynote_inflow_2016Nisha Talagala

Introducing MemSQL 4SingleStore

Self-Service Data Ingestion Using NiFi, StreamSets & KafkaGuido Schmutz

Apache Spark and the Emerging Technology Landscape for Big DataPaco Nathan

Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks DeltaDatabricks

Membase Meetup - Silicon ValleyMembase

Cross-platform interactionOleksii Duhno

Leveraging Mainframe Data for Modern Analyticsconfluent

Similaire à Leveling Up Analytics with Apache Arrow (20)

Datacenter Computing with Apache Mesos - BigData DC

Building FoundationDB

Architecture Patterns - Open Discussion

Introduction to Apache Mesos and DC/OS

Strata SC 2014: Apache Mesos as an SDK for Building Distributed Frameworks

Azure Cosmos DB - The Swiss Army NoSQL Cloud Database

After the LAMP, it's time to get MEAN

Hpc lunch and learn

Microsoft Openness Mongo DB

Above the cloud joarder kamal

Michael stack -the state of apache h base

Nisha talagala keynote_inflow_2016

Introducing MemSQL 4

Self-Service Data Ingestion Using NiFi, StreamSets & Kafka

Apache Spark and the Emerging Technology Landscape for Big Data

Databricks Meetup @ Los Angeles Apache Spark User Group

End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta

Membase Meetup - Silicon Valley

Cross-platform interaction

Leveraging Mainframe Data for Modern Analytics

Plus de Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Solving Enterprise Data Challenges with Apache ArrowWes McKinney

Apache Arrow: High Performance Columnar Data FrameworkWes McKinney

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney

Apache Arrow: Leveling Up the Data Science StackWes McKinney

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Shared Infrastructure for Data ScienceWes McKinney

Data Science Without Borders (JupyterCon 2017)Wes McKinney

Memory Interoperability in Analytics and Machine LearningWes McKinney

Raising the Tides: Open Source Analytics for Data ScienceWes McKinney

Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney

Python Data Wrangling: Preparing for the FutureWes McKinney

PyCon APAC 2016 KeynoteWes McKinney

Apache Arrow and Python: The latestWes McKinney

High Performance Python on Apache SparkWes McKinney

Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney

Improving data interoperability in Python and RWes McKinney

Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney

Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney

Enabling Python to be a Better Big Data CitizenWes McKinney

Plus de Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

Solving Enterprise Data Challenges with Apache Arrow

Apache Arrow: High Performance Columnar Data Framework

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

Apache Arrow: Leveling Up the Data Science Stack

Apache Arrow: Cross-language Development Platform for In-memory Data

Shared Infrastructure for Data Science

Data Science Without Borders (JupyterCon 2017)

Memory Interoperability in Analytics and Machine Learning

Raising the Tides: Open Source Analytics for Data Science

Improving Python and Spark (PySpark) Performance and Interoperability

Python Data Wrangling: Preparing for the Future

PyCon APAC 2016 Keynote

Apache Arrow and Python: The latest

High Performance Python on Apache Spark

Python Data Ecosystem: Thoughts on Building for the Future

Improving data interoperability in Python and R

Next-generation Python Big Data Tools, powered by Apache Arrow

Apache Arrow (Strata-Hadoop World San Jose 2016)

Enabling Python to be a Better Big Data Citizen

Dernier

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Training state-of-the-art general text embeddingZilliz

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

"ML in Production",Oleksandr BaganFwdays

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Gen AI in Business - Global Trends Report 2024.pdfAddepto

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Sample pptx for embedding into website for demoHarshalMandlekar2

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Dernier (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Take control of your SAP testing with UiPath Test Suite

Training state-of-the-art general text embedding

Nell’iperspazio con Rocket: il Framework Web di Rust!

"ML in Production",Oleksandr Bagan

Unraveling Multimodality with Large Language Models.pdf

"Debugging python applications inside k8s environment", Andrii Soldatenko

Developer Data Modeling Mistakes: From Postgres to NoSQL

Gen AI in Business - Global Trends Report 2024.pdf

TeamStation AI System Report LATAM IT Salaries 2024

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Time Series Foundation Models - current state and future directions

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Sample pptx for embedding into website for demo

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

What is DBT - The Ultimate Data Build Tool.pdf

Generative AI for Technical Writer or Information Developers

DSPy a system for AI to Write Prompts and Do Fine Tuning

Artificial intelligence in cctv survelliance.pptx

Leveling Up Analytics with Apache Arrow

2. Leveling up the Analytics Stack Wes McKinney @wesmckinn

3. Apache ibis

4. The Need for Speed

5. 21 Years Ago...

7. A semi-revisionist history of Big Data

8. Cloudera’s Jeff Hammerbacher “Data ﬁrst, ask questions later” Enable everyone to “party on the data” (paraphrasing!)

9. Decoupling Storage from Processing

10. Simpliﬁed MapReduce Arch Storage Step 1 Step 2 Step 3 ...

11. “Scalability! But at what COST?” McSherry, Isard, Murray 2015 https://www.usenix.org/system/ﬁles/conference/hotos15/hotos15-paper-mcsherry.pdf

12. Conﬁguration that Outperforms a Single Thread

13. Brute Force Scalability

14. Storage Feasibility Computational Feasibility Resource Utilization Interactivity Up the hierarchy of needs...

15. Rise of “End-to-end” Execution Engines Storage Step 1 Step 2 Step 3 ... INPUT OUTPUT

16. End-to-end engines: drawbacks • Example: SQL on Hadoop systems, Apache Spark, others • Serve some use cases well, others less well • Fall short in ML/AI domain

17. The Interoperability Conundrum Engine A Engine B Data Handoﬀ

18. Some Hardware Trends • Manycore processor architectures • Much faster disk • Much faster networking • Beyond CPUs

19. “Why Modern CPUs Are Starving and What Can Be Done about It” Francesc Alted, IEEE 2010

20. Recognizing Serialization as an Enemy

21. Serialization Translation of data into a form that can be stored or transmitted, and reconstructed later

22. How to Eliminate Serialization “Serialized” and In-Memory Format must be the same (or nearly so)

23. A Collective Realization in 2015 Many open source developers had noted the absence of an in-memory standard for structured data analytics

24. ● Language-agnostic in-memory format for analytical query processing on modern hardware ● Low-overhead data sharing and transport ● A cross-language development platform to build Arrow-powered applications Mission

25. Why Column-oriented? • Reduce unnecessary IO • Increase memory throughput • Better parallelism • Leverage SIMD instructions

26. Apache Arrow “meta” goals • Forge collaborations between database systems and data science / ML / AI communities • Eliminate barriers to code sharing between application ecosystems and programming languages

27. Community over Code • ASF open governance model • ~400 unique contributors • 49 committers, 28 PMC members • 11 programming languages represented

28. Arrow Development in Practice • “Core” format and protocol implementations • “Batteries-included” standard libraries • Common build / test / package infrastructure and compatibility testing

29. Language Relationships C++ Java Go Rust C# JavaScript C Ruby Python R MATLAB

30. Some Arrow Success Stories Apache

31. • gRPC-based framework for custom data services • High-speed network dataset transfer • Now available for C++, Java, Python Arrow Flight: Fast Data Services Development Partners

32. Flight key ideas • Zero-serialization • Bidirectional streaming transfers • Parallel transfers + horizontal scalability designed into the protocol • Reap beneﬁts of Google’s work on gRPC

33. Flight use cases • Replacing slow database protocols like JDBC / ODBC • General network data movement • Retroﬁt legacy systems with fast Arrow IO

34. Notable Arrow subcomponents Rust DataFusion Arrow-native Rust query engine Gandiva LLVM analytical expression compiler Plasma Shared memory object store

35. Funding Arrow Development • Apache projects are technically communities of volunteers • Much development contributed by direct users of Arrow • Ursa Labs: not-for-proﬁt group I founded in 2018 with initial support of RStudio and Two Sigma

36. Ursa Labs Sponsors

37. Thank you! https://arrow.apache.org https://ursalabs.org