Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

•

1 j'aime•1,731 vues

This document discusses how Apache Arrow enables sharing data between Python and Java without copying. It summarizes Arrow's capabilities for efficient in-memory columnar data and its ability to exchange data between different programming languages. The document then outlines how Arrow, through its Java and Python libraries, allows querying data in Java from Python without copying, by passing memory addresses between the two environments. This enables faster data science workflows that involve both Python and Java/Scala.

Données & analyses

1
Fulfilling Apache Arrow's Promises:
Pandas on JVM memory without a copy
PyCon.DE Karlsruhe 2018
Uwe L. Korn

2
• Senior Data Scientist at Blue Yonder
(@BlueYonderTech)
• Apache {Arrow, Parquet} PMC
• Data Engineer and Architect with heavy
focus around Pandas
About me
xhochy
mail@uwekorn.com

3
What’s Apache Arrow?
• Published in February 2016
• Specification for in-memory columnar data layout
• No overhead for cross-system communication
• Designed for eﬃciency (exploit SIMD, cache locality, ..)
• Exchange data without conversion between Python, C++, C(glib), Ruby,
Lua, R, JavaScript, Go, Rust, Matlab and the JVM
• Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

4
February 2016: Birth of Apache Arrow
Just a goal…

5
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
probability density
function (PDF)
SQL
Engine

6
Looks simple?
• It isn’t.
• „Data“ is very heterogeneous landscape
• Most common setup:
• Java/Scala, i.e. JVM, for data processing
• Python for machine learning

7
Data Science Workflow in 2018
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver JayDeBeApi
P
Y
T
H
O
N
R
O
W
S
J
D
B
C
R
O
W
S

8
org.apache.arrow.adapter.jdbc
• Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot
• Do conversion of rows to columns in the JVM
• Data is stored„oﬀ-heap“, i.e:
• not managed by the JVM
• native memorly layout, same as in pyarrow

9
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
?

10
So we’re done? No.
• We still only have Arrow data in the JVM
• Arrow and Pandas have a slightly diﬀerent memory layout
• We have this today in PySpark
• It’s fast
• Still involves a copy over the network
• Arrow → pandas conversion is tuned but still a copy

11
pyarrow.jvm
• Access Arrow data created in the JVM from Python
• Involves no copy of the data
• Translation of the helper objects
• Actually passes memory addresses around
No copy between the JVM and Python!

NumPy & the BlockManager
Photo by Susan Holt Simpson on Unsplash

13
Pandas Shortcomings
• Limited to NumPy data types, otherwise object
• Columns are not separate, grouped by type
• Nullability is not type-safe (yet)
—> Arrow memory does not match Pandas memory
—> Copy 😢

14
Pandas ExtensionArrays
• Introduced new interfaces in 0.23
• ExtensionDtype
• What type of scalars?
• ExtensionArray
• Implement basic array ops
• Pandas provides algorithms on top
• Still, experimental, wait for 0.24

16
fletcher
• https://github.com/xhochy/fletcher
• Implements Extension{Array,Dtype} with Apache Arrow as storage
• Uses Numba to implement the necessary analytic on top
• Needs {pandas, Arrow, …} master
No copy between Apache Arrow and pandas!

17
Workflow in 2018 with Arrow
Python
machine
learning
model
pre-processing
with pandas
SQL
Engine
JDBC Driver
org.apache.
arrow.adapter.
jdbc
A
R
R
O
W
J
D
B
C
R
O
W
S
pyarrow.jvm 
/
fletcher

Make your
best decision
today.
blueyonder.ai/en/careers
Blue Yonder Analytics, Inc.
5048 Tennyson Parkway
Suite 250
Plano, Texas 75024
USA
21

Cross language DataFrame library
• Website: https://arrow.apache.org/
• ML: dev@arrow.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/ARROW
• Slack: https://
apachearrowslackin.herokuapp.com/
• Github mirror: https://github.com/apache/
arrow
Apache Arrow Apache Parquet
Famous columnar file format
• Website: https://parquet.apache.org/
• ML: dev@parquet.apache.org
• Issues & Tasks: https://issues.apache.org/jira/
browse/PARQUET
• Slack: https://parquet-slack-
invite.herokuapp.com/
• C++ Github mirror: https://github.com/
apache/parquet-cpp
22
Get Involved!

Recommandé

pandas.(to/from)_sql is simple but not fastUwe Korn

Extending Pandas using Apache Arrow and NumbaUwe Korn

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf

Enabling Python to be a Better Big Data CitizenWes McKinney

Future of pandasJeff Reback

Improving data interoperability in Python and RWes McKinney

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney

Recommandé

pandas.(to/from)_sql is simple but not fastUwe Korn

Extending Pandas using Apache Arrow and NumbaUwe Korn

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16MLconf

Enabling Python to be a Better Big Data CitizenWes McKinney

Future of pandasJeff Reback

Improving data interoperability in Python and RWes McKinney

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf

Pandas/Data Analysis at BaypiggiesAndy Hayden

DataFrames: The Extended CutWes McKinney

PrestoChen Chun

PyCon Singapore 2013 KeynoteWes McKinney

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Presto as a Service - Tips for operation and monitoringTaro L. Saito

Presto in my_use_case2wyukawa

Rust is for "Big Data"Andy Grove

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney

Fabian Hueske – Juggling with Bits and BytesFlink Forward

Presto Meetup 2016 Small StartHiroshi Toyama

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Apache Spark & MLlibGrigory Sapunov

Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks

Strata2017 sgwyukawa

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Cascalognathanmarz

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Contenu connexe

Tendances

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...MLconf

Pandas/Data Analysis at BaypiggiesAndy Hayden

DataFrames: The Extended CutWes McKinney

PrestoChen Chun

PyCon Singapore 2013 KeynoteWes McKinney

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Presto as a Service - Tips for operation and monitoringTaro L. Saito

Presto in my_use_case2wyukawa

Rust is for "Big Data"Andy Grove

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney

Fabian Hueske – Juggling with Bits and BytesFlink Forward

Presto Meetup 2016 Small StartHiroshi Toyama

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Apache Spark & MLlibGrigory Sapunov

Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks

Strata2017 sgwyukawa

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Cascalognathanmarz

Tendances (20)

Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...

Pandas/Data Analysis at Baypiggies

DataFrames: The Extended Cut

Presto

PyCon Singapore 2013 Keynote

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

Apache Arrow -- Cross-language development platform for in-memory data

Presto as a Service - Tips for operation and monitoring

Presto in my_use_case2

Rust is for "Big Data"

Apache Arrow at DataEngConf Barcelona 2018

An Incomplete Data Tools Landscape for Hackers in 2015

Fabian Hueske – Juggling with Bits and Bytes

Presto Meetup 2016 Small Start

Resource-Efficient Deep Learning Model Selection on Apache Spark

Apache Spark & MLlib

Apache Spark MLlib 2.0 Preview: Data Science and Production

Strata2017 sg

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Cascalog

Similaire à Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn

Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney

Apache Spark in IndustryDorian Beganovic

Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari

Lightning Fast Dataframes with PolarsAlberto Danese

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...Uwe Korn

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn

Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks

Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman

Scalable Scientific Computing with DaskUwe Korn

Apache Arrow and Python: The latestWes McKinney

Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling

Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney

3 python packagesFEG

Koalas: Unifying Spark and pandas APIsXiao Li

Data Science meets Software DevelopmentAlexis Seigneurin

Apache spark-melbourne-april-2015-meetupNed Shawa

Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok

Similaire à Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy (20)

Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...

Data Science at Scale: Using Apache Spark for Data Science at Bitly

How Apache Arrow and Parquet boost cross-language interoperability

Next-generation Python Big Data Tools, powered by Apache Arrow

Apache Spark in Industry

Apache Spark for Everyone - Women Who Code Workshop

Lightning Fast Dataframes with Polars

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

Deep Learning on Apache® Spark™: Workflows and Best Practices

Scalable Scientific Computing with Dask

Apache Arrow and Python: The latest

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Apache Arrow (Strata-Hadoop World San Jose 2016)

3 python packages

Koalas: Unifying Spark and pandas APIs

Data Science meets Software Development

Apache spark-melbourne-april-2015-meetup

Spark summit 2019 infrastructure for deep learning in apache spark 0425

Dernier

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

April 2024 - Crypto Market Report's Analysismanisha194592

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

Data-Analysis for Chicago Crime Data 2023ymrp368

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Halmar dropshipping via API with DroFxolyaivanovalion

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Week-01-2.ppt BBB human Computer interactionfulawalesam

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Dernier (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

April 2024 - Crypto Market Report's Analysis

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf

Data-Analysis for Chicago Crime Data 2023

BigBuy dropshipping via API with DroFx.pptx

Halmar dropshipping via API with DroFx

Generative AI on Enterprise Cloud with NiFi and Milvus

Determinants of health, dimensions of health, positive health and spectrum of...

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

CebaBaby dropshipping via API with DroFX.pptx

Sampling (random) method and Non random.ppt

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

Week-01-2.ppt BBB human Computer interaction

100-Concepts-of-AI by Anupama Kate .pptx

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

1. 1 Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy PyCon.DE Karlsruhe 2018 Uwe L. Korn

2. 2 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy mail@uwekorn.com

3. 3 What’s Apache Arrow? • Published in February 2016 • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for eﬃciency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)

4. 4 February 2016: Birth of Apache Arrow Just a goal…

5. 5 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas probability density function (PDF) SQL Engine

6. 6 Looks simple? • It isn’t. • „Data“ is very heterogeneous landscape • Most common setup: • Java/Scala, i.e. JVM, for data processing • Python for machine learning

7. 7 Data Science Workflow in 2018 Python machine learning model pre-processing with pandas SQL Engine JDBC Driver JayDeBeApi P Y T H O N R O W S J D B C R O W S

8. 8 org.apache.arrow.adapter.jdbc • Retrieve JDBC results as Arrow RecordBatch / VectorSchemaRoot • Do conversion of rows to columns in the JVM • Data is stored„oﬀ-heap“, i.e: • not managed by the JVM • native memorly layout, same as in pyarrow

9. 9 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S ?

10. 10 So we’re done? No. • We still only have Arrow data in the JVM • Arrow and Pandas have a slightly diﬀerent memory layout • We have this today in PySpark • It’s fast • Still involves a copy over the network • Arrow → pandas conversion is tuned but still a copy

11. 11 pyarrow.jvm • Access Arrow data created in the JVM from Python • Involves no copy of the data • Translation of the helper objects • Actually passes memory addresses around No copy between the JVM and Python!

12. NumPy & the BlockManager Photo by Susan Holt Simpson on Unsplash

13. 13 Pandas Shortcomings • Limited to NumPy data types, otherwise object • Columns are not separate, grouped by type • Nullability is not type-safe (yet) —> Arrow memory does not match Pandas memory —> Copy 😢

14. 14 Pandas ExtensionArrays • Introduced new interfaces in 0.23 • ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top • Still, experimental, wait for 0.24

15. 15 Photo by Niklas Tidbury on Unsplash

16. 16 fletcher • https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as storage • Uses Numba to implement the necessary analytic on top • Needs {pandas, Arrow, …} master No copy between Apache Arrow and pandas!

17. 17 Workflow in 2018 with Arrow Python machine learning model pre-processing with pandas SQL Engine JDBC Driver org.apache. arrow.adapter. jdbc A R R O W J D B C R O W S pyarrow.jvm  / fletcher

18. 18 ??? Does it work?

19. 19 Does it work?

20. 20 Does it work?

21. Make your best decision today. blueyonder.ai/en/careers Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 21

22. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 22 Get Involved!