SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Apache Arrow as a full stack data
engineering solution
Alessandro Molina
@__amol__
https:/
/alessandro.molina.fyi/
Who am I, Alessandro
● Maintainer of TurboGears2,
Apache Arrow Contributor,
Author of DukPy and DEPOT
● Director of Engineering at Voltron Data Labs
● Author of
“Modern Python Standard Library Cookbook” and
“Crafting Test-Driven Software with Python”.
What’s Apache Arrow?
● a data interchange standard
● an in-memory format
● a networking format
● a storage format
● an i/o library
● a computation engine
● a tabular data library
● a query engine
● a partitioned datasets
manager
So much there!
The Apache Arrow project is a huge effort, aimed at
solving the foundamental problems in the data
analytics world.
Aimed at providing a “write everywhere, run
everywhere” experience it’s easy to get lost if you
don’t know where to start.
PyArrow is the entry point to the Apache Arrow
ecosystem for Python developers, and it can easily
give you access to many of the benefits of Arrow itself.
Introducing PyArrow
● Apache Arrow was born as a Columnar Data Format
● So the foundamental type in PyArrow is a “column of data”,
which is exposed by the pyarrow.Array object and its
subclasses.
● At this level, PyArrow is similar to NumPy single dimension
arrays.
PyArrow Arrays
import pyarrow as pa
# Arrays can be made of numbers
>>> pa.array([1, 2, 3, 4, 5])
<pyarrow.lib.Int64Array object at 0xffff77d75d20>
# Or strings
>>> pa.array(["A", "B", "C", "D", "E"])
<pyarrow.lib.StringArray object at 0xffff77d75b40>
# And even complex objects
>>> pa.array([{"a": 5}, {"a": 7}])
<pyarrow.lib.StructArray object at 0xffff77d75d20>
# Arrays can also be masked
>>> pa.array([1, 2, 3, 4, 5],
... mask=pa.array([True, False, True, False, True]))
<pyarrow.lib.Int64Array object at 0xffff77d75d80>
Compared to classic NumPy arrays, PyArrow
arrays are a bit more complex.
● They pair a buffer with the data with one
with the validity map. So that null values
can be more than just None
● Also arrays of strings retaining the
guarantee of having a single continuous
buffer for the values
Introducing PyArrow Tables
● As Arrays are “columns”, their grouping can form pyarrow.Table
● Tables are actually consistuted by pyarrow.ChunkedArray so
that appending rows to them is a cheap operation.
● At this level, PyArrow is similar to Pandas Dataframes
PyArrow Tables
>>> table = pa.table([
... pa.array([1, 2, 3, 4, 5]),
... pa.array(["a", "b", "c", "d", "e"]),
... pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
... ], names=["col1", "col2", "col3"])
>>> table.take([0, 1, 4])
col1: [[1,2,5]]
col2: [["a","b","e"]]
col3: [[1,2,5]]
>>> table.schema
col1: int64
col2: string
col3: double
Compared to Pandas, PyArrow tables are fully
implemented in C++ and never modify data in
place.
Tables are based on ChunkedArrays so that
appending data to them is a zero copy
operation. A new table is created that
references the data from the existing table as
the first chunk of the arrays and the added
data se the new chunk.
The Acero compute engine in Arrow is able to
provide many common analytics and
transformation capabilities, like joining, filtering
and aggregating data in tables.
Running Analytics
The Acero compute engine
powers the analytics and
transformation capabilities
available on tables.
Many pyarrow.compute
functions provide kernels
that work on tables and
Table exposes join, filtering
and grouping methods
import pyarrow as pa
import pyarrow.compute as pc
>>> table = pa.table([
... pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]),
... pa.array([11, 20, 3, 4, 5, 1, 4, 10]),
... ], names=["keys", "values"])
>>> table.filter(pc.field("values") == 4)
keys: [["b","e"]]
values: [[4,4]]
>>> table.group_by("keys").aggregate([("values", "sum")])
values_sum: [[31,7,15,1,4]]
keys: [["a","b","c","d","e"]]
>>> table1 = pa.table({'id': [1, 2, 3],
... 'year': [2020, 2022, 2019]})
>>>
>>> table2 = pa.table({'id': [3, 4],
... 'n_legs': [5, 100],
... 'animal': ["Brittle stars", "Centipede"]})
>>>
>>> table1.join(table2, keys="id")
id: [[3,1,2]]
year: [[2019,2020,2022]]
n_legs: [[5,null,null]]
animal: [["Brittle stars",null,null]]
PyArrow, Numpy and Pandas
One of the original design goals of Apache Arrow was
to allow ease exchange of data without the cost of
converting it across multiple formats or marshaling it
before transfer.
In the spirit of those capabilities, PyArrow provides
copy-free support for converting data to and from
pandas and numpy.
If you have data in PyArrow you can invoke to_numpy
on pyarrow.Array and to_pandas on pyarrow.Array and
pyarrow.Table to get them as pandas or numpy
objects without facing any additional conversion cost.
And it’s fast!
>>> data = [a % 5 for a in range(100000000)]
>>> npdata = np.array(data)
>>> padata = pa.array(data)
>>> import timeit
>>> timeit.timeit(
... lambda: np.unique(npdata, return_counts=True),
... number=1
... )
1.5212857750011608
>>> timeit.timeit(
... lambda: pc.value_counts(padata),
... number=1
... )
0.3754262370057404
Very fast!
In [3]: timeit df = pd.DataFrame(dict_of_numpy_arrays)
82.5 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas()
50.2 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df = pd.read_csv("large.csv", engine="pyarrow")
Full-Stack Solution
DISK
Arrow Storage Format
Data can be stored in the Arrow
Disk Format itself.
Arrow InMemory Format
When loaded,
it will still be in the Arrow
Format.
MEMORY
Acero
Computation can be
performed natively on the
Arrow format.
COMPUTE
Arrow Flight
Arrow format can be used to
ship data across network
through Arrow Flight
NETWORK
Arrow from disk to memory
● Saving data in the Arrow format allows PyArrow
to leverage the same exact format for disk and
in-memory data.
● This means that no marshaling cost happens
when loading back the data.
● And allows to leverage memory mapping to
avoid processing data until it’s actually
accessed.
● This means reducing the latency to access data
from seconds to milliseconds.
● Memory mapping also allows managing data
bigger than memory.
Arrow format does not solve it all
● The Arrow format can make working with your data very fast
● But is expensive in terms of disk space as it’s optimized for fast computation and SIMD
instructions, not for storage size.
● It natively support compressions algorithms, but those come at a cost that nullify most
benefits of using the Arrow format itself.
● Arrow format is a great hot format, but there are better solutions for cold storage.
total 1.3G
-rw-r--r-- 1 root root 1.2G Nov 2 16:10 data.arrow
-rw-r--r-- 1 root root 155M Nov 2 16:10 data.pqt
Yes, you can read 17 Milions Rows in 9ms*
* for some definitions of read
From memory-to-network: Arrow Flight
● Arrow Flight is a protocol and implementation provided in Arrow itself that is optimized for
transferring columnar data using Apache Arrow format.
● pyarrow.flight.FlightServerBase provides the server implementation and
pyarrow.flight.connect allows to create clients that connect to flight servers.
● Flight hooks directly into gRPC,
thus no marshaling or
unmarshaling happens when
sending data through network.
● https:/
/arrow.apache.org/coo
kbook/py/flight.html
Arrow Flight speed
Based on the same foundations that we saw for dealing with data on disk, using Arrow Flight for
data on network can provide major performance gains compared to other existing solutions to
transfer data
Full-Stack Solution, evolved
DISK
Arrow Storage Format
Data can be stored in the Arrow
Disk Format itself.
Arrow InMemory Format
When loaded,
it will still be in the Arrow
Format.
MEMORY
Acero
Computation can be
performed natively on the
Arrow format.
COMPUTE
Arrow Flight
Arrow format can be used to
ship data across network
through Arrow Flight
NETWORK
COLD
STORAGE
Parquet
PyArrow natively
supports optimized
parquet loading
FLIGHT
SQL
ADBC & FlightSQL
Native support for fetching data
from databases in Arrow format.
ADBC
NANO
ARROW
NanoArrow
Sharing Arrow data
between languages
and libraries in same
process
Arrow & Database: FlightSQL
● Flight SQL aims to provide broadly similar functionality to existing APIs like JDBC and ODBC
● Using Flight, it provides an efficient implementation of a wire format that supports features
like encryption and authentication out of the box, while allowing for further optimizations like
parallel data access
● You get the performance
of Flight, with the
convenience of a SQL
database.
● FlightSQL is mostly a
transport for higher level
APIs, you are not meant
to use it directly.
Arrow & Database: ADBC
● Standard database interface built around
Arrow data, especially for efficiently fetching
large datasets (i.e. with minimal or no
serialization and copying)
● ADBC can leverage FlightSQL or directly
connect to the database (currently supports
Postgres, DuckDB, SQLite, …)
● Optimized for transferring column major data
instead of row major data like most database
drivers.
● Support both SQL dialects and the emergent
Substrait standard.
Arrow & Database: ADBC
with sqlite.cursor() as cur:
cur.execute('SELECT 1, "foo", 2.0')
assert cur.fetch_arrow_table() == pyarrow.table(
{
"1": [1],
'"foo"': ["foo"],
"2.0": [2.0],
}
)
with sqlite.cursor() as cur:
cur.execute('SELECT 1, "foo", 2.0')
assert_frame_equal(
cur.fetch_df(),
pandas.DataFrame(
{
"1": [1],
'"foo"': ["foo"],
"2.0": [2.0],
}
),
)
ARROW Tables
Pandas DataFrame
Questions?
● PyArrow Documentation
https:/
/arrow.apache.org/docs/python
/getstarted.html
● PyArrow Cookbook
https:/
/arrow.apache.org/cookbook/p
y/index.html

Contenu connexe

Similaire à PyCon Ireland 2022 - PyArrow full stack.pdf

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeRizwan Habib
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
 
Lecture 3 intro2data
Lecture 3 intro2dataLecture 3 intro2data
Lecture 3 intro2dataJohnson Ubah
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopDoiT International
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...Altinity Ltd
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystemGrzegorz Kolpuc
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine LearningMark Tabladillo
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxSreeLaya9
 
Rpg Pointers And User Space
Rpg Pointers And User SpaceRpg Pointers And User Space
Rpg Pointers And User Spaceramanjosan
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Advanced Index, Partitioning and Compression Strategies for SQL Server
Advanced Index, Partitioning and Compression Strategies for SQL ServerAdvanced Index, Partitioning and Compression Strategies for SQL Server
Advanced Index, Partitioning and Compression Strategies for SQL ServerConfio Software
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 

Similaire à PyCon Ireland 2022 - PyArrow full stack.pdf (20)

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
NYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKeeNYAI - Scaling Machine Learning Applications by Braxton McKee
NYAI - Scaling Machine Learning Applications by Braxton McKee
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
pm1
pm1pm1
pm1
 
4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx
 
Lecture 3 intro2data
Lecture 3 intro2dataLecture 3 intro2data
Lecture 3 intro2data
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Amazon Athena Hands-On Workshop
Amazon Athena Hands-On WorkshopAmazon Athena Hands-On Workshop
Amazon Athena Hands-On Workshop
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
 
Intro to hadoop ecosystem
Intro to hadoop ecosystemIntro to hadoop ecosystem
Intro to hadoop ecosystem
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Unit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptxUnit I - introduction to r language 2.pptx
Unit I - introduction to r language 2.pptx
 
Rpg Pointers And User Space
Rpg Pointers And User SpaceRpg Pointers And User Space
Rpg Pointers And User Space
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Advanced Index, Partitioning and Compression Strategies for SQL Server
Advanced Index, Partitioning and Compression Strategies for SQL ServerAdvanced Index, Partitioning and Compression Strategies for SQL Server
Advanced Index, Partitioning and Compression Strategies for SQL Server
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 

Plus de Alessandro Molina

PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...
PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...
PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...Alessandro Molina
 
EP2016 - Moving Away From Nodejs To A Pure Python Solution For Assets
EP2016 - Moving Away From Nodejs To A Pure Python Solution For AssetsEP2016 - Moving Away From Nodejs To A Pure Python Solution For Assets
EP2016 - Moving Away From Nodejs To A Pure Python Solution For AssetsAlessandro Molina
 
EuroPython 2015 - Storing files for the web is not as straightforward as you ...
EuroPython 2015 - Storing files for the web is not as straightforward as you ...EuroPython 2015 - Storing files for the web is not as straightforward as you ...
EuroPython 2015 - Storing files for the web is not as straightforward as you ...Alessandro Molina
 
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATES
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATESPyConIT6 - MAKING SESSIONS AND CACHING ROOMMATES
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATESAlessandro Molina
 
PyConIT6 - Messing up with pymongo for fun and profit
PyConIT6 - Messing up with pymongo for fun and profitPyConIT6 - Messing up with pymongo for fun and profit
PyConIT6 - Messing up with pymongo for fun and profitAlessandro Molina
 
PyConFR 2014 - DEPOT, Story of a file.write() gone wrong
PyConFR 2014 - DEPOT, Story of a file.write() gone wrongPyConFR 2014 - DEPOT, Story of a file.write() gone wrong
PyConFR 2014 - DEPOT, Story of a file.write() gone wrongAlessandro Molina
 
PyConUK 2014 - PostMortem Debugging and Web Development Updated
PyConUK 2014 - PostMortem Debugging and Web Development UpdatedPyConUK 2014 - PostMortem Debugging and Web Development Updated
PyConUK 2014 - PostMortem Debugging and Web Development UpdatedAlessandro Molina
 
Reactive & Realtime Web Applications with TurboGears2
Reactive & Realtime Web Applications with TurboGears2Reactive & Realtime Web Applications with TurboGears2
Reactive & Realtime Web Applications with TurboGears2Alessandro Molina
 
Post-Mortem Debugging and Web Development
Post-Mortem Debugging and Web DevelopmentPost-Mortem Debugging and Web Development
Post-Mortem Debugging and Web DevelopmentAlessandro Molina
 
MongoTorino 2013 - BSON Mad Science for fun and profit
MongoTorino 2013 - BSON Mad Science for fun and profitMongoTorino 2013 - BSON Mad Science for fun and profit
MongoTorino 2013 - BSON Mad Science for fun and profitAlessandro Molina
 
PyConUK2013 - Validated documents on MongoDB with Ming
PyConUK2013 - Validated documents on MongoDB with MingPyConUK2013 - Validated documents on MongoDB with Ming
PyConUK2013 - Validated documents on MongoDB with MingAlessandro Molina
 
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...Alessandro Molina
 
EuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingEuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingAlessandro Molina
 
PyGrunn2013 High Performance Web Applications with TurboGears
PyGrunn2013  High Performance Web Applications with TurboGearsPyGrunn2013  High Performance Web Applications with TurboGears
PyGrunn2013 High Performance Web Applications with TurboGearsAlessandro Molina
 
Rapid Prototyping with TurboGears2
Rapid Prototyping with TurboGears2Rapid Prototyping with TurboGears2
Rapid Prototyping with TurboGears2Alessandro Molina
 
TurboGears2 Pluggable Applications
TurboGears2 Pluggable ApplicationsTurboGears2 Pluggable Applications
TurboGears2 Pluggable ApplicationsAlessandro Molina
 
From SQLAlchemy to Ming with TurboGears2
From SQLAlchemy to Ming with TurboGears2From SQLAlchemy to Ming with TurboGears2
From SQLAlchemy to Ming with TurboGears2Alessandro Molina
 

Plus de Alessandro Molina (17)

PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...
PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...
PyconIE 2016 - Kajiki, the fast and validated template engine your were looki...
 
EP2016 - Moving Away From Nodejs To A Pure Python Solution For Assets
EP2016 - Moving Away From Nodejs To A Pure Python Solution For AssetsEP2016 - Moving Away From Nodejs To A Pure Python Solution For Assets
EP2016 - Moving Away From Nodejs To A Pure Python Solution For Assets
 
EuroPython 2015 - Storing files for the web is not as straightforward as you ...
EuroPython 2015 - Storing files for the web is not as straightforward as you ...EuroPython 2015 - Storing files for the web is not as straightforward as you ...
EuroPython 2015 - Storing files for the web is not as straightforward as you ...
 
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATES
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATESPyConIT6 - MAKING SESSIONS AND CACHING ROOMMATES
PyConIT6 - MAKING SESSIONS AND CACHING ROOMMATES
 
PyConIT6 - Messing up with pymongo for fun and profit
PyConIT6 - Messing up with pymongo for fun and profitPyConIT6 - Messing up with pymongo for fun and profit
PyConIT6 - Messing up with pymongo for fun and profit
 
PyConFR 2014 - DEPOT, Story of a file.write() gone wrong
PyConFR 2014 - DEPOT, Story of a file.write() gone wrongPyConFR 2014 - DEPOT, Story of a file.write() gone wrong
PyConFR 2014 - DEPOT, Story of a file.write() gone wrong
 
PyConUK 2014 - PostMortem Debugging and Web Development Updated
PyConUK 2014 - PostMortem Debugging and Web Development UpdatedPyConUK 2014 - PostMortem Debugging and Web Development Updated
PyConUK 2014 - PostMortem Debugging and Web Development Updated
 
Reactive & Realtime Web Applications with TurboGears2
Reactive & Realtime Web Applications with TurboGears2Reactive & Realtime Web Applications with TurboGears2
Reactive & Realtime Web Applications with TurboGears2
 
Post-Mortem Debugging and Web Development
Post-Mortem Debugging and Web DevelopmentPost-Mortem Debugging and Web Development
Post-Mortem Debugging and Web Development
 
MongoTorino 2013 - BSON Mad Science for fun and profit
MongoTorino 2013 - BSON Mad Science for fun and profitMongoTorino 2013 - BSON Mad Science for fun and profit
MongoTorino 2013 - BSON Mad Science for fun and profit
 
PyConUK2013 - Validated documents on MongoDB with Ming
PyConUK2013 - Validated documents on MongoDB with MingPyConUK2013 - Validated documents on MongoDB with Ming
PyConUK2013 - Validated documents on MongoDB with Ming
 
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...
EuroPython 2013 - FAST, DOCUMENTED AND RELIABLE JSON BASED WEBSERVICES WITH P...
 
EuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears TrainingEuroPython 2013 - Python3 TurboGears Training
EuroPython 2013 - Python3 TurboGears Training
 
PyGrunn2013 High Performance Web Applications with TurboGears
PyGrunn2013  High Performance Web Applications with TurboGearsPyGrunn2013  High Performance Web Applications with TurboGears
PyGrunn2013 High Performance Web Applications with TurboGears
 
Rapid Prototyping with TurboGears2
Rapid Prototyping with TurboGears2Rapid Prototyping with TurboGears2
Rapid Prototyping with TurboGears2
 
TurboGears2 Pluggable Applications
TurboGears2 Pluggable ApplicationsTurboGears2 Pluggable Applications
TurboGears2 Pluggable Applications
 
From SQLAlchemy to Ming with TurboGears2
From SQLAlchemy to Ming with TurboGears2From SQLAlchemy to Ming with TurboGears2
From SQLAlchemy to Ming with TurboGears2
 

Dernier

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 

Dernier (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 

PyCon Ireland 2022 - PyArrow full stack.pdf

  • 1. Apache Arrow as a full stack data engineering solution Alessandro Molina @__amol__ https:/ /alessandro.molina.fyi/
  • 2. Who am I, Alessandro ● Maintainer of TurboGears2, Apache Arrow Contributor, Author of DukPy and DEPOT ● Director of Engineering at Voltron Data Labs ● Author of “Modern Python Standard Library Cookbook” and “Crafting Test-Driven Software with Python”.
  • 3. What’s Apache Arrow? ● a data interchange standard ● an in-memory format ● a networking format ● a storage format ● an i/o library ● a computation engine ● a tabular data library ● a query engine ● a partitioned datasets manager
  • 4. So much there! The Apache Arrow project is a huge effort, aimed at solving the foundamental problems in the data analytics world. Aimed at providing a “write everywhere, run everywhere” experience it’s easy to get lost if you don’t know where to start. PyArrow is the entry point to the Apache Arrow ecosystem for Python developers, and it can easily give you access to many of the benefits of Arrow itself.
  • 5. Introducing PyArrow ● Apache Arrow was born as a Columnar Data Format ● So the foundamental type in PyArrow is a “column of data”, which is exposed by the pyarrow.Array object and its subclasses. ● At this level, PyArrow is similar to NumPy single dimension arrays.
  • 6. PyArrow Arrays import pyarrow as pa # Arrays can be made of numbers >>> pa.array([1, 2, 3, 4, 5]) <pyarrow.lib.Int64Array object at 0xffff77d75d20> # Or strings >>> pa.array(["A", "B", "C", "D", "E"]) <pyarrow.lib.StringArray object at 0xffff77d75b40> # And even complex objects >>> pa.array([{"a": 5}, {"a": 7}]) <pyarrow.lib.StructArray object at 0xffff77d75d20> # Arrays can also be masked >>> pa.array([1, 2, 3, 4, 5], ... mask=pa.array([True, False, True, False, True])) <pyarrow.lib.Int64Array object at 0xffff77d75d80> Compared to classic NumPy arrays, PyArrow arrays are a bit more complex. ● They pair a buffer with the data with one with the validity map. So that null values can be more than just None ● Also arrays of strings retaining the guarantee of having a single continuous buffer for the values
  • 7. Introducing PyArrow Tables ● As Arrays are “columns”, their grouping can form pyarrow.Table ● Tables are actually consistuted by pyarrow.ChunkedArray so that appending rows to them is a cheap operation. ● At this level, PyArrow is similar to Pandas Dataframes
  • 8. PyArrow Tables >>> table = pa.table([ ... pa.array([1, 2, 3, 4, 5]), ... pa.array(["a", "b", "c", "d", "e"]), ... pa.array([1.0, 2.0, 3.0, 4.0, 5.0]) ... ], names=["col1", "col2", "col3"]) >>> table.take([0, 1, 4]) col1: [[1,2,5]] col2: [["a","b","e"]] col3: [[1,2,5]] >>> table.schema col1: int64 col2: string col3: double Compared to Pandas, PyArrow tables are fully implemented in C++ and never modify data in place. Tables are based on ChunkedArrays so that appending data to them is a zero copy operation. A new table is created that references the data from the existing table as the first chunk of the arrays and the added data se the new chunk. The Acero compute engine in Arrow is able to provide many common analytics and transformation capabilities, like joining, filtering and aggregating data in tables.
  • 9. Running Analytics The Acero compute engine powers the analytics and transformation capabilities available on tables. Many pyarrow.compute functions provide kernels that work on tables and Table exposes join, filtering and grouping methods import pyarrow as pa import pyarrow.compute as pc >>> table = pa.table([ ... pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]), ... pa.array([11, 20, 3, 4, 5, 1, 4, 10]), ... ], names=["keys", "values"]) >>> table.filter(pc.field("values") == 4) keys: [["b","e"]] values: [[4,4]] >>> table.group_by("keys").aggregate([("values", "sum")]) values_sum: [[31,7,15,1,4]] keys: [["a","b","c","d","e"]] >>> table1 = pa.table({'id': [1, 2, 3], ... 'year': [2020, 2022, 2019]}) >>> >>> table2 = pa.table({'id': [3, 4], ... 'n_legs': [5, 100], ... 'animal': ["Brittle stars", "Centipede"]}) >>> >>> table1.join(table2, keys="id") id: [[3,1,2]] year: [[2019,2020,2022]] n_legs: [[5,null,null]] animal: [["Brittle stars",null,null]]
  • 10. PyArrow, Numpy and Pandas One of the original design goals of Apache Arrow was to allow ease exchange of data without the cost of converting it across multiple formats or marshaling it before transfer. In the spirit of those capabilities, PyArrow provides copy-free support for converting data to and from pandas and numpy. If you have data in PyArrow you can invoke to_numpy on pyarrow.Array and to_pandas on pyarrow.Array and pyarrow.Table to get them as pandas or numpy objects without facing any additional conversion cost.
  • 11. And it’s fast! >>> data = [a % 5 for a in range(100000000)] >>> npdata = np.array(data) >>> padata = pa.array(data) >>> import timeit >>> timeit.timeit( ... lambda: np.unique(npdata, return_counts=True), ... number=1 ... ) 1.5212857750011608 >>> timeit.timeit( ... lambda: pc.value_counts(padata), ... number=1 ... ) 0.3754262370057404
  • 12. Very fast! In [3]: timeit df = pd.DataFrame(dict_of_numpy_arrays) 82.5 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [4]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas() 50.2 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) df = pd.read_csv("large.csv", engine="pyarrow")
  • 13. Full-Stack Solution DISK Arrow Storage Format Data can be stored in the Arrow Disk Format itself. Arrow InMemory Format When loaded, it will still be in the Arrow Format. MEMORY Acero Computation can be performed natively on the Arrow format. COMPUTE Arrow Flight Arrow format can be used to ship data across network through Arrow Flight NETWORK
  • 14. Arrow from disk to memory ● Saving data in the Arrow format allows PyArrow to leverage the same exact format for disk and in-memory data. ● This means that no marshaling cost happens when loading back the data. ● And allows to leverage memory mapping to avoid processing data until it’s actually accessed. ● This means reducing the latency to access data from seconds to milliseconds. ● Memory mapping also allows managing data bigger than memory.
  • 15. Arrow format does not solve it all ● The Arrow format can make working with your data very fast ● But is expensive in terms of disk space as it’s optimized for fast computation and SIMD instructions, not for storage size. ● It natively support compressions algorithms, but those come at a cost that nullify most benefits of using the Arrow format itself. ● Arrow format is a great hot format, but there are better solutions for cold storage. total 1.3G -rw-r--r-- 1 root root 1.2G Nov 2 16:10 data.arrow -rw-r--r-- 1 root root 155M Nov 2 16:10 data.pqt
  • 16. Yes, you can read 17 Milions Rows in 9ms* * for some definitions of read
  • 17. From memory-to-network: Arrow Flight ● Arrow Flight is a protocol and implementation provided in Arrow itself that is optimized for transferring columnar data using Apache Arrow format. ● pyarrow.flight.FlightServerBase provides the server implementation and pyarrow.flight.connect allows to create clients that connect to flight servers. ● Flight hooks directly into gRPC, thus no marshaling or unmarshaling happens when sending data through network. ● https:/ /arrow.apache.org/coo kbook/py/flight.html
  • 18. Arrow Flight speed Based on the same foundations that we saw for dealing with data on disk, using Arrow Flight for data on network can provide major performance gains compared to other existing solutions to transfer data
  • 19.
  • 20. Full-Stack Solution, evolved DISK Arrow Storage Format Data can be stored in the Arrow Disk Format itself. Arrow InMemory Format When loaded, it will still be in the Arrow Format. MEMORY Acero Computation can be performed natively on the Arrow format. COMPUTE Arrow Flight Arrow format can be used to ship data across network through Arrow Flight NETWORK COLD STORAGE Parquet PyArrow natively supports optimized parquet loading FLIGHT SQL ADBC & FlightSQL Native support for fetching data from databases in Arrow format. ADBC NANO ARROW NanoArrow Sharing Arrow data between languages and libraries in same process
  • 21. Arrow & Database: FlightSQL ● Flight SQL aims to provide broadly similar functionality to existing APIs like JDBC and ODBC ● Using Flight, it provides an efficient implementation of a wire format that supports features like encryption and authentication out of the box, while allowing for further optimizations like parallel data access ● You get the performance of Flight, with the convenience of a SQL database. ● FlightSQL is mostly a transport for higher level APIs, you are not meant to use it directly.
  • 22. Arrow & Database: ADBC ● Standard database interface built around Arrow data, especially for efficiently fetching large datasets (i.e. with minimal or no serialization and copying) ● ADBC can leverage FlightSQL or directly connect to the database (currently supports Postgres, DuckDB, SQLite, …) ● Optimized for transferring column major data instead of row major data like most database drivers. ● Support both SQL dialects and the emergent Substrait standard.
  • 23. Arrow & Database: ADBC with sqlite.cursor() as cur: cur.execute('SELECT 1, "foo", 2.0') assert cur.fetch_arrow_table() == pyarrow.table( { "1": [1], '"foo"': ["foo"], "2.0": [2.0], } ) with sqlite.cursor() as cur: cur.execute('SELECT 1, "foo", 2.0') assert_frame_equal( cur.fetch_df(), pandas.DataFrame( { "1": [1], '"foo"': ["foo"], "2.0": [2.0], } ), ) ARROW Tables Pandas DataFrame
  • 24.
  • 25. Questions? ● PyArrow Documentation https:/ /arrow.apache.org/docs/python /getstarted.html ● PyArrow Cookbook https:/ /arrow.apache.org/cookbook/p y/index.html