PyCon Ireland 2022 - PyArrow full stack.pdf

Apache Arrow as a full stack data
engineering solution
Alessandro Molina
@__amol__
https:/
/alessandro.molina.fyi/

Who am I, Alessandro
● Maintainer of TurboGears2,
Apache Arrow Contributor,
Author of DukPy and DEPOT
● Director of Engineering at Voltron Data Labs
● Author of
“Modern Python Standard Library Cookbook” and
“Crafting Test-Driven Software with Python”.

What’s Apache Arrow?
● a data interchange standard
● an in-memory format
● a networking format
● a storage format
● an i/o library
● a computation engine
● a tabular data library
● a query engine
● a partitioned datasets
manager

So much there!
The Apache Arrow project is a huge eﬀort, aimed at
solving the foundamental problems in the data
analytics world.
Aimed at providing a “write everywhere, run
everywhere” experience it’s easy to get lost if you
don’t know where to start.
PyArrow is the entry point to the Apache Arrow
ecosystem for Python developers, and it can easily
give you access to many of the beneﬁts of Arrow itself.

Introducing PyArrow
● Apache Arrow was born as a Columnar Data Format
● So the foundamental type in PyArrow is a “column of data”,
which is exposed by the pyarrow.Array object and its
subclasses.
● At this level, PyArrow is similar to NumPy single dimension
arrays.

PyArrow Arrays
import pyarrow as pa
# Arrays can be made of numbers
>>> pa.array([1, 2, 3, 4, 5])
<pyarrow.lib.Int64Array object at 0xffff77d75d20>
# Or strings
>>> pa.array(["A", "B", "C", "D", "E"])
<pyarrow.lib.StringArray object at 0xffff77d75b40>
# And even complex objects
>>> pa.array([{"a": 5}, {"a": 7}])
<pyarrow.lib.StructArray object at 0xffff77d75d20>
# Arrays can also be masked
>>> pa.array([1, 2, 3, 4, 5],
... mask=pa.array([True, False, True, False, True]))
<pyarrow.lib.Int64Array object at 0xffff77d75d80>
Compared to classic NumPy arrays, PyArrow
arrays are a bit more complex.
● They pair a buﬀer with the data with one
with the validity map. So that null values
can be more than just None
● Also arrays of strings retaining the
guarantee of having a single continuous
buﬀer for the values

Introducing PyArrow Tables
● As Arrays are “columns”, their grouping can form pyarrow.Table
● Tables are actually consistuted by pyarrow.ChunkedArray so
that appending rows to them is a cheap operation.
● At this level, PyArrow is similar to Pandas Dataframes

PyArrow Tables
>>> table = pa.table([
... pa.array([1, 2, 3, 4, 5]),
... pa.array(["a", "b", "c", "d", "e"]),
... pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
... ], names=["col1", "col2", "col3"])
>>> table.take([0, 1, 4])
col1: [[1,2,5]]
col2: [["a","b","e"]]
col3: [[1,2,5]]
>>> table.schema
col1: int64
col2: string
col3: double
Compared to Pandas, PyArrow tables are fully
implemented in C++ and never modify data in
place.
Tables are based on ChunkedArrays so that
appending data to them is a zero copy
operation. A new table is created that
references the data from the existing table as
the ﬁrst chunk of the arrays and the added
data se the new chunk.
The Acero compute engine in Arrow is able to
provide many common analytics and
transformation capabilities, like joining, ﬁltering
and aggregating data in tables.

Running Analytics
The Acero compute engine
powers the analytics and
transformation capabilities
available on tables.
Many pyarrow.compute
functions provide kernels
that work on tables and
Table exposes join, ﬁltering
and grouping methods
import pyarrow as pa
import pyarrow.compute as pc
>>> table = pa.table([
... pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]),
... pa.array([11, 20, 3, 4, 5, 1, 4, 10]),
... ], names=["keys", "values"])
>>> table.filter(pc.field("values") == 4)
keys: [["b","e"]]
values: [[4,4]]
>>> table.group_by("keys").aggregate([("values", "sum")])
values_sum: [[31,7,15,1,4]]
keys: [["a","b","c","d","e"]]
>>> table1 = pa.table({'id': [1, 2, 3],
... 'year': [2020, 2022, 2019]})
>>>
>>> table2 = pa.table({'id': [3, 4],
... 'n_legs': [5, 100],
... 'animal': ["Brittle stars", "Centipede"]})
>>>
>>> table1.join(table2, keys="id")
id: [[3,1,2]]
year: [[2019,2020,2022]]
n_legs: [[5,null,null]]
animal: [["Brittle stars",null,null]]

PyArrow, Numpy and Pandas
One of the original design goals of Apache Arrow was
to allow ease exchange of data without the cost of
converting it across multiple formats or marshaling it
before transfer.
In the spirit of those capabilities, PyArrow provides
copy-free support for converting data to and from
pandas and numpy.
If you have data in PyArrow you can invoke to_numpy
on pyarrow.Array and to_pandas on pyarrow.Array and
pyarrow.Table to get them as pandas or numpy
objects without facing any additional conversion cost.

And it’s fast!
>>> data = [a % 5 for a in range(100000000)]
>>> npdata = np.array(data)
>>> padata = pa.array(data)
>>> import timeit
>>> timeit.timeit(
... lambda: np.unique(npdata, return_counts=True),
... number=1
... )
1.5212857750011608
>>> timeit.timeit(
... lambda: pc.value_counts(padata),
... number=1
... )
0.3754262370057404

Very fast!
In [3]: timeit df = pd.DataFrame(dict_of_numpy_arrays)
82.5 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas()
50.2 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df = pd.read_csv("large.csv", engine="pyarrow")

Full-Stack Solution
DISK
Arrow Storage Format
Data can be stored in the Arrow
Disk Format itself.
Arrow InMemory Format
When loaded,
it will still be in the Arrow
Format.
MEMORY
Acero
Computation can be
performed natively on the
Arrow format.
COMPUTE
Arrow Flight
Arrow format can be used to
ship data across network
through Arrow Flight
NETWORK

Arrow from disk to memory
● Saving data in the Arrow format allows PyArrow
to leverage the same exact format for disk and
in-memory data.
● This means that no marshaling cost happens
when loading back the data.
● And allows to leverage memory mapping to
avoid processing data until it’s actually
accessed.
● This means reducing the latency to access data
from seconds to milliseconds.
● Memory mapping also allows managing data
bigger than memory.

Arrow format does not solve it all
● The Arrow format can make working with your data very fast
● But is expensive in terms of disk space as it’s optimized for fast computation and SIMD
instructions, not for storage size.
● It natively support compressions algorithms, but those come at a cost that nullify most
beneﬁts of using the Arrow format itself.
● Arrow format is a great hot format, but there are better solutions for cold storage.
total 1.3G
-rw-r--r-- 1 root root 1.2G Nov 2 16:10 data.arrow
-rw-r--r-- 1 root root 155M Nov 2 16:10 data.pqt

Yes, you can read 17 Milions Rows in 9ms*
* for some definitions of read

From memory-to-network: Arrow Flight
● Arrow Flight is a protocol and implementation provided in Arrow itself that is optimized for
transferring columnar data using Apache Arrow format.
● pyarrow.flight.FlightServerBase provides the server implementation and
pyarrow.flight.connect allows to create clients that connect to ﬂight servers.
● Flight hooks directly into gRPC,
thus no marshaling or
unmarshaling happens when
sending data through network.
● https:/
/arrow.apache.org/coo
kbook/py/ﬂight.html

Arrow Flight speed
Based on the same foundations that we saw for dealing with data on disk, using Arrow Flight for
data on network can provide major performance gains compared to other existing solutions to
transfer data

Full-Stack Solution, evolved
DISK
Arrow Storage Format
Data can be stored in the Arrow
Disk Format itself.
Arrow InMemory Format
When loaded,
it will still be in the Arrow
Format.
MEMORY
Acero
Computation can be
performed natively on the
Arrow format.
COMPUTE
Arrow Flight
Arrow format can be used to
ship data across network
through Arrow Flight
NETWORK
COLD
STORAGE
Parquet
PyArrow natively
supports optimized
parquet loading
FLIGHT
SQL
ADBC & FlightSQL
Native support for fetching data
from databases in Arrow format.
ADBC
NANO
ARROW
NanoArrow
Sharing Arrow data
between languages
and libraries in same
process

Arrow & Database: FlightSQL
● Flight SQL aims to provide broadly similar functionality to existing APIs like JDBC and ODBC
● Using Flight, it provides an eﬃcient implementation of a wire format that supports features
like encryption and authentication out of the box, while allowing for further optimizations like
parallel data access
● You get the performance
of Flight, with the
convenience of a SQL
database.
● FlightSQL is mostly a
transport for higher level
APIs, you are not meant
to use it directly.

Arrow & Database: ADBC
● Standard database interface built around
Arrow data, especially for eﬃciently fetching
large datasets (i.e. with minimal or no
serialization and copying)
● ADBC can leverage FlightSQL or directly
connect to the database (currently supports
Postgres, DuckDB, SQLite, …)
● Optimized for transferring column major data
instead of row major data like most database
drivers.
● Support both SQL dialects and the emergent
Substrait standard.

Arrow & Database: ADBC
with sqlite.cursor() as cur:
cur.execute('SELECT 1, "foo", 2.0')
assert cur.fetch_arrow_table() == pyarrow.table(
{
"1": [1],
'"foo"': ["foo"],
"2.0": [2.0],
}
)
with sqlite.cursor() as cur:
cur.execute('SELECT 1, "foo", 2.0')
assert_frame_equal(
cur.fetch_df(),
pandas.DataFrame(
{
"1": [1],
'"foo"': ["foo"],
"2.0": [2.0],
}
),
)
ARROW Tables
Pandas DataFrame

Questions?
● PyArrow Documentation
https:/
/arrow.apache.org/docs/python
/getstarted.html
● PyArrow Cookbook
https:/
/arrow.apache.org/cookbook/p
y/index.html

PyCon Ireland 2022 - PyArrow full stack.pdf

Recommandé

Recommandé

Contenu connexe

Similaire à PyCon Ireland 2022 - PyArrow full stack.pdf

Similaire à PyCon Ireland 2022 - PyArrow full stack.pdf (20)

Plus de Alessandro Molina

Plus de Alessandro Molina (17)

Dernier

Dernier (20)

PyCon Ireland 2022 - PyArrow full stack.pdf