This document introduces Apache Arrow and PyArrow, describing them as a full-stack solution for data engineering. It discusses how PyArrow provides access to Apache Arrow's capabilities through Python, including representing data as columnar arrays and tables. It explains how PyArrow tables can be used for analytics and how data can be efficiently transferred between disk, memory, and over networks using Arrow formats and Flight. It also discusses how Arrow interacts with databases through standards like ADBC and FlightSQL.
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
PyCon Ireland 2022 - PyArrow full stack.pdf
1. Apache Arrow as a full stack data
engineering solution
Alessandro Molina
@__amol__
https:/
/alessandro.molina.fyi/
2. Who am I, Alessandro
● Maintainer of TurboGears2,
Apache Arrow Contributor,
Author of DukPy and DEPOT
● Director of Engineering at Voltron Data Labs
● Author of
“Modern Python Standard Library Cookbook” and
“Crafting Test-Driven Software with Python”.
3. What’s Apache Arrow?
● a data interchange standard
● an in-memory format
● a networking format
● a storage format
● an i/o library
● a computation engine
● a tabular data library
● a query engine
● a partitioned datasets
manager
4. So much there!
The Apache Arrow project is a huge effort, aimed at
solving the foundamental problems in the data
analytics world.
Aimed at providing a “write everywhere, run
everywhere” experience it’s easy to get lost if you
don’t know where to start.
PyArrow is the entry point to the Apache Arrow
ecosystem for Python developers, and it can easily
give you access to many of the benefits of Arrow itself.
5. Introducing PyArrow
● Apache Arrow was born as a Columnar Data Format
● So the foundamental type in PyArrow is a “column of data”,
which is exposed by the pyarrow.Array object and its
subclasses.
● At this level, PyArrow is similar to NumPy single dimension
arrays.
6. PyArrow Arrays
import pyarrow as pa
# Arrays can be made of numbers
>>> pa.array([1, 2, 3, 4, 5])
<pyarrow.lib.Int64Array object at 0xffff77d75d20>
# Or strings
>>> pa.array(["A", "B", "C", "D", "E"])
<pyarrow.lib.StringArray object at 0xffff77d75b40>
# And even complex objects
>>> pa.array([{"a": 5}, {"a": 7}])
<pyarrow.lib.StructArray object at 0xffff77d75d20>
# Arrays can also be masked
>>> pa.array([1, 2, 3, 4, 5],
... mask=pa.array([True, False, True, False, True]))
<pyarrow.lib.Int64Array object at 0xffff77d75d80>
Compared to classic NumPy arrays, PyArrow
arrays are a bit more complex.
● They pair a buffer with the data with one
with the validity map. So that null values
can be more than just None
● Also arrays of strings retaining the
guarantee of having a single continuous
buffer for the values
7. Introducing PyArrow Tables
● As Arrays are “columns”, their grouping can form pyarrow.Table
● Tables are actually consistuted by pyarrow.ChunkedArray so
that appending rows to them is a cheap operation.
● At this level, PyArrow is similar to Pandas Dataframes
8. PyArrow Tables
>>> table = pa.table([
... pa.array([1, 2, 3, 4, 5]),
... pa.array(["a", "b", "c", "d", "e"]),
... pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
... ], names=["col1", "col2", "col3"])
>>> table.take([0, 1, 4])
col1: [[1,2,5]]
col2: [["a","b","e"]]
col3: [[1,2,5]]
>>> table.schema
col1: int64
col2: string
col3: double
Compared to Pandas, PyArrow tables are fully
implemented in C++ and never modify data in
place.
Tables are based on ChunkedArrays so that
appending data to them is a zero copy
operation. A new table is created that
references the data from the existing table as
the first chunk of the arrays and the added
data se the new chunk.
The Acero compute engine in Arrow is able to
provide many common analytics and
transformation capabilities, like joining, filtering
and aggregating data in tables.
9. Running Analytics
The Acero compute engine
powers the analytics and
transformation capabilities
available on tables.
Many pyarrow.compute
functions provide kernels
that work on tables and
Table exposes join, filtering
and grouping methods
import pyarrow as pa
import pyarrow.compute as pc
>>> table = pa.table([
... pa.array(["a", "a", "b", "b", "c", "d", "e", "c"]),
... pa.array([11, 20, 3, 4, 5, 1, 4, 10]),
... ], names=["keys", "values"])
>>> table.filter(pc.field("values") == 4)
keys: [["b","e"]]
values: [[4,4]]
>>> table.group_by("keys").aggregate([("values", "sum")])
values_sum: [[31,7,15,1,4]]
keys: [["a","b","c","d","e"]]
>>> table1 = pa.table({'id': [1, 2, 3],
... 'year': [2020, 2022, 2019]})
>>>
>>> table2 = pa.table({'id': [3, 4],
... 'n_legs': [5, 100],
... 'animal': ["Brittle stars", "Centipede"]})
>>>
>>> table1.join(table2, keys="id")
id: [[3,1,2]]
year: [[2019,2020,2022]]
n_legs: [[5,null,null]]
animal: [["Brittle stars",null,null]]
10. PyArrow, Numpy and Pandas
One of the original design goals of Apache Arrow was
to allow ease exchange of data without the cost of
converting it across multiple formats or marshaling it
before transfer.
In the spirit of those capabilities, PyArrow provides
copy-free support for converting data to and from
pandas and numpy.
If you have data in PyArrow you can invoke to_numpy
on pyarrow.Array and to_pandas on pyarrow.Array and
pyarrow.Table to get them as pandas or numpy
objects without facing any additional conversion cost.
11. And it’s fast!
>>> data = [a % 5 for a in range(100000000)]
>>> npdata = np.array(data)
>>> padata = pa.array(data)
>>> import timeit
>>> timeit.timeit(
... lambda: np.unique(npdata, return_counts=True),
... number=1
... )
1.5212857750011608
>>> timeit.timeit(
... lambda: pc.value_counts(padata),
... number=1
... )
0.3754262370057404
12. Very fast!
In [3]: timeit df = pd.DataFrame(dict_of_numpy_arrays)
82.5 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: timeit df = pa.table(dict_of_numpy_arrays).to_pandas()
50.2 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
df = pd.read_csv("large.csv", engine="pyarrow")
13. Full-Stack Solution
DISK
Arrow Storage Format
Data can be stored in the Arrow
Disk Format itself.
Arrow InMemory Format
When loaded,
it will still be in the Arrow
Format.
MEMORY
Acero
Computation can be
performed natively on the
Arrow format.
COMPUTE
Arrow Flight
Arrow format can be used to
ship data across network
through Arrow Flight
NETWORK
14. Arrow from disk to memory
● Saving data in the Arrow format allows PyArrow
to leverage the same exact format for disk and
in-memory data.
● This means that no marshaling cost happens
when loading back the data.
● And allows to leverage memory mapping to
avoid processing data until it’s actually
accessed.
● This means reducing the latency to access data
from seconds to milliseconds.
● Memory mapping also allows managing data
bigger than memory.
15. Arrow format does not solve it all
● The Arrow format can make working with your data very fast
● But is expensive in terms of disk space as it’s optimized for fast computation and SIMD
instructions, not for storage size.
● It natively support compressions algorithms, but those come at a cost that nullify most
benefits of using the Arrow format itself.
● Arrow format is a great hot format, but there are better solutions for cold storage.
total 1.3G
-rw-r--r-- 1 root root 1.2G Nov 2 16:10 data.arrow
-rw-r--r-- 1 root root 155M Nov 2 16:10 data.pqt
16. Yes, you can read 17 Milions Rows in 9ms*
* for some definitions of read
17. From memory-to-network: Arrow Flight
● Arrow Flight is a protocol and implementation provided in Arrow itself that is optimized for
transferring columnar data using Apache Arrow format.
● pyarrow.flight.FlightServerBase provides the server implementation and
pyarrow.flight.connect allows to create clients that connect to flight servers.
● Flight hooks directly into gRPC,
thus no marshaling or
unmarshaling happens when
sending data through network.
● https:/
/arrow.apache.org/coo
kbook/py/flight.html
18. Arrow Flight speed
Based on the same foundations that we saw for dealing with data on disk, using Arrow Flight for
data on network can provide major performance gains compared to other existing solutions to
transfer data
19.
20. Full-Stack Solution, evolved
DISK
Arrow Storage Format
Data can be stored in the Arrow
Disk Format itself.
Arrow InMemory Format
When loaded,
it will still be in the Arrow
Format.
MEMORY
Acero
Computation can be
performed natively on the
Arrow format.
COMPUTE
Arrow Flight
Arrow format can be used to
ship data across network
through Arrow Flight
NETWORK
COLD
STORAGE
Parquet
PyArrow natively
supports optimized
parquet loading
FLIGHT
SQL
ADBC & FlightSQL
Native support for fetching data
from databases in Arrow format.
ADBC
NANO
ARROW
NanoArrow
Sharing Arrow data
between languages
and libraries in same
process
21. Arrow & Database: FlightSQL
● Flight SQL aims to provide broadly similar functionality to existing APIs like JDBC and ODBC
● Using Flight, it provides an efficient implementation of a wire format that supports features
like encryption and authentication out of the box, while allowing for further optimizations like
parallel data access
● You get the performance
of Flight, with the
convenience of a SQL
database.
● FlightSQL is mostly a
transport for higher level
APIs, you are not meant
to use it directly.
22. Arrow & Database: ADBC
● Standard database interface built around
Arrow data, especially for efficiently fetching
large datasets (i.e. with minimal or no
serialization and copying)
● ADBC can leverage FlightSQL or directly
connect to the database (currently supports
Postgres, DuckDB, SQLite, …)
● Optimized for transferring column major data
instead of row major data like most database
drivers.
● Support both SQL dialects and the emergent
Substrait standard.