Contenu connexe Similaire à Make your PySpark Data Fly with Arrow! (20) Make your PySpark Data Fly with Arrow!1. Make Your
PySpark Data Fly
with Apache Arrow!
Bryan Cutler
Software Engineer
@BryanCutler
DBG / May 2, 2018 / © 2019 IBM Corporation
2. About Bryan
@BryanCutler on Github
Software Engineer, IBM
Center for Open-Source Data & AI Technologies
(CODAIT)
Big Data Machine Learning & AI
Apache Spark committer
Apache Arrow committer
TensorFlow I/O maintainer
DBG / May 2, 2018 / © 2018 IBM Corporation
3. DBG / May 2, 2018 / © 2018 IBM Corporation
Row 1 Row 2 Row 3 Row 4
0
2
4
6
8
10
12
Column 1
Column 2
Column 3
Center for Open Source
Data and AI Technologies
CODAIT
codait.org
DBG / May 2, 2018 / © 2018 IBM Corporation
CODAIT aims to make AI solutions
dramatically easier to create,
deploy, and manage in the
enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open
Source
4. Agenda
Overview of Apache Arrow
Intro to Arrow Flight
How to talk Arrow
Flight in Action
DBG / May 2, 2018 / © 2018 IBM Corporation
6. About Arrow
Apache Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Standard format for in-memory columnar data
●
Implementations in many languages and growing
●
Built for efficient analytic operations on modern hardware
Has built in primitives for basic exchange of Arrow data
●
Zero-copy data within a process
●
IPC with Arrow record batch messages
7. Why use Arrow
Apache Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow brings many benefits
●
Common standard with cross-
language support
●
Better interoperability between
frameworks
●
Avoid costly data serialization
8. Who is using Arrow
Apache Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
The Apache® Software Foundation Announces Apache Arrow™ Momentum
●
Adopted by dozens of Open Source and commercial technologies
●
Exceeded 1,000,000 monthly downloads within first three years as an
Apache Top-Level Project
●
Apache Spark, NVIDIA RAPIDS, pandas, and Dremio, among others
https://arrow.apache.org/powered_by
Source: https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces46
10. Introduction
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Arrow Flight is an Arrow-native RPC framework
Defines a standard protocol for data exchange
Makes it easy to efficiently move data around a network by providing1
:
●
Arrow Data as a Service
●
Batch Streams
●
Stream Management
11. Arrow Data as a Service
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Extensible data service
●
Clients get/put Arrow data
●
List available data
●
Custom actions
●
Can think of it as ODBC for in-memory data
12. Stream Batching
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Arrow Stream is a schema + record batches
A Flight is composed of multiple streams
●
Streams could come from different endpoints
●
Transfer data in bulk for efficiency
●
Location info can be used to improve data locality
Flight
Stream 1
Record
Batch
Record
Batch
Stream 2
Record
Batch
Record
Batch
13. Stream Management
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Service manages Flights for the clients
●
Flight Info gives a list of endpoints with locations of
each stream in the Flight
●
Streams are referenced by a ticket
– A ticket is an opaque struct that is unique for
each stream
●
Flight descriptors differentiate between flights
– Can define how Flight is composed
– Batch size, or even a SQL query
14. FlightDescriptor Types
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Simple path-like:
Custom proto:
message MyDescriptor {
string sql_query = 1;
int32 records_per_batch = 2;
}
Message MyTicket {
MyDescriptor desc = 1;
string uuid = 2;
}
“datasets/catsdogs/training”
15. Ticket Sequence for Consumer
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
To consume an entire Flight
●
Get FlightInfo for list of
endpoints with tickets
●
For each endpoint
– Use ticket to get endpoint
stream
– Process each RecordBatch in
the stream
Consumer Flight Service
Get FlightInfo (FlightDescriptor)
FlightInfo
For Each Endpoint
Get Stream (Ticket)
For Each Batch in Stream
RecordBatch
Stream
Get Next
Process
batch
16. Benefits
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
●
Applications use client interface and exchange
standard record batches
●
Complex communication handled internally
●
Efficient, uses batches and minimum copies
●
Standardized protocol
– Authentication
– Support different transports
– Able to handle backpressure
17. Current Status
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Common protocol defined using protocol buffers
Prototype implementations in Java, C++, Python
Still experimental, but lots of work being done to
make production ready
18. How to Talk Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
If a system wants to exchange Arrow Fight data, then
needs to be able to produce/consume an Arrow
stream
●
Spark kind of does already, but not externalized
●
See SPARK-24579 and SPARK-26413
●
Can build a Scala Flight connector with a little
hacking
19. How to Talk Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
TensorFlow I/O has Arrow Datasets
●
Maintained by SIG-IO community
– Also many other inputs to TF
– Many sources from legacy contrib/
●
Several Arrow datasets
– ArrowStreamDataset used here
●
Input ops only for now
●
Install: “pip install tensorflow-io”
Check it out at https://github.com/tensorflow/io
21. Define the Service
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
Simple Service backed by an in-memory data store
●
Keeps streams in memory
●
Flight descriptor is a string id
●
This is from the Java Flight examples
22. Make the Clients
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
PySpark will put Arrow data
●
Map partition op of DataFrame to Arrow
●
Each partition sent as a stream of batches
– A ticket is roughly the partition index
TensorFlow Dataset will get Arrow data
●
Request entire Flight, which is multiple streams
●
Gets one batch at a time to process
●
Op outputs tensors
23. Data Flow
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
Spark Worker
Stream 1
Record
Batch
Record
Batch
Flight
Service
Stream 2
Record
Batch
Record
Batch
TensorFlow
Process Batches
Record
Batch
Record
Batch
Record
Batch
Record
Batch
Flight =
Stream 1
+
Stream 2
24. Walkthrough
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
Application code is simple
– Only a few lines
– Focus on working with data
– Don’t need to worry about conversion, file
formats, networking
Example in Python but data never needs to
go through Python!
Worker JVM → Flight Service → TF C++
“”” PySpark Client
“””
# Spark job to put partitions to service
SparkFlightConnector.put(
df, # Existing DataFrame
host, port, # Flight Service ip
'radspark' # Data descriptor
)
“”” TensorFlow Client
“””
# Arrow tf.data.Dataset gets Flight data
dataset = ArrowFlightDataset.from_schema(
host, port, # Flight Service ip
'radspark', # Data descriptor
to_arrow_schema(df.schema) # Schema
)
# Iterate over Flight data as tensors
it = dataset.make_one_shot_iterator()
25. Recap
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Apache Arrow – standard for in-memory data
Arrow Flight – efficiently move data around network
●
Arrow data as a service
●
Stream batching
●
Stream management
Simple example with PySpark + TensorFlow
●
Data transfer never goes through Python
26. Links & References
Apache Arrow and Flight specification
https://arrow.apache.org/
https://github.com/apache/arrow/blob/master/format/Flight.proto
TensorFlow I/O
https://github.com/tensorflow/io
Related Spark JIRAs
SPARK-24579
SPARK-26413
Example Code
https://github.com/BryanCutler/SparkArrowFlight
References: Flight Overview by Arrow PMC Jacques Nadeau
[1] https://www.slideshare.net/JacquesNadeau5/apache-arrow-flight-overview
DBG / May 2, 2018 / © 2018 IBM Corporation
27. Introduction
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Arrow Flight is an Arrow-native RPC framework
Defines a standard protocol for data exchange
Makes it easy to efficiently move data around a network by providing:
●
Arrow Data as a Service
●
Batch Streams
●
Stream Management
29. DBG / May 2, 2018 / © 2018 IBM Corporation
31. Spark Client
DBG / May 2, 2018 / © 2018 IBM Corporation
Code
Map Partitions
to RecordBatches
Add partition batches
Into a Stream
Put stream to
service
// Spark job to put partitions to service
rdd.mapPartitions { it =>
val allocator = it.allocator.newChildAllocator(
"SparkFlightConnector", 0, Long.MaxValue)
val client = new FlightClient(allocator, new Location(host, port))
val desc = FlightDescriptor.path(descriptor)
val stream = client.startPut(desc, it.root)
// Use VectorSchemaRootIterator to convert Rows > Vectors
it.foreach { root =>
// doPut on the populated VectorSchemaRoot
stream.putNext()
}
stream.completed()
stream.getResult
client.close()
Iterator.empty
}.count()