Apache Arrow Flight Overview

An overview of Apache Arrow's RPC protocol built on GRPC

  1. 1. Apache Arrow Apache Arrow Flight By Jacques Nadeau, PMC Apache Arrow
  2. 2. Apache Arrow Why Arrow Flight: Arrow Promises Interoperability • But it’s primary medium is in-memory • Some work to support shared memory in-process • But not all systems can be collocated – Especially in a modern K8s/containerized deployment • Shared memory has other problems: – Reference management and security are complex – Different requirements for long-term datasets versus ephemeral datasets Arrow Needs an RPC layer to simplify the creation of Data Applications
  3. 3. Apache Arrow Arrow Messaging Paradigm: Batch Streams Primary Communication: • A Stream of Arrow Record Batches • Bulk transfer targeting efficient movement • Effectively Peer to Peer Client Server Put HeaderDataDataDataend Thanks endDataDataDataHeader Get Descriptor Specific Methods: • Put Stream: Client sends a stream to server • Get Stream: Server sends a stream to client • Both Initiated by Client
  4. 4. Apache Arrow Endpoint: Retrieved with Ticket Flight Location 1 Location 2 Arrow Messaging Paradigm: Stream Management • Parallel consumption and locality awareness – A flight is composed of streams – Each stream has a FlightEndpoint: A opaque stream ticket along with a consumption location – Systems can take advantage of location information to improve data locality • Flights have two reference systems: – Dotted path namespace for simple services (e.g. marketing.yesterday.sales) – Arbitrary binary command descriptor: (e.g. “select a,b from foo where c > 10”) • Support for Stream Listing – ListFlights(Criteria) – GetFlightInfo(FlightDescriptor) Stream Stream Stream Stream
  5. 5. Apache Arrow Arrow Messaging Paradigm: Data as a Service Customization • Arrow Flight Also support a simple Generic Messaging Framework – Support Customization and Extensibility within the Arrow Flight context • ListActions() – Each Data Service can expose actions along with descriptions about what they support – Each action should describe how to structure the action and corresponding result – Normal HTTP2 exceptions can be used to manage error states • DoAction(Action) => Result – Generic Containers that can carry execute Data Service specific operations – Examples might include: forget stream, load stream from disk, • Actions and Results, each have: – ActionType String token – Body: JSON body of instruction • Arrow Flight Clients can be written without knowledge of custom Actions/Results – Lightweight wrappers can be built for Data Services as needed – Or Simply use existing JSON tooling on top of generic API
  6. 6. Apache Arrow But How? GRPC as a Foundation • Generic RPC generation framework • Built on HTTP/2 Standard • Many language bindings (see right) • Supports security &compression • Uses Protobuf as primary format • Designed primarily for application messaging
  7. 7. Apache Arrow Extend GRPC To Better Work With Arrow Streams • Streams are valid Protobuf Objects so systems that don’t have custom processing can still consume Arrow streams – The entirety of the Arrow RecordBatch is a single length delimited Protobuf “bytes” field. • For high performance situations, do direct byte encoding and one-copy reads/zero-copy writes to avoid extra copies/overhead – Java Flight implementation cuts through multiple layers to achieve this using currently released GRPC (despite no formal support for it).
  8. 8. Apache Arrow Check it out • Arrow Flight Proposal – https://github.com/jacques-n/arrow • Example Usage in Dremio Formation – https://github.com/jacques-n/formation