Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Apache Arrow Flight Overview

An overview of Apache Arrow's RPC protocol built on GRPC

  • Soyez le premier à commenter

Apache Arrow Flight Overview

  1. 1. Apache Arrow Apache Arrow Flight By Jacques Nadeau, PMC Apache Arrow
  2. 2. Apache Arrow Why Arrow Flight: Arrow Promises Interoperability • But it’s primary medium is in-memory • Some work to support shared memory in-process • But not all systems can be collocated – Especially in a modern K8s/containerized deployment • Shared memory has other problems: – Reference management and security are complex – Different requirements for long-term datasets versus ephemeral datasets Arrow Needs an RPC layer to simplify the creation of Data Applications
  3. 3. Apache Arrow Arrow Messaging Paradigm: Batch Streams Primary Communication: • A Stream of Arrow Record Batches • Bulk transfer targeting efficient movement • Effectively Peer to Peer Client Server Put HeaderDataDataDataend Thanks endDataDataDataHeader Get Descriptor Specific Methods: • Put Stream: Client sends a stream to server • Get Stream: Server sends a stream to client • Both Initiated by Client
  4. 4. Apache Arrow Endpoint: Retrieved with Ticket Flight Location 1 Location 2 Arrow Messaging Paradigm: Stream Management • Parallel consumption and locality awareness – A flight is composed of streams – Each stream has a FlightEndpoint: A opaque stream ticket along with a consumption location – Systems can take advantage of location information to improve data locality • Flights have two reference systems: – Dotted path namespace for simple services (e.g. marketing.yesterday.sales) – Arbitrary binary command descriptor: (e.g. “select a,b from foo where c > 10”) • Support for Stream Listing – ListFlights(Criteria) – GetFlightInfo(FlightDescriptor) Stream Stream Stream Stream
  5. 5. Apache Arrow Arrow Messaging Paradigm: Data as a Service Customization • Arrow Flight Also support a simple Generic Messaging Framework – Support Customization and Extensibility within the Arrow Flight context • ListActions() – Each Data Service can expose actions along with descriptions about what they support – Each action should describe how to structure the action and corresponding result – Normal HTTP2 exceptions can be used to manage error states • DoAction(Action) => Result – Generic Containers that can carry execute Data Service specific operations – Examples might include: forget stream, load stream from disk, • Actions and Results, each have: – ActionType String token – Body: JSON body of instruction • Arrow Flight Clients can be written without knowledge of custom Actions/Results – Lightweight wrappers can be built for Data Services as needed – Or Simply use existing JSON tooling on top of generic API
  6. 6. Apache Arrow But How? GRPC as a Foundation • Generic RPC generation framework • Built on HTTP/2 Standard • Many language bindings (see right) • Supports security &compression • Uses Protobuf as primary format • Designed primarily for application messaging
  7. 7. Apache Arrow Extend GRPC To Better Work With Arrow Streams • Streams are valid Protobuf Objects so systems that don’t have custom processing can still consume Arrow streams – The entirety of the Arrow RecordBatch is a single length delimited Protobuf “bytes” field. • For high performance situations, do direct byte encoding and one-copy reads/zero-copy writes to avoid extra copies/overhead – Java Flight implementation cuts through multiple layers to achieve this using currently released GRPC (despite no formal support for it).
  8. 8. Apache Arrow Check it out • Arrow Flight Proposal – https://github.com/jacques-n/arrow • Example Usage in Dremio Formation – https://github.com/jacques-n/formation