SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Make Your
PySpark Data Fly
with Apache Arrow!
Bryan Cutler
Software Engineer
@BryanCutler
DBG / May 2, 2018 / © 2019 IBM Corporation
About Bryan
@BryanCutler on Github
Software Engineer, IBM
Center for Open-Source Data & AI Technologies
(CODAIT)
Big Data Machine Learning & AI
Apache Spark committer
Apache Arrow committer
TensorFlow I/O maintainer
DBG / May 2, 2018 / © 2018 IBM Corporation
DBG / May 2, 2018 / © 2018 IBM Corporation
Row 1 Row 2 Row 3 Row 4
0
2
4
6
8
10
12
Column 1
Column 2
Column 3
Center for Open Source
Data and AI Technologies
CODAIT
codait.org
DBG / May 2, 2018 / © 2018 IBM Corporation
CODAIT aims to make AI solutions
dramatically easier to create,
deploy, and manage in the
enterprise
Relaunch of the Spark Technology
Center (STC) to reflect expanded
mission
Improving Enterprise AI Lifecycle in Open
Source
Agenda
Overview of Apache Arrow
Intro to Arrow Flight
How to talk Arrow
Flight in Action
DBG / May 2, 2018 / © 2018 IBM Corporation
Apache Arrow Overview
DBG / May 2, 2018 / © 2018 IBM Corporation
About Arrow
Apache Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Standard format for in-memory columnar data
●
Implementations in many languages and growing
●
Built for efficient analytic operations on modern hardware
Has built in primitives for basic exchange of Arrow data
●
Zero-copy data within a process
●
IPC with Arrow record batch messages
Why use Arrow
Apache Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow brings many benefits
●
Common standard with cross-
language support
●
Better interoperability between
frameworks
●
Avoid costly data serialization
Who is using Arrow
Apache Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
The Apache® Software Foundation Announces Apache Arrow™ Momentum
●
Adopted by dozens of Open Source and commercial technologies
●
Exceeded 1,000,000 monthly downloads within first three years as an
Apache Top-Level Project
●
Apache Spark, NVIDIA RAPIDS, pandas, and Dremio, among others
https://arrow.apache.org/powered_by
Source: https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces46
Arrow Flight
DBG / May 2, 2018 / © 2018 IBM Corporation
Introduction
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Arrow Flight is an Arrow-native RPC framework
Defines a standard protocol for data exchange
Makes it easy to efficiently move data around a network by providing1
:
●
Arrow Data as a Service
●
Batch Streams
●
Stream Management
Arrow Data as a Service
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Extensible data service
●
Clients get/put Arrow data
●
List available data
●
Custom actions
●
Can think of it as ODBC for in-memory data
Stream Batching
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Arrow Stream is a schema + record batches
A Flight is composed of multiple streams
●
Streams could come from different endpoints
●
Transfer data in bulk for efficiency
●
Location info can be used to improve data locality
Flight
Stream 1
Record
Batch
Record
Batch
Stream 2
Record
Batch
Record
Batch
Stream Management
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Service manages Flights for the clients
●
Flight Info gives a list of endpoints with locations of
each stream in the Flight
●
Streams are referenced by a ticket
– A ticket is an opaque struct that is unique for
each stream
●
Flight descriptors differentiate between flights
– Can define how Flight is composed
– Batch size, or even a SQL query
FlightDescriptor Types
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Simple path-like:
Custom proto:
 message MyDescriptor {
   string sql_query = 1;
   int32 records_per_batch = 2;
 }
 Message MyTicket {
   MyDescriptor desc = 1;
   string uuid = 2;
 }
“datasets/cats­dogs/training”
Ticket Sequence for Consumer
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
To consume an entire Flight
●
Get FlightInfo for list of
endpoints with tickets
●
For each endpoint
– Use ticket to get endpoint
stream
– Process each RecordBatch in
the stream
Consumer Flight Service
Get FlightInfo (FlightDescriptor)
FlightInfo
For Each Endpoint
Get Stream (Ticket)
For Each Batch in Stream
RecordBatch
Stream
Get Next
Process
batch
Benefits
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
●
Applications use client interface and exchange
standard record batches
●
Complex communication handled internally
●
Efficient, uses batches and minimum copies
●
Standardized protocol
– Authentication
– Support different transports
– Able to handle backpressure
Current Status
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Common protocol defined using protocol buffers
Prototype implementations in Java, C++, Python
Still experimental, but lots of work being done to
make production ready
How to Talk Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
If a system wants to exchange Arrow Fight data, then
needs to be able to produce/consume an Arrow
stream
●
Spark kind of does already, but not externalized
●
See SPARK-24579 and SPARK-26413
●
Can build a Scala Flight connector with a little
hacking
How to Talk Arrow
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
TensorFlow I/O has Arrow Datasets
●
Maintained by SIG-IO community
– Also many other inputs to TF
– Many sources from legacy contrib/
●
Several Arrow datasets
– ArrowStreamDataset used here
●
Input ops only for now
●
Install: “pip install tensorflow-io”
Check it out at https://github.com/tensorflow/io
Flight in Action:
Spark to TensorFlow
DBG / May 2, 2018 / © 2018 IBM Corporation
Define the Service
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
Simple Service backed by an in-memory data store
●
Keeps streams in memory
●
Flight descriptor is a string id
●
This is from the Java Flight examples
Make the Clients
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
PySpark will put Arrow data
●
Map partition op of DataFrame to Arrow
●
Each partition sent as a stream of batches
– A ticket is roughly the partition index
TensorFlow Dataset will get Arrow data
●
Request entire Flight, which is multiple streams
●
Gets one batch at a time to process
●
Op outputs tensors
Data Flow
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
Spark Worker
Stream 1
Record
Batch
Record
Batch
Flight
Service
Stream 2
Record
Batch
Record
Batch
TensorFlow
Process Batches
Record
Batch
Record
Batch
Record
Batch
Record
Batch
Flight =
Stream 1
+
Stream 2
Walkthrough
DBG / May 2, 2018 / © 2018 IBM Corporation
Flight Example
Application code is simple
– Only a few lines
– Focus on working with data
– Don’t need to worry about conversion, file
formats, networking
Example in Python but data never needs to
go through Python!
Worker JVM → Flight Service → TF C++
“”” PySpark Client
“””
# Spark job to put partitions to service
SparkFlightConnector.put(
    df,           # Existing DataFrame
    host, port,   # Flight Service ip
    'rad­spark'   # Data descriptor
)
“”” TensorFlow Client
“””
# Arrow tf.data.Dataset gets Flight data
dataset = ArrowFlightDataset.from_schema(
    host, port,   # Flight Service ip
    'rad­spark',  # Data descriptor
    to_arrow_schema(df.schema)  # Schema  
)
# Iterate over Flight data as tensors
it = dataset.make_one_shot_iterator()
Recap
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Apache Arrow – standard for in-memory data
Arrow Flight – efficiently move data around network
●
Arrow data as a service
●
Stream batching
●
Stream management
Simple example with PySpark + TensorFlow
●
Data transfer never goes through Python
Links & References
Apache Arrow and Flight specification
https://arrow.apache.org/
https://github.com/apache/arrow/blob/master/format/Flight.proto
TensorFlow I/O
https://github.com/tensorflow/io
Related Spark JIRAs
SPARK-24579
SPARK-26413
Example Code
https://github.com/BryanCutler/SparkArrowFlight
References: Flight Overview by Arrow PMC Jacques Nadeau
[1] https://www.slideshare.net/JacquesNadeau5/apache-arrow-flight-overview
DBG / May 2, 2018 / © 2018 IBM Corporation
Introduction
DBG / May 2, 2018 / © 2018 IBM Corporation
Arrow Flight
Arrow Flight is an Arrow-native RPC framework
Defines a standard protocol for data exchange
Makes it easy to efficiently move data around a network by providing:
●
Arrow Data as a Service
●
Batch Streams
●
Stream Management
Thank you!
codait.org
http://github.com/BryanCutler
developer.ibm.com/code
DBG / May 2, 2018 / © 2018 IBM Corporation
FfDL
Sign up for IBM Cloud and try Watson Studio!
https://ibm.biz/BdZgcx
https://datascience.ibm.com/
MAX
DBG / May 2, 2018 / © 2018 IBM Corporation
Backup Slides
DBG / May 2, 2018 / © 2018 IBM Corporation
Slides
BACKUP
Spark Client
DBG / May 2, 2018 / © 2018 IBM Corporation
Code
Map Partitions
to RecordBatches
Add partition batches
Into a Stream
Put stream to
service
// Spark job to put partitions to service
rdd.mapPartitions { it =>
   val allocator = it.allocator.newChildAllocator(
       "SparkFlightConnector", 0, Long.MaxValue)
   val client = new FlightClient(allocator, new Location(host, port))
   val desc = FlightDescriptor.path(descriptor)
   val stream = client.startPut(desc, it.root)
   // Use VectorSchemaRootIterator to convert Rows ­> Vectors
   it.foreach { root =>
     // doPut on the populated VectorSchemaRoot
     stream.putNext()
   }
   stream.completed()
   stream.getResult
   client.close()
   Iterator.empty
 }.count()

Contenu connexe

Tendances

Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
Databricks
 

Tendances (20)

Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Apache spark
Apache sparkApache spark
Apache spark
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache spark
Apache sparkApache spark
Apache spark
 
Dynamic Allocation in Spark
Dynamic Allocation in SparkDynamic Allocation in Spark
Dynamic Allocation in Spark
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
High-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQLHigh-speed Database Throughput Using Apache Arrow Flight SQL
High-speed Database Throughput Using Apache Arrow Flight SQL
 

Similaire à Make your PySpark Data Fly with Arrow!

Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 

Similaire à Make your PySpark Data Fly with Arrow! (20)

Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3Optimizing your SparkML pipelines using the latest features in Spark 2.3
Optimizing your SparkML pipelines using the latest features in Spark 2.3
 
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure DataHow To Use Scala At Work - Airframe In Action at Arm Treasure Data
How To Use Scala At Work - Airframe In Action at Arm Treasure Data
 
Airline reservations and routing: a graph use case
Airline reservations and routing: a graph use caseAirline reservations and routing: a graph use case
Airline reservations and routing: a graph use case
 
Airline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseAirline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use Case
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
20180417 hivemall meetup#4
20180417 hivemall meetup#420180417 hivemall meetup#4
20180417 hivemall meetup#4
 
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
Airframe: Lightweight Building Blocks for Scala @ TD Tech Talk 2018-10-17
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
Developer insight into why applications run amazingly Fast in CF 2018
Developer insight into why applications run amazingly Fast in CF 2018Developer insight into why applications run amazingly Fast in CF 2018
Developer insight into why applications run amazingly Fast in CF 2018
 
P4_tutorial.pdf
P4_tutorial.pdfP4_tutorial.pdf
P4_tutorial.pdf
 
Introduction to back-end
Introduction to back-endIntroduction to back-end
Introduction to back-end
 
QCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic PlatformQCon 2018 | Gimel | PayPal's Analytic Platform
QCon 2018 | Gimel | PayPal's Analytic Platform
 
.NET Core Today and Tomorrow
.NET Core Today and Tomorrow.NET Core Today and Tomorrow
.NET Core Today and Tomorrow
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Linkmeup v076(2019-06).2
Linkmeup v076(2019-06).2Linkmeup v076(2019-06).2
Linkmeup v076(2019-06).2
 
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019
IoT  Edge Data Processing with NVidia Jetson Nano oct 3 2019IoT  Edge Data Processing with NVidia Jetson Nano oct 3 2019
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
PowerAI Deep dive
PowerAI Deep divePowerAI Deep dive
PowerAI Deep dive
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Dernier

Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 

Dernier (20)

Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 

Make your PySpark Data Fly with Arrow!

  • 1. Make Your PySpark Data Fly with Apache Arrow! Bryan Cutler Software Engineer @BryanCutler DBG / May 2, 2018 / © 2019 IBM Corporation
  • 2. About Bryan @BryanCutler on Github Software Engineer, IBM Center for Open-Source Data & AI Technologies (CODAIT) Big Data Machine Learning & AI Apache Spark committer Apache Arrow committer TensorFlow I/O maintainer DBG / May 2, 2018 / © 2018 IBM Corporation
  • 3. DBG / May 2, 2018 / © 2018 IBM Corporation Row 1 Row 2 Row 3 Row 4 0 2 4 6 8 10 12 Column 1 Column 2 Column 3 Center for Open Source Data and AI Technologies CODAIT codait.org DBG / May 2, 2018 / © 2018 IBM Corporation CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise Relaunch of the Spark Technology Center (STC) to reflect expanded mission Improving Enterprise AI Lifecycle in Open Source
  • 4. Agenda Overview of Apache Arrow Intro to Arrow Flight How to talk Arrow Flight in Action DBG / May 2, 2018 / © 2018 IBM Corporation
  • 5. Apache Arrow Overview DBG / May 2, 2018 / © 2018 IBM Corporation
  • 6. About Arrow Apache Arrow DBG / May 2, 2018 / © 2018 IBM Corporation Standard format for in-memory columnar data ● Implementations in many languages and growing ● Built for efficient analytic operations on modern hardware Has built in primitives for basic exchange of Arrow data ● Zero-copy data within a process ● IPC with Arrow record batch messages
  • 7. Why use Arrow Apache Arrow DBG / May 2, 2018 / © 2018 IBM Corporation Arrow brings many benefits ● Common standard with cross- language support ● Better interoperability between frameworks ● Avoid costly data serialization
  • 8. Who is using Arrow Apache Arrow DBG / May 2, 2018 / © 2018 IBM Corporation The Apache® Software Foundation Announces Apache Arrow™ Momentum ● Adopted by dozens of Open Source and commercial technologies ● Exceeded 1,000,000 monthly downloads within first three years as an Apache Top-Level Project ● Apache Spark, NVIDIA RAPIDS, pandas, and Dremio, among others https://arrow.apache.org/powered_by Source: https://blogs.apache.org/foundation/entry/the-apache-software-foundation-announces46
  • 9. Arrow Flight DBG / May 2, 2018 / © 2018 IBM Corporation
  • 10. Introduction DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Arrow Flight is an Arrow-native RPC framework Defines a standard protocol for data exchange Makes it easy to efficiently move data around a network by providing1 : ● Arrow Data as a Service ● Batch Streams ● Stream Management
  • 11. Arrow Data as a Service DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Extensible data service ● Clients get/put Arrow data ● List available data ● Custom actions ● Can think of it as ODBC for in-memory data
  • 12. Stream Batching DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Arrow Stream is a schema + record batches A Flight is composed of multiple streams ● Streams could come from different endpoints ● Transfer data in bulk for efficiency ● Location info can be used to improve data locality Flight Stream 1 Record Batch Record Batch Stream 2 Record Batch Record Batch
  • 13. Stream Management DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Service manages Flights for the clients ● Flight Info gives a list of endpoints with locations of each stream in the Flight ● Streams are referenced by a ticket – A ticket is an opaque struct that is unique for each stream ● Flight descriptors differentiate between flights – Can define how Flight is composed – Batch size, or even a SQL query
  • 14. FlightDescriptor Types DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Simple path-like: Custom proto:  message MyDescriptor {    string sql_query = 1;    int32 records_per_batch = 2;  }  Message MyTicket {    MyDescriptor desc = 1;    string uuid = 2;  } “datasets/cats­dogs/training”
  • 15. Ticket Sequence for Consumer DBG / May 2, 2018 / © 2018 IBM Corporation Flight Example To consume an entire Flight ● Get FlightInfo for list of endpoints with tickets ● For each endpoint – Use ticket to get endpoint stream – Process each RecordBatch in the stream Consumer Flight Service Get FlightInfo (FlightDescriptor) FlightInfo For Each Endpoint Get Stream (Ticket) For Each Batch in Stream RecordBatch Stream Get Next Process batch
  • 16. Benefits DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight ● Applications use client interface and exchange standard record batches ● Complex communication handled internally ● Efficient, uses batches and minimum copies ● Standardized protocol – Authentication – Support different transports – Able to handle backpressure
  • 17. Current Status DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Common protocol defined using protocol buffers Prototype implementations in Java, C++, Python Still experimental, but lots of work being done to make production ready
  • 18. How to Talk Arrow DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight If a system wants to exchange Arrow Fight data, then needs to be able to produce/consume an Arrow stream ● Spark kind of does already, but not externalized ● See SPARK-24579 and SPARK-26413 ● Can build a Scala Flight connector with a little hacking
  • 19. How to Talk Arrow DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight TensorFlow I/O has Arrow Datasets ● Maintained by SIG-IO community – Also many other inputs to TF – Many sources from legacy contrib/ ● Several Arrow datasets – ArrowStreamDataset used here ● Input ops only for now ● Install: “pip install tensorflow-io” Check it out at https://github.com/tensorflow/io
  • 20. Flight in Action: Spark to TensorFlow DBG / May 2, 2018 / © 2018 IBM Corporation
  • 21. Define the Service DBG / May 2, 2018 / © 2018 IBM Corporation Flight Example Simple Service backed by an in-memory data store ● Keeps streams in memory ● Flight descriptor is a string id ● This is from the Java Flight examples
  • 22. Make the Clients DBG / May 2, 2018 / © 2018 IBM Corporation Flight Example PySpark will put Arrow data ● Map partition op of DataFrame to Arrow ● Each partition sent as a stream of batches – A ticket is roughly the partition index TensorFlow Dataset will get Arrow data ● Request entire Flight, which is multiple streams ● Gets one batch at a time to process ● Op outputs tensors
  • 23. Data Flow DBG / May 2, 2018 / © 2018 IBM Corporation Flight Example Spark Worker Stream 1 Record Batch Record Batch Flight Service Stream 2 Record Batch Record Batch TensorFlow Process Batches Record Batch Record Batch Record Batch Record Batch Flight = Stream 1 + Stream 2
  • 24. Walkthrough DBG / May 2, 2018 / © 2018 IBM Corporation Flight Example Application code is simple – Only a few lines – Focus on working with data – Don’t need to worry about conversion, file formats, networking Example in Python but data never needs to go through Python! Worker JVM → Flight Service → TF C++ “”” PySpark Client “”” # Spark job to put partitions to service SparkFlightConnector.put(     df,           # Existing DataFrame     host, port,   # Flight Service ip     'rad­spark'   # Data descriptor ) “”” TensorFlow Client “”” # Arrow tf.data.Dataset gets Flight data dataset = ArrowFlightDataset.from_schema(     host, port,   # Flight Service ip     'rad­spark',  # Data descriptor     to_arrow_schema(df.schema)  # Schema   ) # Iterate over Flight data as tensors it = dataset.make_one_shot_iterator()
  • 25. Recap DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Apache Arrow – standard for in-memory data Arrow Flight – efficiently move data around network ● Arrow data as a service ● Stream batching ● Stream management Simple example with PySpark + TensorFlow ● Data transfer never goes through Python
  • 26. Links & References Apache Arrow and Flight specification https://arrow.apache.org/ https://github.com/apache/arrow/blob/master/format/Flight.proto TensorFlow I/O https://github.com/tensorflow/io Related Spark JIRAs SPARK-24579 SPARK-26413 Example Code https://github.com/BryanCutler/SparkArrowFlight References: Flight Overview by Arrow PMC Jacques Nadeau [1] https://www.slideshare.net/JacquesNadeau5/apache-arrow-flight-overview DBG / May 2, 2018 / © 2018 IBM Corporation
  • 27. Introduction DBG / May 2, 2018 / © 2018 IBM Corporation Arrow Flight Arrow Flight is an Arrow-native RPC framework Defines a standard protocol for data exchange Makes it easy to efficiently move data around a network by providing: ● Arrow Data as a Service ● Batch Streams ● Stream Management
  • 28. Thank you! codait.org http://github.com/BryanCutler developer.ibm.com/code DBG / May 2, 2018 / © 2018 IBM Corporation FfDL Sign up for IBM Cloud and try Watson Studio! https://ibm.biz/BdZgcx https://datascience.ibm.com/ MAX
  • 29. DBG / May 2, 2018 / © 2018 IBM Corporation
  • 30. Backup Slides DBG / May 2, 2018 / © 2018 IBM Corporation Slides BACKUP
  • 31. Spark Client DBG / May 2, 2018 / © 2018 IBM Corporation Code Map Partitions to RecordBatches Add partition batches Into a Stream Put stream to service // Spark job to put partitions to service rdd.mapPartitions { it =>    val allocator = it.allocator.newChildAllocator(        "SparkFlightConnector", 0, Long.MaxValue)    val client = new FlightClient(allocator, new Location(host, port))    val desc = FlightDescriptor.path(descriptor)    val stream = client.startPut(desc, it.root)    // Use VectorSchemaRootIterator to convert Rows ­> Vectors    it.foreach { root =>      // doPut on the populated VectorSchemaRoot      stream.putNext()    }    stream.completed()    stream.getResult    client.close()    Iterator.empty  }.count()