Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators

•

3 j'aime•768 vues

CPU technologies have scaled well in past years, by more complex architecture design, more wide execution pipelines, more cores in same processor, and higher frequency. However accelerators show more computational power and higher throughput with lower cost in dedicated area, which leads to more usages in Spark. But when we integrate accelerators in Spark a common case is huge performance promises through micro test with little performance boost actually we get. One reason is the cost of data transfer between JVM and accelerator. The other reason is the accelerator lack the information how it's used in Spark. In this research, we investigate the usage of apache arrow based dataframe as the unified data sharing and transferring way between CPU and accelerators, and make it dataframe aware when we design hardware and software stack. In this way we seamlessly integrate Spark and Accelerators design and get close to promised performance.

Données & analyses

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Binwei Yang, Intel
Carson Wang, Intel
Apache Arrow* Based
Unified Data Exchange
#UnifiedAnalytics #SparkAISummit

Me
• 13 years of experience on performance analysis
• Software -> CPU simulator -> Spark
• Join Intel Spark team in Aug. 2018
• A “layman” of Apache Spark
3

Pursuit of Performance Is Endless
• Intel® 2nd Gen Xeon® Scalable Processors
• Intel® Optane™ DC persistent memory
• Intel® FPGA
• Software optimization
4

Without Offload
5
Internal Row Internal RowTungsten Engine
CPU

FPGA Offload
6
Internal Row
FPGA Batch
FPGA DMA RX FPGA Engine
Internal Row
FPGA Batch
FPGA DMA TX
CPU
FPGA
Spark already has off-heap unsafe-row

Offloading Performance
7
To-FPGA
Offload
From-FPGA
Time
CPU
√

Offloading Performance
8
To-FPGA
Offload
From-FPGA
Time
To-FPGA
Offload
From-FPGA
CPU
√

Overhead of Offload
9
Internal Row
FPGA Batch
FPGA DMA RX FPGA Engine
Internal Row
FPGA Batch
FPGA DMA TX
CPU
FPGA
Convert
Data Move

FPGA BatchFPGA Batch
Optimize – Unified Format
10
Unified Format
FPGA DMA RX FPGA Engine
Unified Format
FPGA DMA TX
CPU
FPGA
• Unified format FPAG can easily debug
• FPGA library can be shared with all other projects

FPGA BatchFPGA Batch
Optimize – Double Buffer
11
Unified Format
FPGA DMA RX1
FPGA Engine
Unified Format
CPU
FPGA
FPGA DMA RX2
FPGA DMA RX1
FPGA DMA RX2

Optimize – Double Buffer
12
Time
Col1
Eng 1
Col2
Eng 2
Col3
Eng 3
Col…
Eng …
• Columnar data format is
friendly to most of
accelerator

Do We Fully Utilize CPU?
13
df.agg(F.sum(‘a_float')).show()
perf stat -e fp_arith_inst_retired.128b_packed_single -A -a sleep 1
CPU0 0 fp_arith_inst_retired.128b_packed_single
CPU1 0 fp_arith_inst_retired.128b_packed_single
CPU2 0 fp_arith_inst_retired.128b_packed_single
…

Add AVX Support
14
• We need
– A columnar data format
– Native LLVM SQL Engine
• Take use of other highly optimized libraries

Recap
15
• A standard columnar data format
– Easily debug
– Shared by all projects
• Implement a serial of Tungsten backends

Apache Arrow* Is the Answer
16
• Apache Arrow* is the best choice
• A standard data frame format
– For Native Tungsten backend
– For all accelerators offloading Spark SQL engine
*Other names and brands may be claimed as the property of others.

Plug and Play Backend
17
op1 op2 op3 op4
Python
UDF
Data Frame Physical Plan
Tungsten Backend
JVM
LLVM
AVX
ACC1 ACC2 Intel
Python
Off-Heap Python
>>> >>> >>>

Take Use of Intel Optane DC Persistent
Memory
18
op1 op2 op3 op4
Data Frame Physical Plan
Tungsten Backend
JVM
LLVM
AVX
ACC ACC
Off-Heap
>>> >>>

Take Use of Intel Optane DC Persistent
Memory
19
op2 op3 op4
Data Frame Physical Plan
Tungsten Backend
LLVM
AVX
ACC ACC
Off-Heap
>>> >>>Shuffle
Input

Json, CSV, Unzip Offload
20
op2 op3 op4
Data Frame Physical Plan
Tungsten Backend
LLVM
AVX
ACC1 ACC2
Off-Heap
>>> >>>
Unzip ACC1Json csv

Filter, Project Pushdown
21
op2 op3 op4
Data Frame Physical Plan
Tungsten Backend
LLVM
AVX
ACC1 ACC2
Off-Heap
>>> >>>
ACC1Filter Project

Connect Other ML/AI Framework
22
• The proposal of JIRA 24579
• No extra data format convert

Call to Action
• Share your comments on JIRA 27396 created by
Robert
• Follow our work on https://github.com/Intel-
bigdata
• Let’s bring Spark’s performance to higher level
23#UnifiedAnalytics #SparkAISummit

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Contenu connexe

Tendances

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...

Databricks

Pedal to the Metal: Accelerating Spark with Silicon Innovation

Jen Aman

Apache Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters are fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage. This session will cover: – How to leverage Spark and TensorFlow for hyperparameter tuning and for deploying trained models – DeepLearning4J, CaffeOnSpark, IBM’s SystemML and Intel’s BigDL – Sidecar GPU cluster architecture and Spark-GPU data reading patterns – The pros, cons and performance characteristics of various approaches You’ll leave the session better informed about the available architectures for Spark and deep learning, and Spark with and without GPUs for deep learning. You’ll also learn about the pros and cons of deep learning software frameworks for various use cases, and discover a practical, applied methodology and technical examples for tackling big data deep learning.

Deep Learning with Apache Spark and GPUs with Pierce Spitler

Databricks

Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways. However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk. It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...

Databricks

Shuffle in Apache Spark is an intermediate phrase redistributing data across computing units, which has one important primitive that the shuffle data is persisted on local disks. This architecture suffers from some scalability and reliability issues. Moreover, the assumptions of collocated storage do not always hold in today's data centers. The hardware trend is moving to disaggregated storage and compute architecture for better cost efficiency and scalability. To address the issues of Spark shuffle and support disaggregated storage and compute architecture, we implemented a new remote Spark shuffle manager. This new architecture writes shuffle data to a remote cluster with different Hadoop-compatible filesystem backends. Firstly, the failure of compute nodes will no longer cause shuffle data recomputation. Spark executors can also be allocated and recycled dynamically which results in better resource utilization. Secondly, for most customers currently running Spark with collocated storage, it is usually challenging for them to upgrade the disks on every node to latest hardware like NVMe SSD and persistent memory because of cost consideration and system compatibility. With this new shuffle manager, they are free to build a separated cluster storing and serving the shuffle data, leveraging the latest hardware to improve the performance and reliability. Thirdly, in HPC world, more customers are trying Spark as their high performance data analytics tools, while storage and compute in HPC clusters are typically disaggregated. This work will make their life easier. In this talk, we will present an overview of the issues of the current Spark shuffle implementation, the design of new remote shuffle manager, and a performance study of the work.

Improving Apache Spark by Taking Advantage of Disaggregated Architecture

Databricks

Amazon EC2 F1 is a new compute instance with programmable hardware for application acceleration. With F1, you can directly access custom FPGA hardware on the instance in a few clicks. Learning Objectives: • Learn about the capabilities, features, and benefits of the new F1 instances • Develop your FPGA using the F1 Hardware Developer Kit and FPGA Developer AMI • Deploy your FPGA acceleration code using F1 instances • Use F1 instances for hardware acceleration in your applications • Learn how to offer pre-packaged Amazon FPGA Machine Images (AFIs) to your customers through the AWS Marketplace

Announcing Amazon EC2 F1 Instances with Custom FPGAs

Amazon Web Services

Flexible and Real-Time Stream Processing with Apache Flink

DataWorks Summit

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Spark Summit

Kafka is becoming an ever more popular choice for users to help enable fast data and Streaming. Kafka provides a wide landscape of configuration to allow you to tweak its performance profile. Understanding the internals of Kafka is critical for picking your ideal configuration. Depending on your use case and data needs, different settings will perform very differently. Lets walk through performance essentials of Kafka. Let's talk about how your Consumer configuration, can speed up or slow down the flow of messages to Brokers. Lets talk about message keys, their implications and their impact on partition performance. Lets talk about how to figure out how many partitions and how many Brokers you should have. Let's discuss consumers and what effects their performance. How do you combine all of these choices and develop the best strategy moving forward? How do you test performance of Kafka? I will attempt a live demo with the help of Zeppelin to show in real time how to tune for performance.

Kafka to the Maxka - (Kafka Performance Tuning)

DataWorks Summit

Hive vs. Impala

Omid Vahdaty

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Daniel Rodriguez

Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool , I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models in GPU-based production environment. This talk is contains many Spark ML and TensorFlow AI demos using PipelineIO's 100% Open Source Community Edition. All code and Docker images are available to reproduce on your own CPU or GPU-based cluster. * Bio * Chris Fregly is Founder and Research Engineer at PipelineIO, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Video Series High Performance TensorFlow in Production. Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member of the IBM Spark Technology Center in San Francisco.

Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...

DataWorks Summit

Data shuffling is a costly operation. At Facebook, single job shuffles can reach the scale of over 300TB compressed using (relatively cheap) large spinning disks. However, shuffle reads issue large amounts of inefficient, small, random I/O requests to disks and can be a large source of job latency as well as waste of reserved system resources. In order to boost shuffle performance and improve resource efficiency, we have developed Spark-optimized Shuffle (SOS). This shuffle technique effectively converts a large number of small shuffle read requests into fewer large, sequential I/O requests. In this session, we present SOS’s multi-stage shuffle architecture and implementation. We will also share our production results and future optimizations.

SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe

Databricks

Spark started at Facebook as an experiment when the project was still in its early phases. Spark's appeal stemmed from its ease of use and an integrated environment to run SQL, MLlib, and custom applications. At that time the system was used by a handful of people to process small amounts of data. However, we've come a long way since then. Currently, Spark is one of the primary SQL engines at Facebook in addition to being the primary system for writing custom batch applications. This talk will cover the story of how we optimized, tuned and scaled Apache Spark at Facebook to run on 10s of thousands of machines, processing 100s of petabytes of data, and used by 1000s of data scientists, engineers and product analysts every day. In this talk, we'll focus on three areas: * *Scaling Compute*: How Facebook runs Spark efficiently and reliably on tens of thousands of heterogenous machines in disaggregated (shared-storage) clusters. * *Optimizing Core Engine*: How we continuously tune, optimize and add features to the core engine in order to maximize the useful work done per second. * *Scaling Users:* How we make Spark easy to use, and faster to debug to seamlessly onboard new users. Speakers: Ankit Agarwal, Sameer Agarwal

Scaling Apache Spark at Facebook

Databricks

By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016. In this talk, we will update audiences about the recenet development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment. We plan to share some interesting use cases from Yahoo, including image classification, NSFW image detection, and automatic identification of eSports game highlights. We will offer an interactive demo of image auto captioning using CaffeOnSpark in a Hadoop based notebook.

CaffeOnSpark Update: Recent Enhancements and Use Cases

DataWorks Summit

The opportunity in accelerating Spark by improving its network data transfer facilities has been under much debate in the last few years. RDMA (remote direct memory access) is a network acceleration technology that is very prominent in the HPC (high-performance computing) world, but has not yet made its way to mainstream Apache Spark. Proper implementation of RDMA in network-oriented applications can improve scalability, throughput, latency and CPU utilization. In this talk we are going to present a new RDMA solution for Apache Spark that shows amazing improvements in multiple Spark use cases. The solution is under development in our labs, and is going to be released to the public as an open-source plug-in.

Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...

Spark Summit

Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage. In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.

Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...

Databricks

KSCOPE 2013: Exadata Consolidation Success Story

Kristofferson A

Spark meetup feb 2016

Todd Niven

Spark Summit EU talk by Jorg Schad

Spark Summit

Tendances (20)

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...

Pedal to the Metal: Accelerating Spark with Silicon Innovation

Deep Learning with Apache Spark and GPUs with Pierce Spitler

Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...

Improving Apache Spark by Taking Advantage of Disaggregated Architecture

Announcing Amazon EC2 F1 Instances with Custom FPGAs

Flexible and Real-Time Stream Processing with Apache Flink

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Kafka to the Maxka - (Kafka Performance Tuning)

Hive vs. Impala

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...

SOS: Optimizing Shuffle I/O with Brian Cho and Ergin Seyfe

Scaling Apache Spark at Facebook

CaffeOnSpark Update: Recent Enhancements and Use Cases

Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...

Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...

KSCOPE 2013: Exadata Consolidation Success Story

Spark meetup feb 2016

Spark Summit EU talk by Jorg Schad

Similaire à Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators

In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang

Spark Summit

Piotr Kupisiewicz – Technical Expert in Krakow’s TAC VPN team. In IT for more than 10 years, out of which 5 years is mostly software engineering experience. Last 5 years spent mostly in networking area interested mostly in Network Security. His hobby are drums and very heavy music. CCIE Security 39762. Olivier Pelerin – as a key member of the escalation team at Cisco’s Technical Assistance Center, he handles world-wide escalations on VPN technologies pertaining to IPSEC, DMVPN, EzVPN, GetVPN, FlexVPN, PKI. Olivier has spent years troubleshooting and diagnosing issues on some of largest, and most complex VPN deployments Olivier have a CCIE in security #20306 Topic of Presentation: Make IOS-XE Troubleshooting Easy – Packet-Tracer Language: English Abstract: “IOS-XE is operating system running on Service Provider devices like ASR series and ISR-4451. Aim of this session is to show how very complicated Service Provider’s configurations can be easily troubleshoted using packet-tracer tool.”

PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...

PROIDEA

CAPI and OpenCAPI Hardware acceleration enablement

Ganesan Narayanasamy

Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...

Danielle Womboldt

Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...

Ceph Community

3.INTEL.Optane_on_ceph_v2.pdf

hellobank1

Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster

Ceph Community

Ceph Day KL - Delivering cost-effective, high performance Ceph cluster

Ceph Community

00 opencapi acceleration framework yonglu_ver2

Yutaka Kawai

Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster

Ceph Community

Big Data with Hadoop & Spark Training: http://bit.ly/2L6bZbn This CloudxLab Introduction to Spark Streaming & Apache Kafka tutorial helps you to understand Spark Streaming and Kafka in detail. Below are the topics covered in this tutorial: 1) Spark Streaming - Workflow 2) Use Cases - E-commerce, Real-time Sentiment Analysis & Real-time Fraud Detection 3) Spark Streaming - DStream 4) Word Count Hands-on using Spark Streaming 5) Spark Streaming - Running Locally Vs Running on Cluster 6) Introduction to Apache Kafka 7) Apache Kafka Hands-on on CloudxLab 8) Integrating Spark Streaming & Kafka 9) Spark Streaming & Kafka Hands-on

Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...

CloudxLab

Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster

Ceph Community

Intel s'intéresse tout particulièrement aux FPGA et notamment au potentiel qu'ils apportent lorsque les ISV et développeurs ont des besoins très spécifiques en Génomique, traitement d'images, traitement de bases de données, et même dans le Cloud. Dans ce document vous aurez l'occasion d'en savoir plus sur notre stratégie, et sur un programme de recherche lancé par Intel et Altera impliquant des Xeon E5 équipés... de FPGA Intel is looking at FPGA and what they bring to ISVs and developers and their very specific needs in genomics, image processing, databases, and even in the cloud. In this document you will have the opportunity to learn more about our strategy, and a research program initiated by Intel and Altera involving Xeon E5 with... FPGA inside. Auteur(s)/Author(s): P. K. Gupta, Director of Cloud Platform Technology, Intel Corporation

Using a Field Programmable Gate Array to Accelerate Application Performance

Odinot Stanislas

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...

Databricks

Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...

Chris Fregly

Piotr Kupisiewicz - Cisco Systems Language: Polish Architektura IOS-XE jest implementowana w każdym nowoczesnym routerze Cisco. Mowa tutaj o routerze ASR1000, jak również o seriach 43xx oraz 44xx. Skoro IOS oraz IOS-XE "wyglądają" tak samo, jaka jest różnica między nimi ? W jaki sposób efektywnie rozwiązywać problemy z przepływem ruchu poprzez router oparty o IOS-XE ? Sesja omawiająca architekturę oraz podejście do rozwiązywania problemów (z prawdziwym "live demo"). Aspekty te mogą okazać się bardzo pomocne dla inżynierów sieciowych, jak również dla architektów sieciowych. Zarejestruj się na kolejną edycję PLNOG: krakow.plnog.pl

PLNOG14: Architektura oraz rozwiązywanie problemów na routerach IOS-XE - Piot...

PROIDEA

SCFE 2020 OpenCAPI presentation as part of OpenPWOER Tutorial

Ganesan Narayanasamy

Coral is a framework that allows the distributed acceleration of large data sets across clusters of FPGA resources using simple programming models. It is designed to scale up from single devices to multiple FPGAs, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of FPGAs, each of which may be prone to failures. Coral abstracts FPGA resources (device, memory), enabling fault-tolerant heterogeneous distributed systems to easily be built and run effectively. It allows: - instant scalability to multiple FPGAs - seamless virtualization of the FPGA cluster

InAccel FPGA resource manager

Christoforos Kachris

Deep Dive into GPU Support in Apache Spark 3.x

Databricks

A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...

Shinya Takamaeda-Y

Similaire à Apache Arrow-Based Unified Data Sharing and Transferring Format Among CPU and Accelerators (20)