"Data science workflows can benefit tremendously from being accelerated, to enable data scientists to explore more and larger datasets. This allows data scientist to drive towards their business goals, faster, and more reliably. Accelerating Apache Spark with GPU is the next step for data science. In this talk, we will share our work in accelerating Spark applications via CUDA and NCCL.
We have identified several bottleneck in Spark 2.4 in the areas of data serialization and data scalability. To address this we accelerated Spark based data analytics with enhancements to allow large columnar datasets to be analyzed directly in CUDA with Python. The GPU dataframe library, cuDF (github.com/rapidsai/cudf), can be used to express advanced analytics easily. Through applying Apache Arrow and cuDF, we have achieved over 20x speedup over regular RDDs.
For distributed machine learning, Spark 2.4 introduced a barrier execution mode to support MPI allreduce style algorithms. We will demonstrate how the latest Nvidia NCCL library, NCCL2, could further scale out distributed learning algorithms, such as XGBoost.
Finally, an enhancement of Spark kubernetes scheduler will be introduced so that GPU resources can be scheduled from a kubernetes cluster for Spark applications. We will share our experience deploying Spark on Nvidia Tesla T4 server clusters. Based on the new NVIDIA Turing architecture, the T4, an energy-efficient 70-watt small PCIe form factor GPU, is optimized for scale-out computing environments and features multi-precision Turing Tensor Cores and new RT Cores. "
Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL
1. Richard Whitcomb, NVIDIA
Rong Ou, NVIDIA
Accelerating Machine Learning
Workloads and Apache Spark
Applications via CUDA and NCCL
#UnifiedAnalytics #SparkAISummit
2. About Us
Richard Whitcomb: Senior Engineer working on AI
Infrastructure. Previously at Spotify, Twitter.
Rong Ou: Principal Engineer at Nvidia working on
AI Infrastructure. Previously at Google.
7. XGBoost
• Popular gradient boosting library
• Distributed mode via Spark
• GPU support via CUDA
• Multi-GPU support via NCCL 2
• Recent addition: multi-node GPU support
• Experimental: running on Spark with GPUs
7#UnifiedAnalytics #SparkAISummit
8. Spark Cluster
• Standalone cluster on GCP
• 5 virtual machines, each has:
– 64 vCPUs (32 physical cores)
– 416 GB memory
– 4 x NVIDIA Tesla T4
– 400 GB SSD persistent disk
– Default networking
• 4 Spark workers per VM
11. But...
• XGBoost training is pretty fast on GPUs
• ETL is slow in comparison
• We need to accelerate the machine learning
workflow end to end
12. Apache Arrow RDD
• Store Arrow batches directly in RDDs
• Already has library support
• Moving between RDD->CUDA with Zero Copy
• Eliminates PySpark serialization overhead
– 20x speed improvement in PySpark vs Pickle
13. Arrow RDD Problems
• Users moving to Dataset/DataFrame API
• Difficult to use (columns vs rows)
• Most of Spark features aren’t usable, mostly
works on distributed Pandas dataframes
• Users would have to rewrite all of their ETL jobs
to make use of GPUs
14. Moving towards DataFrames
• Can we provide similar speed improvements
under the DataFrame API?
• Little to no code changes for ETL jobs
• Same API in which users are comfortable
15. ETL on GPUs
• Ability to process columnar across ops is key
• Added interface so DataFrame ops can “opt-in”
to consume and produce columnar data
• Added columnar processing to a few
DataFrame ops (CSV parsing, hash join, hash
aggregate, etc.)
• Can switch between row/columnar with config
16. Simple Benchmark
18x
speedup
dfc = spark.schema(schema).csv("...")
dfc.groupBy("id").agg(F.sum("x"))
No user code changes
Config settings to enable
GPU acceleration
Uses RAPIDS library under
the covers: https://rapids.ai/
17. Spark on GPU
• Encouraging early results with room to improve
– 6X speedup of XGBoost training loop
– 18X speedup of dataset-based ETL example
• Eager to collaborate with the Spark community
– Accelerator-aware scheduling (SPARK-24615)
– Stage level resource scheduling (SPARK-27495)
– Columnar processing (SPARK-27396)
– cuDF integration into XGBoost (XGBOOST-3997)
– Out-of-core XGBoost GPU (XGBOOST-4357)