Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL

•

0 likes•773 views

"Data science workflows can benefit tremendously from being accelerated, to enable data scientists to explore more and larger datasets. This allows data scientist to drive towards their business goals, faster, and more reliably. Accelerating Apache Spark with GPU is the next step for data science. In this talk, we will share our work in accelerating Spark applications via CUDA and NCCL. We have identified several bottleneck in Spark 2.4 in the areas of data serialization and data scalability. To address this we accelerated Spark based data analytics with enhancements to allow large columnar datasets to be analyzed directly in CUDA with Python. The GPU dataframe library, cuDF (github.com/rapidsai/cudf), can be used to express advanced analytics easily. Through applying Apache Arrow and cuDF, we have achieved over 20x speedup over regular RDDs. For distributed machine learning, Spark 2.4 introduced a barrier execution mode to support MPI allreduce style algorithms. We will demonstrate how the latest Nvidia NCCL library, NCCL2, could further scale out distributed learning algorithms, such as XGBoost. Finally, an enhancement of Spark kubernetes scheduler will be introduced so that GPU resources can be scheduled from a kubernetes cluster for Spark applications. We will share our experience deploying Spark on Nvidia Tesla T4 server clusters. Based on the new NVIDIA Turing architecture, the T4, an energy-efficient 70-watt small PCIe form factor GPU, is optimized for scale-out computing environments and features multi-precision Turing Tensor Cores and new RT Cores. "

Data & Analytics

Richard Whitcomb, NVIDIA
Rong Ou, NVIDIA
Accelerating Machine Learning
Workloads and Apache Spark
Applications via CUDA and NCCL
#UnifiedAnalytics #SparkAISummit

About Us
Richard Whitcomb: Senior Engineer working on AI
Infrastructure. Previously at Spotify, Twitter.
Rong Ou: Principal Engineer at Nvidia working on
AI Infrastructure. Previously at Google.

Spark GPU: A Machine Learning Story
• Problem: predict loan delinquency
• Dataset: Fannie Mae loan performance data
• Library: XGBoost
• Platform: Apache Spark
• GPU: NVIDIA Tesla T4
5#UnifiedAnalytics #SparkAISummit

Dataset
• Fannie Mae single-family loan performance data
• 18 years: 2000 - 2017
• # loans: 38,964,685
• # performance records: 2,008,374,244
• Size (CSV): 168 GB

XGBoost
• Popular gradient boosting library
• Distributed mode via Spark
• GPU support via CUDA
• Multi-GPU support via NCCL 2
• Recent addition: multi-node GPU support
• Experimental: running on Spark with GPUs
7#UnifiedAnalytics #SparkAISummit

Spark Cluster
• Standalone cluster on GCP
• 5 virtual machines, each has:
– 64 vCPUs (32 physical cores)
– 416 GB memory
– 4 x NVIDIA Tesla T4
– 400 GB SSD persistent disk
– Default networking
• 4 Spark workers per VM

Sample Code
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val xgbParam = Map("eta" -> 0.1f,
"max_depth" -> 20,
"max_leaves" -> 256,
"grow_policy" -> "lossguide",
"num_round" -> 100,
"num_workers" -> 20,
"nthread" -> 16,
"tree_method" -> "gpu_hist")
val xgbClassifier = new XGBoostClassifier(xgbParam).
setFeaturesCol("features").
setLabelCol("labels")

Preliminary Results
Accuracy
(AUC)
Training Loop
(Seconds)
Max Tree
Depth = 8
CPU 0.832 1071.002
GPU 0.832 139.641
Speedup 766.97%
Max Tree
Depth = 20
CPU 0.833 1088.662
GPU 0.833 165.868
Speedup 656.34%

But...
• XGBoost training is pretty fast on GPUs
• ETL is slow in comparison
• We need to accelerate the machine learning
workflow end to end

Apache Arrow RDD
• Store Arrow batches directly in RDDs
• Already has library support
• Moving between RDD->CUDA with Zero Copy
• Eliminates PySpark serialization overhead
– 20x speed improvement in PySpark vs Pickle

Arrow RDD Problems
• Users moving to Dataset/DataFrame API
• Difficult to use (columns vs rows)
• Most of Spark features aren’t usable, mostly
works on distributed Pandas dataframes
• Users would have to rewrite all of their ETL jobs
to make use of GPUs

Moving towards DataFrames
• Can we provide similar speed improvements
under the DataFrame API?
• Little to no code changes for ETL jobs
• Same API in which users are comfortable

ETL on GPUs
• Ability to process columnar across ops is key
• Added interface so DataFrame ops can “opt-in”
to consume and produce columnar data
• Added columnar processing to a few
DataFrame ops (CSV parsing, hash join, hash
aggregate, etc.)
• Can switch between row/columnar with config

Simple Benchmark
18x
speedup
dfc = spark.schema(schema).csv("...")
dfc.groupBy("id").agg(F.sum("x"))
No user code changes
Config settings to enable
GPU acceleration
Uses RAPIDS library under
the covers: https://rapids.ai/

Spark on GPU
• Encouraging early results with room to improve
– 6X speedup of XGBoost training loop
– 18X speedup of dataset-based ETL example
• Eager to collaborate with the Spark community
– Accelerator-aware scheduling (SPARK-24615)
– Stage level resource scheduling (SPARK-27495)
– Columnar processing (SPARK-27396)
– cuDF integration into XGBoost (XGBOOST-3997)
– Out-of-core XGBoost GPU (XGBOOST-4357)

Recently uploaded

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71

FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor

Brighton SEO | April 2024 | Data StorytellingNeil Barnes

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor

Industrialised data - the key to AI success.pdfLars Albertsson

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

Mature dropshipping via API with DroFx.pptxolyaivanovalion

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Invezz.com - Grow your wealth with trading signalsInvezz1

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth

Recently uploaded (20)

CebaBaby dropshipping via API with DroFX.pptx

BigBuy dropshipping via API with DroFx.pptx

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha

FESE Capital Markets Fact Sheet 2024 Q1.pdf

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati

Brighton SEO | April 2024 | Data Storytelling

Log Analysis using OSSEC sasoasasasas.pptx

VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...

Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai

Industrialised data - the key to AI success.pdf

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

Mature dropshipping via API with DroFx.pptx

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

Invezz.com - Grow your wealth with trading signals

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

Carero dropshipping via API with DroFx.pptx

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Unveiling Insights: The Role of a Data Analyst

Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL

1. Richard Whitcomb, NVIDIA Rong Ou, NVIDIA Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL #UnifiedAnalytics #SparkAISummit

2. About Us Richard Whitcomb: Senior Engineer working on AI Infrastructure. Previously at Spotify, Twitter. Rong Ou: Principal Engineer at Nvidia working on AI Infrastructure. Previously at Google.

3. Why Spark on GPU?

5. Spark GPU: A Machine Learning Story • Problem: predict loan delinquency • Dataset: Fannie Mae loan performance data • Library: XGBoost • Platform: Apache Spark • GPU: NVIDIA Tesla T4 5#UnifiedAnalytics #SparkAISummit

6. Dataset • Fannie Mae single-family loan performance data • 18 years: 2000 - 2017 • # loans: 38,964,685 • # performance records: 2,008,374,244 • Size (CSV): 168 GB

7. XGBoost • Popular gradient boosting library • Distributed mode via Spark • GPU support via CUDA • Multi-GPU support via NCCL 2 • Recent addition: multi-node GPU support • Experimental: running on Spark with GPUs 7#UnifiedAnalytics #SparkAISummit

8. Spark Cluster • Standalone cluster on GCP • 5 virtual machines, each has: – 64 vCPUs (32 physical cores) – 416 GB memory – 4 x NVIDIA Tesla T4 – 400 GB SSD persistent disk – Default networking • 4 Spark workers per VM

9. Sample Code import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier val xgbParam = Map("eta" -> 0.1f, "max_depth" -> 20, "max_leaves" -> 256, "grow_policy" -> "lossguide", "num_round" -> 100, "num_workers" -> 20, "nthread" -> 16, "tree_method" -> "gpu_hist") val xgbClassifier = new XGBoostClassifier(xgbParam). setFeaturesCol("features"). setLabelCol("labels")

10. Preliminary Results Accuracy (AUC) Training Loop (Seconds) Max Tree Depth = 8 CPU 0.832 1071.002 GPU 0.832 139.641 Speedup 766.97% Max Tree Depth = 20 CPU 0.833 1088.662 GPU 0.833 165.868 Speedup 656.34%

11. But... • XGBoost training is pretty fast on GPUs • ETL is slow in comparison • We need to accelerate the machine learning workflow end to end

12. Apache Arrow RDD • Store Arrow batches directly in RDDs • Already has library support • Moving between RDD->CUDA with Zero Copy • Eliminates PySpark serialization overhead – 20x speed improvement in PySpark vs Pickle

13. Arrow RDD Problems • Users moving to Dataset/DataFrame API • Difficult to use (columns vs rows) • Most of Spark features aren’t usable, mostly works on distributed Pandas dataframes • Users would have to rewrite all of their ETL jobs to make use of GPUs

14. Moving towards DataFrames • Can we provide similar speed improvements under the DataFrame API? • Little to no code changes for ETL jobs • Same API in which users are comfortable

15. ETL on GPUs • Ability to process columnar across ops is key • Added interface so DataFrame ops can “opt-in” to consume and produce columnar data • Added columnar processing to a few DataFrame ops (CSV parsing, hash join, hash aggregate, etc.) • Can switch between row/columnar with config

16. Simple Benchmark 18x speedup dfc = spark.schema(schema).csv("...") dfc.groupBy("id").agg(F.sum("x")) No user code changes Config settings to enable GPU acceleration Uses RAPIDS library under the covers: https://rapids.ai/

17. Spark on GPU • Encouraging early results with room to improve – 6X speedup of XGBoost training loop – 18X speedup of dataset-based ETL example • Eager to collaborate with the Spark community – Accelerator-aware scheduling (SPARK-24615) – Stage level resource scheduling (SPARK-27495) – Columnar processing (SPARK-27396) – cuDF integration into XGBoost (XGBOOST-3997) – Out-of-core XGBoost GPU (XGBOOST-4357)

Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL

Recommended

Recommended

More Related Content

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Accelerating Machine Learning Workloads and Apache Spark Applications via CUDA and NCCL