Scalable Deep Learning Platform On Spark In Baidu

•

9 j'aime•3,415 vues

Jen Aman

Spark Summit 2016 talk by Weide Zhang (Baidu) and Kyle Tsai (Baidu)

Données & analyses

Scalable Deep Learning in Baidu
Weide Zhang, Kyle Tsai, Jiang Wang
Baidu USDC

Background
• Spark
– Batch/Streaming processing, Spark SQL, MLLib
• Deep learning has many use cases in Baidu and showed
significant improvement in quality
– Image retrieval ranking
– Ads CTR prediction
– Machine translation
– Speech recognition
• Goal: able to use distributed deep learning in Spark

Typical Training Data in Baidu
• Image Recognition: 100 millions
• OCR: 100 millions
• Speech: 10 billions
• CTR: 100 billions
• Grows every year since 2012

Paddle
• Parallel Asynchronous Distributed Deep Learning Library
– Support distributed parameter servers to do
synchronous/asynchronous parameter update
– Support Multi GPU / CPU training
– Support sparse model
– Easy to understand API for user to add new layers
– Support rich features for deep learning use cases,
especially for NLP

Deep learning options comparison
Caffe Tensor Flow Torch Paddle
Distributed Training Yes Yes No Yes
Communication
Cost
Medium High N/A Medium to Low
Easy to customize
and coding
Yes More Learning
Curve
More Learning
Curve
Yes
Sparse Model
Support
No Yes Yes Yes
Area of Focus Vision All All All
Integration with
Spark
Yes No No Yes

High Level Goals
• Implement Spark ML abstractions to let user train
deep learning models with minimal code change
• Leverage paddle’s native training and parameter
server mechanisms to be scheduled in spark deep
learning jobs
• Handle multi-tenancy and heterogeneity
• Parallelize hyper parameter selection
• Batch and Streaming learning

Paddle on Spark
•Training
Communication
•Deep Learning
Algorithm
•Resource
Management
• Training
Environment
Spark Yarn
Parameter
Server
Paddle

Spark ML’s Abstraction
• Train
• Predict

Simple Parameter Is Not Enough
Image Label
Bird
Cat
Convolution
Pooling
Full Connection Cost
Parameter For CNN

Design decisions
• Spark ML Compatible API
– Compatible with Spark is more important than
implemented under Spark
• Code level configuration
– Easy and flexible
– Manual is prone to error

PADDLE Scalable Deep Learning Platform at Baidu

Sharded Parameter Server
• One parameter an one
trainer co-locate in a
machine.
• Parameters are shared,
but not replicated.
• All-to-all communication.
• Our environments
• 4 GPUs per machine.
• 4-10 machines.
• all machines in one
switch
• reliable data center.

GPU Ring Synchronization
• Each parameter only needs
to go through slow
connection two times.
• One for reduce.
• Another for scatter.

ImageNet Scale Experiments
0
10
20
30
40
1 2 3 4 5
Time(s)
Number of machines
Time per 100 batches
TCP RDMA
• AlexNet on ImageNet
• batch size = 64
• 1 Machine has 4 K10
GPUs.

Sparse Training Experiment
0
75
150
225
1 2 4 8 16
Time(s)
Number of nodes
Time per 100 batches
Non Sparse Sparse
• 1451594 dimensional
sparse feature.
• Embedded to 128d, 96d,
96d, and 128d.
• Using a ranking cost on
the top.
• batch size = 128.

Flexible and Efficient RNN Implementation

RNN Performance Comparison with TensorFlow
0
125
250
375
500
625
200 650 1500
Time(ms)
RNN Hidden Size
Time per Batch
PADDLE TensorFlow
• Machine Translation
• batch size = 64
• embedding size =
hidden_size
• dictionary size =
10000

DistributedTraining Performance Character NeuralMachine Translation
• 8 Machines, each with 4 K40 GPUs
• number of RNN encoder layers: 9, number of RNN decoder layers: 7
• Word embedding size: 256, RNN size: 512
• batch size: 25000 character
• Speed:
•attention: 25 minutes / 100 batches .
•encoder-decoder: 9 minutes / 100 batches.

Future work
• Streaming training
• Dynamic trainer allocation
• FairScheduler
• Model serving

THANK YOU.
zhangweide/kyletsai/wangjiang03@baidu.com

Recommandé

Large Scale Multimedia Data Intelligence And Analysis On SparkJen Aman

Spark: Interactive To ProductionJen Aman

Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman

EclairJS = Node.Js + Apache SparkJen Aman

Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella

Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman

Huawei Advanced Data Science With Spark StreamingJen Aman

High Performance Python on Apache SparkWes McKinney

Recommandé

Large Scale Multimedia Data Intelligence And Analysis On SparkJen Aman

Spark: Interactive To ProductionJen Aman

Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman

EclairJS = Node.Js + Apache SparkJen Aman

Spark Streaming and MLlib - Hyderabad Spark GroupPhaneendra Chiruvella

Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman

Huawei Advanced Data Science With Spark StreamingJen Aman

High Performance Python on Apache SparkWes McKinney

Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez

Spark Summit EU talk by Josef HabdankSpark Summit

Deep Learning to Production with MLflow & RedisAIDatabricks

DASK and Apache SparkDatabricks

Willump: Optimizing Feature Computation in ML InferenceDatabricks

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

Spark Summit EU talk by Bas GeerdinkSpark Summit

Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks

Spark Summit EU talk by Heiko KorndorfSpark Summit

Spark Summit EU talk by Ahsan Javed AwanSpark Summit

Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks

Spark Summit EU talk by Brij Bhushan RavatSpark Summit

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Databricks

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks

Machine Learning Methods for Parameter Acquisition in a Human ...butest

MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...asimkadav

Contenu connexe

Tendances

Spark Summit 2016: Connecting Python to the Spark EcosystemDaniel Rodriguez

Spark Summit EU talk by Josef HabdankSpark Summit

Deep Learning to Production with MLflow & RedisAIDatabricks

DASK and Apache SparkDatabricks

Willump: Optimizing Feature Computation in ML InferenceDatabricks

Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit

Spark Summit EU talk by Bas GeerdinkSpark Summit

Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks

Spark Summit EU talk by Heiko KorndorfSpark Summit

Spark Summit EU talk by Ahsan Javed AwanSpark Summit

Choose Your Weapon: Comparing Spark on FPGAs vs GPUsDatabricks

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflowDatabricks

Spark Summit EU talk by Brij Bhushan RavatSpark Summit

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Databricks

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...Databricks

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks

Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks

Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache SparkDatabricks

Tendances (20)

Spark Summit 2016: Connecting Python to the Spark Ecosystem

Spark Summit EU talk by Josef Habdank

Deep Learning to Production with MLflow & RedisAI

DASK and Apache Spark

Willump: Optimizing Feature Computation in ML Inference

Spark Summit EU talk by Kaarthik Sivashanmugam

Spark Summit EU talk by Bas Geerdink

Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...

Spark Summit EU talk by Heiko Korndorf

Spark Summit EU talk by Ahsan Javed Awan

Choose Your Weapon: Comparing Spark on FPGAs vs GPUs

Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow

Spark Summit EU talk by Brij Bhushan Ravat

Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...

Using SparkML to Power a DSaaS (Data Science as a Service) with Kiran Muglurm...

Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service

Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...

Resource-Efficient Deep Learning Model Selection on Apache Spark

Near Data Computing Architectures: Opportunities and Challenges for Apache Spark

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

En vedette

Machine Learning Methods for Parameter Acquisition in a Human ...butest

MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...asimkadav

Spark Summit EU talk by Rolf JagermanSpark Summit

Challenges on Distributed Machine Learningjie cao

GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...Fabio Petroni, PhD

Baidu World 2016 With NVIDIA CEO Jen-Hsun HuangNVIDIA

How Spark Fits into Baidu's Scale-(James Peng, Baidu)Spark Summit

Large Scale Distributed Deep NetworksHiroyuki Vincent Yamazaki

AI for business: Capire l'opportunitàMeetupDataScienceRoma

Spark Summit EU talk by Nick PentreathSpark Summit

Distributed machine learningStanley Wang

Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf

H20 - Thirst for Machine LearningMeetupDataScienceRoma

Distributed Deep Learning on SparkMathieu Dumoulin

FlinkML: Large Scale Machine Learning with Apache FlinkTheodoros Vasiloudis

Deep Learning at ScaleMateusz Dymczyk

Inspection of CloudML Hyper Parameter Tuningnagachika t

A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...Kato Mivule

Scaling Machine Learning To Billions Of ParametersJen Aman

Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016MLconf

En vedette (20)

Machine Learning Methods for Parameter Acquisition in a Human ...

MALT: Distributed Data-Parallelism for Existing ML Applications (Distributed ...

Spark Summit EU talk by Rolf Jagerman

Challenges on Distributed Machine Learning

GASGD: Stochastic Gradient Descent for Distributed Asynchronous Matrix Comple...

Baidu World 2016 With NVIDIA CEO Jen-Hsun Huang

How Spark Fits into Baidu's Scale-(James Peng, Baidu)

Large Scale Distributed Deep Networks

AI for business: Capire l'opportunità

Spark Summit EU talk by Nick Pentreath

Distributed machine learning

Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...

H20 - Thirst for Machine Learning

Distributed Deep Learning on Spark

FlinkML: Large Scale Machine Learning with Apache Flink

Deep Learning at Scale

Inspection of CloudML Hyper Parameter Tuning

A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Usin...

Scaling Machine Learning To Billions Of Parameters

Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016

Similaire à Scalable Deep Learning Platform On Spark In Baidu

Spark and Deep Learning frameworks with distributed workloadsS N

Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...Edureka!

알리바바 클라우드 PAI (machine learning Platform for AI)Alibaba Cloud Korea

Toronto meetup 20190917Bill Liu

Distributed Tensorflow with Kubernetes - data2day - Jakob KaralusJakob Karalus

Bringing Deep Learning into production Paolo Platter

High Performance Deep learning with Apache SparkRui Liu

World Artificial Intelligence Conference Shanghai 2018Adam Gibson

Low Latency Polyglot Model Scoring using Apache ApexApache Apex

Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman

Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks

Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman

Proud to be polyglotTugdual Grall

AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)Amazon Web Services

Which Is Deeper - Comparison Of Deep Learning Frameworks On SparkSpark Summit

Introduction to Google Cloud PlatformSujai Prakasam

QuAI platformTeddy Kuo

Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks

Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo

Distributed Deep learning Training.Umang Sharma

Similaire à Scalable Deep Learning Platform On Spark In Baidu (20)

Spark and Deep Learning frameworks with distributed workloads

Keras Tutorial For Beginners | Creating Deep Learning Models Using Keras In P...

알리바바 클라우드 PAI (machine learning Platform for AI)

Toronto meetup 20190917

Distributed Tensorflow with Kubernetes - data2day - Jakob Karalus

Bringing Deep Learning into production

High Performance Deep learning with Apache Spark

World Artificial Intelligence Conference Shanghai 2018

Low Latency Polyglot Model Scoring using Apache Apex

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Deep Learning on Apache® Spark™: Workflows and Best Practices

Proud to be polyglot

AWS re:Invent 2016: Bringing Deep Learning to the Cloud with Amazon EC2 (CMP314)

Which Is Deeper - Comparison Of Deep Learning Frameworks On Spark

Introduction to Google Cloud Platform

QuAI platform

Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark

Big Data Adavnced Analytics on Microsoft Azure

Distributed Deep learning Training.

Plus de Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman

Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman

RISELab:Enabling Intelligent Real-Time DecisionsJen Aman

Spatial Analysis On Histological Images Using SparkJen Aman

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman

A Graph-Based Method For Cross-Entity Threat DetectionJen Aman

Time-Evolving Graph Processing On Commodity ClustersJen Aman

Deploying Accelerators At Datacenter Scale Using SparkJen Aman

Re-Architecting Spark For Performance UnderstandabilityJen Aman

Low Latency Execution For Apache SparkJen Aman

Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman

Livy: A REST Web Service For Apache SparkJen Aman

GPU Computing With Apache Spark And PythonJen Aman

Spark And Cassandra: 2 Fast, 2 FuriousJen Aman

Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman

Spark on MesosJen Aman

Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman

Spark at Bloomberg: Dynamically Composable Analytics Jen Aman

Spark Uber Development KitJen Aman

Plus de Jen Aman (20)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

Snorkel: Dark Data and Machine Learning with Christopher Ré

RISELab:Enabling Intelligent Real-Time Decisions

Spatial Analysis On Histological Images Using Spark

Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...

A Graph-Based Method For Cross-Entity Threat Detection

Time-Evolving Graph Processing On Commodity Clusters

Deploying Accelerators At Datacenter Scale Using Spark

Re-Architecting Spark For Performance Understandability

Low Latency Execution For Apache Spark

Efficient State Management With Spark 2.0 And Scale-Out Databases

Livy: A REST Web Service For Apache Spark

GPU Computing With Apache Spark And Python

Spark And Cassandra: 2 Fast, 2 Furious

Building Custom Machine Learning Algorithms With Apache SystemML

Spark on Mesos

Elasticsearch And Apache Lucene For Apache Spark And MLlib

Spark at Bloomberg: Dynamically Composable Analytics

Spark Uber Development Kit

Dernier

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2

Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

20240419 - Measurecamp Amsterdam - SAM.pdfHuman37

Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss

Learn How Data Science Changes Our WorldEduminds Learning

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

Vision, Mission, Goals and Objectives ppt..pptxellehsormae

Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics

Dernier (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Advanced Machine Learning for Business Professionals

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Identifying Appropriate Test Statistics Involving Population Mean

Data Factory in Microsoft Fabric (MsBIP #82)

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

20240419 - Measurecamp Amsterdam - SAM.pdf

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines

Student Profile Sample report on improving academic performance by uniting gr...

办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree

GA4 Without Cookies [Measure Camp AMS]

科罗拉多大学波尔得分校毕业证学位证成绩单-可办理

Learn How Data Science Changes Our World

Generative AI for Social Good at Open Data Science East 2024

Vision, Mission, Goals and Objectives ppt..pptx

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf

Top 5 Best Data Analytics Courses In Queens

Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...

Scalable Deep Learning Platform On Spark In Baidu

1. Scalable Deep Learning in Baidu Weide Zhang, Kyle Tsai, Jiang Wang Baidu USDC

2. Background • Spark – Batch/Streaming processing, Spark SQL, MLLib • Deep learning has many use cases in Baidu and showed significant improvement in quality – Image retrieval ranking – Ads CTR prediction – Machine translation – Speech recognition • Goal: able to use distributed deep learning in Spark

3. Typical Training Data in Baidu • Image Recognition: 100 millions • OCR: 100 millions • Speech: 10 billions • CTR: 100 billions • Grows every year since 2012

4. Baidu Spark One

5. Paddle • Parallel Asynchronous Distributed Deep Learning Library – Support distributed parameter servers to do synchronous/asynchronous parameter update – Support Multi GPU / CPU training – Support sparse model – Easy to understand API for user to add new layers – Support rich features for deep learning use cases, especially for NLP

6. Deep learning options comparison Caffe Tensor Flow Torch Paddle Distributed Training Yes Yes No Yes Communication Cost Medium High N/A Medium to Low Easy to customize and coding Yes More Learning Curve More Learning Curve Yes Sparse Model Support No Yes Yes Yes Area of Focus Vision All All All Integration with Spark Yes No No Yes

7. High Level Goals • Implement Spark ML abstractions to let user train deep learning models with minimal code change • Leverage paddle’s native training and parameter server mechanisms to be scheduled in spark deep learning jobs • Handle multi-tenancy and heterogeneity • Parallelize hyper parameter selection • Batch and Streaming learning

8. Paddle on Spark •Training Communication •Deep Learning Algorithm •Resource Management • Training Environment Spark Yarn Parameter Server Paddle

9. Training Data Flow

10. System Architecture

11. Spark ML’s Abstraction • Train • Predict

12. Simple Parameter Is Not Enough Image Label Bird Cat Convolution Pooling Full Connection Cost Parameter For CNN

13. Use Paddle As Estimator

14. Code your Configuration

15. Example of caffe prototxt

16. Design decisions • Spark ML Compatible API – Compatible with Spark is more important than implemented under Spark • Code level configuration – Easy and flexible – Manual is prone to error

17. PADDLE Scalable Deep Learning Platform at Baidu

18. Sharded Parameter Server • One parameter an one trainer co-locate in a machine. • Parameters are shared, but not replicated. • All-to-all communication. • Our environments • 4 GPUs per machine. • 4-10 machines. • all machines in one switch • reliable data center.

19. GPU Ring Synchronization • Each parameter only needs to go through slow connection two times. • One for reduce. • Another for scatter.

20. ImageNet Scale Experiments 0 10 20 30 40 1 2 3 4 5 Time(s) Number of machines Time per 100 batches TCP RDMA • AlexNet on ImageNet • batch size = 64 • 1 Machine has 4 K10 GPUs.

21. Sparse Training

22. Sparse Training Experiment 0 75 150 225 1 2 4 8 16 Time(s) Number of nodes Time per 100 batches Non Sparse Sparse • 1451594 dimensional sparse feature. • Embedded to 128d, 96d, 96d, and 128d. • Using a ranking cost on the top. • batch size = 128.

23. Flexible and Efficient RNN Implementation

24. RNN Performance Comparison with TensorFlow 0 125 250 375 500 625 200 650 1500 Time(ms) RNN Hidden Size Time per Batch PADDLE TensorFlow • Machine Translation • batch size = 64 • embedding size = hidden_size • dictionary size = 10000

25. DistributedTraining Performance Character NeuralMachine Translation • 8 Machines, each with 4 K40 GPUs • number of RNN encoder layers: 9, number of RNN decoder layers: 7 • Word embedding size: 256, RNN size: 512 • batch size: 25000 character • Speed: •attention: 25 minutes / 100 batches . •encoder-decoder: 9 minutes / 100 batches.

26. Future work • Streaming training • Dynamic trainer allocation • FairScheduler • Model serving

27. THANK YOU. zhangweide/kyletsai/wangjiang03@baidu.com