Distributed Deep Learning on Hadoop Clusters

Distributed Deep Learning on Hadoop
Clusters
Andy Feng & Jun Shi
Yahoo! Inc.

Our Talks @ Hadoop Summit
2
 Storm on YARN (2013)
› http://bit.ly/1W02tZy
 Spark on YARN (2014)
› http://bit.ly/1W03dxE
 Machine Learning on Hadoop/Spark (2015)
› http://bit.ly/1NW3GvO

Agenda
• Why Deep Learning on Hadoop?
• CaffeOnSpark
– Architecture
– API: Scala + Python
• Demo
– CaffeOnSpark + Python Notebook

Use Case: Flickr Magic View flickr.com/cameraroll

Yahoo Use Case: Yahoo Weather
6
 Beauty
› Computational
assessed
 Relevant
› Location
› Time
› Cloudy
› Shower
› …
Weather App Yahoo Weather App

(4)
Apply
ML Model
@ Scale
Flickr DL/ML Pipeline
(3)
Non-deep
Learning
@ Scale
* http://bit.ly/1KIDfof by Pierre Garrigues, Deep Learning Summit 2015
(2)
Deep
Learning
@ Scale
(1)
Prepare
Datasets
@ Scale
* 10 billion photos * 7.5 million per day

10
Machine Learning & Deep Learning on Hadoop

11
Hadoop Cluster Enhanced
 GPU servers added
› 4 Tesla K80 cards
• 2 GK210 GPUs, 24GB memory
 Network interface enhanced
› InfiniBand for direct access to GPU
memory
› Ethernet for external communication

Deep Learning Frameworks
 Caffe
› Available since Sept, 2013, 6.3k forks
› Popular in vision community & Yahoo
 TensorFlow
› Released in Nov. 2015, 9.8k forks
 Theano, Torch, DL4J, etc.

 Released in Feb. 2016
• Apache 2.0 license
• Distributed deep learning
– GPU or CPU
– Ethernet or InfiniBand
• Easily deployed on public
cloud or private cloud
13
CaffeOnSpark Open Sourced
github.com/yahoo/CaffeOnSpark

CaffeOnSpark: Scalable Architecture
14

CaffeOnSpark: 19x Speedup (est.)
Training latency (hours)
Top-5ValidationError

CaffeOnSpark: Deployment Options
16
• Single node
– Spark-submit –master local
• Multiple nodes
– Spark-submit –master URL –connection ethernet
– Ex. EC2
– Spark-submit –master URL –connection infiniband
– Ex., Yahoo Hadoop cluster

Spark CLI
• spark-submit
--num-executors #_Processes
--class com.yahoo.ml.CaffeOnSpark
caffe-on-spark.jar
-devices #_gpus_per_proc
-conf solver_config_file
-model model_file
-train | -test | -feature
Caffe Configuration
layer {
name: "data"
type: "MemoryData"
source_class=“com.yahoo.ml.caffe.LMDB”
memory_data_param {
source: ”hdfs:///mnist/trainingdata/"
batch_size: 64;
channels: 1;
height: 28;
width: 28;
}
…
}
17
CaffeOnSpark: DL Made Easy

CaffeOnSpark: One Program (Scala)
http://bit.ly/21ZY1c2
18
cos = new CaffeOnSpark(ctx) conf = new Config(ctx, args).init()
// (1) training DL model
dl_train_source = DataSource.getSource(conf, true) cos.train(dl_train_source)
// (2) extract features via DL
lr_raw_source = DataSource.getSource(conf, false) ext_df =
cos.features(lr_raw_source)
// (3) apply ML
lr_input=ext_df.withColumn(“L", cos.floats2doubleUDF(ext_df(conf.label)))
.withColumn(“F", cos.floats2doublesUDF(ext_df(conf.features(0)))) lr = new
LogisticRegression().setLabelCol(”L").setFeaturesCol(”F") lr_model =
lr.fit(lr_input_df)
Non-deep
Learning
DeepLearning

CaffeOnSpark: One Notebook (Python)
http://bit.ly/1REZ0cN
19

Demo: CaffeOnSpark on EC2
 https://github.com/yahoo/CaffeOnSpark/wiki
› Get started on EC2
› Python for CaffeOnSpark

CaffeOnSpark: What’s Next?
 Validation within training
 Enhanced data layer
 RNN and LSTM
 Java API
 Asynchronous distributed training

Related Work: SparkNet & DL4J
1) [driver] sc.broadcast(model) to executors
2) [executor] apply DL training against a mini-batch of dataset to
update models locally
3) [driver] aggregate(models) to produce a new model
REPEAT
Driver

Summary
24
 Yahoo Hadoop clusters enhanced for deep learning
› GPU nodes + CPU nodes
› Infiniband network for fast communication
 CaffeOnSpark open sourced
› Empower Flickr and other Yahoo services
• In production since Q3 2015
• Reduced training latency, and improved accuracy
› Scalable deep learning made easy
• spark-submit on your Spark cluster

25
Thank You!
Repo: github.com/yahoo/CaffeOnSpark
Email: caffeonspark-users@googlegroups.com

Distributed Deep Learning on Hadoop Clusters

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Distributed Deep Learning on Hadoop Clusters

Similaire à Distributed Deep Learning on Hadoop Clusters (20)

Plus de DataWorks Summit/Hadoop Summit

Plus de DataWorks Summit/Hadoop Summit (20)

Dernier

Dernier (20)

Distributed Deep Learning on Hadoop Clusters

Notes de l'éditeur