SlideShare une entreprise Scribd logo
1  sur  54
Anthony Hsu
Staff Software Engineer
Scaling Deep Learning on Hadoop
at LinkedIn
DataWorks Summit, Washington, D.C., May 23, 2019
About Me: Anthony Hsu
• https://www.linkedin.com/in/erwaman/
• Staff Software Engineer at LinkedIn working on the Hadoop Dev team
• Been working in the Hadoop space for 5.5 years on workflow scheduling
(Azkaban), dataset access (Dali), machine learning infra (TonY, this talk)
LinkedIn's Vision
Create economic opportunity
for every member of the global workforce
630M
Members
30M
Companie
s
20M
Jobs
50K
Skills
90K
Schools
Machine Learning at LinkedIn
People You May Know
Job Recommendations
News Feed
LinkedIn Learning Recommendations
4
Why Deep Learning?
5
Building AI Applications Using Deep Learning
https://blog.easysol.net/building-ai-applications/
• Prediction accuracy of traditional ML
models tends to plateau quickly as
data increases
• Deep networks continue to improve as
data increases
Which framework to use?
6
Andrej Karpathy, Director of AI at Tesla
https://twitter.com/karpathy/status/972295865187512320
Machine Learning process
• ML process has many parts
7
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to
accelerate this loop. We have
teams working on every part of
the ML pipeline.
8
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Machine Learning process
• ML process has many parts
• At LinkedIn, we have a Productive
ML (Pro-ML) initiative to
accelerate this loop. We have
teams working on every part of
the ML pipeline.
• This talk will focus on model
training.
9
Data Ingestion
Data Preparation
Model Training
Model Deployment
Model Serving
Early days: how AI engineers did training
• Copy code and
dependencies to each
host
• Manually specify host
and port of each
process
• Customize arguments
for each process
10
# On ps0.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=ps --task_index=0
# On ps1.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=ps --task_index=1
# On worker0.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=worker --task_index=0
# On worker1.example.com:
$ python trainer.py 
--ps_hosts=ps0.example.com:2222,ps1.example.com:2222 
--worker_hosts=worker0.example.com:2222,worker1.example.com:2222 
--job_name=worker --task_index=1
Source: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md
Challenges of scaling up training
• Managing code and dependencies
• Orchestrating distributed training
• Resource contention (especially for GPUs)
• Managing an ML workflow (data preparation, training, deployment)
• Fault tolerance
11
E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to
allocate 693.00M (726663168 bytes) from device:
CUDA_ERROR_OUT_OF_MEMORY: out of memory
Existing YARN features to leverage
• YARN is Hadoop's scheduler
12
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
13
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
14
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
○ Elasticity between queues
15
Existing YARN features to leverage
• YARN is Hadoop's scheduler
• YARN supports
○ GPU resources and other resource types
○ Team-based and hierarchical queues
○ Elasticity between queues
○ User-based limits
16
New and upcoming YARN features useful for ML
• Docker container support productionized in Hadoop 3.x
• YARN Native Service in Hadoop 3.x
• Submarine ML CLI released in Hadoop 3.2.0, now its own Hadoop subproject
17
How can we do distributed training on YARN?
• Want to take a program developed on a single machine and run it in distributed
mode with little or no modifications
• Want to take advantage of YARN's features
• Some existing open-source solutions we looked at:
○ Kubeflow (Google)
○ TensorFlow on Spark (Yahoo!)
○ Spark Deep Learning (Databricks)
○ TOY: TensorFlow on YARN (Intel)
○ XLearning (Qihoo)
○ Horovod (Uber)
○ YARN Native Service (in Hadoop 3.x)
18
Kubeflow + Kubernetes
• Kubeflow is an ML toolkit built on Kubernetes
○ Has a rich ecosystem and active community
• Kubernetes is one of the most popular cluster managers
• Challenges in adopting Kubernetes at LinkedIn
○ Large investment in YARN
■ Many clusters of 1000s of nodes (our largest is ~6000)
■ Expertise and tooling for YARN
○ Scalability: "No more than 5000 nodes" (https://kubernetes.io/docs/setup/cluster-
large/)
○ Need to integrate with Hadoop security (Kerberos and Hadoop delegation tokens)
○ Lack of hierarchical namespaces 19
Spark-based solutions
• TensorFlow on Spark (Yahoo!)
• Spark Deep Learning (Databricks)
• Pros
○ Integrates well with native Spark processing
• Cons
○ GPU resource requests not supported until Spark 3.0 (SPARK-20327)
○ No heterogeneous resource support (e.g.: more memory + GPUs for workers, less
memory + only CPUs for parameter servers)
20
YARN-native solutions
• TOY: TensorFlow on YARN (Intel)
• XLearning (Qihoo)
• Pros
○ Works with YARN out-of-the-box
• Cons
○ No GPU resource support
21
Horovod
• Horovod (Uber)
• Wraps existing optimizer to allow synchronous distributed training
• Works with many frameworks (TensorFlow, PyTorch, Keras, MXNet)
• Uses MPI or NCCL for communication
○ Multi-node MPI on YARN requires Docker containers running sshd
daemons
22
YARN Native Service
• YARN Native Service (available in Hadoop 3.x)
• Configure distributed training jobs via XML, YAML, or JSON config file
• Distributed TensorFlow requires deploying YARN DNS Registry and
ZooKeeper
• Relatively new, LinkedIn is still on Hadoop 2.x
23
Summary of open-source solutions
Open-source solution Pros Cons
Kubeflow / Kubernetes (Google) ● Large marketplace of libraries and plugins
● Active community
● Does not run on Hadoop
● May not scale to very large clusters
TensorFlow on Spark (Yahoo!)
Spark Deep Learning (Databricks)
● Integrates with Spark ● No GPU resource support until Spark 3.0
(SPARK-20327)
● No heterogeneous resource support
TOY: TensorFlow on YARN (Intel)
XLearning (Qihoo)
● YARN native, works out-of-the-box ● No GPU resource support
Horovod (Uber) ● Supports synchronous distributed training ● MPI on YARN requires Docker
YARN Native Service ● YARN native ● Distributed TensorFlow requires YARN
DNS Registry and ZooKeeper
24
Building our own solution: TonY
• TonY is a YARN application for running distributed ML jobs
• We started with TensorFlow support (hence TensorFlow on YARN (TonY))
• Now we also support PyTorch and Horovod (so perhaps Things on YARN is more
apt)
25
A Comparison of MapReduce, Spark, and TonY
26
Map
task
Map
task
Map
task
Reduce
task
Reduce
task
Spark
executor
Spark
executor
Spark
executor
Spark
executor
Foo
task
Foo
task
Foo
task
Bar
task
Bar
task
Qux
task
MapReduce
• 2 task types
• Map tasks connected
to Reduce tasks
Spark
• 1 task type
• All connected to all
TonY
• N task types
• Heterogeneous connections
Baz
task
TonY supports many different models
27
Scoring
task
Scoring
task
Scoring
task
Scoring
task
Scoring
task
Parallel tasks,
no communication
Worker
task
Worker
task
Worker
task
Parameter
server task
Parameter
server task
Worker + Parameter Server Model
Worker
task
Worker
task
Worker
task
Worker
task
Ring All-Reduce Model
TonY also supports more exotic setups
28
Worker
task
Worker
task
Worker
task
Parameter
server task
Parameter
server task
Worker-PS with chief worker and
evaluator
Chief
worker
task
Evaluator
task
Worker
task
Worker
task
Worker
task
Worker
task
Ring All-Reduce with in-memory
distributed hash table (DHT)
DHT
task
DHT
task
DHT
task
TonY supports multiple frameworks
29
TonY under the hood
30
TonY under the hood
31
TonY Client
YARN
ResourceManager
TonY component
YARN component
TonY under the hood
32
TonY Client
YARN
ResourceManager
TonY
ApplicationMaste
r
TonY component
YARN component
YARN container
TonY under the hood
33
TonY Client
YARN
ResourceManager
TonY
ApplicationMaste
r
TonY
Task Executor
TonY
Task Executor
TonY
Task Executor
TonY component
YARN component
YARN container
TonY under the hood
34
TonY Client
YARN
ResourceManager
TonY
ApplicationMaste
r
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
TonY under the hood
35
TonY Client
YARN
ResourceManager
TonY
ApplicationMaste
r
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
TonY under the hood
36
TonY Client
YARN
ResourceManager
TonY
ApplicationMaste
r
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Worker Task
TonY
Task Executor
TensorFlow
Parameter
Server Task
TonY component
TensorFlow component
YARN component
YARN container
Related YARN changes
37
Related YARN changes
38
• Backport of GPU support to Hadoop 2.x (YARN-8200)
Related YARN changes
39
• Backport of GPU support to Hadoop 2.x (YARN-8200)
• Support for updating tracking URL (YARN-7974)
○ Contributed to Hadoop 2.x and 3.x
Using TonY
• TonY client lets you easily launch a job with only a few required arguments
40
java -cp `hadoop classpath`:tony-cli-0.3.7-all.jar 
com.linkedin.tony.cli.ClusterSubmitter 
--python_venv=venv.zip 
--python_binary_path=Python/bin/python 
--src_dir=src 
--executes=my_model.py 
--conf_file=tony-test.xml
Using TonY
• For a list of all configurations,
see
https://github.com/linkedin/To
nY/wiki/TonY-Configurations
41
<configuration>
<property>
<name>tony.worker.instances</name>
<value>3</value>
</property>
<property>
<name>tony.worker.gpus</name>
<value>1</value>
</property>
<property>
<name>tony.ps.instances</name>
<value>1</value>
</property>
</configuration>
• Example configuration file:
Using TonY
$ java ... com.linkedin.tony.cli.ClusterSubmitter ...
...
INFO impl.YarnClientImpl: Submitted application application_XXX
INFO tony.TonyClient: URL to track running application
(will proxy to TensorBoard once it has started): http://...
INFO tony.TonyClient: ResourceManager web address for application: http://...
...
INFO tony.TonyClient: Logs for ps 0 at: http://...
INFO tony.TonyClient: Logs for worker 0 at: http://...
INFO tony.TonyClient: Logs for worker 1 at: http://...
INFO tony.TonyClient: Logs for worker 2 at: http://...
TonY Portal for accessing job events and configs
43
Using TonY to launch notebooks and tools on demand
• TonY can be used to launch
○ Jupyter notebooks
○ TensorBoard
○ MLflow
○ etc.
• Run any Python virtual environment, PEX, or shiv
• Run any Docker image
44
TonY is open-source
• Open-source repo: https://github.com/linkedin/tony
○ Contributions welcome!
• OpML '19 paper: https://arxiv.org/abs/1904.01631 (presented 3 days ago)
• LinkedIn engineering blog post: https://bit.ly/2O6L5WD
45
TonY integrations with other
projects
Azkaban workflow scheduler integration
• Azkaban is a workflow
scheduler for Hadoop
• Run TonY jobs inside a
workflow that includes
Spark and other data
processing jobs
47
TonY job tuning recommendations by Dr. Elephant
48
• Dr. Elephant is a
job tuning and
performance
analysis tool for
Hadoop jobs.
Run TonY on Google Cloud DataProc
• DataProc lets you run Hadoop and Spark on Google's Cloud
• TonY setup script for DataProc: https://github.com/GoogleCloudPlatform/dataproc-
initialization-actions/tree/master/tony
• TonY on DataProc blog post: https://bit.ly/2HEYemT
49
TonY runtime for Hadoop Submarine
• Submarine is a deep learning CLI for Hadoop
• TonY is a supported runtime implementation for Submarine (SUBMARINE-40, in
Submarine 0.2.0)
50
TonY on Microsoft Azure HDInsight (coming soon)
• HDInsight lets you run open-source frameworks on Azure, including Hadoop, Spark,
and Kafka
• TonY integration is coming soon
51
+
Demo
52
• Live demo using TonY Client from CLI
• Video of using TonY job in Azkaban: https://youtu.be/DM89y8BGFaY
Future Work
• GPU metrics + tuning suggestions for Dr. Elephant
• Expand TonY Portal to support launching notebooks, visualization,
and managing experiments
• TonY CLI + Python library
• TonY support on Azure HDInsight
• TonY support for other ML frameworks, schedulers, and cloud services
53
+ ?
Thank you!
54
Questions?

Contenu connexe

Tendances

Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
DataWorks Summit
 

Tendances (20)

Accelerating Big Data Insights
Accelerating Big Data InsightsAccelerating Big Data Insights
Accelerating Big Data Insights
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Hortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks Technical Workshop - Operational Best Practices Workshop
Hortonworks Technical Workshop - Operational Best Practices Workshop
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
The Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-ServiceThe Time Has Come for Big-Data-as-a-Service
The Time Has Come for Big-Data-as-a-Service
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Hybrid Data Platform
Hybrid Data Platform Hybrid Data Platform
Hybrid Data Platform
 

Similaire à Scaling Deep Learning on Hadoop at LinkedIn

project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
Aswini Ashu
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
aswini pilli
 

Similaire à Scaling Deep Learning on Hadoop at LinkedIn (20)

Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and BeyondHadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
Hadoop Meetup Jan 2019 - TonY: TensorFlow on YARN and Beyond
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Introduction to DL platform
Introduction to DL platformIntroduction to DL platform
Introduction to DL platform
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
project--2 nd review_2
project--2 nd review_2project--2 nd review_2
project--2 nd review_2
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
TonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on HadoopTonY: Native support of TensorFlow on Hadoop
TonY: Native support of TensorFlow on Hadoop
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube Migre sus bases de datos Oracle a la nube
Migre sus bases de datos Oracle a la nube
 

Plus de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Scaling Deep Learning on Hadoop at LinkedIn

  • 1. Anthony Hsu Staff Software Engineer Scaling Deep Learning on Hadoop at LinkedIn DataWorks Summit, Washington, D.C., May 23, 2019
  • 2. About Me: Anthony Hsu • https://www.linkedin.com/in/erwaman/ • Staff Software Engineer at LinkedIn working on the Hadoop Dev team • Been working in the Hadoop space for 5.5 years on workflow scheduling (Azkaban), dataset access (Dali), machine learning infra (TonY, this talk)
  • 3. LinkedIn's Vision Create economic opportunity for every member of the global workforce 630M Members 30M Companie s 20M Jobs 50K Skills 90K Schools
  • 4. Machine Learning at LinkedIn People You May Know Job Recommendations News Feed LinkedIn Learning Recommendations 4
  • 5. Why Deep Learning? 5 Building AI Applications Using Deep Learning https://blog.easysol.net/building-ai-applications/ • Prediction accuracy of traditional ML models tends to plateau quickly as data increases • Deep networks continue to improve as data increases
  • 6. Which framework to use? 6 Andrej Karpathy, Director of AI at Tesla https://twitter.com/karpathy/status/972295865187512320
  • 7. Machine Learning process • ML process has many parts 7 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 8. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. 8 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 9. Machine Learning process • ML process has many parts • At LinkedIn, we have a Productive ML (Pro-ML) initiative to accelerate this loop. We have teams working on every part of the ML pipeline. • This talk will focus on model training. 9 Data Ingestion Data Preparation Model Training Model Deployment Model Serving
  • 10. Early days: how AI engineers did training • Copy code and dependencies to each host • Manually specify host and port of each process • Customize arguments for each process 10 # On ps0.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=0 # On ps1.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=ps --task_index=1 # On worker0.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=worker --task_index=0 # On worker1.example.com: $ python trainer.py --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 --job_name=worker --task_index=1 Source: https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md
  • 11. Challenges of scaling up training • Managing code and dependencies • Orchestrating distributed training • Resource contention (especially for GPUs) • Managing an ML workflow (data preparation, training, deployment) • Fault tolerance 11 E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 693.00M (726663168 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
  • 12. Existing YARN features to leverage • YARN is Hadoop's scheduler 12
  • 13. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types 13
  • 14. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues 14
  • 15. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues ○ Elasticity between queues 15
  • 16. Existing YARN features to leverage • YARN is Hadoop's scheduler • YARN supports ○ GPU resources and other resource types ○ Team-based and hierarchical queues ○ Elasticity between queues ○ User-based limits 16
  • 17. New and upcoming YARN features useful for ML • Docker container support productionized in Hadoop 3.x • YARN Native Service in Hadoop 3.x • Submarine ML CLI released in Hadoop 3.2.0, now its own Hadoop subproject 17
  • 18. How can we do distributed training on YARN? • Want to take a program developed on a single machine and run it in distributed mode with little or no modifications • Want to take advantage of YARN's features • Some existing open-source solutions we looked at: ○ Kubeflow (Google) ○ TensorFlow on Spark (Yahoo!) ○ Spark Deep Learning (Databricks) ○ TOY: TensorFlow on YARN (Intel) ○ XLearning (Qihoo) ○ Horovod (Uber) ○ YARN Native Service (in Hadoop 3.x) 18
  • 19. Kubeflow + Kubernetes • Kubeflow is an ML toolkit built on Kubernetes ○ Has a rich ecosystem and active community • Kubernetes is one of the most popular cluster managers • Challenges in adopting Kubernetes at LinkedIn ○ Large investment in YARN ■ Many clusters of 1000s of nodes (our largest is ~6000) ■ Expertise and tooling for YARN ○ Scalability: "No more than 5000 nodes" (https://kubernetes.io/docs/setup/cluster- large/) ○ Need to integrate with Hadoop security (Kerberos and Hadoop delegation tokens) ○ Lack of hierarchical namespaces 19
  • 20. Spark-based solutions • TensorFlow on Spark (Yahoo!) • Spark Deep Learning (Databricks) • Pros ○ Integrates well with native Spark processing • Cons ○ GPU resource requests not supported until Spark 3.0 (SPARK-20327) ○ No heterogeneous resource support (e.g.: more memory + GPUs for workers, less memory + only CPUs for parameter servers) 20
  • 21. YARN-native solutions • TOY: TensorFlow on YARN (Intel) • XLearning (Qihoo) • Pros ○ Works with YARN out-of-the-box • Cons ○ No GPU resource support 21
  • 22. Horovod • Horovod (Uber) • Wraps existing optimizer to allow synchronous distributed training • Works with many frameworks (TensorFlow, PyTorch, Keras, MXNet) • Uses MPI or NCCL for communication ○ Multi-node MPI on YARN requires Docker containers running sshd daemons 22
  • 23. YARN Native Service • YARN Native Service (available in Hadoop 3.x) • Configure distributed training jobs via XML, YAML, or JSON config file • Distributed TensorFlow requires deploying YARN DNS Registry and ZooKeeper • Relatively new, LinkedIn is still on Hadoop 2.x 23
  • 24. Summary of open-source solutions Open-source solution Pros Cons Kubeflow / Kubernetes (Google) ● Large marketplace of libraries and plugins ● Active community ● Does not run on Hadoop ● May not scale to very large clusters TensorFlow on Spark (Yahoo!) Spark Deep Learning (Databricks) ● Integrates with Spark ● No GPU resource support until Spark 3.0 (SPARK-20327) ● No heterogeneous resource support TOY: TensorFlow on YARN (Intel) XLearning (Qihoo) ● YARN native, works out-of-the-box ● No GPU resource support Horovod (Uber) ● Supports synchronous distributed training ● MPI on YARN requires Docker YARN Native Service ● YARN native ● Distributed TensorFlow requires YARN DNS Registry and ZooKeeper 24
  • 25. Building our own solution: TonY • TonY is a YARN application for running distributed ML jobs • We started with TensorFlow support (hence TensorFlow on YARN (TonY)) • Now we also support PyTorch and Horovod (so perhaps Things on YARN is more apt) 25
  • 26. A Comparison of MapReduce, Spark, and TonY 26 Map task Map task Map task Reduce task Reduce task Spark executor Spark executor Spark executor Spark executor Foo task Foo task Foo task Bar task Bar task Qux task MapReduce • 2 task types • Map tasks connected to Reduce tasks Spark • 1 task type • All connected to all TonY • N task types • Heterogeneous connections Baz task
  • 27. TonY supports many different models 27 Scoring task Scoring task Scoring task Scoring task Scoring task Parallel tasks, no communication Worker task Worker task Worker task Parameter server task Parameter server task Worker + Parameter Server Model Worker task Worker task Worker task Worker task Ring All-Reduce Model
  • 28. TonY also supports more exotic setups 28 Worker task Worker task Worker task Parameter server task Parameter server task Worker-PS with chief worker and evaluator Chief worker task Evaluator task Worker task Worker task Worker task Worker task Ring All-Reduce with in-memory distributed hash table (DHT) DHT task DHT task DHT task
  • 29. TonY supports multiple frameworks 29
  • 30. TonY under the hood 30
  • 31. TonY under the hood 31 TonY Client YARN ResourceManager TonY component YARN component
  • 32. TonY under the hood 32 TonY Client YARN ResourceManager TonY ApplicationMaste r TonY component YARN component YARN container
  • 33. TonY under the hood 33 TonY Client YARN ResourceManager TonY ApplicationMaste r TonY Task Executor TonY Task Executor TonY Task Executor TonY component YARN component YARN container
  • 34. TonY under the hood 34 TonY Client YARN ResourceManager TonY ApplicationMaste r TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 35. TonY under the hood 35 TonY Client YARN ResourceManager TonY ApplicationMaste r TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 36. TonY under the hood 36 TonY Client YARN ResourceManager TonY ApplicationMaste r TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Worker Task TonY Task Executor TensorFlow Parameter Server Task TonY component TensorFlow component YARN component YARN container
  • 38. Related YARN changes 38 • Backport of GPU support to Hadoop 2.x (YARN-8200)
  • 39. Related YARN changes 39 • Backport of GPU support to Hadoop 2.x (YARN-8200) • Support for updating tracking URL (YARN-7974) ○ Contributed to Hadoop 2.x and 3.x
  • 40. Using TonY • TonY client lets you easily launch a job with only a few required arguments 40 java -cp `hadoop classpath`:tony-cli-0.3.7-all.jar com.linkedin.tony.cli.ClusterSubmitter --python_venv=venv.zip --python_binary_path=Python/bin/python --src_dir=src --executes=my_model.py --conf_file=tony-test.xml
  • 41. Using TonY • For a list of all configurations, see https://github.com/linkedin/To nY/wiki/TonY-Configurations 41 <configuration> <property> <name>tony.worker.instances</name> <value>3</value> </property> <property> <name>tony.worker.gpus</name> <value>1</value> </property> <property> <name>tony.ps.instances</name> <value>1</value> </property> </configuration> • Example configuration file:
  • 42. Using TonY $ java ... com.linkedin.tony.cli.ClusterSubmitter ... ... INFO impl.YarnClientImpl: Submitted application application_XXX INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://... INFO tony.TonyClient: ResourceManager web address for application: http://... ... INFO tony.TonyClient: Logs for ps 0 at: http://... INFO tony.TonyClient: Logs for worker 0 at: http://... INFO tony.TonyClient: Logs for worker 1 at: http://... INFO tony.TonyClient: Logs for worker 2 at: http://...
  • 43. TonY Portal for accessing job events and configs 43
  • 44. Using TonY to launch notebooks and tools on demand • TonY can be used to launch ○ Jupyter notebooks ○ TensorBoard ○ MLflow ○ etc. • Run any Python virtual environment, PEX, or shiv • Run any Docker image 44
  • 45. TonY is open-source • Open-source repo: https://github.com/linkedin/tony ○ Contributions welcome! • OpML '19 paper: https://arxiv.org/abs/1904.01631 (presented 3 days ago) • LinkedIn engineering blog post: https://bit.ly/2O6L5WD 45
  • 46. TonY integrations with other projects
  • 47. Azkaban workflow scheduler integration • Azkaban is a workflow scheduler for Hadoop • Run TonY jobs inside a workflow that includes Spark and other data processing jobs 47
  • 48. TonY job tuning recommendations by Dr. Elephant 48 • Dr. Elephant is a job tuning and performance analysis tool for Hadoop jobs.
  • 49. Run TonY on Google Cloud DataProc • DataProc lets you run Hadoop and Spark on Google's Cloud • TonY setup script for DataProc: https://github.com/GoogleCloudPlatform/dataproc- initialization-actions/tree/master/tony • TonY on DataProc blog post: https://bit.ly/2HEYemT 49
  • 50. TonY runtime for Hadoop Submarine • Submarine is a deep learning CLI for Hadoop • TonY is a supported runtime implementation for Submarine (SUBMARINE-40, in Submarine 0.2.0) 50
  • 51. TonY on Microsoft Azure HDInsight (coming soon) • HDInsight lets you run open-source frameworks on Azure, including Hadoop, Spark, and Kafka • TonY integration is coming soon 51 +
  • 52. Demo 52 • Live demo using TonY Client from CLI • Video of using TonY job in Azkaban: https://youtu.be/DM89y8BGFaY
  • 53. Future Work • GPU metrics + tuning suggestions for Dr. Elephant • Expand TonY Portal to support launching notebooks, visualization, and managing experiments • TonY CLI + Python library • TonY support on Azure HDInsight • TonY support for other ML frameworks, schedulers, and cloud services 53 + ?