Taming Your Deep Learning Workflow by Determined AI

•

1 j'aime•378 vues

desmondchanatdet

Determined AI gave a talk on "Taming the Deep Learning Workflow" at Re-Work DL Summit on January 24, 2019.

Technologie

Taming the Deep Learning Workflow
Neil Conway
CTO, Determined AI
January 24, 2019

Deep Learning is very difficult.
DL requires finding scarce talent and making a major
investment in a high-performance GPU cluster.
Even so, most organizations struggle. Time-to-
market for DL applications is often measured in years!
Today’s Reality

Key Challenge
 
Better DL Infrastructure Software!

TensorFlow is great! (So is Keras,
PyTorch, etc.) 
 
However, these tools are focused on
solving the problems of 1 researcher,
training 1 model, using ~1 GPU.
Wait, what about TensorFlow?

Training A Single Model
Hyperparameter 
Tuning, Architecture 
Search
GPU Cluster 
Scheduling
Metrics Collection 
and Storage
Model 
Management
Training Data 
Management
Collaboration
Deployment
Operations and 
Monitoring
Data Augmentation
Data Prep
and ETL
Parallel and 
Distributed Training

What Are Your Options?
• For some of these problems: no OSS solutions.
• For others: narrow technical tools. Up to you to
figure out how to put them together! 
Result: highly trained DL researchers spend most of
their time on drudgery!

We’re in the Golden Age of Deep Learning, 
but Deep Learning infrastructure is still stuck
in the Dark Ages!
Deep Learning Deep Learning Infrastructure ☹

What Do We Need?
• End-to-end system design, not narrow technical tools
• Driven by a deep understanding of real-world DL
workflows
• New APIs, new abstractions, and new platforms!

Hyperparameter Tuning
Search over a space of
similar models to find the
“best” model configuration
= Hard Problem!
Large, complex HP spaces are common
(e.g., optimization method, batch size,
LR, model architecture, etc.)
Evaluating a single HP
configuration can take
10-100+ GPU hours!
DL-specific challenges
+ +

HP Tuning Today
Intuition! Grid Search
Pick a few points and try them 
out manually.
Exhaustive search over all points
in grid.

Step 1: Smarter Searching
• Lots of academic research on HP tuning algorithms
• Recent work: Hyperband [ICLR 2017]
• Intuition: spend more compute time on “promising”
configurations, give up on “bad” configurations
quickly
• 5-50x faster than prior methods!

Step 2: Scheduler Integration
• What if the job scheduler was deeply integrated with
HP search algorithm?
• Smarter scheduling
• Intelligent fault tolerance and task migration
• More efficient caching
• Aside: Much more efficient than distributed training of a
single model!

Step 3: Metadata Storage
• A single HP search might involve thousands
of tasks on hundreds of machines, and run
for days or weeks
• Result: lots of crucial metadata!
• Training and validation metrics
• Hyperparameter settings
• Library versions, random seeds, logs, etc.
• Where does this data live? How can your
teammates make use of it?
• What happens when you want to replace the
production model 9 months later?

Takeaways
1. Progress on deep learning is held back by the current state of DL
infrastructure.
2. End-to-end system design can yield massive performance and
usability wins.
3. What are the key high-level DL workflows we need infra to support?
What are the right APIs and abstractions for doing so?
https://determined.ai

Recommandé

H2O World - H2O Deep Learning with Arno CandelSri Ambati

Intro to Machine Learning with H2O and AWSSri Ambati

Danny Bickson - Python based predictive analytics with GraphLab Create PyData

Taking Data Science to Enterprise levelChristos Charmatzis

Big Data Overview/Teaser (6 Aug 2013)orcsab

Hadoop's Problem and How to Fix itKognitio

Dealing with uncertainty in fintech using AIData Products Meetup

Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...Spark Summit

Recommandé

H2O World - H2O Deep Learning with Arno CandelSri Ambati

Intro to Machine Learning with H2O and AWSSri Ambati

Danny Bickson - Python based predictive analytics with GraphLab Create PyData

Taking Data Science to Enterprise levelChristos Charmatzis

Big Data Overview/Teaser (6 Aug 2013)orcsab

Hadoop's Problem and How to Fix itKognitio

Dealing with uncertainty in fintech using AIData Products Meetup

Spark in the Wild: An In-Depth Analysis of 50+ Production Deployments-(Arsala...Spark Summit

Software team linkedinPrysmian Group

High Performance ComputingDell World

Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson

1. what is hadoop part 1wintersnow181189

The Evolution of Big Data FrameworkseXascale Infolab

SQL on Hadoop benchmarks using TPC-DS query setKognitio

Deep learning with tensorflowCharmi Chokshi

Big Data at a Gaming Company: Spil GamesRob Winters

Deep learning for NLPShishir Choudhary

SDSC18 and DSATL Meetup March 2018 CareerBuilder.com

Large scale computing Bhupesh Bansal

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks

Deep LearningBüşra İçöz

Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation

5 Things that Make Hadoop a Game ChangerCaserta

Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira

Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul

Challenges of Operationalising Data Science in Productioniguazio

IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...Dr. Haxel Consult

tensorflow.pptxJoanJeremiah

Retail & CPGTata Consultancy Services

Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta

Contenu connexe

Tendances

Software team linkedinPrysmian Group

High Performance ComputingDell World

Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson

1. what is hadoop part 1wintersnow181189

The Evolution of Big Data FrameworkseXascale Infolab

SQL on Hadoop benchmarks using TPC-DS query setKognitio

Tendances (6)

Software team linkedin

High Performance Computing

Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

1. what is hadoop part 1

The Evolution of Big Data Frameworks

SQL on Hadoop benchmarks using TPC-DS query set

Similaire à Taming Your Deep Learning Workflow by Determined AI

Deep learning with tensorflowCharmi Chokshi

Big Data at a Gaming Company: Spil GamesRob Winters

Deep learning for NLPShishir Choudhary

SDSC18 and DSATL Meetup March 2018 CareerBuilder.com

Large scale computing Bhupesh Bansal

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks

Deep LearningBüşra İçöz

Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation

5 Things that Make Hadoop a Game ChangerCaserta

Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira

Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul

Challenges of Operationalising Data Science in Productioniguazio

IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...Dr. Haxel Consult

tensorflow.pptxJoanJeremiah

Retail & CPGTata Consultancy Services

Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta

Big Data and HadoopMaulikLakhani

Stefan Geissler kairntech - SDC Nice Apr 2019 Stefan Geißler

Hadoop and the Data Warehouse: Point/Counter PointInside Analysis

II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...Dr. Haxel Consult

Similaire à Taming Your Deep Learning Workflow by Determined AI (20)

Deep learning with tensorflow

Big Data at a Gaming Company: Spil Games

Deep learning for NLP

SDSC18 and DSATL Meetup March 2018

Large scale computing

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...

Deep Learning

Data Engineer's Lunch #85: Designing a Modern Data Stack

5 Things that Make Hadoop a Game Changer

Scaling ETL with Hadoop - Avoiding Failure

Bitkom Cray presentation - on HPC affecting big data analytics in FS

Challenges of Operationalising Data Science in Production

IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...

tensorflow.pptx

Retail & CPG

Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup

Big Data and Hadoop

Stefan Geissler kairntech - SDC Nice Apr 2019

Hadoop and the Data Warehouse: Point/Counter Point

II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...

Dernier

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Gen AI in Business - Global Trends Report 2024.pdfAddepto

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Commit 2024 - Secret Management made easyAlfredo García Lavilla

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

"ML in Production",Oleksandr BaganFwdays

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Take control of your SAP testing with UiPath Test SuiteDianaGray10

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Dernier (20)

Designing IA for AI - Information Architecture Conference 2024

Gen AI in Business - Global Trends Report 2024.pdf

DSPy a system for AI to Write Prompts and Do Fine Tuning

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

SAP Build Work Zone - Overview L2-L3.pptx

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day

Commit 2024 - Secret Management made easy

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Powerpoint exploring the locations used in television show Time Clash

Connect Wave/ connectwave Pitch Deck Presentation

Anypoint Exchange: It’s Not Just a Repo!

"ML in Production",Oleksandr Bagan

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Take control of your SAP testing with UiPath Test Suite

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Are Multi-Cloud and Serverless Good or Bad?

What's New in Teams Calling, Meetings and Devices March 2024

DevEX - reference for building teams, processes, and platforms

SIP trunking in Janus @ Kamailio World 2024

Scanning the Internet for External Cloud Exposures via SSL Certs

Taming Your Deep Learning Workflow by Determined AI

1. Taming the Deep Learning Workflow Neil Conway CTO, Determined AI January 24, 2019

2. Deep Learning is very difficult. DL requires finding scarce talent and making a major investment in a high-performance GPU cluster. Even so, most organizations struggle. Time-to- market for DL applications is often measured in years! Today’s Reality

3. Key Challenge   Better DL Infrastructure Software!

4. TensorFlow is great! (So is Keras, PyTorch, etc.)    However, these tools are focused on solving the problems of 1 researcher, training 1 model, using ~1 GPU. Wait, what about TensorFlow?

5. Training A Single Model Hyperparameter  Tuning, Architecture  Search GPU Cluster  Scheduling Metrics Collection  and Storage Model  Management Training Data  Management Collaboration Deployment Operations and  Monitoring Data Augmentation Data Prep and ETL Parallel and  Distributed Training

6. What Are Your Options? • For some of these problems: no OSS solutions. • For others: narrow technical tools. Up to you to figure out how to put them together!  Result: highly trained DL researchers spend most of their time on drudgery!

7. We’re in the Golden Age of Deep Learning,  but Deep Learning infrastructure is still stuck in the Dark Ages! Deep Learning Deep Learning Infrastructure ☹

8. What Do We Need? • End-to-end system design, not narrow technical tools • Driven by a deep understanding of real-world DL workflows • New APIs, new abstractions, and new platforms!

9. Determined AI

10. Deep Dive:  Hyperparameter Tuning

11. Hyperparameter Tuning Search over a space of similar models to find the “best” model configuration = Hard Problem! Large, complex HP spaces are common (e.g., optimization method, batch size, LR, model architecture, etc.) Evaluating a single HP configuration can take 10-100+ GPU hours! DL-specific challenges + +

12. HP Tuning Today Intuition! Grid Search Pick a few points and try them  out manually. Exhaustive search over all points in grid.

13. Step 1: Smarter Searching • Lots of academic research on HP tuning algorithms • Recent work: Hyperband [ICLR 2017] • Intuition: spend more compute time on “promising” configurations, give up on “bad” configurations quickly • 5-50x faster than prior methods!

14. Example: Random Search

15. Example: Hyperband

16. Step 2: Scheduler Integration • What if the job scheduler was deeply integrated with HP search algorithm? • Smarter scheduling • Intelligent fault tolerance and task migration • More efficient caching • Aside: Much more efficient than distributed training of a single model!

17. Step 3: Metadata Storage • A single HP search might involve thousands of tasks on hundreds of machines, and run for days or weeks • Result: lots of crucial metadata! • Training and validation metrics • Hyperparameter settings • Library versions, random seeds, logs, etc. • Where does this data live? How can your teammates make use of it? • What happens when you want to replace the production model 9 months later?

18. Takeaways 1. Progress on deep learning is held back by the current state of DL infrastructure. 2. End-to-end system design can yield massive performance and usability wins. 3. What are the key high-level DL workflows we need infra to support? What are the right APIs and abstractions for doing so? https://determined.ai