Scanning the Internet for External Cloud Exposures via SSL Certs
Taming Your Deep Learning Workflow by Determined AI
1. Taming the Deep Learning Workflow
Neil Conway
CTO, Determined AI
January 24, 2019
2. Deep Learning is very difficult.
DL requires finding scarce talent and making a major
investment in a high-performance GPU cluster.
Even so, most organizations struggle. Time-to-
market for DL applications is often measured in years!
Today’s Reality
4. TensorFlow is great! (So is Keras,
PyTorch, etc.)
However, these tools are focused on
solving the problems of 1 researcher,
training 1 model, using ~1 GPU.
Wait, what about TensorFlow?
5. Training A Single Model
Hyperparameter
Tuning, Architecture
Search
GPU Cluster
Scheduling
Metrics Collection
and Storage
Model
Management
Training Data
Management
Collaboration
Deployment
Operations and
Monitoring
Data Augmentation
Data Prep
and ETL
Parallel and
Distributed Training
6. What Are Your Options?
• For some of these problems: no OSS solutions.
• For others: narrow technical tools. Up to you to
figure out how to put them together!
Result: highly trained DL researchers spend most of
their time on drudgery!
7. We’re in the Golden Age of Deep Learning,
but Deep Learning infrastructure is still stuck
in the Dark Ages!
Deep Learning Deep Learning Infrastructure ☹
8. What Do We Need?
• End-to-end system design, not narrow technical tools
• Driven by a deep understanding of real-world DL
workflows
• New APIs, new abstractions, and new platforms!
11. Hyperparameter Tuning
Search over a space of
similar models to find the
“best” model configuration
= Hard Problem!
Large, complex HP spaces are common
(e.g., optimization method, batch size,
LR, model architecture, etc.)
Evaluating a single HP
configuration can take
10-100+ GPU hours!
DL-specific challenges
+ +
12. HP Tuning Today
Intuition! Grid Search
Pick a few points and try them
out manually.
Exhaustive search over all points
in grid.
13. Step 1: Smarter Searching
• Lots of academic research on HP tuning algorithms
• Recent work: Hyperband [ICLR 2017]
• Intuition: spend more compute time on “promising”
configurations, give up on “bad” configurations
quickly
• 5-50x faster than prior methods!
16. Step 2: Scheduler Integration
• What if the job scheduler was deeply integrated with
HP search algorithm?
• Smarter scheduling
• Intelligent fault tolerance and task migration
• More efficient caching
• Aside: Much more efficient than distributed training of a
single model!
17. Step 3: Metadata Storage
• A single HP search might involve thousands
of tasks on hundreds of machines, and run
for days or weeks
• Result: lots of crucial metadata!
• Training and validation metrics
• Hyperparameter settings
• Library versions, random seeds, logs, etc.
• Where does this data live? How can your
teammates make use of it?
• What happens when you want to replace the
production model 9 months later?
18. Takeaways
1. Progress on deep learning is held back by the current state of DL
infrastructure.
2. End-to-end system design can yield massive performance and
usability wins.
3. What are the key high-level DL workflows we need infra to support?
What are the right APIs and abstractions for doing so?
https://determined.ai