Deep Learning on AWS Made Easy

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Danny Bickson, Co-founder DATO
CMP305
Deep Learning on AWS
Made Easy
October 2015

2
Who is Dato?
Seattle-based Machine Learning Company
45+ and growing fast!

4
Image classification
Input: x
Image pixels
Output: y
Predicted object

Neural networks

Learning *very* non-linear features

6
Linear classifiers (binary)
Score(x) > 0 Score(x) < 0
Score(x) = w0 + w1 x1 + w2 x2 + … + wd xd

7
Graph representation of classifier:
useful for defining neural networks
x1
x2
xd
y
…
1
w2
> 0, output 1
< 0, output 0
Input Output
Score(x) =
w0 + w1 x1 + w2 x2 + … + wd xd

8
What can a linear classifier represent?
x1 OR x2 x1 AND x2
x1
x2
1
y x1
x2
1
y1
1
-0.5
1
1
-1.5

9
What can’t a simple linear
classifier represent?
XOR
the counterexample
to everything
Need non-linear features

Solving the XOR problem:
Adding a layer
XOR = x1 AND NOT x2 OR NOT x1 AND x2
z1
-0.5
1
-1
z1 z2
z2
-0.5
-1
1
x1
x2
1
y
1 -0.5
1
1
Thresholded to 0 or 1

11
A neural network
• Layers and layers and layers of
linear models and non-linear transformations
• Around for about 50 years
• In last few years, big resurgence
- Impressive accuracy on several benchmark problems
- Advanced in hardware allows computation (i.e. aws g2
instances)
x
1
x
2
1
z
1
z
2
1
y

Application of deep learning
to computer vision

13
Feature detection – traditional approach
• Features = local detectors
- Combined to make prediction
- (in reality, features are more low-level)
Face!
Eye
Eye
Nose
Mouth

14
SIFT [Lowe ‘99]
•Spin Images
[Johnson & Herbert ‘99]
•Textons
[Malik et al. ‘99]
•RIFT
[Lazebnik ’04]
•GLOH
[Mikolajczyk & Schmid ‘05]
•HoG
[Dalal & Triggs ‘05]
•…
Many hand created features exist
for finding interest points…

15
Standard image
classification approach
Input Use simple classifier
e.g., logistic regression, SVMs
Face?
Extract features
Hand-created
features

16
SIFT [Lowe
‘99]
•Spin Images
[Johnson & Herbert ‘99]
•Textons
[Malik et al. ‘99]
•RIFT
[Lazebnik ’04]
•GLOH
[Mikolajczyk & Schmid ‘05]
•HoG
[Dalal & Triggs ‘05]
•…
Many hand created features exist
for finding interest points…
Hand-created
features
… but very painful to design

17
Deep learning:
implicitly learns features
Layer 1 Layer 2 Layer 3 Prediction
Example
detectors
learned
Example
interest points
detected
[Zeiler & Fergus ‘13]

Deep learning accuracy
• German traffic sign
recognition benchmark
- 99.5% accuracy (IDSIA
team)
• House number recognition
- 97.8% accuracy per character
[Goodfellow et al. ’13]

ImageNet 2012 competition:
1.2M training images, 1000 categories
0
0.05
0.1
0.15
0.2
0.25
0.3
SuperVision ISI OXFORD_VGG
Error(bestof5guesses)
Huge
gain
Exploited hand-coded features like SIFT
Top 3 teams

ImageNet 2012 competition:
1.2M training images, 1000 categories
Winning entry: SuperVision
8 layers, 60M parameters [Krizhevsky et al. ’12]
Achieving these amazing results required:
• New learning algorithms
• GPU implementation

Deep learning performance
• ImageNet: 1.2M images
0
10
20
30
40
50
60
g2.xlarge g2.8xlarge
Running time (hours)

Deep learning in computer vision

Scene parsing with deep learning
[Farabet et al. ‘13]

Retrieving similar images
Input Image Nearest neighbors

Designed a simple user interface
#training the model
model = graphlab.neuralnet.create(train_images)
#predicting classes for new images
outcome = model.predict(test_images)

Deep learning score card
Pros
• Enables learning of features
rather than hand tuning
• Impressive performance
gains
- Computer vision
- Speech recognition
- Some text analysis
• Potential for more impact

Deep learning workflow
Lots
of
labeled
data
Training
set
Validation
set
Learn
deep
neural net
Validate
Adjust
parameters,
network
architecture,…

32
Many tricks needed to work well…
Different types of layers, connections,…
needed for high accuracy
[Krizhevsky et al. ’12]

Deep learning score card
Pros
• Enables learning of features
rather than hand tuning
• Impressive performance
gains
- Computer vision
- Speech recognition
- Some text analysis
• Potential for more impact
Cons
• Requires a lot of data for
high accuracy
• Computationally
really expensive
• Extremely hard to tune
- Choice of architecture
- Parameter types
- Hyperparameters
- Learning algorithm
- …
Computational cost+ so many
choices
=
incredibly hard to tune

Deep features:
Deep learning
+
Transfer learning

35
Standard image
classification approach
Input Use simple classifier
e.g., logistic regression, SVMs
Face?
Extract features
Hand-created
features
Can we learn features
from data, even when
we don’t have data or
time?

36
What’s learned in a neural net
Very specific
to Task 1
Should be ignored
for other tasks
More generic
Can be used as feature extractor
vs.
Neural net trained for Task 1: cat vs. dog

37
Transfer learning in more detail…
Very specific
to Task 1
Should be ignored
for other tasks
More generic
Can be used as feature extractor
For Task 2, predicting 101 categories,
learn only end part of neural net
Use simple classifier
e.g., logistic regression,
SVMs, nearest neighbor,…
Class?
Keep weights fixed!
Neural net trained for Task 1: cat vs. dog

38
Careful where you cut:
latter layers may be too task specific
Layer 1 Layer 2 Layer 3 Prediction
Example
detectors
learned
Example
interest points
detected
[Zeiler & Fergus ‘13]
Too specific
for new task
Use these!

Transfer learning with deep features workflow
Some
labeled
data
Extract
features
with
neural net
trained on
different
task
Learn
simple
classifier
Validate
Training
set
Validation
set

How general are deep features?

Deep learning in production on
AWS

44
How to use deep learning in
production?
PredictiveUnderstands input &
takes actions or
makes decisions
InteractiveResponds in real time
LearningImproves its
performance
with experience

Intelligent service at the core…

46
Yourintelligentapplication
Intelligent
backend
service
Real-time
data
Predictions &
decisions
Historical
data
Machine
learning
model
Predictions &
decisions
Most ML
research here…
But ML research useless
without great
solution here…

47
Essential ingredients of intelligent service
Responsive
Intelligent applications
are interactive

Need low latency,
high throughput &
high availability
Adaptive
ML models out-of-date the
moment learning is done

Need to constantly
understand & improve
end-to-end performance
Manageable
Many thousands of models,
created by hundreds of people

Need versioning,
attribution, provenance &
reproducibility

Responsive: Now and Always
Responsive
are interactive

Need low latency,
high throughput &
high availability
Adaptive

Need to constantly
Manageable

Need versioning,
reproducibility

50
Challenge: Scoring Latency
Compute predictions in < 20ms for complex
all while under heavy query load
Models Queries
TopK
Features
SELECT * FROM
users JOIN items,
click_logs, pages
WHERE …

51
The Common Solutions to Latency
Faster Online
Model Scoring
“Execute Predict(query) in
real-time as queries arrive”
Pre-Materialization
and Lookup
“Pre-compute Predict(query)
for all queries and lookup
answer at query time”Dato Predictive Services does Both

52
Faster Online Model Scoring:
Highly optimized machine learning
• SFrame: Native code, optimized data frame
- Available open-source (BSD)
• Model querying acceleration with native code,
e.g.,
- TopK and Nearest Neighbor eval:
• LSH, Ball Trees,…

53
The Common Solutions to Latency
Faster Online
Model Scoring
“Execute Predict(query) in
real-time as queries arrive”
Pre-Materialization
and Lookup
“Pre-compute Predict(query)
for all queries and lookup
answer at query time”Dato Predictive Services does Both

54
Smart Materialization  Caching
Unique Queries
QueryFrequency
Example: top 10% of all unique queries cover
90% of all queries performed.
Caching a small number of unique
queries has a very large impact.

55
Distributed shared caching
Distributed Shared Cache (Redis)
Cache:
Model query results
Common features (e.g., product info)
Scale-out improves
throughput and latency

56
Dato Latency by the numbers
Easy Case: cache hit ~2ms
Hard Case: cache miss
• Simple Linear Models: 5-6ms
• Complex Random Forests: 7-8ms
- P99: ~ 15ms
[using aws m3.xlarge instance]

57
Challenge: Availability
Heavy load substantial delays
Frequent model updates  cache misses
Machine failures

58
Scale-Out availability under load
Heavy Load
Elastic Load Balancing load balancer

Adaptive:
Accounting for Constant Change
Responsive
are interactive

Need low latency,
high throughput &
high availability
Adaptive

Need to constantly
Manageable

Need versioning,
reproducibility

60
Change at Different Scales and Rates
Shopping
for Mom
Shopping
for Me
Months Rate of Change Minutes
Population Granularity of Change Session

61
Months Rate of Change Minutes
Population Granularity of Change SessionIndividual and Session Level Change
Small Data
Online learning
Bandits to Assess Models
Shopping
for Mom
Shopping
for Me
Change at Different Scales and Rates

62
The Dangerous Feedback Loop
I once looked at cameras on
Amazon …
Bags
Similar cameras
and
accessories
If this is all they showed how would they
learn that I also like bikes, and shoes?

63
Exploration / Exploitation Tradeoff
Systems that can take actions can
adversely affect future data
Exploration
Exploitation
Best
Action
Random
Action
Learn more about
what is good and bad
Make the best use
of what we believe is good.

64
Dato Solution to Adaptivity
Rapid offline learning with GraphLab Create
Online bandit adaptation in Predictive Services
• Demo

Manageable:
Unification and simplification
Responsive
are interactive

Need low latency,
high throughput &
high availability
Adaptive

Need to constantly
Manageable

Need versioning,
reproducibility

66
Ecosystem of Intelligent Services
Data
Infrastructure MySQL
MySQL
Serving
Data Science
ModelA ModelB
TableA
TableB
Service A
Service B
Complicated!
Many systems, with overlapping roles,
no single source of truth for Intelligent Service.

67
Dato Predictive
Services
Responsive Adaptive Manageable

68
Model Management  like code management,
but for life cycle of intelligent applications
Provenance &
Reproducibility
• Track changes &
rollback
• Cover code,
model type,
parameters,
data…
Collaboration
• Review, blame

• Share
• Common feature
engineering
pipelines
Continuous
Integration
• Deploy & update
• Measure &
improve
• Avoid down time
and impact on
end-users

69
Dato Predictive
Services
Responsive Adaptive Manageable
Dato Predictive
Services
Serving Models and Managing the
Machine Learning Lifecycle
GraphLab
Create
Accurate, Robust, and Scalable
Model Training

GraphLab Create:
Sophisticated machine learning made easy
High-level
ML toolkits
AutoML
tune params, model
selection,…

so you can focus on
creative parts
Reusable
features
transferrable feature
engineering

accuracy with less data &
less effort

71
High-level ML toolkits
get started with 4 lines of code,
then modify, blend, add yours…
Recommender
Image
search
Sentiment
analysis
Data
matching
Auto
tagging
Churn predictor
Object detector
Product
sentiment
Click
prediction
Fraud detection
User
segmentation
Data
completion
Anomaly
detection
Document
clustering
Forecasting
Search
ranking
Summarization …
import graphlab as gl
data = gl.SFrame.read_csv('my_data.csv')
model = gl.recommender.create(data,
user_id='user',
item_id='movie’,
target='rating')
recommendations = model.recommend(k=5)

SFrame ❤️ all ML tools SGraph
SFrame:
Sophisticated machine learning made
scalable

Opportunity for Out-of-Core ML
Capacity 1 TB
0.5 GB/s
10 TB
0.1 GB/s
0.1 TB
1 GB/sThroughput
Fast, but significantly
limits data sizeOpportunity for big data on 1 machine
For sequential reads only!
Random access very slow
Out-of-core ML
opportunity is huge
Usual design → Lots of
random access →
Slow
Design to maximize
sequential access for
ML algo patterns
GraphChi early example
SFrame data frame for ML

Performance of SFrame/SGraph
70 sec
251 sec
200 sec
2,128 sec
0 750 1500 2250
GraphLab Create
GraphX
Giraph
Spark
Connected components in Twitter graph
Source(s): Gonzalez et. al. (OSDI 2014)
Twitter: 41 million Nodes, 1.4 billion Edges
SGraph
16 machines
1 machine

75
SFrame & SGraph
Optimized
out-of-core
computation for ML
High Performance
1 machine can handle:
TBs of data
100s Billions of edges
Optimized for ML
. Columnar transformation
. Create features
. Iterators
. Filter, join, group-by, aggregate
. User-defined functions
. Easily extended through SDK
Tables,
graphs, text,
images
Open-
source ❤️
BSD
license

76
The Dato Machine Learning Platform
Predictive
Services
Serve Models and Manage the
Machine Learning Lifecycle
GraphLab Create
Train Accurate, Robust,
and Scalable models

Deep Learning on AWS Made Easy

Deep Learning on AWS Made Easy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Deep Learning on AWS Made Easy

Similar to Deep Learning on AWS Made Easy (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Deep Learning on AWS Made Easy