This document provides an overview of deep learning and its applications. It discusses how deep learning can be used for image classification and how neural networks learn hierarchical representations from data. The document highlights some of the challenges of deep learning, such as the large amounts of data and computation required. It also covers how deep learning models can be deployed in production using services like Amazon Web Services to ensure low latency, high availability, and continuous learning.
8. 8
What can a linear classifier represent?
x1 OR x2 x1 AND x2
x1
x2
1
y x1
x2
1
y1
1
-0.5
1
1
-1.5
9. 9
What can’t a simple linear
classifier represent?
XOR
the counterexample
to everything
Need non-linear features
10. Solving the XOR problem:
Adding a layer
XOR = x1 AND NOT x2 OR NOT x1 AND x2
z1
-0.5
1
-1
z1 z2
z2
-0.5
-1
1
x1
x2
1
y
1 -0.5
1
1
Thresholded to 0 or 1
11. 11
A neural network
• Layers and layers and layers of
linear models and non-linear transformations
• Around for about 50 years
• In last few years, big resurgence
- Impressive accuracy on several benchmark problems
- Advanced in hardware allows computation (i.e. aws g2
instances)
x
1
x
2
1
z
1
z
2
1
y
13. 13
Feature detection – traditional approach
• Features = local detectors
- Combined to make prediction
- (in reality, features are more low-level)
Face!
Eye
Eye
Nose
Mouth
14. 14
SIFT [Lowe ‘99]
•Spin Images
[Johnson & Herbert ‘99]
•Textons
[Malik et al. ‘99]
•RIFT
[Lazebnik ’04]
•GLOH
[Mikolajczyk & Schmid ‘05]
•HoG
[Dalal & Triggs ‘05]
•…
Many hand created features exist
for finding interest points…
16. 16
SIFT [Lowe
‘99]
•Spin Images
[Johnson & Herbert ‘99]
•Textons
[Malik et al. ‘99]
•RIFT
[Lazebnik ’04]
•GLOH
[Mikolajczyk & Schmid ‘05]
•HoG
[Dalal & Triggs ‘05]
•…
Many hand created features exist
for finding interest points…
Hand-created
features
… but very painful to design
17. 17
Deep learning:
implicitly learns features
Layer 1 Layer 2 Layer 3 Prediction
Example
detectors
learned
Example
interest points
detected
[Zeiler & Fergus ‘13]
19. Deep learning accuracy
• German traffic sign
recognition benchmark
- 99.5% accuracy (IDSIA
team)
• House number recognition
- 97.8% accuracy per character
[Goodfellow et al. ’13]
20. ImageNet 2012 competition:
1.2M training images, 1000 categories
0
0.05
0.1
0.15
0.2
0.25
0.3
SuperVision ISI OXFORD_VGG
Error(bestof5guesses)
Huge
gain
Exploited hand-coded features like SIFT
Top 3 teams
21. ImageNet 2012 competition:
1.2M training images, 1000 categories
Winning entry: SuperVision
8 layers, 60M parameters [Krizhevsky et al. ’12]
Achieving these amazing results required:
• New learning algorithms
• GPU implementation
22. Deep learning performance
• ImageNet: 1.2M images
0
10
20
30
40
50
60
g2.xlarge g2.8xlarge
Running time (hours)
27. Designed a simple user interface
#training the model
model = graphlab.neuralnet.create(train_images)
#predicting classes for new images
outcome = model.predict(test_images)
30. Deep learning score card
Pros
• Enables learning of features
rather than hand tuning
• Impressive performance
gains
- Computer vision
- Speech recognition
- Some text analysis
• Potential for more impact
32. 32
Many tricks needed to work well…
Different types of layers, connections,…
needed for high accuracy
[Krizhevsky et al. ’12]
33. Deep learning score card
Pros
• Enables learning of features
rather than hand tuning
• Impressive performance
gains
- Computer vision
- Speech recognition
- Some text analysis
• Potential for more impact
Cons
• Requires a lot of data for
high accuracy
• Computationally
really expensive
• Extremely hard to tune
- Choice of architecture
- Parameter types
- Hyperparameters
- Learning algorithm
- …
Computational cost+ so many
choices
=
incredibly hard to tune
35. 35
Standard image
classification approach
Input Use simple classifier
e.g., logistic regression, SVMs
Face?
Extract features
Hand-created
features
Can we learn features
from data, even when
we don’t have data or
time?
36. 36
What’s learned in a neural net
Very specific
to Task 1
Should be ignored
for other tasks
More generic
Can be used as feature extractor
vs.
Neural net trained for Task 1: cat vs. dog
37. 37
Transfer learning in more detail…
Very specific
to Task 1
Should be ignored
for other tasks
More generic
Can be used as feature extractor
For Task 2, predicting 101 categories,
learn only end part of neural net
Use simple classifier
e.g., logistic regression,
SVMs, nearest neighbor,…
Class?
Keep weights fixed!
Neural net trained for Task 1: cat vs. dog
38. 38
Careful where you cut:
latter layers may be too task specific
Layer 1 Layer 2 Layer 3 Prediction
Example
detectors
learned
Example
interest points
detected
[Zeiler & Fergus ‘13]
Too specific
for new task
Use these!
39. Transfer learning with deep features workflow
Some
labeled
data
Extract
features
with
neural net
trained on
different
task
Learn
simple
classifier
Validate
Training
set
Validation
set
44. 44
How to use deep learning in
production?
PredictiveUnderstands input &
takes actions or
makes decisions
InteractiveResponds in real time
LearningImproves its
performance
with experience
47. 47
Essential ingredients of intelligent service
Responsive
Intelligent applications
are interactive
Need low latency,
high throughput &
high availability
Adaptive
ML models out-of-date the
moment learning is done
Need to constantly
understand & improve
end-to-end performance
Manageable
Many thousands of models,
created by hundreds of people
Need versioning,
attribution, provenance &
reproducibility
48. Responsive: Now and Always
Responsive
Intelligent applications
are interactive
Need low latency,
high throughput &
high availability
Adaptive
ML models out-of-date the
moment learning is done
Need to constantly
understand & improve
end-to-end performance
Manageable
Many thousands of models,
created by hundreds of people
Need versioning,
attribution, provenance &
reproducibility
50. 50
Challenge: Scoring Latency
Compute predictions in < 20ms for complex
all while under heavy query load
Models Queries
TopK
Features
SELECT * FROM
users JOIN items,
click_logs, pages
WHERE …
51. 51
The Common Solutions to Latency
Faster Online
Model Scoring
“Execute Predict(query) in
real-time as queries arrive”
Pre-Materialization
and Lookup
“Pre-compute Predict(query)
for all queries and lookup
answer at query time”Dato Predictive Services does Both
52. 52
Faster Online Model Scoring:
Highly optimized machine learning
• SFrame: Native code, optimized data frame
- Available open-source (BSD)
• Model querying acceleration with native code,
e.g.,
- TopK and Nearest Neighbor eval:
• LSH, Ball Trees,…
53. 53
The Common Solutions to Latency
Faster Online
Model Scoring
“Execute Predict(query) in
real-time as queries arrive”
Pre-Materialization
and Lookup
“Pre-compute Predict(query)
for all queries and lookup
answer at query time”Dato Predictive Services does Both
54. 54
Smart Materialization Caching
Unique Queries
QueryFrequency
Example: top 10% of all unique queries cover
90% of all queries performed.
Caching a small number of unique
queries has a very large impact.
55. 55
Distributed shared caching
Distributed Shared Cache (Redis)
Cache:
Model query results
Common features (e.g., product info)
Scale-out improves
throughput and latency
56. 56
Dato Latency by the numbers
Easy Case: cache hit ~2ms
Hard Case: cache miss
• Simple Linear Models: 5-6ms
• Complex Random Forests: 7-8ms
- P99: ~ 15ms
[using aws m3.xlarge instance]
59. Adaptive:
Accounting for Constant Change
Responsive
Intelligent applications
are interactive
Need low latency,
high throughput &
high availability
Adaptive
ML models out-of-date the
moment learning is done
Need to constantly
understand & improve
end-to-end performance
Manageable
Many thousands of models,
created by hundreds of people
Need versioning,
attribution, provenance &
reproducibility
60. 60
Change at Different Scales and Rates
Shopping
for Mom
Shopping
for Me
Months Rate of Change Minutes
Population Granularity of Change Session
61. 61
Months Rate of Change Minutes
Population Granularity of Change SessionIndividual and Session Level Change
Small Data
Online learning
Bandits to Assess Models
Shopping
for Mom
Shopping
for Me
Change at Different Scales and Rates
62. 62
The Dangerous Feedback Loop
I once looked at cameras on
Amazon …
Bags
Similar cameras
and
accessories
If this is all they showed how would they
learn that I also like bikes, and shoes?
63. 63
Exploration / Exploitation Tradeoff
Systems that can take actions can
adversely affect future data
Exploration
Exploitation
Best
Action
Random
Action
Learn more about
what is good and bad
Make the best use
of what we believe is good.
64. 64
Dato Solution to Adaptivity
Rapid offline learning with GraphLab Create
Online bandit adaptation in Predictive Services
• Demo
65. Manageable:
Unification and simplification
Responsive
Intelligent applications
are interactive
Need low latency,
high throughput &
high availability
Adaptive
ML models out-of-date the
moment learning is done
Need to constantly
understand & improve
end-to-end performance
Manageable
Many thousands of models,
created by hundreds of people
Need versioning,
attribution, provenance &
reproducibility
66. 66
Ecosystem of Intelligent Services
Data
Infrastructure MySQL
MySQL
Serving
Data Science
ModelA ModelB
TableA
TableB
Service A
Service B
Complicated!
Many systems, with overlapping roles,
no single source of truth for Intelligent Service.
68. 68
Model Management like code management,
but for life cycle of intelligent applications
Provenance &
Reproducibility
• Track changes &
rollback
• Cover code,
model type,
parameters,
data…
Collaboration
• Review, blame
• Share
• Common feature
engineering
pipelines
Continuous
Integration
• Deploy & update
• Measure &
improve
• Avoid down time
and impact on
end-users
69. 69
Dato Predictive
Services
Responsive Adaptive Manageable
Dato Predictive
Services
Serving Models and Managing the
Machine Learning Lifecycle
GraphLab
Create
Accurate, Robust, and Scalable
Model Training
70. GraphLab Create:
Sophisticated machine learning made easy
High-level
ML toolkits
AutoML
tune params, model
selection,…
so you can focus on
creative parts
Reusable
features
transferrable feature
engineering
accuracy with less data &
less effort
71. 71
High-level ML toolkits
get started with 4 lines of code,
then modify, blend, add yours…
Recommender
Image
search
Sentiment
analysis
Data
matching
Auto
tagging
Churn predictor
Object detector
Product
sentiment
Click
prediction
Fraud detection
User
segmentation
Data
completion
Anomaly
detection
Document
clustering
Forecasting
Search
ranking
Summarization …
import graphlab as gl
data = gl.SFrame.read_csv('my_data.csv')
model = gl.recommender.create(data,
user_id='user',
item_id='movie’,
target='rating')
recommendations = model.recommend(k=5)
72. SFrame ❤️ all ML tools SGraph
SFrame:
Sophisticated machine learning made
scalable
73. Opportunity for Out-of-Core ML
Capacity 1 TB
0.5 GB/s
10 TB
0.1 GB/s
0.1 TB
1 GB/sThroughput
Fast, but significantly
limits data sizeOpportunity for big data on 1 machine
For sequential reads only!
Random access very slow
Out-of-core ML
opportunity is huge
Usual design → Lots of
random access →
Slow
Design to maximize
sequential access for
ML algo patterns
GraphChi early example
SFrame data frame for ML
75. 75
SFrame & SGraph
Optimized
out-of-core
computation for ML
High Performance
1 machine can handle:
TBs of data
100s Billions of edges
Optimized for ML
. Columnar transformation
. Create features
. Iterators
. Filter, join, group-by, aggregate
. User-defined functions
. Easily extended through SDK
Tables,
graphs, text,
images
Open-
source ❤️
BSD
license
76. 76
The Dato Machine Learning Platform
Predictive
Services
Serve Models and Manage the
Machine Learning Lifecycle
GraphLab Create
Train Accurate, Robust,
and Scalable models