Deep Learning at AWS: Embedding & Attention Models

1© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deep Learning at AWS:
Embeddings & Attention
Models
Leo Dirac, Principal Engineer
July 20, 2017

Goals of this talk
• Inspire you to think big!
• Explain some key Deep Learning concepts
• Share impressive research results
• Applications at Amazon

How similar are these products?
• Identical?
• Different {sizes, styles} of the same product?
• Different products?

Supervised ML
Training Data Labels

Supervised ML
Model

Learning Code
Model
Training
Data
Algorithm Code

ML models as Code
• Linear Models (i.e. Logistic Regression)
– Very simple algorithm: SUMPRODUCT
– Fast to run, pretty easy to train
• Deep Neural Networks
– Arbitrarily complex algorithm
– Tricky & slow to train, requires GPU hardware

Floating Point Performance
Multiply two (10,000 x 10,000) matrices
(400MB each, 32-bit)
• Native BLAS (python numpy): ~30 seconds*
• Java (Naïve triple for-loop):
• P2.xlarge GPU: ~0.6 seconds
*Tested on a 4-core (8 w/ HT) iMac w/ Intel Core i7 @ 3.4GHz; similar to c4.2xl
~5 hours*

EC2 p2.16xlarge
68,000,000,000,000
operations/second
(w/ 16 GPU’s, each about 4 TFlops)

Clustering
• Finds “similar” items
• What is Similar?
• Vector Distance
– Data points are coordinates

Euclidean Distance / L2-norm

Pixel Similarity Distance
0.483
1.412
1.770

Preparing your data for Math
NxD matrix

“Embedding”
“Encoding”
“Latent Features”
“Feature Embedding”
“Feature Vector”
“Vector”
“Point”
“Coordinates”

☐Image Embeddings:
☐Word Embedding…
tricky

Word Embedding
• Why not just use char[]?

Not semantically meaningful
s(“Duck”) = [68.00,
117.00,
99.00,
107.00,
0.00,
0.00,
0.00,
0.00]
• Closest to “Euck” “Dudk” “Dtck”.
• Not very similar to “duck” or
“Ducks”.
• Very far from “Goose”.

Bag of Words / 1-Hot

Holds in higher dimensions

Everything is equidistant
D(“king”,”queen”) = 1.4142
D(“king”,”kings”) = 1.4142
D(“king”,”small”) = 1.4142
D(”small”, “tiny”) = 1.4142
D(”frog”, “diesel”) = 1.4142
D(”soccer”, “ball”) = 1.4142

Semantic meaning in geometry
D(“king”,”queen”) = 0.188
D(“king”,”kings”) = 0.052
D(“king”,”small”) = 1.385
D(”small”, “tiny”) = 0.165

Word2vec Embedding
W2v(“king”) = [-3.168
-0.136
3.770
4.767
3.558
-4.168
0.464
2.034
3.411
…
0.866]
• float[128]
• Meaningless to a human
• Like a hash code
• Pre-computed
• Map<String,float[]>
• Takes long time to train

Similar words: nearby embeddings
W2v(“king”) = [-3.168
-0.136
3.770
4.767
3.558
-4.168
0.464
2.034
3.411
…
0.866]
W2v(“queen”) = [-3.101
-0.057
3.800
4.862
3.632
-4.157
0.549
2.064
3.428
…
0.884]
D(W2v(“king”) – W2v(“queen”)) = 0.188

Algebra
[5.409
5.281
-1.331
3.714
-1.727
-3.167
-2.130
1.213
-3.285
…
-2.000]
W2v(“king”) – W2v(“queen”) + W2v(“aunt”) =
W2v(“uncle”) ≈

Analogies in Geomgetry

Analogies

Word2Vec Embedding
Training data: Large text corpus (like wikipedia)

Word Embeddings: Word2Vec
☐Image Embeddings: ?

Goal: Semantic Similarity
0.058
0.731
0.782

ImageNet
Training data: 10^6 <Image,Noun> pairs
Noun vocabulary: 1000

Convolutional Neural Network
X F(X) Y
←Leopard
≈
ConvNet
CNN

Human-level performance

Dark Knowledge
Training
Data
[0.001,
0.000,
0.685,
0.013,
…
0.004,
0.134,
0.000,
…
0.007]
grille 
grille
convertible
Predictions
Training
Algorithm

Predictions as Embedding?
[0.001,
0.000,
0.685,
0.013,
…
0.004,
0.134,
0.000,
…
0.007]
grille 
convertible

Image Features
X F(X)
↑
penultimate layer
y

Best Linearly Separable Space

Learned features
Penultimate Layer
Dim1: Four legs?
Dim2: Straps?
Dim3: Brown & furry?
Dim4: Human leg?
Dim5: Standing in grass?
Dim6: Person holding it?
Dim7: Has laces?
…
Dim4096: In this sky?
Output Layer
Dim1: Is this an aardvark?
Dim2: Is this an airplane?
Dim3: Is this an apple?
…
Dim 258: Is this a dress shoe?
…
Dim721: Is this a sandal?
…
Dim 1000: Is this a zebra?

Image Embedding
Embedding Features
Dim1: Four legs?
Dim2: Straps?
Dim3: Brown & furry?
Dim4: Human leg?
Dim5: Standing in grass?
Dim6: Person holding it?
Dim7: Has laces?
…
Dim4096: In this sky?

Word Embeddings: Word2Vec w/ Wikipedia
Image Embeddings: ConvNet w/ ImageNet data

Word Embeddings: Word2Vec w/ Wikipedia
Image Embeddings: ConvNet w/ ImageNet data
☐Phrase Embeddings: ?

Machine Translation
Training data: list of
<English Phrase, French Phrase> pairs

Encoder/Decoder Network
RNN
Recurrent
Neural
NetworkRNN RNN
seriously technique
RNN
powerful
RNN RNN RNN
technique au puissant
RNN
sérieux
Phrase
Embedding

Embeddings as Interfaces
Encoder
RNN
English
Words
Decoder
RNN
French
Words
English
Word2Vec
French
Word2Vec
Joint English/French
Phrase Embedding

“Joint Embedding”
Combines two kinds of data into the same
embedding space.
Here: English & French phrases.
Or…

Image Embeddings: ConvNet
Phrase Embeddings: Encoder/Decoder RNN
☐Image/Phrase joint embedding: ?

Neural Image Captioning
Training Data: list of <Image,Phrase> pairs

Composition of Neural Networks
Image
ConvNet
Image
Language
Decoder
RNN
Descriptive
Phrase
English
Word2Vec
Joint
Image/Phrase
Embedding
Raw Pixel
Encoding

Composition of Neural Networks
Image
ConvNet
Image RNN Word1
English
Word2Vec
Joint
Image/Phrase
Embedding
Raw Pixel
Encoding
Word2
Word3
RNN
RNN

NIC examples

Image captioning: ConvNet + Decoder RNN
☐Limits of embedding models

Phrase embeddings don’t work well

Why?
ℝ512 is too small
You’re nuts!

Information content in embeddings
• How many points can be organized in ℝ2 ?

• How many points can be organized in 2 single-
precision floats?
264 = 18,446,744,073,709,551,616

• Only using 1 bit per dimension
2512 ≈ 10154

Cover’s Function Counting Theorem
(1965)
http://www.cns.nyu.edu/~eorhan/notes/covers-theorem.pdf
Simplification: ℝN is probably linearly
separable for up to O(N) points.

☐Attention models

Seq2Seq Network
RNN RNN
seriously technique
RNN
powerful
RNN RNN RNN
technique au puissant
RNN
sérieux
Phrase
Embedding

Seq2Seq with Attention

Attention Model
f(Decoder_state, Input_Word) -> [0,1]
How relevant is this input word to the current
output?

NMT attention
https://arxiv.org/pdf/1409.0473.pdf

Attention in NIC

Far from perfect

Attention models
☐Amazon Applications

Product2Vec
Product
Embedding
NN
Product 1
Features
-Title
-Description
Product 1-2
Similarity
(observed
in aggregate customer
behavior)
NN
Product 2
Features
-Title
-Description
distance

Product Embeddings w/ Images
Product
Embedding
NN
Product 1
Features
Product 1-2
Similarity
NN
Product 2
Features
distance
Product 1
Image
Product 2
Image
CNN
CNN
Image
Embedding

Analogies
- + =

Analogies
Dave's Killer Bread - 21
Whole Grains Bread - 2
loaves - USDA Organic
Stroehmann
King Bread Loaf
Pack of 2 Quaker Chewy
Variety Pack 60
Granola Bars
- + =
Nature's Path
Organic Chewy
Granola Bars

Linear Combinations
+
=
2

Image2Vec on Product Images

Understanding Points in High
Dimensional Space
• Excel
• Clustering – assigns integer values
• Projection – maps to 2D space (or 3D, 4D, etc)
– PCA
– “t-SNE” is a learned projection

t-SNE Clustering Product Images

Lessons
• Embeddings need a context to have meaning
– Similarity & Distance become relevant
• Supervised ML can create useful embeddings
– Weak labels are often good enough
• Neural networks are composable
– Re-use network architectures or trained networks
• Attention mechanisms extend embeddings
– Embeddings have limited capacity.
– Attention provides interpretability

Think big!
It’s still day 1 for Deep Learning.

Deep Learning at AWS: Embedding & Attention Models

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Deep Learning at AWS: Embedding & Attention Models

Similaire à Deep Learning at AWS: Embedding & Attention Models (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Deep Learning at AWS: Embedding & Attention Models