Contenu connexe Similaire à Deep Learning at AWS: Embedding & Attention Models (20) Plus de Amazon Web Services (20) Deep Learning at AWS: Embedding & Attention Models1. 1© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deep Learning at AWS:
Embeddings & Attention
Models
Leo Dirac, Principal Engineer
July 20, 2017
2. 2© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Goals of this talk
• Inspire you to think big!
• Explain some key Deep Learning concepts
• Share impressive research results
• Applications at Amazon
3. 3© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How similar are these products?
• Identical?
• Different {sizes, styles} of the same product?
• Different products?
4. 4© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Supervised ML
Training Data Labels
5. 6© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Supervised ML
Model
6. 7© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learning Code
Model
Training
Data
Algorithm Code
7. 8© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ML models as Code
• Linear Models (i.e. Logistic Regression)
– Very simple algorithm: SUMPRODUCT
– Fast to run, pretty easy to train
• Deep Neural Networks
– Arbitrarily complex algorithm
– Tricky & slow to train, requires GPU hardware
8. 9© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Floating Point Performance
Multiply two (10,000 x 10,000) matrices
(400MB each, 32-bit)
• Native BLAS (python numpy): ~30 seconds*
• Java (Naïve triple for-loop):
• P2.xlarge GPU: ~0.6 seconds
*Tested on a 4-core (8 w/ HT) iMac w/ Intel Core i7 @ 3.4GHz; similar to c4.2xl
~5 hours*
9. 10© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
EC2 p2.16xlarge
68,000,000,000,000
operations/second
(w/ 16 GPU’s, each about 4 TFlops)
10. 11© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Clustering
• Finds “similar” items
• What is Similar?
• Vector Distance
– Data points are coordinates
11. 12© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Euclidean Distance / L2-norm
12. 13© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pixel Similarity Distance
0.483
1.412
1.770
13. 15© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Preparing your data for Math
NxD matrix
14. 16© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Embedding”
“Encoding”
“Latent Features”
“Feature Embedding”
“Feature Vector”
“Vector”
“Point”
“Coordinates”
15. 17© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
☐Image Embeddings:
☐Word Embedding…
tricky
16. 18© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Word Embedding
• Why not just use char[]?
17. 19© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Not semantically meaningful
s(“Duck”) = [68.00,
117.00,
99.00,
107.00,
0.00,
0.00,
0.00,
0.00]
• Closest to “Euck” “Dudk” “Dtck”.
• Not very similar to “duck” or
“Ducks”.
• Very far from “Goose”.
18. 20© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Bag of Words / 1-Hot
21. 23© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Holds in higher dimensions
22. 24© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Everything is equidistant
D(“king”,”queen”) = 1.4142
D(“king”,”kings”) = 1.4142
D(“king”,”small”) = 1.4142
D(”small”, “tiny”) = 1.4142
D(”frog”, “diesel”) = 1.4142
D(”soccer”, “ball”) = 1.4142
23. 25© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Semantic meaning in geometry
D(“king”,”queen”) = 0.188
D(“king”,”kings”) = 0.052
D(“king”,”small”) = 1.385
D(”small”, “tiny”) = 0.165
24. 26© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Word2vec Embedding
W2v(“king”) = [-3.168
-0.136
3.770
4.767
3.558
-4.168
0.464
2.034
3.411
…
0.866]
• float[128]
• Meaningless to a human
• Like a hash code
• Pre-computed
• Map<String,float[]>
• Takes long time to train
25. 27© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Similar words: nearby embeddings
W2v(“king”) = [-3.168
-0.136
3.770
4.767
3.558
-4.168
0.464
2.034
3.411
…
0.866]
W2v(“queen”) = [-3.101
-0.057
3.800
4.862
3.632
-4.157
0.549
2.064
3.428
…
0.884]
D(W2v(“king”) – W2v(“queen”)) = 0.188
26. 28© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Algebra
[5.409
5.281
-1.331
3.714
-1.727
-3.167
-2.130
1.213
-3.285
…
-2.000]
W2v(“king”) – W2v(“queen”) + W2v(“aunt”) =
W2v(“uncle”) ≈
27. 29© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analogies in Geomgetry
28. 30© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analogies
29. 31© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Word2Vec Embedding
Training data: Large text corpus (like wikipedia)
30. 32© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Word Embeddings: Word2Vec
☐Image Embeddings: ?
31. 33© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Goal: Semantic Similarity
0.058
0.731
0.782
32. 34© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ImageNet
Training data: 10^6 <Image,Noun> pairs
Noun vocabulary: 1000
33. 35© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Convolutional Neural Network
X F(X) Y
←Leopard
≈
ConvNet
CNN
34. 36© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Human-level performance
35. 39© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dark Knowledge
Training
Data
[0.001,
0.000,
0.685,
0.013,
…
0.004,
0.134,
0.000,
…
0.007]
grille
grille
convertible
Predictions
Training
Algorithm
36. 40© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Predictions as Embedding?
[0.001,
0.000,
0.685,
0.013,
…
0.004,
0.134,
0.000,
…
0.007]
grille
convertible
37. 41© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Image Features
X F(X)
↑
penultimate layer
y
38. 42© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Linearly Separable Space
39. 43© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Learned features
Penultimate Layer
Dim1: Four legs?
Dim2: Straps?
Dim3: Brown & furry?
Dim4: Human leg?
Dim5: Standing in grass?
Dim6: Person holding it?
Dim7: Has laces?
…
Dim4096: In this sky?
Output Layer
Dim1: Is this an aardvark?
Dim2: Is this an airplane?
Dim3: Is this an apple?
…
Dim 258: Is this a dress shoe?
…
Dim721: Is this a sandal?
…
Dim 1000: Is this a zebra?
40. 44© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Image Embedding
Embedding Features
Dim1: Four legs?
Dim2: Straps?
Dim3: Brown & furry?
Dim4: Human leg?
Dim5: Standing in grass?
Dim6: Person holding it?
Dim7: Has laces?
…
Dim4096: In this sky?
41. 45© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Word Embeddings: Word2Vec w/ Wikipedia
Image Embeddings: ConvNet w/ ImageNet data
42. 46© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Word Embeddings: Word2Vec w/ Wikipedia
Image Embeddings: ConvNet w/ ImageNet data
☐Phrase Embeddings: ?
43. 47© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Machine Translation
Training data: list of
<English Phrase, French Phrase> pairs
44. 48© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Encoder/Decoder Network
RNN
Recurrent
Neural
NetworkRNN RNN
seriously technique
RNN
powerful
RNN RNN RNN
technique au puissant
RNN
sérieux
Phrase
Embedding
45. 49© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Embeddings as Interfaces
Encoder
RNN
English
Words
Decoder
RNN
French
Words
English
Word2Vec
French
Word2Vec
Joint English/French
Phrase Embedding
46. 50© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
“Joint Embedding”
Combines two kinds of data into the same
embedding space.
Here: English & French phrases.
Or…
47. 51© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Word Embeddings: Word2Vec
Image Embeddings: ConvNet
Phrase Embeddings: Encoder/Decoder RNN
☐Image/Phrase joint embedding: ?
48. 52© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Neural Image Captioning
Training Data: list of <Image,Phrase> pairs
49. 53© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Composition of Neural Networks
Image
ConvNet
Image
Language
Decoder
RNN
Descriptive
Phrase
English
Word2Vec
Joint
Image/Phrase
Embedding
Raw Pixel
Encoding
50. 54© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Composition of Neural Networks
Image
ConvNet
Image RNN Word1
English
Word2Vec
Joint
Image/Phrase
Embedding
Raw Pixel
Encoding
Word2
Word3
RNN
RNN
51. 55© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NIC examples
52. 56© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Word Embeddings: Word2Vec
Image Embeddings: ConvNet
Phrase Embeddings: Encoder/Decoder RNN
Image captioning: ConvNet + Decoder RNN
☐Limits of embedding models
53. 57© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Phrase embeddings don’t work well
54. 58© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why?
ℝ512 is too small
You’re nuts!
55. 59© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Information content in embeddings
• How many points can be organized in ℝ2 ?
56. 60© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Information content in embeddings
• How many points can be organized in 2 single-
precision floats?
264 = 18,446,744,073,709,551,616
57. 61© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Information content in embeddings
• Only using 1 bit per dimension
2512 ≈ 10154
58. 62© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cover’s Function Counting Theorem
(1965)
http://www.cns.nyu.edu/~eorhan/notes/covers-theorem.pdf
Simplification: ℝN is probably linearly
separable for up to O(N) points.
59. 63© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Word Embeddings: Word2Vec
Image Embeddings: ConvNet
Phrase Embeddings: Encoder/Decoder RNN
Image captioning: ConvNet + Decoder RNN
☐Attention models
60. 64© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Seq2Seq Network
RNN RNN
seriously technique
RNN
powerful
RNN RNN RNN
technique au puissant
RNN
sérieux
Phrase
Embedding
61. 65© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Seq2Seq with Attention
62. 66© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Attention Model
f(Decoder_state, Input_Word) -> [0,1]
How relevant is this input word to the current
output?
63. 67© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
NMT attention
https://arxiv.org/pdf/1409.0473.pdf
64. 68© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Attention in NIC
https://arxiv.org/pdf/1502.03044.pdf
65. 69© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Attention in NIC
https://arxiv.org/pdf/1502.03044.pdf
66. 70© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Far from perfect
https://arxiv.org/pdf/1502.03044.pdf
67. 71© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Word Embeddings: Word2Vec
Image Embeddings: ConvNet
Phrase Embeddings: Encoder/Decoder RNN
Image captioning: ConvNet + Decoder RNN
Attention models
☐Amazon Applications
68. 72© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Product2Vec
Product
Embedding
NN
Product 1
Features
-Title
-Description
Product 1-2
Similarity
(observed
in aggregate customer
behavior)
NN
Product 2
Features
-Title
-Description
distance
69. 73© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Product Embeddings w/ Images
Product
Embedding
NN
Product 1
Features
Product 1-2
Similarity
NN
Product 2
Features
distance
Product 1
Image
Product 2
Image
CNN
CNN
Image
Embedding
70. 74© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analogies
- + =
71. 75© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analogies
Dave's Killer Bread - 21
Whole Grains Bread - 2
loaves - USDA Organic
Stroehmann
King Bread Loaf
Pack of 2 Quaker Chewy
Variety Pack 60
Granola Bars
- + =
Nature's Path
Organic Chewy
Granola Bars
72. 76© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Linear Combinations
+
=
2
73. 77© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Image2Vec on Product Images
74. 78© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Understanding Points in High
Dimensional Space
• Excel
• Clustering – assigns integer values
• Projection – maps to 2D space (or 3D, 4D, etc)
– PCA
– “t-SNE” is a learned projection
75. 79© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
t-SNE Clustering Product Images
76. 80© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lessons
• Embeddings need a context to have meaning
– Similarity & Distance become relevant
• Supervised ML can create useful embeddings
– Weak labels are often good enough
• Neural networks are composable
– Re-use network architectures or trained networks
• Attention mechanisms extend embeddings
– Embeddings have limited capacity.
– Attention provides interpretability
77. 81© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Think big!
It’s still day 1 for Deep Learning.