Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Alexander Smola
AWS Machine Learning
Personaliza...
Outline
• Personalization
• Latent Variable Models
• User Engagement and Return Times
• Deep Recommender Systems
• MXNet
•...
Personalization
Latent Variable Models
• Temporal sequence of observations

Purchases, likes, app use, e-mails, ad clicks, queries, rating...
Latent Variable Models
• Temporal sequence of observations

Purchases, likes, app use, e-mails, ad clicks, queries, rating...
Latent Variable Models
• Temporal sequence of observations

Purchases, likes, app use, e-mails, ad clicks, queries, rating...
Latent Variable Models
• Temporal sequence of observations

Purchases, likes, app use, e-mails, ad clicks, queries, rating...
Long Short Term Memory
x
h
Schmidhuber and Hochreiter, 1998
it = (Wi(xt, ht) + bi)
ft = (Wf (xt, ht) + bf )
zt+1 = ft · zt...
Long Short Term Memory
x
h
Schmidhuber and Hochreiter, 1998
(zt+1, ht+1, ot) = LSTM(zt, ht, xt)
Treat it as a black box
User Engagement
9:01 8:55 11:50
12:30
never
next week
?
(app frame toutiao.com)
User Engagement Modeling
• User engagement is gradual
• Daily average users?
• Weekly average users?
• Number of active us...
User Engagement Modeling
• User engagement is gradual
• Model user returns
• Context of activity
• World events (elections...
Survival Analysis 101
• Model population where something dramatic happens
• Cancer patients (death; efficacy of a drug)
• ...
Session Model
• User activity is sequence of times
• bi when app is opened
• ei when app is closed
• In between wait for u...
Look up
table
One-hot
UserID
Hidden2
Hidden1
User
Embedding
Look up
table
One-hot
TimeID
Time
Embedding……
0 0 1 0 0 0……
……...
Personalized LSTM
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016 8
Hidden2
Hidden1
In...
Perplexity (quality of prediction)
next visit time (hour)
Fig. 6. The histogram of the time period between two sessions. T...
Perplexity (quality of prediction)IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016
Tout...
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t...
Recommender Systems
Recommender systems, not recommender archaeology
users
items
time
NOW
predict that
(future)
use this
(past)
don’t predict ...
The Netflix contest
got it wrong …
Getting it right
change in
taste and
expertise
change in
perception
and novelty
LSTM
LSTM
Wu et al, WSDM’17
Wu et al, WSDM’17
Prizes
Sanity Check
Deep Learning with MXNet
Caffe
Torch
Theano
Tensorflow
CNTK
Keras
Paddle
(image - Banksy/wikipedia)
Why yet another deep networks tool?
Why yet another deep networks tool?
• Frugality & resource efficiency

Engineered for cheap GPUs with smaller memory, slow...
Imperative Programs
import numpy as np
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
print c
d = c + 1 Easy to tweak
with ...
Declarative Programs
A = Variable('A')
B = Variable('B')
C = B * A
D = C + 1
f = compile(D)
d = f(A=np.ones(10),
B=np.ones...
Imperative vs. Declarative for Deep Learning
Computational Graph
of the Deep Architecture
forward backward
Needs heavy opt...
Mixed Style Training Loop in MXNet
executor = neuralnetwork.bind()
for i in range(3):
train_iter.reset()
for dbatch in tra...
Mixed API for Quick Extensions
• Runtime switching between different graphs depending on input
• Useful for sequence model...
3D Image Construction
Deep3D
100 lines of Python code
https://github.com/piiswrong/deep3d
Distributed Deep Learning
Distributed Deep Learning
Distributed Deep Learning
## train
num_gpus = 4
gpus = [mx.gpu(i) for i in range(num_gpus)]
model = mx.model.FeedForward(
...
Scaling on p2.16xlarge
alexnet
inception-v3
resnet-50
GPUs GPUs
average throughput
per GPU
aggregate throughput
GPU-GPU sy...
Demo
Getting Started
• Website

http://mxnet.io/
• GitHub repository

git clone —recursive git@github.com:dmlc/mxnet.git
• Dock...
Acknowledgements
• User engagement

How Jing, Chao-Yuan Wu
• Temporal recommenders

Chao-Yuan Wu, Alex Beutel, Amr Ahmed
•...
Prochain SlideShare
Chargement dans…5
×

Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016

1 934 vues

Publié le

Alex Smola is the Manager of the Cloud Machine Learning Platform at Amazon. Prior to his role at Amazon, Smola was a Professor in the Machine Learning Department of Carnegie Mellon University and cofounder and CEO of Marianas Labs. Prior to that he worked at Google Strategic Technologies, Yahoo Research, and National ICT Australia. Prior to joining CMU, he was professor at UC Berkeley and the Australian National University. Alex obtained his PhD at TU Berlin in 1998. He has published over 200 papers and written or coauthored 5 books.

Abstract summary

Personalization and Scalable Deep Learning with MXNET: User return times and movie preferences are inherently time dependent. In this talk I will show how this can be accomplished efficiently using deep learning by employing an LSTM (Long Short Term Model). Moreover, I will show how to train large scale distributed parallel models using MXNet efficiently. This includes a brief overview of key components of defining networks, of optimization, and a walkthrough of the steps required to allocate machines, and to train a model.

Publié dans : Technologie
  • Soyez le premier à commenter

Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Alexander Smola AWS Machine Learning Personalization and Scalable Deep Learning with MXNET
  2. 2. Outline • Personalization • Latent Variable Models • User Engagement and Return Times • Deep Recommender Systems • MXNet • Basic concepts • Launching a cluster in a minute • Imagenet for beginners
  3. 3. Personalization
  4. 4. Latent Variable Models • Temporal sequence of observations
 Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Clusters (navigational, informational queries in search) • Topics (interest distributions for users over time) • Kalman Filter (trajectory and location modeling) Action Explanation
  5. 5. Latent Variable Models • Temporal sequence of observations
 Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Clusters (navigational, informational queries in search) • Topics (interest distributions for users over time) • Kalman Filter (trajectory and location modeling) Action Explanation Are the parametric models really true?
  6. 6. Latent Variable Models • Temporal sequence of observations
 Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Nonparametric model / spectral • Use data to determine shape • Sidestep approximate inference x h ht = f(xt 1, ht 1) xt = g(xt 1, ht)
  7. 7. Latent Variable Models • Temporal sequence of observations
 Purchases, likes, app use, e-mails, ad clicks, queries, ratings • Latent state to explain behavior • Plain deep network = RNN • Deep network with attention = LSTM / GRU …
 (learn when to update state, how to read out) x h
  8. 8. Long Short Term Memory x h Schmidhuber and Hochreiter, 1998 it = (Wi(xt, ht) + bi) ft = (Wf (xt, ht) + bf ) zt+1 = ft · zt + it · tanh(Wz(xt, ht) + bz) ot = (Wo(xt, ht, zt+1) + bo) ht+1 = ot · tanh zt+1
  9. 9. Long Short Term Memory x h Schmidhuber and Hochreiter, 1998 (zt+1, ht+1, ot) = LSTM(zt, ht, xt) Treat it as a black box
  10. 10. User Engagement 9:01 8:55 11:50 12:30 never next week ? (app frame toutiao.com)
  11. 11. User Engagement Modeling • User engagement is gradual • Daily average users? • Weekly average users? • Number of active users? • Number of users? • Abandonment is passive • The last time you tweeted? Pin? Like? Skype? • Churn models assume active abandonment 
 (insurance, phone, bank) 9:01
  12. 12. User Engagement Modeling • User engagement is gradual • Model user returns • Context of activity • World events (elections, Super Bowl, …) • User habits (morning reader, night owl) • Previous reading behavior
 (poor quality content will discourage return) 9:01
  13. 13. Survival Analysis 101 • Model population where something dramatic happens • Cancer patients (death; efficacy of a drug) • Atoms (radioactive decay) • Japanese women (marriage) • Users (opens app) • Survival probability TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JA well known that the differential equation can be solved partial integration, i.e. Pr(tsurvival T) = exp Z T 0 (T)dt ! . (2) ce, if the patient survives until time T and we stop kernel time t Conse hazard rate function
  14. 14. Session Model • User activity is sequence of times • bi when app is opened • ei when app is closed • In between wait for user return • Model user activity likelihood start end
  15. 15. Look up table One-hot UserID Hidden2 Hidden1 User Embedding Look up table One-hot TimeID Time Embedding…… 0 0 1 0 0 0…… …… 0 0 0 1 0 0…… …… …… …… External Feature Rate Fig. 1. A Personalized Time-Aware architecture for Survival Analysis. Given the data from previous session, we aims to predict the (quantized) rate values for the next session. tun to [39 sp [40 of tho in to mo ins lin lea Session Model start end
  16. 16. Personalized LSTM IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016 8 Hidden2 Hidden1 Input …… …… …… Hidden2 Hidden1 …… …… Hidden2 Hidden1 …… …… Input …… Input …… Session s-2 Session s-1 Session s Fig. 2. Unfolded LSTM network for 3 sessions. The input vector for session s is the concatenation of user embedding, time slot embedding and the • LSTM for global state update • LSTM for indvidual state update • Update both of them • Learn using backprop and SGD Jing and Smola, WSDM’17
  17. 17. Perplexity (quality of prediction) next visit time (hour) Fig. 6. The histogram of the time period between two sessions. The top one is from Toutiao and the bottom one is from Last.fm. The small bump around 24 hours corresponds to users having a daily habit of using the app at the same time. global constant model. A static model with only one pa- rameter, assuming that the rate is constant throughout the time frame for all users. global+user constant model. A static model that assumes that the rate is an additive function of a global constant and a user-specific constant model. piecewise constant model. A more flexible static model that learns parameters for each discretized bin. Hawkes process. A self-exciting point process that respects past sessions. integrated model. A combined model with all the above components. DNN. A model that assumes that the rate is a function of time, user, session feature, parameterized by a deep neural network. LSTM. A recurrent neural network that incorporates past activities. For completeness, we also report the result for Cox’s model where the Hazard Rate is given by u(t) = 0(t) exp(h , xu(t)i) (28) perp = exp ⇣ 1 M mX u=1 muX i=1 log p({bi, ei}; ) ⌘ (29) where M is the total number of sessions in the test set. The lower the value, the better the model is at explaining the test data. In other words, perplexity measures the amount of surprise in a user’s behavior relative to our prediction. Obviously a good model can predict well, hence there will be less surprise. 6.6 Model Comparison The summarized results are shown in table 1. As can be seen from the table, there is a big gap between linear models and the two deep models. The Cox model is inferior to our integrated model and significantly worse than the deep networks. model Toutiao Last.fm Cox Model 27.13 28.31 global constant 45.29 59.98 user constant 28.74 45.44 piecewise constant 26.88 26.12 Hawkes process 22.58 30.80 integrated model 21.56 26.06 DNN 18.87 20.62 LSTM 18.10 19.80 TABLE 1 Average perplexity evaluated on the test set for different models. flexible static model iscretized bin. nt process that respects el with all the above the rate is a function ameterized by a deep that incorporates past result for Cox’s model xu(t)i) (28) from the table, there is a big gap between line and the two deep models. The Cox model is our integrated model and significantly worse than networks. model Toutiao Last.fm Cox Model 27.13 28.31 global constant 45.29 59.98 user constant 28.74 45.44 piecewise constant 26.88 26.12 Hawkes process 22.58 30.80 integrated model 21.56 26.06 DNN 18.87 20.62 LSTM 18.10 19.80 TABLE 1 Average perplexity evaluated on the test set for different
  18. 18. Perplexity (quality of prediction)IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016 Toutiao Last.fm # of sessions (%) 0 20 40 60 80 100 Perplexity 0 20 40 60 80 100 120 140 160 global constant user constant piecewise constant Hawkes Process Integrated Cox DNN LSTM # of sessions (%) 0 20 40 60 80 100 Perplexity 0 20 40 60 80 100 120 140 160 180 global constant user constant piecewise constant Hawkes Process Integrated Cox DNN LSTM %) 50 LSTM v.s. Integrated LSTM v.s. Cox %) 45 50 LSTM v.s. Integrated LSTM v.s. Cox # of sessions (%) 0 20 40 60 80 100 0 20 # of sessions (%) 0 5 10 15 20 RelativeImprovements(%) 0 10 20 30 40 50 LSTM v.s. Integrated LSTM v.s. Cox Fig. 7. Top row: Average test perplexity as a function of the fraction of o LSTMs over the integrated and the Cox model. Left column: Toutiao datJing and Smola, WSDM’17
  19. 19. t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 0.6 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time t (hour) 0 20 40 60 80 λ(t) 0 0.1 0.2 0.3 0.4 0.5 instantaneous rate actual return time t (hour) 0 20 40 60 80 Pr(return≥t) 0 0.2 0.4 0.6 0.8 1 survival function actual return time g. 9. Six randomly sampled learned predictive rate function. Three from toutiao (left) and three from Last.fm (right). Each pair of figure denotes e instantaneous rate value (t) (purple), the survival function p(return t) in red, and the actual return time in blue. Clearly, our deep model is
  20. 20. Recommender Systems
  21. 21. Recommender systems, not recommender archaeology users items time NOW predict that (future) use this (past) don’t predict this (archaeology)
  22. 22. The Netflix contest got it wrong …
  23. 23. Getting it right change in taste and expertise change in perception and novelty LSTM LSTM Wu et al, WSDM’17
  24. 24. Wu et al, WSDM’17
  25. 25. Prizes
  26. 26. Sanity Check
  27. 27. Deep Learning with MXNet
  28. 28. Caffe Torch Theano Tensorflow CNTK Keras Paddle (image - Banksy/wikipedia) Why yet another deep networks tool?
  29. 29. Why yet another deep networks tool? • Frugality & resource efficiency
 Engineered for cheap GPUs with smaller memory, slow networks • Speed • Linear scaling with #machines and #GPUs • High efficiency on single machine, too (C++ backend) • Simplicity
 Mix declarative and imperative code single implementation of backend system and common operators performance guarantee regardless which frontend language is used frontend backend
  30. 30. Imperative Programs import numpy as np a = np.ones(10) b = np.ones(10) * 2 c = b * a print c d = c + 1 Easy to tweak with python codes Pro • Straightforward and flexible. • Take advantage of language native features (loop, condition, debugger) Con • Hard to optimize
  31. 31. Declarative Programs A = Variable('A') B = Variable('B') C = B * A D = C + 1 f = compile(D) d = f(A=np.ones(10), B=np.ones(10)*2) Pro • More chances for optimization • Cross different languages Con • Less flexible A B 1 + ⨉ C can share memory with D, because C is deleted later
  32. 32. Imperative vs. Declarative for Deep Learning Computational Graph of the Deep Architecture forward backward Needs heavy optimization, fits declarative programs Needs mutation and more language native features, good for imperative programs Updates and Interactions with the graph • Iteration loops • Parameter update
 • Beam search • Feature extraction … w w ⌘@wf(w)
  33. 33. Mixed Style Training Loop in MXNet executor = neuralnetwork.bind() for i in range(3): train_iter.reset() for dbatch in train_iter: args["data"][:] = dbatch.data[0] args["softmax_label"][:] = dbatch.label[0] executor.forward(is_train=True) executor.backward() for key in update_keys: args[key] -= learning_rate * grads[key] Imperative NDArray can be set as input nodes to the graph Executor is bound from declarative program that describes the network Imperative parameter update on GPU
  34. 34. Mixed API for Quick Extensions • Runtime switching between different graphs depending on input • Useful for sequence modeling and image size reshaping • Use of imperative code in Python, 10 lines of additional Python code BucketingVariable length sentences
  35. 35. 3D Image Construction Deep3D 100 lines of Python code https://github.com/piiswrong/deep3d
  36. 36. Distributed Deep Learning
  37. 37. Distributed Deep Learning
  38. 38. Distributed Deep Learning ## train num_gpus = 4 gpus = [mx.gpu(i) for i in range(num_gpus)] model = mx.model.FeedForward( ctx = gpus, symbol = softmax, num_round = 20, learning_rate = 0.01, momentum = 0.9, wd = 0.00001) model.fit(X = train, eval_data = val, batch_end_callback = mx.callback.Speedometer(batch_size=batch_size)) 2 lines for multi GPU
  39. 39. Scaling on p2.16xlarge alexnet inception-v3 resnet-50 GPUs GPUs average throughput per GPU aggregate throughput GPU-GPU sync alexnet inception-v3 resnet-50 108x 75x
  40. 40. Demo
  41. 41. Getting Started • Website
 http://mxnet.io/ • GitHub repository
 git clone —recursive git@github.com:dmlc/mxnet.git • Docker
 docker pull dmlc/mxnet • Amazon AWS Deep Learning AMI (with other toolkits & anaconda)
 https://aws.amazon.com/marketplace/pp/B01M0AXXQB
 http://bit.ly/deepami • CloudFormation Template
 https://github.com/dmlc/mxnet/tree/master/tools/cfn 
 http://bit.ly/deepcfn
  42. 42. Acknowledgements • User engagement
 How Jing, Chao-Yuan Wu • Temporal recommenders
 Chao-Yuan Wu, Alex Beutel, Amr Ahmed • MXNet & Deep Learning AMI
 Mu Li, Tianqi Chen, Bing Xu, Eric Xie, Joseph Spisak, Naveen Swamy, Anirudh Subramanian and many more … We are hiring {smola, thakerb, spisakj}@amazon.com

×