Alex Smola is the Manager of the Cloud Machine Learning Platform at Amazon. Prior to his role at Amazon, Smola was a Professor in the Machine Learning Department of Carnegie Mellon University and cofounder and CEO of Marianas Labs. Prior to that he worked at Google Strategic Technologies, Yahoo Research, and National ICT Australia. Prior to joining CMU, he was professor at UC Berkeley and the Australian National University. Alex obtained his PhD at TU Berlin in 1998. He has published over 200 papers and written or coauthored 5 books.
Abstract summary
Personalization and Scalable Deep Learning with MXNET: User return times and movie preferences are inherently time dependent. In this talk I will show how this can be accomplished efficiently using deep learning by employing an LSTM (Long Short Term Model). Moreover, I will show how to train large scale distributed parallel models using MXNet efficiently. This includes a brief overview of key components of defining networks, of optimization, and a walkthrough of the steps required to allocate machines, and to train a model.
2. Outline
• Personalization
• Latent Variable Models
• User Engagement and Return Times
• Deep Recommender Systems
• MXNet
• Basic concepts
• Launching a cluster in a minute
• Imagenet for beginners
4. Latent Variable Models
• Temporal sequence of observations
Purchases, likes, app use, e-mails, ad clicks, queries, ratings
• Latent state to explain behavior
• Clusters (navigational, informational queries in search)
• Topics (interest distributions for users over time)
• Kalman Filter (trajectory and location modeling)
Action
Explanation
5. Latent Variable Models
• Temporal sequence of observations
Purchases, likes, app use, e-mails, ad clicks, queries, ratings
• Latent state to explain behavior
• Clusters (navigational, informational queries in search)
• Topics (interest distributions for users over time)
• Kalman Filter (trajectory and location modeling)
Action
Explanation
Are the parametric models really true?
6. Latent Variable Models
• Temporal sequence of observations
Purchases, likes, app use, e-mails, ad clicks, queries, ratings
• Latent state to explain behavior
• Nonparametric model / spectral
• Use data to determine shape
• Sidestep approximate inference
x
h
ht = f(xt 1, ht 1)
xt = g(xt 1, ht)
7. Latent Variable Models
• Temporal sequence of observations
Purchases, likes, app use, e-mails, ad clicks, queries, ratings
• Latent state to explain behavior
• Plain deep network = RNN
• Deep network with attention = LSTM / GRU …
(learn when to update state, how to read out)
x
h
8. Long Short Term Memory
x
h
Schmidhuber and Hochreiter, 1998
it = (Wi(xt, ht) + bi)
ft = (Wf (xt, ht) + bf )
zt+1 = ft · zt + it · tanh(Wz(xt, ht) + bz)
ot = (Wo(xt, ht, zt+1) + bo)
ht+1 = ot · tanh zt+1
9. Long Short Term Memory
x
h
Schmidhuber and Hochreiter, 1998
(zt+1, ht+1, ot) = LSTM(zt, ht, xt)
Treat it as a black box
11. User Engagement Modeling
• User engagement is gradual
• Daily average users?
• Weekly average users?
• Number of active users?
• Number of users?
• Abandonment is passive
• The last time you tweeted? Pin? Like? Skype?
• Churn models assume active abandonment
(insurance, phone, bank)
9:01
12. User Engagement Modeling
• User engagement is gradual
• Model user returns
• Context of activity
• World events (elections, Super Bowl, …)
• User habits (morning reader, night owl)
• Previous reading behavior
(poor quality content will discourage return)
9:01
13. Survival Analysis 101
• Model population where something dramatic happens
• Cancer patients (death; efficacy of a drug)
• Atoms (radioactive decay)
• Japanese women (marriage)
• Users (opens app)
• Survival probability
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JA
well known that the differential equation can be solved
partial integration, i.e.
Pr(tsurvival T) = exp
Z T
0
(T)dt
!
. (2)
ce, if the patient survives until time T and we stop
kernel
time t
Conse
hazard rate function
14. Session Model
• User activity is sequence of times
• bi when app is opened
• ei when app is closed
• In between wait for user return
• Model user activity likelihood
start
end
16. Personalized LSTM
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016 8
Hidden2
Hidden1
Input
……
……
……
Hidden2
Hidden1
……
……
Hidden2
Hidden1
……
……
Input
……
Input
……
Session s-2 Session s-1 Session s
Fig. 2. Unfolded LSTM network for 3 sessions. The input vector for session s is the concatenation of user embedding, time slot embedding and the
• LSTM for global state update
• LSTM for indvidual state update
• Update both of them
• Learn using backprop and SGD
Jing and Smola, WSDM’17
17. Perplexity (quality of prediction)
next visit time (hour)
Fig. 6. The histogram of the time period between two sessions. The top
one is from Toutiao and the bottom one is from Last.fm. The small bump
around 24 hours corresponds to users having a daily habit of using the
app at the same time.
global constant model. A static model with only one pa-
rameter, assuming that the rate is constant throughout
the time frame for all users.
global+user constant model. A static model that assumes
that the rate is an additive function of a global constant
and a user-specific constant model.
piecewise constant model. A more flexible static model
that learns parameters for each discretized bin.
Hawkes process. A self-exciting point process that respects
past sessions.
integrated model. A combined model with all the above
components.
DNN. A model that assumes that the rate is a function
of time, user, session feature, parameterized by a deep
neural network.
LSTM. A recurrent neural network that incorporates past
activities.
For completeness, we also report the result for Cox’s model
where the Hazard Rate is given by
u(t) = 0(t) exp(h , xu(t)i) (28)
perp = exp
⇣ 1
M
mX
u=1
muX
i=1
log p({bi, ei}; )
⌘
(29)
where M is the total number of sessions in the test set. The
lower the value, the better the model is at explaining the
test data. In other words, perplexity measures the amount
of surprise in a user’s behavior relative to our prediction.
Obviously a good model can predict well, hence there will
be less surprise.
6.6 Model Comparison
The summarized results are shown in table 1. As can be seen
from the table, there is a big gap between linear models
and the two deep models. The Cox model is inferior to
our integrated model and significantly worse than the deep
networks.
model Toutiao Last.fm
Cox Model 27.13 28.31
global constant 45.29 59.98
user constant 28.74 45.44
piecewise constant 26.88 26.12
Hawkes process 22.58 30.80
integrated model 21.56 26.06
DNN 18.87 20.62
LSTM 18.10 19.80
TABLE 1
Average perplexity evaluated on the test set for different models.
flexible static model
iscretized bin.
nt process that respects
el with all the above
the rate is a function
ameterized by a deep
that incorporates past
result for Cox’s model
xu(t)i) (28)
from the table, there is a big gap between line
and the two deep models. The Cox model is
our integrated model and significantly worse than
networks.
model Toutiao Last.fm
Cox Model 27.13 28.31
global constant 45.29 59.98
user constant 28.74 45.44
piecewise constant 26.88 26.12
Hawkes process 22.58 30.80
integrated model 21.56 26.06
DNN 18.87 20.62
LSTM 18.10 19.80
TABLE 1
Average perplexity evaluated on the test set for different
18. Perplexity (quality of prediction)IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 100, NO. 10, JANUARY 2016
Toutiao Last.fm
# of sessions (%)
0 20 40 60 80 100
Perplexity
0
20
40
60
80
100
120
140
160
global constant
user constant
piecewise constant
Hawkes Process
Integrated
Cox
DNN
LSTM
# of sessions (%)
0 20 40 60 80 100
Perplexity
0
20
40
60
80
100
120
140
160
180
global constant
user constant
piecewise constant
Hawkes Process
Integrated
Cox
DNN
LSTM
%)
50
LSTM v.s. Integrated
LSTM v.s. Cox
%)
45
50
LSTM v.s. Integrated
LSTM v.s. Cox
# of sessions (%)
0 20 40 60 80 100
0
20
# of sessions (%)
0 5 10 15 20
RelativeImprovements(%)
0
10
20
30
40
50
LSTM v.s. Integrated
LSTM v.s. Cox
Fig. 7. Top row: Average test perplexity as a function of the fraction of o
LSTMs over the integrated and the Cox model. Left column: Toutiao datJing and Smola, WSDM’17
19. t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
0.6
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
t (hour)
0 20 40 60 80
λ(t)
0
0.1
0.2
0.3
0.4
0.5
instantaneous rate
actual return time
t (hour)
0 20 40 60 80
Pr(return≥t)
0
0.2
0.4
0.6
0.8
1
survival function
actual return time
g. 9. Six randomly sampled learned predictive rate function. Three from toutiao (left) and three from Last.fm (right). Each pair of figure denotes
e instantaneous rate value (t) (purple), the survival function p(return t) in red, and the actual return time in blue. Clearly, our deep model is
29. Why yet another deep networks tool?
• Frugality & resource efficiency
Engineered for cheap GPUs with smaller memory, slow networks
• Speed
• Linear scaling with #machines and #GPUs
• High efficiency on single machine, too (C++ backend)
• Simplicity
Mix declarative and imperative code
single implementation of
backend system and
common operators
performance guarantee
regardless which frontend
language is used
frontend
backend
30. Imperative Programs
import numpy as np
a = np.ones(10)
b = np.ones(10) * 2
c = b * a
print c
d = c + 1 Easy to tweak
with python
codes
Pro
• Straightforward and flexible.
• Take advantage of language native
features (loop, condition, debugger)
Con
• Hard to optimize
31. Declarative Programs
A = Variable('A')
B = Variable('B')
C = B * A
D = C + 1
f = compile(D)
d = f(A=np.ones(10),
B=np.ones(10)*2)
Pro
• More chances for optimization
• Cross different languages
Con
• Less flexible
A B
1
+
⨉
C can share memory with D,
because C is deleted later
32. Imperative vs. Declarative for Deep Learning
Computational Graph
of the Deep Architecture
forward backward
Needs heavy optimization,
fits declarative programs
Needs mutation and more
language native features, good for
imperative programs
Updates and Interactions
with the graph
• Iteration loops
• Parameter update
• Beam search
• Feature extraction …
w w ⌘@wf(w)
33. Mixed Style Training Loop in MXNet
executor = neuralnetwork.bind()
for i in range(3):
train_iter.reset()
for dbatch in train_iter:
args["data"][:] = dbatch.data[0]
args["softmax_label"][:] = dbatch.label[0]
executor.forward(is_train=True)
executor.backward()
for key in update_keys:
args[key] -= learning_rate * grads[key]
Imperative NDArray can be set as input
nodes to the graph
Executor is bound from
declarative program that
describes the network
Imperative parameter update on GPU
34. Mixed API for Quick Extensions
• Runtime switching between different graphs depending on input
• Useful for sequence modeling and image size reshaping
• Use of imperative code in Python, 10 lines of additional Python code
BucketingVariable length sentences
41. Getting Started
• Website
http://mxnet.io/
• GitHub repository
git clone —recursive git@github.com:dmlc/mxnet.git
• Docker
docker pull dmlc/mxnet
• Amazon AWS Deep Learning AMI (with other toolkits & anaconda)
https://aws.amazon.com/marketplace/pp/B01M0AXXQB
http://bit.ly/deepami
• CloudFormation Template
https://github.com/dmlc/mxnet/tree/master/tools/cfn
http://bit.ly/deepcfn
42. Acknowledgements
• User engagement
How Jing, Chao-Yuan Wu
• Temporal recommenders
Chao-Yuan Wu, Alex Beutel, Amr Ahmed
• MXNet & Deep Learning AMI
Mu Li, Tianqi Chen, Bing Xu, Eric Xie, Joseph Spisak,
Naveen Swamy, Anirudh Subramanian and many more …
We are hiring
{smola, thakerb, spisakj}@amazon.com