Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402 - re:Invent 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Building Content Recommendation
Systems Using Apache MXNet and Gluon
C y r u s M V a h i d
c y r u s m v @ a m a z o n . c o m
P r i n c i p a l D e e p L e a r n i n g S o l u t i o n A r c h i t e c t @ A m a z o n A I
D e c e m b e r 1 , 2 0 1 7

My Profile – amazon.co.uk My Profile – amazon.de
Motivation

• Recommender systems are an important
mechanism to personalize and enhance customer
experience.
• Recommender systems can be applied to achieve
different goals:
• Increased time spent on a platform
• Bring to the attention of a customer
that an item might need complementary
pieces.
• In the heart of it all, the intention is to provide
the person interacting with the system the most
relevant results and content to that person.
Motivation

Recommender System
Collaborative Filtering Using Matrix Factorization
Step by Step Coding
Challenges
Deep Structure Semantic Models (DSSM)
Production Notes
C o m m o n A p p r o a c h e s t o B u i l d a

• User based recommenders try to identify a short
list of other users who are "similar" to you who
have rated this new item. They then take a
weighted average of the rating for the item from
these users and predict that number as your likely
rating for that item.
• Item based recommenders try to identify a short
list of other items that are "similar" to the item
under question and take a weighted average of
the ratings for those top few items which you
provided and predict that number as your likely
rating for that item.
Collaborative Filtering
https://www.oreilly.com/ideas/deep-matrix-factorization-using-apache-mxnet?cmp=tw-data-na-article-engagement_sponsored+kibird

Collaborative Filtering

• User-Item matrices are very sparse. For instance
amazon.com provides a catalogue of millions of
items, but each user has rated only a handful of
items.
• Intention of a recommendation engine is to
complete the sparse matrix based on the known
data.
• Matrix Factorization factorizes a matrix to
separate matrices, that when multiplied
approximate to the completed matrix.
• For the sake of efficiency we would like to
factorize the matrix to a long and a wide matrices.
Matrix Factorization

def plain_net(k):
# input
user = mx.symbol.Variable('user')
item = mx.symbol.Variable('item')
score = mx.symbol.Variable('score')
# user feature lookup
user = mx.symbol.Embedding(data = user, input_dim = max_user, output_dim = k)
# item feature lookup
item = mx.symbol.Embedding(data = item, input_dim = max_item, output_dim = k)
# predict by the inner product, which is elementwise product and then sum
pred = user * item
pred = mx.symbol.sum(data = pred, axis = 1)
pred = mx.symbol.Flatten(data = pred)
# loss layer
pred = mx.symbol.LinearRegressionOutput(data = pred, label = score)
return pred
net1 = plain_net(64)
mx.viz.plot_network(net1)
Linear Model
User
Embedding
Item
Embedding
∘
⨀⨁ score
output

def get_one_layer_mlp(hidden, k):
# input
user = mx.symbol.Variable('user')
item = mx.symbol.Variable('item')
score = mx.symbol.Variable('score')
# user latent features
user = mx.symbol.Embedding(data = user, input_dim = max_user, output_dim = k)
user = mx.symbol.Activation(data = user, act_type='relu')
user = mx.symbol.FullyConnected(data = user, num_hidden = hidden)
# item latent features
item = mx.symbol.Embedding(data = item, input_dim = max_item, output_dim = k)
item = mx.symbol.Activation(data = item, act_type='relu')
item = mx.symbol.FullyConnected(data = item, num_hidden = hidden)
# predict by the inner product
pred = user * item
pred = mx.symbol.sum(data = pred, axis = 1)
pred = mx.symbol.Flatten(data = pred)
# loss layer
pred = mx.symbol.LinearRegressionOutput(data = pred, label = score)
return pred
Adding Non-Linearity
User
Embedding
Item
Embedding
∘
⨀⨁ score
output
dense dense
dense dense

• Embedding Layer is where a networks extracts
importance of features from data.
• Embedding is frequently used in NLP. For instance
in sentiment analysis embedding distills sentiment
information form words.
Embedding Layer

Coding Steps in gluon
• Loading Data into an Array Iterator
• Defining the Network
• Initializing the Network
• Choosing the Loss Function
• Choosing Optimizer
• Training the Model
• Measuring Performance
• Adding Non-linearity
• Measuring Performance

train_data_iter = gluon.data.DataLoader(SparseMatrixDataset(train_data, train_label),
shuffle=True, batch_size=batch_size)
test_data_iter = gluon.data.DataLoader(SparseMatrixDataset(test_data, test_label),
shuffle=True, batch_size=batch_size)
class SparseMatrixDataset(gluon.data.Dataset):
def __init__(self, data, label):
assert data.shape[0] == len(label)
self.data = data
self.label = label
if isinstance(label, ndarray.NDArray) and len(label.shape) == 1:
self._label = label.asnumpy()
else:
self._label = label
def __getitem__(self, idx):
return self.data[idx, 0], self.data[idx, 1], self.label[idx]
def __len__(self):
return self.data.shape[0]
Loading Data into an Array Iterator

class MFBlock(gluon.Block):
def __init__(self, max_users, max_items, num_emb, dropout_p=0.5):
super(MFBlock, self).__init__()
self.max_users = max_users
self.max_items = max_items
self.dropout_p = dropout_p
self.num_emb = num_emb
with self.name_scope():
self.user_embeddings = gluon.nn.Embedding(max_users, num_emb)
self.item_embeddings = gluon.nn.Embedding(max_items, num_emb)
def forward(self, users, items):
a = self.user_embeddings(users)
b = self.item_embeddings(items)
predictions = a * b
predictions = nd.sum(predictions, axis=1)
return predictions
Defining the Network

net.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx, force_reinit=True)
Initializing the Network

loss_function = gluon.loss.L2Loss()
Choosing the Loss Function

trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': lr, 'wd': wd, 'momentum':
0.9})
Choosing Optimizer

pochs = 10
def train(data_iter, net):
for e in range(epochs):
print("epoch: {}".format(e))
for i, (user, item, label) in enumerate(train_data_iter):
user = user.as_in_context(ctx).reshape((batch_size,))
item = item.as_in_context(ctx).reshape((batch_size,))
label = label.as_in_context(ctx).reshape((batch_size,))
with mx.autograd.record():
output = net(user, item)
loss = loss_function(output, label)
loss.backward()
net.collect_params().values()
trainer.step(batch_size)
print("EPOCH {}: RMSE ON TRAINING and TEST: {}. {}".format(e,
eval_net(train_data_iter, net)),
eval_net(test_data_iter, net)))
return output
Training the Model

EPOCH 0: RMSE ON TRAINING and TEST: 3.5231035657055676. 3.53353935778141
Measuring Performance

class MFBlock(gluon.Block):
def __init__(self, max_users, max_items, num_emb, dropout_p=0.5):
super(MFBlock, self).__init__()
self.max_users = max_users
self.max_items = max_items
self.dropout_p = dropout_p
self.num_emb = num_emb
with self.name_scope():
self.user_embeddings = gluon.nn.Embedding(max_users, num_emb)
self.item_embeddings = gluon.nn.Embedding(max_items, num_emb)
self.dense = gluon.nn.Dense(num_emb, activation='relu')
def forward(self, users, items):
a = self.user_embeddings(users)
a = self.dense(a)
b = self.item_embeddings(items)
b = self.dense(b)
predictions = a * b
predictions = nd.sum(predictions, axis=1)
return predictions
Adding Non_Linearity

EPOCH 0: RMSE ON TRAINING and TEST:
0.7397429539039732. 0.7727593539834022
0.7340302160739899. 0.7680136477231979
0.7290934163123369. 0.7636049427747726
0.7489639871120453. 0.7812525823950768
0.7219638488218189. 0.7587113852620124
0.7221305305734277. 0.7613962271451951
0.6964019835107028. 0.7500802919447422
0.7040959010727703. 0.7654881411790848
0.662088219345361. 0.7430168891996145
0.641668664816767. 0.7430033217519522
Measuring Performance
3.5231035657055676. 3.53353935778141
3.3170668120235205. 3.3881871127337218
1.7826270855292679. 2.011566595607996
1.1699977076664567. 1.3101321067839862
0.9599943867214024. 1.0654756610274314
0.8661853047579527. 0.951712748876214
0.8170913911752403. 0.8906555083304644
0.7873748381830752. 0.8524201203614473
0.7683705289691687. 0.8290898406863213
0.7545523364305496. 0.8126902643978595

• Matrix Factorization is ideal for small catalogues
and can perform based on small amount of data.
• As the catalogues get larger memory becomes a
challenge.
• For instance MovieLens 20M has 27278 items and
138493 users. User x Item matrix will have a
dimension of 138,493 X 27,278 = 3,777,812,054.
From all possible ratings the dataset includes only
20000263. This means only 0.05 percent of the
dataset contains data and 99.95 percent is just
sparsity.
Limitations
ref: Leo Dirac – re:Invent 2016

• Factorization Machines take advantage of
combining Support Vector Machines and MF in
order to scale and deal with sparsity.
• The problem FM is that they run on a single
machine and have huge memory requirements
Scaling Is the Issue…
• DiFacto or Distributed Factorization Machines are
capable of distributing computation of sub-
gradients on minibatches asynchronously and thus
distribution load to several machines.
(Alex Smola et al. - 2016)
parameter server asynch SGD

• New users have no previous raring history.
• New items have never been rated before.
• Implementing recommendation without having
any prior information for platforms that no user-
login has been required.
Cold-Start Problem
itm0 itm1 itm2 itm3 itm4 itm5 itm6 itm7 itm8 itm9 itm10
u0 2 ? ? ? ? ? 4 ? ? ? ?
u1 ? 2 3 ? 4 ? ? 3 ? 5 ?
u2 2 ? ? ? 4 ? 4 ? ? ? ?
u3 1 ? ? 1 ? ? ? ? 1 ? ?
u4 5 ? ? 2 ? ? ? ? ? 2 ?
u5 ? ? 3 ? ? ? ? ? ? 5 ?
u6 ? 1 ? ? ? ? 4 ? 1 ? ?
u7 ? ? ? ? ? ? 1 ? 2 3 ?
u8 ? ? ? 5 ? 3 ? ? ? ? ?
u9 ? 2 3 ? 3 ? ? 3 ? 5 ?
u10 ? ? ? ? ? ? ? ? ? ? ?

• We might not have user data, but often there is a
wealth of product information available. We can
use this data in order to recommend similar
products to a user.
Content (Feature)-Based
code category sub category weight price colour dimentions
itm1 1 1 20 123.1 blue 23x12x19
itm2 2 1 20 900 white 23x12x20
itm3 1 1 22 123.1 green 20x10x18
itm4 3 1 20 600 blue 23x12x22
itm5 1 2 19 200 yellow 23x12x23
itm6 1 1 1 12 red 2x1x3
itm7 1 1 900 2000 blue 100x80x99
itm8 9 8 20 6000 grey 99x99x99
itm9 7 5 1000 123.1 blue 123x5x8
itm10 9 8 20 5000 brown 99x99x99

Content Based

• Hybrid Models combine item/user-based models
with content based.
• After searching for travel guides for Peru,
Argentina, and Brasil and while being in Germany,
I received the recommendation in the opposite
pane.
• The recommender has combined my location,
offered me a South American travel guide, and
has given more weight to a specific publisher that
I deliberately clicked-on more frequently.
Hybrid Models

• There is still a wealth of information we have not
tapped into.
• Movies have images.
• Images can be captioned.
• Products have names, while we have so
far reduced them to item numbers.
• TV series have episode names.
• Products have verbose reviews.
• We can harvest all of this information and enrich
our recommendation system.
Most of the Data Is Still Untapped
There is a strong similarity in the ambience and
composition of these images:

• A Matrix Factorization solution in its core is
multiplication of two matrices.
• Neural Networks are good at picking up semantic
intent at phrase/sentence level.
• Neural Networks perform well at image
captioning.
• Output of network is a tensor.
• So we can use output of several networks as our
embedding layer for an enriched recommendation
system.
DSSM
User
Embedding
Item
Embedding
∘
⨀⨁ score
output

DSSM
User
Embedding
Item
Embedding
⨀
⨀⨁ score
output
user Search
BOW
title words
BOW
resnet: imgs
dropout
dense densedensedensedense
concat concat
densedense

Development and Deployment
• One of the most important areas to pay attention to is the loss function.
• Multi-label cross-entropy loss has worked well in the past.
• This is relatively simple to apply to a wide variety of model types. Ranking loss often works better, but is
more complex to apply correctly.
• Scalability is always a big challenge. Offline batch computation and saving the results can help.

Logging and Measurement
• Deploying a recommender system requires some care since a model only succeeds if good behavioral data
can be logged.
• Moreover, without good logging it is impossible to assess the quality of the deployment. Tools such as
Amazon Kinesis are ideally suited for this purpose.
• Display bias is very strong – this means that customers are more likely to click on a mediocre
recommendation that they see than an excellent recommendation they don’t see.
• An improved model might not outperform the baseline when evaluated on historical data (and a model
that performs well on historical data might not work well on future data).
• Online comparisons are critical. This suggests an iterative strategy to building out a recommender system,
starting with a relatively simple model that is easy to deploy, and iteratively testing improvements.

Metrics
• Some common and effective metrics to measure are
• Recall: How many of the relevant items were selected.
• Precision: How many items were selected – Optimize for precision and for top x.
• Try to increase the number of customers that viewed a product with some frequency.
• Weblab: take half of the customers and check the end-result.
• Time spent on the platform
• Distinct days on platform
• Forecast customer score

How to Choose a Model
Iterative
Process
à à à à
Data Available Limited user
data
Binary user-item
interaction
Limited user
data
Additional user-
item interaction
User data
Extensive item
data
Extensive user
data
Extensive item
data
Relevant
Algorithms
MF
Binary
MF
FM
DiFacto
DSSM Customized and
more advanced
DSSM
Relative
Complexity
2 4 5 5
Deployment
Considerations
• Historical data size – 30d/60d/1y…
• Fine-tuning techniques (daily, weekly..)
• Inference – compressed model? Trade-off between model complexity
and inference latency
• Validation system setup
• Iterate fast and simple

References
• https://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
• http://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
• https://www.cs.cmu.edu/~muli/file/difacto.pdf
• http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/proceedings/p173.pdf
• http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/
• https://arxiv.org/pdf/1310.4546.pdf
• http://faculty.washington.edu/xiaohe/UW_DeepNLP_slides.pdf
• https://www.microsoft.com/en-us/research/publication/learning-deep-structured-semantic-models-for-
web-search-using-clickthrough-data/
• http://alex.smola.org/teaching/berkeley2012/recommender.html
• http://papers.www2017.com.au.s3-website-ap-southeast-
2.amazonaws.com/proceedings/p173.pdfhttp://www.simafore.com/blog/bid/207476/The-intuition-
behind-recommender-systems-collaborative-filtering
• https://arxiv.org/pdf/1312.4894.pdf
• http://www.simafore.com/blog/bid/207476/The-intuition-behind-recommender-systems-collaborative-
filtering

Thank you!
c y r u s m v @ a m a z o n . c o m

Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402 - re:Invent 2017

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402 - re:Invent 2017

Similaire à Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402 - re:Invent 2017 (20)

Plus de Amazon Web Services

Plus de Amazon Web Services (20)

Building Content Recommendation Systems Using Apache MXNet and Gluon - MCL402 - re:Invent 2017