With so many advances in machine learning recently, it’s not unreasonable to ask: why aren’t my recommendations perfect by now?
Aish provides a walkthrough of the open problems in the area of recommender systems, especially as they apply to Netflix’s personalization and recommender algorithms. He also provides a brief overview of recommender systems, and sketches out some tentative solutions for the problems he presents.
Telling stories has always been at the core of human nature. They provide us with a sense of community and let us communicate deeper truths.
Major technological breakthroughs have changed society in fundamental ways, and have allowed us to tell richer stories.
It’s not hard to imagine our ancestors around a campfire to share and tell stories. And you can see how from that desire to share stories, symbolic representation developed into writing.
And then later the printing press,
And then later again the invention of the TV.
Whole new ways to express and understand ourselves through stories where possible.
Today, we’re lucky to be witnessing the changes brought about by the Internet. And like previous technological breakthroughs, the internet is also having a profound impact on how we tell stories.
Netflix lies at this cross-roads of the technology and entertainment. We’re Inventing internet TV.
In the world of linear-tv, the job of the “ content programmer” was to select what shows were on. And even with 100s of cable TV channels your choice is still limited.
The promise of Internet TV is that we can provide 70 million channels. Because each user is their own channel.
So producing a completely personalized experience is central to everything we do.
ML is used everywhere at Netflix. In fact, 80% of what is played comes from some form of recommendation system.
You’re probably aware that rows such as: “Top Picks” are driven by MLing.
But you might not have realized that most of the other rows,
The hero images at the top of the page,
What information (evidence) we show about a video,
And even how we combine all these elements onto a single page,
Is all driven by machine learned algorithms that are optimized to provide you with the a completely personalized experienced.
So I’m here to tell you a story today too. A story about recommender systems and why they break.
Unfortunately it’s a sad story, and it only has villians. I hope you won’t think me overly negative.
But before I get into that. Let’s do a quick refresher on the algorithms behind recommender systems.
Here’s some of the classics:
Factorizations are still the workhorse of recommender systems.
And indeed the classic Matrix Factorization model popularized by the Netflix Challenge is still a good place to start. Although many modifications and extensions have been made. It’s still a great launching point even today.
The model is basically: given a matrix of observed ratings, R. Find two lower rank matrices, U and M, that minimize the Frobenius Norm between their product and R.
Or putting it another way: we learn a latent representation of a user’s tastes, and a movie’s genres, by minimizing the squared error between what we predict you’ll rate, and what you actually rate.
http://mathurl.com/zcdk4ld
And we can nest this in a probabilistic setting too, so that we can better understand some of the assumptions we’re making.
So we now say that each rating is drawn from a gaussian, where a mean is equal to the dot-product of the U and M. And U, and M themselves are drawn a zero-meaned gaussian.
The nice thing about this formulation is that assumptions behind our model are now more clear - it reflects a gaussian noise assumption. And we now have a clear way to extend this model for different sorts of observations.
http://mathurl.com/juhenka
http://mathurl.com/ztylr32
http://mathurl.com/jxglmrs
This basic model can be extended in many ways, and indeed we see papers here today that continue to extended it.
Some of the most popular extensions have been:
Let’s quickly review another approach. Graphical models.
In this approach:
We assume a certain generative model, and we then find parameters that best explain the observations.
For example, we could assume an underlying model like the following:
Each user has a distribution over a set of tastes.
And each taste is defined as being a distribution over movies.
Now when a user goes to rate a video highly. They draw a taste from a personal distribution of tastes, and then draws a movie from that taste’s distribution of videos.
If we can learn what these two distributions, theta and phi, then we can use this to predict what other movies they may like.
If this sounds an awful like like the MF model a minute ago, then I agree. They’re closely related.
In fact, if you look closely at the predictive-posterior for the model I just gave. You see that embedded within it is just another dot-product.
http://mathurl.com/hqqpcyq
Or looking at it geometrically.
If we take the simplex created by the distribution over movies
Each topic is a point somewhere on this simplex. Since it’s a distribution over movies
What our model is saying is that: each user can be represented as a convex combination of these topics
If you throw in a non-negative constraint, and normalize the user and movie vectors... you can see the connection to MF.
http://mathurl.com/jb8dj9m
Although factorization approaches are still very useful. It’s rare these days that they are used by themselves
Often we combine different flavors of them in an ensemble instead, and combine them with other features.
I won’t go too much into this side, as that’s a whole talk unto itself. But you get the idea.
So, now onto the fun stuff.
Why with all these great methods, do we still get it wrong sometimes?
So lets kick off with a somewhat philosophical question.
What makes a good recommendation? What are we trying to do?
I have a question for you. Which of these is the better movie?
You’re probably not surprised to learn that CK is more highly rated than Sharknado 2. In fact Rotten Tomatoes users give it 100%. Our users too will consistently rate content like CK more highly.
But if we recommend Sharknado 2, more people will actually watch it.
So what do we believe? And how do we explain the discrepancy.
Well here’s a few reasons:
1. What user’s self-report as liking is often aspirational. “We’re totally going to get around to watching all that consciousness-expanding life-changing title… just not tonight, I’m too tired”
2. We’re asking the wrong question. The way I think most recommender systems, us included, frame the question is very confusing. Are we asking for a users critical assessment of the title within the oeuvre of world cinema? Or are we asking if it’s a good recommendation for you? We’re not interested in the later, but often users answer with the former.
3. The feedback we get is hopelessly biased. Both in terms of the sub-population of users who provide feedback, and in terms of which titles they bother to rate. Most models assume that data is MAR, but in real-world systems it typically isn’t.
So because of this gap between what people self-report as a good recommendations vs. what they actually want to watch.
Most recommender systems rely on implicit feedback over explicit feeedback.
And that typically means we’re using their consumption of the recommendation, as a proxy for how relevant the recommendation is to them. And I think that makes some sense, after all if they’re consuming our recommendation, then this seems like a reasonable proxy for it being good recommendation.
Implicit data is king. But there’s a gap between what we observe, and what we think we’re training on.
So let me sketch out here what I think is a more complete (although not totally complete) of what we’re observing.
We want to train our model to produce relevant recommendations, but we only have observations on what they consumed,
And consumption is biased by:
The position of the item. Where was it on screen? Could the user even see it? How hard was it to navigate too?
And what we call “evidence”. How was the item sold to them? What supporting information was provided: such as RT reviews, did the boxart stand out, etc.
When training our model, we should really be controlling for these, but in practice that’s hard to do.
All of this means that there’s a gap between what we want to train our model to do, vs. what observations we really have to train with.
In the next few points I’ll drill into some of these more in detail.
So here’s a big one. And although I think most people have heard of it, it’s still generally ignored in recommender systems, because, well, it’s hard to solve.
In Netflix 80% of what people play come from the recommendation algorithms. This is a great success, but within this success there lies a hazard. We train our algorithms based off what content people consume, but what they consume is based on what we recommend. So there’s a feedback loop..
Presentation Bias comes in many forms, although they’re all interrelated.
#1. The position the item is shown on screen impacts the probability of a user consuming it. It’s far more likely that a user will consume a title from the top-left corner, than something that they have to scroll right down the screen to get at.
In some ways this is the easiest type to deal with. If you can get a good handle on what the distribution for P(cons|pos), independent of other factors, then you could condition on this. For example, maybe weighing your training set, or something like that.
But that’s a big if. Because
The position itself is confounded by the relevance of the recommendation.
So unless you’re prepared to randomize the recommendation, and present some not-that-relevant recommendations, you really have no idea.
And even if you think that sounds like a good idea. It turns out that randomizing introduces it’s own bias. Users who consume random items are not really indicative of the general population.
#2 So the second type of presentation bias is:
The number of times you present a title. The more times users see a title the more likely they are to play it. So this biases our observations too.
You can start to get at modelling this by thinking about the take-rate of a title. And indeed this is something that is used in ad models quite a lot.
But, there is a problem with this, that I’ll get to in a minute.
#3 The third type, and this is really just a more extreme version of the first two, is: what happens if a user was never recommended the title at all.
Now this is really hard to deal with because the counterfactual isn’t known: What would have happened if the user knew about this title? We don’t know.
So again, you may think: hey this is easy to solve. Let’s make the titles that were presented and unplayed, a true negative, and place less weight on all the missing observations.
So what’s the problem with adjustments, such as TakeRates?
So lets imagine that we only treat titles that were presented to the user as true-negatives, and down-weight the titles the user never saw.
So you make your training set look something like this.
On one level this sounds like a great idea. You could argue that in fact you’ll find a finer decision boundary between the play, non-play class. And that you’re addressing presentation bias.
But there’s a few problems with this reasoning
Firstly, you don’t know why they didn’t play the title
They may love that movie, but already have seen it
OR. It maybe a great rec, but they have to finish binge watching Jessica Jones first before they get to that other great recommendation.
Second,
Position is confounded by what the recommender system decides is relevant
And what the recommender system decides is relevant is confounded by position
This feedback loop makes it hard to control for
So somewhat paradoxically the better your recommender system is at ranking relevant titles highly, the less likely these unplayed titles are to be truly negative.
And conversely, the titles that are the most unseen, because they’ve been ranked lowly by your recommender system, are probably the least relevant.
So there’s this push and pull between the two. If you try to adjust for presentation bias, in a way you’re undoing what your recommender system has learned.
I haven’t seen any great answers to this yet.
And what’s the consequence of ignoring this?
You may have heard in the media about Filter Bubbles. The concept that user’s are only consuming what we decide to show them. And that there is all this great content out there that lies outside the filter bubble but is hard to get to.
This is no idle problem for Netflix. We take this seriously. The great promise of Internet TV is that we’re now longer dependant on purchasing only content that broadly popular. If we can find an audience for a niche title, and the economics of owning the rights to that title make sense, then absolutely we’ll do it.
But this is dependent on the recommender finding that audience.
So let’s turn to the next problem. Context.
Intuitively we all know that context has a big impact on our decisions.
There’s context that is observable. Such as:
Time of Day, Weekday vs. Weekend
Or Device (Big screen TV vs. iPhone)
Etc.
But there’s also context that we don’t observe too.
Maybe you’re sitting in front of Netflix with your SO, and they’re really not into Westens like you are.
We don’t know if you’re in a mood to take a risk, and discover new content that is outside your comfort zone. Or if you’ve had a really bad day at work, and you want a familiar comforting old favorite.
I’m not going to talk much about dealing with observable context. Because fortunately we now have many tools with which we can attack the problem. We have tensor factorizations, factorization machines, you could add context into your graphical models, etc
All these approach basically jointly factorize the user, item, and the contexts together. Recognizing that our recommendations require more than a 2-way interaction.
I wouldn’t say that this is a “solved problem”, but it we’ve made steady progress on it.
The fundamental problem though is the unobserved context.
We can try and make the unobservable context, observable, by coming up with clever UIs that make it as low cost as possible for the user to fill us in more about this unobservable context.
You can see an example of this at Netflix where we introduced Profiles a few years back. We now ask you which person is watching Netflix, because we know within most households there are usually several different people who use Netflix. And you can all have quite dissimilar tastes.
But each time we do something like this it levies a tax on the user. It’s another thing that they have to do before they can relax and watch a show. It’s another button to click. It’s another decision they have to make.
The context we don’t observe is is the dark-matter of recommender systems.
The path forward here is really in the hands of product designers, and smart devices. Our hope here is that they come up with further innovations for capturing more and more of this missing context.
The most important features that have information about the user, and information about the video.
But we don’t have these in the case of new users, or new titles entering the Netflix catalog.
In these cases we need to cold start the user, or the title.
For items, we have metadata about the item. We’re fortunate at Netflix in that we have a well curated set of tags about every title that enters our service. That tells us everything from who it stars, to more abstract concepts such as it contains kick-ass women.
So we can make use of this data to cold-start the item via a classic content-recommender system.
But behavioural information always trumps metadata. So using a pure metadata approach seems to be throwing out too much.
After all it’s only the one item that needs to be cold-started, so if you know what titles the new title is similar too, you should be able to leverage that information.
So a more profitable line of attack seems to be to blend your metadata into your collaborative filtering approach. So that items transition smoothly from cold to warm, and can benefit fully from everything you’ve learned about the warm items.
These are all areas of active research, but I don’t think we’ve found the magic bullet yet.
For new users there’s similar problems, and pretty much all that I’ve said about item cold starting equally applies.
Except now we have an additional problem. The user’s behavior is also changing and adapting within those first few weeks. Or to put this in more technical terms, we have a non-stationarity problem.
Additionally, we’re maybe optimizing for the wrong thing here.
A new user’s mission is to evaluate Netflix and decide if it’s worth spending $9 a month.
And part of that decision is based on their perception of breadth of content. So personalization could actually be be harmful.
So how should we model this? Ideally we want to transition smoothly as the user settles into their more long term behavior.
Most of the approaches suggested for cold-starting users overlook these problems.
Factorization is still the workhorse of recommender systems.
But despite that they have limitations, many of which are still an active research areas.
#1. Most assume that you can be described as a linear combination of your tastes.
Even within the graphical model space, there’s a still linear assumption baked in, it’s just well hidden behind the probability formalism.
For example, most factorizations do a terrible job with situations like:
If you may be lukewarm on Action films, and lukewarm on Sci-Fi, but you love the combination of Sci-Fi-Action films. This interaction isn’t well captured.
Or another example: If you like Zombies films, and your partner likes romance films. Does that mean you’re likely to watch a Zombie-Romance film together? Our factorization models say yes.
Any place where the probability of you playing something is different than the sum of your tastes, we’re going to do a poor job.
#2. The point of embedding users and items within a lower dimensional space, is that that we’re assuming there’s some kind of archetypal set of tastes, from which all users draw from.
And that is kinda the point of creating a lower-dimensional embeddings.
But if we have content that is fairly unique within it’s genre, or genre-defying, then there’s really no natural home “topic” for it.
In practice this means that broadly populated genres dominate niche genres, even if individual titles within that niche genre are popular. They’re simply overpowered by the share mass of the other topics.
#3. You typically choose between low-dimensional or sparse. But really we want both.
If you use a model like SLIM, then this’ll capture fine-grained interactions, such as: if you watch Rocky 1, then the probability of you watching Rocky 2 should be higher.
In a factorization approach, these very fine grained interactions tend to get lost.
On the other hand, factorization’s advantage is that it handles synonymy: the ability to roll up many sparse examples into a more general topic to give us more statistical strength.
Ideally though we want both.
We want a model that adaptively fits local structure, but embeds into a lower-dimensional space where there isn’t enough signal in the fine-grained interactions.
#4. Most factorization approaches are still trained in an unsupervised setting, but all we really care about is the predictive performance.
You can work around this by using factorizations as a feature within larger supervised ensemble.
But that isn’t ideal. We should be finding your latest space in the context of the actual end goal: producing a recommendation. So that the embedding we find are optimized for that task.
And possibly that’s the great advantage of deep learning. And there are some models that address this, such as sLDA, but this is an area that needs much more attention.
#5. In the real-world items come and go.
So for any training set you actually have a mixture of items of different tenures.
In practice this confounds the factorizations. Items that would normally cluster together, are instead separated into different topics.
This happens because to the model, this lack of interaction between items appears as evidence that they should be separated. Whereas what we really have is partially missing data for one of the items. But this isn’t typically incorporated into these models.
Most recommender systems model the problem as producing the. single. best. recommendation. But in the real-world we’re typically tasked with recommending a basket of titles.
We typically bridge this gap by ignoring it. We start by putting our best recommendation in the basket, and then our second best, and so on.
But this can clearly be suboptimal. If our goal is maximize the probability of a user finding something they like. Then recommending a set of titles that might be very similar to each other probably isn’t the best strategy. Hedging of our bets a little would be wiser.
There’s a few different ways to tackle this problem. The most common approach is to tackle it in a post-ranker step, where diversity is injected post-hoc.
But here’s another approach: if we put our best bet at the top of the screen, and the user rejects it. Then we now have a new piece of information -- that the user didn’t feel in the mood for that title.
We could condition on this new piece of information in selecting our second title, and so on. Until you’ve built a full page of recommendations
The nice thing about this approach is that diversity, while maximizing our overall probability of the user finding something is all baked in.
But even in this approach we’ve made a drastic simplification, that in practice hurts our ability to make a compelling recommendation.
We’re assuming that the interaction of the titles on the screen, how they look compared to their neighbours, has no effect. We assuming that a user is considering each title individually before moving onto the next.
But in reality a title can stand-out, and receive more plays, simple because it stands out from the titles displayed around it.
And conversely a more niche title can be assisted by having being surrounded by better known, but highly related, titles. In this case the surrounding titles provide more context to the user on why they should consider it.