The Netflix prize: yet another million dollar problem

The Problem
Strategies
Some Funny New Science

The Netﬂix Prize:
yet another million dollar problem

David Bessis

Ecole Normale Sup´rieure, 27/01/2010
e

David Bessis The Netﬂix Prize: yet another million dollar problem

The Problem
Strategies

7 + 1 Million Dollar Problems
Millenium Prize Problems:


The Problem
Strategies

Funded in 2000 by the Clay Mathematical Institute.


The Problem
Strategies

Seven classical open problems in Mathematics.


The Problem
Strategies

Solutions must


The Problem
Strategies

Solutions must
”be published in a refereed mathematics publication of
worldwide repute”


The Problem
Strategies

Solutions must
”be published in a refereed mathematics publication of
worldwide repute”
”have general acceptance in the mathematics community two
years after”


The Problem
Strategies

Fuzzy rules.


The Problem
Strategies

Fuzzy rules.
The Poincar´ conjecture was solved by Perelman in 2003.
e


The Problem
Strategies

Fuzzy rules.
e
No award yet.


The Problem
Strategies

Fuzzy rules.
e
No award yet.
Netﬂix Prize:


The Problem
Strategies

Fuzzy rules.
e
No award yet.
Netﬂix Prize:
Funded in 2006 by the DVD rental company Netﬂix.


The Problem
Strategies

Fuzzy rules.
e
No award yet.
Netﬂix Prize:
A problem in Applied Mathematics.


The Problem
Strategies

Fuzzy rules.
e
No award yet.
Netﬂix Prize:
A problem in Applied Mathematics Computer Science.


The Problem
Strategies

Fuzzy rules.
e
No award yet.
Netﬂix Prize:
A problem in Applied Mathematics Computer Science
Psychology.


The Problem
Strategies

Fuzzy rules.
e
No award yet.
Netﬂix Prize:
A problem in Applied Mathematics Computer Science
Psychology (do we really care?)


The Problem
Strategies

Fuzzy rules.
e
No award yet.
Netﬂix Prize:
A problem in Some Funny New Science.


The Problem
Strategies

Fuzzy rules.
e
No award yet.
Netﬂix Prize:
Clear rules.


The Problem
Strategies

Fuzzy rules.
e
No award yet.
Netﬂix Prize:
Reasonably clear rules.


The Problem
Strategies

Fuzzy rules.
e
No award yet.
Netﬂix Prize:
Reasonably clear rules.
Prize awarded in September 2009.


The Problem
Rules
Strategies
Competition

Context

Netﬂix has an “all-you-can-eat” pricing model.


The Problem
Rules
Strategies
Competition

Context

They need their users to watch a lot of movies.


The Problem
Rules
Strategies
Competition

Context

Beyond a few obvious choices, people don’t know what they
want to watch.


The Problem
Rules
Strategies
Competition

Context

want to watch.
Collaborative ﬁltering: recommending products based on prior
evaluations by other users (just like Amazon does).


The Problem
Rules
Strategies
Competition

Context

want to watch.
The Netﬂix prize is a collaborative ﬁltering competition:


The Problem
Rules
Strategies
Competition

Context

want to watch.
Based on a huge dataset of actual ratings by Netﬂix users.


The Problem
Rules
Strategies
Competition

Context

want to watch.
Open to almost everyone.


The Problem
Rules
Strategies
Competition

Context

want to watch.
Open to almost everyone.
Endowed with a $1.000.000 prize.


The Problem
Rules
Strategies
Competition

The Dataset

The user space U consists of 480 189 users
(identiﬁed by a meaningless non-sequential integral id).


The Problem
Rules
Strategies
Competition

The Dataset

The movie space M consists of 17 770 movies
(identiﬁed by integers 1, . . . , 17 770, and the associated list of titles
and release years is provided – this data is meaningful and minable).


The Problem
Rules
Strategies
Competition

The Dataset

The date space D spans the period Oct. 1998 – Dec. 2005
(extremely meaningful data; no time of day is provided).


The Problem
Rules
Strategies
Competition

The Dataset

The rating space R is {1, 2, 3, 4, 5} (”stars”).


The Problem
Rules
Strategies
Competition

The Dataset

The training dataset T contains 100 480 507 quadruples
(u, m, d, r ) ∈ U × M × D × R.


The Problem
Rules
Strategies
Competition

The Dataset

The training dataset T contains 100 480 507 quadruples
(u, m, d, r ) ∈ U × M × D × R.
The qualifying dataset Q contains 2 817 131 triples
(u, m, d) ∈ U × M × D.


The Problem
Rules
Strategies
Competition

The Challenge

Open to everyone


The Problem
Rules
Strategies
Competition

The Challenge

Open to everyone except Netﬂix employees and their relatives


The Problem
Rules
Strategies
Competition

The Challenge

Open to everyone except Netﬂix employees and their relatives and
residents of Cuba, Iran, Syria, North Korea, Myanmar and Sudan.


The Problem
Rules
Strategies
Competition

The Challenge

Participants can join eﬀorts in teams.


The Problem
Rules
Strategies
Competition

The Challenge

They can upload their predictions up to once a day.


The Problem
Rules
Strategies
Competition

The Challenge

Predictions are maps from the qualifying set Q to the interval
[1, 5].


The Problem
Rules
Strategies
Competition

The Challenge

Predictions are maps from the qualifying set Q to the interval
[1, 5].
The metric used to benchmark predictions is the RMSE (”root
of mean square error”)

1
RMSE = |predicted rating for q − actual rating for q|2
|Q|
q∈Q


The Problem
Rules
Strategies
Competition

Typical RMSEs

Theoretically, the RMSE cannot be greater than 2.


The Problem
Rules
Strategies
Competition

Typical RMSEs

Users tend to view and rate movies they like, so they typically
give 3, 4 or 5 stars rather than 1 or 2 (the above upper bound
is unrealistically pessimistic).


The Problem
Rules
Strategies
Competition

Typical RMSEs

A basic prediction consists of mapping a triple (u, m, d) to the
average rating obtained by the movie m.


The Problem
Rules
Strategies
Competition

Typical RMSEs

average rating obtained by the movie m. It achieves 1.0540.


The Problem
Rules
Strategies
Competition

Typical RMSEs

At the beginning of the Challenge, Netﬂix’s in-house
prediction system Cinematch achieved 0.9514
(roughly a 10% improvement).


The Problem
Rules
Strategies
Competition

Typical RMSEs

At the beginning of the Challenge, Netﬂix’s in-house
prediction system Cinematch achieved 0.9514
(roughly a 10% improvement).
Netﬂix set the following target: obtain a further 10%
improvement over Cinematch.


The Problem
Rules
Strategies
Competition

Very Smart Rules 1: a Cryptographic Trick

Netﬂix has secretly partitioned the qualifying set

Q = Q1 Q2

into two subsets of (approximately) equal sizes.


The Problem
Rules
Strategies
Competition



Q = Q1 Q2

The RMSE achieved on Q1 is revealed to participants
(there is a public leaderboard).


The Problem
Rules
Strategies
Competition



Q = Q1 Q2

The RMSE achieved on Q2 is used to determine the winner.


The Problem
Rules
Strategies
Competition



Q = Q1 Q2

This prevented participants from “learning from the oracle”.


The Problem
Rules
Strategies
Competition



Q = Q1 Q2

This prevented participants from “learning from the oracle”.
The goal was to achieve 0.8572.


The Problem
Rules
Strategies
Competition

Very Smart Rules 2: Crowd Psychology Tricks

The Challenged opened on October 2, 2006.


The Problem
Rules
Strategies
Competition


Annual $50.000 prizes were oﬀered to current leaders


The Problem
Rules
Strategies
Competition


Annual $50.000 prizes were oﬀered to current leaders provided
they made their current methodology public.


The Problem
Rules
Strategies
Competition


The Challenge was to last for 30 more days after the goal was
achieved.


The Problem
Rules
Strategies
Competition


achieved.
The winner would be the team with the best RMSE after this
30 days period


The Problem
Rules
Strategies
Competition


achieved.
30 days period (no backstabbing arXiv-style “I posted ﬁrst” eﬀect).


The Problem
Rules
Strategies
Competition


achieved.
Every detail was carefully anticipated (even the possibility of a
tie).


The Problem
Rules
Strategies
Competition


achieved.
Every detail was carefully anticipated (even the possibility of a
tie).
These smart rules, together with the $1.000.000 prize,
attracted thousands of participants.


The Problem
Rules
Strategies
Competition

Timeline

October 2006: Cinematch RMSE = 0.9514.


The Problem
Rules
Strategies
Competition

Timeline

October 2007: team KorBell leads with 0.8712 (8.43%
improvement).


The Problem
Rules
Strategies
Competition

Timeline

improvement).
October 2008: team “BellKor in BigChaos” (two teams
merging eﬀorts) leads with 0.8616 (9.44% improvement).


The Problem
Rules
Strategies
Competition

Timeline

improvement).
June 26, 2009: the goal is achieved.


The Problem
Rules
Strategies
Competition

Timeline

improvement).
July 26, 2009: Netﬂix stops gathering solutions.


The Problem
Rules
Strategies
Competition

Timeline

improvement).
July 26, 2009: Netﬂix stops gathering solutions.
The winner is announced on September 18, 2009.


The Problem
Rules
Strategies
Competition

The winning team
Three teams combined their results to win the competition:
BellKor
Bob Bell (AT&T)
Yehuda Koren (Yahoo)
Chris Volinsky (AT&T)
BigChaos
Michael Jahrer (Commendo research and consulting)
Andreas T¨scher (Commendo research and consulting)
o
Pragmatic Theory
Martin Chabbert (Pragmatic Theory)
Martin Piotte (Pragmatic Theory)


The Problem
Rules
Strategies
Competition

The winning team
BellKor
Bob Bell (AT&T)
BigChaos
o
Pragmatic Theory
Their winnning submission achieved a RMSE of 0.8567 (10.06%
improvement over Cinematch.)


The Problem
Rules
Strategies
Competition

The winning team
BellKor
Bob Bell (AT&T)
BigChaos
o
Pragmatic Theory
Another team, The Ensemble, achieved the same RMSE...


The Problem
Rules
Strategies
Competition

The winning team
BellKor
Bob Bell (AT&T)
BigChaos
o
Pragmatic Theory
Another team, The Ensemble, achieved the same RMSE...
...and lost because their submission was posted 24 minutes later!

Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Computer implementation

Memory requirements:
Movies can be encoded on 2 bytes (17770 < 2562 ).


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


Viewers can be encoded on 3 bytes (480189 < 2563 ).


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


Dates can be encoded on 2 bytes.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


A triple (m, v , d) can be encoded on 7 bytes.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


700 MB suﬃce to store the dataset.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


It is possible (necessary) to work in RAM.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


Commodity hardware is suﬃcient.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


Commodity hardware is suﬃcient.
(I have some Ruby code to interactively play with the dataset.)


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
About 200 ratings per users.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
This is likely caused by Cinematch’s data gathering procedure:


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
users sometime rate tens of movies on a single day.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
This causes an insanely huge bias within the dataset (movies
are perceived diﬀerently when rated individually or within a
rating spree), not fully exploited by the winners.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
Netﬂix, do you read me?


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
Some movies were rated by hundreds of thousands viewers,
some by just a few (long-tail distribution).


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
Similarly, a user rated all the movies, and many just a few.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
Let F be the set of all ﬁnal 9 ratings for all individual users.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
Then F = Q P, with P ⊂ T publicly tagged by Netﬂix.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
Q is a random draw of 2/3 of F .


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Remarks
Q is a random draw of 2/3 of F .
Q resembles P but is very dissimilar from T .

Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Algorithms

The machine learning toolbox consists of many methods:
Clustering methods.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Algorithms

Clustering methods.
Regressions.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Algorithms

Clustering methods.
Regressions.
Latent parameters methods (SVD).


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Algorithms

Clustering methods.
Regressions.
Neural networks.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Algorithms

Clustering methods.
Regressions.
Neural networks.
SVM


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Algorithms

Clustering methods.
Regressions.
Neural networks.
SVM
...


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Beginner’s mistakes

Underestimate the volume eﬀect.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


Think conceptually and discretely rather than globally and
continuously.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


continuously.
Put users and movies into categories (clustering introduces
unwanted discontinuities).


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


continuously.
Learn from the probe.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


continuously.
Dealing with 100 000 000 data isn’t a logic puzzle.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


continuously.
Dealing with 100 000 000 data isn’t a logic puzzle.
It resembles Thermodynamics.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Linear regression
Suppose all viewers in X have rated all movies in Y : the rating
matrix is
(rx,y )(x,y )∈X ×Y .


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Linear regression
matrix is
(rx,y )(x,y )∈X ×Y .
Suppose you want to model the ratings given to a particular movie
y0 based on the ratings given to the movies in Y = Y − {y0 }.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Linear regression
matrix is
(rx,y )(x,y )∈X ×Y .
A linear regression is a natural way to do that.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Linear regression
matrix is
(rx,y )(x,y )∈X ×Y .
Write (rx,y ) = (Cy )y ∈Y where the Cy are the column vectors.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Linear regression
matrix is
(rx,y )(x,y )∈X ×Y .
Performing the linear regression consists of approximating Cy0 by
ˆ
its orthogonal projection Cy0 on the hyperplane generated by the
(Cy )y ∈Y .


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Linear regression
matrix is
(rx,y )(x,y )∈X ×Y .
ˆ
(Cy )y ∈Y .
Clearly, there exists a unique solution.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Linear regression
matrix is
(rx,y )(x,y )∈X ×Y .
ˆ
(Cy )y ∈Y .
It optimizes RMSE.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Linear regression
matrix is
(rx,y )(x,y )∈X ×Y .
ˆ
(Cy )y ∈Y .
It optimizes RMSE.
Write the formula!

Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Real life problems 1: missing data

Not all viewers have seen all movies.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


Worse, there are virtually no complete rectangular blocks
within the dataset.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


within the dataset.
Regression by viewers or by movies?


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


within the dataset.
It is better to do regression by movies.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


within the dataset.
Normalize ratings:


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


within the dataset.
Normalize ratings:
replace the rating rv ,m by the meaningful signal, i.e., the diﬀerence
r v ,m between rv ,m and the average rating for m.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


within the dataset.
Normalize ratings:
Then it becomes natural to set r v ,m to 0 when v hasn’t rated m.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


within the dataset.
Normalize ratings:
Actually, whether or not v has rated m is a meaningful information!


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


within the dataset.
Normalize ratings:
Actually, whether or not v has rated m is a meaningful information!
Add normalized bit columns to account for that.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Real life problems 2: the curse of dimensionality

We all know that Lagrange interpolators are not to be used on
noisy data. Rather, one should look at best-ﬁtting polynomials of a
given low degree.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


given low degree.
Similarly, the curse of dimensionality asserts that:
With high-dimensionality datasets, one will always ﬁnd stupid
predictors, making perfect predictions on the dataset, and
failing to generalize.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


given low degree.
By looking at my audience today, what should I be able to infer?


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


given low degree.
That having long hair is a reasonably good gender predictor?


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


given low degree.
That wearing a grey sweater is a reasonably good gender
predictor?


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


given low degree.
That wearing a grey sweater is a reasonably good gender
predictor?
Dilemma: overlearning vs underlearning.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Ridge regression (aka Tikhonov regularization)

Linear regression: given vectors x, y1 , . . . , yn ∈ Rm , ﬁnd λ1 , . . . , λn
that minimize
||x − λi yi ||2 .


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


that minimize
||x − λi yi ||2 .
When n is large (with respect to m), the linear system is
overdetermined. Overﬁtting occurs.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


that minimize
||x − λi yi ||2 .
A telltale sign of overﬁtting is the presence of λi ’s with huge norms
compensating each other.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


that minimize
||x − λi yi ||2 .
A telltale sign of overﬁtting is the presence of λi ’s with huge norms
compensating each other.
Ridge regression (Tikhonov regularization): ﬁnd λ1 , . . . , λn that
minimize

||x − λi yi ||2 + ε |λi |2
where ε is a well-adjusted (small) penalty term.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending

Assigning attributes to movies

Assume that movies diﬀer by their amount of certain qualities:


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


Violence.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


Violence.
Sex.


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


Violence.
Sex.
Anything else?


Practical issues
The Problem
Regressions
Strategies
Latent factors
Tuning and Blending


Violence.
Sex.
Maybe not.


The Netflix prize: yet another million dollar problem

The Netflix prize: yet another million dollar problem

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (20)

En vedette

En vedette (20)

The Netflix prize: yet another million dollar problem