A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

A
Top-‐N
Recommender
System
Evalua8on

Protocol
Inspired
by
Deployed
Systems

Alan
Said,
Alejandro
Bellogín,
Arjen
De
Vries

CWI

@alansaid,
@abellogin,
@arjenpdevries

Outline

•  Evalua8on

–  Real
world

–  Oﬄine

•  Not
algorithmic
comparison!

•  Comparison
of
evalua8on

•  Protocol

•  Experiments
&
Results

•  Conclusions

2013-‐10-‐13

LSRS'13

2

EVALUATION

2013-‐10-‐13

LSRS'13

3

Evalua8on

•  Does
p@10
in
[Smith,2010a]
measure
the
same
quality
as
p@10
in
[Smith,
2012b]?

–  Even
if
it
does

•  is
the
underlying
data
the
same?

•  was
cross-‐valida8on
performed
similarly?

•  etc.

2013-‐10-‐13

LSRS'13

4

Evalua8on

•  What
metrics
should
we
use?

•  How
should
we
evaluate?

–  Relevance
criteria
for
test
items

–  Cross
valida8on
(n-‐fold,
random)

•  Should
all
users
and
items
be
treated
the
same
way?

–  Do
certain
users
and
items
reﬂect
diﬀerent
evalua8on
quali8es?

2013-‐10-‐13

LSRS'13

5

Oﬄine
Evalua8on

Recommender
System
accuracy
evalua8on
is
currently
based
on
methods

from
IR/ML

– 
– 
– 
– 
– 

One
training
set

One
test
set

(One
valida8on
set)

Algorithms
are
trained
on
the
training
set

Evaluate
using
metric@N
(e.g.
p@N
–
a
page
size)

•  Even
when
N
is
larger
than
the
number
of
test
items

•  p@N
=
1.0
is
(almost)
impossible

2013-‐10-‐13

LSRS'13

6

Evalua8on
in
produc8on

•  One
dynamic
training
set

–  All
of
the
available
data
at
a
certain
point
in
8me

–  Con8nuously
updated

•  No
test
set

–  Only
live
user
interac8ons

•  Clicked/purchased
items
are
good
recommenda8ons

Can
we
simulate
this
oﬄine?

2013-‐10-‐13

LSRS'13

7

Evalua8on
Protocol

• 
• 
• 
• 

Based
on
“real
world”
concepts

Uses
as
much
available
data
as
possible

Trains
algorithms
once
per
user
and
evalua8on
selng
(e.g.
N)

Evaluates
p@N
when
there
are
exactly
N
correct
items
in
the
test
set

–  possible
p@N
=
1
(gold
standard)

2013-‐10-‐13

LSRS'13

8

Evalua8on
Protocol

Three
concepts:

1.  Personalized
training
&
test
sets

2. 
3. 

–  Use
all
available
informa8on
about
the
system
for
the
candidate
user

–  Different
test/training
sets
for
different
levels
of
N

Candidate
item
selec8on
(items
in
test
sets)

–  Only
“good”
items
go
in
test
sets
(no
random
80%-‐20%
splits)

–  How
“good”
an
item
is
is
based
on
each
user’s
personal
preference

Candidate
user
selec8on
(users
in
test
sets)

–  Candidate
users
must
have
items
in
the
training
set

–  When
evalua8ng
p@N,
each
user
in
test
set
should
have
N
items
in
test
set

•  Effec8vely
precision
becomes
R-‐precision

Train
each
algorithm
once
for
each
user
in
the
test
set
and
once
for
each
N.

2013-‐10-‐13

LSRS'13

9

Evalua8on
Protocol

2013-‐10-‐13

LSRS'13

10

EXPERIMENTS

2013-‐10-‐13

LSRS'13

11

Experiments

–  Movielens
100k

• 
• 
• 
• 

Minimum
20
ra8ngs
per
user

943
users

6.43%
density

Not
realis8c

–  Movielens
1M
sample

•  100k
ra8ngs

•  1000
users

•  3.0%
density

• 

number
of
users

Datasets:

10

1

10

100

number
of
raAngs

1000

100

number
of
raAngs

1000
12

100

Algorithms

–  SVD

–  User-‐based
CF
(kNN)

–  Item-‐based
CF

2013-‐10-‐13

number
of
users

• 

100

10

1

LSRS'13
10

Experimental
Selngs

According
to
proposed
protocol:

•  Evaluate
R-‐precision
for

N=[1,5,10,20,50,100]

•  Users
evaluated
at
N
must
have
at

least
N
items
rated
above
the

relevance
threshold
(RT)

•  RT
depends
on
the
users
mean

ra8ng
and
standard
devia8on

Baseline

•  Evaluate
p@N
for

N=[1,5,10,20,50,100]

•  80%-‐20%
training-‐test
split

•  Number
of
runs:
|N|*|users|

•  Number
of
runs:
1

2013-‐10-‐13

–  Items
in
test
set
rated
at
least
3

LSRS'13

13

Results

User-‐based
CF
ML1M
sample

2013-‐10-‐13

LSRS'13

14

User-‐based
CF
ML1M
sample

User-‐based
CF
ML100k

SVD
ML1M
sample

SVD
ML1M
sample

2013-‐10-‐13

Results

LSRS'13

15

Results

What
about
8me?

–  |N|*|users|
vs.
1?

–  Trade-‐oﬀ
between
a
realis8c

evalua8on
and
complexity?

2013-‐10-‐13

LSRS'13

16

Conclusions

•  We
can
emulate
a
realis8c
produc8on
scenario
by
crea8ng
personalized

training/test
sets
and
evalua8ng
them
for
each
candidate
user
separately

•  We
can
see
how
well
a
recommender
performs
at
diﬀerent
levels
of
recall

(page
size)

•  We
can
compare
towards
a
gold
standard

•  We
can
reduce
evalua8on
8me

2013-‐10-‐13

LSRS'13

17

Ques8ons?

•  Thanks!

•  Also:
check
out

–  ACM
TIST
Special
Issue
on
RecSys
Benchmarking
–
bit.ly/RecSysBe

–  The
ACM
RecSys
Wiki
–
www.recsyswiki.com

2013-‐10-‐13

LSRS'13

18

A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems

Similaire à A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems (20)

Plus de Alan Said

Plus de Alan Said (16)

Dernier

Dernier (20)

A Top-N Recommender System Evaluation Protocol Inspired by Deployed Systems