The Web of Data is the natural evolution of the World Wide Web from a set of interlinked documents to a set of interlinked entities. It is a graph of information resources interconnected by semantic relations, thereby yielding the name Linked Data. The proliferation of Linked Data is for sure an opportunity to create a new family of data-intensive applications such as recommender systems. In particular, since content-based recommender systems base on the notion of similarity between items, the selection of the right graph-based similarity metric is of paramount importance to build an effective recommendation engine. In this paper, we review two existing metrics, SimRank and PageRank, and investigate their suitability and performance for computing similarity between resources in RDF graphs and investigate their usage to feed a content-based recommender system. Finally, we conduct experimental evaluations on a dataset for musical artists and bands recommendations thus comparing our results with two other content-based baselines
measuring their performance with precision and recall, catalog coverage, items distribution and novelty metrics.
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
An evaluation of SimRank and Personalized PageRank to build a recommender system for the Web of Data
1. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
An evaluation of SimRank and Personalized PageRank to
build a recommender system for the Web of Data
Phuong T. Nguyen, Paolo Tomeo, Tommaso Di Noia, Eugenio Di Sciascio
{phuong.nguyen, paolo.tomeo, tommaso.dinoia, eugenio.disciascio}@poliba.it
Polytechnic University of Bari - Bari (ITALY)
2. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Outline
Introduction
Recommender systems
Linked Open Data
SimRank and Personalized PageRank
Experimental Evaluation
Conclusions
3. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Introduction
Content-based recommender systems base on the
notion of similarity between items, but obviously
they need content
Web of data is an opportunity to foster data-
intensive applications
We have investigated how effectively SimRank and
Personalized PageRank work with Linked Data in the
recommendation task.
4. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Recommender Systems (RSs) are software tools and
techniques providing suggestions for items to be of
use to a user.
[F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, editors. Recommender Systems Handbook. Springer, 2011.]
Recommender Systems
5. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Recommender Systems (RSs) are software tools and
techniques providing suggestions for items to be of
use to a user.
[F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, editors. Recommender Systems Handbook. Springer, 2011.]
Recommender Systems
• Content-based filtering
• Collaborative filtering
6. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Content-based RSs try to recommend items similar to those a given user
has liked in the past. The recommendations are based upon a description
of the items and a profile of the user’s interests
Recommender
System
User profile
Content-based RSs
Items description
Personalized
Recommendations
7. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
We need domain knowledge
and rich descriptions of the items
No suitable suggestion without enough information
Quality of CB recommendations depends on quantity and quality of
the features explicitly associated to the items
Main drawback: Limited Content Analysis
P. Lops, M. de Gemmis, G. Semeraro. Content-based Recommender Systems: State of the Art and Trends. In
Recommender Systems Handbook: A Complete Guide for Research Scientists & Practitioners, Chapter 3, Springer, 2010.
8. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Linked Open Data
• Initiative for publishing and connecting data on the Web using
Semantic Web technologies
• >30 billion of RDF triples from hundreds of data sources
• Good resource to mine existing information and deduce new
knowledge
9. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Florence in DBpedia
Florence
Formers capital of Italydcterms:subject
Tuscany
dbpedia-owl:region
Italy
dbpedia-owl:country
.
.
.
Example of facts in Triple-form: subject- predicate - object
10. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Enrich Data model
Catalog Items Knowledge Graph
1 – Mapping (Entity linking)
2 – Subgraph extraction
11. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Enrich Data model
Catalog Items Knowledge Graph
Each item is represented by a part of this subgraph.
We need for algorithms able to compute similarity between graphs.
12. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
SimRank
SimRank computes similarity between nodes in a graph using the
structural context: two nodes are similar if they are referenced by
similar nodes.
G. Jeh and J. Widom. Simrank: A measure of structural-context similarity. In ACM KDD ’02, pages 538–543, 2002.
Given k ≥ 0, R(k) (α, β) = 1 with α = β. R(k) (α, β) = 0 with k = 0 and α = β.
Otherwise, the general formula is
13. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Personalized PageRank
A node receives an amount of rank from every node which points to
it and in turn transfers an amount of its rank to the nodes it refers to.
T. H. Haveliwala. Topic-sensitive pagerank. In ACM WWW ’02, pages 517–526, 2002. ACM.
The similarity between two items α and β represented by vectors α = {ai }i=1,..,n
and β = {bi }i=1,..,n is computed as the inner product space between the two vectors
14. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
K-Nearest Neighbors
K-Nearest Neighbors (k-NN) algorithm find the set neighbors(α)
containing the k most similar entities β to a given item α using a
similarity function sim(α, β)
A k-NN recommendation algorithm predicts the rating for item α for
an user u, taking into account the neighbors of α and the profile of u
15. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Evaluation Methodology
• Top-N Item recommendation task, for different number of N
(from top-1 to top-50 )
• 60-40% holdout split for each user
• Baselines: Vector Space Model (VSM) and a simplified variation of
the algorithm proposed in [T. Di Noia, R. Mirizzi, V. C. Ostuni, D. Romito, and
M. Zanker. Linked open data to support content-based recommender systems. In
ACM I-SEMANTICS ’12. ACM]
• The results for the four algorithms were computed with 10, 20, 40,
60, 80 neighbors
16. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Evaluation Metrics
Accuracy Precision (P@N)
Recall (R@N)
Sales diversity Catalog coverage
Entropy
Gini index
Novelty Long-tail percentage
Expected Popularity Complement (EPC@N)
17. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Dataset
Subset of Last.fm mapped to DBpedia
• 1,867 users
• 700 most popular artists and bands
• 47,330 ratings
• 113,386 number of extracted triples
Mappings
http://sisinflab.poliba.it/semanticweb/lod/recsys/datasets/
18. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Properties extracted
Properties extracted from DBpedia (113,386 extracted triples)
Inbound Outbound
dbo:producer
dbo:artist
dbo:writer
dbo:associatedBand
dbo:associatedMusicalArtist
dbo:musicalArtist
dbo:musicComposer
dbo:bandMember
dbo:formerBandMember
dbo:starring
dbo:composer
dcterms:subject
dbo:genre
dbo:associatedBand
dbo:associatedMusicalArtist
dbo:instrument
dbo:occupation
dbo:birthPlace
dbo:background
19. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Accuracy with 40 neighbors
PageRank and SimRank more accurate
20. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Catalog coverage with 40 neighbors
Worst coverage
21. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Distribution with 40 neighbors
Recommendations concentred on a few items
22. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Novelty Results
23. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Conclusions
SimRank and Personalized PageRank obtained interesting
results compared to the two baselines in terms of precision,
recall and novelty, even though catalog coverage and items
distribution decrease
Future work:
Same experiments on two more dataset: Movielens and
TheLibraryThing
Same experiments using other graph similarity metrics
24. WWW 2015 – 7th International Workshop on Web Intelligence & Communities
May 18, 2015 in Florence, Italy
Thanks for your
attention!
Q & A
Notes de l'éditeur
1
2
3
RSs are software tool able to suggest relevant items or information to the user in a personalized way
trying to deal with the information overload problem.
These systems must to be able to predict items that the user may have an intest in.
Typically, they present a list of recommendations, produced in one of two ways: content-based filtering or collaborative filtering.
In this work we are interested in content-based
In general Content Based approaches have some limitations with respect to other approaches like collaborative filtering, but at the same time they have also some advantages like dealing with cold start situations. One of the main drawback is the limited content analysis.
If there are no item’s descriptions the system cannot compute recommendations. To apply CB RS we need domain knowledge. Furthermore the better is the quality of the item’s descriptions the better is the quality of the recommendations