On the Web, the amount of structured and Linked Data about entities is constantly growing. Descriptions of single entities often include thousands of statements and it becomes difficult to comprehend the data, unless a selection of the most relevant facts is provided. This doctoral thesis addresses the problem of Linked Data entity summarization. The contributions involve two entity summarization approaches, a common API for entity summarization, and an approach for entity data fusion.
1. KIT – The Research University in the Helmholtz Association
INSTITUTE OF APPLIED INFORMATICS AND FORMAL DESCRIPTION METHODS (AIFB)
www.kit.edu
Linked Data Entity Summarization
Dipl.-Inf. Univ. Andreas Thalhammer 08.12.2016
2. Institute of Applied Informatics and Formal
Description Methods (AIFB)
2
Outline
1. Motivation
2. Research Questions
3. Contributions
a) LinkSUM (Contribution 1)
b) SUMMA API (Contribution 3)
4. Related Work
5. Summary and Outlook
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
3. Institute of Applied Informatics and Formal
Description Methods (AIFB)
3 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
1. MOTIVATION
4. Institute of Applied Informatics and Formal
Description Methods (AIFB)
4
Information need versus availability
Information need (in the US*)
More than 40% of all search queries are focused on one specific entity.
579 million searches per day come from home and work devices in the
US every day.
~ 232 million searches for entities (every day; in the US; desktop)
Information availability (Wikidata**)
Wikidata covers 24.5 million entities (growth of 55% in last year).
3.2 million entities have > 10 statements (growth of 78% in last year).
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
* https://www.comscore.com/Insights/Rankings/comScore-Releases-February-2016-US-Desktop-Search-Engine-Rankings
** https://www.wikidata.org/wiki/Wikidata:Statistics
5. Institute of Applied Informatics and Formal
Description Methods (AIFB)
5
Wikidata entry
for Pulp Fiction
~ 614 facts
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Growing amount of structured data on the Web
6. Institute of Applied Informatics and Formal
Description Methods (AIFB)
6
Naïve solution: Entity presentation based on
class summaries
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
(Source: yahoo.com)
7. Institute of Applied Informatics and Formal
Description Methods (AIFB)
7
Problems of class summaries
1. The patterns are very static and do not reflect the individual
particularities of entities.
2. A pattern needs to be created for each type and class hierarchies
need to be considered.
3. Some entities are of multiple (distinct) types with unclear main type.
4. Some of the properties can have many values for which no ranking or
cut-off is defined.
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Person Athlete
Body
builder
Arnold
Schwarzenegger
Angkor Wat
8. Institute of Applied Informatics and Formal
Description Methods (AIFB)
8
Entity Summarization
Propositions:
Every entity is individual.
For different entities, different properties are of importance.
Entities of the same type do not always have the same attributes.
For each entity, a single property-value pair can be of different
relevance.
Solution:
Focus on individual particularities of each entity:
Entity Summarization
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
9. Institute of Applied Informatics and Formal
Description Methods (AIFB)
9 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
2. RESEARCH QUESTIONS
10. Institute of Applied Informatics and Formal
Description Methods (AIFB)
10
Challenge #1
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
RQ1: How can we effectively summarize entities with limited
background information?
RQ1.1: How can we use link analysis effectively in order to derive
summaries of entities?
RQ1.2: How can we use usage data analysis effectively in order to derive
summaries of entities?
RDF data typically does not reflect importance levels in its relations.
Proprietary entity summarization systems have access to a lot of data
(e.g., search queries) and infrastructure (e.g., a full Web index).
Other knowledge panel providers (such as publishers) are lacking that
information and infrastructure.
(Source: google.com)
11. Institute of Applied Informatics and Formal
Description Methods (AIFB)
11
Challenge #2
RQ2: Is there a minimum set of re-occurring/common features of entity
summarization systems that allow us to provide a generic API?
Andreas Thalhammer – Linked Data Entity Summarization03.10.201803.10.2018
Providers of knowledge panels are hiding the original graph structure in
strongly abstracted interfaces.
Standardized programmatic access is desirable (but not available).
(Source: google.com)
(Source: developers.google.com/knowledge-graph)
12. Institute of Applied Informatics and Formal
Description Methods (AIFB)
12
Challenge #3
RQ3: How can we align duplicate/similar facts about Linked Data
entities on the Web?
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Different Web sources provide structured information about a single entity.
The different sources often cover similar information but do not provide
according links or vocabulary mappings.
Alignments are particularly difficult as the sources typically provide data at
different levels of modeling granularity.
(Source: imdb.com)
(Source: wikidata.org)
13. Institute of Applied Informatics and Formal
Description Methods (AIFB)
13 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
3. CONTRIBUTIONS
14. Institute of Applied Informatics and Formal
Description Methods (AIFB)
14
Knowledge
Base(s)
Input
Output
(Usage Data)
(Link Structure)
LinkSUM
UBES
UI
SUMMA
API
1
2
3
Entity
Data
Fusion
4
Overview: Research Questions and Contributions
RQ1: How can we effectively summarize entities with limited
background information?
RQ1.1: How can we use link analysis effectively in order to
derive summaries of entities? (Contribution 1)
RQ1.2: How can we use usage data analysis effectively in
order to derive summaries of entities? (Contribution 2)
RQ2: Is there a minimum set of re-occurring/common features of
entity summarization systems that allow us to provide a generic
API (Contribution 3)
RQ3: How can we align duplicate/similar facts about Linked Data
entities on the Web? (Contribution 4)
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
15. Institute of Applied Informatics and Formal
Description Methods (AIFB)
15
Linked Data Entity Summarization
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Knowledge
Base(s)
Input
Output
(Usage Data)
(Link Structure)
LinkSUM
UBES
UI
SUMMA
API
1
2
3
Entity
Data
Fusion
4
Contribution 1
16. Institute of Applied Informatics and Formal
Description Methods (AIFB)
16
LinkSUM
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Step 1: Select top-k important related resources.
Step 2: Select the most relevant connecting predicate.
Idea: Use link analysis for selecting facts.
(Link Structure)
LinkSUM
17. Institute of Applied Informatics and Formal
Description Methods (AIFB)
17 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Approach: Resource Selection
Quentin
Tarantino
Pulp Fiction
director
Compute PageRank [5] scores of entities with (un-typed)
links that occur in textual descriptions of entities (pr).
Use “Backlinks” [7] (also called “mutual links”) for finding strong
connections (bl):
Combine scores:
(Link Structure)
LinkSUM
dbpedia:Category:English-language_films 220.961
dbpedia:Quentin_Tarantino 13.7403
dbpedia:John_Travolta 10.5771
dbpedia:Miramax_Films 9.9398
... ...
18. Institute of Applied Informatics and Formal
Description Methods (AIFB)
18 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Approach: Relation Selection
Problem: multiple relations
Approaches:
Frequency (FRQ)
#times the predicate is used
Exclusivity (EXC)
1 / (N + M)
Description (DSC):
#domain + #range + #label
Quentin
Tarantino
Pulp Fiction
director
writer of
and combinations
of those, e.g. (FREQ * EXCL)
(Link Structure)
LinkSUM
19. Institute of Applied Informatics and Formal
Description Methods (AIFB)
19 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Used reference dataset:
Introduced in Gunaratna et al. [3].
Contains human-created summaries of 50 entities (DBpedia 3.9,
outgoing relations).
Includes seven top-5 and seven top-10 summaries for each entity.
The dataset was created by 15 experts from the Semantic Web
field.
Used similarity measure:
Reference system:
FACES (introduced in [3]).
Quantitative Evaluation: Dataset and Measures
(Link Structure)
LinkSUM
20. Institute of Applied Informatics and Formal
Description Methods (AIFB)
20 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Quantitative Evaluation: Results
(Link Structure)
LinkSUM
SO: Subject-Object pairs (predicates not considered).
SPO: Full triple.
config-1:
config-2:
Significance with respect to both LinkSUM configurations (p < 0.05).
Significance with respect to the best LinkSUM configuration (p < 0.05).
Standard deviation.SD
9.0
8.0
21. Institute of Applied Informatics and Formal
Description Methods (AIFB)
21 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Qualitative Evaluation: Setup
(Link Structure)
LinkSUM
Scenario: Search Engine Result Page (SERP).
20 users, 10 entities (from the FACES dataset).
22. Institute of Applied Informatics and Formal
Description Methods (AIFB)
22 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Qualitative Evaluation: Results
(Link Structure)
LinkSUM
In some cases the task is
subjective.
Reasons for:
Selection
- the presented related
resources are relevant for
the entity.
Rejection
- redundancy.
- related resources do not
characterize the entity.
23. Institute of Applied Informatics and Formal
Description Methods (AIFB)
23
Focus: PageRank (1)
PageRank is not perfect, for example:
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
PREFIX v:http://purl.org/voc/vrank#
SELECT ?e ?r FROM <http://dbpedia.org>
FROM <http://people.aifb.kit.edu/ath/
#DBpedia_PageRank>
WHERE {
?e rdf:type dbo:Scientist;
v:hasRank/v:rankValue ?r.
} ORDER BY DESC(?r) LIMIT 5
dbpedia:Carl_Linnaeus 551.791
dbpedia:Charles_Darwin 215.028
dbpedia:Albert_Einstein 186.549
dbpedia:Isaac_Newton 167.811
dbpedia:Sigmund_Freud 140.245
(Link Structure)
LinkSUM
24. Institute of Applied Informatics and Formal
Description Methods (AIFB)
24
Focus: PageRank (2)
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
(Link Structure)
LinkSUM
Important parameters (for resources r):
l(r) – returns all pages that link to r.
c(r) – the number of outgoing links of r.
d – the damping factor
Traditional PageRank [5]:
Variant: Weighted Links Rank (WLRank) [6]:
Link weights (lw): relative position of a link in the article
[8]
25. Institute of Applied Informatics and Formal
Description Methods (AIFB)
25
Focus: PageRank (3)
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
(Link Structure)
LinkSUM
Newly constructed rankings:
ALL – all links from the article text and from the templates.
ATL – article text links.
TEL – template links.
ATL-RP – article text links with WLRank and relative position.
Size of input dataset:
Reference rankings (page-view-based):
TOWR-PV – “The Open Wikipedia Ranking”
SUB – SubjectiveEye3D by Paul Houle
ALL ATL TEL ATL-RP
# links 159.398.815 142.305.605 26.460.273 143.056.545
26. Institute of Applied Informatics and Formal
Description Methods (AIFB)
26
Focus: PageRank (4)
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
(Link Structure)
LinkSUM
Measure: Spearman rank correlation (range: [-1, 1])
Results:
Conclusions:
Bad correlation of TEL with TOWR-PV/SUB is the result of a small input
data set.
Weighting by relative position improves correlation to SUB. These findings
are supported by [4].
27. Institute of Applied Informatics and Formal
Description Methods (AIFB)
27
Conclusions and Impact
Conclusions:
LinkSUM significantly outperforms the state of the art.
Entity summarization:
Focus should be on selecting relevant resources.
Redundancies at the object level should be avoided.
LinkSUM is lightweight and can be applied in other scenarios, e.g.
Web sites with semantic annotations.
Semantic MediaWikis.
Impact:
Published and presented as full research paper at ICWE 2016.
The PageRank scores are published online and found many adopters
(e.g., the official DBpedia SPARQL endpoint includes the scores)
In use at the WDAqua project (http://wdaqua.eu/).
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
(Link Structure)
LinkSUM
28. Institute of Applied Informatics and Formal
Description Methods (AIFB)
28
Linked Data Entity Summarization
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Knowledge
Base(s)
Input
Output
(Usage Data)
(Link Structure)
LinkSUM
UBES
UI
SUMMA
API
1
2
3
Entity
Data
Fusion
4
Contribution 3
29. Institute of Applied Informatics and Formal
Description Methods (AIFB)
29
SUMMA API
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Quantitative evaluation.
Qualitative evaluation.
A/B testing.
Combination of summary services.
Idea: A common API for entity summaries
Output
UI
SUMMA
API
30. Institute of Applied Informatics and Formal
Description Methods (AIFB)
30 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Approach: SUMMA API
Parameters:
URI (of the entity e) – the entity needs to be identified
k (number) – an upper limit of facts related to e
Multi-language support
Statement groups (e.g., biographical data)
Restriction to specific properties
Multi-hop search space
SUMMA Vocabulary:
Output
UI
SUMMA
API
summa:Summary
xsd:positiveInteger
summa:topK
summa:entity
rdfs:Resource
xsd:String
summa:language
summa:fixedProperty
rdf:Property
summa:statement
rdf:Statement
xsd:positiveInteger
summa:maxHops
summa:SummaryGroup
summa:group
summa:path
PF
JT
VV
actor
role
_:
starring
31. Institute of Applied Informatics and Formal
Description Methods (AIFB)
31 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Approach: SUMMA API
SUMMA RESTful Interaction:
Client Server
POST [ a :Summary;
:entity dbpedia:Barack_Obama; :topK 10 ] .
201 CREATED
Location: http://example.com/
summary?entity=dbpedia:Barack_Obama&topK=10
@ prefix summa: <http://purl.org/voc/summa/> .
...
GET http://example.com/
summary?entity=dbpedia:Barack_Obama&topK=10
200 OK
@ prefix summa: <http://purl.org/voc/summa/> .
...
Output
UI
SUMMA
API
32. Institute of Applied Informatics and Formal
Description Methods (AIFB)
32 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Analysis: Setup
Search Engines:
Google Knowledge Graph
Microsoft Bing Satori/Snapshots
Yahoo Knowledge
News Portals (Alexa Top 25 News sites):
Forbes
BBC News
Can the user interfaces be generated with data from the
SUMMA API without changing their layout?
Output
UI
SUMMA
API
33. Institute of Applied Informatics and Formal
Description Methods (AIFB)
33 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Analysis: Criteria
Features:
1. Property Restriction
2. Statement Groups
3. Multi-hop Search Space
4. Languages
Five entities:
Spain (country)
Dirk Nowitzki (person/athlete)
Ramones (band)
SAP (company/organization)
Inglourious Basterds (movie) (Source: http://google.com)
Output
UI
SUMMA
API
34. Institute of Applied Informatics and Formal
Description Methods (AIFB)
34 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Analysis: Results
Which features were required by the respective system?
Output
UI
SUMMA
API
35. Institute of Applied Informatics and Formal
Description Methods (AIFB)
35
Conclusions and Impact
Conclusions:
Decouple user interface from actual entity summarization
system by defining a common API.
Light-weight and extensible vocabulary and interaction mechanism.
Reference implementations and their source code are publicly
available.
Empirical analysis demonstrate applicability in real-world scenarios.
Impact:
Published and presented as full research paper at ICWE 2015.
Best Paper Candidate at ICWE 2015.
Best Demo Award at ICWE 2016.
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Output
UI
SUMMA
API
36. Institute of Applied Informatics and Formal
Description Methods (AIFB)
36 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
4. RELATED WORK
37. Institute of Applied Informatics and Formal
Description Methods (AIFB)
37
Related Work
Who else is working on this?
Google [1], Microsoft, Yahoo, etc.
Other researchers in the field of the
Semantic Web e.g.
Cheng et al. [2]
Gunaratna et al. [3]
What distinguishes the presented work from theirs?
LinkSUM is a lightweight and effective approach.
UBES is the first approach that uses usage data for entity summarization.
SUMMA API: first and currently only API definition that enables the
exchange of entity summaries.
Entity Data Fusion: First approach that focuses on general alignment of
structured entity data on the Web.
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
RDF + lots of
background data
(Only)
RDF data
38. Institute of Applied Informatics and Formal
Description Methods (AIFB)
38 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
5. SUMMARY AND OUTLOOK
39. Institute of Applied Informatics and Formal
Description Methods (AIFB)
39
We provided contributions for Linked Data Entity Summarization.
Impact was created on the levels of research and dataset/system
adoption.
Combination with entity linking is possible.
The addressed problem is highly relevant for search and question
answering engines.
Summary
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
40. Institute of Applied Informatics and Formal
Description Methods (AIFB)
40
Outlook
Full integration of the entity data fusion approach.
Addressing literal values.
Personalized/contextualized summaries of entities.
Abstract entity summarization.
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
41. Institute of Applied Informatics and Formal
Description Methods (AIFB)
41 Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Questions?
42. Institute of Applied Informatics and Formal
Description Methods (AIFB)
42
Publications
Contribution 1
Andreas Thalhammer, Nelia Lasierra, Achim Rettinger: LinkSUM: Using Link Analysis to Summarize Entity Data, In Web Engineering: 16th
International Conference, ICWE 2016. Proceedings, vol. 9671 of Lecture Notes in Computer Science, pages 244–261. Springer, 2016
Andreas Thalhammer and Achim Rettinger: Browsing DBpedia Entities with Summaries. The Semantic Web: ESWC 2014 Satellite Events,
Lecture Notes in Computer Science 2014, pages 511-515, Springer 2014
Andreas Thalhammer and Achim Rettinger: PageRank on Wikipedia: Towards General Importance Scores for Entities. In The Semantic
Web: ESWC 2016 Satellite Events, Heraklion, Crete, Greece, May 29 – June 2, 2016, Revised Selected Papers, pages 227–240. Springer,
2016.
Contribution 2
Andreas Thalhammer, Ioan Toma, Antonio J. Roa-Valverde, Dieter Fensel: Leveraging Usage Data for Linked Data Movie Entity
Summarization. In Proceedings of the 2nd International Workshop on Usage Analysis and the Web of Data (USEWOD’12), 2012.
Andreas Thalhammer, Magnus Knuth, Harald Sack: Evaluating Entity Summarization Using a Game-Based Ground Truth. In International
Semantic Web Conference (2), vol. 7650, pages 350–361. Springer, 2012.
Contribution 3
Antonio Roa-Valverde, Andreas Thalhammer, Ioan Toma, and Miguel-Angel Sicilia: Towards a formal model for sharing and reusing
ranking computations. In Proceedings of the 6th International Workshop on Ranking in Databases In conjunction with VLDB 2012.
Andreas Thalhammer and Steffen Stadtmüller. SUMMA: A Common API for Linked Data Entity Summaries. In P. Cimiano, F. Frasincar,
G.-J. Houben, and D. Schwabe, editors, Engineering the Web in the Big Data Era, vol. 9114, pages 430-446. Springer, 2015.
Andreas Thalhammer, Achim Rettinger: ELES: Combining Entity Linking and Entity Summarization. In Web Engineering: 16th International
Conference, ICWE 2016. Proceedings, vol. 9671 of Lecture Notes in Computer Science, pages 547–550. Springer, 2016
Contribution 4
Andreas Thalhammer, Steffen Thoma, Andreas Harth: Entity-Centric Claim Reconciliation in Web Data, Submitted to WWW 2017.
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Conference
Workshop
Demo
Knowledge
Base(s)
Input
Output
(Usage Data)
(Link Structure)
LinkSUM
UBES
UI
SUMMA
API
1
2
3
Entity
Data
Fusion
4
43. Institute of Applied Informatics and Formal
Description Methods (AIFB)
43
References
[1] A. Singhal. Introducing the knowledge graph: things, not strings.
http://goo.gl/kH1NKq, 2012.
[2] G. Cheng, T. Tran, and Y. Qu. RELIN: relatedness and informativeness-based centrality
for entity summarization. In Proc. of the 10th int. conf. on The Semantic Web - Vol. Part I,
ISWC’11. Springer, 2011.
[3] K. Gunaratna, K. Thirunarayan, and A. P. Sheth. FACES: diversity-aware entity
summarization using incremental hierarchical conceptual clustering. In Proc. of the 29th
AAAI Conf. Artificial Intelligence, 2015, Austin, Texas, USA., 2015.
[4] D. Dimitrov, P. Singer, F. Lemmerich, M. Strohmaier. What Makes a Link Successful on
Wikipedia? https://arxiv.org/abs/1611.02508
[5] S. Brin and L. Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. In
Proceedings of the Seventh International Conference on World Wide Web 7, WWW7,
pages 107–117. Elsevier Science Publishers B. V., Amsterdam, The Netherlands, The
Netherlands, 1998.
[6] R. Baeza-Yates and E. Davis. Web Page Ranking Using Link Attributes. In Proceedings
of the 13th International World Wide Web Conference on Alternate Track Papers &Amp;
Posters, WWW Alt. ’04, pages 328–329, New York, NY, USA, 2004. ACM.
[7] J. Waitelonis and H. Sack. Towards exploratory video search using linked data.
Multimedia Tools and Applications, 59:645–672, 2012. 10.1007/s11042-011-0733-1.
[8] An art draw drawn by Felipe Micaroni Lalli (micaroni@gmail.com).
Andreas Thalhammer – Linked Data Entity Summarization03.10.2018
Notes de l'éditeur
Good afternoon, I would like to welcome the committee and the audience to my PhD defense, my name is Andreas Thalhammer and the title of my PhD thesis is “Linked Data Entity Summarization”.
Wikidata is a Wikipedia project ...
roughly 600 facts
now you could say: that’s too much, just show me the top part
Show facts in a common order: release date, rating, ...
this seems reasonable
But: the second one has an important part missing: “it was the first animated feature film by walt disnesy, it is based on a fairy tale”
Arnold Schwarzenegger – body builder, actor, politician
Angkor Wat – tourist attraction, human-built structure, Hindu and Buddhist temple
x example -> for snow white the production company is of particular importance – for pulp fiction not so much
x ocean, Sri Lanka (Indian Ocean) – Austria doesn’t
x If two movies have john travolta as an actor, it might be more important for the one and not so important for the other
So why is it desirable: exchange, combine and remix summaries. Evaluate summaries in different ways.
Baeza-Yates
Filling the gap between approaches that have large amounts of background data and those who only use RDF