This presentation on data enrichment is part of the ARCOMEM training curriculum. Feel free to roam around or contact us on Twitter via @arcomem to learn more about ARCOMEM training on archiving Social Media.
1. Entity Enrichment and Clustering
in ARCOMEM
Elena Demidova1,
including slides by: Stefan Dietze1, Diana Maynard2, Thomas Risse1, Wim Peters2,
Katerina Doka3, Yannis Stavrakas3
1
L3S Research Center, Hannover, Germany
2 University
3
Sheffield, UK
IMIS, RC ATHENA, Athens, Greece
2. The ARCOMEM approach
• Make use of the Social Web
– Huge source of user generated content
– Wide range of articulation methods
From simple „I like it“-Buttons to complete articles
– Represents the diversity of opinions of the public
• User activities often triggered by
– Events and related entities
(e.g. Sport Events, Celebrations,
Crises, News Articles, Persons,
Locations)
– Topics (e.g. Global Warming,
Financial Crisis, Swine Flu)
A semantic-aware and socially-driven
preservation model is a natural way to go
Slide 2
3. ARCOMEM architecture
ARCOMEM system architecture foresees four processing
levels: crawler level, online processing level, offline
processing level and cross crawl analysis
Slide 3
4. ETOE offline processing chain
The processing chain depicted here describes all components involved in
the offline processing of Web objects.
4
5. The extraction components for text
Aim
Extraction of Entities, Topics, Events and Opinions (ETOEs) from
Web Pages
Social Web (Twitter, YouTube, Facebook, …)
Challenges
Entity recognition from degraded input sources (tweets etc)
Advancing state of the art NLP and text mining
Dynamics detection: evolution of terms/entities
Semantic representation of Web objects and entities
Appropriate RDF schemas for ETOE and Web objects
Exploiting (Linked Open) Web data to enrich extracted ETOE
Entity classification (into events, locations, topics etc) & consolidation
Slide 5
7. Data consolidation & integration problem
Data extracted from different components or during
different processing cycles not aligned
=> consolidation, disambiguation & correlation required.
<Location>Greece</Location>
<Person>Venizelos</Person>
<Location>Griechenland</Location>
<Organisation>Greek Parliament</Organisation>
?
Slide 7
8. Data enrichment & clustering
Enrichment of entities with related references to Linked
Data, particularly reference datasets (DBpedia, Freebase, …)
=> use enrichments for clustering/correlation/consolidation
Slide 8
9. Enrichment with DBpedia & Freebase
• DBpedia and Freebase are particularly well-suited due to
their vast size, the availability of disambiguation techniques
which can utilise the variety of multilingual labels available
in both datasets for individual data items and the level of
inter-connectedness of both datasets, allowing the retrieval
of a wealth of related information for particular items.
• In the case of DBpedia, we make use of the DBpedia
Spotlight service which enables an approximate string
matching with adjustable confidence level in the interval
[0,1]. Experimentally, we set confidence to 0.6.
• For Freebase, we use structured queries, taking into
account entity types extracted by GATE.
9
10. Enrichment for clustering & correlation: example
<Person>Jean Claude Trichet</Person>
<Organisation>ECB</Organisation>
<Event>Trichet warns of systemic debt crisis</Event>
Slide 10
11. Enrichment for clustering & correlation: example
<Person>Jean Claude Trichet</Person>
<Organisation>ECB</Organisation>
<Event>Trichet warns of systemic debt crisis</Event>
<Enrichment>http://dbpedia.org/resource/Jean-Claude_Trichet</Enrichment>
<Enrichment>http://dbpedia.org/resource/ECB</Enrichment>
Slide 11
12. Enrichment for clustering & correlation: example
<Person>Jean Claude Trichet</Person>
<Organisation>ECB</Organisation>
<Event>Trichet warns of systemic debt crisis</Event>
<Enrichment>http://dbpedia.org/resource/Jean-Claude_Trichet</Enrichment>
<Enrichment>http://dbpedia.org/resource/ECB</Enrichment>
=> dbpprop:office
=> dcterms:subject
dbpedia:President_of_the_European_Central_Bank
dbpedia:Governor_of_the_Banque_de_France
category:Living_people
category:Karlspreis_recipients
category:Alumni_of_the_École_Nationale_d'Administration
category:People_from_Lyon
Slide 12
13. ARCOMEM entities, enrichments & clusters
Nodes: entities/events (blue), enrichments DBpedia (green), Freebase (orange)
1013 clusters of correlated entities/events
Cluster built around
enrichment db:Market
Slide 13
14. Cluster expansion with related enrichments
Clusters can be further expanded by considering related enrichments in the reference knowledge
base. This is an experimental feature that is currently not included in the SARA application.
Cluster expansion
Cluster built around
enrichment db:Market
Slide 14
15. Clustering of entities via enrichment relatedness
Discovery of “related” entities by discovering related enrichments
(a) Retrieving possible paths between 2 enrichments (eg via RelFinder
http://www.visualdataweb.org/relfinder.php)
(b) Computation of relatedness measure (considering variables such as shortest path,
number of paths, relationship types, number of directly connected edges of both
enrichments…)
(c) Clustering enrichments (entities) which are above certain threshold
Slide 15
16. RDF schema for the Knowledge Base
Relationships between ARCOMEM entities (ETOE etc) and enrichments
RDF schema: http://www.gate.ac.uk/ns/ontologies/arcomem-datamodel.rdf
16
17. Enrichment evaluation results
Manual evaluation of 240 enrichment-entity pairs
Available scores: 1 (correct), 0 (incorrect), 0.5 (vague or
ambiguous relationship)
Entity Type
Average score
DBpedia
Average score
Freebase
Average Score
Total
0.71
arco:Event
0.71
arco:Location
0.81
arco:Money
0.67
arco:Organization
0.93
1
0.97
arco:Person
0.9
0.89
0.89
arco:Time
0.74
Total
0.79
0.94
0.88
0.67
0.74
0.94
0.87
Slide 17
18. Further reading
•
Entity Extraction and Consolidation for Social Web Content Preservation. S.
Dietze, D. Maynard, E. Demidova, T. Risse, W. Peters, K. Doka und Y.
Stavrakas, SDA, volume 912 of CEUR Workshop Proceedings, page 18-29.
CEUR-WS.org, (2012)
•
Can entities be friends? B. P. Nunes , R. Kawase, S. Dietze, D. Taibi, M. A.
Casanova, W. Nejdl Boston, US, 2012. Web of Linked Entities (WOLE2012),
Workshop at The 11th International Semantic Web Conference (ISWC2012).
•
Combining a co-occurrence-based and a semantic measure for entity linking. B.
P. Nunes, S. Dietze, M. A. Casanova, R. Kawase, B. Fetahu, W. Nejdl. 2013.
ESWC 2013 - 10th Extended Semantic Web Conference.
•
Linked data - The Story So Far. Biser, C., Heath, T. and Berners-Lee, T. 2009,
Special Issue on Linked data, International Journal on Semantic Web and
Information Systems (IJSWIS).
Slide 18