Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Mining Historical Data for DBpedia 
via Temporal Tagging 
of Wikipedia Infoboxes 
Norman Weisenburger, Volha Bryl, Simone ...
Outline 
1. State of art: Temporally annotated data in DBpedia and LOD 
2. Temporally annotated data extraction pipeline 
...
Why we need historical LOD 
• “Historical data” == any data that is or can be temporally annotated 
• population of a city...
Temporally annotated data: Examples 
Apple Inc. in Wikidata 
http.//www.wikidata.org/wiki/Q312 
Mining Historical Data for...
Temporally annotated data: Examples 
Apple Inc. in Freebase 
http.//www.freebase.com/m/0k8z 
Mining Historical Data for DB...
Temporally annotated data in DBpedia 
• DBpedia's main source of knowledge are Wikipedia infoboxes 
• Temporal (time-depen...
Temporally annotated data in DBpedia 
• DBpedia's main source of knowledge are Wikipedia infoboxes 
• Temporal (time-depen...
Temporally annotated data in DBpedia 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
...
Temporally annotated data in DBpedia 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
...
Temporally annotated data in DBpedia 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
...
Temporally annotated data in DBpedia 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
...
Temporally annotated data in DBpedia 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
...
Idea: go back in time 
• Properties of interest 
• Temporally annotated, annotation is a part of an attribute value 
• Use...
Extraction pipeline 
1. Select and download Wikipedia revisions 
2. Extract temporal facts 
3. Merge facts 
• Code availab...
Extraction pipeline 
1. Select and download Wikipedia revisions 
• Select 4 revisions per year (1st, 2nd, 3rd quartile and...
Extraction pipeline 
2. Extract temporal facts 
• Parse each infobox attribute twice 
• For a value: Mapping Extractor of ...
Extraction pipeline 
{{ Infobox company 
| name = Netflix, Inc. 
| revenue = US$4.37  million (''FY 2013'') 
... 
<Ne...
Extraction pipeline 
3. Merge facts 
• Group triples by subject, property, temporal validity, value 
• In case of value co...
Extraction pipeline 
<Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> 
<Netflix, operatingIncome, 1.92E8, usD...
Data model 
• Our choice for RDF representation 
• Singleton property approach 
• Vinh Nguyen, Olivier Bodenreider, and Am...
Company dataset 
• Dataset available at http://tiny.cc/tmpcompany 
• Started from DBpedia resources of type dbpedia-owl:Co...
Company dataset 
• Dataset available at http://tiny.cc/tmpcompany 
• 51,214 companies, for 18,489 at least one fact is ext...
Company dataset 
• Dataset available at http://tiny.cc/tmpcompany 
• 51,214 companies, for 18,489 at least one fact is ext...
Company dataset vs other KBs 
• 10 random companies 
with well-maintained 
infoboxes 
• Manually mapped 
ontology properti...
Company dataset vs other KBs 
• 10 random companies 
with well-maintained 
infoboxes 
• Manually mapped 
ontology properti...
Evaluation 
• Evaluating the precision 
• (preliminary, not in the paper) 
• 100 random tuples, 2 properties, so far only ...
Ongoing and future work 
• Ongoing: extracting missing attributes from Wikipedia article texts 
• Company dataset is used ...
Prochain SlideShare
Chargement dans…5
×

Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

582 vues

Publié le

Norman Weisenburger, Volha Bryl, Simone Paolo Ponzetto

NLP & DBpedia 2014 Workshop @ ISWC 2014, Riva del Garda, Italy, October 20, 2014

Publié dans : Sciences
  • Identifiez-vous pour voir les commentaires

Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

  1. 1. Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes Norman Weisenburger, Volha Bryl, Simone Paolo Ponzetto Data and Web Science Research Group University of Mannheim Germany NLP & DBpedia @ ISWC, Riva del Garda, Italy, October 20, 2014
  2. 2. Outline 1. State of art: Temporally annotated data in DBpedia and LOD 2. Temporally annotated data extraction pipeline 3. Company Dataset • Statistics • Comparison with other KBs 1. Ongoing and future work Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 2
  3. 3. Why we need historical LOD • “Historical data” == any data that is or can be temporally annotated • population of a city, revenue of a company, current club for a football player • Why we need such data • Allows having a more precise description of an entity • Enables LOD-based data mining for trend prediction • Availability of temporally annotated data on the Web of Data • Poor and scarce • Examples can be found in Freebase, Wikidata, YAGO, … • Temporally annotated facts or – not so frequently – time series • Some exceptionally good examples follow… Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 3
  4. 4. Temporally annotated data: Examples Apple Inc. in Wikidata http.//www.wikidata.org/wiki/Q312 Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 4
  5. 5. Temporally annotated data: Examples Apple Inc. in Freebase http.//www.freebase.com/m/0k8z Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 5
  6. 6. Temporally annotated data in DBpedia • DBpedia's main source of knowledge are Wikipedia infoboxes • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute 3. Temporally annotated, annotation is a part of an attribute value • Often only the latest value is present • When new value is available, the old one is overwritten Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 6
  7. 7. Temporally annotated data in DBpedia • DBpedia's main source of knowledge are Wikipedia infoboxes • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute 3. Temporally annotated, annotation is a part of an attribute value • Often only the latest value is present • When new value is available, the old one is overwritten Our focus: case 3, temporal annotation is a part of an attribute value Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 7
  8. 8. Temporally annotated data in DBpedia • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated (1) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 8
  9. 9. Temporally annotated data in DBpedia • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute • Often lost during DBpedia data extraction • E.g. no connection between populationTotal and populationAsOf properties (2) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 9
  10. 10. Temporally annotated data in DBpedia • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute • Ends up in DBpedia only if an intermediate node mapping is defined in the mapping wiki (2) (2) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 10
  11. 11. Temporally annotated data in DBpedia • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute • Ends up in DBpedia only if an intermediate node mapping is defined in the mapping wiki (2) (2) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 11
  12. 12. Temporally annotated data in DBpedia • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute 3. Temporally annotated, annotation is a part of an attribute value • Annotation is lost during extraction • In most cases value is regularly overwritten Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 12
  13. 13. Idea: go back in time • Properties of interest • Temporally annotated, annotation is a part of an attribute value • Use case: Business and Financial Data (Companies) • Key observations • Attribute values are often temporally annotated • If annotation is part of attribute value DBpedia extraction framework ignores it • Attribute values are regularly overwritten by Wikipedia editors, but the trace remains in Wikipedia revision history • DBpedia data extraction process is run on one (e.g. the latest) dump only • Proposed solution • Run extraction on (part of) revision history • Add a temporal tagger to the process Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 14
  14. 14. Extraction pipeline 1. Select and download Wikipedia revisions 2. Extract temporal facts 3. Merge facts • Code available at https.//github.com/normalerweise/mte Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 15
  15. 15. Extraction pipeline 1. Select and download Wikipedia revisions • Select 4 revisions per year (1st, 2nd, 3rd quartile and the last revision) • Use MediaWiki API to download the revisions Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 16
  16. 16. Extraction pipeline 2. Extract temporal facts • Parse each infobox attribute twice • For a value: Mapping Extractor of the DBpedia Extraction Framework • For time validity (point or interval): HeidelTime • HeidelTime is a multilingual cross-domain rule-based temporal tagger • Developed at the University Of Heidelberg • http.//dbs.ifi.uni-heidelberg.de/index.php?id=129 Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 17
  17. 17. Extraction pipeline {{ Infobox company | name = Netflix, Inc. | revenue = US$4.37&nbsp; million (''FY 2013'') ... <Netflix, revenue, 4.37E9, usDollar, 2013, 610604061> 2. Extract temporal facts • Parse each infobox attribute twice Revision ID • For a value: Mapping Extractor of the DBpedia Extraction Framework • For time validity (point or interval): HeidelTime • HeidelTime is a multilingual cross-domain rule-based temporal tagger • Developed at the University Of Heidelberg • http.//dbs.ifi.uni-heidelberg.de/index.php?id=129 Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 18
  18. 18. Extraction pipeline 3. Merge facts • Group triples by subject, property, temporal validity, value • In case of value conflicts, select the most frequent value • In case of ties, select the most recent value Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 19
  19. 19. Extraction pipeline <Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> <Netflix, operatingIncome, 1.92E8, usDollar, 2009, 387048342> <Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478> <Netflix, operatingIncome, 1.94E8, usDollar, 2009, 426138580> <Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> <Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478> 3. Merge facts • Group triples by subject, property, temporal validity, value • In case of value conflicts, select the most frequent value • In case of ties, select the most recent value Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 20
  20. 20. Data model • Our choice for RDF representation • Singleton property approach • Vinh Nguyen, Olivier Bodenreider, and Amit Sheth. Don’t like RDF reification? Making statements about statements using singleton property, WWW 2014 • Motivation: performance in terms of #triples, query size and execution time • Main idea: unique URI for each predicate instance <Netflix, revenue#uniqueId, 4.37E9> <revenue#uniqueId, singletonPropertyOf, revenue> <revenue#uniqueId, date, 2013> <revenue#uniqueId, sourceRevision, 610604061> Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 21
  21. 21. Company dataset • Dataset available at http://tiny.cc/tmpcompany • Started from DBpedia resources of type dbpedia-owl:Company and yago:Company108058098 • 51,214 companies, for 18,489 at least one fact is extracted for • assets • equity • netIncome • numberOfEmployees • operatingIncome • revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 22
  22. 22. Company dataset • Dataset available at http://tiny.cc/tmpcompany • 51,214 companies, for 18,489 at least one fact is extracted for • assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 23
  23. 23. Company dataset • Dataset available at http://tiny.cc/tmpcompany • 51,214 companies, for 18,489 at least one fact is extracted for • assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 24
  24. 24. Company dataset vs other KBs • 10 random companies with well-maintained infoboxes • Manually mapped ontology properties • YAGO2 • 0 triples for these companies for hasNumberOfPeople and hasRevenue Our dataset Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 25
  25. 25. Company dataset vs other KBs • 10 random companies with well-maintained infoboxes • Manually mapped ontology properties • YAGO2 • 0 triples for these companies for hasNumberOfPeople and hasRevenue • Freebase • 201 vs 58 triples Our dataset Freebase Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 26
  26. 26. Evaluation • Evaluating the precision • (preliminary, not in the paper) • 100 random tuples, 2 properties, so far only one annotator • 75% for numberOfEmployees and 78% for revenue • Caused by parsing errors: DBpedia extraction framework is always tuned to work with the latest Wikipedia version • After fixing some errors: 97% for numberOfEmployees and 92% for revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 27
  27. 27. Ongoing and future work • Ongoing: extracting missing attributes from Wikipedia article texts • Company dataset is used for distant supervision • Anticipating some questions • Yes, we tried the approach for another domain: American football • Yes, making the data available through an endpoint is on our todo list Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 28

×