SlideShare a Scribd company logo
1 of 27
Mining Historical Data for DBpedia 
via Temporal Tagging 
of Wikipedia Infoboxes 
Norman Weisenburger, Volha Bryl, Simone Paolo Ponzetto 
Data and Web Science Research Group 
University of Mannheim 
Germany 
NLP & DBpedia @ ISWC, Riva del Garda, Italy, October 20, 2014
Outline 
1. State of art: Temporally annotated data in DBpedia and LOD 
2. Temporally annotated data extraction pipeline 
3. Company Dataset 
• Statistics 
• Comparison with other KBs 
1. Ongoing and future work 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 2
Why we need historical LOD 
• “Historical data” == any data that is or can be temporally annotated 
• population of a city, revenue of a company, current club for a football player 
• Why we need such data 
• Allows having a more precise description of an entity 
• Enables LOD-based data mining for trend prediction 
• Availability of temporally annotated data on the Web of Data 
• Poor and scarce 
• Examples can be found in Freebase, Wikidata, YAGO, … 
• Temporally annotated facts or – not so frequently – time series 
• Some exceptionally good examples follow… 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 3
Temporally annotated data: Examples 
Apple Inc. in Wikidata 
http.//www.wikidata.org/wiki/Q312 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 4
Temporally annotated data: Examples 
Apple Inc. in Freebase 
http.//www.freebase.com/m/0k8z 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 5
Temporally annotated data in DBpedia 
• DBpedia's main source of knowledge are Wikipedia infoboxes 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a separate attribute 
3. Temporally annotated, annotation is a part of an attribute value 
• Often only the latest value is present 
• When new value is available, the old one is overwritten 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 6
Temporally annotated data in DBpedia 
• DBpedia's main source of knowledge are Wikipedia infoboxes 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a separate attribute 
3. Temporally annotated, annotation is a part of an attribute value 
• Often only the latest value is present 
• When new value is available, the old one is overwritten 
Our focus: case 3, temporal annotation is a part of an attribute value 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 7
Temporally annotated data in DBpedia 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
(1) 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 8
Temporally annotated data in DBpedia 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a 
separate attribute 
• Often lost during DBpedia data extraction 
• E.g. no connection between populationTotal and 
populationAsOf properties 
(2) 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 9
Temporally annotated data in DBpedia 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a 
separate attribute 
• Ends up in DBpedia only if an intermediate 
node mapping is defined in the mapping wiki 
(2) 
(2) 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 10
Temporally annotated data in DBpedia 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a 
separate attribute 
• Ends up in DBpedia only if an intermediate 
node mapping is defined in the mapping wiki 
(2) 
(2) 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 11
Temporally annotated data in DBpedia 
• Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a 
separate attribute 
3. Temporally annotated, annotation is a part of an 
attribute value 
• Annotation is lost during extraction 
• In most cases value is regularly overwritten 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 12
Idea: go back in time 
• Properties of interest 
• Temporally annotated, annotation is a part of an attribute value 
• Use case: Business and Financial Data (Companies) 
• Key observations 
• Attribute values are often temporally annotated 
• If annotation is part of attribute value DBpedia extraction framework ignores it 
• Attribute values are regularly overwritten by Wikipedia editors, but the trace 
remains in Wikipedia revision history 
• DBpedia data extraction process is run on one (e.g. the latest) dump only 
• Proposed solution 
• Run extraction on (part of) revision history 
• Add a temporal tagger to the process 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 14
Extraction pipeline 
1. Select and download Wikipedia revisions 
2. Extract temporal facts 
3. Merge facts 
• Code available at https.//github.com/normalerweise/mte 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 15
Extraction pipeline 
1. Select and download Wikipedia revisions 
• Select 4 revisions per year (1st, 2nd, 3rd quartile and the last revision) 
• Use MediaWiki API to download the revisions 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 16
Extraction pipeline 
2. Extract temporal facts 
• Parse each infobox attribute twice 
• For a value: Mapping Extractor of the DBpedia Extraction Framework 
• For time validity (point or interval): HeidelTime 
• HeidelTime is a multilingual cross-domain rule-based temporal tagger 
• Developed at the University Of Heidelberg 
• http.//dbs.ifi.uni-heidelberg.de/index.php?id=129 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 17
Extraction pipeline 
{{ Infobox company 
| name = Netflix, Inc. 
| revenue = US$4.37  million (''FY 2013'') 
... 
<Netflix, revenue, 4.37E9, usDollar, 2013, 610604061> 
2. Extract temporal facts 
• Parse each infobox attribute twice 
Revision ID 
• For a value: Mapping Extractor of the DBpedia Extraction Framework 
• For time validity (point or interval): HeidelTime 
• HeidelTime is a multilingual cross-domain rule-based temporal tagger 
• Developed at the University Of Heidelberg 
• http.//dbs.ifi.uni-heidelberg.de/index.php?id=129 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 18
Extraction pipeline 
3. Merge facts 
• Group triples by subject, property, temporal validity, value 
• In case of value conflicts, select the most frequent value 
• In case of ties, select the most recent value 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 19
Extraction pipeline 
<Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> 
<Netflix, operatingIncome, 1.92E8, usDollar, 2009, 387048342> 
<Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478> 
<Netflix, operatingIncome, 1.94E8, usDollar, 2009, 426138580> 
<Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> 
<Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478> 
3. Merge facts 
• Group triples by subject, property, temporal validity, value 
• In case of value conflicts, select the most frequent value 
• In case of ties, select the most recent value 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 20
Data model 
• Our choice for RDF representation 
• Singleton property approach 
• Vinh Nguyen, Olivier Bodenreider, and Amit Sheth. Don’t like RDF reification? 
Making statements about statements using singleton property, WWW 2014 
• Motivation: performance in terms of #triples, query size and execution time 
• Main idea: unique URI for each predicate instance 
<Netflix, revenue#uniqueId, 4.37E9> 
<revenue#uniqueId, singletonPropertyOf, revenue> 
<revenue#uniqueId, date, 2013> 
<revenue#uniqueId, sourceRevision, 610604061> 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 21
Company dataset 
• Dataset available at http://tiny.cc/tmpcompany 
• Started from DBpedia resources of type dbpedia-owl:Company and 
yago:Company108058098 
• 51,214 companies, for 18,489 at least one fact is extracted for 
• assets 
• equity 
• netIncome 
• numberOfEmployees 
• operatingIncome 
• revenue 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 22
Company dataset 
• Dataset available at http://tiny.cc/tmpcompany 
• 51,214 companies, for 18,489 at least one fact is extracted for 
• assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 23
Company dataset 
• Dataset available at http://tiny.cc/tmpcompany 
• 51,214 companies, for 18,489 at least one fact is extracted for 
• assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 24
Company dataset vs other KBs 
• 10 random companies 
with well-maintained 
infoboxes 
• Manually mapped 
ontology properties 
• YAGO2 
• 0 triples for these 
companies for 
hasNumberOfPeople 
and hasRevenue 
Our dataset 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 25
Company dataset vs other KBs 
• 10 random companies 
with well-maintained 
infoboxes 
• Manually mapped 
ontology properties 
• YAGO2 
• 0 triples for these 
companies for 
hasNumberOfPeople 
and hasRevenue 
• Freebase 
• 201 vs 58 triples 
Our dataset 
Freebase 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 26
Evaluation 
• Evaluating the precision 
• (preliminary, not in the paper) 
• 100 random tuples, 2 properties, so far only one annotator 
• 75% for numberOfEmployees and 78% for revenue 
• Caused by parsing errors: DBpedia extraction framework is always tuned 
to work with the latest Wikipedia version 
• After fixing some errors: 97% for numberOfEmployees and 92% for revenue 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 27
Ongoing and future work 
• Ongoing: extracting missing attributes from Wikipedia article texts 
• Company dataset is used for distant supervision 
• Anticipating some questions 
• Yes, we tried the approach for another domain: American football 
• Yes, making the data available through an endpoint is on our todo list 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 28

More Related Content

Similar to Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesHeiko Paulheim
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbZhangZhengming
 
RSS feeds using Millennium data
RSS feeds using Millennium dataRSS feeds using Millennium data
RSS feeds using Millennium dataAndrew Preater
 
20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS WebinarBen Blaiszik
 
The Elephant in the Library
The Elephant in the LibraryThe Elephant in the Library
The Elephant in the LibraryDataWorks Summit
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Fwdays
 
Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockQAware GmbH
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”bridgingworlds2008
 
Lod portal and pundit @ Humanities Hack london2014
Lod portal and pundit @ Humanities Hack london2014Lod portal and pundit @ Humanities Hack london2014
Lod portal and pundit @ Humanities Hack london2014Net7
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Future Perfect 2012
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoopcneudecker
 

Similar to Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes (20)

Gathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia EntitiesGathering Alternative Surface Forms for DBpedia Entities
Gathering Alternative Surface Forms for DBpedia Entities
 
Publishing Linked Data from RDB
Publishing Linked Data from RDBPublishing Linked Data from RDB
Publishing Linked Data from RDB
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
DWIntro.pptx
DWIntro.pptxDWIntro.pptx
DWIntro.pptx
 
IM SEMINAR.pptx
IM SEMINAR.pptxIM SEMINAR.pptx
IM SEMINAR.pptx
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
Apache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdbApache con 2020 use cases and optimizations of iotdb
Apache con 2020 use cases and optimizations of iotdb
 
Welcome to big data
Welcome to big dataWelcome to big data
Welcome to big data
 
RSS feeds using Millennium data
RSS feeds using Millennium dataRSS feeds using Millennium data
RSS feeds using Millennium data
 
20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar20160922 Materials Data Facility TMS Webinar
20160922 Materials Data Facility TMS Webinar
 
The Elephant in the Library
The Elephant in the LibraryThe Elephant in the Library
The Elephant in the Library
 
Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"Дмитрий Попович "How to build a data warehouse?"
Дмитрий Попович "How to build a data warehouse?"
 
The new time series kid on the block
The new time series kid on the blockThe new time series kid on the block
The new time series kid on the block
 
Chronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the BlockChronix Time Series Database - The New Time Series Kid on the Block
Chronix Time Series Database - The New Time Series Kid on the Block
 
A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”
 
Lod portal and pundit @ Humanities Hack london2014
Lod portal and pundit @ Humanities Hack london2014Lod portal and pundit @ Humanities Hack london2014
Lod portal and pundit @ Humanities Hack london2014
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
 
The Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating HadoopThe Elephant in the Library - Integrating Hadoop
The Elephant in the Library - Integrating Hadoop
 
Second Thoughts about Metadata Standards for Data
Second Thoughts about Metadata Standards for DataSecond Thoughts about Metadata Standards for Data
Second Thoughts about Metadata Standards for Data
 

Recently uploaded

Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomyDrAnita Sharma
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 

Recently uploaded (20)

Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPests of Blackgram, greengram, cowpea_Dr.UPR.pdf
Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
basic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomybasic entomology with insect anatomy and taxonomy
basic entomology with insect anatomy and taxonomy
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRCall Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 

Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

  • 1. Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes Norman Weisenburger, Volha Bryl, Simone Paolo Ponzetto Data and Web Science Research Group University of Mannheim Germany NLP & DBpedia @ ISWC, Riva del Garda, Italy, October 20, 2014
  • 2. Outline 1. State of art: Temporally annotated data in DBpedia and LOD 2. Temporally annotated data extraction pipeline 3. Company Dataset • Statistics • Comparison with other KBs 1. Ongoing and future work Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 2
  • 3. Why we need historical LOD • “Historical data” == any data that is or can be temporally annotated • population of a city, revenue of a company, current club for a football player • Why we need such data • Allows having a more precise description of an entity • Enables LOD-based data mining for trend prediction • Availability of temporally annotated data on the Web of Data • Poor and scarce • Examples can be found in Freebase, Wikidata, YAGO, … • Temporally annotated facts or – not so frequently – time series • Some exceptionally good examples follow… Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 3
  • 4. Temporally annotated data: Examples Apple Inc. in Wikidata http.//www.wikidata.org/wiki/Q312 Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 4
  • 5. Temporally annotated data: Examples Apple Inc. in Freebase http.//www.freebase.com/m/0k8z Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 5
  • 6. Temporally annotated data in DBpedia • DBpedia's main source of knowledge are Wikipedia infoboxes • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute 3. Temporally annotated, annotation is a part of an attribute value • Often only the latest value is present • When new value is available, the old one is overwritten Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 6
  • 7. Temporally annotated data in DBpedia • DBpedia's main source of knowledge are Wikipedia infoboxes • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute 3. Temporally annotated, annotation is a part of an attribute value • Often only the latest value is present • When new value is available, the old one is overwritten Our focus: case 3, temporal annotation is a part of an attribute value Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 7
  • 8. Temporally annotated data in DBpedia • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated (1) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 8
  • 9. Temporally annotated data in DBpedia • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute • Often lost during DBpedia data extraction • E.g. no connection between populationTotal and populationAsOf properties (2) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 9
  • 10. Temporally annotated data in DBpedia • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute • Ends up in DBpedia only if an intermediate node mapping is defined in the mapping wiki (2) (2) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 10
  • 11. Temporally annotated data in DBpedia • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute • Ends up in DBpedia only if an intermediate node mapping is defined in the mapping wiki (2) (2) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 11
  • 12. Temporally annotated data in DBpedia • Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute 3. Temporally annotated, annotation is a part of an attribute value • Annotation is lost during extraction • In most cases value is regularly overwritten Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 12
  • 13. Idea: go back in time • Properties of interest • Temporally annotated, annotation is a part of an attribute value • Use case: Business and Financial Data (Companies) • Key observations • Attribute values are often temporally annotated • If annotation is part of attribute value DBpedia extraction framework ignores it • Attribute values are regularly overwritten by Wikipedia editors, but the trace remains in Wikipedia revision history • DBpedia data extraction process is run on one (e.g. the latest) dump only • Proposed solution • Run extraction on (part of) revision history • Add a temporal tagger to the process Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 14
  • 14. Extraction pipeline 1. Select and download Wikipedia revisions 2. Extract temporal facts 3. Merge facts • Code available at https.//github.com/normalerweise/mte Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 15
  • 15. Extraction pipeline 1. Select and download Wikipedia revisions • Select 4 revisions per year (1st, 2nd, 3rd quartile and the last revision) • Use MediaWiki API to download the revisions Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 16
  • 16. Extraction pipeline 2. Extract temporal facts • Parse each infobox attribute twice • For a value: Mapping Extractor of the DBpedia Extraction Framework • For time validity (point or interval): HeidelTime • HeidelTime is a multilingual cross-domain rule-based temporal tagger • Developed at the University Of Heidelberg • http.//dbs.ifi.uni-heidelberg.de/index.php?id=129 Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 17
  • 17. Extraction pipeline {{ Infobox company | name = Netflix, Inc. | revenue = US$4.37&nbsp; million (''FY 2013'') ... <Netflix, revenue, 4.37E9, usDollar, 2013, 610604061> 2. Extract temporal facts • Parse each infobox attribute twice Revision ID • For a value: Mapping Extractor of the DBpedia Extraction Framework • For time validity (point or interval): HeidelTime • HeidelTime is a multilingual cross-domain rule-based temporal tagger • Developed at the University Of Heidelberg • http.//dbs.ifi.uni-heidelberg.de/index.php?id=129 Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 18
  • 18. Extraction pipeline 3. Merge facts • Group triples by subject, property, temporal validity, value • In case of value conflicts, select the most frequent value • In case of ties, select the most recent value Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 19
  • 19. Extraction pipeline <Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> <Netflix, operatingIncome, 1.92E8, usDollar, 2009, 387048342> <Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478> <Netflix, operatingIncome, 1.94E8, usDollar, 2009, 426138580> <Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> <Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478> 3. Merge facts • Group triples by subject, property, temporal validity, value • In case of value conflicts, select the most frequent value • In case of ties, select the most recent value Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 20
  • 20. Data model • Our choice for RDF representation • Singleton property approach • Vinh Nguyen, Olivier Bodenreider, and Amit Sheth. Don’t like RDF reification? Making statements about statements using singleton property, WWW 2014 • Motivation: performance in terms of #triples, query size and execution time • Main idea: unique URI for each predicate instance <Netflix, revenue#uniqueId, 4.37E9> <revenue#uniqueId, singletonPropertyOf, revenue> <revenue#uniqueId, date, 2013> <revenue#uniqueId, sourceRevision, 610604061> Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 21
  • 21. Company dataset • Dataset available at http://tiny.cc/tmpcompany • Started from DBpedia resources of type dbpedia-owl:Company and yago:Company108058098 • 51,214 companies, for 18,489 at least one fact is extracted for • assets • equity • netIncome • numberOfEmployees • operatingIncome • revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 22
  • 22. Company dataset • Dataset available at http://tiny.cc/tmpcompany • 51,214 companies, for 18,489 at least one fact is extracted for • assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 23
  • 23. Company dataset • Dataset available at http://tiny.cc/tmpcompany • 51,214 companies, for 18,489 at least one fact is extracted for • assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 24
  • 24. Company dataset vs other KBs • 10 random companies with well-maintained infoboxes • Manually mapped ontology properties • YAGO2 • 0 triples for these companies for hasNumberOfPeople and hasRevenue Our dataset Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 25
  • 25. Company dataset vs other KBs • 10 random companies with well-maintained infoboxes • Manually mapped ontology properties • YAGO2 • 0 triples for these companies for hasNumberOfPeople and hasRevenue • Freebase • 201 vs 58 triples Our dataset Freebase Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 26
  • 26. Evaluation • Evaluating the precision • (preliminary, not in the paper) • 100 random tuples, 2 properties, so far only one annotator • 75% for numberOfEmployees and 78% for revenue • Caused by parsing errors: DBpedia extraction framework is always tuned to work with the latest Wikipedia version • After fixing some errors: 97% for numberOfEmployees and 92% for revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 27
  • 27. Ongoing and future work • Ongoing: extracting missing attributes from Wikipedia article texts • Company dataset is used for distant supervision • Anticipating some questions • Yes, we tried the approach for another domain: American football • Yes, making the data available through an endpoint is on our todo list Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 28

Editor's Notes

  1. Only the latest value is present in the infobox – of interest?
  2. Comparison with DBpedia: http://dbpedia.org/resource/Apple Inc. Our dataset contains 45 temporal facts whereas DBpedia currently has one fact for reach relation, i.e. 6 triples
  3. Freebase lists EDGAR as one of its data sources. EDGAR is a database which contains information about publicly traded US companies operated by the United States Security and Exchange Commission.