SlideShare une entreprise Scribd logo
1  sur  20
Crowdsourcing the Quality of Knowledge Graphs:
A DBpedia Study
Dr.-Ing. Maribel Acosta
2
- Different types of incorrectness
- Semi-structured data model
Correctness Challenges
Drug
Oralrdf:type
rdf:type
Data source: DBpedia endpoint (December 2018).
?
- Skewed data distributions
- Semi-structured data model
- Open World Assumption
Completeness Challenges
route
route
rdf:typeroute
Cheek
Crowdsourcing KG Correctness
Find-Verify Approach
3
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., & Lehmann, J. Crowdsourcing Linked Data Quality Assessment. In
International Semantic Web Conference (pp. 260-276), 2013.
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., & Lehmann, J. Detecting Linked Data Quality issues via Crowdsourcing: A
DBpedia Study. Semantic Web Journal, 9(3), 303-335, 2018.
Find-Verify Approach
• Find stage: subject-centric
• In each task, the crowd assess the triples of a subject in the KG
• Incorrect triples are annotated with the corresponding quality issue
• Verify stage: issue-centric
• The crowd assess the triples annotated as incorrect in the previous stage
• In each task, the crowd assess triples annotated with the same quality issue
4
Crowdsourcing
Interface Crowd
Crowdsourcing
Interface
Incorrect
RDF Triples
Incorrect
RDF Triples
Tasks
Crowd
Verify StageFind Stage
Tasks
<<input>> <<output>>
RDF Triples
� �
Quality Issues:
Studied DBpedia Quality Issues
Three categories of quality issues that occur in DBpedia [Zaveri2013]:
• Incorrect object value
dbr:Dave_Dobbyn dbp:dateOfBirth “3” .
• Incorrect data type or language tags
dbr:Torishima_Izu_Islands foaf:name “鳥島”@en .
• Incorrect link to external sources
dbr:John-Two-Hawks dbpedia-owl:wikiPageExternalLink <http://cedarlakedvd.com/> .
5
Examples from DBpedia 2014.
Overview of the Results on DBpedia 3.9
6
Two workflows: Expert-Worker (EW), Worker-Worker (WW)
Triples crowdsourced in Find Stage: >30,000
Triples crowdsourced in Verify Stage: 1,073
Distribution of DBpedia quality issues (as of experts):
509 triples with incorrect object value
341 triples with incorrect datatype/language
223 triples with incorrect link
Results: Expert-Worker Workflow
• Crowd workers cannot detect datatype issues correctly
• Experts do not check the outlinks properly
7
0.00
0.25
0.50
0.75
1.00
Datatype / Language Link Object Value
MetricValue
EW (Find) Precision
EW (Verify) Precision
EW (Verify) Sensitivity
EW (Verify) Specificity
1
2
1
2
Results: Worker-Worker Workflow
• Precision of crowd workers in the find stage is very low
• Sensitivity values indicate that crowd workers reliably confirm incorrect triples
8
0.00
0.25
0.50
0.75
1.00
Datatype / Language Link Object Value
MetricValue
WW (Find) Precision
WW (Verify) Precision
WW (Verify) Sensitivity
WW (Verify) Specificity
2
1 1
1
2 2
1
2
Crowdsourcing KG Completeness
HARE (Hybrid SPARQL Query Engine)
9
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: A hybrid SPARQL engine to enhance query answers via crowdsourcing. In
Proceedings of the 8th International Conference on Knowledge Capture (p. 11). 2015. Best Student Paper Award.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. Enhancing answer completeness of SPARQL queries via crowdsourcing. Journal of Web
Semantics, 45, 41-62, 2017.
Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: An engine for enhancing answer completeness of SPARQL queries via
crowdsourcing. Companion Volume of the Web Conference (pp. 501-505). 2018.
SELECT DISTINCT ?drug WHERE {
?drug rdf:type dbo:Drug .
?drug dbo:atcPrefix “C01” .
?drug dbp:routesOfAdministration ?route .
}
HARE Overview
10
{?drug  dbr:Ibuprofen}
{?drug  dbr:Flecainide}
Query Engine
RDF Completeness
Model
Microtask Manager
{?drug  dbr:Acadesine}
{?drug  dbr:Ibuprofen}
{?drug  dbr:Flecainide}
{?drug  dbr:Acadesine}
Crowd Knowledge
CKB+ CKB- CKB~
D
τ
HARE
• A hybrid machine/human SPARQL query engine that is able to enhance
the size of query answers.
• Based on a novel RDF completeness model, HARE implements query
optimization and execution techniques:
Identifying portions of queries that yield missing values.
• HARE resorts to microtask crowdsourcing:
Resolving missing values.
11
Microtask Manager
• Receives triple patterns to crowdsource.
(dbr:Flecainide, dbp:routesOfAdministration, ?route)
• Creates human tasks using data from the KG
Experimental Settings
• Benchmark: 50 queries against (English version, 2014).
• Ten queries in five different knowledge domains:
History, Life Sciences, Movies, Music, and Sports.
• Implementation details:
• Dataset (queries executed directly against the dataset).
• HARE (our proposed approach).
• HARE BL (generates microtask interfaces replacing URIs by labels).
• Crowdsourcing configuration:
• The crowd is reached via CrowdFlower.
• Four different triple patterns per task, 0.07 US$ per task.
• At least 3 answers were collected per task.
13
Overview of the Results
Total triple patterns crowdsourced: 1,004
Total answers collected from the crowd: 3,163
75%-98% of the crowd answers
were produced in 12 minutes
14
Completeness of Query Answers
15
Recall of tested approaches w.r.t. D* per SPARQL query
Recall varies across queries and knowledge domains in DBpedia.
Completeness of Query Answers
16
Recall of tested approaches w.r.t. D* per SPARQL query
HARE outperforms the other approaches across all knowledge domains.
Our RDF completeness model captures the skewed distributions of values.
Recall varies across queries and knowledge domains in DBpedia.
✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓
Quality of Crowd Answers: Precision
17
The crowd exhibits heterogeneous performance within DBpedia domains.
Conclusions & Outlook
18
Conclusions & Outlook
• Crowdsourcing is a feasible solution for KG quality processing.
• KG correctness: The crowd performs best in verification tasks
(confirming incorrect facts).
• KG completeness: The precision of the crowd answer varies
within a knowledge domains in DBpedia.
• Outlook (Scalability): Further integration of the crowd answers
with automatic methods to scale up to large datasets.
19
Thank you
20
Drug
Oralrdf:type
rdf:type ?
route
route
rdf:typeroute
Cheek

Contenu connexe

Tendances

Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)Frank van Harmelen
 
Strata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationStrata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationTimothy Danford
 
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...Trevor Owens
 
AjayBhullar_Resume (5)
AjayBhullar_Resume (5)AjayBhullar_Resume (5)
AjayBhullar_Resume (5)Ajay Bhullar
 
Indexing data on the web a comparison of schema level indices for data search
Indexing data on the web a comparison of schema level indices for data searchIndexing data on the web a comparison of schema level indices for data search
Indexing data on the web a comparison of schema level indices for data searchTill Blume
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataPablo Bernabeu
 
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationThe Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationFrank van Harmelen
 
R Basics and Best Practices
R Basics and Best PracticesR Basics and Best Practices
R Basics and Best PracticesKristen Sauby
 
Oles Petriv “Creating one concept embedding space for persons, brands and new...
Oles Petriv “Creating one concept embedding space for persons, brands and new...Oles Petriv “Creating one concept embedding space for persons, brands and new...
Oles Petriv “Creating one concept embedding space for persons, brands and new...Lviv Startup Club
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoFrank van Harmelen
 
Online Masterclass Learning Analytics
Online Masterclass Learning Analytics Online Masterclass Learning Analytics
Online Masterclass Learning Analytics Hendrik Drachsler
 
Frequent Itemset Mining on BigData
Frequent Itemset Mining on BigDataFrequent Itemset Mining on BigData
Frequent Itemset Mining on BigDataRaju Gupta
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 

Tendances (20)

Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 
Strata-Hadoop 2015 Presentation
Strata-Hadoop 2015 PresentationStrata-Hadoop 2015 Presentation
Strata-Hadoop 2015 Presentation
 
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
We Have Interesting Problems: Some Applied Grand Challenges from Digital Libr...
 
Friday talk 11.02.2011
Friday talk 11.02.2011Friday talk 11.02.2011
Friday talk 11.02.2011
 
AjayBhullar_Resume (5)
AjayBhullar_Resume (5)AjayBhullar_Resume (5)
AjayBhullar_Resume (5)
 
Learn about Your Location (Using ALL Your Data)
Learn about Your Location (Using ALL Your Data)Learn about Your Location (Using ALL Your Data)
Learn about Your Location (Using ALL Your Data)
 
Indexing data on the web a comparison of schema level indices for data search
Indexing data on the web a comparison of schema level indices for data searchIndexing data on the web a comparison of schema level indices for data search
Indexing data on the web a comparison of schema level indices for data search
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open data
 
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationThe Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge Representation
 
R Basics and Best Practices
R Basics and Best PracticesR Basics and Best Practices
R Basics and Best Practices
 
Data analytics courses
Data analytics coursesData analytics courses
Data analytics courses
 
Oles Petriv “Creating one concept embedding space for persons, brands and new...
Oles Petriv “Creating one concept embedding space for persons, brands and new...Oles Petriv “Creating one concept embedding space for persons, brands and new...
Oles Petriv “Creating one concept embedding space for persons, brands and new...
 
Data science courses
Data science coursesData science courses
Data science courses
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years ago
 
Online Masterclass Learning Analytics
Online Masterclass Learning Analytics Online Masterclass Learning Analytics
Online Masterclass Learning Analytics
 
Frequent Itemset Mining on BigData
Frequent Itemset Mining on BigDataFrequent Itemset Mining on BigData
Frequent Itemset Mining on BigData
 
Milex 2010 final
Milex 2010 finalMilex 2010 final
Milex 2010 final
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 

Similaire à Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study

Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentMaribel Acosta Deibe
 
NL-Graphs: A Hybrid Approach toward Interactively Querying Semantic Data
NL-Graphs: A Hybrid Approach toward Interactively Querying Semantic DataNL-Graphs: A Hybrid Approach toward Interactively Querying Semantic Data
NL-Graphs: A Hybrid Approach toward Interactively Querying Semantic DataSuvodeep Mazumdar
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisStuart Wrigley
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentAmrapali Zaveri, PhD
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...Marko Rodriguez
 
Assigning semantic labels to data sources
Assigning semantic labels to data sourcesAssigning semantic labels to data sources
Assigning semantic labels to data sourcesCraig Knoblock
 
How Much do Availability Studies Increase Full Text Success?
How Much do Availability Studies Increase Full Text Success?How Much do Availability Studies Increase Full Text Success?
How Much do Availability Studies Increase Full Text Success?Sanjeet Mann
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science Carole Goble
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Jisc
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsCarole Goble
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...Armin Haller
 
Week 11Collection of Data – questionnaire and Instruments & .docx
Week 11Collection of Data – questionnaire and Instruments & .docxWeek 11Collection of Data – questionnaire and Instruments & .docx
Week 11Collection of Data – questionnaire and Instruments & .docxmelbruce90096
 
Share and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelShare and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelKrzysztof Gorgolewski
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Maria Eskevich
 
NaturalMSEQueries_presICWI2023.pdf
NaturalMSEQueries_presICWI2023.pdfNaturalMSEQueries_presICWI2023.pdf
NaturalMSEQueries_presICWI2023.pdfAndré Valdestilhas
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsPaul Hofmann
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
 
D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sourc...
D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sourc...D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sourc...
D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sourc...Binh Vu
 

Similaire à Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study (20)

Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
NL-Graphs: A Hybrid Approach toward Interactively Querying Semantic Data
NL-Graphs: A Hybrid Approach toward Interactively Querying Semantic DataNL-Graphs: A Hybrid Approach toward Interactively Querying Semantic Data
NL-Graphs: A Hybrid Approach toward Interactively Querying Semantic Data
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Improving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log AnalysisImproving Semantic Search Using Query Log Analysis
Improving Semantic Search Using Query Log Analysis
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Assigning semantic labels to data sources
Assigning semantic labels to data sourcesAssigning semantic labels to data sources
Assigning semantic labels to data sources
 
How Much do Availability Studies Increase Full Text Success?
How Much do Availability Studies Increase Full Text Success?How Much do Availability Studies Increase Full Text Success?
How Much do Availability Studies Increase Full Text Success?
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015Keynote speech - Carole Goble - Jisc Digital Festival 2015
Keynote speech - Carole Goble - Jisc Digital Festival 2015
 
RARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research ObjectsRARE and FAIR Science: Reproducibility and Research Objects
RARE and FAIR Science: Reproducibility and Research Objects
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
Week 11Collection of Data – questionnaire and Instruments & .docx
Week 11Collection of Data – questionnaire and Instruments & .docxWeek 11Collection of Data – questionnaire and Instruments & .docx
Week 11Collection of Data – questionnaire and Instruments & .docx
 
Share and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next levelShare and Reuse: how data sharing can take your research to the next level
Share and Reuse: how data sharing can take your research to the next level
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014Search and Hyperlinking Overview @MediaEval2014
Search and Hyperlinking Overview @MediaEval2014
 
NaturalMSEQueries_presICWI2023.pdf
NaturalMSEQueries_presICWI2023.pdfNaturalMSEQueries_presICWI2023.pdf
NaturalMSEQueries_presICWI2023.pdf
 
Dynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & StatisticsDynamic Search Using Semantics & Statistics
Dynamic Search Using Semantics & Statistics
 
Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking  Profile-based Dataset Recommendation for RDF Data Linking
Profile-based Dataset Recommendation for RDF Data Linking
 
D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sourc...
D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sourc...D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sourc...
D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sourc...
 

Plus de Maribel Acosta Deibe

HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowd...
HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowd...HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowd...
HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowd...Maribel Acosta Deibe
 
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Maribel Acosta Deibe
 
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsAdaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsMaribel Acosta Deibe
 
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingHARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingMaribel Acosta Deibe
 
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 TutorialSemantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 TutorialMaribel Acosta Deibe
 
Semantic Data Management in Graph Databases
Semantic Data Management in Graph DatabasesSemantic Data Management in Graph Databases
Semantic Data Management in Graph DatabasesMaribel Acosta Deibe
 

Plus de Maribel Acosta Deibe (6)

HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowd...
HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowd...HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowd...
HARE: An Engine for Enhancing Answer Completeness of SPARQL Queries via Crowd...
 
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
Diefficiency Metrics: Measuring the Continuous Efficiency of Query Processing...
 
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsAdaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of Endpoints
 
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingHARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
 
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 TutorialSemantic Data Management in Graph Databases: ESWC 2014 Tutorial
Semantic Data Management in Graph Databases: ESWC 2014 Tutorial
 
Semantic Data Management in Graph Databases
Semantic Data Management in Graph DatabasesSemantic Data Management in Graph Databases
Semantic Data Management in Graph Databases
 

Dernier

Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxRizalinePalanog2
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 

Dernier (20)

Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 

Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study

  • 1. Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study Dr.-Ing. Maribel Acosta
  • 2. 2 - Different types of incorrectness - Semi-structured data model Correctness Challenges Drug Oralrdf:type rdf:type Data source: DBpedia endpoint (December 2018). ? - Skewed data distributions - Semi-structured data model - Open World Assumption Completeness Challenges route route rdf:typeroute Cheek
  • 3. Crowdsourcing KG Correctness Find-Verify Approach 3 Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., & Lehmann, J. Crowdsourcing Linked Data Quality Assessment. In International Semantic Web Conference (pp. 260-276), 2013. Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Flöck, F., & Lehmann, J. Detecting Linked Data Quality issues via Crowdsourcing: A DBpedia Study. Semantic Web Journal, 9(3), 303-335, 2018.
  • 4. Find-Verify Approach • Find stage: subject-centric • In each task, the crowd assess the triples of a subject in the KG • Incorrect triples are annotated with the corresponding quality issue • Verify stage: issue-centric • The crowd assess the triples annotated as incorrect in the previous stage • In each task, the crowd assess triples annotated with the same quality issue 4 Crowdsourcing Interface Crowd Crowdsourcing Interface Incorrect RDF Triples Incorrect RDF Triples Tasks Crowd Verify StageFind Stage Tasks <<input>> <<output>> RDF Triples � � Quality Issues:
  • 5. Studied DBpedia Quality Issues Three categories of quality issues that occur in DBpedia [Zaveri2013]: • Incorrect object value dbr:Dave_Dobbyn dbp:dateOfBirth “3” . • Incorrect data type or language tags dbr:Torishima_Izu_Islands foaf:name “鳥島”@en . • Incorrect link to external sources dbr:John-Two-Hawks dbpedia-owl:wikiPageExternalLink <http://cedarlakedvd.com/> . 5 Examples from DBpedia 2014.
  • 6. Overview of the Results on DBpedia 3.9 6 Two workflows: Expert-Worker (EW), Worker-Worker (WW) Triples crowdsourced in Find Stage: >30,000 Triples crowdsourced in Verify Stage: 1,073 Distribution of DBpedia quality issues (as of experts): 509 triples with incorrect object value 341 triples with incorrect datatype/language 223 triples with incorrect link
  • 7. Results: Expert-Worker Workflow • Crowd workers cannot detect datatype issues correctly • Experts do not check the outlinks properly 7 0.00 0.25 0.50 0.75 1.00 Datatype / Language Link Object Value MetricValue EW (Find) Precision EW (Verify) Precision EW (Verify) Sensitivity EW (Verify) Specificity 1 2 1 2
  • 8. Results: Worker-Worker Workflow • Precision of crowd workers in the find stage is very low • Sensitivity values indicate that crowd workers reliably confirm incorrect triples 8 0.00 0.25 0.50 0.75 1.00 Datatype / Language Link Object Value MetricValue WW (Find) Precision WW (Verify) Precision WW (Verify) Sensitivity WW (Verify) Specificity 2 1 1 1 2 2 1 2
  • 9. Crowdsourcing KG Completeness HARE (Hybrid SPARQL Query Engine) 9 Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: A hybrid SPARQL engine to enhance query answers via crowdsourcing. In Proceedings of the 8th International Conference on Knowledge Capture (p. 11). 2015. Best Student Paper Award. Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. Enhancing answer completeness of SPARQL queries via crowdsourcing. Journal of Web Semantics, 45, 41-62, 2017. Acosta, M., Simperl, E., Flöck, F., & Vidal, M. E. HARE: An engine for enhancing answer completeness of SPARQL queries via crowdsourcing. Companion Volume of the Web Conference (pp. 501-505). 2018.
  • 10. SELECT DISTINCT ?drug WHERE { ?drug rdf:type dbo:Drug . ?drug dbo:atcPrefix “C01” . ?drug dbp:routesOfAdministration ?route . } HARE Overview 10 {?drug  dbr:Ibuprofen} {?drug  dbr:Flecainide} Query Engine RDF Completeness Model Microtask Manager {?drug  dbr:Acadesine} {?drug  dbr:Ibuprofen} {?drug  dbr:Flecainide} {?drug  dbr:Acadesine} Crowd Knowledge CKB+ CKB- CKB~ D τ
  • 11. HARE • A hybrid machine/human SPARQL query engine that is able to enhance the size of query answers. • Based on a novel RDF completeness model, HARE implements query optimization and execution techniques: Identifying portions of queries that yield missing values. • HARE resorts to microtask crowdsourcing: Resolving missing values. 11
  • 12. Microtask Manager • Receives triple patterns to crowdsource. (dbr:Flecainide, dbp:routesOfAdministration, ?route) • Creates human tasks using data from the KG
  • 13. Experimental Settings • Benchmark: 50 queries against (English version, 2014). • Ten queries in five different knowledge domains: History, Life Sciences, Movies, Music, and Sports. • Implementation details: • Dataset (queries executed directly against the dataset). • HARE (our proposed approach). • HARE BL (generates microtask interfaces replacing URIs by labels). • Crowdsourcing configuration: • The crowd is reached via CrowdFlower. • Four different triple patterns per task, 0.07 US$ per task. • At least 3 answers were collected per task. 13
  • 14. Overview of the Results Total triple patterns crowdsourced: 1,004 Total answers collected from the crowd: 3,163 75%-98% of the crowd answers were produced in 12 minutes 14
  • 15. Completeness of Query Answers 15 Recall of tested approaches w.r.t. D* per SPARQL query Recall varies across queries and knowledge domains in DBpedia.
  • 16. Completeness of Query Answers 16 Recall of tested approaches w.r.t. D* per SPARQL query HARE outperforms the other approaches across all knowledge domains. Our RDF completeness model captures the skewed distributions of values. Recall varies across queries and knowledge domains in DBpedia. ✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓ ✓✓✓✓ ✓✓✓✓✓ ✓ ✓✓✓✓ ✓✓✓✓✓
  • 17. Quality of Crowd Answers: Precision 17 The crowd exhibits heterogeneous performance within DBpedia domains.
  • 19. Conclusions & Outlook • Crowdsourcing is a feasible solution for KG quality processing. • KG correctness: The crowd performs best in verification tasks (confirming incorrect facts). • KG completeness: The precision of the crowd answer varies within a knowledge domains in DBpedia. • Outlook (Scalability): Further integration of the crowd answers with automatic methods to scale up to large datasets. 19