Improving DBpedia (one microtask at a time)

•

1 like•497 views

Elena Simperl

Experiments in workflow design for entity typing in DBpedia

Education

Improving DBpedia (one
microtask at a time)
Elena Simperl
University of Southampton
Google, San Francisco
21 April 2015

DBpedia
Class Instances
Resource (overall) 4,233,000
Place 735,000
Person 1,450,000
Work 411,000
Species 251,000
Organisation 241,000 2
4.58M
things

Crowds or no crowds?
• Study different ways to crowdsource
entity typing using paid microtasks.
• Three workflows
– Free associations
– Validating the machine
– Exploring the DBpedia ontology
3

What to crowdsource
• Entity typing (free associations)
4
E
C

What to crowdsource (2)
• Entity typing (from a list of suggestions)
5
E - City
- SportsTeam
- Municipality
- PopulatedPlace
C

How to crowdsource: no suggestions
Workflow
Ask crowd
to suggest
classes
Take top k
Ask crowd
to vote
the best
match
Pros/cons
+ No biases
+ No pre-processing
– Vocabulary convergence
– Time and costs
– The more classifications the
better
– Two steps
6

How to crowdsource: with suggestions
Two options
• Generate a shortlist
– Automatically
• Show all available options
– As a tree
Pros/cons
+ Focused, cheap, fast
– Too many classes (685!),
see [Miller, 1956]
– Not the right classes
– Tool does not perform well
– Crowd is not familiar with
classes, see [Rosch et al.,
1976], [Tanaka & Taylor,
1991]
7

Experiments: Data
• Classified entities in popular
categories
• Test workflows, compare crowd
and machine performance
E1: Baseline,
120 entities
• Test the three workflows on data
that cannot be classified
automatically
E2:
Unclassified
entities, 12o
entities
• Fewer judgements
• Lower level of tool support
E3:
Unclassified
entities,
optimized, 120
entities

Experiments: Methods
• Adjusted precision metric to take into account broader and
narrower matches, as well as synonyms
• Gold standard (for E2 and E3)
– Two annotator, Cohen kappa of 0.7
– Conflicts resolved via small set of rules and discussions
11

Overall results
• Shortlists are easy & fast
• Freedom comes with a
price
• Working at the basic
level of abstraction
achieves greatest
precision
– Even when there is
too much choice
12

Other observations
• Unclassified entities might be unclassifiable
– Different entity summary
– Freetext or explorative workflow
• Popular classes are not enough
– Alternative approach to browse the taxonomy
• The basic level of abstraction in DBpedia is user-friendly
– But when given the freedom to choose, users suggest
more specific classes
– Domain-specific vocabulary is not welcome
13

Conclusions
• In knowledge engineering, microtask crowdsourcing has
focused on improving the results of automatic algorithms
• We know too little about those cases in which algorithms
fail
• No optimal workflow in sight
• The DBpedia ontology needs revision
14

Using microtasks to crowdsource DBpedia entity
classification: a study in workflow design
E Simperl, Q Bu, Y Li
Submitted to SWJ, 2015
Email: e.simperl@soton.ac.uk
Twitter: @esimperl
15

Similar to Improving DBpedia (one microtask at a time)

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.

RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu

II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...Dr. Haxel Consult

Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks

Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.

Lucene/Solr Revolution 2015: Where Search Meets Machine LearningS. Diana Hu

How Lyft Drives Data DiscoveryNeo4j

Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Lucidworks

Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes

Studying archives of online behaviorJames Howison

Evaluating Semantic Search Systems to Identify Future Directions of ResearchStuart Wrigley

Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang

Webinar 11-13-14 - DIY E-Resources Management: Basics of Information Architec...NASIG

TLC2016 - A search engine for Blackboard Learn, the impossible made possible.BlackboardEMEA

GOKb and Refine (Kuali Days 2013)GOKb Project

Building Recommender Systems - Mendeley and Science DirectDaniel Kershaw

Data council sf amundsen presentationTao Feng

Meetup SF - AmundsenPhilippe Mizrahi

Deep learning for NLPShishir Choudhary

Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchC4Media

Similar to Improving DBpedia (one microtask at a time) (20)

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...

RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...

II-SDV 2014 Organising Data: The step before visualisation (Nils C. Newman - ...

Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...

Lucene/Solr Revolution 2015: Where Search Meets Machine Learning

How Lyft Drives Data Discovery

Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...

Dice.com Bay Area Search - Beyond Learning to Rank Talk

Studying archives of online behavior

Evaluating Semantic Search Systems to Identify Future Directions of Research

Text REtrieval Conference (TREC) Dynamic Domain Track 2015

Webinar 11-13-14 - DIY E-Resources Management: Basics of Information Architec...

TLC2016 - A search engine for Blackboard Learn, the impossible made possible.

GOKb and Refine (Kuali Days 2013)

Building Recommender Systems - Mendeley and Science Direct

Data council sf amundsen presentation

Meetup SF - Amundsen

Deep learning for NLP

Structure, Personalization, Scale: A Deep Dive into LinkedIn Search

Recently uploaded

Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam

Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1

Food safety_Challenges food safety laboratories_.pdfSherif Taha

FSB Advising Checklist - Orientation 2024Elizabeth Walsh

How to Add New Custom Addons Path in Odoo 17Celine George

Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfDr Vijay Vishwakarma

Jamworks pilot and AI at Jisc (20/03/2024)Jisc

Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade

Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand

Accessible Digital Futures project (20/03/2024)Jisc

ICT role in 21st century education and it's challenges.MaryamAhmad92

Understanding Accommodations and ModificationsMJDuyan

Application orientated numerical on hev.pptRamjanShidvankar

How to setup Pycharm environment for Odoo 17.pptxCeline George

How to Manage Global Discount in Odoo 17 POSCeline George

Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136

REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda

SOC 101 Demonstration of Learning Presentationcamerronhm

Recently uploaded (20)

Python Notes for mca i year students osmania university.docx

Interdisciplinary_Insights_Data_Collection_Methods.pptx

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx

Food safety_Challenges food safety laboratories_.pdf

FSB Advising Checklist - Orientation 2024

How to Add New Custom Addons Path in Odoo 17

Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf

Jamworks pilot and AI at Jisc (20/03/2024)

Plant propagation: Sexual and Asexual propapagation.pptx

2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx

Google Gemini An AI Revolution in Education.pptx

Accessible Digital Futures project (20/03/2024)

ICT role in 21st century education and it's challenges.

Understanding Accommodations and Modifications

Application orientated numerical on hev.ppt

How to setup Pycharm environment for Odoo 17.pptx

How to Manage Global Discount in Odoo 17 POS

Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...

REMIFENTANIL: An Ultra short acting opioid.pptx

SOC 101 Demonstration of Learning Presentation

Improving DBpedia (one microtask at a time)

1. Improving DBpedia (one microtask at a time) Elena Simperl University of Southampton Google, San Francisco 21 April 2015

2. DBpedia Class Instances Resource (overall) 4,233,000 Place 735,000 Person 1,450,000 Work 411,000 Species 251,000 Organisation 241,000 2 4.58M things

3. Crowds or no crowds? • Study different ways to crowdsource entity typing using paid microtasks. • Three workflows – Free associations – Validating the machine – Exploring the DBpedia ontology 3

4. What to crowdsource • Entity typing (free associations) 4 E C

5. What to crowdsource (2) • Entity typing (from a list of suggestions) 5 E - City - SportsTeam - Municipality - PopulatedPlace C

6. How to crowdsource: no suggestions Workflow Ask crowd to suggest classes Take top k Ask crowd to vote the best match Pros/cons + No biases + No pre-processing – Vocabulary convergence – Time and costs – The more classifications the better – Two steps 6

7. How to crowdsource: with suggestions Two options • Generate a shortlist – Automatically • Show all available options – As a tree Pros/cons + Focused, cheap, fast – Too many classes (685!), see [Miller, 1956] – Not the right classes – Tool does not perform well – Crowd is not familiar with classes, see [Rosch et al., 1976], [Tanaka & Taylor, 1991] 7

8. How to crowdsource: microtasks 8

9. How to crowdsource: microtasks (2) 9

10. Experiments: Data • Classified entities in popular categories • Test workflows, compare crowd and machine performance E1: Baseline, 120 entities • Test the three workflows on data that cannot be classified automatically E2: Unclassified entities, 12o entities • Fewer judgements • Lower level of tool support E3: Unclassified entities, optimized, 120 entities

11. Experiments: Methods • Adjusted precision metric to take into account broader and narrower matches, as well as synonyms • Gold standard (for E2 and E3) – Two annotator, Cohen kappa of 0.7 – Conflicts resolved via small set of rules and discussions 11

12. Overall results • Shortlists are easy & fast • Freedom comes with a price • Working at the basic level of abstraction achieves greatest precision – Even when there is too much choice 12

13. Other observations • Unclassified entities might be unclassifiable – Different entity summary – Freetext or explorative workflow • Popular classes are not enough – Alternative approach to browse the taxonomy • The basic level of abstraction in DBpedia is user-friendly – But when given the freedom to choose, users suggest more specific classes – Domain-specific vocabulary is not welcome 13

14. Conclusions • In knowledge engineering, microtask crowdsourcing has focused on improving the results of automatic algorithms • We know too little about those cases in which algorithms fail • No optimal workflow in sight • The DBpedia ontology needs revision 14

15. Using microtasks to crowdsource DBpedia entity classification: a study in workflow design E Simperl, Q Bu, Y Li Submitted to SWJ, 2015 Email: e.simperl@soton.ac.uk Twitter: @esimperl 15

Improving DBpedia (one microtask at a time)

Recommended

Recommended

More Related Content

Similar to Improving DBpedia (one microtask at a time)

Similar to Improving DBpedia (one microtask at a time) (20)

More from Elena Simperl

More from Elena Simperl (20)

Recently uploaded

Recently uploaded (20)

Improving DBpedia (one microtask at a time)