Improving Family Search Indexing Efficiency and Quality

•Télécharger en tant que PPTX, PDF•

0 j'aime•743 vues

Derek Hansen

RootsTech workshop presentation

Formation Business Technologie

IMPROVING INDEXING EFFICIENCY & QUALITY:
COMPARING A-B-ARBITRATE AND PEER REVIEW

FAMILY HISTORY TECHNOLOGY WORKSHOP
FEBRUARY 3, 2012

DEREK HANSEN, JAKE GEHRING,
PATRICK SCHONE, AND MATTHEW REID

A-B-ARBITRATE PROCESS (A-B-ARB)

A

ARB

B

THE PROBLEM

Scanned
Amount

Documents

Time

OUR APPROACH

• Historical Data Analysis
• Field Experiment comparing
quality control models

HISTORICAL DATA ANALYSIS
• Quality (estimated based on A-B agreement)
• Measures difficulty more than actual quality
• Underestimates quality, since an experienced Arbitrator
reviews all A-B disagreements
• Good at capturing differences across people, fields, and
projects
• Time (calculated using keystroke-logging data)
• Idle time is tracked separately, making actual time
measurements more accurate
• Outliers removed

A-B AGREEMENT BY LANGUAGE

1871 Canadian Census
English Language French Language
• Given Name: 79.8 • Given Name: 62.7%
• Surname: 66.4 • Surname: 48.8%

A-B AGREEMENT BY EXPERIENCE

Birth Place: All U.S. Censuses
B (novice ↔ expert)

A (novice ↔ expert)

A-B AGREEMENT BY EXPERIENCE

Given Name: All U.S. Censuses
B (novice ↔ expert)

A (novice ↔ expert)

A-B AGREEMENT BY EXPERIENCE

Surname: All U.S. Censuses
B (novice ↔ expert)

A (novice ↔ expert)

A-B AGREEMENT BY EXPERIENCE

Gender: All U.S. Censuses
B (novice ↔ expert)

A (novice ↔ expert)

A-B AGREEMENT BY EXPERIENCE

U.S. - English Canada - English

Mexico - Spanish Canada - French

A NEW APPROACH? (A-R-ARB)

• Peer review model
• Efficiency ++
• Quality ?

PEER REVIEW PROCESS (A-R-ARB)

A R ARB

Already Filled In
Optional?

FIELD EXPERIMENT

• Develop Truth Set of 2,000 1930 Census
images
• Use historical A-B-ARB data
• Create new A-R-ARB dataset by having new
indexers review and arbitrate
• Compare quality & efficiency
• Qualitatively identify types of errors

DISCUSSION
IMPLICATIONS
• Transition users from novice to expert
• Recruit foreign language indexers
• Intelligent matching based on expertise
(in A-B-ARB &/or A-R-ARB)

FUTURE POSSIBILITIES
• Peer review by algorithms?
• Initial indexing by algorithms?

QUESTIONS

• Derek Hansen (dlhansen@byu.edu)
• Jake Gehring (GehringJG@familysearch.org)
• Patrick Schone (BoiseBound@aol.com)
• Matthew Reid (matthewreid007@gmail.com)

Recommandé

Cscw family searchindexingDerek Hansen

Byu ISYS presentation_seminarDerek Hansen

TESOL_LoomVue.pptxDerek Hansen

Aahb workshopDerek Hansen

Exploring and Elevating Healthy Behaviors with Social Technologies (AAHB Conf...Derek Hansen

Guest Lecture IrvineDerek Hansen

Designing Reusable ARGsDerek Hansen

Infrastructure for Supporting Computational Social ScienceDerek Hansen

Recommandé

Cscw family searchindexingDerek Hansen

Byu ISYS presentation_seminarDerek Hansen

TESOL_LoomVue.pptxDerek Hansen

Aahb workshopDerek Hansen

Exploring and Elevating Healthy Behaviors with Social Technologies (AAHB Conf...Derek Hansen

Guest Lecture IrvineDerek Hansen

Designing Reusable ARGsDerek Hansen

Infrastructure for Supporting Computational Social ScienceDerek Hansen

IntroDerek Hansen

The PLACE approach (Prototyping Location Activities and Collective Experience)Derek Hansen

University of Utah Biomedical Informatics Seminar SlidesDerek Hansen

BYU CS Colloquium PresentationDerek Hansen

Veiled viral marketingDerek Hansen

Odd Leaf Out (IEEE Social Computing 2011)Derek Hansen

EventGraphs Talk at HCIL2011Derek Hansen

NodeXL ResearchDerek Hansen

Medicine 2.0 2008 HansenDerek Hansen

Social Media ToolsDerek Hansen

Hansen ASIST '07Derek Hansen

Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB

A Critique of the Proposed National Education Policy ReformChameera Dedduwage

microwave assisted reaction. General introductionMaksud Ahmed

Advance Mobile Application Development class 07Dr. Mazin Mohamed alkathiri

APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur

Arihant handbook biology for class 11 .pdfchloefrazer622

Measures of Central Tendency: Mean, Median and ModeThiyagu K

The byproduct of sericulture in different industries.pptxShobhayan Kirtania

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD

Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal

Contenu connexe

Plus de Derek Hansen

IntroDerek Hansen

The PLACE approach (Prototyping Location Activities and Collective Experience)Derek Hansen

University of Utah Biomedical Informatics Seminar SlidesDerek Hansen

BYU CS Colloquium PresentationDerek Hansen

Veiled viral marketingDerek Hansen

Odd Leaf Out (IEEE Social Computing 2011)Derek Hansen

EventGraphs Talk at HCIL2011Derek Hansen

NodeXL ResearchDerek Hansen

Medicine 2.0 2008 HansenDerek Hansen

Social Media ToolsDerek Hansen

Hansen ASIST '07Derek Hansen

Plus de Derek Hansen (11)

Intro

The PLACE approach (Prototyping Location Activities and Collective Experience)

University of Utah Biomedical Informatics Seminar Slides

BYU CS Colloquium Presentation

Veiled viral marketing

Odd Leaf Out (IEEE Social Computing 2011)

EventGraphs Talk at HCIL2011

NodeXL Research

Medicine 2.0 2008 Hansen

Social Media Tools

Hansen ASIST '07

Dernier

Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB

A Critique of the Proposed National Education Policy ReformChameera Dedduwage

microwave assisted reaction. General introductionMaksud Ahmed

Advance Mobile Application Development class 07Dr. Mazin Mohamed alkathiri

APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur

Arihant handbook biology for class 11 .pdfchloefrazer622

Measures of Central Tendency: Mean, Median and ModeThiyagu K

The byproduct of sericulture in different industries.pptxShobhayan Kirtania

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD

Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy

Accessible design: Minimum effort, maximum impactdawncurless

JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327

9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt

1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh

Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha

The basics of sentences session 2pptx copy.pptxheathfieldcps1

Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622

Paris 2024 Olympic Geographies - an activityGeoBlogs

Dernier (20)

Beyond the EU: DORA and NIS 2 Directive's Global Impact

A Critique of the Proposed National Education Policy Reform

microwave assisted reaction. General introduction

Advance Mobile Application Development class 07

APM Welcome, APM North West Network Conference, Synergies Across Sectors

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...

Arihant handbook biology for class 11 .pdf

Measures of Central Tendency: Mean, Median and Mode

The byproduct of sericulture in different industries.pptx

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...

Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf

Accessible design: Minimum effort, maximum impact

JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...

9548086042 for call girls in Indira Nagar with room service

1029-Danh muc Sach Giao Khoa khoi 6.pdf

Call Girls in Dwarka Mor Delhi Contact Us 9654467111

The basics of sentences session 2pptx copy.pptx

Disha NEET Physics Guide for classes 11 and 12.pdf

Paris 2024 Olympic Geographies - an activity

Improving Family Search Indexing Efficiency and Quality

1. IMPROVING INDEXING EFFICIENCY & QUALITY: COMPARING A-B-ARBITRATE AND PEER REVIEW FAMILY HISTORY TECHNOLOGY WORKSHOP FEBRUARY 3, 2012 DEREK HANSEN, JAKE GEHRING, PATRICK SCHONE, AND MATTHEW REID

2. FAMILYSEARCH

3. FAMILYSEARCH INDEXING

4. A-B-ARBITRATE PROCESS (A-B-ARB) A ARB B

5. THE PROBLEM Scanned Amount Documents Time

6. OUR APPROACH • Historical Data Analysis • Field Experiment comparing quality control models

7. HISTORICAL DATA ANALYSIS • Quality (estimated based on A-B agreement) • Measures difficulty more than actual quality • Underestimates quality, since an experienced Arbitrator reviews all A-B disagreements • Good at capturing differences across people, fields, and projects • Time (calculated using keystroke-logging data) • Idle time is tracked separately, making actual time measurements more accurate • Outliers removed

8. A-B AGREEMENT BY FIELD

9. A-B AGREEMENT BY LANGUAGE 1871 Canadian Census English Language French Language • Given Name: 79.8 • Given Name: 62.7% • Surname: 66.4 • Surname: 48.8%

10. A-B AGREEMENT BY EXPERIENCE Birth Place: All U.S. Censuses B (novice ↔ expert) A (novice ↔ expert)

11. A-B AGREEMENT BY EXPERIENCE Given Name: All U.S. Censuses B (novice ↔ expert) A (novice ↔ expert)

12. A-B AGREEMENT BY EXPERIENCE Surname: All U.S. Censuses B (novice ↔ expert) A (novice ↔ expert)

13. A-B AGREEMENT BY EXPERIENCE Gender: All U.S. Censuses B (novice ↔ expert) A (novice ↔ expert)

14. A-B AGREEMENT BY EXPERIENCE U.S. - English Canada - English Mexico - Spanish Canada - French

15. TIME & KEYSTROKE BY EXPERIENCE

16. TIME & KEYSTROKE OF ARB

17. A NEW APPROACH? (A-R-ARB) • Peer review model • Efficiency ++ • Quality ?

18. PEER REVIEW PROCESS (A-R-ARB) A R ARB Already Filled In Optional?

19. FIELD EXPERIMENT • Develop Truth Set of 2,000 1930 Census images • Use historical A-B-ARB data • Create new A-R-ARB dataset by having new indexers review and arbitrate • Compare quality & efficiency • Qualitatively identify types of errors

20. DISCUSSION IMPLICATIONS • Transition users from novice to expert • Recruit foreign language indexers • Intelligent matching based on expertise (in A-B-ARB &/or A-R-ARB) FUTURE POSSIBILITIES • Peer review by algorithms? • Initial indexing by algorithms?

21. QUESTIONS • Derek Hansen (dlhansen@byu.edu) • Jake Gehring (GehringJG@familysearch.org) • Patrick Schone (BoiseBound@aol.com) • Matthew Reid (matthewreid007@gmail.com)

Notes de l'éditeur

The goal of FamilySearch is to help people find their ancestors. It is a freely available resource that compiles information from databases from around the world. The LDS church sponsors it, but it can be used by anyone for free.
FamilySearch Indexing’s role is to transcribe text from scanned images so it is in a machine-readable format that can be searched. This is done by hundreds of thousands of indexers. [would be nice to include some background slides on FamilySearchIndexing].This is likely one of the largest crowdsourcing projects in the world.
The current quality control mechanism is called A-B-Arbitrate (or just A-B-ARB for short). In this process A and B index the document independently, and an experience arbitrator (ARB) reviews any discrepancies between the two.
Documents are being scanned at an increasing rate. If we are to benefit from these new resources we’ll need to keep pace with the indexing efforts.
A new approach based on peer review instead of independent indexing would likely improve efficiency, but its effect on quality is unknown. Anecdotal evidence suggests that peer reviewing may be twice as fast as indexing from scratch.
The model could include arbitration (ARB) or that step could be skipped if A-B results in high enough quality on its own.
Data is currently being collected for R and ARB. It should be done in a few weeks.
Combining humans and algorithms into the same process would allow Family Search to continue to improve machine learning algorithms based on millions of records.