Improving Family Search Indexing Efficiency and Quality
1. IMPROVING INDEXING EFFICIENCY & QUALITY:
COMPARING A-B-ARBITRATE AND PEER REVIEW
FAMILY HISTORY TECHNOLOGY WORKSHOP
FEBRUARY 3, 2012
DEREK HANSEN, JAKE GEHRING,
PATRICK SCHONE, AND MATTHEW REID
7. HISTORICAL DATA ANALYSIS
• Quality (estimated based on A-B agreement)
• Measures difficulty more than actual quality
• Underestimates quality, since an experienced Arbitrator
reviews all A-B disagreements
• Good at capturing differences across people, fields, and
projects
• Time (calculated using keystroke-logging data)
• Idle time is tracked separately, making actual time
measurements more accurate
• Outliers removed
9. A-B AGREEMENT BY LANGUAGE
1871 Canadian Census
English Language French Language
• Given Name: 79.8 • Given Name: 62.7%
• Surname: 66.4 • Surname: 48.8%
10. A-B AGREEMENT BY EXPERIENCE
Birth Place: All U.S. Censuses
B (novice ↔ expert)
A (novice ↔ expert)
11. A-B AGREEMENT BY EXPERIENCE
Given Name: All U.S. Censuses
B (novice ↔ expert)
A (novice ↔ expert)
12. A-B AGREEMENT BY EXPERIENCE
Surname: All U.S. Censuses
B (novice ↔ expert)
A (novice ↔ expert)
13. A-B AGREEMENT BY EXPERIENCE
Gender: All U.S. Censuses
B (novice ↔ expert)
A (novice ↔ expert)
14. A-B AGREEMENT BY EXPERIENCE
U.S. - English Canada - English
Mexico - Spanish Canada - French
19. FIELD EXPERIMENT
• Develop Truth Set of 2,000 1930 Census
images
• Use historical A-B-ARB data
• Create new A-R-ARB dataset by having new
indexers review and arbitrate
• Compare quality & efficiency
• Qualitatively identify types of errors
20. DISCUSSION
IMPLICATIONS
• Transition users from novice to expert
• Recruit foreign language indexers
• Intelligent matching based on expertise
(in A-B-ARB &/or A-R-ARB)
FUTURE POSSIBILITIES
• Peer review by algorithms?
• Initial indexing by algorithms?
21. QUESTIONS
• Derek Hansen (dlhansen@byu.edu)
• Jake Gehring (GehringJG@familysearch.org)
• Patrick Schone (BoiseBound@aol.com)
• Matthew Reid (matthewreid007@gmail.com)
Notes de l'éditeur
The goal of FamilySearch is to help people find their ancestors. It is a freely available resource that compiles information from databases from around the world. The LDS church sponsors it, but it can be used by anyone for free.
FamilySearch Indexing’s role is to transcribe text from scanned images so it is in a machine-readable format that can be searched. This is done by hundreds of thousands of indexers. [would be nice to include some background slides on FamilySearchIndexing].This is likely one of the largest crowdsourcing projects in the world.
The current quality control mechanism is called A-B-Arbitrate (or just A-B-ARB for short). In this process A and B index the document independently, and an experience arbitrator (ARB) reviews any discrepancies between the two.
Documents are being scanned at an increasing rate. If we are to benefit from these new resources we’ll need to keep pace with the indexing efforts.
A new approach based on peer review instead of independent indexing would likely improve efficiency, but its effect on quality is unknown. Anecdotal evidence suggests that peer reviewing may be twice as fast as indexing from scratch.
The model could include arbitration (ARB) or that step could be skipped if A-B results in high enough quality on its own.
Data is currently being collected for R and ARB. It should be done in a few weeks.
Combining humans and algorithms into the same process would allow Family Search to continue to improve machine learning algorithms based on millions of records.