Many hands make light work, the american version [charleston library conference 201111]
1. Many Hands Make Light Work,
the American Version
Experiences with User-Text-Correction at California Digital
Newspaper Collection (CDNC):
How crowd-sourcing OCR text correction impacts a
historic newspaper collection
2. About the Collection
The California Digital Newspaper
Collection contains over 490,000 pages of
visits per month
significant California newspapers published
from 1846 to 1922.
The newspapers were digitized to both minutes per visit
page and article level METS/ALTO data as
part of the National Digital Newspaper
Program.
pages per visit
site statistics between Nov. 2010 and Aug. 2011
The collection is displayed using Veridian
digital library software.
3. poor OCR reduces search recall to low levels
OCR quality ranges between 50%-90% of word level accuracy
4. Daily Alta California, 2 January 1850
$$
post OCR text correction is
expensive
≈ $0.50 per 1000 characters or $5.00 to
$10.00 per newspaper page
5. The Average CDNC
User users above 40 years old
users who consider
Like the users of many digital newspaper themselves genealogists
collections, patrons of the CDNC visit the
site for personal reasons, consider users who visit the site at
themselves genealogists or family
least weekly
historians, and return to the site
frequently.
6. Wikipedia on Crowdsourcing:
“distributed problem-solving and production model”
“sourcing tasks traditionally performed by specific individuals
to an undefined large group of people or community (crowd)
through an open call”
7. Crowd-Sourcing Projects
Project Gutenberg
Family Search
The National Library of Australia
The National Library of Finland
FreeBMD.org
9. lines per month corrected by
the top corrector
30,000 ‘Engaging with users and building virtual
communities is just as important to the
users as providing the data itself. They want
total lines corrected since
2008
49 Million to be part of a community.’
Rose Holley, The National Library of Australia
total number of text
correctors
30,000
lines corrected per month
in 2011
2,000,000 +
11. Results
August 22 - October 22
Users who have Lines Corrected Per Month
corrected text
Lines corrected by
top corrector
Total number of lines
corrected
12. Goals
• Improve OCR text at low cost
• Improve search precision / recall
• Build user community
13. Risks?
• User text correction of newspapers is (relatively) new
• Users won’t know what to do, interface is confusing
• Users don’t understand errors in OCR text
• Vandalism of text
15. User Reaction
“Great feature (I tested it during the beta) for a
“I have used the new system and like it. The user
great site, which I have used extensively. I plan to
correction is great idea.”
use the edit feature when I get back to research in
~Pat
the Los Angeles Herald and the Daily Alta
California.”
~Lawrence B.
“Exactly what the system needed!!! Pulled up a
couple articles in the beta system and made some
text corrections. Went back and tried the old
system using the words I corrected and it worked!!
“STUNNINGLY FANTASTIC!!!! is what I think!”
Outstanding enhancement!”
~A fifth generation Californian
~Mary B.
of multiple Forty-niner families
16. “The addition of user text correction (UTC) to the California Digital
Newspaper Collection has dramatically improved the quality of the
computer-generated text and enlivened our relationship with our
users. Within a couple of weeks of implementing UTC, and with little
publicity, a handful of users had already corrected thousands of lines
of text. Many of those users emailed us directly with questions about
or praise for the UTC, building direct, personal connections between
our staff and users that hadn’t existed before.”
~Brian Geiger, Center for Bibliographic Research, UC Riverside
17. ?
Brian Geiger, Director Center for Bibliographic Studies and Research
University of California Riverside
bgeiger@ucr.edu
Frederick Zarndt, Chair IFLA Newspapers Section
frederick@frederickzarndt.com