My contribution to the panel session "Developing a Collaborative Sandbox for Digital Library Research" at the at the 2010 iConference at the University of Illinois at Urbana-Champaign
http://www.ischools.org/iConference10/2010index/
iConference 2010: How to collect reference questions and answers for research purposes?
1. How to collect reference
questions and answers for
research purposes?
2. What the IPL does currently
Archive of Reference Questions (ARQ)
Updated every 6 months
Organized in biweekly batches (15th & last day of the month)
Example record
Entire record is a single HTML file
Jeffrey Pomerantz
UNC-CH SILS
pomerantz@unc.edu
iConference 2010
3. What weʼve done
Crawled the ARQ and downloaded all records
77,883 records from Aug 1995 - Dec 2008
Built a parser to tokenize question-answer records
question text block
answer text block
URLs within answers
subject categories
timestamps
Jeffrey Pomerantz
UNC-CH SILS
pomerantz@unc.edu
iConference 2010
4. What we have yet to do
Deposit crawler and parser in the IPL Learning
Community
Better still, redo the ARQ database
And make it queryable
Jeffrey Pomerantz
UNC-CH SILS
pomerantz@unc.edu
iConference 2010