THe HathiTrust Research Center: Digital Humanities at Scale
1. The HathiTrust Research Center:
Digital Humanities at Scale
Robert H. McDonald | @mcdonald
Executive Committee-HathiTrust Research Center (HTRC)
Deputy Director-Data to Insight Center
Associate Dean-University Libraries
Indiana University
2. Presentation Overview
• What is the HathiTrust Research Center
(HTRC)?
• How does the HTRC Work?
• Overview of Current HTRC Research
– Applied Technology for Access to the HT Corpus
– New Models for Non-Consumptive Research
• New questions for scaleable Digital
Humanities Computing?
• How to get engaged with the HTRC?
3. HathiTrust Research Center (HTRC):
The HTRC is dedicated to enable
computational access for nonprofit and
educational users to published works in
the public domain and, in the future, on
limited terms to works in-copyright from
the HathiTrust.
4. Big Data and Access Problem
HTRC
HT Volume
HT Store and
Volume
Volume Index
Store
Store (IUB)
(UM) XSEDE
(IUPUI)
Compute
FutureGrid Allocation
Computation
Cloud UIUC
Compute
Allocation
IU
Compute
Allocation
5. New questions require computational
access to large corpus
Investigate way in which concepts of philosophy
are used in physics through
– Extracting argumentative structure from large
dataset using mixture of automated and social
computing techniques
– Capture evidence for conjecture that availability of
such analyses will enable innovative
interdisciplinary research
– Digging into Data 2012 award, Colin Allen, IU
6. New questions, cont.
Document through software text analysis
techniques, the appearance, frequency and
context of terms, concepts and usages related
to human rights in a selection of English-
language novels.
– Ronnie Lipschutz of UCSC is currently doing this
analysis on one of Jane Austen’s books. He’d like
to extend the work to encompass a far larger
corpus.
7. New questions, cont.
Identify all 18th century published books in
HathiTrust corpus, and apply topic modeling
to create a consistent overall subject metadata
8. Towards an HT Research Center
• Started in response to proposed Google
Settlement - June 2009
Specific Funding set aside by Google to build a public
research center
Worked to identify key stakeholders from HT institutions to
collaborate and write RFP
• Developed specific RFP for HathiTrust to
solicit proposals – Summer/Fall 2009
– HTRC RFP Working Group
• RFP Released – Winter 2010
9. HTRC RFP Working Group
http://www.hathitrust.org/documents/hathitrust-research-center-rfp.pdf or
• Kat Hagedorn-HathiTrust • Robert H. McDonald-Indiana
• Jeremy York-HathiTrust • Qiaozhu Mei-Michigan
• Melissa Levine-HathiTrust • Shawn Newsam-California-
• John Wilkin-HathiTrust Merced
• Steve Abney-Michigan • John Ober-
California Digital Library
• Jack Bernard-Michigan
• Beth Plale-Indiana
• Geoffrey Fox-Indiana
• Scott Poole-Illinois
• David Goldberg-California-Irvine
• Sarah Shreeves-Illinois
• John Unsworth-Illinois
10. HTRC Governance
• Reports to the HathiTrust Board of Governors
– Laine Farley-CDL is liaison to the HT BOG
• HTRC Executive Committee
– Co-Directors: Beth Plale (IU) – J. Stephen Downie (UIUC)
– IU-Robert H. McDonald (IU)
– UIUC-Beth Sandore (UIUC) – Marshall Scott Poole (UIUC)
– John Unsworth (Brandeis)
• HTRC Advisory Board (See members next slide)
• Google Public Domain agreement – in place for IU
and UIUC
11. HTRC Advisory Board
• Cathy Blake, University of Illinois, Urbana-Champaign
• Beth Cate, Indiana University
• Greg Crane, Tufts University
• Laine Farley, California Digital Library
• Brian Geiger, University of California at Riverside
• David Greenbaum, University of California at Berkeley
• Fotis Jannidis, University of Wurzberg, Germany
• Matthew Jockers, Stanford University
• Jim Neal, Columbia University
• Bill Newman, Indiana University
• Bethany Nowviskie, University of Virginia
• Andrey Rzhetsky, University of Chicago
• Pat Steele, University of Maryland
• Craig Stewart, Indiana University
• David Theo Goldberg, University of California at Irvine
• John Towns, National Center for Supercomputing Applications
• Madelyn Wessel, University of Virginia
12. GOOGLE DIGITAL HUMANITIES AWARDS RECIPIENT
INTERVIEWS REPORT
PREPARED FOR THE HATHITRUST RESEARCH CENTER
VIRGIL E. VARVEL JR.
ANDREA THOMER
CENTER FOR INFORMATICS RESEARCH IN SCIENCE AND
SCHOLARSHIP
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Fall 2011
13. The study
• Dr. John Unsworth, a representative of HTRC,
distributed invitations to participate in this study via
email to 22 researchers given Google Digital
Humanities Research Awards.
• Interviews were conducted via telephone, Skype®, or
face-to-face, and all were audio recorded. All
participants agreed to IRB permission statement via
email.
• A semi-structured interview protocol was developed
with input from HTRC to elicit responses from
participants on primary goals of project.
14. Select Findings
• Optical Character Recognition
– Steps should be taken to improve OCR quality if
and when possible
– Scalability of scanned image viewing is necessary
for OCR reference and correction
– Metadata should expose the quality of OCR
15. Select Findings
• “Would like better metadata about text languages,
particularly in multi-text documents and on language
by sections within text. Automatic language
identification functions would be helpful, but
human-created metadata is preferred, particularly
for documents with low OCR quality.”
• “primary issue was retrieving the bibliographic
records in usable form, unparsed by Google. […]
process took 10 months to design the queries and
get the data.”
16. HathiTrust is large corpus
providing opportunity for new
forms of computation
investigation.
The bigger the data, the less
able we are to move it to a
researcher’s desktop machine
Future research on large
collections will require
computation moves to the
data, not vice versa
17. HTRC Goals
• HTRC will provide a persistent and sustainable
structure to enable original and cutting edge
research.
• Stimulate the development in community of
new functionality and tools to enable new
discoveries that would not be possible
without the HTRC.
18. HTRC Goals (cont.)
• Leverage data storage and computational
infrastructure at Indiana U and UIUC,
• Provision secure computational and data
environment for scholars to perform research
using HathiTrust Digital Library.
• Center will break new ground, allowing
scholars to fully utilize content of HathiTrust
Library while preventing intellectual property
misuse within confines of current U.S.
copyright law.
19. HathiTrust
Research
Center
• Analysis on 10,000,000+
volumes of HathiTrust
Type of Data (Public Estimated initial
digital repository domain and copyrighted size: 300-500 TB
• Founded 2011 works)
Solr Indexes 6.8 TB (3 indexes)
• Working with OCR
File system rsync 2.3 TB
• Large-scale data storage (pairtree)
and access Fast volume access store 3.7 TB
(Cassandra)
• HPC and Cloud Versions of collection (5) 11.5 TB (2.3TB x 5)
Volume store indexes 6.8 TB
8/17/2012 19
20. Corpus Usage Patterns
Chapter
1
Chapter 1
Access by
Chapter
1
chapter
Page IV
Access by
Page IV
page
Page IV
Table of
Contents
Access by
Table of
1………….
#Contents
1………….
2…………
#
Table of
## Contents
special contents
2…………#
1………….
#
#
2…………#
#
(table of
contents, index,
glossary)
8/17/2012 20
21. HTRC Timeline
• Phase I: an 18-mo development cycle
– Began 01 July 2011
– Demo of capability September 2012 (14 mo mark)
• Phase II: broad availability of resource, begins
01 January 2013
22. HTRC UnCamp
• HTRC UnCamp to be held at Indiana University
– September 10-11, 2012
• Goals of HTRC UnCamp
– Demo of initial HTRC API and SOA Infrastructure
per 4 use cases as based on original RFP
submission of 2010
– Researcher and Librarian input on next steps for
HTRC services
23. • Web services architecture and protocols
• Registry of services and algorithms
• Solr full text indexes
• noSQL store as volume store
• Large scale computing
• openID authentication
• Portal front-end, programmatic access
• SEASR mining algorithms
8/17/2012 23
24. HTRC Architecture
Overview
Portal access
High level apps
Nam
Token Sentence Direct
e
filter tokenizer programmatic
entity
access (by Meandre
Latent semantic Word programs running
analysis position on HTRC machines)
Data API access interface
Algorithm Security Cassandra
Registry (WS02) (OAuth) cluster
volume store
Application Solr index
submission Audit
Compute resources Storage resources
25. Web portal
Desktop SEASR client
CI logon Agent SEASR analytics WSO2 registry
services, collections, data
(NCSA) framework service capsule images Blacklight
HTRC Data API v0.1
Agent Agent
instance
Agent instance
Agent
Access instance instance
control (e.g. Solr index Programmatic
Grouper) access e.g.,
Task Meandre
deployment Orchestration
Non-consumptive Volume store
Data capsules Volume store
(Cassandra)
Volume store
(Cassandra)
NCSA local resources
(Cassandra)
Future
Grid HathiTrust
NCSA HPC resources rsync corpus
Page/volume
Penguin on tree (file system)
Demand
University of Michigan
8/17/2012 25
30. HTRC Solr index
• The Solr Data API 0.1 test version available.
– Preserves all query syntax of original Solr,
– Prevents user from modification,
– Hides the host machine and port number HTRC
Solr is actually running on,
– Creates audit log of requests, and
– Provides filtered term vector for words starting
with user-specified letter
• Test version service soon available
31. HTRC Solr index: most used API patterns
• ===== query
• http://coffeetree.cs.indiana.edu:9994/solr/select/?q=ocr:war
• =====faceted search
• http://coffeetree.cs.indiana.edu:9994/solr/select/?q=*:*&fac
et=on&facet.field=genre
• ===get frequency and offset of words starting with letter
• http://coffeetree.cs.indiana.edu:9994/solr/getfreqoffset/inu.3
2000011575976/w
• ===== banned modification request:
• http://localhost:8983/solr/update?stream.body=<delete><qu
ery>id:298253</query></delete>&commit=truethanks,
32. Requirement: run big jobs on
large scale (free or nearly
free (leveraged)) compute
resources
33. Data Capsules VM
Cluster
HTRC Volume
Store and Index
Remote
Provide secure
Desktop
VM
Or VNC
Submit secure
Scholars capsule
FutureGrid
map/reduce Data
Computation
Capsule images
Cloud
to FutureGrid.
Receive and
review results
Secure Data Capsule
34. HTRC FutureGrid
Single source
4GB chunk + of write from
user buffer Map map node. <
HTRC block store
Mapreduce Read only node (0) 1GB per map
headnode. node
Take experiment with
Partitions 4GB chunk
5M vols (4TB Map
4TB evenly
uncompressed data.) node (1)
amongst Reduce
Block abstraction at vol
1000 nodes.
size or larger.
Trusted
because run 4GB chunk Map
as user HTRC. node (2) Data
User buffer Capsule
needs to be node
User buffer
for secondary copied to
data user each map
node. 4GB chunk Map
submits for
use in node
computation (999)
Data Capsule
nodes
Provenance capture (through Karma provenance tool)
36. Notion of a Proxy
• “researcher” in Google Book Settlement definition
means a human
• An avatar is a virtual life form acting on behalf of
person
• Computer program acts on behalf of person
• Computer program (i.e., proxy) must be able to read
Google texts, otherwise computational analysis is
impossible to carry out.
• So non-consumption applied to books applies to
human consumption.
37. Non-consumptive research –
implementation definition
• No action or set of actions on part of users,
either acting alone or in cooperation with
other users over duration of one or multiple
sessions can result in sufficient information
gathered from collection of copyrighted works
to reassemble pages from collection.
• Definition disallows collusion between users,
or accumulation of material over time.
Differentiates human researcher from proxy
which is not a user. Users are human beings.
38. Notion of Algorithm
• Computational analysis is accomplished
through algorithms
– An algorithm carries out one coherent analysis
task: sort list of words, compute word frequency
for text
• Researcher’s computational analysis often
requires running sequence of algorithms.
Important distinction for implementing non-
consumptive research is “who owns the
algorithm”?
39. Infrastructure for computational
analysis
• When needing to support computation over
10+M volume corpus, algorithms must be co-
located with data.
• That is, algorithms must be located where
repository is located, and not on user’s
desktop.
• When computational analysis is to be non-
consumptive, likely one location for the data.
40. Who owns algorithm?
• HTRC owns the algorithms,
– use Software Environment for Advancement of
Scholarly Research (SEASR) suite of algorithms
– we are examining security requirements of users,
algorithms, and data
41. User owns and submits their
algorithms
• HTRC recently received funding from Alfred P.
Sloan foundation to prototype “data capsule
framework” that provisions for non-
consumptive research.
• Founded on principle of “trust but verify”.
Informatics-savvy humanities scholar is given
freedom to experiment with new algorithms
on protected information, but technological
mechanisms in place to prevent undesirable
behavior (leakage.)
42. Non-consumptive, user-owned algorithms
infrastructure; requirements:
• Implements non-consumptive
• Openness – users not limited to using known
set of algorithms
• Efficiency – Not possible to analyze algorithms
for conformance prior to running
• Low cost and scale – Run at large-scale and
low cost to scholarly community of users
• Long term value –adoption for other purposes
43. Categories of algorithms. Can fair use be
determined based on categorization of
algorithm? Or is all computational use fair use?
8/17/2012 43
44. Algorithm results fair use?
• Center supplied
– Easier because we know category of algorithm
• User supplied
– HTRC is not examining code, so open question
45. HTRC Philosophy and NCR
• Finally, results of computational research that
conforms to restrictions of non-consumptive
research must belong to researcher
46. How to Engage
• Building partnership with researchers and
research communities is key goal of the
HathiTrust Research Center
• HTRC can give technical advice to researchers
as they look for funding opportunities
involving access to research data
• Upcoming “Fix the OCR and Metadata
Shortage Community Challenge” : help us
address key weaknesses of HT corpus
47. HTRC Upcoming Events
• JADH2012 Sept 15, 2012 - HathiTrust Research
Center: Pushing the Frontiers of Large Scale
Text Analytics – J. Stephen Downie-UIUC
• HathiTrust Research Center UnCamp – Sept
10-11, 2012 – Indiana University
http://www.hathitrust.org/htrc_uncamp2012
• Demonstration Presentation for the HathiTrust
Digital Library Board of Governors – Sept 2012
48. Thank You
• This presentation was made possible with content
provided by many HTRC colleagues Beth Plale, Marshall
Scott Poole, John Unsworth, J. Stephen Downie, Stacy
Kowalczyk, Yiming Sun, Guangchen Ruan, Loretta Auvil,
Kirk Hess, and many others…
• The HTRC Non-Consumptive Research Grant is
graciously funded by the Alfred P. Sloan Foundation
• IU D2I-PTI is graciously funded by The Lilly Endowment,
Inc.
• HTRC - http://www.hathitrust.org/htrc
• IU D2I Center - http://d2i.indiana.edu/
49. Contact Information
• Robert H. McDonald
– Email – robert@indiana.edu
– Chat – rhmcdonald on googletalk | skype
– Twitter - @mcdonald
– Blog – http://www.rmcdonald.net
Notes de l'éditeur
== Corpus Usage Patterns (cont'd) === 1) In this use case, the user may wish to select a chapter from each book in the collection he/she assembled and perform analysis over these chapters2) In this use case, analysis only on some special pages of each book (e.g. foreword, or first page of each chapter) in his/her collection.3) in this case, analysis on special contents of each book (e.g. table of contents, indices, glossaries