These are my slides from the DPLAFest 2015 held in Indianapolis, IN on 04/17/2015-04/18/2015.
For more see - https://dplafest2015.sched.org/event/a1cfbaca67fd71a2409d28d9b27b1351
The HathiTrust Research Center: An Overview of Advanced Computational Services
1. The HathiTrust Research Center:
An Overview of Advanced
Computational Services
April 18, 2015 | DPLAFest 2015 | Indianapolis, IN
Robert H. McDonald
Indiana University Libraries | Data To Insight Center
Indiana University
Tweet us - #HTRC #DPLAFest
HATHI TRUST RESEARCH CENTER
2. Many thanks …
HTRC IU Team
• Beth Plale
• Robert H. McDonald
• Dirk Herr-Hoyman
• Miao Chen
• Guangchen Ruan
• Zong Peng
• Milinda Pathirage
• Samitha Liyanage
• Leena Unnikrishnan
• Nicholae Cline
HTRC UIUC Team
• J. Stephen Downie
• Beth Namachchivaya
• Ryan Dubnicek
• Megan Senseney
• Sayan Bhattacharyya
• Colleen Fallaw
• Loretta Auvil
• Boris Capitanu
• Harriet Green
• Jacob Jett
• Dan Bassett
3. 4/18/15 #HTRC @HathiTrust
HathiTrust Digital Library
• HathiTrust is a partnership of academic &
research institutions, offering a collection of
millions of titles digitized from libraries around
the world.
– IU is a founding member of the HathiTrust along
with University of Michigan, University of
California, and the University of Virginia.
http://www.hathitrust.org/htrc
http://www.hathitrust.org
4. 4/18/15 #HTRC @HathiTrust
HathiTrust “Wow” Numbers
• 13,284,163 total volumes
• 6,742,394 book titles
• 352,534 serial titles
• 4,649,457,050 pages
• 595 terabytes
• 157 miles
• 10,793 tons
• 4,979,599 volumes in the public domain
5. 0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1. Michigan 4,712,752
2. California 3,612,596
3. Harvard 838,115
4. Wisconsin 561,094
5. Indiana 529,601
6. Cornell 510,286
7. Penn State 388,713
8. Illinois 329,136
9. NYPL 294,883
10. Princeton 252,837
11. Minnesota 193,124
12. Madrid 117,291
13. Library of
Congress
108,892
14. Keio University 90,112
6. 4/18/15 #HTRC @HathiTrust
Goals for HTRC
• Provide a persistent and sustainable structure to
enable scholars to ask and answer new questions.
– Leverage data storage and computational infrastructure at Indiana
& Illinois
– Stimulate community development of new functionality and tools
– Use tools to enable discoveries that would not be possible
without the HTRC
• Enable scholars to fully utilize content of HathiTrust
Library while preventing intellectual property misuse
within U.S. copyright law.
– Provide a secure computational and data environment for
scholars to perform research using HathiTrust Digital Library.
7. 4/18/15 #HTRC @HathiTrust
HathiTrust and HTRC
HathiTrust
University
of
Illinois
Indiana
University
HathiTrust
Research
Center
University
of
Michigan
• Board of Governors
• Executive Committee
• Executive Director
9. 4/18/15 #HTRC @HathiTrust
Non-Consumptive Research Paradigm
• No action or set of actions on part of users,
either acting alone or in cooperation with
other users over duration of one or multiple
sessions can result in sufficient information
gathered from collection of copyrighted works
to reassemble pages from collection.
• Definition disallows collusion between users,
or accumulation of material over time.
Differentiates human researcher from proxy
which is not a user. Users are human beings.
11. 4/18/15 #HTRC @HathiTrust
Working with HTRC Tools
Get started at: https://htrc2.pti.indiana.edu/
Build Worksets
Execute Algorithms
Visualize Term Frequency
http://sandbox.htrc.illinois.edu/bookworm/
12. 4/18/15 #HTRC @HathiTrust
Working with HTRC Staff
Advanced Collaborative
Support
Scholarly Commons
Advanced Research
Workshops, tutorials, and
guidance for using HTRC
One-on-one research support
provided through a competitive
awards process
Collaborative research
partnership with HTRC
19. 4/18/15 #HTRC @HathiTrust
ACS (Advanced Collaborative Support)
1. Tracing Technology Diffusion Over Time: Dr. Michelle Alexopoulos, a scholar in
economics from the University of Toronto.
2. Detecting Literary Plagiarisms: The Case of Oliver Goldsmith: Douglas Duhaime.
University of Notre Dame: Will work on developing tools for detecting
plagiarisms. He will focus on the case of Oliver Goldsmith, to detect the literary
thefts of Goldsmith by using machine learning techniques.
3. Taxonomizing the Texts: Towards Cultural-Scale Models of Full Text: Colin Allen,
Jaimie Murdock. Indiana University Bloomington. Allen and Murdock will carry
out a cultural-scale investigation and topic modeling on HT public-domain full
text through random sampling to select collections according to the Library of
Congress Subject Headings (LCSH).
4. The Trace of Theory. Geoffrey Rockwell, Laura Mandell, Stefan Sinclair, Matthew
Wilkens, Susan Brown. University of Alberta, Texas A&M University, University of
Notre Dame. Aim to subset theoretical subsets from the HT public corpus and
apply large-scale topic modeling on the subsets. The researchers will develop
tools and computational methods for tracking the concept of "theory.”
20. 4/18/15 #HTRC @HathiTrust
WCSA Funded Projects
1. Workset Creation through Image Analysis of
Document Pages - Texas A&M University (PI: Keith
Biggers)
2. Semantic Analysis of Documents from the HathiTrust
Corpus - Waikato University (PI: Annike Hinze)
3. Distributed Metadata Correction and Annotation-
Maryland Institute for Technology in the Humanities,
University of Maryland. (PI: Trevor Muñoz)
4. ElEPHãT: Early English Print in HathiTrust, a Linked
Semantic Workset Prototype-Oxford University (PI:
Kevin Page)
21. 4/18/15 #HTRC @HathiTrust
HTRC Data Capsule for Secure Text-
Mining at Scale
Funded at $606,000 by The Alfred P. Sloan Foundation; Beth
Plale, Indiana University, PI; Atul Prakash, University of Michigan,
Co-PI; Fall 2011 – Fall 2014.
Goal: Prototype a system that enables secure text mining to be
carried out at scale using public cloud resources, including:
1. a software cloud infrastructure based on OpenStack
2. mechanisms for managing a secure virtual machine We plan
The Sloan Cloud will provide users with dedicated virtual
machines that are pre-configured with appropriate tools and
provide secure access to remote data that cannot be funneled
through the VM to outside filesystems.
24. 4/18/15 #HTRC @HathiTrust
HTRC UpComing Events
1. Tutorial at JCDL 2015 – June 21, 2015 –
Knoxville, TN
– Topic Exploration with the HTRC Data Capsule for
Non-Consumptive Research
– http://www.jcdl2015.org/tutorials-workshops
2. HASTAC 2015 Post Conference Workshop – May
30, 2015 – East Lansing, MI
– Workshop on Text Mining with the HathiTrust
Research Center
– http://www.hastac2015.org/schedule/post-
conference-workshops/
Notes de l'éditeur
5
6
7
Data Capsule
Enhanced Feature Extraction
Alpha Version of Bookworm