Sarah Michalak, HathiTrust #RLUK14

HATHITRUST
A Shared Digital Repository
HathiTrust:
An Above Campus Solution
Sarah Michalak
RLUK Birmingham
November 14, 2014

Today’s Discussion - HathiTrust
• Mission and partnership
• Collections
• Services
• HathiTrust Research Center
• Benefits for Libraries
12/18/2014

The Name
• The meaning behind the name
• Hathi (hah-tee)--Hindi for elephant
• Never forgets
• Full of wisdom
• Secure
• Trustworthy
• Big, strong
12/18/2014

The Mission and Partnership
12/18/2014

Mission
To contribute to the common good by collecting, organizing, preserving,
communicating, and sharing the record of human knowledge.
Efforts include, but are not limited to
…building comprehensive collections co-owned and managed by
partners.
…enabling access by users with print disabilities.
…supporting computational research with the collections.
…stimulating shared collection storage strategies among libraries.
12/18/2014

HathiTrust Members
Allegheny College
Arizona State University
Baylor University
Boston College
Boston University
Brandeis University
Brown University
California Digital Library
Carnegie Mellon University
Colby College
Columbia University
Cornell University
Dartmouth College
Duke University
Emory University
Florida State University**
Getty Research Institute
Harvard University Library
Indiana University
Iowa State University
Johns Hopkins University
Kansas State University
Lafayette College
Library of Congress
Massachusetts Institute of
Technology
McGill University`
Michigan State University
Montana State University
Mount Holyoke College
New York Public Library
New York University
North Carolina Central
University
North Carolina State
University
Northwestern University
The Ohio State University
The Pennsylvania State
University
Princeton University
Purdue University
Rutgers University
Stanford University
Syracuse University
Temple University
Texas A&M University
Texas Tech
Tufts University
Universidad Complutense
de Madrid
University of Alabama
University of Alberta
University of Arizona
University of British Columbia
University of Calgary
University of California
Berkeley
Davis
Irvine
Los Angeles
Merced
Riverside
San Diego
San Francisco
Santa Barbara
Santa Cruz
The University of Chicago
University of Connecticut
University of Delaware
University of Florida
University of Houston
University of Illinois
University of Illinois at
Chicago
The University of Iowa**
University of Kansas**
University of Maine
University of Maryland**
University of Massachusetts,
Amherst
University of Miami
University of Michigan
University of Minnesota**
University of Missouri**
University of Nebraska-
Lincoln**
University of New Mexico
The University of North
Carolina at Chapel Hill**
University of Notre Dame
University of Oklahoma
University of Pennsylvania
University of Pittsburgh
University of Queensland
University of Tennessee,
Knoxville**
University of Texas
University of Utah
University of Vermont
University of Virginia**
University of Washington**
University of Wisconsin-
Madison**
Utah State University**
Vanderbilt University
Virginia Tech
Wake Forest University
Washington University
Yale University Library
November 3, 2014 7

How Are Costs Shared?
• Public domain volumes: All partners share in infrastructure costs
for each item.
• In copyright volumes: Partners share costs based on their
holdings.
• Infrastructure cost per volume: ~$0.168 per volume per year.
• All partners pay an additional amount above costs to fund new
programs and investigations.
12/18/2014

Collections and Access
12/18/2014

HATHITRUST.ORG
November 3, 2014 10

12.5 million total volumes
6.4 million book titles
327,000 serial titles
575,889 government publications
4.6 million volumes in the public domain
(~37%)

Link takes you to HathiTrust
Records loaded into DPLA, local library
catalogs, and commercial databases

Collective Stewardship
• Leverage expertise across institutions
• Distributed Functions and Services
• Preservation repository and access services
• University of Michigan
• Mirror site: Indiana University
• Metadata management services
• California Digital Library
• HathiTrust Research Center
• Indiana University and University of Illinois
5 November 2014 13

Collection Sources
1412/18/2014
Michigan, 37.54%
California, 28.63%
Harvard, 6.15%
Wisconsin, 4.47%
Indiana, 4.19%
Cornell, 4.02%
llinois (UC), 2.45%
NYPL, 2.35%
Princeton, 2.02%
PSU, 1.19%
Mnnesota, 1.11%
Universidad Complutense, 0.92%
LoC, 0.87%
Keio, 0.72%
Columbia, 0.52%
Northwestern, 0.45%
Ohio State, 0.42%
Chicago, 0.41%
Virginia, 0.41%
Purdue, 0.38%
Yale, 0.19%
UNC Chapel Hill, 0.14%
Getty Research Institute, 0.13%
Massachusetts, 0.09%
Florida, 0.08%
Duke, 0.06%
Connecticutt, 0.04%
Boston College, 0.03%
NC State, 0.03%
Mgill, 0.01%
Texas A&M, 0.01%
Alberta, < 0.01%
Delaware, < 0.01%
Utah State, < 0.01%

Dates
2000-2009
10%
1990-1999
14%
1980-1989
14%
1970-1979
13%
1960-1969
11%
1950-1959
6%
1940-1949
4%
1930-1939
4%
1920-1929
4%
1910-1919
4%
1900-1909
4%
1850-1899
10%
1800-1849
3%
1700-1799, 0.01%
1600-1699, 0.01%
1500-1599, 0.07%
0-1500, 0.04%
12/18/2014

Language Distribution (1)
The top 10 languages make up
~87% of all content
English, 49%
German, 9%
French, 7%
Spanish, 5%
Chinese, 4%
Russian, 4%
Japanese, 3%
Italian, 3%
Arabic, 2%
Latin, 1%
Remaining
Languages, 13%
12/18/2014

Language Distribution (2)
Portuguese, 7%
Polish, 7%
Dutch, 5%
Hebrew, 5%
Hindi, 5%
Indonesian, 4%
Korean, 4%
Swedish, 4%
Thai, 3%Urdu, 3%
Turkish, 3%
Danish, 3%
Czech, 3%
Croatian, 3%
Persian, 2%
Tamil, 2%
Hungarian, 2%
Bengali, 2%
Norwegian, 2%
Sanskrit, 2%
Greek,-Modern-
(1453--), 2%
Vietnamese, 1%
Ukrainian, 1%
Serbian, 1%
Bulgarian, 1%
Greek,-Ancient-
(to-1453), 1%
Armenian, 1%
Romanian, 1%
Marathi, 1%
Panjabi, 1%
Telugu, 1% Catalan,
1%
Malay,
1%
Multiple-languages, 1%
Malayalam, 1%
Finnish, 1%
Slovak, 1%
Slovenian
, 1%
Turkish,-
Ottoman,
1%
Yiddish, 1%
Nepali, 0%
The next 40
languages
make up
~12% of
total
12/18/2014

Copyright Distribution
In Copyright or
undetermined
63%
Public Domain
Worldwide
21%
US Government
Documents
5%
Public Domain (US)
11%
Open Access
0.06%
Creative Commons
0.06%
“Public domain”
38%
12/18/2014 18

10 September, 2014 | 20
Preservation with Access
• Preservation
– TRAC-certified
– Long-term commitments to preserve digital content facilitate planning,
decision-making
• Discovery
– Bibliographic and full-text search of all materials
– Mechanisms for local loading of records
• Access and Use
– Full text search (all users)
– Public domain and open access works (all users)
– Collections and APIs (all users)
– Lawful uses of in-copyright works (members)

10 September, 2014 | 21
Access: Lawful uses of
in-copyright works
• Sensitive to multiple legal regimes
– Full-text search (everyone everywhere)
– Access to users who have print disabilities (through member proxy in
US, and where law permits)**
– Access works that are damaged or missing and also out of print and
unavailable (members in US only)
**Terms and conditions at http://www.hathitrust.org/access_use#ic-
access

10 September, 2014 | 22
Collective Action: Copyright Review
• Copyright Review Management System
– Systematic manual review of copyright registrations to determine
status of portions of the HathiTrust Collection
– CRMS US: Published in US, 1923-1963
• 316,396 reviewed / 166,753 PD (~53%)
– CRMS-World: Published in UK (1874-1944), Canada, Australia (1894-
1964)
• 145,804 reviewed / 75,775 PD-world 9 (~52%)
21 October 2014 22

10 September, 2014 | 23
HathiTrust Research Center
• http://www.hathitrust.org/htrc
• Operated by the University of Illinois, Urbana-Champaign and
Indiana University, with additional financial support from
HathiTrust.
• Co-led by Beth Plale (Indiana) and Stephen Downie (Illinois).
• Goal: enable researchers world-wide to carry out
computational investigation of HT repository.

10 September, 2014 | 24
Aims of the HTRC
• Focus on developing services to researchers
• Develop model for access: the ‘workset’
• Develop tools that facilitate research by digital humanities and
informatics communities
• Develop secure cyberinfrastructure that allows computational
investigation of entire copyrighted and public domain
HathiTrust repository

10 September, 2014 | 25
Example Projects Supported by HTRC
• Muñoz, Trevor, University of Maryland. “Distributed Metadata Correction and Annotation.”
– Correction, annotation and enhancement of HT records and export as linked data
• Page, Kevin, Oxford University. “ElEPHãT: Early English Print in HathiTrust, a Linked Semantic
Workset Prototype”
– Development of secondary worksets based on both HT and the Early English Books Online Text
Creation Partnership (EEBO-TCP).
• Burton, Vernon. “The South as ‘Other,’ the Southerner as ‘Stranger.’”
– Explore how attitudes expressed in print about slavery, southerners, and non-southerners have
changed over both time and space.
• Ted Underwood, Associate Professor of English at the University of Illinois, Urbana-
Champaign.
– Using public domain texts received from HathiTrust to explore changing relationships in literary
genres from 1700-1899.

10 September, 2014 | 26
HathiTrust overall benefits to libraries
• Digital Curation
– Drive costs down
– Reduce “bibliographic indeterminacy”
– Make meaningful decisions about formats and quality
– Increase discoverability, use
– Consolidate development talent
– Improve strength of archiving
• Print Curation
– Means to associate our print holdings
– Coordinated record-keeping
• Subsidiary benefits
– Quantify problems
– Collective attention to solving shared problems
– Understanding relationship between collective and local

10 September, 2014 | 28
Benefits for UNC-Chapel Hill
• Preservation solution for UNC digitized books and journals.
• Online access to hundred’s of thousands of titles we do not
have in our collection.
• Live links to Hathi materials in our catalog is a convenience for
users and enriches our collections.
• Hathi-led “community developments” provide tools and
expertise we might not have otherwise.
• Digital humanities scholars and other researchers have the
benefit of computational research over the large-scale corpus.

10 September, 2014 | 30
The HathiTrust Digital Library
Large Scale Digital Preservation and Access
For the Public Good

Sarah Michalak, HathiTrust #RLUK14

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sarah Michalak, HathiTrust #RLUK14

Similar to Sarah Michalak, HathiTrust #RLUK14 (20)

More from ResearchLibrariesUK

More from ResearchLibrariesUK (20)

Recently uploaded

Recently uploaded (20)

Sarah Michalak, HathiTrust #RLUK14

Editor's Notes