3. Today’s Discussion - HathiTrust
• Mission and partnership
• Collections
• Services
• HathiTrust Research Center
• Benefits for Libraries
12/18/2014
4. The Name
• The meaning behind the name
• Hathi (hah-tee)--Hindi for elephant
• Never forgets
• Full of wisdom
• Secure
• Trustworthy
• Big, strong
12/18/2014
6. Mission
To contribute to the common good by collecting, organizing, preserving,
communicating, and sharing the record of human knowledge.
Efforts include, but are not limited to
…building comprehensive collections co-owned and managed by
partners.
…enabling access by users with print disabilities.
…supporting computational research with the collections.
…stimulating shared collection storage strategies among libraries.
12/18/2014
7. HathiTrust Members
Allegheny College
Arizona State University
Baylor University
Boston College
Boston University
Brandeis University
Brown University
California Digital Library
Carnegie Mellon University
Colby College
Columbia University
Cornell University
Dartmouth College
Duke University
Emory University
Florida State University**
Getty Research Institute
Harvard University Library
Indiana University
Iowa State University
Johns Hopkins University
Kansas State University
Lafayette College
Library of Congress
Massachusetts Institute of
Technology
McGill University`
Michigan State University
Montana State University
Mount Holyoke College
New York Public Library
New York University
North Carolina Central
University
North Carolina State
University
Northwestern University
The Ohio State University
The Pennsylvania State
University
Princeton University
Purdue University
Rutgers University
Stanford University
Syracuse University
Temple University
Texas A&M University
Texas Tech
Tufts University
Universidad Complutense
de Madrid
University of Alabama
University of Alberta
University of Arizona
University of British Columbia
University of Calgary
University of California
Berkeley
Davis
Irvine
Los Angeles
Merced
Riverside
San Diego
San Francisco
Santa Barbara
Santa Cruz
The University of Chicago
University of Connecticut
University of Delaware
University of Florida
University of Houston
University of Illinois
University of Illinois at
Chicago
The University of Iowa**
University of Kansas**
University of Maine
University of Maryland**
University of Massachusetts,
Amherst
University of Miami
University of Michigan
University of Minnesota**
University of Missouri**
University of Nebraska-
Lincoln**
University of New Mexico
The University of North
Carolina at Chapel Hill**
University of Notre Dame
University of Oklahoma
University of Pennsylvania
University of Pittsburgh
University of Queensland
University of Tennessee,
Knoxville**
University of Texas
University of Utah
University of Vermont
University of Virginia**
University of Washington**
University of Wisconsin-
Madison**
Utah State University**
Vanderbilt University
Virginia Tech
Wake Forest University
Washington University
Yale University Library
November 3, 2014 7
8. How Are Costs Shared?
• Public domain volumes: All partners share in infrastructure costs
for each item.
• In copyright volumes: Partners share costs based on their
holdings.
• Infrastructure cost per volume: ~$0.168 per volume per year.
• All partners pay an additional amount above costs to fund new
programs and investigations.
12/18/2014
11. 12.5 million total volumes
6.4 million book titles
327,000 serial titles
575,889 government publications
4.6 million volumes in the public domain
(~37%)
12. Link takes you to HathiTrust
Records loaded into DPLA, local library
catalogs, and commercial databases
13. Collective Stewardship
• Leverage expertise across institutions
• Distributed Functions and Services
• Preservation repository and access services
• University of Michigan
• Mirror site: Indiana University
• Metadata management services
• California Digital Library
• HathiTrust Research Center
• Indiana University and University of Illinois
5 November 2014 13
16. Language Distribution (1)
The top 10 languages make up
~87% of all content
English, 49%
German, 9%
French, 7%
Spanish, 5%
Chinese, 4%
Russian, 4%
Japanese, 3%
Italian, 3%
Arabic, 2%
Latin, 1%
Remaining
Languages, 13%
12/18/2014
17. Language Distribution (2)
Portuguese, 7%
Polish, 7%
Dutch, 5%
Hebrew, 5%
Hindi, 5%
Indonesian, 4%
Korean, 4%
Swedish, 4%
Thai, 3%Urdu, 3%
Turkish, 3%
Danish, 3%
Czech, 3%
Croatian, 3%
Persian, 2%
Tamil, 2%
Hungarian, 2%
Bengali, 2%
Norwegian, 2%
Sanskrit, 2%
Greek,-Modern-
(1453--), 2%
Vietnamese, 1%
Ukrainian, 1%
Serbian, 1%
Bulgarian, 1%
Greek,-Ancient-
(to-1453), 1%
Armenian, 1%
Romanian, 1%
Marathi, 1%
Panjabi, 1%
Telugu, 1% Catalan,
1%
Malay,
1%
Multiple-languages, 1%
Malayalam, 1%
Finnish, 1%
Slovak, 1%
Slovenian
, 1%
Turkish,-
Ottoman,
1%
Yiddish, 1%
Nepali, 0%
The next 40
languages
make up
~12% of
total
12/18/2014
18. Copyright Distribution
In Copyright or
undetermined
63%
Public Domain
Worldwide
21%
US Government
Documents
5%
Public Domain (US)
11%
Open Access
0.06%
Creative Commons
0.06%
“Public domain”
38%
12/18/2014 18
20. 10 September, 2014 | 20
Preservation with Access
• Preservation
– TRAC-certified
– Long-term commitments to preserve digital content facilitate planning,
decision-making
• Discovery
– Bibliographic and full-text search of all materials
– Mechanisms for local loading of records
• Access and Use
– Full text search (all users)
– Public domain and open access works (all users)
– Collections and APIs (all users)
– Lawful uses of in-copyright works (members)
21. 10 September, 2014 | 21
Access: Lawful uses of
in-copyright works
• Sensitive to multiple legal regimes
– Full-text search (everyone everywhere)
– Access to users who have print disabilities (through member proxy in
US, and where law permits)**
– Access works that are damaged or missing and also out of print and
unavailable (members in US only)
**Terms and conditions at http://www.hathitrust.org/access_use#ic-
access
22. 10 September, 2014 | 22
Collective Action: Copyright Review
• Copyright Review Management System
– Systematic manual review of copyright registrations to determine
status of portions of the HathiTrust Collection
– CRMS US: Published in US, 1923-1963
• 316,396 reviewed / 166,753 PD (~53%)
– CRMS-World: Published in UK (1874-1944), Canada, Australia (1894-
1964)
• 145,804 reviewed / 75,775 PD-world 9 (~52%)
21 October 2014 22
23. 10 September, 2014 | 23
HathiTrust Research Center
• http://www.hathitrust.org/htrc
• Operated by the University of Illinois, Urbana-Champaign and
Indiana University, with additional financial support from
HathiTrust.
• Co-led by Beth Plale (Indiana) and Stephen Downie (Illinois).
• Goal: enable researchers world-wide to carry out
computational investigation of HT repository.
24. 10 September, 2014 | 24
Aims of the HTRC
• Focus on developing services to researchers
• Develop model for access: the ‘workset’
• Develop tools that facilitate research by digital humanities and
informatics communities
• Develop secure cyberinfrastructure that allows computational
investigation of entire copyrighted and public domain
HathiTrust repository
25. 10 September, 2014 | 25
Example Projects Supported by HTRC
• Muñoz, Trevor, University of Maryland. “Distributed Metadata Correction and Annotation.”
– Correction, annotation and enhancement of HT records and export as linked data
• Page, Kevin, Oxford University. “ElEPHãT: Early English Print in HathiTrust, a Linked Semantic
Workset Prototype”
– Development of secondary worksets based on both HT and the Early English Books Online Text
Creation Partnership (EEBO-TCP).
• Burton, Vernon. “The South as ‘Other,’ the Southerner as ‘Stranger.’”
– Explore how attitudes expressed in print about slavery, southerners, and non-southerners have
changed over both time and space.
• Ted Underwood, Associate Professor of English at the University of Illinois, Urbana-
Champaign.
– Using public domain texts received from HathiTrust to explore changing relationships in literary
genres from 1700-1899.
26. 10 September, 2014 | 26
HathiTrust overall benefits to libraries
• Digital Curation
– Drive costs down
– Reduce “bibliographic indeterminacy”
– Make meaningful decisions about formats and quality
– Increase discoverability, use
– Consolidate development talent
– Improve strength of archiving
• Print Curation
– Means to associate our print holdings
– Coordinated record-keeping
• Subsidiary benefits
– Quantify problems
– Collective attention to solving shared problems
– Understanding relationship between collective and local
28. 10 September, 2014 | 28
Benefits for UNC-Chapel Hill
• Preservation solution for UNC digitized books and journals.
• Online access to hundred’s of thousands of titles we do not
have in our collection.
• Live links to Hathi materials in our catalog is a convenience for
users and enriches our collections.
• Hathi-led “community developments” provide tools and
expertise we might not have otherwise.
• Digital humanities scholars and other researchers have the
benefit of computational research over the large-scale corpus.
30. 10 September, 2014 | 30
The HathiTrust Digital Library
Large Scale Digital Preservation and Access
For the Public Good
Editor's Notes
1
I am University Librarian and Associate Provost for University Libraries at the University of North Carolina at Chapel Hill. UNC is a comprehensive research university with a law school, medical school, strong research programs in numerous disciplines, especially including bio-medicine and public health. There are 28,000 students. The library has about 7.6 million volumes and has twelve libraries in various campus locations.
This is the Wilson Special Collections Library, located exactly at the center of our beautiful campus.
I am speaking today as chair of the Board of Governors of the HathiTrust Digital Library but the UNC library is a member of HathiTrust and I will point out some of the benefits we receive along the way.
Why
4
6
7
Annual income: approximately $2.5 million (Operations: $1.688 m Programmatic: $863K)
Public Domain formula: (PD*C*X)/N
In copyright forumula: IC=(C*X)/H
Computation of fees must be approved by the membership every year. We are now working on collecting holdings and computing costs for 2015.
Here is where it all starts – the HathiTrust search page
Read services.
11
WE HAVE WORKED TO IMPLEMENT A SHARED APPROACH TO CURATION AND STEWARDSHIP
We have an intent that our members will be able to develop new services over time.
HOW WORK GETS DONE
At Michigan: 7 FTE on the operational side, 2 on the programs side. PLUS additional subsidized FTE that is not on the budget.
Indiana and California are provided funds to support the services they operate.
HTRC: The Board of Governors has approved some central funding, but Indiana and Illinois are also picking up about 2/3 – ¾ of the costs. Plus grant funds.
Why are these schools supporting? Because they get something back greater than what they pay in. And they can take advantage of running these operations in a way to help facilitate other mission-driven programs on their campus.
The intent is for the central staff to stay lean and promote contributed effort from other institutions.
One of the goals: “To create and sustain this “public good” in a way that mitigates the problem of free-riders.”
In short: the costs are shared in a way that ensures that everyone pays for what they benefit most directly from. And since almost the entire collection is based on what we have reformatted from libraries, we are basing it on Library holdings. We are NOT basing it on what your library has digitized and put in the repository.
Annual income: approximately $2.5 million (Operations: $1.688 m Programmatic: $863K)
Computation of fees must be approved by the membership every year. We are now working on collecting holdings and computing costs for 2015.
15
This slide is not so surprising….
17
Two ways we determine copyright status. First: date of publication listed in catalog.
Then, manually, through collective effort on certain subsets: large-scale review of materials in collection to check on copyright status. Have identified an additional 250,000 volumes as public domain in US or worldwide.
18 member libraries contributing at least ¼ of a person’s time to this work.
US Govt Publications are not all assumed to be public domain. Certain publishers, such as Smithsonian, are not opened automatically. We also close NTIS materials published in the last 6 years.
If a work is known to have substantial 3rd party copyrighted material in the publication, then we would close it.
We have taken a more liberal approach to this than Google did; they go primarily by date, not content.
We do have clearly published guidelines for receiving a complaint, and a process for taking down and reviewing status of a work.
So how do you use all this stuff? Talk for just a second before going into end uses, about how libraries view HathiTrust.
20
Our unofficial slogan is “We make lawful uses of published works.”
Search
Print disabilities
Preservation access
In HathiTrust we’ve been willing to take vigorous advantage of rights and protections available to libraries and individuals in US Copyright Law, and in doing so I think we have been able to make the concept of “Fair Use” in our law much stronger and in some ways, more available, to libraries. Multiple courts have agreed that libraries can digitize items for the purposes of enabling computationally-driven search, and that we can digitize items for the purposes of providing access to users with print disabilities. We feel very confident that our policies and practices around providing replacement copy access for damaged/lost items is lawful. But you see here how these differing laws affect the types of benefits our partner libraries. While we are confident that we can provide access to damaged or missing works to US institutions under the fair use and libraries exceptions in our laws, we cannot do so elsewhere.
22
KEY FOCUS: EXPAND COMPUTATIONAL ACCESS, BOTH BY SIMPLIFYING FOR NON-EXPERTS AND BY DEVELOPING ROBUST INFRASTRUCTURE FOR SPECIALISTS
Focus on developing services to researchers
Develop model for access: the ‘workset’
Develop tools that facilitate research by digital humanities and informatics communities
Develop secure cyberinfrastructure that allows computational investigation of entire copyrighted and public domain HathiTrust repository
Beth is Director, Data to Insight Center
Managing Director, Pervasive Technology Institute (PTI)
Professor, School of Informatics and Computing, Indiana University
Stephen is Professor and Associate Dean for Research at the Graduate School of Library and Information Science.
Management team includes Robert McDonald, Beth Sandore, and John Unsworth.
A secure computing framework that:
Trusts that researcher will not deliberately leak repository data, but
Prevents malware acting on user's behalf from leaking data.
Enforces:
Non-consumptive use: framework provides safe handling of large volumes of protected data
Openness: framework supports user-contributed analysis tools (that is, not limit uses to a known set of algorithms)
Efficiency: framework supports user-contributed analysis tools without resorting to code walkthroughs prior to acceptance
Large-scale and low cost: protections can be extended to utilization of large-scale national (public) supercomputers
GOAL: ENABLE ACCESS FOR COMPUTATION WITHIN A SECURE FRAMEWORK
Pushing out the Services through Scholarly Commons
Gives HT institutions exclusive access to training and learning materials that help them establish programs that integrate HTRC tools and services into their scholarly commons programs in libraries and digital humanities centers.
Physically located on the University of Illinois Library’s Scholarly commons.
Supported by several Library staff and faculty. Key among these is the Digital Humanities Research Specialist who will assist with the development of training and outreach initiatives in support of researchers working with the Hathi Trust Research Center and HathiTrust digital library affiliates who seek to start their own HTRC research services.
Effort involves planning, implementation and continuous development of training materials, educational workshops, and potential tools, and outreach activities in support of the usage of HTRC tools and datasets.
“Workset Creation through Image Analysis of Document Pages”, Texas A&M University (PI: Keith Biggers)
Biggers will work with Neal Audenaert and Natalie M. Houston to develop a software application that uses the visual characteristics of digitized printed pages to identify documents that contain three types of visually distinctive materials of interest to humanities researchers: poetry, music, and illustrations. This prototype will demonstrate the value of using visual analysis of document images in conjunction with more traditional textual analysis to enable scholars to ask more refined questions about texts and their physical manifestations.
“Semantic Analysis of Documents from the HathiTrust Corpus”, Waikato University (PI: Annike Hinze)
Hinze’s team will develop a suite of tools that analyze documents by the semantics of their content and metadata. Clustering documents by semantic similarity will open up a wealth of opportunities for scholarly research.The project is designed in close collaboration with two humanities scholars from the areas of Maori & Pacific Studies, and Historical Anthropology, who not only drive this project with research questions based on their scholarly practice, but also provide ongoing input and feedback during the development process.
“Distributed Metadata Correction and Annotation”, Maryland Institute for Technology in the Humanities, University of Maryland. (PI: Trevor Muñoz)
Muñoz will collaborate with Peter Mallios and the Foreign Literatures in America (FLA) project team to develop a set of services and interfaces that will allow the FLA project (and other projects like it) to pull metadata records from the HathiTrust, correct and annotate these records using standardized vocabularies, gather corrections and annotations from other teams or scholars, and export enhanced metadata in formats suitable for publication as linked data.
“ElEPHãT: Early English Print in HathiTrust, a Linked Semantic Workset Prototype”, Oxford University (PI: Kevin Page)
Page will work with colleagues from the Bodleian Library to produce software that exposes the necessary metadata from individual collections for building aggregate worksets drawn from multiple sources. The prototype will build integrated worksets that combine resources from the HathiTrust and from the the Early English Books Online Text Creation Partnership (EEBO-TCP) collection, which focuses on high quality images and accurate transcriptions of items usually found in libraries’ special collections.