SlideShare une entreprise Scribd logo
1  sur  49
The HathiTrust Research Center:
  Digital Humanities at Scale
        Robert H. McDonald | @mcdonald
  Executive Committee-HathiTrust Research Center (HTRC)
           Deputy Director-Data to Insight Center
             Associate Dean-University Libraries
                  Indiana University
Presentation Overview
• What is the HathiTrust Research Center
  (HTRC)?
• How does the HTRC Work?
• Overview of Current HTRC Research
  – Applied Technology for Access to the HT Corpus
  – New Models for Non-Consumptive Research
• New questions for scaleable Digital
  Humanities Computing?
• How to get engaged with the HTRC?
HathiTrust Research Center (HTRC):

   The HTRC is dedicated to enable
 computational access for nonprofit and
 educational users to published works in
the public domain and, in the future, on
limited terms to works in-copyright from
             the HathiTrust.
Big Data and Access Problem
                                 HTRC
  HT                            Volume
               HT              Store and
Volume
            Volume               Index
 Store
             Store               (IUB)
 (UM)                                            XSEDE
            (IUPUI)
                                               Compute
                       FutureGrid              Allocation
                      Computation
                         Cloud                UIUC
                                           Compute
                                           Allocation
                           IU
                       Compute
                       Allocation
New questions require computational
         access to large corpus
Investigate way in which concepts of philosophy
  are used in physics through
  – Extracting argumentative structure from large
    dataset using mixture of automated and social
    computing techniques
  – Capture evidence for conjecture that availability of
    such analyses will enable innovative
    interdisciplinary research
  – Digging into Data 2012 award, Colin Allen, IU
New questions, cont.
Document through software text analysis
 techniques, the appearance, frequency and
 context of terms, concepts and usages related
 to human rights in a selection of English-
 language novels.
  – Ronnie Lipschutz of UCSC is currently doing this
    analysis on one of Jane Austen’s books. He’d like
    to extend the work to encompass a far larger
    corpus.
New questions, cont.
Identify all 18th century published books in
  HathiTrust corpus, and apply topic modeling
  to create a consistent overall subject metadata
Towards an HT Research Center
• Started in response to proposed Google
  Settlement - June 2009
  Specific Funding set aside by Google to build a public
   research center
  Worked to identify key stakeholders from HT institutions to
   collaborate and write RFP
• Developed specific RFP for HathiTrust to
  solicit proposals – Summer/Fall 2009
  – HTRC RFP Working Group
• RFP Released – Winter 2010
HTRC RFP Working Group
               http://www.hathitrust.org/documents/hathitrust-research-center-rfp.pdf or




•   Kat Hagedorn-HathiTrust                          • Robert H. McDonald-Indiana
•   Jeremy York-HathiTrust                           • Qiaozhu Mei-Michigan
•   Melissa Levine-HathiTrust                        • Shawn Newsam-California-
•   John Wilkin-HathiTrust                             Merced
•   Steve Abney-Michigan                             • John Ober-
                                                       California Digital Library
•   Jack Bernard-Michigan
                                                     • Beth Plale-Indiana
•   Geoffrey Fox-Indiana
                                                     • Scott Poole-Illinois
•   David Goldberg-California-Irvine
                                                     • Sarah Shreeves-Illinois
                                                     • John Unsworth-Illinois
HTRC Governance
• Reports to the HathiTrust Board of Governors
  – Laine Farley-CDL is liaison to the HT BOG
• HTRC Executive Committee
  –   Co-Directors: Beth Plale (IU) – J. Stephen Downie (UIUC)
  –   IU-Robert H. McDonald (IU)
  –   UIUC-Beth Sandore (UIUC) – Marshall Scott Poole (UIUC)
  –   John Unsworth (Brandeis)
• HTRC Advisory Board (See members next slide)
• Google Public Domain agreement – in place for IU
  and UIUC
HTRC Advisory Board
•   Cathy Blake, University of Illinois, Urbana-Champaign
•   Beth Cate, Indiana University
•   Greg Crane, Tufts University
•   Laine Farley, California Digital Library
•   Brian Geiger, University of California at Riverside
•   David Greenbaum, University of California at Berkeley
•   Fotis Jannidis, University of Wurzberg, Germany
•   Matthew Jockers, Stanford University
•   Jim Neal, Columbia University
•   Bill Newman, Indiana University
•   Bethany Nowviskie, University of Virginia
•   Andrey Rzhetsky, University of Chicago
•   Pat Steele, University of Maryland
•   Craig Stewart, Indiana University
•   David Theo Goldberg, University of California at Irvine
•   John Towns, National Center for Supercomputing Applications
•   Madelyn Wessel, University of Virginia
GOOGLE DIGITAL HUMANITIES AWARDS RECIPIENT
             INTERVIEWS REPORT
PREPARED FOR THE HATHITRUST RESEARCH CENTER
                 VIRGIL E. VARVEL JR.
                  ANDREA THOMER
  CENTER FOR INFORMATICS RESEARCH IN SCIENCE AND
                    SCHOLARSHIP
    UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN



                    Fall 2011
The study
• Dr. John Unsworth, a representative of HTRC,
  distributed invitations to participate in this study via
  email to 22 researchers given Google Digital
  Humanities Research Awards.
• Interviews were conducted via telephone, Skype®, or
  face-to-face, and all were audio recorded. All
  participants agreed to IRB permission statement via
  email.
• A semi-structured interview protocol was developed
  with input from HTRC to elicit responses from
  participants on primary goals of project.
Select Findings
• Optical Character Recognition
  – Steps should be taken to improve OCR quality if
    and when possible
  – Scalability of scanned image viewing is necessary
    for OCR reference and correction
  – Metadata should expose the quality of OCR
Select Findings
• “Would like better metadata about text languages,
  particularly in multi-text documents and on language
  by sections within text. Automatic language
  identification functions would be helpful, but
  human-created metadata is preferred, particularly
  for documents with low OCR quality.”
• “primary issue was retrieving the bibliographic
  records in usable form, unparsed by Google. […]
  process took 10 months to design the queries and
  get the data.”
 HathiTrust is large corpus
providing opportunity for new
forms of computation
investigation.
 The bigger the data, the less
able we are to move it to a
researcher’s desktop machine
 Future research on large
collections will require
computation moves to the
data, not vice versa
HTRC Goals
• HTRC will provide a persistent and sustainable
  structure to enable original and cutting edge
  research.
• Stimulate the development in community of
  new functionality and tools to enable new
  discoveries that would not be possible
  without the HTRC.
HTRC Goals (cont.)
• Leverage data storage and computational
  infrastructure at Indiana U and UIUC,
• Provision secure computational and data
  environment for scholars to perform research
  using HathiTrust Digital Library.
• Center will break new ground, allowing
  scholars to fully utilize content of HathiTrust
  Library while preventing intellectual property
  misuse within confines of current U.S.
  copyright law.
HathiTrust
            Research
            Center
• Analysis on 10,000,000+
  volumes of HathiTrust
                             Type of Data (Public         Estimated initial
  digital repository         domain and copyrighted       size: 300-500 TB
• Founded 2011               works)
                             Solr Indexes                 6.8 TB (3 indexes)
• Working with OCR
                             File system rsync            2.3 TB
• Large-scale data storage   (pairtree)
  and access                 Fast volume access store     3.7 TB
                             (Cassandra)
• HPC and Cloud              Versions of collection (5)   11.5 TB (2.3TB x 5)

                             Volume store indexes         6.8 TB


8/17/2012                                                                      19
Corpus Usage Patterns
                  Chapter
                  1


                   Chapter 1
                                  Access by
                  Chapter
                  1
                                  chapter



                    Page IV
                                  Access by
                        Page IV
                                  page
                    Page IV




                  Table of
                  Contents
                                  Access by
                   Table of
                  1………….
                  #Contents
                   1………….
                  2…………
                   #
                     Table of
                  ## Contents
                                  special contents
                   2…………#
                     1………….
                   #
                     #
                     2…………#
                     #
                                  (table of
                                  contents, index,
                                  glossary)
8/17/2012                                       20
HTRC Timeline
• Phase I: an 18-mo development cycle
  – Began 01 July 2011
  – Demo of capability September 2012 (14 mo mark)
• Phase II: broad availability of resource, begins
  01 January 2013
HTRC UnCamp
• HTRC UnCamp to be held at Indiana University
  – September 10-11, 2012
• Goals of HTRC UnCamp
  – Demo of initial HTRC API and SOA Infrastructure
    per 4 use cases as based on original RFP
    submission of 2010
  – Researcher and Librarian input on next steps for
    HTRC services
•   Web services architecture and protocols
•   Registry of services and algorithms
•   Solr full text indexes
•   noSQL store as volume store
•   Large scale computing
•   openID authentication
•   Portal front-end, programmatic access
•   SEASR mining algorithms
8/17/2012                                     23
HTRC Architecture
     Overview
                Portal access
                                        High level apps
                  Nam
                              Token        Sentence                Direct
                    e
                               filter      tokenizer           programmatic
                  entity
                                                                 access (by         Meandre
                   Latent semantic          Word             programs running
                       analysis            position         on HTRC machines)



                                         Data API access interface

  Algorithm            Security                               Cassandra
Registry (WS02)        (OAuth)                                cluster
                                                              volume store

  Application                                                          Solr index
  submission               Audit




Compute resources                                         Storage resources
Web portal
                                                        Desktop SEASR client




  CI logon                Agent                           SEASR analytics         WSO2 registry
                                                                               services, collections, data
  (NCSA)               framework                             service                capsule images                                       Blacklight




                                                                                                                 HTRC Data API v0.1
                         Agent            Agent
                       instance
                              Agent     instance
                                               Agent
    Access                  instance         instance

 control (e.g.                                                                       Solr index                                             Programmatic
  Grouper)                                                                                                                                   access e.g.,
                          Task                            Meandre
                       deployment                       Orchestration

                 Non-consumptive                                                   Volume store
                 Data capsules                                                       Volume store
                                                                                    (Cassandra)
                                                                                       Volume store
                                                                                      (Cassandra)
                                               NCSA local resources
                                                                                        (Cassandra)

        Future
         Grid                                                                                                                               HathiTrust
                                       NCSA HPC resources                                                    rsync                           corpus
                                                                                  Page/volume
                                   Penguin on                                   tree (file system)
                                    Demand
                                                                                                                                      University of Michigan
8/17/2012                                                                                                                                             25
One access point: through SEASR




8/17/2012                            26
SEASR: workflow
used to generate
tagcloud




Query = title:Abraham Lincoln
AND publishDate:1920
 8/17/2012                      27
Query = author:Withers, Hartley
 8/17/2012                        28
Workflow invokes HTRC Solr index and HTRC data API.
 8/17/2012                                            29
HTRC Solr index
• The Solr Data API 0.1 test version available.
  – Preserves all query syntax of original Solr,
  – Prevents user from modification,
  – Hides the host machine and port number HTRC
    Solr is actually running on,
  – Creates audit log of requests, and
  – Provides filtered term vector for words starting
    with user-specified letter
• Test version service soon available
HTRC Solr index: most used API patterns

•   ===== query
•   http://coffeetree.cs.indiana.edu:9994/solr/select/?q=ocr:war
•   =====faceted search
•   http://coffeetree.cs.indiana.edu:9994/solr/select/?q=*:*&fac
    et=on&facet.field=genre
•   ===get frequency and offset of words starting with letter
•   http://coffeetree.cs.indiana.edu:9994/solr/getfreqoffset/inu.3
    2000011575976/w
•   ===== banned modification request:
•   http://localhost:8983/solr/update?stream.body=<delete><qu
    ery>id:298253</query></delete>&commit=truethanks,
Requirement: run big jobs on
large scale (free or nearly
free (leveraged)) compute
resources
Data Capsules VM
           Cluster

                                         HTRC Volume
                                        Store and Index

     Remote
                 Provide secure
     Desktop
                 VM
     Or VNC

                      Submit secure
Scholars              capsule
                                         FutureGrid
                      map/reduce Data
                                        Computation
                      Capsule images
                                           Cloud
                      to FutureGrid.
                      Receive and
                      review results



Secure Data Capsule
HTRC                                                      FutureGrid
                                                                                   Single source
                                                        4GB chunk +                of write from
                                                         user buffer   Map         map node. <
   HTRC block store
                                    Mapreduce               Read only node (0)     1GB per map
                                     headnode.                                     node
 Take experiment with
                                     Partitions          4GB chunk
     5M vols (4TB                                                        Map
                                     4TB evenly
 uncompressed data.)                                                    node (1)
                                      amongst                                            Reduce
Block abstraction at vol
                                    1000 nodes.
     size or larger.
                                       Trusted
                                    because run          4GB chunk       Map
                                   as user HTRC.                        node (2)           Data
                                    User buffer                                            Capsule
                                    needs to be                                            node
     User buffer
   for secondary                      copied to
      data user                      each map
                                        node.            4GB chunk        Map
     submits for
        use in                                                            node
    computation                                                           (999)

                                                                        Data Capsule
                                                                        nodes
                           Provenance capture (through Karma provenance tool)
Creating Functionality for
Non-consumptive Research
          Key notions
Notion of a Proxy
• “researcher” in Google Book Settlement definition
  means a human
• An avatar is a virtual life form acting on behalf of
  person
• Computer program acts on behalf of person
• Computer program (i.e., proxy) must be able to read
  Google texts, otherwise computational analysis is
  impossible to carry out.
• So non-consumption applied to books applies to
  human consumption.
Non-consumptive research –
       implementation definition
• No action or set of actions on part of users,
  either acting alone or in cooperation with
  other users over duration of one or multiple
  sessions can result in sufficient information
  gathered from collection of copyrighted works
  to reassemble pages from collection.
• Definition disallows collusion between users,
  or accumulation of material over time.
  Differentiates human researcher from proxy
  which is not a user. Users are human beings.
Notion of Algorithm
• Computational analysis is accomplished
  through algorithms
  – An algorithm carries out one coherent analysis
    task: sort list of words, compute word frequency
    for text
• Researcher’s computational analysis often
  requires running sequence of algorithms.
  Important distinction for implementing non-
  consumptive research is “who owns the
  algorithm”?
Infrastructure for computational
               analysis
• When needing to support computation over
  10+M volume corpus, algorithms must be co-
  located with data.
• That is, algorithms must be located where
  repository is located, and not on user’s
  desktop.
• When computational analysis is to be non-
  consumptive, likely one location for the data.
Who owns algorithm?
• HTRC owns the algorithms,
  – use Software Environment for Advancement of
    Scholarly Research (SEASR) suite of algorithms
  – we are examining security requirements of users,
    algorithms, and data
User owns and submits their
            algorithms
• HTRC recently received funding from Alfred P.
  Sloan foundation to prototype “data capsule
  framework” that provisions for non-
  consumptive research.
• Founded on principle of “trust but verify”.
  Informatics-savvy humanities scholar is given
  freedom to experiment with new algorithms
  on protected information, but technological
  mechanisms in place to prevent undesirable
  behavior (leakage.)
Non-consumptive, user-owned algorithms
     infrastructure; requirements:
• Implements non-consumptive
• Openness – users not limited to using known
  set of algorithms
• Efficiency – Not possible to analyze algorithms
  for conformance prior to running
• Low cost and scale – Run at large-scale and
  low cost to scholarly community of users
• Long term value –adoption for other purposes
Categories of algorithms. Can fair use be
     determined based on categorization of
algorithm? Or is all computational use fair use?




8/17/2012                                          43
Algorithm results fair use?
• Center supplied
  – Easier because we know category of algorithm
• User supplied
  – HTRC is not examining code, so open question
HTRC Philosophy and NCR
• Finally, results of computational research that
  conforms to restrictions of non-consumptive
  research must belong to researcher
How to Engage
• Building partnership with researchers and
  research communities is key goal of the
  HathiTrust Research Center
• HTRC can give technical advice to researchers
  as they look for funding opportunities
  involving access to research data
• Upcoming “Fix the OCR and Metadata
  Shortage Community Challenge” : help us
  address key weaknesses of HT corpus
HTRC Upcoming Events
• JADH2012 Sept 15, 2012 - HathiTrust Research
  Center: Pushing the Frontiers of Large Scale
  Text Analytics – J. Stephen Downie-UIUC
• HathiTrust Research Center UnCamp – Sept
  10-11, 2012 – Indiana University
  http://www.hathitrust.org/htrc_uncamp2012

• Demonstration Presentation for the HathiTrust
  Digital Library Board of Governors – Sept 2012
Thank You
• This presentation was made possible with content
  provided by many HTRC colleagues Beth Plale, Marshall
  Scott Poole, John Unsworth, J. Stephen Downie, Stacy
  Kowalczyk, Yiming Sun, Guangchen Ruan, Loretta Auvil,
  Kirk Hess, and many others…
• The HTRC Non-Consumptive Research Grant is
  graciously funded by the Alfred P. Sloan Foundation
• IU D2I-PTI is graciously funded by The Lilly Endowment,
  Inc.
• HTRC - http://www.hathitrust.org/htrc
• IU D2I Center - http://d2i.indiana.edu/
Contact Information
• Robert H. McDonald
  – Email – robert@indiana.edu
  – Chat – rhmcdonald on googletalk | skype
  – Twitter - @mcdonald
  – Blog – http://www.rmcdonald.net

Contenu connexe

Similaire à THe HathiTrust Research Center: Digital Humanities at Scale

Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
lljohnston
 
Beyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeBeyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional Practice
Eric Kansa
 

Similaire à THe HathiTrust Research Center: Digital Humanities at Scale (20)

Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
Getting Started with Institutional Repositories and Open Access
Getting Started with Institutional Repositories and Open AccessGetting Started with Institutional Repositories and Open Access
Getting Started with Institutional Repositories and Open Access
 
Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014Plale HathiTrust El Colegio de Mexico May2014
Plale HathiTrust El Colegio de Mexico May2014
 
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
INNOVATION AND ‎RESEARCH (Digital Library ‎Information Access)‎
 
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital TextBridging Digital Humanities Research and Big Data Repositories of Digital Text
Bridging Digital Humanities Research and Big Data Repositories of Digital Text
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional Repositories
 
The HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and DemoThe HathiTrust Research Center (HTRC): An Overview and Demo
The HathiTrust Research Center (HTRC): An Overview and Demo
 
Why manage research data?
Why manage research data?Why manage research data?
Why manage research data?
 
JCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening SlidesJCDL 2015 Tutorial Opening Slides
JCDL 2015 Tutorial Opening Slides
 
Big Data
Big Data Big Data
Big Data
 
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
 
The biodiversity informatics landscape: a systematics perspective
The biodiversity informatics landscape: a systematics perspectiveThe biodiversity informatics landscape: a systematics perspective
The biodiversity informatics landscape: a systematics perspective
 
"Filling the digital preservation gap" with Archivematica
"Filling the digital preservation gap" with Archivematica"Filling the digital preservation gap" with Archivematica
"Filling the digital preservation gap" with Archivematica
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research CenterElephant in the Room: Scaling Storage for the HathiTrust Research Center
Elephant in the Room: Scaling Storage for the HathiTrust Research Center
 
Beyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional PracticeBeyond Preservation: Situating Archaeological Data in Professional Practice
Beyond Preservation: Situating Archaeological Data in Professional Practice
 
Building a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital LibraryBuilding a Public Research Center for the HathiTrust Digital Library
Building a Public Research Center for the HathiTrust Digital Library
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 
RLG Partnership Update Webinar Slides
RLG Partnership Update Webinar SlidesRLG Partnership Update Webinar Slides
RLG Partnership Update Webinar Slides
 
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
Identifiers for Researchers and Data: Increasing Attribution and Discovery– J...
 

Plus de Robert H. McDonald

Plus de Robert H. McDonald (20)

ER&L The Role of Choice in the Future of Discovery Evaluations Panel
ER&L The Role of Choice in the Future of Discovery Evaluations PanelER&L The Role of Choice in the Future of Discovery Evaluations Panel
ER&L The Role of Choice in the Future of Discovery Evaluations Panel
 
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
 
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
Academic Libraries and Big Data: Trends in Collection, Publication, Preservat...
 
TLT Discussion on "Saving My Stuff" - 06.05.15
TLT Discussion on "Saving My Stuff" - 06.05.15TLT Discussion on "Saving My Stuff" - 06.05.15
TLT Discussion on "Saving My Stuff" - 06.05.15
 
The HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational ServicesThe HathiTrust Research Center: An Overview of Advanced Computational Services
The HathiTrust Research Center: An Overview of Advanced Computational Services
 
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
Creating Sustainable Communities in Open Data Resources: The eagle-i and VIVO...
 
ER&L 2015 Closing Keynote Slides
ER&L 2015 Closing Keynote SlidesER&L 2015 Closing Keynote Slides
ER&L 2015 Closing Keynote Slides
 
HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14HathiTrust Research Center Data Capsule Overview 09.10.14
HathiTrust Research Center Data Capsule Overview 09.10.14
 
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkThe HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
 
Owning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your PatronsOwning the Discovery Experience for Your Patrons
Owning the Discovery Experience for Your Patrons
 
Kuali OLE: Enabling Choices for Libraries
Kuali OLE: Enabling Choices for LibrariesKuali OLE: Enabling Choices for Libraries
Kuali OLE: Enabling Choices for Libraries
 
Charleston Seminar Being Earnest with our Collections - Legacy to Cloud
Charleston Seminar Being Earnest with our Collections - Legacy to CloudCharleston Seminar Being Earnest with our Collections - Legacy to Cloud
Charleston Seminar Being Earnest with our Collections - Legacy to Cloud
 
SCONUL Kuali OLE Briefing
SCONUL Kuali OLE BriefingSCONUL Kuali OLE Briefing
SCONUL Kuali OLE Briefing
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science
 
New Perspectives for Business Intelligence: Library and Research Technologies...
New Perspectives for Business Intelligence: Library and Research Technologies...New Perspectives for Business Intelligence: Library and Research Technologies...
New Perspectives for Business Intelligence: Library and Research Technologies...
 
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...
Kuali OLE: Deep Library Collaboration and the Release of a Community-Sourced ...
 
GOKb & KB+: An International Partnership to leverage Open Access and Communit...
GOKb & KB+: An International Partnership to leverage Open Access and Communit...GOKb & KB+: An International Partnership to leverage Open Access and Communit...
GOKb & KB+: An International Partnership to leverage Open Access and Communit...
 
Kuali OLE @ LITA Forum 2012
Kuali OLE @ LITA Forum 2012Kuali OLE @ LITA Forum 2012
Kuali OLE @ LITA Forum 2012
 
HathiTrust Research Center: The Fast Version
HathiTrust Research Center: The Fast VersionHathiTrust Research Center: The Fast Version
HathiTrust Research Center: The Fast Version
 
HTRC Architecture Overview
HTRC Architecture OverviewHTRC Architecture Overview
HTRC Architecture Overview
 

Dernier

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 

THe HathiTrust Research Center: Digital Humanities at Scale

  • 1. The HathiTrust Research Center: Digital Humanities at Scale Robert H. McDonald | @mcdonald Executive Committee-HathiTrust Research Center (HTRC) Deputy Director-Data to Insight Center Associate Dean-University Libraries Indiana University
  • 2. Presentation Overview • What is the HathiTrust Research Center (HTRC)? • How does the HTRC Work? • Overview of Current HTRC Research – Applied Technology for Access to the HT Corpus – New Models for Non-Consumptive Research • New questions for scaleable Digital Humanities Computing? • How to get engaged with the HTRC?
  • 3. HathiTrust Research Center (HTRC): The HTRC is dedicated to enable computational access for nonprofit and educational users to published works in the public domain and, in the future, on limited terms to works in-copyright from the HathiTrust.
  • 4. Big Data and Access Problem HTRC HT Volume HT Store and Volume Volume Index Store Store (IUB) (UM) XSEDE (IUPUI) Compute FutureGrid Allocation Computation Cloud UIUC Compute Allocation IU Compute Allocation
  • 5. New questions require computational access to large corpus Investigate way in which concepts of philosophy are used in physics through – Extracting argumentative structure from large dataset using mixture of automated and social computing techniques – Capture evidence for conjecture that availability of such analyses will enable innovative interdisciplinary research – Digging into Data 2012 award, Colin Allen, IU
  • 6. New questions, cont. Document through software text analysis techniques, the appearance, frequency and context of terms, concepts and usages related to human rights in a selection of English- language novels. – Ronnie Lipschutz of UCSC is currently doing this analysis on one of Jane Austen’s books. He’d like to extend the work to encompass a far larger corpus.
  • 7. New questions, cont. Identify all 18th century published books in HathiTrust corpus, and apply topic modeling to create a consistent overall subject metadata
  • 8. Towards an HT Research Center • Started in response to proposed Google Settlement - June 2009  Specific Funding set aside by Google to build a public research center  Worked to identify key stakeholders from HT institutions to collaborate and write RFP • Developed specific RFP for HathiTrust to solicit proposals – Summer/Fall 2009 – HTRC RFP Working Group • RFP Released – Winter 2010
  • 9. HTRC RFP Working Group http://www.hathitrust.org/documents/hathitrust-research-center-rfp.pdf or • Kat Hagedorn-HathiTrust • Robert H. McDonald-Indiana • Jeremy York-HathiTrust • Qiaozhu Mei-Michigan • Melissa Levine-HathiTrust • Shawn Newsam-California- • John Wilkin-HathiTrust Merced • Steve Abney-Michigan • John Ober- California Digital Library • Jack Bernard-Michigan • Beth Plale-Indiana • Geoffrey Fox-Indiana • Scott Poole-Illinois • David Goldberg-California-Irvine • Sarah Shreeves-Illinois • John Unsworth-Illinois
  • 10. HTRC Governance • Reports to the HathiTrust Board of Governors – Laine Farley-CDL is liaison to the HT BOG • HTRC Executive Committee – Co-Directors: Beth Plale (IU) – J. Stephen Downie (UIUC) – IU-Robert H. McDonald (IU) – UIUC-Beth Sandore (UIUC) – Marshall Scott Poole (UIUC) – John Unsworth (Brandeis) • HTRC Advisory Board (See members next slide) • Google Public Domain agreement – in place for IU and UIUC
  • 11. HTRC Advisory Board • Cathy Blake, University of Illinois, Urbana-Champaign • Beth Cate, Indiana University • Greg Crane, Tufts University • Laine Farley, California Digital Library • Brian Geiger, University of California at Riverside • David Greenbaum, University of California at Berkeley • Fotis Jannidis, University of Wurzberg, Germany • Matthew Jockers, Stanford University • Jim Neal, Columbia University • Bill Newman, Indiana University • Bethany Nowviskie, University of Virginia • Andrey Rzhetsky, University of Chicago • Pat Steele, University of Maryland • Craig Stewart, Indiana University • David Theo Goldberg, University of California at Irvine • John Towns, National Center for Supercomputing Applications • Madelyn Wessel, University of Virginia
  • 12. GOOGLE DIGITAL HUMANITIES AWARDS RECIPIENT INTERVIEWS REPORT PREPARED FOR THE HATHITRUST RESEARCH CENTER VIRGIL E. VARVEL JR. ANDREA THOMER CENTER FOR INFORMATICS RESEARCH IN SCIENCE AND SCHOLARSHIP UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Fall 2011
  • 13. The study • Dr. John Unsworth, a representative of HTRC, distributed invitations to participate in this study via email to 22 researchers given Google Digital Humanities Research Awards. • Interviews were conducted via telephone, Skype®, or face-to-face, and all were audio recorded. All participants agreed to IRB permission statement via email. • A semi-structured interview protocol was developed with input from HTRC to elicit responses from participants on primary goals of project.
  • 14. Select Findings • Optical Character Recognition – Steps should be taken to improve OCR quality if and when possible – Scalability of scanned image viewing is necessary for OCR reference and correction – Metadata should expose the quality of OCR
  • 15. Select Findings • “Would like better metadata about text languages, particularly in multi-text documents and on language by sections within text. Automatic language identification functions would be helpful, but human-created metadata is preferred, particularly for documents with low OCR quality.” • “primary issue was retrieving the bibliographic records in usable form, unparsed by Google. […] process took 10 months to design the queries and get the data.”
  • 16.  HathiTrust is large corpus providing opportunity for new forms of computation investigation.  The bigger the data, the less able we are to move it to a researcher’s desktop machine  Future research on large collections will require computation moves to the data, not vice versa
  • 17. HTRC Goals • HTRC will provide a persistent and sustainable structure to enable original and cutting edge research. • Stimulate the development in community of new functionality and tools to enable new discoveries that would not be possible without the HTRC.
  • 18. HTRC Goals (cont.) • Leverage data storage and computational infrastructure at Indiana U and UIUC, • Provision secure computational and data environment for scholars to perform research using HathiTrust Digital Library. • Center will break new ground, allowing scholars to fully utilize content of HathiTrust Library while preventing intellectual property misuse within confines of current U.S. copyright law.
  • 19. HathiTrust Research Center • Analysis on 10,000,000+ volumes of HathiTrust Type of Data (Public Estimated initial digital repository domain and copyrighted size: 300-500 TB • Founded 2011 works) Solr Indexes 6.8 TB (3 indexes) • Working with OCR File system rsync 2.3 TB • Large-scale data storage (pairtree) and access Fast volume access store 3.7 TB (Cassandra) • HPC and Cloud Versions of collection (5) 11.5 TB (2.3TB x 5) Volume store indexes 6.8 TB 8/17/2012 19
  • 20. Corpus Usage Patterns Chapter 1 Chapter 1 Access by Chapter 1 chapter Page IV Access by Page IV page Page IV Table of Contents Access by Table of 1…………. #Contents 1…………. 2………… # Table of ## Contents special contents 2…………# 1…………. # # 2…………# # (table of contents, index, glossary) 8/17/2012 20
  • 21. HTRC Timeline • Phase I: an 18-mo development cycle – Began 01 July 2011 – Demo of capability September 2012 (14 mo mark) • Phase II: broad availability of resource, begins 01 January 2013
  • 22. HTRC UnCamp • HTRC UnCamp to be held at Indiana University – September 10-11, 2012 • Goals of HTRC UnCamp – Demo of initial HTRC API and SOA Infrastructure per 4 use cases as based on original RFP submission of 2010 – Researcher and Librarian input on next steps for HTRC services
  • 23. Web services architecture and protocols • Registry of services and algorithms • Solr full text indexes • noSQL store as volume store • Large scale computing • openID authentication • Portal front-end, programmatic access • SEASR mining algorithms 8/17/2012 23
  • 24. HTRC Architecture Overview Portal access High level apps Nam Token Sentence Direct e filter tokenizer programmatic entity access (by Meandre Latent semantic Word programs running analysis position on HTRC machines) Data API access interface Algorithm Security Cassandra Registry (WS02) (OAuth) cluster volume store Application Solr index submission Audit Compute resources Storage resources
  • 25. Web portal Desktop SEASR client CI logon Agent SEASR analytics WSO2 registry services, collections, data (NCSA) framework service capsule images Blacklight HTRC Data API v0.1 Agent Agent instance Agent instance Agent Access instance instance control (e.g. Solr index Programmatic Grouper) access e.g., Task Meandre deployment Orchestration Non-consumptive Volume store Data capsules Volume store (Cassandra) Volume store (Cassandra) NCSA local resources (Cassandra) Future Grid HathiTrust NCSA HPC resources rsync corpus Page/volume Penguin on tree (file system) Demand University of Michigan 8/17/2012 25
  • 26. One access point: through SEASR 8/17/2012 26
  • 27. SEASR: workflow used to generate tagcloud Query = title:Abraham Lincoln AND publishDate:1920 8/17/2012 27
  • 28. Query = author:Withers, Hartley 8/17/2012 28
  • 29. Workflow invokes HTRC Solr index and HTRC data API. 8/17/2012 29
  • 30. HTRC Solr index • The Solr Data API 0.1 test version available. – Preserves all query syntax of original Solr, – Prevents user from modification, – Hides the host machine and port number HTRC Solr is actually running on, – Creates audit log of requests, and – Provides filtered term vector for words starting with user-specified letter • Test version service soon available
  • 31. HTRC Solr index: most used API patterns • ===== query • http://coffeetree.cs.indiana.edu:9994/solr/select/?q=ocr:war • =====faceted search • http://coffeetree.cs.indiana.edu:9994/solr/select/?q=*:*&fac et=on&facet.field=genre • ===get frequency and offset of words starting with letter • http://coffeetree.cs.indiana.edu:9994/solr/getfreqoffset/inu.3 2000011575976/w • ===== banned modification request: • http://localhost:8983/solr/update?stream.body=<delete><qu ery>id:298253</query></delete>&commit=truethanks,
  • 32. Requirement: run big jobs on large scale (free or nearly free (leveraged)) compute resources
  • 33. Data Capsules VM Cluster HTRC Volume Store and Index Remote Provide secure Desktop VM Or VNC Submit secure Scholars capsule FutureGrid map/reduce Data Computation Capsule images Cloud to FutureGrid. Receive and review results Secure Data Capsule
  • 34. HTRC FutureGrid Single source 4GB chunk + of write from user buffer Map map node. < HTRC block store Mapreduce Read only node (0) 1GB per map headnode. node Take experiment with Partitions 4GB chunk 5M vols (4TB Map 4TB evenly uncompressed data.) node (1) amongst Reduce Block abstraction at vol 1000 nodes. size or larger. Trusted because run 4GB chunk Map as user HTRC. node (2) Data User buffer Capsule needs to be node User buffer for secondary copied to data user each map node. 4GB chunk Map submits for use in node computation (999) Data Capsule nodes Provenance capture (through Karma provenance tool)
  • 36. Notion of a Proxy • “researcher” in Google Book Settlement definition means a human • An avatar is a virtual life form acting on behalf of person • Computer program acts on behalf of person • Computer program (i.e., proxy) must be able to read Google texts, otherwise computational analysis is impossible to carry out. • So non-consumption applied to books applies to human consumption.
  • 37. Non-consumptive research – implementation definition • No action or set of actions on part of users, either acting alone or in cooperation with other users over duration of one or multiple sessions can result in sufficient information gathered from collection of copyrighted works to reassemble pages from collection. • Definition disallows collusion between users, or accumulation of material over time. Differentiates human researcher from proxy which is not a user. Users are human beings.
  • 38. Notion of Algorithm • Computational analysis is accomplished through algorithms – An algorithm carries out one coherent analysis task: sort list of words, compute word frequency for text • Researcher’s computational analysis often requires running sequence of algorithms. Important distinction for implementing non- consumptive research is “who owns the algorithm”?
  • 39. Infrastructure for computational analysis • When needing to support computation over 10+M volume corpus, algorithms must be co- located with data. • That is, algorithms must be located where repository is located, and not on user’s desktop. • When computational analysis is to be non- consumptive, likely one location for the data.
  • 40. Who owns algorithm? • HTRC owns the algorithms, – use Software Environment for Advancement of Scholarly Research (SEASR) suite of algorithms – we are examining security requirements of users, algorithms, and data
  • 41. User owns and submits their algorithms • HTRC recently received funding from Alfred P. Sloan foundation to prototype “data capsule framework” that provisions for non- consumptive research. • Founded on principle of “trust but verify”. Informatics-savvy humanities scholar is given freedom to experiment with new algorithms on protected information, but technological mechanisms in place to prevent undesirable behavior (leakage.)
  • 42. Non-consumptive, user-owned algorithms infrastructure; requirements: • Implements non-consumptive • Openness – users not limited to using known set of algorithms • Efficiency – Not possible to analyze algorithms for conformance prior to running • Low cost and scale – Run at large-scale and low cost to scholarly community of users • Long term value –adoption for other purposes
  • 43. Categories of algorithms. Can fair use be determined based on categorization of algorithm? Or is all computational use fair use? 8/17/2012 43
  • 44. Algorithm results fair use? • Center supplied – Easier because we know category of algorithm • User supplied – HTRC is not examining code, so open question
  • 45. HTRC Philosophy and NCR • Finally, results of computational research that conforms to restrictions of non-consumptive research must belong to researcher
  • 46. How to Engage • Building partnership with researchers and research communities is key goal of the HathiTrust Research Center • HTRC can give technical advice to researchers as they look for funding opportunities involving access to research data • Upcoming “Fix the OCR and Metadata Shortage Community Challenge” : help us address key weaknesses of HT corpus
  • 47. HTRC Upcoming Events • JADH2012 Sept 15, 2012 - HathiTrust Research Center: Pushing the Frontiers of Large Scale Text Analytics – J. Stephen Downie-UIUC • HathiTrust Research Center UnCamp – Sept 10-11, 2012 – Indiana University http://www.hathitrust.org/htrc_uncamp2012 • Demonstration Presentation for the HathiTrust Digital Library Board of Governors – Sept 2012
  • 48. Thank You • This presentation was made possible with content provided by many HTRC colleagues Beth Plale, Marshall Scott Poole, John Unsworth, J. Stephen Downie, Stacy Kowalczyk, Yiming Sun, Guangchen Ruan, Loretta Auvil, Kirk Hess, and many others… • The HTRC Non-Consumptive Research Grant is graciously funded by the Alfred P. Sloan Foundation • IU D2I-PTI is graciously funded by The Lilly Endowment, Inc. • HTRC - http://www.hathitrust.org/htrc • IU D2I Center - http://d2i.indiana.edu/
  • 49. Contact Information • Robert H. McDonald – Email – robert@indiana.edu – Chat – rhmcdonald on googletalk | skype – Twitter - @mcdonald – Blog – http://www.rmcdonald.net

Notes de l'éditeur

  1. == Corpus Usage Patterns (cont&apos;d) === 1)  In this use case, the user may wish to select a chapter from each book in the collection he/she assembled and perform analysis over these chapters2) In this use case, analysis only on some special pages of each book (e.g. foreword, or first page of each chapter) in his/her collection.3) in this case, analysis on special contents of each book (e.g. table of contents, indices, glossaries