SlideShare a Scribd company logo
1 of 31
Use Cases

the gap between Humanities and GRID



           Dirk Roorda, coordinator infrastructure

           http://www.dans.knaw.nl
Prologue




GRID: The rationalist’s paradise
SSH and
ICT

an
accelerating
story
in 3 gears

so far ...
first gear:

online acces
to sources
...
for people
second gear:

collaborate
in
annotating
and
enriched
publications
third gear:

resources
subject
to
intensive
computing
ESFRI
      https://documents.egi.eu/public/ShowDocument?docid=12



• EEF report (April 2010)
  • CESSDA, CLARIN, DARIAH
  • ESS, SHARE
• Common requirements
  •    single sign on
  •    unified virtual organisations
  •    persistent storage (and identifiers)
  •    webservices and workflows
•   represents the linguistic community
•   relatively strong in computing
•   many exploitable language tools
•   research benefits from high performance
    computing

                             http://www.clarin.eu
• initial expectations of GRID
   • enabling technology for workflows
   • sustainable tools
   • sustainable data

..., and all of this will be available on the internet
using a service oriented architecture based on
secure grid technologies
                                 http://www.clarin.eu/the-vision
• Turning point
  There is no off-the-shelf grid/federation technology; much
  relies on the availability of specialists who know about the
  details. Much adaptation and configuration work has to be
  done, which requires a deep understanding of the
  components. ... it is the interaction and integration that
  requires lots of efforts.
                                     http://www-sk.let.uu.nl/u/D2R-2a.pdf
• Current point of view
  • interested in data grid
  • for webservices: consider cloud
  • unclear what will turn out to be a stable technology
    for the CLARIN RI
• humanities, arts and cultural heritage
  • mission
     • explore and apply ICT-based methods
     • link digital source materials of many kinds
     • exchange expertise across disciplines
  • key concepts
     • virtual infrastructure
     • virtual competence center
        • VCC1 e-Infrastructure ; VCC3 Scholarly Content

http://www.dariah.eu
•   DARIAH on the GRID?
•   no mention on high performance computing
•   scarcely mentions grid
•   programmatic access to data: yes
•   emphasis on preservation of data
•   But then there is TextGRID


                           http://www.textgrid.de/
• First Periodic Report 2009
    • announces infrastructure experiment
        • showing large computational needs for humanities
        • indexing unstructured resources in an on-demand and
          flexible manner
        • conducted on D4Science



http://dariah.eu/wiki/images/e/ea/090426_dest09_interactivegrid.pdf
• Umbrella for social science archives
  • since 1970
• Upgrade in ESFRI context (FP7)
• developing:
  • the data portal to allow seamless access to data holdings
    across Europe
  • common authentication and access middleware tools
  • investigating the potential of grid technologies
  • improving data harmonisation tools
• Use cases contain
  •   data collection
  •   applying harmonisation
  •   collect results and analyse them further
  •   in a collaborative environment
  •   under severe access restrictions
  •   develop new surveys comparable to existing ones
•   In February 2009
•   Yes GRID can help
•   but many approaches
•   of which maybe none enables all scenarios at
    the same time



http://www.cessda.org/project/doc/D11.1a_Grid-enabling.pdf
• May 2009 SWOT analysis
    • GRID solves several partial problems
    • current GRID technology not geared to the
      core of CESSDA’s problems
    • do realistic case studies on
        • GRID, Cloud, traditional client server
http://www.cessda.org/project/doc/D11.1b_Sustainability_CESSDA_e-Infrastructure.pdf
requirements
• functional
  • enable data intensive scholarship
     • better support for existing hypotheses
     • new research questions
     • enriched publications
  • enable collaboration around sources
     • incremental enrichment (annotations)
     • connecting sources (linked data)
     • connecting people (collaboratories)
requirements
• “non-functional”
  •   respect access rights
  •   sustainability of data and tools
  •   ease of use and programming
  •   performance
characteristics
•   data from irrepeatable processes
•   data in many small chunks
•   complex interrelationships
•   emphasis on human interpretation
    • natural language processing
    • symbolic processing
    • linking to background knowledge
human interpretation
• only part of the dataprocessing can be
  automated
  • higher level patterns require higher performance
    computing
• researchers need interaction with data and
  processes
  • store intermediate results
  • build virtual collections
(virtual) use cases
• a use case spells out a concrete task from the
  perspective of an end user
• the humanities present few use cases
  • where HPC is essential
  • with “significant” amounts of data
  • which can be translated into batch jobs
• so: virtually no concrete use cases!
virtual use cases
• a virtual use case
  • spells out a generic task
  • from an infrastructural perspective
  • still meaningful to the end user
• virtual use cases
  • pave the way for real use cases
  • bridge big gaps between
     • end users of
     • “foreign” infrastructure
a specific virtual use case
• smart indexes on unique resources, e.g.
  •   speech sounds in audio sources
  •   names in hand-written sources
  •   artefacts in images
  •   motifs in literary texts
• task
  • create a comprehensive index
  • of patterns in all accessible collections
  • share that index for follow-up research
smart indexes: sources
• accessible collections <=
  • sources are readily accessible
     • programmatically
     • with minimal latency
     • with persistent, shareable addresses
  • access rights are implemented
     • identities can be passed on to programs
  • sources can be gathered in virtual collections
smart indexes: processing
• a smart spider visits a virtual collection
  • on behalf of a researcher
     • with his credentials
• every addressable element is subjected to a
  smart pattern analyzer
  • this lends itself to parallellism
  • smart pattern detection <= high performance
  • occurrences are added to the index in the form
    <pattern, persistent id, fragment id>
smart indexes: results
• the resulting index is stored in the researcher’s
  workspace
• the researcher may publish his index
  • for programmatic use
  • addressed with a new persistent id
• new smart virtual collections are possible
• third parties write subsequent applications
  • portals
  • visualisations
conclusions
• research projects in the humanities:
  • “small” projects, repeatedly accessing shared
    sources
  • many people annotating the same data
  • computation becoming more intensive – gradually
  • many computing steps to be combined in
    workflows
  • “small” data, but diverse, and deeply linked
  • need for sustaining sources and results
conclusions
• what is the best supporting technology?
• if it is GRID then
• do not stick to concrete use cases
  • they do not mobilise the SSH comunity
• start supporting virtual use cases
  • because then you connect technology with a
    community
  • and lots of concrete use cases will follow
Epilogue




nothing is simple and rational except
what we ourselves have invented ...

More Related Content

What's hot

Digital libraries
Digital libraries Digital libraries
Digital libraries Dheeraj Negi
 
New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...Derek Keats
 
National Digital Library
National Digital LibraryNational Digital Library
National Digital Libraryguesta45bc80
 
Role of Information Technology in Library Management in Digital Era
Role of Information Technology  in Library Management in Digital EraRole of Information Technology  in Library Management in Digital Era
Role of Information Technology in Library Management in Digital Erarsgiri75
 
Digital Libraries: the process, initiatives and developmental issues in India...
Digital Libraries: the process, initiatives and developmental issues in India...Digital Libraries: the process, initiatives and developmental issues in India...
Digital Libraries: the process, initiatives and developmental issues in India...Sudesh Sood
 
Digital library presentation
Digital library presentationDigital library presentation
Digital library presentationmedo9477
 
Introduction to digital libraries - definitions, examples, concepts and trend...
Introduction to digital libraries - definitions, examples, concepts and trend...Introduction to digital libraries - definitions, examples, concepts and trend...
Introduction to digital libraries - definitions, examples, concepts and trend...Olaf Janssen
 
DIGITAL LIBRARIES PPT
DIGITAL LIBRARIES PPT DIGITAL LIBRARIES PPT
DIGITAL LIBRARIES PPT 03062679929
 
SGCI - S2I2: Science Gateways Community Institute
SGCI - S2I2: Science Gateways Community InstituteSGCI - S2I2: Science Gateways Community Institute
SGCI - S2I2: Science Gateways Community InstituteSandra Gesing
 
Toward universal information access on the digital object cloud
Toward universal information access on the digital object cloudToward universal information access on the digital object cloud
Toward universal information access on the digital object cloudNational Institute of Informatics
 
DIGITAL LIBRARY ARCHITECTURE
DIGITAL LIBRARY ARCHITECTUREDIGITAL LIBRARY ARCHITECTURE
DIGITAL LIBRARY ARCHITECTUREsarika meher
 
NordForsk Open Access Reykjavik 14-15/8-2014:NeIC
NordForsk Open Access Reykjavik 14-15/8-2014:NeICNordForsk Open Access Reykjavik 14-15/8-2014:NeIC
NordForsk Open Access Reykjavik 14-15/8-2014:NeICNordForsk
 

What's hot (20)

Digital libraries
Digital libraries Digital libraries
Digital libraries
 
Introduction to Digital libraries
Introduction to Digital librariesIntroduction to Digital libraries
Introduction to Digital libraries
 
Information Systems
Information SystemsInformation Systems
Information Systems
 
Digital library Assignment
Digital library AssignmentDigital library Assignment
Digital library Assignment
 
New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...New challenges for digital scholarship and curation in the era of ubiquitous ...
New challenges for digital scholarship and curation in the era of ubiquitous ...
 
Digital Library
Digital LibraryDigital Library
Digital Library
 
National Digital Library
National Digital LibraryNational Digital Library
National Digital Library
 
Humanities and ICT
Humanities and ICTHumanities and ICT
Humanities and ICT
 
Role of Information Technology in Library Management in Digital Era
Role of Information Technology  in Library Management in Digital EraRole of Information Technology  in Library Management in Digital Era
Role of Information Technology in Library Management in Digital Era
 
Digital Libraries: the process, initiatives and developmental issues in India...
Digital Libraries: the process, initiatives and developmental issues in India...Digital Libraries: the process, initiatives and developmental issues in India...
Digital Libraries: the process, initiatives and developmental issues in India...
 
Digital Library
Digital LibraryDigital Library
Digital Library
 
Digital library presentation
Digital library presentationDigital library presentation
Digital library presentation
 
Introduction to digital libraries - definitions, examples, concepts and trend...
Introduction to digital libraries - definitions, examples, concepts and trend...Introduction to digital libraries - definitions, examples, concepts and trend...
Introduction to digital libraries - definitions, examples, concepts and trend...
 
DIGITAL LIBRARIES PPT
DIGITAL LIBRARIES PPT DIGITAL LIBRARIES PPT
DIGITAL LIBRARIES PPT
 
Digital library
 Digital library Digital library
Digital library
 
SGCI - S2I2: Science Gateways Community Institute
SGCI - S2I2: Science Gateways Community InstituteSGCI - S2I2: Science Gateways Community Institute
SGCI - S2I2: Science Gateways Community Institute
 
Toward universal information access on the digital object cloud
Toward universal information access on the digital object cloudToward universal information access on the digital object cloud
Toward universal information access on the digital object cloud
 
DIGITAL LIBRARY ARCHITECTURE
DIGITAL LIBRARY ARCHITECTUREDIGITAL LIBRARY ARCHITECTURE
DIGITAL LIBRARY ARCHITECTURE
 
NordForsk Open Access Reykjavik 14-15/8-2014:NeIC
NordForsk Open Access Reykjavik 14-15/8-2014:NeICNordForsk Open Access Reykjavik 14-15/8-2014:NeIC
NordForsk Open Access Reykjavik 14-15/8-2014:NeIC
 
Innovative ICT Based Library Services
Innovative ICT Based Library ServicesInnovative ICT Based Library Services
Innovative ICT Based Library Services
 

Similar to 2010 EGITF Amsterdam - Gap between GRID and Humanities

The Reasons Why the Science Gateways Community Needs an Institute
The Reasons Why the Science Gateways Community Needs an InstituteThe Reasons Why the Science Gateways Community Needs an Institute
The Reasons Why the Science Gateways Community Needs an InstituteSandra Gesing
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationEnno Meijers
 
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?Daniel S. Katz
 
g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking FunctionalityNicholas Loulloudes
 
Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs vty
 
20191210 NDLI KEDL2019 Building the dutch digital heritage network
20191210 NDLI KEDL2019 Building the dutch digital heritage network20191210 NDLI KEDL2019 Building the dutch digital heritage network
20191210 NDLI KEDL2019 Building the dutch digital heritage networkEnno Meijers
 
Data Citizen Driven City at IoT London Meetup 6
Data Citizen Driven City at IoT London Meetup 6Data Citizen Driven City at IoT London Meetup 6
Data Citizen Driven City at IoT London Meetup 6Cesar Garcia
 
Reproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsReproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsSimon Cockell
 
A distributed network of digital heritage information - Unesco/NDL India
A distributed network of digital heritage information - Unesco/NDL IndiaA distributed network of digital heritage information - Unesco/NDL India
A distributed network of digital heritage information - Unesco/NDL IndiaEnno Meijers
 
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...Carole Goble
 
cloudcomputingdistributedcomputing-171208050503 (1).pdf
cloudcomputingdistributedcomputing-171208050503 (1).pdfcloudcomputingdistributedcomputing-171208050503 (1).pdf
cloudcomputingdistributedcomputing-171208050503 (1).pdfArchanaPandiyan
 
Software and Education at NSF/ACI
Software and Education at NSF/ACISoftware and Education at NSF/ACI
Software and Education at NSF/ACIDaniel S. Katz
 
A distributed network of digital heritage information by Enno Meijers - Europ...
A distributed network of digital heritage information by Enno Meijers - Europ...A distributed network of digital heritage information by Enno Meijers - Europ...
A distributed network of digital heritage information by Enno Meijers - Europ...Europeana
 
Azure Brain: 4th paradigm, scientific discovery & (really) big data
Azure Brain: 4th paradigm, scientific discovery & (really) big dataAzure Brain: 4th paradigm, scientific discovery & (really) big data
Azure Brain: 4th paradigm, scientific discovery & (really) big dataMicrosoft Technet France
 
Cloud and grid computing by Leen Blom, Centric
Cloud and grid computing by Leen Blom, CentricCloud and grid computing by Leen Blom, Centric
Cloud and grid computing by Leen Blom, CentricCentric
 
Cloud and Grid Computing
Cloud and Grid ComputingCloud and Grid Computing
Cloud and Grid ComputingLeen Blom
 
SGCI - The Science Gateways Community Institute: Going Beyond Borders
SGCI - The Science Gateways Community Institute: Going Beyond BordersSGCI - The Science Gateways Community Institute: Going Beyond Borders
SGCI - The Science Gateways Community Institute: Going Beyond BordersSandra Gesing
 

Similar to 2010 EGITF Amsterdam - Gap between GRID and Humanities (20)

The Reasons Why the Science Gateways Community Needs an Institute
The Reasons Why the Science Gateways Community Needs an InstituteThe Reasons Why the Science Gateways Community Needs an Institute
The Reasons Why the Science Gateways Community Needs an Institute
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?
 
g-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionalityg-Social - Enhancing e-Science Tools with Social Networking Functionality
g-Social - Enhancing e-Science Tools with Social Networking Functionality
 
Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs Decentralised identifiers and knowledge graphs
Decentralised identifiers and knowledge graphs
 
20191210 NDLI KEDL2019 Building the dutch digital heritage network
20191210 NDLI KEDL2019 Building the dutch digital heritage network20191210 NDLI KEDL2019 Building the dutch digital heritage network
20191210 NDLI KEDL2019 Building the dutch digital heritage network
 
Data Citizen Driven City at IoT London Meetup 6
Data Citizen Driven City at IoT London Meetup 6Data Citizen Driven City at IoT London Meetup 6
Data Citizen Driven City at IoT London Meetup 6
 
Reproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformaticsReproducibility - The myths and truths of pipeline bioinformatics
Reproducibility - The myths and truths of pipeline bioinformatics
 
A distributed network of digital heritage information - Unesco/NDL India
A distributed network of digital heritage information - Unesco/NDL IndiaA distributed network of digital heritage information - Unesco/NDL India
A distributed network of digital heritage information - Unesco/NDL India
 
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...RDMkit, a Research Data Management Toolkit.  Built by the Community for the ...
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
 
cloudcomputingdistributedcomputing-171208050503 (1).pdf
cloudcomputingdistributedcomputing-171208050503 (1).pdfcloudcomputingdistributedcomputing-171208050503 (1).pdf
cloudcomputingdistributedcomputing-171208050503 (1).pdf
 
Cloud Computing & Distributed Computing
Cloud Computing & Distributed ComputingCloud Computing & Distributed Computing
Cloud Computing & Distributed Computing
 
Software and Education at NSF/ACI
Software and Education at NSF/ACISoftware and Education at NSF/ACI
Software and Education at NSF/ACI
 
A distributed network of digital heritage information by Enno Meijers - Europ...
A distributed network of digital heritage information by Enno Meijers - Europ...A distributed network of digital heritage information by Enno Meijers - Europ...
A distributed network of digital heritage information by Enno Meijers - Europ...
 
Azure Brain: 4th paradigm, scientific discovery & (really) big data
Azure Brain: 4th paradigm, scientific discovery & (really) big dataAzure Brain: 4th paradigm, scientific discovery & (really) big data
Azure Brain: 4th paradigm, scientific discovery & (really) big data
 
Cloud and grid computing by Leen Blom, Centric
Cloud and grid computing by Leen Blom, CentricCloud and grid computing by Leen Blom, Centric
Cloud and grid computing by Leen Blom, Centric
 
Cloud and Grid Computing
Cloud and Grid ComputingCloud and Grid Computing
Cloud and Grid Computing
 
Corrado -- Establishing the Landscape
Corrado -- Establishing the LandscapeCorrado -- Establishing the Landscape
Corrado -- Establishing the Landscape
 
SGCI - The Science Gateways Community Institute: Going Beyond Borders
SGCI - The Science Gateways Community Institute: Going Beyond BordersSGCI - The Science Gateways Community Institute: Going Beyond Borders
SGCI - The Science Gateways Community Institute: Going Beyond Borders
 

More from Dirk Roorda

General Missives
General MissivesGeneral Missives
General MissivesDirk Roorda
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)Dirk Roorda
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-FabricDirk Roorda
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysisDirk Roorda
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsDirk Roorda
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchersDirk Roorda
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew BibleDirk Roorda
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissenDirk Roorda
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleDirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDirk Roorda
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsDirk Roorda
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Dirk Roorda
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleDirk Roorda
 

More from Dirk Roorda (20)

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
 
Textpy
TextpyTextpy
Textpy
 
General Missives
General MissivesGeneral Missives
General Missives
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
 
Text fabric
Text fabricText fabric
Text fabric
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew Bible
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Award
AwardAward
Award
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, Lessons
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
 

Recently uploaded

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 

Recently uploaded (20)

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

2010 EGITF Amsterdam - Gap between GRID and Humanities

  • 1. Use Cases the gap between Humanities and GRID Dirk Roorda, coordinator infrastructure http://www.dans.knaw.nl
  • 4. first gear: online acces to sources ... for people
  • 7. ESFRI https://documents.egi.eu/public/ShowDocument?docid=12 • EEF report (April 2010) • CESSDA, CLARIN, DARIAH • ESS, SHARE • Common requirements • single sign on • unified virtual organisations • persistent storage (and identifiers) • webservices and workflows
  • 8. represents the linguistic community • relatively strong in computing • many exploitable language tools • research benefits from high performance computing http://www.clarin.eu
  • 9. • initial expectations of GRID • enabling technology for workflows • sustainable tools • sustainable data ..., and all of this will be available on the internet using a service oriented architecture based on secure grid technologies http://www.clarin.eu/the-vision
  • 10. • Turning point There is no off-the-shelf grid/federation technology; much relies on the availability of specialists who know about the details. Much adaptation and configuration work has to be done, which requires a deep understanding of the components. ... it is the interaction and integration that requires lots of efforts. http://www-sk.let.uu.nl/u/D2R-2a.pdf
  • 11. • Current point of view • interested in data grid • for webservices: consider cloud • unclear what will turn out to be a stable technology for the CLARIN RI
  • 12. • humanities, arts and cultural heritage • mission • explore and apply ICT-based methods • link digital source materials of many kinds • exchange expertise across disciplines • key concepts • virtual infrastructure • virtual competence center • VCC1 e-Infrastructure ; VCC3 Scholarly Content http://www.dariah.eu
  • 13. DARIAH on the GRID? • no mention on high performance computing • scarcely mentions grid • programmatic access to data: yes • emphasis on preservation of data • But then there is TextGRID http://www.textgrid.de/
  • 14. • First Periodic Report 2009 • announces infrastructure experiment • showing large computational needs for humanities • indexing unstructured resources in an on-demand and flexible manner • conducted on D4Science http://dariah.eu/wiki/images/e/ea/090426_dest09_interactivegrid.pdf
  • 15. • Umbrella for social science archives • since 1970 • Upgrade in ESFRI context (FP7) • developing: • the data portal to allow seamless access to data holdings across Europe • common authentication and access middleware tools • investigating the potential of grid technologies • improving data harmonisation tools
  • 16. • Use cases contain • data collection • applying harmonisation • collect results and analyse them further • in a collaborative environment • under severe access restrictions • develop new surveys comparable to existing ones
  • 17. In February 2009 • Yes GRID can help • but many approaches • of which maybe none enables all scenarios at the same time http://www.cessda.org/project/doc/D11.1a_Grid-enabling.pdf
  • 18. • May 2009 SWOT analysis • GRID solves several partial problems • current GRID technology not geared to the core of CESSDA’s problems • do realistic case studies on • GRID, Cloud, traditional client server http://www.cessda.org/project/doc/D11.1b_Sustainability_CESSDA_e-Infrastructure.pdf
  • 19. requirements • functional • enable data intensive scholarship • better support for existing hypotheses • new research questions • enriched publications • enable collaboration around sources • incremental enrichment (annotations) • connecting sources (linked data) • connecting people (collaboratories)
  • 20. requirements • “non-functional” • respect access rights • sustainability of data and tools • ease of use and programming • performance
  • 21. characteristics • data from irrepeatable processes • data in many small chunks • complex interrelationships • emphasis on human interpretation • natural language processing • symbolic processing • linking to background knowledge
  • 22. human interpretation • only part of the dataprocessing can be automated • higher level patterns require higher performance computing • researchers need interaction with data and processes • store intermediate results • build virtual collections
  • 23. (virtual) use cases • a use case spells out a concrete task from the perspective of an end user • the humanities present few use cases • where HPC is essential • with “significant” amounts of data • which can be translated into batch jobs • so: virtually no concrete use cases!
  • 24. virtual use cases • a virtual use case • spells out a generic task • from an infrastructural perspective • still meaningful to the end user • virtual use cases • pave the way for real use cases • bridge big gaps between • end users of • “foreign” infrastructure
  • 25. a specific virtual use case • smart indexes on unique resources, e.g. • speech sounds in audio sources • names in hand-written sources • artefacts in images • motifs in literary texts • task • create a comprehensive index • of patterns in all accessible collections • share that index for follow-up research
  • 26. smart indexes: sources • accessible collections <= • sources are readily accessible • programmatically • with minimal latency • with persistent, shareable addresses • access rights are implemented • identities can be passed on to programs • sources can be gathered in virtual collections
  • 27. smart indexes: processing • a smart spider visits a virtual collection • on behalf of a researcher • with his credentials • every addressable element is subjected to a smart pattern analyzer • this lends itself to parallellism • smart pattern detection <= high performance • occurrences are added to the index in the form <pattern, persistent id, fragment id>
  • 28. smart indexes: results • the resulting index is stored in the researcher’s workspace • the researcher may publish his index • for programmatic use • addressed with a new persistent id • new smart virtual collections are possible • third parties write subsequent applications • portals • visualisations
  • 29. conclusions • research projects in the humanities: • “small” projects, repeatedly accessing shared sources • many people annotating the same data • computation becoming more intensive – gradually • many computing steps to be combined in workflows • “small” data, but diverse, and deeply linked • need for sustaining sources and results
  • 30. conclusions • what is the best supporting technology? • if it is GRID then • do not stick to concrete use cases • they do not mobilise the SSH comunity • start supporting virtual use cases • because then you connect technology with a community • and lots of concrete use cases will follow
  • 31. Epilogue nothing is simple and rational except what we ourselves have invented ...

Editor's Notes

  1. is there a gap?When you look at the texts of ESFRI projects CLARIN, DARIAH, CESSDA, there is mention of GRID technology, in positive or even very positive terms.A framework for collaboration, a source of sustainability, a potential for new research.And from the side of the GRID initiatives there have been many invitations to come and talk and layout the use cases.But, lately there has not been much news on the humanities front, and GRID is not mentioned that often, and where are the success stories?So, what is happening? What is the dynamics?
  2. 1925 Aldous Huxley Along the Road Views of Hollandmy love for plane geometry prepared me to feel a special affection for Holland. For the Dutch landscape has all the qualities that make geometry so delightful. ...In the interminable polders, the dykes and ditches intersect another at right angles, a criss-cross of perfect parallels. ...No wonder Descartes preferred the Dutch landscape to any other scene.It’s the rationalist’s paradise. ...One feels that the universe is a simple affair that can be explained in terms of physics and mechanics, that all men are equally endowed with reason and that it is only a question of putting the right arguments before them ... to inaugurate the reign of justice and common sense.
  3. Yes, there is a lot of dynamics indeed.Reckoning from where the humanities and social sciences come, and reckoning from where GRID is coming.ICT can be seen as an engine that powers industry, technology, science and scholarship in general.In the non-technical, interpretative sciences, the data is not produced by instruments in endless streams, but data is found as relics of times long passed, or as expressions of people doing very human things such as thinking, understanding, creating, enjoying, deciding.All this data used to find its first expression on analogue media, such as stone, paper, vinyl disks, audio tapes. Multiplication was expensive, there was much traveling to and from the remaining copies by people that wanted to study them.
  4. But then ICT came along. One by one all types of analogue expressions of sources became subject to digitisation.And digital sources are easy to copy and to distribute.It was a great improvement to have online access to sources, it opened up a wealth of material to individual researchers.We are still in first gear. Not everything has been digitised, and online access is far from streamlined.Yet, part of the enterprise is shifting to second gear.
  5. With so many sources now in the scope of the individual scholar, something has to be done.The traditional way of working: close reading, interpreting, solving puzzles cannot cope with the amount of material available.We need teams or even crowds of people doing small but intelligent things to the sources: making annotations.And then: making editions, beautiful works with a rich set of ideas incorporated in it, around the old and new products of the human mind.It is kind of an Industrial Revolution taking place.This is not a trivial step. The sources have to be findable, referable, metadata must be normalised and added, results must be collected and be made available as resources themselves. Last but not least, access rights must be regulated. The international scale and improved technological accessibility of resources requires us to give the concept of trust a new meaning.This second gear is where we can position the ESFRI projects, especially DARIAH for the humanities and cultural heritage, CLARIN for language resources and tools, and CESSDA for the social sciences.And while these projects are by no means finished, they all are trying to shift to third gear.
  6. There is something in the word GRID that suggests: solidity, rationality, efficiency, straightforwardness.And in a sense the third gear is a straightforward step: leave the paradigm of downloading your sources to your pc, run local programs, upload your results, and publish on the outcomes.If your resources are already out there, in cyberspace, and if, in cyberspace, you can tap unlimited computing resources, why force your processing through the bottleneck of your own pc?This is really compelling given the increasing volume of data, and the second reason is that the algorithms that you want to run require more and more computing power.There is, however, a but: it is not yet practical. For some disciplines, for some tasks, GRID performs well, but for the humanities and the social sciences it just has not materialized, yet.Let us see how DARIAH, CLARIN and CESSDA position themselves with respect to high performance computing.
  7. Within ESFRI there has been a study to find common requirements for RIs, leading to a report in April 2010 of the EEF.The report has consulted from the SSH projects: CESSDA, CLARIN, DARIAH, ESS, SHARE, of which I will highight the first three.Data archiving and curation is a common need for several of the ESFRI projects.The sensitive nature of the data to be stored leads to a need for a fine-grained Authentication and Authorization system, it also is imperative that such a system provides Single Sign On (SSO) functionality. The ability to use grid/cloud compute facilities for the processing of the stored data is also foreseen in some projects. In discussions with these projects they expressed their willingness to work together on pilot projects that can explore key functionality satisfying agreed requirements for identified and engaged users. It is proposed that a series of such joint pilot projects are identified and planned across all the ESFRI sectors. The objective would be that such pilot projects lead to working prototypes that are relevant for a range of user communities and ESFRI projects. Subsequently the results could lead to the introduction of the successful functionality and services in the production systems of European e-infrastructures.
  8. CLARIN is the RI for linguistics. CLARIN is a key example for various reasons.First, many products of the human mind are expressed through language, and there are more than 6000 languages. If a humanities’ research project wants to study phenomena that are mediated through language, they either need human brain power that is conditioned for that language, or they need tools to extract the patterns of interest right through the language screen. Linguists are building resources in the form of lexica, grammars, word-nets on which computational linguists can build tools that make the various languages more transparent to digital research.Secondly, linguistics is a rather exact science among the humanities disciplines, and has built its own track record in computing science. There is real affinity between linguists and computers.Thirdly, the results of computational linguistics lead to technology that is applicable in real life. Applications are a driver for research.
  9. Last, but not least, CLARIN started out with a vision to build on existing HPC technologies, such as GRID.When connecting the world of webservices with the GRID world, something has to be done with authentication and authorisation mechanisms, because they are rather different in both worlds.In fact the problem already occurs when you mix webservices with local tools, if you want single sign on.In an early project CLARIN, together with BIG-GRID and SURFfederation did a pilot project that employed Short-Lived Credential Service as a bridge, connecting the Shibboleth access negotiation with that of the local tools.So it looks as if CLARIN is really moving forward in adapting its infrastructure services to GRID technology.
  10. But then, in Deliverable D2R-2a Federation Foundation for Language Resource and Technology, one can read this passage.In the same report it is stated that the detailed knowledge that is needed to build on the GRID is not available in the client communities, and that it is difficult to get access to that expertise where it can be found, in the computing centres, due to capacity problems.This all handles about just a part of the challenge: implementing authentication and authorisation in a webservices/grid context.
  11. When it comes down to the construction stage, where things get implemented, CLARIN does not see readily available GRID technology that can perform the following things at the same time:maintain large amounts of data securelycreate and maintain workflows of webservices using that dataused by communities of many userswith proper authentication and authorisationIf these things can be realised quicker using cloud technology, even commercially, so be it.
  12. DARIAH is for arts and humanities, and cultural heritage. The mission is to give ICT-based methods more scope, add meaningful links between sources of all (digital) kinds, and exchange expertise across disciplines.It is planned to organise DARIAH as four Virtual Competency Centres (VCCs). Each of the four VCCs is focused on one particular area of expertise: -1- e-Infrastructure,-2- Research and Education Liaison,-3-Scholarly Content,-4- AdvocacyOf these, 1 and 3 are most relevant for GRID computing.
  13. When I look in the documents of DARIAH and search for references to GRID, I do not find high performance computing, but there is mention of the grid, not much though, and programmatic access to data.The emphasis is on preservation of data.Yet there is a real interest in GRID and a real dedication to use it.One of the best testimonies is the TextGRID project and foundation in Göttingen, one of the DARIAH partners, which is itself a joint project between 10 partners.It is funded by the German Federal Ministry of Education and Research (BMBF) for the period starting June 1st 2009 to May 31th 2012.It consists of The TextGrid Repositorya long-term archive for research data in the humanities, embedded in a grid infrastructure, andThe TextGrid Laboratory, a virtual research environment, with a growing set of tools.This can be seen as an excellent testbed of what GRID can do for humanities. Note that TextGRID is not about high performance computing. The GRID is mainly used for its datastoring capabilities.
  14. The Report in 2009 announces two experiments, with TextGRID and D4Science, but to me it was unclear whether these experiments have taken place by now.But there is a report: Infrastructure for Interactivity – Decoupled Systems on the Loose, with many acute observations of the practice in the humanities on the one hand, and the many providing technologies in the grid world on the other hand.I see it as a real attempt to pave the way for humanities’ use cases towards using grid and cloud technologies.FIndings are: repositories still have to be linked to grid infrastructure.Ways to do so are via cloud interfaces on a GRID, or loosely coupled through web technologies (REST based interface).All in all, In the DARIAH deliverables I have not seen something conclusive about using the GRID. I think TextGRID is the most visible step forward.I know that there have many profound discussions about the nature of the technology that should carry the DARIAH research infrastructure.
  15. CESSDA is venerable and old, a liaison of social sciences archives since 1970.Now, in the ESFRI Framework Program 7 it is undertaking an overhaul, basically to give the ICT induced dynamics a better scope.The business of CESSDA is to provide access to large amounts of data on society, and to do meaningful things with that data, such as harmonising the results of different studies, and then analysing the quantities and qualities.One of the subtasks of the preparatory project is to investigate the potential of grid technologies.
  16. The use cases in CESSDA center around the collection of datadeveloping methods to harmonise them,do analytical studies on the resultsIt is important to do that in collaboration with researchers all over Europe,but due to the sensitive nature of portions of the data,the access conditions must be tightly controlled.New scientific survey methods may be developed, and then it is important to be able to compare the new results with those of older surveys, for which lots of data of various sources must be accessed, harmonised, and analysed.
  17. The February 2009 report says that yes GRID is useful, but with certain caveats: GRID might solve partial problems,but it is unclear which precise approach to choose,and whether there is an approach that will satisfy all constraints in the relevant scenarios.Richard Sinnott
  18. The May 2009 analysis of Strengths, Weaknesses, Opportunities and Threats makes it more explicit.While the GRID solves several partial problems (like single sign on),the technology as a whole is not geared to the heart of CESSDA’s problems,and the recommendation is: do realistic case studies and conduct them on various technologies:GRID, Cloud, and traditional client server.Richard Sinnott
  19. A step back.Why are humanities interested in RIs?Let’s summarise the most important requirements, on a very high level, so that we can focus on what is really needed from GRID technology.The highest level is to accomodate a change in scholarly dynamics due to the new possibilities generated by ICT.We have to translate this into implementable requirements.
  20. Before we do that, it is handy to list some (very important) “non-functional” requirements.Access rights may be more complicated here, because the multiple roles that the sources play: they are also exhibited in musea, published by commercial parties, etc.Sustainability is more profound than elsewhere: these sources have often existed for centuries, consulting digital representations of these sources is likely to last for a few centuries more.Writing applications to give beautiful access to sources should not be inordinately difficult due to underlying technology. That technology should be shielded from application programmers working for the humanities. It is this kind of “ease of use” that I have in mind, not the ease of use of the end-user (which is important as well, but that is a concern for the application developer).Performance is important, but not that important. The quest for real high performance is still isolated. But when the need arises, HPC should be “around the corner”, so that applications can be written for every kind of performance without dramatic redesign.
  21. GRID technology is optimised for large volumes of data, for parallelism and many instructions per second.When GRID people look at humanities data, a first reaction might be: out of scope.All files are small, there is no real computation.Yes. And there are other things that do not enter into the considerations: the complex structure of the data, and the many interrelationships.We talk about data that does not make sense without extensive interpretation, symbolic processing, and links with human background knowledge.So people want to work with this data in other ways than with data that flows from scientific instruments.
  22. The humanities are working on an exciting (meta)program of trying to capture as much of human interpretation skills in software as they can.Yet they are far from an unbroken algorithmic chain from data to cognition. So, intermediate results have to be gathered, looked at by humans, reordered, enriched, submitted to other workflows.There is much interaction needed between researchers and their data, so the infrastructure should facilitate that.How can we direct the technology people in such a way that they build the infrastructure that we need?
  23. By means of use cases, but not limited, concrete, end-user use cases, but by virtual use cases.We can learn a lot from CLARIN, DARIAH and CESSDA in terms of generic requirements for research infrastructures.Starting with concrete end users and their concrete jobs is an unnecessary step back.There is an other disadvantage: there are not that many concrete use cases to be found in the humanities.
  24. So let us formulate a use case that incorporates the collective wisdom of what these RIs have yielded so far.That means that such a virtual use case spells tasks that are generic: you might see the task, in different guises, again and again in different research projects.Such use cases highlight pieces of infrastructure that should be there in order to make it work.They should be a foundation on which you can develop concrete use cases, in two senses: (a) concrete use cases that require the infrastructure yielded by virtual use cases as a prerequisite (b) concrete use cases that are instantiations of virtual use cases.Such virtual use cases might cover the wide tract of land between the humanities end users and the existing, “foreign”, GRID infrastructure.
  25. There is nothing that forces a virtual use case to be vague.I want to formulate a very definite, explicit, detail-rich use case, that still is a virtual use case: smart indexes.What is an index? An index is a list of occurrences of patterns in a collection of materials. Indexes are often a first step to gain control over unanalysed material.What is smart? An index of higher-level patterns is smart. In a corpus of spoken language, an index of occurences of emotional speech is a smart index.You need real computing power to identify higher-level patterns. If you have the power, it is efficient to make the upfront investment to identify all possible higher-level patterns, and make them available in an index. This is a very valuable resource for all kinds of further applications and research.With a little bit of imagination you can come up with all kinds of smart indexes, for all kinds of media and data.It is clear that this is a relevant use case: it connects lots of data with lots of computing, and both data and results should be persistent, and it is not a bad idea to take care that indexes can be rebuilt over time!Now let us spell out the virtual use case of creating smart indexes. We shall see that in doing so, we need to introduce various well known pieces of infrastructure.
  26. It all starts with the availability of sources. It will not do that the sources can be found on the web, negotiated, downloaded, converted, normalised.If a project undertakes to provide sources, and does all the necessary steps, it would be nice if the result were from then on easily accessible, at a single location, in a single format, by humans but also by programs, subject to reasonable rights, that can be negotiated without human interaction.It should be possible then to gather resources in virtual collections, in order to make preliminary selections. E.g. select all sources on the right kind of medium.
  27. When the sources are in place, designated as a set of persistent identifiers, brought together in a virtual collection, we can start processing.A spider visits every dataset in the collection, with the credentials of the researcher on whose’s behalf the index is being constructed.Every addressable element in those datasets is analysed by means of a smart pattern analyzer, which identifies the occurrences of patterns and generates fragment identifiers from those occurrences.Probably the identification of a single pattern is computing intensive, so in order to finish the index in reasonable time, it is necessary (and possible) to parallellize the spidering.
  28. The new smart index is experimental work of a researcher. Maybe the pattern recogniser’s settings were not optimal. He might want to redo the indexing.After a while, he is confident that he has a worthwhile index at his disposal. That means, the index is in his private workspace.The next step is that the researcher shares his index. So it should be possible to share items in your public workspace.And, in the case of an index, it is particularly important that the index can be accessed by programs. Programs written by other researchers that have detailed questions to ask to the source material.So the index becomes a new resource in its own right, shared or public, with a persistent identifier.And other people can build portals, visualisations or dedicated applications that make use of this index.
  29. Now let us oversee the specific characteristics of humanities’ data and projects and people:Small projects, repeatedly accessing shared sources.Many people annotating the same data.Computation that gradually becomes more intensive.Many computing steps to be composed into workflows.Small data, but diverse and intricately and meaningfully linked together.Need for sustaining the data and the growing web of links: persistent identification.
  30. The question is not: how can humanities best use the GRID as we know it?But: what is the best supporting technology,and can GRID as we know it provide it?If so, we need a step forward, out of the situation that the humanities are not expressing what they want in terms that the GRID people are used to deal with.We need a language in which humanities express their infrastructural concerns, and that guides the GRID people to build the components that are needed.In the ESFRI projects the humanities and social sciences really do the excercise to think “infrastructural”, now it is time for the GRID people to respond.I think, virtual use cases such as presented here, could bridge the gap in understanding.Whereas the concrete use case connects a scientist with a technician, these virtual use cases connect a community with a technology.Then lots of concrete use cases will follow.
  31. 1925 Aldous Huxley Along the RoadThose were noble and touching dreams. We are soberer now.We have learnt that nothing is simple and rational except what we ourselves have invented, ...that reason is unequally distributed,that instinct is the sole source of action,that prejudice is stronger than argument.EGI here has landed right between the grid-like landscape and the chaotic town of Amsterdambetween the dreams of rationality and the intricacies of human realityMake the best of both worlds!