SlideShare une entreprise Scribd logo
1  sur  22
Creating Metadata for Legacy Research Data
Collaborate, Automate,
Prepare, Prioritize
Stacy Konkiel
IU Libraries
Inna Kouper
Data to Insight Center, IU
Jennifer A. Liss
IU Libraries
Juliet L. Hardesty
IU Libraries
Data Management as
“Grand Challenge”
& Metadata
^
SEAD is funded by the National Science Foundation under Cooperative
Agreement #OCI0940824
SEAD Virtual Archive
(SVA)
-- manage sustainability
science window to multiple IRs
IU Scholar
Works IR
publish associate
discover
UIUC IDEALS
IR
UMich Deep
Blue IR
ingest
Investigation
 How can the curation of legacy data be
improved by supplying necessary
metadata?
 How much time and effort is required to
supply domain-specific metadata?
Goals
• Enable discovery of research data
• Communicate experiences with metadata
creation for legacy dataset to community
• Begin conversation about metadata
practices for legacy data
Methodology
• 20 NCED legacy datasets
• Federal Geographic Data Committee (FGDC)
Content Standard for Digital Geospatial
Metadata
Methodology
• 4 encoders, each
assigned 5 datasets
• Datasets ranged greatly
in size and composition
• 0.01–664 GB
• 1–140,000 files
Methodology
• Phase I
Standalone XML files using basic
NCED-provided information &
“Googleable” facts
• Phase II
Extensive research re: processes by
which datasets were created and used
Findings–Phase I
0:00
1:12
2:24
3:36
4:48
Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5
Metadata creation time Phase I (h:mm)
Encoder 1 Encoder 2 Encoder 3 Encoder 4
Findings–Phase I
Successes:
• Supplied many mandatory elements
• Thesauri & Controlled Vocabularies
Challenges:
• Time-intensive startup
• Lacking geospatial information
Findings–Phase II
Successes:
• Enhanced 10 metadata fields
Challenges:
• Accessing and processing the
datasets (size, complexity)
Observations
Though the information that we found may
enhance opportunities for the discovery of
legacy research data, the available
information was unlikely to be sufficient to
support the tasks of preservation,
reproducibility, and re-use.
Observations
• FGDC is insufficient for dealing with
legacy research data
• Data curators without domain expertise
can be successful in creating some types
of metadata
• Structural and administrative metadata is
difficult to curate without help of
researchers
Proposal: The CAPP Framework
• Labor
• Datasets
• Types of
metadata
• User needs
• Choice of
metadata
standards
• Instructions /
manuals
• Workflows /
software
• Licensing and
contact information
• File format
identification
• Provenance
• Native
environment
• Entity extraction
• Subject specialists
• Librarians
• Researchers
• Tool developers
Collaborate Automate
PrioritizePrepare
Collaborate
• Subject specialists
• Librarians
• Researchers
• Tool developers
Automate
• File format identification
• Provenance
• Native environment
• Entity extraction
Prepare
• Choice of metadata standards
• Licensing and contact information
• Instructions and manuals
• Workflows and software
Prioritize
• Labor
• Datasets
• Types of metadata
• User needs
Future Work
Benchmark:
• Effectiveness of tools and workflows
• Collaborations and relationships
• Domains/interdisciplinarity
Thank you!
Jennifer A. Liss
Metadata/Cataloging Librarian
jaliss@indiana.edu
http://sead-data.net

Contenu connexe

Tendances

NSF DataNet Partners Update at RDAP14
NSF DataNet Partners Update at RDAP14NSF DataNet Partners Update at RDAP14
NSF DataNet Partners Update at RDAP14
SEAD
 

Tendances (20)

SEAD slide set (October 2011)
SEAD slide set (October 2011)SEAD slide set (October 2011)
SEAD slide set (October 2011)
 
A Data Scientist Perspective on Data Curation in the Digital Era
A Data Scientist Perspective on Data Curation in the Digital EraA Data Scientist Perspective on Data Curation in the Digital Era
A Data Scientist Perspective on Data Curation in the Digital Era
 
User engagement in research data curation
User engagement in research data curationUser engagement in research data curation
User engagement in research data curation
 
NSF DataNet Partners Update at RDAP14
NSF DataNet Partners Update at RDAP14NSF DataNet Partners Update at RDAP14
NSF DataNet Partners Update at RDAP14
 
ESA14 Workshop on SEAD's Data Services and Tools
ESA14 Workshop on SEAD's Data Services and ToolsESA14 Workshop on SEAD's Data Services and Tools
ESA14 Workshop on SEAD's Data Services and Tools
 
SEAD: Lightweight Data Services for Sustainability Research
SEAD: Lightweight Data Services for Sustainability ResearchSEAD: Lightweight Data Services for Sustainability Research
SEAD: Lightweight Data Services for Sustainability Research
 
20130222 kaptur training_goldsmiths
20130222 kaptur training_goldsmiths20130222 kaptur training_goldsmiths
20130222 kaptur training_goldsmiths
 
Preservation, Publishing, and People: A SEAD View
Preservation, Publishing, and  People: A SEAD ViewPreservation, Publishing, and  People: A SEAD View
Preservation, Publishing, and People: A SEAD View
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
D4Science Data Infrastructure - Facilitator for a FAIR Data Management
D4Science Data Infrastructure - Facilitator for a FAIR Data ManagementD4Science Data Infrastructure - Facilitator for a FAIR Data Management
D4Science Data Infrastructure - Facilitator for a FAIR Data Management
 
Simon hodson
Simon hodsonSimon hodson
Simon hodson
 
Improving Data Management Capacity in the Mekong Basin Using SEAD
Improving Data Management Capacity in the Mekong Basin Using SEADImproving Data Management Capacity in the Mekong Basin Using SEAD
Improving Data Management Capacity in the Mekong Basin Using SEAD
 
S cook ands_ttt2_perth_rdm_training
S cook ands_ttt2_perth_rdm_trainingS cook ands_ttt2_perth_rdm_training
S cook ands_ttt2_perth_rdm_training
 
Rdm training presentation 16.01.2013
Rdm training presentation 16.01.2013Rdm training presentation 16.01.2013
Rdm training presentation 16.01.2013
 
Introduction to research data management
Introduction to research data managementIntroduction to research data management
Introduction to research data management
 
Organising and Documenting Data
Organising and Documenting DataOrganising and Documenting Data
Organising and Documenting Data
 
Supporting the Research data management process- a guide for Librarians. .
Supporting the Research data management process- a guide for Librarians. .Supporting the Research data management process- a guide for Librarians. .
Supporting the Research data management process- a guide for Librarians. .
 
White Manipulating Metadata to Enhance Access
White Manipulating Metadata to Enhance AccessWhite Manipulating Metadata to Enhance Access
White Manipulating Metadata to Enhance Access
 
Next generation data services at the Marriott Library
Next generation data services at the Marriott LibraryNext generation data services at the Marriott Library
Next generation data services at the Marriott Library
 
Data Citation Implementation Guidelines By Tim Clark
Data Citation Implementation Guidelines By Tim ClarkData Citation Implementation Guidelines By Tim Clark
Data Citation Implementation Guidelines By Tim Clark
 

En vedette

En vedette (15)

Discovery Layer Strategies for Kuali OLE: Indiana University
Discovery Layer Strategies for Kuali OLE: Indiana UniversityDiscovery Layer Strategies for Kuali OLE: Indiana University
Discovery Layer Strategies for Kuali OLE: Indiana University
 
From Anywhere Library to Everywhere Library: Creating a User Experience Strat...
From Anywhere Library to Everywhere Library: Creating a User Experience Strat...From Anywhere Library to Everywhere Library: Creating a User Experience Strat...
From Anywhere Library to Everywhere Library: Creating a User Experience Strat...
 
If we build it, they will come: authority data for a linked data future
If we build it, they will come: authority data for a linked data futureIf we build it, they will come: authority data for a linked data future
If we build it, they will come: authority data for a linked data future
 
"We'll burn that bridge when we get to it”—Technology, Metadata Standards, an...
"We'll burn that bridge when we get to it”—Technology, Metadata Standards, an..."We'll burn that bridge when we get to it”—Technology, Metadata Standards, an...
"We'll burn that bridge when we get to it”—Technology, Metadata Standards, an...
 
Taking the Plunge into Holistic Design
Taking the Plunge into Holistic DesignTaking the Plunge into Holistic Design
Taking the Plunge into Holistic Design
 
Doctoring Strange Results
Doctoring Strange ResultsDoctoring Strange Results
Doctoring Strange Results
 
Possibilities & Pitfalls: Reference in the age of discovery
Possibilities & Pitfalls: Reference in the age of discoveryPossibilities & Pitfalls: Reference in the age of discovery
Possibilities & Pitfalls: Reference in the age of discovery
 
Cataloging Competencies for the 21st Century
Cataloging Competencies for the  21st CenturyCataloging Competencies for the  21st Century
Cataloging Competencies for the 21st Century
 
Better Libraries by Design - ALAO 2016
Better Libraries by Design - ALAO 2016Better Libraries by Design - ALAO 2016
Better Libraries by Design - ALAO 2016
 
UX for the People: Empowering Patrons & Front-Line Staff through a User-cente...
UX for the People: Empowering Patrons & Front-Line Staff through a User-cente...UX for the People: Empowering Patrons & Front-Line Staff through a User-cente...
UX for the People: Empowering Patrons & Front-Line Staff through a User-cente...
 
It's Live, Now What? Reflecting on the first year with Ebsco Discovery Service
It's Live, Now What? Reflecting on the first year with Ebsco Discovery ServiceIt's Live, Now What? Reflecting on the first year with Ebsco Discovery Service
It's Live, Now What? Reflecting on the first year with Ebsco Discovery Service
 
Designing for Users First: Creating the User-Centered Library
Designing for Users First: Creating the User-Centered LibraryDesigning for Users First: Creating the User-Centered Library
Designing for Users First: Creating the User-Centered Library
 
'Weird' titles in RDA and MARC: Preferred titles, collective titles, and conv...
'Weird' titles in RDA and MARC: Preferred titles, collective titles, and conv...'Weird' titles in RDA and MARC: Preferred titles, collective titles, and conv...
'Weird' titles in RDA and MARC: Preferred titles, collective titles, and conv...
 
Going Straight to the Source
Going Straight to the SourceGoing Straight to the Source
Going Straight to the Source
 
Please send catalogers : metadata staffing in the 21st century
Please send catalogers : metadata staffing in the 21st centuryPlease send catalogers : metadata staffing in the 21st century
Please send catalogers : metadata staffing in the 21st century
 

Similaire à Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

Institutional repository
Institutional repositoryInstitutional repository
Institutional repository
Waqas Ahmed
 

Similaire à Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data (20)

NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
Managing Your Research Data
Managing Your Research DataManaging Your Research Data
Managing Your Research Data
 
Incentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production processIncentivising the uptake of reusable metadata in the survey production process
Incentivising the uptake of reusable metadata in the survey production process
 
RDAP14: Learning to Curate Panel
RDAP14: Learning to Curate Panel RDAP14: Learning to Curate Panel
RDAP14: Learning to Curate Panel
 
Research Data Mangagement Essentials, 5th July 2017
Research Data Mangagement Essentials, 5th July 2017Research Data Mangagement Essentials, 5th July 2017
Research Data Mangagement Essentials, 5th July 2017
 
Love Your Data Locally
Love Your Data LocallyLove Your Data Locally
Love Your Data Locally
 
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
RDAP 15: Beyond Metadata: Leveraging the “README” to support disciplinary Doc...
 
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
dkNET Office Hours - "Are You Ready for 2023: New NIH Data Management and Sha...
 
ESI Supplemental 1 E-research Support Slides
ESI Supplemental 1   E-research Support SlidesESI Supplemental 1   E-research Support Slides
ESI Supplemental 1 E-research Support Slides
 
IEDA Overview & Updates, March 2014
IEDA Overview & Updates, March 2014IEDA Overview & Updates, March 2014
IEDA Overview & Updates, March 2014
 
Planning for Research Data Management
Planning for Research Data ManagementPlanning for Research Data Management
Planning for Research Data Management
 
SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science SEAD Datanet and Sustainability Science
SEAD Datanet and Sustainability Science
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data Management
 
Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...
Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...
Introduction to Research Data Management - 2017-02-15 - MPLS Division, Univer...
 
Getting to grips with Research Data Management
Getting to grips with Research Data ManagementGetting to grips with Research Data Management
Getting to grips with Research Data Management
 
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP PilotL&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
L&P Humphrey Stewart-Shearer-Joint Session Project ARC & Federated DMP Pilot
 
Institutional repository
Institutional repositoryInstitutional repository
Institutional repository
 
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
Preparing Your Research Material for the Future - 2017-02-22 - Humanities Div...
 
Engaging with students and researchers: the case of the social sciences
Engaging with students and researchers: the case of the social sciencesEngaging with students and researchers: the case of the social sciences
Engaging with students and researchers: the case of the social sciences
 
Preparing Your Research Material for the Future - 2018-06-08 - Humanities Div...
Preparing Your Research Material for the Future - 2018-06-08 - Humanities Div...Preparing Your Research Material for the Future - 2018-06-08 - Humanities Div...
Preparing Your Research Material for the Future - 2018-06-08 - Humanities Div...
 

Dernier

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 

Collaborate, Automate, Prepare, Prioritize: Creating Metadata for Legacy Research Data

  • 1. Creating Metadata for Legacy Research Data Collaborate, Automate, Prepare, Prioritize Stacy Konkiel IU Libraries Inna Kouper Data to Insight Center, IU Jennifer A. Liss IU Libraries Juliet L. Hardesty IU Libraries
  • 2. Data Management as “Grand Challenge” & Metadata ^
  • 3. SEAD is funded by the National Science Foundation under Cooperative Agreement #OCI0940824
  • 4. SEAD Virtual Archive (SVA) -- manage sustainability science window to multiple IRs IU Scholar Works IR publish associate discover UIUC IDEALS IR UMich Deep Blue IR ingest
  • 5. Investigation  How can the curation of legacy data be improved by supplying necessary metadata?  How much time and effort is required to supply domain-specific metadata?
  • 6. Goals • Enable discovery of research data • Communicate experiences with metadata creation for legacy dataset to community • Begin conversation about metadata practices for legacy data
  • 7. Methodology • 20 NCED legacy datasets • Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata
  • 8. Methodology • 4 encoders, each assigned 5 datasets • Datasets ranged greatly in size and composition • 0.01–664 GB • 1–140,000 files
  • 9. Methodology • Phase I Standalone XML files using basic NCED-provided information & “Googleable” facts • Phase II Extensive research re: processes by which datasets were created and used
  • 10. Findings–Phase I 0:00 1:12 2:24 3:36 4:48 Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 Metadata creation time Phase I (h:mm) Encoder 1 Encoder 2 Encoder 3 Encoder 4
  • 11. Findings–Phase I Successes: • Supplied many mandatory elements • Thesauri & Controlled Vocabularies Challenges: • Time-intensive startup • Lacking geospatial information
  • 12.
  • 13. Findings–Phase II Successes: • Enhanced 10 metadata fields Challenges: • Accessing and processing the datasets (size, complexity)
  • 14. Observations Though the information that we found may enhance opportunities for the discovery of legacy research data, the available information was unlikely to be sufficient to support the tasks of preservation, reproducibility, and re-use.
  • 15. Observations • FGDC is insufficient for dealing with legacy research data • Data curators without domain expertise can be successful in creating some types of metadata • Structural and administrative metadata is difficult to curate without help of researchers
  • 16. Proposal: The CAPP Framework • Labor • Datasets • Types of metadata • User needs • Choice of metadata standards • Instructions / manuals • Workflows / software • Licensing and contact information • File format identification • Provenance • Native environment • Entity extraction • Subject specialists • Librarians • Researchers • Tool developers Collaborate Automate PrioritizePrepare
  • 17. Collaborate • Subject specialists • Librarians • Researchers • Tool developers
  • 18. Automate • File format identification • Provenance • Native environment • Entity extraction
  • 19. Prepare • Choice of metadata standards • Licensing and contact information • Instructions and manuals • Workflows and software
  • 20. Prioritize • Labor • Datasets • Types of metadata • User needs
  • 21. Future Work Benchmark: • Effectiveness of tools and workflows • Collaborations and relationships • Domains/interdisciplinarity
  • 22. Thank you! Jennifer A. Liss Metadata/Cataloging Librarian jaliss@indiana.edu http://sead-data.net

Notes de l'éditeur

  1. Research data management has been recognized by many international governmental bodies and their agencies as a grand challenge: JISC, UK Data Archive, Bill & Melinda Gates Foundation, US Department of Energy’s Office of Science. All are struggling with data management, particularly with the increase in data that is born digitally.Many governmentalfunding agencies now require that researchers pay heed to metadata…but they don’t explain how researchers should go about it.
  2. This research falls within the context of a NSF-funded DataNet project called SEAD or “Sustainable Environment, Actionable Data.” My colleagues Inna Kouper and Stacy Konkiel are part of the SEAD research team, which is comprised of a larger team of scientists and research data specialists.SEAD is a federation of repositories for sustainability science, which is a highly interdisciplinary field. The SEAD project focuses on the development of tools that enable sustainability scientists to curate and share their data at earlier stages of research as well as “downstream,” after the data have been collected and stored.
  3. The federation leverages existing IR platforms, where data is preserved, and also has value-added services built on its interface so that scientists can easily work with and annotate data, harvest metadata, and use the VIVO social network to connect with other researchers in the field.More information about SEAD is available at its website, sead-data.net. The work presented in our project report focuses on supplying metadata for the ingest of legacy datasets into the SEAD Virtual Archive.
  4. Data repositories and federations are becoming more prevalent, what with success of Dyrad and DataONE, among other such projects. We wanted to address the following questions in order to better understand (in absolute, quantifiable terms) what effects metadata has on data management in these spaces.How can the curation of legacy datasets be improved by supplying necessary metadata?How much time and effort is required to create domain-specific metadata?
  5. Our paper reports on quantitative and qualitative metrics of creating domain-specific metadata. In benchmarking the process of enhancing the metadata for legacy datasets, we pursue several goals. First, to make datasets available for effective search and re-use within newer data sharing environments. Second, to advance knowledge among researchers and data professionals about the needs, barriers, and requirements of curating legacy research data. Ultimately, we hope to advance a conversation about efficient metadata creation practices for research data within broader context of data curation and mangement
  6. For this project, we used 20 datasets that are publicly available via the National Center on Earth-surface Dynamics (NCED) repository. Because these datasets originate from the interdisciplinary domain of earth sciences, the choice of a domain-specific metadata standard was not easy. Butgiven that thedatasets contained a significant amount of geospatial information, we decided to use the Federal Geographic Data Committee’s (FGDC) Content Standard for Digital Geospatial Metadata.
  7. A team of four librarians and data professionals (or “encoders”) contributed to metadata creation. We all have different backgrounds: I come from a metadata perspective in traditional library context, Julie from a digital library metadata and useabilty context; Stacy and Inna from the scientific data context. Each encoder received 5 datasets of varying sizesranging from 0.01 to 664 gigabytes and from 1 to ~140,000 files per dataset. The datasets could be comprised of one or many different files types: text files, spreadsheets, images, applications, zip files
  8. Metadata encoding was done in two phases. During Phase I, encoders created standalone XML-based metadata files for each dataset using basic information provided by the NCED repository and information available via quick Internet searches. During Phase II, encoders undertook extensive research to find more information about datasets, particularly concerning the processes by which datasets were created and used. Encoders timed all of their encoding activities and logged their experiences in a journal.
  9. Encoding the basic metadata during Phase I required 9 minutes to 4 hours per dataset (average time: 54 minutes). Time dropped significantly after first dataset was cataloged (learning curve).
  10. Successes: Metadata that we were able to encodeduring Phase I (i.e., the metadata that was easiest to obtain)largely corresponded to the mandatory elements required by the FGDC content standard. Librarians had great success assigning subject terms using controlled vocabularies.Challenges: WhilePhase I allowed us to collect descriptive metadata, which describes resources for the purposes of discovery and identification, it was very hard to encode spatial information (info wasn’t included in datasets’ readme files, we didn’t have software/expertise with software to figure out geospatial info, geospatial details not included in NCED repository metadata).
  11. During Phase II, we attempted to supply richer metadata, which included encoding the composition of the complexresearch objects, as well as encoding relevanttechnical and preservation information.Providing additional metadata during Phase II required 20 minutes to 1.5 hours per dataset.
  12. Successes: 10 metadata fields were enhanced during Phase II, adding such information as references to grants and funding information, distribution conditions, digital access and transfer information, and citations to related datasets and published articles.Challenges: We were still unable to provide a few key mandatory elements, such as geospatial coordinates, resulting in XML files that did not validate against the FGDC schema
  13. The FGDC Content Standard for Digital Geospatial Metadata is a powerful tool for representing descriptive, structural, and administrative metadata. In dealing with legacy research data, however, the capabilities of this tool become seriously limited. Unlike other information resources, such as books or images that remain accessible and relatively transparent for preservation and sharing efforts, research data are complex compound objects. Formats, structure, relationships, and provenance become opaque once the data has been created. Our project demonstrates that data curators who are handed legacy research data “as is” can be very effective in creating descriptive metadata – particularly, in conducting subject analysis and assigning keywords based on controlled vocabularies and thesauri. However, identifying structural and administrative metadata for legacy data is extremely difficult.
  14. Our approach, CAPP: Collaborate, Automate, Prepare, Prioritize is based on the premise that metadata creation or enhancement projects need to rely on a collaborative effort and on a combination of automated and manual labor.
  15. Data managers need to collaborate with subject specialists, researchers, and tool developers to define the requirements of specific data curation projects. Researchers, as data producers and consumers, can contribute to metadata creation by indicating what elements are valuable and for what purposes. Researcherscan also supply additional information that can be used in completing metadata records.
  16. Tasks such as file format identification, provenance capture, and entity extraction need to be automated. Existing tools, such as the JSTOR/Harvard Object Validation Environment (JHOVE, http://jhove.sourceforge.net/), MIME Type Detection Utility (mime-util, http://sourceforge.net/projects/mime-util), or Internet Assigned Number Authority’s MIME Media Types (IANA, http://www.iana.org/assignments/media-types) can be used to automate identification of technical metadata, including file formats. Tools such as GeoServer (http://geoserver.org/display/GEOS/Welcome) can provide access to specific metadata within certain formats, such as shapefiles, and automate the extraction of bounding coordinates and other geospatial information. Researcher identification registries such as ORCID (http://orcid.org/) may help mitigate some of the challenges of finding up-to-date information about data set contributors that the encoders encountered in Phase I.
  17. Librarians and data managers can contribute to automation by providing system and user requirements, identifying a minimal set of metadata elements, and encouraging other partners to become involved in data sharing initiatives.
  18. At the beginning of a legacy data curation project, data managers may also want to make the decision-making explicit by prioritizing which datasets should be curated and what user needs should guide curation.
  19. In the future, we plan to enhance the CAPP framework by benchmarking other processes of metadata creation, such as the usability and effectiveness of certain tools and workflows, the impact of collaborations on metadata creation, and the effects of domain orientation or interdisciplinarity on the effectiveness and completeness of metadata. At its current early stage, CAPP framework is a proposition that needs to be developed into a rich research agenda. We hope that our framework will be considered by the Dublin Core community for further development, testing, improvement, and eventual incorporation into the set of best practices for metadata creation.