The document discusses methodologies for sharing long-tail data and what has been learned. It notes that unique identifiers (PIDs) are important for identifying entities across contexts. Standards like MINI and common data elements (CDEs) help ensure data is findable, accessible, and reusable. The Neuroscience Information Framework (NIF) aggregates ontologies and searches over 200 data sources to organize information. What we have learned is that data should be in repositories, not personal servers; people are key to these efforts; and resources should be comprehensive and support each other to advance open data sharing.
Exploration Method’s in Archaeological Studies & Research
Martone grethe
1. Methodologies for Long-Tail Data
Sharing: What Have We Learned?
Maryann E. Martone, Ph. D.
University of California, San Diego
and
Hypothesis
Jeffrey S. Grethe, Ph. D.
University of California, San Diego
2. Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years:
NIF is an initiative of the NIH Blueprint consortium of institutes
– NIF has been tracking and cataloging the biomedical resource landscape since 2008
3. The current “Addictome"
NIF searches across:
• Resource Registry
(13,000+)
• > 200 deeply
integrated data
sources (>800
million records)
• literature
Query: Addiction
4. N
ORCID
RRID
Data
Digital world runs on globally unique and persistent identifiers; PID’s serve as a
“key” for identifying the same entity across different contexts
e-Science Ecosystem
Metadatastandards
Aggregator
People
Research resources
Ontology
Concepts
DOI
Protocols
Minimal Information Models
TranslationNon-digital
Repositories
and
Registries
e.g. NIF, Monarch
NIH Data DIscovery
Index
CDE
E
eScience goal: Make data Findable, Accessible, Interoperable, Re-usable
(FAIR) for both human and machine
PID
5. Resource Identification Initiative: Supplying unique
identifiers for key research resources
“The following antibodies were used for
immunoblotting: -actin mAb (1:10,000
dilution, Sigma-Aldrich)…”
“The following antibodies were used for
immunoblotting: -actin mAb (1:10,000
dilution, Sigma-Aldrich,
RRID:AB_262137)…”
VS
https://scicrunch.org/resolver/RRID:AB_262137
8. Value Sets
The set of possible values or
responses. A Value Set often
includes concepts from established
Vocabularies, Ontologies or Data
Standards. A value set may also
include a range of permissible values
and indicate the required units. For a
survey question, the value set may
be a list of possible responses.
http://neurolex.org/wiki/Category:Hippocampus_CA1_pyramidal_cell
9. Neuroscience Information Framework
“a tool for analyzing and structuring information”
“a reduction in uncertainty”
• Ontologies are the major way that NIF searches for and organizes information
• Aggregate of community ontologies, e.g., Gene Ontology, Chebi, Protein Ontology
• Still significant gaps for behavioral and physiological concepts and techniques
• Available as services through NIF so they can be built into applications
Organism
Molecule
Macromolecule Gene
Molecule Descriptors
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
NS Function
Subcellular
structure
Investigation
ProtocolsReagent
Techniques
NIFSTD
11. What have we learned?
• The landscape is vibrant, dynamic and growing, but also littered
with abandoned and unrealized projects
• Data belongs in a data repository, not on your lab server
• People are important in this endeavor: Leaders, curators,
community engagement specialists
• Data and ontology resources become interesting when they
are comprehensive: populate!!!
• Assume that you will be resource limited and plan
accordingly: time, money, personnel
• Cost-benefit analysis; what to do now vs later
• Technology will improve
• Don’t start from square 1-resources exist to help; help
support them
13. Dimensions of FAIR data sharing
• Discoverability
– Data can be found
– Data set has an identifier and links are stable
• Accessibility
– Data can be accessed programmatically
– Access rights are clear
• Assessability
– Provenance is known
– Reliability can be determined
• Understandability
– The data can be understood
• Usability
– The data are actionable
– Data are not in a proprietary format
?
?
Goodman, A. et al. Ten simple rules for the care and feeding of scientific data. PLoS Comput Biol 10,
e1003542, doi:10.1371/journal.pcbi.1003542 (2014)
Science as an open enterprise, Royal Society: https://royalsociety.org/policy/projects/science-public-
enterprise/Report/
14. FORCE11: Future of Research Communications and
e-Scholarship
• Resource Identification Initiative:
https://www.force11.org/group/resource-identification-
initiative
• FAIR Data Guiding principles:
https://www.force11.org/group/fairgroup/fairprinciples
• Data Citation Principles:
https://www.force11.org/group/joint-declaration-data-
citation-principles-final
• On creating machine-readable data citations:
https://peerj.com/articles/cs-1/
• 10 Simple rules for design, provision, and reuse of persistent
identifiers for life science data:
https://zenodo.org/record/18003#.VeOxxLQjvyAFORCE11.org: Grass roots organization dedicated to transforming scholarship through
Figure X: Resource types and year added to the registry. Research resources are each tagged with one or more resource types, the most common are represented in this graph (for all data see http://neurolex.org/wiki/Resource_Type_Hierarchy). The year that a resource was added to the registry is denoted by the color, note that 2009 and earlier data are lumped into 2010.