2. We say this to each other all the
time, but we set up systems for
scholarly advancement and
communication that are the
antithesis of integrationWhole brain data
(20 um
microscopic MRI)
Mosiac LM
images (1 GB+)
Conventional LM
images
Individual cell
morphologies
EM volumes &
reconstructions
Solved molecular
structures
No single technology serves
these all equally well.
Multiple data types;
multiple scales; multiple
databases
A data integration problem
3. Solving the large problems of
science?
• Observation
• Experimentation
• Modeling
• Cooperative data
intensive science
“An unaided human’s ability to process
large data sets is comparable to a dog’s
ability to do arithmetic, and not much more
valuable.” –Michael Nielson, Reinventing
Discovery, 2012.
4. Old Model: Single type of content;
single mode of distribution
Scholar
Library
Scholar
Publisher
FORCE11.org: Future of research communications and e-scholarship
6. The duality of modern scholarship
Observation: Those who build information systems from the
machine side don’t understand the requirements of the
human very well
Those who build information systems from the human side,
don’t understand requirements of machines very well
Production of “reusable scholarly artifacts” = usable by by humans and machines
Findable, accessible, citable
7. • NIF is an initiative of the NIH Blueprint consortium of institutes
– What types of resources (data, tools, materials, services) are available to the
neuroscience community?
– How many are there?
– What domains do they cover? What domains do they not cover?
– Where are they?
• Web sites
• Databases
• Literature
• Supplementary material
– Who uses them?
– Who creates them?
– How can we find them?
– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
NIF has been
surveying,
cataloging and
tracking the
neuroscience
resource
landscape since
< 2008
9. Database
Software Application
Data Analysis Service
Topical Portal
Core Facility
Ontology
Software Resource
Years:
Anita Bandrowski and Burak Ozyurt
Population, Coverage and Linkage of Resource
Registry
10. • Automated text mining is used to look
for “web page last updated” or
copyright dates
– Identified for 570 resources
– 373 were not updated within the last 2
years (65%)
• Manual review of ~200 resources
– 38 not updated within the past 2 years
(~20%)
– 8 migrated to new addresses or institutions
– 7 are no longer in service (~3%)
– 3 were deemed no longer appropriate
What happens to these resources?
The Registry provides a persistent identifier and metadata
record for what once existed but no longer does
11. BD2K: Big Data to Knowledge
• BD2K - a trans-NIH initiative established to enable biomedical research as a
digital research enterprise, to facilitate discovery and support new knowledge,
and to maximize community engagement.
• BD2K aims to develop the new approaches, standards, methods, tools,
software, and competencies that will enhance the use of biomedical Big Data
by:
– Facilitating broad use of biomedical digital assets by making them
discoverable, accessible, and citable
– Conducting research and developing the methods, software, and tools
needed to analyze biomedical Big Data
– Enhancing training in the development and use of methods and tools
necessary for biomedical Big Data science
– Supporting a data ecosystem that accelerates discovery as part of a digital
enterprise
http://bd2k.nih.gov/
13. What resources are available for GRM1?
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice
14. NIF data federation
NIF was designed to accommodate the multiplicity of heterogeneous and distributed data
resources, providing deep query of the contents and unified views
250 sources
> 800 M records
15. What do you mean by data?
Databases come in many shapes and sizes
• Primary data:
– Data available for reanalysis, e.g.,
microarray data sets from GEO;
brain images from XNAT;
microscopic images (CCDB/CIL)
• Secondary data
– Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
• Tertiary data
– Claims and assertions about the
meaning of data
• E.g., gene
upregulation/downregulation,
brain activation as a function of
task
• Registries:
– Metadata
– Pointers to data sets or
materials stored elsewhere
• Data aggregators
– Aggregate data of the same
type from multiple sources,
e.g., Cell Image Library
,SUMSdb, Brede
• Single source
– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information artifacts using a multitude of
technologies
18. Making it easier to access and understand
distributed databases
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases
19. Current challenge: With so much
available, how do I find what I need?
• “What genes are upregulated by
chronic morphine?”
– It depends
• Most often use cases require
connecting a researcher to
relevant data sets and
appropriate tools
– Depending upon the data and tools,
the answers may differ
• Many databases have tool bases
and workflows that they support
• Much value has been added to
individual data sets if we can
connect to them
22. SciCrunch: A “social network” for
resources
• NIF is a general search
engine across all of
neuroscience (biomedicine)
• Very powerful for discovery
and general browsing
• Can perform analytics across
the spectrum of biomedical
resources
• Many communities want to
create more focused portals
• Specialized for their domain
• Restrict the particular sources
• Organize the data according
to their needs
• Use their own branding
• How do we create a system
that satisfies community
needs without creating
another silo?
28. What is an effective information
framework for neuroscience?
Knowledge in space and spatial relationships
(the “where”)
Knowledge in words, terminologies and
logical relationships (the “what”)
30. What can ontology do for us?
• Express neuroscience concepts in a way that is machine readable
– Unique identifier
– Synonyms, lexical variants
– Definitions
• Provide means of disambiguation of strings
– Nucleus part of cell; nucleus part of brain; nucleus part of atom
– Each of these concepts has a unique identifier that distinguishes them
• Properties
– Support reasoning
• Provide universals for navigating across different data sources
– Semantic “index”
– Link data through relationships not just one-to-one mappings
• Provide the basis for concept-based queries to probe and mine data
• Establish a semantic framework for landscape analysis
• Deep data integration for some types of knowledge
Mathematics, Computer code or Esperanto
31. The scourge of neuroanatomical nomenclature
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number that map to the same identifier, i.e., synonyms: 99
•Number of 1st order partonomy matches: 385
32. : C
Neurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge base
NIF Cell Graph
This is your brain on
computers
33. Looking across the ecosystem: Where are the data?
Data Sources
Bringing knowledge to data: Gap analysis
35. How much information makes it into
the data space?
∞
What is easily machine
processable and accessible
What is potentially knowable
What is known:
Literature, images, human
knowledge
Unstructured; Natural
language processing,
entity recognition,
image processing and
analysis; paywalls; file
drawers
Abstracts vs full
text vs tables etc
Estimates that > 50% scientific output is not recovered
Chan et al. Lancet, 383, 2014
36. The tale of the tail
“Human neuroimaging typically is performed on a whole brain basis.
However, for several reasons tail of the caudate activity can easily be missed.
•One reason is limitations in the normalization algorithms, that typically are
optimized to maximize accuracy for cortical rather than subcortical
structures. ...
•A second reason is that standard neuroimaging atlases such as the Harvard-
Oxford structural atlas used with neuroimaging analysis programs such as
FreeSurfer truncate the caudate at the body, and completely exclude the
tail...
•A final reason is that the tail of the caudate is close to the hippocampus, and
could be misidentified as such especially in tasks involving learning and
memory.
Therefore, the tail of the caudate may be recruited in additional cognitive
tasks, but yet not have been properly identified and reported in the
neuroimaging literature”
Seger CA. The visual corticostriatal loop through the tail of the caudate: circuitry and function. Front
Syst Neurosci. 2013 Dec 6;7:104. doi: 10.3389/fnsys.2013.00104. eCollection 2013.
38. Data-Knowledge Mismatch
Dutowski et al., 2013:
Nature Biotechnology
A major impediment
for researchers using
ontology identifiers
is the perception
that ontologies
require a consensus
on definition of
terms
By matching
assertions about
biological entities
to data, we can
test both our
knowledge and
our data
39. The Monarch Initiative
•Genotype-Phenotype
comparison engine
•Integrates large amounts
of genotype-phenotype
data
•Semantic similarity
analytics
•Human disease
Animal model
Monarchinitiative.org
Melissa Haendel, OHSU
Chris Mungall, LBL
42. SO ALL I AM IS A NUMBER?
The power of unique and persistent identifiers
43.
44. What studies used my monoclonal mouse antibody
against actin in humans?
“The following antibodies were used for immunoblotting: -actin
mAb (1:10,000 dilution, Sigma-Aldrich)…”
Papers are
currently poor at
identifying the
simplest part of
the paper, the
materials used
45. Pilot Project
• Authors to identify 3 types of
research resources:
– Software/databases
– Antibodies
– Model organisms
• Include unique identifier = RRID
in methods section
• Voluntary for authors
• Journals did not have to modify
their submission system
Launched February 2014: 3 month commitment and more…
Two simple questions:
Could authors do it?
Would authors do it?
46. Resource IDs from NIF aggregated databases
•A single portal for
authors
•>10 authoritative
databases
•One search interface
•Simple directions
•Prominent “Cite
This” button
•Help desk
RII Portal
http://scicrunch.org/resources
Initiative was possible because of
the massive registries available
and aggregation services of
NIF/SciCrunch
47. RRID’s in the wild!
• >300 articles
have appeared to
date
• 47 journals
• 800+ RRID’s
• 96% correct!
Database available at: https://www.force11.org/node/5635
Authors can and will
adopt new citation
styles for research
resources
48. Increased identifiability of resources after the
Resource Identification Initiative Pilot
Update of Vasilevsky et al, PeerJ, 2013
49. What can we do with an RRID?
• A resolver
service has
been created
• 3rd party tools
are being
created to
provide linkage
between
resources and
papers
http://scicrunch.com/resolver/RRID:AB_90755
50. “Alerting” service
• Teaming with
Hypothes.is and
ORCID to
develop
annotation tools
for RRID’s,
including
“alerts” on
reagents and
tools
51. Hypothes.is is a tool for creating and
sharing annotations on web pages
http://hypothes.is.org
52. Article
Code
Blogs
Workflows
Data
Portals
Unique and persistent identifiers and a system for
referencing them allow an ecosystem to function
An ecosystem for research objects: the social network of
research resources
Data
Data
Code
Code
Blogs
Blogs
Workflows
Workflows
Portals
Portals
Search engines
ID’s
ID’s
ID’s
ID’s
ID’s
ID’s
ID’s
ID’s
53. WHAT CAN WE DO NOW?
Lessons learned from my career
54. Share your data and share it
effectively• Discoverability
– Data can be found
• Accessibility
– Data can be accessed and
access rights are clear
– Links to data are stable
• Assessability
– The reliability of the data can
be determined
• Understandability
– The data can be understood
• Usability
– The data are in a usable form
• Publishing data on your
website or as
supplemental material is
not the best way to make
it available
55. What about my data?
•Best practice:
•Put it in a repository
•What repository?
•Community repository for
your data type, e.g.,
NITRC, GEO
•General repository:
•Dryad
•FigShare
•NIH Data Commons
•Institutional repository
•Research libraries are
setting up repositories to
manage their “digital
assets”
NIF can help you find a place for your data
56. Make sure you and your scholarly outputs
can be linked
A distributed system like the biomedical data ecosystem runs
on the ability to uniquely identify relevant entities
ORCID ID: Unique researcher
identifier
Editors, authors: participate in
the Resource Identification
Initiative
“Sound, reproducible scholarship rests upon a
foundation of robust, accessible data. Data should be
considered legitimate, citable products of research. Data
citation, like the citation of other evidence and sources,
is good research practice.”
-Joint Declaration of Data Citation
Principles http://www.force11.org/datacitation
Coming soon: Formal
standards for citing data sets
57. Future of Research Communications
and e-Scholarship (FORCE11.org)
http://force11.orgJoin FORCE11!
58. NIF team (past and present)
Jeff Grethe, UCSD, Co-PI
Amarnath Gupta, UCSD,
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
Vadim Astakhov
Xufei Qian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer
(retired)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11
59.
60. The
Encyclopedia
of Life
A…
Access to data has
changed over the
years
Tim Berner-s Lee: Web of dataWikipedia defines Linked Data as "a
term used to describe a
recommended best practice for
exposing, sharing, and connecting
pieces of data, information, and
knowledge on the Semantic Web
using URIs and RDF.”
http://linkeddata.org/
Genbank
PDB
“Whichever technology wins broad adoption will become, by
default, the data web. That’s why we don’t need to know
which technological vision of the data web will win to conclude
that the data web is inevitable”-Michael Nielson
61. “Empty Archives”
Repository Type of Data
Date
started Host
Public
data Comments
CARMEN
neuroscience /
electrophysiology 2008
Newcastle University; United
Kingdom 100 Requires account
INCF Dataspace various 2012
International
Neuroinformatics
Coordinating Facility ?
Open Source Brain models 2014 University College London 47 Cells and Networks; 23 (Technology -showcases)
XNAT Central Neuroimaging 2010
Washington University
School of Medicine in St.
Louis; Missouri; USA 34
States 370 projects, 3804 subjects, and 5172
imaging sessions. 123 were visible but do not all
appear to be public. 34 public data were listed
under “Recent”
Open Connectome
Serial electron
Microscopy and
Magnetic Resonance 2011
Johns Hopkins University;
Maryland; USA (graphs) 9 9, 7 - image projects; 19 - graphs
UCSF DataShare
biomedical including
neuroimaging, MRI,
cognitive
impairment,
dementia, aging 2011
University of California at San
Francisco; California; USA 15
BrainLiner
various functional
data 2011 ATR; Kyoto; Japan 10
ModelDB neuron models 1996
Yale University; Connecticut;
USA 875
NeuroMorpho
digitally
reconstructed
neurons 2006
George Mason University;
Virginia; USA 10004
Cell Image
Library/Cell
Centered Database
images, videos, and
animations of cell
2002 CCDB
2010 CIL
American Society for Cell
Biology / University of
California at San Diego;
California; USA 10,360
The CCDB had 450 data sets when it merged with
CIL. CIL also contains large imaging data sets that
are not counted as separate images
CRCNS
computational
neuroscience
datasets 2008
University of California at
Berkeley; California; USA 38
OpenfMRI fMRI 2012
University of Texas at Austin;
Texas; USA 22
“I finally gave NeuroMorpho my data so they would stop
63. Make your data machine-actionable
Van De Werd HJ1, Uylings HB.. Brain Struct Funct. 2014 Mar;219(2):433-59. doi:
10.1007/s00429-013-0630-
64. Use RRID’s in your papers,
databases and journals!
• Antibody and
model
organism
databases
are adopting
65. NIF Information Framework: Query and alignment
• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
• Available as services through NIF and BioPortal
NIFSTD
Organism
NS FunctionMolecule Investigation
Subcellular
structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction Quality
Anatomical
Structure
NIF uses ontologies to enhance search
and discovery but is not constrained by
them