The Neuroscience Information Framework (NIF) is an initiative of the NIH Blueprint to maximize access to and utility of worldwide neuroscience research resources. NIF catalogs over 10,000 resources including databases, literature, and materials. It provides search capabilities across these resources and develops ontologies and semantic frameworks to integrate diverse data types and scales. NIF aims to make dispersed neuroscience information more findable, accessible, interoperable, and reusable to enable new insights.
2. We say this to each other all the time,
but we set up systems for scholarly
advancement and communication that
are the antithesis of integration
Whole brain data
(20 um
microscopic MRI)
Mosiac LM
images (1 GB+)
Conventional LM
images
Individual cell
morphologies
EM volumes &
reconstructions
Solved molecular
structures
No single technology serves
these all equally well.
Multiple data types;
multiple scales; multiple
databases
A data integration problem
3. • NIF is an initiative of the NIH Blueprint consortium of institutes
– What types of resources (data, tools, materials, services) are available to the
neuroscience community?
– How many are there?
– What domains do they cover? What domains do they not cover?
– Where are they?
• Web sites
• Databases
• Literature
• Supplementary material
– Who uses them?
– Who creates them?
– How can we find them?
– How can we make them better in the future?
http://neuinfo.org
• PDF files
• Desk drawers
4. How many resources
are there?
•NIF Registry: A catalog of
neuroscience-relevant
resources
•> 10,000 currently
listed
•> 2500 databases
•And we are finding more
every day
June10, 2013 4
5. But we have Google!
• Current web is
designed to share
documents
– Documents are
unstructured data
• Much of the content
of digital resources is
part of the “hidden
web”
• Wikipedia: The Deep Web
(also called Deepnet, the
invisible Web, DarkNet,
Undernet or the hidden
Web) refers to World
Wide Web content that is
not part of the Surface
Web, which is indexed by
standard search engines.
6. Which databases do you use?
• Mouse Genome
Database
• Allen Brain Atlas
• Clinical Trials.gov
• Pub Med
• dbGAP
• GEO
• NIH Reporter
• OMIM
• Bionumbers:
– -a database of numerical
values extracted from
literature
• Epigenomics
– - human epigenomic data to
catalyze basic biology and
disease-oriented research
• Antibody Registry
– -2M antibodies
• BioGrid
– an interaction repository of
protein and genetic
interactions
June10, 2013 6Most resources are largely unknown and underutilized
7. NIF: A New Type of Entity for New Modes of
Scientific Dissemination
• NIF’s mission is to maximize the awareness of, access to
and utility of research resources produced worldwide to
enable better science and promote efficient use
– NIF unites neuroscience information without respect to domain,
funding agency, institute or community
– NIF is like a “Pub Med” for all biomedical resources and a “Pub
Med Central” for databases
– Makes them searchable from a single interface
– Practical and cost-effective; tries to be sensible
– Learned a lot about the effective data sharing
8. How do resources get added to the NIF?
•NIF curators
•Nomination by the
community
•Semi-automated text
mining pipelines
NIF Registry
Requires no special
skills
Site map available
for local hosting
•NIF Data Federation
•DISCO interop
•Requires some
programming skill
•Open Source Brain <
2 hr
Two tiered system: low barrier to entry
9. NIF searches across 3 main indices: Registry, Federation and
Literature
Data Federation:
200 databases/400M
records
Registry: 6300
resources
(2500 databases)
Literature: 22 million
articles
10. What resources are available for GRM1?
With the thousands of databases and other information sources
available, simple descriptive metadata will not suffice
11. NIF makes it easier to browse different databases
Hippocampus OR “CornuAmmonis” OR
“Ammon’s horn” Query expansion: Synonyms
and related concepts
Boolean queries
Data sources
categorized by
“data type” and
level of nervous
system
Common views
across multiple
sources
Tutorials for using
full resource when
getting there from
NIF
Link back to
record in
original source
12. Making it easier to access and understand
distributed databases
Each resource implements a different, though related model;
systems are complex and difficult to learn, in many cases
13. NIF Semantic Framework: NIFSTD ontology
• NIF covers multiple structural scales and domains of relevance to neuroscience
• Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene
Ontology, Chebi, Protein Ontology
NIFSTD
Organism
NS FunctionMolecule Investigation
Subcellular
structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction Quality
Anatomical
Structure
NIF capitalizes on the growing set of community ontologies
available in biomedical science
15. Is there a framework for neuroscience?
• Of the ~ 4000 columns
that NIF queries,
~1300 map to one of
our core categories:
– Organism
– Anatomical structure
– Cell
– Molecule
– Function
– Dysfunction
– Technique
• When NIF combines
multiple sources, a set
of common fields
emerges
– >Basic information
models/semantic
models exist for
certain types of
entities
Biomedical science does have a conceptual framework
17. : C
Neurolex: > 1 million triples
Dr. Yi Zeng: Chinese neural knowledge base
NIF Cell Graph
This is your brain on
computers
18. • Incorporate basic neuroscience knowledge into
search
– Google: searches for string “GABAergic
neuron)
– NIF automatically searches for types of
GABAergic neurons
Types of
GABAergicneurons
NIF Concept-Based Search
Neuroscience Information Framework – http://neuinfo.org
19. Ontologies as a data integration framework
•NIF Connectivity: 7 databases containing connectivity primary data or claims
from literature on connectivity between brain regions
•Brain Architecture Management System (rodent)
•Temporal lobe.com (rodent)
•Connectome Wiki (human)
•Brain Maps (various)
•CoCoMac (primate cortex)
•UCLA Multimodal database (Human fMRI)
•Avian Brain Connectivity Database (Bird)
•Total: 1800 unique brain terms (excluding Avian)
•Number of exact terms used in > 1 database: 42
•Number of synonym matches: 99
•Number of 1st order partonomy matches: 385
23. What can we learn from the NIF Registry?
NIF supports a semantic model for
describing research resources
24. Resource Curation
June10, 2013 24
• NIF Registry is hosted
on Semantic Media
Wiki platform
Neurolex
– Community can add,
review, edit without
special privileges
– Searchable by Google
– Integrated with NIF
ontologies
– Graph structure
http://neurolex.org
25. Can we mine relationships between resources?
http://neuinfo.org
NIF semantic graph of
research resources
Text mining
gives a
picture of
the most
used
resources
PDB
http://force11.org/Resource_identification_initiative
26. • Automated text mining is used to look
for “web page last updated” or
copyright dates
– Identified for 570 resources
– 373 were not updated within the last 2
years (65%)
• Manual review of ~200 resources
– 38 not updated within the past 2 years
(~20%)
– 8 migrated to new addresses or institutions
– 7 are no longer in service (~3%)
– 3 were deemed no longer appropriate
Tracking digital resources since 2008
NIF helps stabilize the dynamic resource landscape
27. Keeping content up to dateConnectome
Tractography
Epigenetics
•New tags come into
existence
•New resource types come
into existence, e.g., Mobile
apps
•Resources add new types of
content
•Change name
•Change scope
•> 7000 updates to the
registry last year
It’s a challenge to keep the registry up to date;
sitemaps, curation, ontologies, community review
28. What can we learn from the NIF Data
Federation?
NIF supports a semantic model for
describing research resources
29. 0
50
100
150
200
250
0.01
0.1
1
10
100
1000
Jun-08 Dec-08 Jul-09 Jan-10 Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13
NumberofFederatedDatabases
NumberofFederatedRecords(Millions)
Data Federation Growth
NIF searches the largest collation of
neuroscience-relevant data on the web
DISCO
June10, 2013 dkCOIN Investigator's Retreat 29
30. What do you mean by data?
Databases come in many shapes and sizes
• Primary data:
– Data available for
reanalysis, e.g., microarray data sets
from GEO; brain images from XNAT;
microscopic images (CCDB/CIL)
• Secondary data
– Data features extracted through
data processing and sometimes
normalization, e.g, brain structure
volumes (IBVD), gene expression
levels (Allen Brain Atlas); brain
connectivity statements (BAMS)
• Tertiary data
– Claims and assertions about the
meaning of data
• E.g., gene
upregulation/downregulation,
brain activation as a function of
task
• Registries:
– Metadata
– Pointers to data sets or
materials stored elsewhere
• Data aggregators
– Aggregate data of the same
type from multiple
sources, e.g., Cell Image
Library ,SUMSdb, Brede
• Single source
– Data acquired within a single
context , e.g., Allen Brain Atlas
Researchers are producing a variety of
information artifacts using a multitude of
technologies
31. What have we learned: Grabbing the long tail
of small data
• NIF is in a unique position to ask
questions against the data resource
landscape
• The data space is not uniform
• Data “flows” from one resource to
the next
– Data is reinterpreted, reanalyzed or added
to
• Currently very difficult to track data
as it moves across the landscape
– Makes it difficult to learn from combined
efforts
NIF is trying to make it easier to work with diverse data
32. Phases of NIF
• 2006-2008: A survey of what was out there
• 2008-2009: Strategy for resource discovery
– NIF Registry vs NIF data federation
– Ingestion of data contained within different technology platforms, e.g., XML vs relational
vs RDF
– Effective search across semantically diverse sources
• NIFSTD ontologies
• 2009-2011: Strategy for data integration
– Unified views across common sources
– Mapping of content to NIF vocabularies
• 2011-present: Data analytics
– Uniform external data references
• 2012-present: SciCrunch: unified biomedical resource
services
NIF provides a strategy and set of tools applicable to all
biomedical science
33. NIF team (past and present)
Jeff Grethe, UCSD, Co Investigator, Interim PI
AmarnathGupta, UCSD, Co Investigator
Anita Bandrowski, NIF Project Leader
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
Rixin Wang
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
ArunRangarajan
Hans Michael Muller
Yuling Li
Giorgio Ascoli, George Mason University
SrideviPolavarum
Fahim Imam
Larry Lui
Andrea Arnaud Stagg
Jonathan Cachat
Jennifer Lawrence
Svetlana Sulima
Davis Banks
VadimAstakhov
XufeiQian
Chris Condit
Mark Ellisman
Stephen Larson
Willie Wong
Tim Clark, Harvard University
Paolo Ciccarese
Karen Skinner, NIH, Program Officer
(retired)
Jonathan Pollock, NIH, Program Officer
And my colleagues in Monarch, dkNet, 3DVC, Force 11