Data integration is a perennial challenge facing large-scale data scientists. Bio-ontologies are useful in this endeavour as sources of synonyms and also for rules-based fuzzy integration pipelines.
Ontology-based data integration
Ontologies can help with the semantic and the content
aspects of data integration
• Semantic: definition for schemas
• OWL is a good language for defining schemas
• See RDF and Semantic Web presentations, today
• Content: definition of the entities referred to by data
• Ontologies embedded into a data integration workflow help
facilitate content-aware data integration
Core challenge: biological knowledge
The answer to the question: “Is
Entity A from Data Source 1
the same thing as
Entity B from Data Source 2?”
often depends who is asking and who is answering!
Left lung vs. lung
Hippocampus vs. brain
Dopamine vs. L-dopamine
In vitro vs. In vivo cells of type X
Gene Y and post-translationally modified form Y‟
Gene Z in mouse, Gene Z in human
Hierarchy
left lung
lung
organ
is a
is a
Generalise to the
nearest common ancestor
i.e. if you are integrating data about tissue
samples annotated to „lung‟ in the one
dataset, and „left lung‟ in the other,
The ontology can compute „lung‟ as the
nearest common ancestor
Also for „left lung‟ and „right lung‟
Other relationships
Relationships encode biological knowledge
Rules allow to specify which relationships
can be traversed for data integration purposes
e.g. for tissue samples, part_of:
sample_frompart_of => sample_from
A sample from a part of the brain (e.g. the
hippocampus) is a sample from the brain
(Quite aside from the „is a‟ hierarchy!)
brain
hippocampus
part of
Core challenge: flexibility
… (>150 members)
Fixed-depth hierarchies
force some classes to be
too big, with the lowest level
collapsing biolgoical hierarchy
and others too small
… (<1 member)
Ontologies in content integration
A
B
A&B
1. Schema
mappings
A
B
2. Ontology-
provided
synonyms
A
B
3. Hierarchy
and relationship
rules for integration
OWL language and tools: web-embedded
(but whole-ontology rule reasoning may be slow)
Is ontology integration
just another type of data integration?
Which ontology(-ies) to use?
How to use them together?
How to plug the gaps?
Why should I (as a user) have to
do this integration over and over
Desiderata for ontologies for data
integration
• Ontologies should be neutral and shared community-
wide
• Users should be able to directly and rapidly extend the
ontology where there are gaps (responsiveness)
• The ontology should use semantics-free identifiers and at
the same time energetically annotate synonyms
• When necessary, ontologies should take care of
ontology integration to provide the community with a
one-stop service and appropriate cross-references
• The ontologies should be used
in data annotation
See http://www.obofoundry.org/
I visited a website listing 10 reasons data integration was hard. Mainly focused on business data integration scenarios, but still relevant for bioinformatics. Included sterling true points such as –technology changes very rapidly, but legacy never 100% goes away, different applications have fundamentally different needs, we keep inventing new products, etc. The first comment was almost brilliantly dumb. It offered the pearl of wisdom – 11th reason – the developers can’t just use the same one golden way to design <schemas/content/etc>! If only they could, data integration would be SO much easier.
Ontologies gather synonyms together around a semantics-free identifier which acts as a “hub” for all the possible labels that could refer to things of that type. This works for ambiguous labels too, since the same label can be associated with multiple ontology terms.