SlideShare une entreprise Scribd logo
1  sur  51
Reference Data Integration:
 A Strategy For The Future

              Barry Smith
National Center for Ontological Research
          University at Buffalo

        presented at FIMA, March 21, 2012



                                            1
Who am I?
    National Center for Biomedical Ontology
based in Stanford Medical School, the Mayo Clinic
     and Buffalo Department of Philosophy
    •   Cleveland Clinic Semantic Database
    •   Duke University Health System
    •   University of Pittsburgh Medical Center
    •   German Federal Ministry of Health
    •   European Union eHealth Directorate
    •   Plant Genome Research Resource
    •   Protein Information Resource
                              2
Who am I?
National Center for Ontological Research (http://ncor.us)

  • Joint Warfighting Center, US Joint Forces Command
  • Intelligence and Information Warfare Directorate
    (I2WD)
  • US Department of the Army Net-Centric Data
    Strategy Center of Excellence
  • NextGen (Next Generation Air Transportation
    System) Ontology Team
  • National Nuclear Security Administration (NNSA),
    Department of Energy
                           3
Some questions

•   How to find data?
•   How to understand data when you find it?
•   How to use data when you find it?
•   How to compare and integrate with other data?
•   How to avoid data silos?




                                                4
The Web (net-centricity) as part of the
                 solution

• You build a site
• Others discover the site and they link to it
• The more they link, the more well known the
  page becomes (Google …)
• Your data becomes discoverable




                                             5
The roots of Semantic Technology

1. Make your data available in a standard way
   on the Web
2. Use controlled vocabularies (‘ontologies’) to
   capture common meanings, in ways
   understandable to both humans and
   computers – Web Ontology Language
   (OWL)
3. Build links among the datasets to create a
   ‘web of data’
Controlled vocabularies for tagging
          (‘annotating’) data
• Hardware changes rapidly
• Organizations rapidly forming and
  disbanding
• Data is exploding
• Meanings of common words change slowly
• Use web architecture to annotate exploding
  data stores using ontologies to capture
  these common meanings in a stable way
                                               7
Where we stand today
• increasing availability of semantically enhanced
  data and semantic software
• increasing use of XML, RDF, OWL in attempts to
  create useful integration of on-line data and
  information
• “Linked Open Data” the New Big Thing




                                                     8
Ontology success stories, and some
           reasons for failure
•




                                         9
as of September 2010
The problem: the more Semantic
Technology is successful, they more it fails

The original idea was to break down silos via
  common controlled vocabularies for the tagging
  of data
The very success of the approach leads to the
  creation of ever new controlled vocabularies –
  semantic silos – as ever more ontologies are
  created in ad hoc ways
The Semantic Web framework as currently
  conceived and governed by the W3C yields
  minimal standardization
Multiplying (Meta)data registries are creating
  data cemeteries
                                                   11
NCBO Bioportal (Ontology Registry)




                                 12
13/24
14/24
Reasons for this effect
• Low incentives for reuse of existing ontologies
• Each organization wants its own ontology
• Poor licensing regime, poor standards, poor
  training
• People think: Information technology (hardware)
  is changing constantly, so it’s not worth the effort
  of getting things right
• People have egos: “We have done it this way for
  30 years, we are not going to change now”

                                                     15
Why should you care?
• when they are many ad hoc systems, average
  quality will be low
• constant need for ad hoc repair through
  manual effort
• DoD alone spends $6 billion per annum on
  this problem
• regulatory agencies are recognizing the need
  for common controlled vocabularies

                                             16/24
So now people are scrambling

• to learn how to create ontologies
• serious lag in creating trained expertise
• poor quality coding leads to poor quality
  ontologies
• poor quality ontology management



                                              17
How to do it right?
• how create an incremental, evolutionary
  process, where what is good survives ?
• how to bring about ontology death ?


      A success story from biology



                                            18
Old biology data




                   19/
New biology data
MKVSDRRKFEKANFDEFESALNNKNDLVHCPSITLFESIPTEVRSFYED
EKSGLIKVVKFRTGAMDRKRSFEKVVISVMVGKNVKKFLTFVEDEPDF
QGGPISKYLIPKKINLMVYTLFQVHTLKFNRKDYDTLSLFYLNRGYYNE
LSFRVLERCHEIASARPNDSSTMRTFTDFVSGAPIVRSLQKSTIRKYGY
NLAPYMFLLLHVDELSIFSAYQASLPGEKKVDTERLKRDLCPRKPIEIKY
FSQICNDMMNKKDRLGDILHIILRACALNFGAGPRGGAGDEEDRSIT
NEEPIIPSVDEHGLKVCKLRSPNTPRRLRKTLDAVKALLVSSCACTARD
LDIFDDNNGVAMWKWIKILYHEVAQETTLKDSYRITLVPSSDGISLLAF
AGPQRNVYVDDTTRRIQLYTDYNKNGSSEPRLKTLDGLTSDYVFYFVT
VLRQMQICALGNSYDAFNHDPWMDVVGFEDPNQVTNRDISRIVLY
SYMFLNTAKGCLVEYATFRQYMRELPKNAPQKLNFREMRQGLIALGR
HCVGSRFETDLYESATSELMANHSVQTGRNIYGVDFSLTSVSGTTATLL
Ontology in PubMed
                                                 Series 1
             1200



             1000



              800
Axis Title




              600



              400



              200



                0
                    2000   2001   2002   2003   2004   2005   2006   2007   2008   2009   2010
By far the most successful: GO (Gene Ontology)




                                           22
the Gene Ontology is not an ontology of ge



what cellular component?

what molecular function?

what biological process?



                             23
time

                                                                           Defense response
                                                                                       Microarray data
                                                                           Immune response
                                                                           Response to stimulus
                                                                                       shows changed
                                                                           Toll regulated genes
                                                                           JAK-STAT regulated genes
                                                                                       expression of
                                                                                       thousands of genes.
                                                                           Puparial adhesion
                                                                           Molting cycle
                                                                           hemocyanin

                                                                           Amino acid catabolism
                                                                           Lipid metobolism
                                                                                       How will you spot
                                                                                       the patterns?
                                                                           Peptidase activity
                                                                           Protein catabloism
                                                                           Immune response


                                                                           Immune response
                                                                           Toll regulated genes



                                    attacked control                                                         24
e Tree: lw n3d ...lw n3d ...
ar son          pearson       Colored by: Copy of Copy of C5_RMA (Defa...
                                   Colored by:    Copy of Copy of C5_RMA (Defa...
 lassification: Set_LW_n3 d_5p_... Gene List:
t_LW_n3 d_5p_...              Gene List:       allall genes (1 4010)
                                                  genes (1 4010)
Why is GO successful
• built by bench biologists
• multi-species, multi-disciplinary, open source
• compare use of kilograms, meters, seconds in
  formulating experimental results
• natural language and logical definitions for all
  terms
• initially low-tech to ensure aggressive use and
  testing
                                                 25
now used not just in
biology but also in
hospital research




                       26
Lab / pathology data
EHR data
Clinical trial data
Family history data
Medical imaging
Microarray data
Model organism data
Flow cytometry
Mass spec
Genotype / SNP data

How will you spot the patterns?
How will you find the data you
need?
                                  27
 over 11 million annotations relating
  UniProt, Ensembl and other databases to terms in
  the GO




                                                     28
Hierarchical view representing
relations between represented
                                 29
types
~ $200 mill. invested in the GO so far
  A new kind of biomedical research
 Over 11 million GO annotations to biomedical
 research literature freely available on the web

 Powerful software tool support for navigating
 this data means that what used to take
 researchers months of data comparison effort,
 can now be performed in milliseconds

                                                   30
If controlled vocabularies are to serve
            to remove silos
 they have to be respected by many owners of
   data as resources that ensure accurate
   description of their data
 – GO maintained not by computer scientists but
   by biologists
 they have to be willingly used in annotations by
   many owners of data
 they have to be maintained by persons who are
   trained in common principles of ontology
   maintenance
                                                31
The new profession of biocurator




               32
GO has been amazingly successful

Has created a community consensus
Has created a web of feedback loops where
  users of the GO can easily report errors
  and gaps
Has identified principles for successful
  ontology management
Indispensable to every drug company and
  every biology lab

                                             33
But GO is limited in its scope

it covers only generic biological entities of three
sorts:
     – cellular components
     – molecular functions
     – biological processes
no diseases, symptoms, disease
biomarkers, protein interactions, experimental
processes …
                                                 34
Extending the GO methodology to
  other domains of biology and
            medicine




                                  35
RELATION
      TO TIME                     CONTINUANT                     OCCURRENT


                  INDEPENDENT               DEPENDENT
GRANULARITY

                            Anatomical
                 Organism                 Organ
  ORGAN AND                    Entity
                  (NCBI                  Function
  ORGANISM                     (FMA,
                Taxonomy)              (FMP, CPRO) Phenotypic      Biological
                              CARO)
                                                    Quality         Process
                                                     (PaTO)          (GO)
   CELL AND                   Cellular   Cellular
                  Cell
   CELLULAR                 Component Function
                  (CL)
  COMPONENT                 (FMA, GO)     (GO)

                     Molecule
                                         Molecular Function     Molecular Process
  MOLECULE          (ChEBI, SO,
                                               (GO)                  (GO)
                    RnaO, PrO)



OBO (Open Biomedical Ontology) Foundry proposal
                    (Gene Ontology in yellow)                                36
RELATION
     TO TIME                     CONTINUANT                     OCCURRENT


                 INDEPENDENT               DEPENDENT
GRANULARITY

                           Anatomical
                Organism                 Organ
 ORGAN AND                    Entity
                 (NCBI                  Function
 ORGANISM                     (FMA,
               Taxonomy)              (FMP, CPRO) Phenotypic      Biological
                             CARO)
                                                   Quality         Process
                                                    (PaTO)          (GO)
  CELL AND                   Cellular   Cellular
                 Cell
  CELLULAR                 Component Function
                 (CL)
 COMPONENT                 (FMA, GO)     (GO)

                    Molecule
                                        Molecular Function     Molecular Process
  MOLECULE         (ChEBI, SO,
                                              (GO)                  (GO)
                   RnaO, PrO)



    The strategy of orthogonal modules
                                                                           37
Ontology                        Scope                          URL                     Custodians
     Cell Ontology           cell types from prokaryotes    obo.sourceforge.net/cgi-     Jonathan Bard, Michael
          (CL)                       to mammals                bin/detail.cgi?cell      Ashburner, Oliver Hofman

Chemical Entities of Bio-                                                                    Paula Dematos,
                                 molecular entities             ebi.ac.uk/chebi
logical Interest (ChEBI)                                                                     Rafael Alcantara
                                                                                         Melissa Haendel, Terry
Common Anatomy Refer-         anatomical structures in
                                                             (under development)       Hayamizu, Cornelius Rosse,
 ence Ontology (CARO)       human and model organisms
                                                                                           David Sutherland,

 Foundational Model of                                      fma.biostr.washington.            JLV Mejino Jr.,
                            structure of the human body
    Anatomy (FMA)                                                    edu                     Cornelius Rosse

  Functional Genomics
                                design, protocol, data
 Investigation Ontology                                           fugo.sf.net             FuGO Working Group
                            instrumentation, and analysis
         (FuGO)
                                cellular components,
    Gene Ontology
                                molecular functions,        www.geneontology.org       Gene Ontology Consortium
        (GO)                    biological processes
  Phenotypic Quality                                        obo.sourceforge.net/cgi
                               qualities of anatomical                                 Michael Ashburner, Suzanna
      Ontology                                                  -bin/ detail.cgi?
                                      structures                                        Lewis, Georgios Gkoutos
       (PaTO)                                                attribute_and_value
   Protein Ontology              protein types and
                                                             (under development)       Protein Ontology Consortium
         (PrO)                     modifications
Relation Ontology (RO)                relations             obo.sf.net/relationship     Barry Smith, Chris Mungall
    RNA Ontology              three-dimensional RNA
                                                             (under development)        RNA Ontology Consortium
      (RnaO)                         structures
  Sequence Ontology          properties and features of
                                                                  song.sf.net                 Karen Eilbeck
        (SO)                    nucleic sequences
How to recreate the success of the
        GO in other areas
1. create a portal for sharing of information
   about existing controlled vocabularies, needs
   and institutions operating in a given area
2. create a library of ontologies in this area
3. create a consortium of developers of these
   ontologies who agree to pool their efforts to
   create a single set of non-overlapping
   ontology modules
  – one ontology for each sub-area
                                               39
NextGen Ontology Portal



Portal                           Ontology Portal
                                 • Two-Tiered Registry
                                    – NextGen Ontology – consist of
Communities




                                       vetted ontologies
              Ontology Library
                                    – Ontology Library – open to the
                                       wider community
                                 • Ontology Metadata
                   NextGen          – Ontology owner, domain, and
                                       location
                   Enterprise    • Ontology Search*
Search




                   Ontology         – Support ontology discovery




                                                                       40
The OBO Foundry: a step-by-
     step, principles-based approach

 Developers commit in advance to
  collaborating with developers of ontologies
  in adjacent domains and

 to working to ensure that, for each
  domain, there is community convergence on
  a single ontology

          http://obofoundry.org
                                                41
OBO Foundry Principles
 Common governance
 Common training
 Robust versioning
 Common architecture




                              42
top level                       Basic Formal Ontology (BFO)


                Information Artifact   Ontology for Biomedical   Ontology of General
   mid-level         Ontology              Investigations         Medical Science
                      (IAO)                    (OBI)                  (OGMS)


                  Anatomy Ontology                       Infectious
                   (FMA*, CARO)                           Disease
                                          Environment
                             Cellular                    Ontology
                 Cell                      Ontology
                            Component                      (IDO*)
               Ontology                     (EnvO)
                             Ontology
                 (CL)                                   Phenotypic      Biological
                           (FMA*, GO*)
domain level                                              Quality        Process
                                                         Ontology     Ontology (GO*)
                Subcellular Anatomy Ontology (SAO)
                                                          (PaTO)
                          Sequence Ontology
                                (SO*)                    Molecular
                                                         Function
                          Protein Ontology                (GO*)
                               (PRO*)


                   OBO Foundry Modular Organization                                    43
Extension Strategy

top level             UCore 2.0 / UCore SL

mid-level


 domain
   level



   Military domain ontologies as extensions of the
            Universal Core Semantic Layer
                                         44
Existing efforts to create modular
           ontology suites
NASA Sweet Ontologies
Military Intelligence Ontology Foundry
Planned OMG efforts:
• OMG (CIA) Financial Event Ontology
• Semantic Layer for ISO 20022 (Financial
Industry Message Scheme)
Example:
Financial Securities Ontology
Mike Bennett (2007)             46
Basic principles of ontology
            development
– for formulating definitions
– of modularity
– of user feedback for error correction and gap
  identification
– for ensuring compatibility between modules
– for using ontologies to annotate legacy data
– for using ontologies to create new data
– for developing user-specific views
Modularity designed to ensure
•    non-redundancy
•    annotations can be additive
•    division of labor among SMEs
•    lessons learned in one module can benefit work on
     other modules
•    transferrable training
•    motivation of SME users



                                                     49
How the FIMA Reference Data
community should solve this problem?
Major financial institutions
Major software vendors
Major data management companies
EDMC and government principals
   – should pool information about the controlled vocabularies
     which already exist
   – create a common library of these controlled vocabularies
   – create a subset of thought leaders who agree to pool their
     efforts in the creation of a suite of ontology modules for
     common use
   – create a strategy to disseminate and evolve the selected
     modules
   – create a governance strategy to manage the modules over time
   – allow bad ontologies to die
Urgent need for trained ontologists
 Severe shortage of persons with the needed
 expertise
 University at Buffalo Online Training and
 Certification Program for Ontologists

      for details: phismith@buffalo.edu

Contenu connexe

En vedette

MDM Institute: Why is Reference data mission critical now?
MDM Institute: Why is Reference data mission critical now?MDM Institute: Why is Reference data mission critical now?
MDM Institute: Why is Reference data mission critical now?Orchestra Networks
 
Understanding Reference Data with Aaron Zornes
Understanding Reference Data with Aaron ZornesUnderstanding Reference Data with Aaron Zornes
Understanding Reference Data with Aaron ZornesOrchestra Networks
 
Webinar: How Banks Manage Reference Data with MongoDB
 Webinar: How Banks Manage Reference Data with MongoDB Webinar: How Banks Manage Reference Data with MongoDB
Webinar: How Banks Manage Reference Data with MongoDBMongoDB
 
United Technologies, Hands On Reference Data Management For Corporate Finance...
United Technologies, Hands On Reference Data Management For Corporate Finance...United Technologies, Hands On Reference Data Management For Corporate Finance...
United Technologies, Hands On Reference Data Management For Corporate Finance...Orchestra Networks
 
Credit Suisse: Multi-Domain Enterprise Reference Data
Credit Suisse: Multi-Domain Enterprise Reference DataCredit Suisse: Multi-Domain Enterprise Reference Data
Credit Suisse: Multi-Domain Enterprise Reference DataOrchestra Networks
 
H&M Strategic Recommendations in Depth
H&M Strategic Recommendations in DepthH&M Strategic Recommendations in Depth
H&M Strategic Recommendations in DepthVasiliki Evangelou
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...Christopher Bradley
 

En vedette (9)

MDM Institute: Why is Reference data mission critical now?
MDM Institute: Why is Reference data mission critical now?MDM Institute: Why is Reference data mission critical now?
MDM Institute: Why is Reference data mission critical now?
 
Understanding Reference Data with Aaron Zornes
Understanding Reference Data with Aaron ZornesUnderstanding Reference Data with Aaron Zornes
Understanding Reference Data with Aaron Zornes
 
MDM and Reference Data
MDM and Reference DataMDM and Reference Data
MDM and Reference Data
 
Webinar: How Banks Manage Reference Data with MongoDB
 Webinar: How Banks Manage Reference Data with MongoDB Webinar: How Banks Manage Reference Data with MongoDB
Webinar: How Banks Manage Reference Data with MongoDB
 
United Technologies, Hands On Reference Data Management For Corporate Finance...
United Technologies, Hands On Reference Data Management For Corporate Finance...United Technologies, Hands On Reference Data Management For Corporate Finance...
United Technologies, Hands On Reference Data Management For Corporate Finance...
 
Credit Suisse: Multi-Domain Enterprise Reference Data
Credit Suisse: Multi-Domain Enterprise Reference DataCredit Suisse: Multi-Domain Enterprise Reference Data
Credit Suisse: Multi-Domain Enterprise Reference Data
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
H&M Strategic Recommendations in Depth
H&M Strategic Recommendations in DepthH&M Strategic Recommendations in Depth
H&M Strategic Recommendations in Depth
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...
 

Similaire à Reference Data Integration: A Strategy for the Future

Ontology for the Financial Services Industry
Ontology for the Financial Services IndustryOntology for the Financial Services Industry
Ontology for the Financial Services IndustryBarry Smith
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple nadeem akhter
 
Friend NRNB 2012-12-13
Friend NRNB 2012-12-13Friend NRNB 2012-12-13
Friend NRNB 2012-12-13Sage Base
 
GENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSGENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSsandeshGM
 
AI and Machine Learning for Secondary Metabolite Prediction
AI and Machine Learning for Secondary Metabolite PredictionAI and Machine Learning for Secondary Metabolite Prediction
AI and Machine Learning for Secondary Metabolite PredictionYannick Djoumbou
 
Long-lasting alterations to DNA methylation and ncRNAs could underlie the eff...
Long-lasting alterations to DNA methylation and ncRNAs could underlie the eff...Long-lasting alterations to DNA methylation and ncRNAs could underlie the eff...
Long-lasting alterations to DNA methylation and ncRNAs could underlie the eff...Ben Laufer
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KBenjamin Good
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Lee Larcombe
 
Big Data
Big DataBig Data
Big DataSURFnet
 
Friend AACR 2013-01-16
Friend AACR 2013-01-16Friend AACR 2013-01-16
Friend AACR 2013-01-16Sage Base
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureRobert Cormia
 
Molecular pathology in microbiology and metagenomics
Molecular pathology in microbiology and metagenomicsMolecular pathology in microbiology and metagenomics
Molecular pathology in microbiology and metagenomicsCharithRanatunga
 
Mining Phenotypes: How to set up a reverse genetics experiment with an Arabid...
Mining Phenotypes: How to set up a reverse genetics experiment with an Arabid...Mining Phenotypes: How to set up a reverse genetics experiment with an Arabid...
Mining Phenotypes: How to set up a reverse genetics experiment with an Arabid...adcobb
 
Friend harvard 2013-01-30
Friend harvard 2013-01-30Friend harvard 2013-01-30
Friend harvard 2013-01-30Sage Base
 
Epigeneticsand methylation
Epigeneticsand methylationEpigeneticsand methylation
Epigeneticsand methylationShubhda Roy
 
Genomics In Personal Care Product Development
Genomics In Personal Care Product DevelopmentGenomics In Personal Care Product Development
Genomics In Personal Care Product DevelopmentGenemarkers
 
Friend NAS 2013-01-10
Friend NAS 2013-01-10Friend NAS 2013-01-10
Friend NAS 2013-01-10Sage Base
 
Working with Chromosomes
Working with ChromosomesWorking with Chromosomes
Working with ChromosomesIoanna Leontiou
 

Similaire à Reference Data Integration: A Strategy for the Future (20)

Ontology for the Financial Services Industry
Ontology for the Financial Services IndustryOntology for the Financial Services Industry
Ontology for the Financial Services Industry
 
bioinformatics simple
bioinformatics simple bioinformatics simple
bioinformatics simple
 
Friend NRNB 2012-12-13
Friend NRNB 2012-12-13Friend NRNB 2012-12-13
Friend NRNB 2012-12-13
 
GENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSGENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICS
 
AI and Machine Learning for Secondary Metabolite Prediction
AI and Machine Learning for Secondary Metabolite PredictionAI and Machine Learning for Secondary Metabolite Prediction
AI and Machine Learning for Secondary Metabolite Prediction
 
Long-lasting alterations to DNA methylation and ncRNAs could underlie the eff...
Long-lasting alterations to DNA methylation and ncRNAs could underlie the eff...Long-lasting alterations to DNA methylation and ncRNAs could underlie the eff...
Long-lasting alterations to DNA methylation and ncRNAs could underlie the eff...
 
ICDB
ICDBICDB
ICDB
 
Gene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2KGene Wiki and Mark2Cure update for BD2K
Gene Wiki and Mark2Cure update for BD2K
 
Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014Intro to in silico drug discovery 2014
Intro to in silico drug discovery 2014
 
Big Data
Big DataBig Data
Big Data
 
Friend AACR 2013-01-16
Friend AACR 2013-01-16Friend AACR 2013-01-16
Friend AACR 2013-01-16
 
Basic of bioinformatics
Basic of bioinformaticsBasic of bioinformatics
Basic of bioinformatics
 
Bioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of NatureBioinformatics - Discovering the Bio Logic Of Nature
Bioinformatics - Discovering the Bio Logic Of Nature
 
Molecular pathology in microbiology and metagenomics
Molecular pathology in microbiology and metagenomicsMolecular pathology in microbiology and metagenomics
Molecular pathology in microbiology and metagenomics
 
Mining Phenotypes: How to set up a reverse genetics experiment with an Arabid...
Mining Phenotypes: How to set up a reverse genetics experiment with an Arabid...Mining Phenotypes: How to set up a reverse genetics experiment with an Arabid...
Mining Phenotypes: How to set up a reverse genetics experiment with an Arabid...
 
Friend harvard 2013-01-30
Friend harvard 2013-01-30Friend harvard 2013-01-30
Friend harvard 2013-01-30
 
Epigeneticsand methylation
Epigeneticsand methylationEpigeneticsand methylation
Epigeneticsand methylation
 
Genomics In Personal Care Product Development
Genomics In Personal Care Product DevelopmentGenomics In Personal Care Product Development
Genomics In Personal Care Product Development
 
Friend NAS 2013-01-10
Friend NAS 2013-01-10Friend NAS 2013-01-10
Friend NAS 2013-01-10
 
Working with Chromosomes
Working with ChromosomesWorking with Chromosomes
Working with Chromosomes
 

Plus de Barry Smith

Towards an Ontology of Philosophy
Towards an Ontology of PhilosophyTowards an Ontology of Philosophy
Towards an Ontology of PhilosophyBarry Smith
 
An application of Basic Formal Ontology to the Ontology of Services and Commo...
An application of Basic Formal Ontology to the Ontology of Services and Commo...An application of Basic Formal Ontology to the Ontology of Services and Commo...
An application of Basic Formal Ontology to the Ontology of Services and Commo...Barry Smith
 
Ways of Worldmarking: The Ontology of the Eruv
Ways of Worldmarking: The Ontology of the EruvWays of Worldmarking: The Ontology of the Eruv
Ways of Worldmarking: The Ontology of the EruvBarry Smith
 
The Division of Deontic Labor
The Division of Deontic LaborThe Division of Deontic Labor
The Division of Deontic LaborBarry Smith
 
Ontology of Aging (August 2014)
Ontology of Aging (August 2014)Ontology of Aging (August 2014)
Ontology of Aging (August 2014)Barry Smith
 
The Fifth Cycle of Philosophy
The Fifth Cycle of PhilosophyThe Fifth Cycle of Philosophy
The Fifth Cycle of PhilosophyBarry Smith
 
Ontology of Poker
Ontology of PokerOntology of Poker
Ontology of PokerBarry Smith
 
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...Barry Smith
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataBarry Smith
 
The Philosophome: An Exercise in the Ontology of the Humanities
The Philosophome: An Exercise in the Ontology of the HumanitiesThe Philosophome: An Exercise in the Ontology of the Humanities
The Philosophome: An Exercise in the Ontology of the HumanitiesBarry Smith
 
IAO-Intel: An Ontology of Information Artifacts in the Intelligence Domain
IAO-Intel: An Ontology of Information Artifacts in the Intelligence DomainIAO-Intel: An Ontology of Information Artifacts in the Intelligence Domain
IAO-Intel: An Ontology of Information Artifacts in the Intelligence DomainBarry Smith
 
Science of Emerging Social Media
Science of Emerging Social MediaScience of Emerging Social Media
Science of Emerging Social MediaBarry Smith
 
Ethics, Informatics and Obamacare
Ethics, Informatics and ObamacareEthics, Informatics and Obamacare
Ethics, Informatics and ObamacareBarry Smith
 
e‐Human Beings: The contribution of internet ranking systems to the developme...
e‐Human Beings: The contribution of internet ranking systems to the developme...e‐Human Beings: The contribution of internet ranking systems to the developme...
e‐Human Beings: The contribution of internet ranking systems to the developme...Barry Smith
 
Ontology of aging and death
Ontology of aging and deathOntology of aging and death
Ontology of aging and deathBarry Smith
 
Ontology in-buffalo-2013
Ontology in-buffalo-2013Ontology in-buffalo-2013
Ontology in-buffalo-2013Barry Smith
 
ImmPort strategies to enhance discoverability of clinical trial data
ImmPort strategies to enhance discoverability of clinical trial dataImmPort strategies to enhance discoverability of clinical trial data
ImmPort strategies to enhance discoverability of clinical trial dataBarry Smith
 
Ontology of Documents (2005)
Ontology of Documents (2005)Ontology of Documents (2005)
Ontology of Documents (2005)Barry Smith
 
Ontology and the National Cancer Institute Thesaurus (2005)
Ontology and the National Cancer Institute Thesaurus (2005)Ontology and the National Cancer Institute Thesaurus (2005)
Ontology and the National Cancer Institute Thesaurus (2005)Barry Smith
 

Plus de Barry Smith (20)

Towards an Ontology of Philosophy
Towards an Ontology of PhilosophyTowards an Ontology of Philosophy
Towards an Ontology of Philosophy
 
An application of Basic Formal Ontology to the Ontology of Services and Commo...
An application of Basic Formal Ontology to the Ontology of Services and Commo...An application of Basic Formal Ontology to the Ontology of Services and Commo...
An application of Basic Formal Ontology to the Ontology of Services and Commo...
 
Ways of Worldmarking: The Ontology of the Eruv
Ways of Worldmarking: The Ontology of the EruvWays of Worldmarking: The Ontology of the Eruv
Ways of Worldmarking: The Ontology of the Eruv
 
The Division of Deontic Labor
The Division of Deontic LaborThe Division of Deontic Labor
The Division of Deontic Labor
 
Ontology of Aging (August 2014)
Ontology of Aging (August 2014)Ontology of Aging (August 2014)
Ontology of Aging (August 2014)
 
Meaningful Use
Meaningful UseMeaningful Use
Meaningful Use
 
The Fifth Cycle of Philosophy
The Fifth Cycle of PhilosophyThe Fifth Cycle of Philosophy
The Fifth Cycle of Philosophy
 
Ontology of Poker
Ontology of PokerOntology of Poker
Ontology of Poker
 
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
Clinical trial data wants to be free: Lessons from the ImmPort Immunology Dat...
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
The Philosophome: An Exercise in the Ontology of the Humanities
The Philosophome: An Exercise in the Ontology of the HumanitiesThe Philosophome: An Exercise in the Ontology of the Humanities
The Philosophome: An Exercise in the Ontology of the Humanities
 
IAO-Intel: An Ontology of Information Artifacts in the Intelligence Domain
IAO-Intel: An Ontology of Information Artifacts in the Intelligence DomainIAO-Intel: An Ontology of Information Artifacts in the Intelligence Domain
IAO-Intel: An Ontology of Information Artifacts in the Intelligence Domain
 
Science of Emerging Social Media
Science of Emerging Social MediaScience of Emerging Social Media
Science of Emerging Social Media
 
Ethics, Informatics and Obamacare
Ethics, Informatics and ObamacareEthics, Informatics and Obamacare
Ethics, Informatics and Obamacare
 
e‐Human Beings: The contribution of internet ranking systems to the developme...
e‐Human Beings: The contribution of internet ranking systems to the developme...e‐Human Beings: The contribution of internet ranking systems to the developme...
e‐Human Beings: The contribution of internet ranking systems to the developme...
 
Ontology of aging and death
Ontology of aging and deathOntology of aging and death
Ontology of aging and death
 
Ontology in-buffalo-2013
Ontology in-buffalo-2013Ontology in-buffalo-2013
Ontology in-buffalo-2013
 
ImmPort strategies to enhance discoverability of clinical trial data
ImmPort strategies to enhance discoverability of clinical trial dataImmPort strategies to enhance discoverability of clinical trial data
ImmPort strategies to enhance discoverability of clinical trial data
 
Ontology of Documents (2005)
Ontology of Documents (2005)Ontology of Documents (2005)
Ontology of Documents (2005)
 
Ontology and the National Cancer Institute Thesaurus (2005)
Ontology and the National Cancer Institute Thesaurus (2005)Ontology and the National Cancer Institute Thesaurus (2005)
Ontology and the National Cancer Institute Thesaurus (2005)
 

Reference Data Integration: A Strategy for the Future

  • 1. Reference Data Integration: A Strategy For The Future Barry Smith National Center for Ontological Research University at Buffalo presented at FIMA, March 21, 2012 1
  • 2. Who am I? National Center for Biomedical Ontology based in Stanford Medical School, the Mayo Clinic and Buffalo Department of Philosophy • Cleveland Clinic Semantic Database • Duke University Health System • University of Pittsburgh Medical Center • German Federal Ministry of Health • European Union eHealth Directorate • Plant Genome Research Resource • Protein Information Resource 2
  • 3. Who am I? National Center for Ontological Research (http://ncor.us) • Joint Warfighting Center, US Joint Forces Command • Intelligence and Information Warfare Directorate (I2WD) • US Department of the Army Net-Centric Data Strategy Center of Excellence • NextGen (Next Generation Air Transportation System) Ontology Team • National Nuclear Security Administration (NNSA), Department of Energy 3
  • 4. Some questions • How to find data? • How to understand data when you find it? • How to use data when you find it? • How to compare and integrate with other data? • How to avoid data silos? 4
  • 5. The Web (net-centricity) as part of the solution • You build a site • Others discover the site and they link to it • The more they link, the more well known the page becomes (Google …) • Your data becomes discoverable 5
  • 6. The roots of Semantic Technology 1. Make your data available in a standard way on the Web 2. Use controlled vocabularies (‘ontologies’) to capture common meanings, in ways understandable to both humans and computers – Web Ontology Language (OWL) 3. Build links among the datasets to create a ‘web of data’
  • 7. Controlled vocabularies for tagging (‘annotating’) data • Hardware changes rapidly • Organizations rapidly forming and disbanding • Data is exploding • Meanings of common words change slowly • Use web architecture to annotate exploding data stores using ontologies to capture these common meanings in a stable way 7
  • 8. Where we stand today • increasing availability of semantically enhanced data and semantic software • increasing use of XML, RDF, OWL in attempts to create useful integration of on-line data and information • “Linked Open Data” the New Big Thing 8
  • 9. Ontology success stories, and some reasons for failure • 9
  • 11. The problem: the more Semantic Technology is successful, they more it fails The original idea was to break down silos via common controlled vocabularies for the tagging of data The very success of the approach leads to the creation of ever new controlled vocabularies – semantic silos – as ever more ontologies are created in ad hoc ways The Semantic Web framework as currently conceived and governed by the W3C yields minimal standardization Multiplying (Meta)data registries are creating data cemeteries 11
  • 12. NCBO Bioportal (Ontology Registry) 12
  • 13. 13/24
  • 14. 14/24
  • 15. Reasons for this effect • Low incentives for reuse of existing ontologies • Each organization wants its own ontology • Poor licensing regime, poor standards, poor training • People think: Information technology (hardware) is changing constantly, so it’s not worth the effort of getting things right • People have egos: “We have done it this way for 30 years, we are not going to change now” 15
  • 16. Why should you care? • when they are many ad hoc systems, average quality will be low • constant need for ad hoc repair through manual effort • DoD alone spends $6 billion per annum on this problem • regulatory agencies are recognizing the need for common controlled vocabularies 16/24
  • 17. So now people are scrambling • to learn how to create ontologies • serious lag in creating trained expertise • poor quality coding leads to poor quality ontologies • poor quality ontology management 17
  • 18. How to do it right? • how create an incremental, evolutionary process, where what is good survives ? • how to bring about ontology death ? A success story from biology 18
  • 21. Ontology in PubMed Series 1 1200 1000 800 Axis Title 600 400 200 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
  • 22. By far the most successful: GO (Gene Ontology) 22
  • 23. the Gene Ontology is not an ontology of ge what cellular component? what molecular function? what biological process? 23
  • 24. time Defense response Microarray data Immune response Response to stimulus shows changed Toll regulated genes JAK-STAT regulated genes expression of thousands of genes. Puparial adhesion Molting cycle hemocyanin Amino acid catabolism Lipid metobolism How will you spot the patterns? Peptidase activity Protein catabloism Immune response Immune response Toll regulated genes attacked control 24 e Tree: lw n3d ...lw n3d ... ar son pearson Colored by: Copy of Copy of C5_RMA (Defa... Colored by: Copy of Copy of C5_RMA (Defa... lassification: Set_LW_n3 d_5p_... Gene List: t_LW_n3 d_5p_... Gene List: allall genes (1 4010) genes (1 4010)
  • 25. Why is GO successful • built by bench biologists • multi-species, multi-disciplinary, open source • compare use of kilograms, meters, seconds in formulating experimental results • natural language and logical definitions for all terms • initially low-tech to ensure aggressive use and testing 25
  • 26. now used not just in biology but also in hospital research 26
  • 27. Lab / pathology data EHR data Clinical trial data Family history data Medical imaging Microarray data Model organism data Flow cytometry Mass spec Genotype / SNP data How will you spot the patterns? How will you find the data you need? 27
  • 28.  over 11 million annotations relating UniProt, Ensembl and other databases to terms in the GO 28
  • 29. Hierarchical view representing relations between represented 29 types
  • 30. ~ $200 mill. invested in the GO so far A new kind of biomedical research Over 11 million GO annotations to biomedical research literature freely available on the web Powerful software tool support for navigating this data means that what used to take researchers months of data comparison effort, can now be performed in milliseconds 30
  • 31. If controlled vocabularies are to serve to remove silos they have to be respected by many owners of data as resources that ensure accurate description of their data – GO maintained not by computer scientists but by biologists they have to be willingly used in annotations by many owners of data they have to be maintained by persons who are trained in common principles of ontology maintenance 31
  • 32. The new profession of biocurator 32
  • 33. GO has been amazingly successful Has created a community consensus Has created a web of feedback loops where users of the GO can easily report errors and gaps Has identified principles for successful ontology management Indispensable to every drug company and every biology lab 33
  • 34. But GO is limited in its scope it covers only generic biological entities of three sorts: – cellular components – molecular functions – biological processes no diseases, symptoms, disease biomarkers, protein interactions, experimental processes … 34
  • 35. Extending the GO methodology to other domains of biology and medicine 35
  • 36. RELATION TO TIME CONTINUANT OCCURRENT INDEPENDENT DEPENDENT GRANULARITY Anatomical Organism Organ ORGAN AND Entity (NCBI Function ORGANISM (FMA, Taxonomy) (FMP, CPRO) Phenotypic Biological CARO) Quality Process (PaTO) (GO) CELL AND Cellular Cellular Cell CELLULAR Component Function (CL) COMPONENT (FMA, GO) (GO) Molecule Molecular Function Molecular Process MOLECULE (ChEBI, SO, (GO) (GO) RnaO, PrO) OBO (Open Biomedical Ontology) Foundry proposal (Gene Ontology in yellow) 36
  • 37. RELATION TO TIME CONTINUANT OCCURRENT INDEPENDENT DEPENDENT GRANULARITY Anatomical Organism Organ ORGAN AND Entity (NCBI Function ORGANISM (FMA, Taxonomy) (FMP, CPRO) Phenotypic Biological CARO) Quality Process (PaTO) (GO) CELL AND Cellular Cellular Cell CELLULAR Component Function (CL) COMPONENT (FMA, GO) (GO) Molecule Molecular Function Molecular Process MOLECULE (ChEBI, SO, (GO) (GO) RnaO, PrO) The strategy of orthogonal modules 37
  • 38. Ontology Scope URL Custodians Cell Ontology cell types from prokaryotes obo.sourceforge.net/cgi- Jonathan Bard, Michael (CL) to mammals bin/detail.cgi?cell Ashburner, Oliver Hofman Chemical Entities of Bio- Paula Dematos, molecular entities ebi.ac.uk/chebi logical Interest (ChEBI) Rafael Alcantara Melissa Haendel, Terry Common Anatomy Refer- anatomical structures in (under development) Hayamizu, Cornelius Rosse, ence Ontology (CARO) human and model organisms David Sutherland, Foundational Model of fma.biostr.washington. JLV Mejino Jr., structure of the human body Anatomy (FMA) edu Cornelius Rosse Functional Genomics design, protocol, data Investigation Ontology fugo.sf.net FuGO Working Group instrumentation, and analysis (FuGO) cellular components, Gene Ontology molecular functions, www.geneontology.org Gene Ontology Consortium (GO) biological processes Phenotypic Quality obo.sourceforge.net/cgi qualities of anatomical Michael Ashburner, Suzanna Ontology -bin/ detail.cgi? structures Lewis, Georgios Gkoutos (PaTO) attribute_and_value Protein Ontology protein types and (under development) Protein Ontology Consortium (PrO) modifications Relation Ontology (RO) relations obo.sf.net/relationship Barry Smith, Chris Mungall RNA Ontology three-dimensional RNA (under development) RNA Ontology Consortium (RnaO) structures Sequence Ontology properties and features of song.sf.net Karen Eilbeck (SO) nucleic sequences
  • 39. How to recreate the success of the GO in other areas 1. create a portal for sharing of information about existing controlled vocabularies, needs and institutions operating in a given area 2. create a library of ontologies in this area 3. create a consortium of developers of these ontologies who agree to pool their efforts to create a single set of non-overlapping ontology modules – one ontology for each sub-area 39
  • 40. NextGen Ontology Portal Portal Ontology Portal • Two-Tiered Registry – NextGen Ontology – consist of Communities vetted ontologies Ontology Library – Ontology Library – open to the wider community • Ontology Metadata NextGen – Ontology owner, domain, and location Enterprise • Ontology Search* Search Ontology – Support ontology discovery 40
  • 41. The OBO Foundry: a step-by- step, principles-based approach  Developers commit in advance to collaborating with developers of ontologies in adjacent domains and  to working to ensure that, for each domain, there is community convergence on a single ontology http://obofoundry.org 41
  • 42. OBO Foundry Principles  Common governance  Common training  Robust versioning  Common architecture 42
  • 43. top level Basic Formal Ontology (BFO) Information Artifact Ontology for Biomedical Ontology of General mid-level Ontology Investigations Medical Science (IAO) (OBI) (OGMS) Anatomy Ontology Infectious (FMA*, CARO) Disease Environment Cellular Ontology Cell Ontology Component (IDO*) Ontology (EnvO) Ontology (CL) Phenotypic Biological (FMA*, GO*) domain level Quality Process Ontology Ontology (GO*) Subcellular Anatomy Ontology (SAO) (PaTO) Sequence Ontology (SO*) Molecular Function Protein Ontology (GO*) (PRO*) OBO Foundry Modular Organization 43
  • 44. Extension Strategy top level UCore 2.0 / UCore SL mid-level domain level Military domain ontologies as extensions of the Universal Core Semantic Layer 44
  • 45. Existing efforts to create modular ontology suites NASA Sweet Ontologies Military Intelligence Ontology Foundry Planned OMG efforts: • OMG (CIA) Financial Event Ontology • Semantic Layer for ISO 20022 (Financial Industry Message Scheme)
  • 47.
  • 48. Basic principles of ontology development – for formulating definitions – of modularity – of user feedback for error correction and gap identification – for ensuring compatibility between modules – for using ontologies to annotate legacy data – for using ontologies to create new data – for developing user-specific views
  • 49. Modularity designed to ensure • non-redundancy • annotations can be additive • division of labor among SMEs • lessons learned in one module can benefit work on other modules • transferrable training • motivation of SME users 49
  • 50. How the FIMA Reference Data community should solve this problem? Major financial institutions Major software vendors Major data management companies EDMC and government principals – should pool information about the controlled vocabularies which already exist – create a common library of these controlled vocabularies – create a subset of thought leaders who agree to pool their efforts in the creation of a suite of ontology modules for common use – create a strategy to disseminate and evolve the selected modules – create a governance strategy to manage the modules over time – allow bad ontologies to die
  • 51. Urgent need for trained ontologists Severe shortage of persons with the needed expertise University at Buffalo Online Training and Certification Program for Ontologists for details: phismith@buffalo.edu

Notes de l'éditeur

  1. http://www.w3.org/People/Ivan/CorePresentations/HighLevelIntro/
  2. http://www.w3.org/People/Ivan/CorePresentations/HighLevelIntro/
  3. http://www.w3.org/People/Ivan/CorePresentations/HighLevelIntro/
  4. Ivan Herman
  5. http://dbpedia.org/fct/images/lod-datasets_2009-03-27_colored.png
  6. http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=116006492sequence of X chromosome in baker’s yeast
  7. http://1105govinfoevents.com/EA/Presentations/EA09_2-2_Robinson.pdf