SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
CIBB2011
                                8th INTERNATIONAL MEETING ON COMPUTATIONAL INTELLIGENCE
                                       METHODS FOR BIOINFORMATICS AND BIOSTATISTICS
                                              Gargnano Sul Garda, Lombardia, Italy
                                                         July 2nd 2011




           Dipartimento di
Elettronica e Informazione




               Biomolecular annotation prediction
               through information integration

             Davide Chicco, Marco Tagliasacchi, Marco Masseroli
                             davide.chicco@elet.polimi.it
Summary

          1. Data warehousing and information integration

          2. Computational methods

          3. Software infrastructures




© Davide Chicco Biomolecular annotation prediction through information integration   2
Data warehousing
          and information integration (1)

          • Given the large amount of biomolecular information and
            data available nowadays in different web sources,
            information integration has become a very important
            task for the bioinformatics community.

          • There are                         different                  data        integration   approaches
            available:
                     • federated databases (e.g. BioKleisli, K2)
                     • link driven federations (e.g. SRS)
                     • link-driven integration of data sources (e.g. Entrez,
                       GeneCards, SOURCE)
                     • mediator-based architecture (e.g. BioDataServer,
                       DiscoveryLink )
                     • data warehousing (e.g. EnsMart, BioWarehouse)
© Davide Chicco Biomolecular annotation prediction through information integration                          3
Data warehousing
          and information integration (2)

          • To integrate heterogeneous bioinformatics data available
            in a variety of formats from many different sources,
            followed this last approach:
                                Data Warehousing
                       On-line databanks


                                                                                     • Quicker, more efficient and
           Entrez
                            eVOC
                                   BioCyc
                                            KEGG
                                                Reactome
                                                         GOA
                                                                                       effective than “on demand”
                                                                                       systems.
                      IPI                                      Gene
           Gene                                               Ontology
  Homologene

                                                                                     • We generated, maintain
                                                     Automatic                         updated and extend a multi-
Database
                                                      updating                         species Genomic and
                                                     procedures
   server                                                                              Proteomic Data Warehouse
                                                                                       (GPDW) based at Politecnico
                      Genomic and Proteomic
                         Data Warehouse
                                                                                       di Milano.
© Davide Chicco Biomolecular annotation prediction through information integration                                   4
Data warehousing
          and information integration (3)

          • The integration is managed in three phases:
                     1. Importing: data from the different sources are imported
                        into one repository.
                     2. Aggregation: data are gathered and normalized into a
                        single representation in the instance-aggregation tier of
                        our global data model.
                     3. Integration: data are organized into informative clusters
                        in the concept-integration tier of the global model

          • During the initial importing and aggregation phases, tables of
            the features described by the imported data are created and
            populated.

          Fur further information, see my colleague Arif Canakoglu
          CIBB2011 paper.
© Davide Chicco Biomolecular annotation prediction through information integration   5
Computational methods:
          Prediction of biomolecular annotations (1)

          • To better understand the functions of genes and proteins,
            scientists developed the concept of annotation (association of
            nucleotide or amino acid sequences with useful information).

          • This information is expressed through controlled vocabularies,
            sometimes structured as ontologies, where every controlled
            term of the vocabulary is associated with a unique
            alphanumeric code.

          • Such a code associated with a gene or protein ID constitutes
            an annotation.

       Gene                                                                          Biological function feature
                                           Annotation
                                           gene2bff
© Davide Chicco Biomolecular annotation prediction through information integration                                 6
Computational methods:
          Prediction of biomolecular annotations (2)

          • However, available annotations are incomplete, and, for
            recently studied genomes, they are mostly computationally
            derived.

          • So only a few of them represent highly reliable human–
            curated information.

          • To support and quicken the time–consuming curation process,
            prioritized lists of computationally predicted annotations
            are extremely useful.

          • In our work, we assessed, improved and extended a prediction
            method already available in literature, based on SVD (Singular
            Value Decomposition) of the annotation matrix, by proposing a
            new method, SIM (Semantic IMprovement).
© Davide Chicco Biomolecular annotation prediction through information integration   7
Computational methods:
          Singular Value Decomposition – SVD (1)

          • Matrix A {-1, 0, 1}m x n
              • m rows: genes
              • n columns: annotation terms
          A(i, j) = 1 if gene i is annotated to term j or to any descendant of j
          in the considered ontology structure.
          A(i, j) = -1 if it‟s known that gene i is NOT annotated to term j.
          A(i, j) = 0 otherwise (we don‟t know).

          • SVD of A:                                   A = U ∑ VT

          • An annotation prediction is performed by computing a reduced
          rank approximation Ak of matrix A
               • (where 0 < k < r, with r the number of non zero singular
               values).
© Davide Chicco Biomolecular annotation prediction through information integration   8
Computational methods:
          Singular Value Decomposition – SVD (2)

          • Ak contains real valued entries related to the likelihood that
            gene i shall be annotated to term j.

          • To evaluate the prediction, we consider A(i,j) entry and its
            corresponding Ak(i,j):

                 For a certain real threshold τ
                     if Ak(i,j) > τ : gene i is predicted to be annotated to term j



          • The threshold τ can be chosen in order to obtain the B best
            predicted annotations (with B natural number).



© Davide Chicco Biomolecular annotation prediction through information integration    9
Computational methods:
          Sematic Improved – SIM (1)

            • The SVD method implicitly adopts a global term-to-term
              correlation matrix T = AT A, estimated from the whole set of
              available annotations.

            • In contrast, we propose an adaptive approach, called the SIM
              method, which clusters genes (the rows of matrix A) based on
              their original annotation profile and values a set of distinct
              correlation matrices Tc one for each cluster.

            • For each matrix Tc a predicted annotation profile for gene i is
              computed.

            • The selected predicted annotation profile for gene i is the one
              that minimizes variation, measured by the ell-2 norm, with
              respect to the original annotation profile of the gene.
© Davide Chicco Biomolecular annotation prediction through information integration   10
Computational methods:
          Sematic Improved – SIM (2)

            1. We heuristically fix a number C of clusters, and completely
               discard the columns of matrix U where j = C+1, ..., n.

            2. Each column uc of SVD matrix U represents a cluster, and the
               value U(i,c) indicates the membership of gene i to the c-th
               cluster.

            3. We use this membership degree to cluster the genes (the
               rows of matrix A). As such, each gene might belong to more
               than one cluster with different degrees of membership.

            4. To estimate the correlation matrices Tc we cluster genes
               based on their functional similarity, expressed through their
               annotations, by exploiting SVD of matrix A

© Davide Chicco Biomolecular annotation prediction through information integration   11
Computational methods:
          Sematic Improved – SIM (2)

            5. To calculate Tc for each cluster, first we generate a modified
               gene-to-term matrix Ac = Wc A, in which the i-th row of A is
               weighted by the membership score of the corresponding
               gene to the c-cluster.

            6. Then, we compute Tc = AcTAc

            7. Furthermore, to effect more accurate clustering, we compute
               the eigenvectors of the matrix G~ = ASAT where real nxn
               matrix S is the term similarity matrix.
               Starting from a pair of ontology terms, j1 and j2, the term
               functional similarity S(j1, j2) can be calculated using different
               methods; here we use the Lin„s similarity metrics



© Davide Chicco Biomolecular annotation prediction through information integration   12
Computational methods:
          Sematic Improved – SIM (3)

            8. Then, every element of the Ak matrix is computed considering
               the c-th cluster that minimize its ell-2 norm distance to the
               original vector: ak,i = ai * Vk,c,i * Vk,c,iT
            9. In the end, the final Ak is build

            10. To evaluate the prediction, finally we compare A(i,j) element
                to its corresponding Ak(i,j)
            • if A(i,j) ≤ 0 & Ak(i,j) > τ: a new annotation is suggested
                                                                 (AP++)
            • if A(i,j) = 1 & Ak(i,j) ≤ τ: the annotation is suggested to be
               semantic inconsistent with the available data (AR++)
            • if A(i,j) = 1 & Ak(i,j) > τ: the original annotation confirmed as
               present                                           (AC++)
            • if A(i,j) ≤ 0 & Ak(i,j) ≤ τ: the absence of annotation is
               confirmed                                         (NAC++)
© Davide Chicco Biomolecular annotation prediction through information integration   13
Computational methods:
          data considered

            •       We considered the Gene Ontology annotations of organism
                    Saccharomyces cerevisiae, excluding annotations with
                    evidence code IEA (Inferred Electronic Annotations), since
                    they have not been checked by a manual curator.

            •       After this, the Saccharomyces cerevisiae dataset included
                    3,676 genes related to 1,256 ontology terms, for a total of
                    23,384 annotations.




© Davide Chicco Biomolecular annotation prediction through information integration   14
Computational methods:
          results evaluation (1)

            •       We performed K-fold cross-validation (as in King et al. and
                    Tao et al. works) to assess the performance of our methods.

            •       The we discarded terms used to annotate less than M genes,
                    in order to avoid considering very low reference annotation
                    terms which could bias the evaluation.

            •       As for the SIM method, we evaluated two variants:
                   • in SIM1, we set S = I, i.e. the clustering step does not rely
                       on the functional similarity between terms.
                   • in SIM2, matrix S is computed by means of the Lin‘s
                       metrics.
                   In both cases we heuristically set a fixed number of clusters
                   C=5 for all ontology, based on our own tests (C << p).

© Davide Chicco Biomolecular annotation prediction through information integration   15
Computational methods:
          results evaluation (2)




   The trade-off between the annotations to be reviewed rate AR / (AC + AR) and predicted
   annotations rate AP / (AP + NAC) of annotations of Saccharomyces cerevisiae genes, when the
   prediction is performed using heuristically the first k = 40 eigenvectors of the available annotation
   matrix, Ak .
© Davide Chicco Biomolecular annotation prediction through information integration                    16
Computational methods:
          results evaluation (3)

          • As an aggregated indicator of the prediction performance, we
            computed the area above the AR rate versus AP rate curve (AAC)
            in the [0;0:01] range, for Saccharomyces cerevisiae.

          • Indeed, we are typically interested in the low range of AP rate, since it
            corresponds to top-ranked predictions of newly inferred annotations
            (AP) with the highest score. The normalized AAC metrics are limited to
            the [0;1] interval, where a value closer to 1 implies more accurate
            predictions.

          • In all cases, we computed the AAC metrics considering only the
            prediction of Gene Ontology terms with depth L from the root of
            the ontology greater than either 2 or 6.




© Davide Chicco Biomolecular annotation prediction through information integration   17
Computational methods:
          results evaluation (4)

          • Using the AAC metrics, Table 1 shows that the SIM method
            outperforms the SVD method for all Gene Ontology ontologies. In
            66% of cases SIM2 was better than, or equal to, the SIM1 method,
            showing that clustering based on the functional similarity between
            erms might be beneficial.




© Davide Chicco Biomolecular annotation prediction through information integration   18
Software infrastructures

          • Since the amount of data was very large and the objectives and
            computation quite demanding, we had to pay particular attention to
            software performance and memory usage.

          • To satisfy the facility of the software to be modified and extended, we
            first chose the Java programming language for implementation.

          • Java results to be very independent from platform and from
            operating system, that is a very important value for the software
            objectives, but provides a high response time, and uses a lot of
            memory.

          • We implemented the mathematical core of our software in C++
            programming language using a multiplatform, multithreading,
            optimized mathematical kernel: AMD Core Math Library (ACML)


© Davide Chicco Biomolecular annotation prediction through information integration   19
Software infrastructures (2)

          • Another effective library we used for the mathematical core was
            SVDLIBC.

          • The multithreading native part was developed by using OpenMP
            (Open Multi-Processing, OPM) compiler directives which exploited
            by the mathematical kernels, independently from the operating
            system.

          • The interaction between the native C++ code and Java code was
            though Java Native Interface (JNI).




© Davide Chicco Biomolecular annotation prediction through information integration   20
Conclusions (1)

          • Easy and integrated access to the high amount of biomolecular
            information and knowledge now available in many heterogeneous
            and distributed data sources is required to answer biological
            questions.

          • Data warehousing and computational systems can provide support
            for comprehensive use and analysis of sparsely available genomic and
            proteomic structural, functional and phenotypic information and
            knowledge.

          • We designed a method for integrating biomolecular knowledge
            (mainly expressed through controlled terminologies or ontologies),
            into a data warehouse, and exploiting such integrated data to
            predict new biomolecular annotations.



© Davide Chicco Biomolecular annotation prediction through information integration   21
Conclusions (2)

          • We proposed a novel contribution in the context of prediction of
            genomic ontological annotations, SIM.

          • Our Semantic IMproved (SIM) version of the Single Value
            Decomposition (SVD) method produced better predictions than the
            SVD method alone.

          • The method we proposed turned out to be very useful and powerful
            compared to those already present in literature.

          • Furthermore, since our approach is not limited to the Gene Ontology
            and can be applied to any controlled annotation.
          • Increasingly available multiple annotations of genes and gene
            products from different controlled vocabularies and ontologies could
            be jointly considered to further improve prediction reliability.

© Davide Chicco Biomolecular annotation prediction through information integration   22
Conclusions (3)




                                                         Thanks for the attention




© Davide Chicco Biomolecular annotation prediction through information integration   23

Contenu connexe

Similaire à "Biomolecular annotation prediction through information integration" - Davide Chicco (PoliMi) @ CIBB 2011

Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...
Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...
Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...Manuel GEA - Bio-Modeling Systems
 
Introduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxIntroduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxRAJESHKUMAR428748
 
Knowledge Discovery using an Integrated Semantic Web
Knowledge Discovery using an Integrated Semantic WebKnowledge Discovery using an Integrated Semantic Web
Knowledge Discovery using an Integrated Semantic WebMichel Dumontier
 
BioDiscovery Solutions for Future
BioDiscovery Solutions for FutureBioDiscovery Solutions for Future
BioDiscovery Solutions for Futurecontactmeasif
 
BioDec Srl Company Profile
BioDec Srl Company ProfileBioDec Srl Company Profile
BioDec Srl Company ProfileBioDec
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
 
LAS - System Biology Lesson
LAS - System Biology LessonLAS - System Biology Lesson
LAS - System Biology LessonLASircc
 
57 bio infomark
57 bio infomark57 bio infomark
57 bio infomarkphdcao
 
Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Monica Munoz-Torres
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformaticscontactsoorya
 
Proteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASyProteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASyChrist College, Rajkot
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS
 
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...Thitichai Sripan
 

Similaire à "Biomolecular annotation prediction through information integration" - Davide Chicco (PoliMi) @ CIBB 2011 (20)

Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...
Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...
Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...
 
NRNB EAC Report 2011
NRNB EAC Report 2011NRNB EAC Report 2011
NRNB EAC Report 2011
 
Introduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptxIntroduction to Biological database ppt(1).pptx
Introduction to Biological database ppt(1).pptx
 
Knowledge Discovery using an Integrated Semantic Web
Knowledge Discovery using an Integrated Semantic WebKnowledge Discovery using an Integrated Semantic Web
Knowledge Discovery using an Integrated Semantic Web
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Biorel
BiorelBiorel
Biorel
 
Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
Bioinformatics principles and applications
 
BioDiscovery Solutions for Future
BioDiscovery Solutions for FutureBioDiscovery Solutions for Future
BioDiscovery Solutions for Future
 
BioDec Srl Company Profile
BioDec Srl Company ProfileBioDec Srl Company Profile
BioDec Srl Company Profile
 
Brizio rossibiodec
Brizio rossibiodecBrizio rossibiodec
Brizio rossibiodec
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
LAS - System Biology Lesson
LAS - System Biology LessonLAS - System Biology Lesson
LAS - System Biology Lesson
 
57 bio infomark
57 bio infomark57 bio infomark
57 bio infomark
 
Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformatics
 
Neo4j and bioinformatics
Neo4j and bioinformaticsNeo4j and bioinformatics
Neo4j and bioinformatics
 
Proteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASyProteomics resources at the EBI & ExPASy
Proteomics resources at the EBI & ExPASy
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databases
 
Bio4j
Bio4jBio4j
Bio4j
 
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
 

Dernier

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Dernier (20)

Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

"Biomolecular annotation prediction through information integration" - Davide Chicco (PoliMi) @ CIBB 2011

  • 1. CIBB2011 8th INTERNATIONAL MEETING ON COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS Gargnano Sul Garda, Lombardia, Italy July 2nd 2011 Dipartimento di Elettronica e Informazione Biomolecular annotation prediction through information integration Davide Chicco, Marco Tagliasacchi, Marco Masseroli davide.chicco@elet.polimi.it
  • 2. Summary 1. Data warehousing and information integration 2. Computational methods 3. Software infrastructures © Davide Chicco Biomolecular annotation prediction through information integration 2
  • 3. Data warehousing and information integration (1) • Given the large amount of biomolecular information and data available nowadays in different web sources, information integration has become a very important task for the bioinformatics community. • There are different data integration approaches available: • federated databases (e.g. BioKleisli, K2) • link driven federations (e.g. SRS) • link-driven integration of data sources (e.g. Entrez, GeneCards, SOURCE) • mediator-based architecture (e.g. BioDataServer, DiscoveryLink ) • data warehousing (e.g. EnsMart, BioWarehouse) © Davide Chicco Biomolecular annotation prediction through information integration 3
  • 4. Data warehousing and information integration (2) • To integrate heterogeneous bioinformatics data available in a variety of formats from many different sources, followed this last approach: Data Warehousing On-line databanks • Quicker, more efficient and Entrez eVOC BioCyc KEGG Reactome GOA effective than “on demand” systems. IPI Gene Gene Ontology Homologene • We generated, maintain Automatic updated and extend a multi- Database updating species Genomic and procedures server Proteomic Data Warehouse (GPDW) based at Politecnico Genomic and Proteomic Data Warehouse di Milano. © Davide Chicco Biomolecular annotation prediction through information integration 4
  • 5. Data warehousing and information integration (3) • The integration is managed in three phases: 1. Importing: data from the different sources are imported into one repository. 2. Aggregation: data are gathered and normalized into a single representation in the instance-aggregation tier of our global data model. 3. Integration: data are organized into informative clusters in the concept-integration tier of the global model • During the initial importing and aggregation phases, tables of the features described by the imported data are created and populated. Fur further information, see my colleague Arif Canakoglu CIBB2011 paper. © Davide Chicco Biomolecular annotation prediction through information integration 5
  • 6. Computational methods: Prediction of biomolecular annotations (1) • To better understand the functions of genes and proteins, scientists developed the concept of annotation (association of nucleotide or amino acid sequences with useful information). • This information is expressed through controlled vocabularies, sometimes structured as ontologies, where every controlled term of the vocabulary is associated with a unique alphanumeric code. • Such a code associated with a gene or protein ID constitutes an annotation. Gene Biological function feature Annotation gene2bff © Davide Chicco Biomolecular annotation prediction through information integration 6
  • 7. Computational methods: Prediction of biomolecular annotations (2) • However, available annotations are incomplete, and, for recently studied genomes, they are mostly computationally derived. • So only a few of them represent highly reliable human– curated information. • To support and quicken the time–consuming curation process, prioritized lists of computationally predicted annotations are extremely useful. • In our work, we assessed, improved and extended a prediction method already available in literature, based on SVD (Singular Value Decomposition) of the annotation matrix, by proposing a new method, SIM (Semantic IMprovement). © Davide Chicco Biomolecular annotation prediction through information integration 7
  • 8. Computational methods: Singular Value Decomposition – SVD (1) • Matrix A {-1, 0, 1}m x n • m rows: genes • n columns: annotation terms A(i, j) = 1 if gene i is annotated to term j or to any descendant of j in the considered ontology structure. A(i, j) = -1 if it‟s known that gene i is NOT annotated to term j. A(i, j) = 0 otherwise (we don‟t know). • SVD of A: A = U ∑ VT • An annotation prediction is performed by computing a reduced rank approximation Ak of matrix A • (where 0 < k < r, with r the number of non zero singular values). © Davide Chicco Biomolecular annotation prediction through information integration 8
  • 9. Computational methods: Singular Value Decomposition – SVD (2) • Ak contains real valued entries related to the likelihood that gene i shall be annotated to term j. • To evaluate the prediction, we consider A(i,j) entry and its corresponding Ak(i,j): For a certain real threshold τ if Ak(i,j) > τ : gene i is predicted to be annotated to term j • The threshold τ can be chosen in order to obtain the B best predicted annotations (with B natural number). © Davide Chicco Biomolecular annotation prediction through information integration 9
  • 10. Computational methods: Sematic Improved – SIM (1) • The SVD method implicitly adopts a global term-to-term correlation matrix T = AT A, estimated from the whole set of available annotations. • In contrast, we propose an adaptive approach, called the SIM method, which clusters genes (the rows of matrix A) based on their original annotation profile and values a set of distinct correlation matrices Tc one for each cluster. • For each matrix Tc a predicted annotation profile for gene i is computed. • The selected predicted annotation profile for gene i is the one that minimizes variation, measured by the ell-2 norm, with respect to the original annotation profile of the gene. © Davide Chicco Biomolecular annotation prediction through information integration 10
  • 11. Computational methods: Sematic Improved – SIM (2) 1. We heuristically fix a number C of clusters, and completely discard the columns of matrix U where j = C+1, ..., n. 2. Each column uc of SVD matrix U represents a cluster, and the value U(i,c) indicates the membership of gene i to the c-th cluster. 3. We use this membership degree to cluster the genes (the rows of matrix A). As such, each gene might belong to more than one cluster with different degrees of membership. 4. To estimate the correlation matrices Tc we cluster genes based on their functional similarity, expressed through their annotations, by exploiting SVD of matrix A © Davide Chicco Biomolecular annotation prediction through information integration 11
  • 12. Computational methods: Sematic Improved – SIM (2) 5. To calculate Tc for each cluster, first we generate a modified gene-to-term matrix Ac = Wc A, in which the i-th row of A is weighted by the membership score of the corresponding gene to the c-cluster. 6. Then, we compute Tc = AcTAc 7. Furthermore, to effect more accurate clustering, we compute the eigenvectors of the matrix G~ = ASAT where real nxn matrix S is the term similarity matrix. Starting from a pair of ontology terms, j1 and j2, the term functional similarity S(j1, j2) can be calculated using different methods; here we use the Lin„s similarity metrics © Davide Chicco Biomolecular annotation prediction through information integration 12
  • 13. Computational methods: Sematic Improved – SIM (3) 8. Then, every element of the Ak matrix is computed considering the c-th cluster that minimize its ell-2 norm distance to the original vector: ak,i = ai * Vk,c,i * Vk,c,iT 9. In the end, the final Ak is build 10. To evaluate the prediction, finally we compare A(i,j) element to its corresponding Ak(i,j) • if A(i,j) ≤ 0 & Ak(i,j) > τ: a new annotation is suggested (AP++) • if A(i,j) = 1 & Ak(i,j) ≤ τ: the annotation is suggested to be semantic inconsistent with the available data (AR++) • if A(i,j) = 1 & Ak(i,j) > τ: the original annotation confirmed as present (AC++) • if A(i,j) ≤ 0 & Ak(i,j) ≤ τ: the absence of annotation is confirmed (NAC++) © Davide Chicco Biomolecular annotation prediction through information integration 13
  • 14. Computational methods: data considered • We considered the Gene Ontology annotations of organism Saccharomyces cerevisiae, excluding annotations with evidence code IEA (Inferred Electronic Annotations), since they have not been checked by a manual curator. • After this, the Saccharomyces cerevisiae dataset included 3,676 genes related to 1,256 ontology terms, for a total of 23,384 annotations. © Davide Chicco Biomolecular annotation prediction through information integration 14
  • 15. Computational methods: results evaluation (1) • We performed K-fold cross-validation (as in King et al. and Tao et al. works) to assess the performance of our methods. • The we discarded terms used to annotate less than M genes, in order to avoid considering very low reference annotation terms which could bias the evaluation. • As for the SIM method, we evaluated two variants: • in SIM1, we set S = I, i.e. the clustering step does not rely on the functional similarity between terms. • in SIM2, matrix S is computed by means of the Lin‘s metrics. In both cases we heuristically set a fixed number of clusters C=5 for all ontology, based on our own tests (C << p). © Davide Chicco Biomolecular annotation prediction through information integration 15
  • 16. Computational methods: results evaluation (2) The trade-off between the annotations to be reviewed rate AR / (AC + AR) and predicted annotations rate AP / (AP + NAC) of annotations of Saccharomyces cerevisiae genes, when the prediction is performed using heuristically the first k = 40 eigenvectors of the available annotation matrix, Ak . © Davide Chicco Biomolecular annotation prediction through information integration 16
  • 17. Computational methods: results evaluation (3) • As an aggregated indicator of the prediction performance, we computed the area above the AR rate versus AP rate curve (AAC) in the [0;0:01] range, for Saccharomyces cerevisiae. • Indeed, we are typically interested in the low range of AP rate, since it corresponds to top-ranked predictions of newly inferred annotations (AP) with the highest score. The normalized AAC metrics are limited to the [0;1] interval, where a value closer to 1 implies more accurate predictions. • In all cases, we computed the AAC metrics considering only the prediction of Gene Ontology terms with depth L from the root of the ontology greater than either 2 or 6. © Davide Chicco Biomolecular annotation prediction through information integration 17
  • 18. Computational methods: results evaluation (4) • Using the AAC metrics, Table 1 shows that the SIM method outperforms the SVD method for all Gene Ontology ontologies. In 66% of cases SIM2 was better than, or equal to, the SIM1 method, showing that clustering based on the functional similarity between erms might be beneficial. © Davide Chicco Biomolecular annotation prediction through information integration 18
  • 19. Software infrastructures • Since the amount of data was very large and the objectives and computation quite demanding, we had to pay particular attention to software performance and memory usage. • To satisfy the facility of the software to be modified and extended, we first chose the Java programming language for implementation. • Java results to be very independent from platform and from operating system, that is a very important value for the software objectives, but provides a high response time, and uses a lot of memory. • We implemented the mathematical core of our software in C++ programming language using a multiplatform, multithreading, optimized mathematical kernel: AMD Core Math Library (ACML) © Davide Chicco Biomolecular annotation prediction through information integration 19
  • 20. Software infrastructures (2) • Another effective library we used for the mathematical core was SVDLIBC. • The multithreading native part was developed by using OpenMP (Open Multi-Processing, OPM) compiler directives which exploited by the mathematical kernels, independently from the operating system. • The interaction between the native C++ code and Java code was though Java Native Interface (JNI). © Davide Chicco Biomolecular annotation prediction through information integration 20
  • 21. Conclusions (1) • Easy and integrated access to the high amount of biomolecular information and knowledge now available in many heterogeneous and distributed data sources is required to answer biological questions. • Data warehousing and computational systems can provide support for comprehensive use and analysis of sparsely available genomic and proteomic structural, functional and phenotypic information and knowledge. • We designed a method for integrating biomolecular knowledge (mainly expressed through controlled terminologies or ontologies), into a data warehouse, and exploiting such integrated data to predict new biomolecular annotations. © Davide Chicco Biomolecular annotation prediction through information integration 21
  • 22. Conclusions (2) • We proposed a novel contribution in the context of prediction of genomic ontological annotations, SIM. • Our Semantic IMproved (SIM) version of the Single Value Decomposition (SVD) method produced better predictions than the SVD method alone. • The method we proposed turned out to be very useful and powerful compared to those already present in literature. • Furthermore, since our approach is not limited to the Gene Ontology and can be applied to any controlled annotation. • Increasingly available multiple annotations of genes and gene products from different controlled vocabularies and ontologies could be jointly considered to further improve prediction reliability. © Davide Chicco Biomolecular annotation prediction through information integration 22
  • 23. Conclusions (3) Thanks for the attention © Davide Chicco Biomolecular annotation prediction through information integration 23