Ondex is a data integration and visualization platform used to integrate large amounts of biological data from multiple sources. It transforms the data into a graph of biological concepts and relationships. Ondex allows users to integrate data, perform semantic alignment of concepts, and visualize the integrated network. Filters and annotators can then be used to highlight specific areas of interest within the large integrated network. Ondex has been applied to problems such as candidate gene prioritization, pathway mapping, and analysis of quantitative trait loci regions in plants.
14. Syntactic integration challenge Over 1000 databases freely available to public Over 60 million sequences in GenBank Over 870 complete genomes and many ongoing projects Over 17 million citations in PubMed PubMed growth by 600,000 publications each year Integration of Life Science data sources is essential for Systems Biology research http://www.ncbi.nlm.nih.gov/Database
15. Ear Semantic Integration challenge Same concept different names Synonyms Same name different concepts Homographs
19. Concepts and relations (1/2) interact Cell Protein – Protein interaction network (PPI) Cellular location of proteins Protein Protein e.g. Network of Concepts and Relations RelationType interact located in ConceptClass ConceptClass Protein CelComp Protein Protein Properties: compound name, protein sequence, protein structure, cellular component, KM-value, PH optimum … Ontology of Concept Classes, Relation Types and additional Properties
20. Reaction Reaction produced by consumed by consumed by produced by Metabolite Metabolite Metabolite Concepts and relations (2/2) Transformation to binary graph Properties: compound name, protein sequence, protein structure, cellular component, KM-value, PH optimum … Concepts: Relations:
24. Importing data into Ondex What databases to import What format these are in Ondex parsers already written Generic OBO, PSI-MI, SBML, Tab-delimited, Fasta Database-specific Aracyc, AtRegNet, BioCyc, BioGRID, Brenda, Drastic, EcoCyc, GO, GOA, Gramene, Grassius, KEGG, Medline, MetaCyc, Oglycbase, OMIM, PDB, Pfam, SGD, TAIR, TIGR, Transfac, Transpath, UniProt, WGS, WordNet
25. Example of resulting graph Has similar sequence Target sequence Binds to, has similar sequence Repressed by, regulated by, activated by Member is part of Gene Protein Encoded by Is_a Member is part of Is_a Transcription factor Is_a Member is part of Enzyme Protein complex Is_a catalyses Catalysing class Member is part of Reaction Member is part of EC Is_a Pathway
26. Ondex Data Integration Scheme Treatments from DRASTIC Graph alignment Pathways from KEGG Data input& transformation Data integration Visualisation Clients/Tools Heterogeneous data sources Ondexgraph warehouse Integration Methods Ondex Visualization Tool Kit UniProt Accession Generalized Object Data Model Database Layer Parser Name based Web Client AraCyc Parser Transitive Taverna KEGG Blast Parser ProteinFamily Transfac Data Exchange Parser Pfam2GO OXL/RDF Microarray Lucene Parser Web Service
27. Semantic Integration by Graph Alignment Create relations between equivalent entries from different data sources Identified by mapping methods Concept accessions (UniProt ID) Concept name (gene name), synonyms Sequence methods Graph neighbourhood Text mining
34. Filters Integrating different datasets large resulting graph Need to narrow down Select meaningful areas of the graph Example in Ondex protein-protein interaction network
46. Annotators (2/3) Virtual Knock-out Annotator to see how important a single concept is to all possible paths contained in a network Ondex resizes the concepts based on this score Scale Concept by Value Pie charts Up/down regulation is indicated in red/green
47. AraCyc ONDEX Application case2: Mapping microarray expression data to integrated pathways Parser tab file Arabidopsis C/N uptake OXL tab file Jan Taubert Accession based Mapping usingTAIR IDs Ondex Interactive exploration Enriched spreadsheet, e.g. AraCyc pathways
54. Application case 3: Arabidopsis PPI network Artem Lysenko IntAct TAIR BioGRID Mapping the 3 databases based on TAIR accessions
55. Adding 3 sources of evidence co-expression sequence similarity co-occurrence in scientific literature facilitate the identification of functionally related groups of proteins
56. Added attributes to nodes/edges Network stats Betweenness centrality (BWC) How influential (bridge) Degree centrality (DC) Hub likeness Markov Clustering Identifies strongly connected groups of proteins in the network
60. Filters, annotators and layouts Combination of these three types of tools in Ondex a more complex application case …
61. Application case 4: Bioenergy Project Use bioinformatics to support phenotype-genotype research in bioenergy crops Given a phenotypic variant is it possible to pin down the relevant genes? Develop tools to support systematic analysis of QTL regions to pin down relevant genes Identify genes implicated in biomass production in willow Prioritise genes for experimental validation Keywan Hassani-Pak Biofuel Conversion Process http://www.jgi.doe.gov/education/bioenergy/bioenergy_1.html
62. QTL and Genomic Data QTL Willow genome is not sequenced yetQTL may encompass many potentialcandidates, perhaps hundreds Poplar is the first tree with fully sequenced genome 19 Chromosomes, 45778 predicted genes 4x larger than Arabidopsis genome Not much known about the function of the genes
63. Linking genes to data sources Linked References model e.g. Poplar, Arabidopsis Willow Pathways Plant Hormones QTL Map Orthologous Markers Physical map Expression Patterns Genes Gene Function List of candidate genes linked to biological processes
64. Relevant Data Sources Release 15.10 Poplar Gene Prediction v2.0 (Jan 2010) All plants: 739,396 proteins Reviewed: 28,404 proteins (3,84%) PoplarCyc 1.0: 285 pathways, 3434 enzymes, 1363 compounds (Oct 2009) Pfam 24.0: 11,912 protein families (Oct 2009) Poplar Transcription Factors - DPTF: 2,576 putative TF (March 2007) - PlnTFDB: 2,901 putative TF (July 2009) 29,365 GO terms (Jan 2010) Poplar/ Willow QTL - work in progress - preliminary dataset available Only loading referenced publications ~15,000 articles
65. Unique Knowledge Base for Poplar Proteins annotated with functional information and publications Based on Comparative genomics and Protein familyanalysis Genes, QTLs enriched withpositionalinformation Data integration was done in Ondex
66. Ondex Genomics Layout Genomic Layout displays chromosomes, genes and QTLs Chromosomal regions and QTLs can be selected
68. Phenotypic Information in Literature HMMer: 650581 – HLH E-Value: 3.4E-7 Score: 30.0 BLAST 217086 – LAX E-Value: 8.3E-17 Score: 80.88 BLAST 217086 – BHLH63 E-Value: 8.3E-9 Score: 54.3 PMID:13130077 “LAX and SPA: major regulators of shoot branching in rice.” Poplar protein 217086 We identified two remote homologs in Rice (LAX) and in Arabidopsis (BHLH63), as well as one protein domain HLH The LAX homolog contains evidence to be a major regulator of shoot branching Hypothesis generation
Light pink – Increased virulenceLight blue – Reduced virulenceLight Green – Loss of pathogenicityYellow – Unaffected pathogenicityStar – animalCircle – plant
Virtual KO scoreis based on 3 other scores: - "extension" gives the number of paths that would be extended if a concept was added- "deletion" gives the number of paths that would be deleted if this concept was deleted- "nochange" gives the number of paths that would not be shortened/extended if this concept was deleted
IntAct4625 protein interactions (data derived from literature curation or direct user submissions)TAIR (The Arabidopsis Information Resource) – 1143 interactionsgenome sequence, gene structure, gene product information, metabolism, gene expression, DNA and seed stocks, genome maps, genetic and physical markers, publicationsBioGrid (General Repository for Interaction Datasets)collections of protein and genetic interactions from major model organism species1223 interactions for Arabidopsis derived from high-throughput studies and conventional focused studies
ATTED II (Arabidopsis thalianatrans-factor and cis-element prediction database)provides co-regulated gene relationships in Arabidopsis to estimate gene functionsgives the Pearson correlation coefficients of co-expressed genes in Arabidopsis calculated from available microarray dataNCBI PSI-BLASTidentify similarities between our reference set of proteinsMatching against Arabidopsis subset of UNIPROTCo-occurrence of protein names25,900 Medline abstracts related to Arabidopsis ThalianaIntegrated Lucene-based mapping method
Solid biomass (in the form of plants and trees) can be converted into liquid fuels (such as ethanol, methanol, and biodiesel)The challenge lies in efficient conversion,creating more energy than the input required to produce itincrease biomass yieldDevelop means to support systematic analysis of QTL regions and prioritise genes for experimental analyses identify genes controlling biomass production in willow
QTL are genomic regions that assign variations observed in a phenotype to a region on the genetic mapBiomass traits: branching, height, leaf number etc.Going from Willow to Poplar to Arabidopsis and other species
Reduced hypothesis space from 100 potential candidates to 3 hot candidates.Next steps: Cloning and transformation for experimental validation.