SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
Combining large-scale evolutionary analyses
with multiple biological data sources to predict
human protein function


                              David Jones
   UCL Depts. of Computer Science and Structural and Molecular Biology
Background
In Uniprot, 30% of human    … and only 0.5% have
proteins still have no      completely specific ones
functional annotations at   for all aspects
all

         CC                            MF




 30%
                     MF           BP

                                        CC

         BP
Main approaches for function annotation
• Annotation transfers by homology
  e.g. BLAST, HMMER
  Only applicable to a subset of the data
  Has reached a plateau in terms of novel function
    annotation but provides highest quality information


• Model-classifier based using sequence features
  Limited to common and broad functions for which there
    are many examples
FFPRED - Function Prediction Pipeline
Novel sequence              Amino acid sequence



Characteristics
              structure disorder   aa transmem motifs localisation



 Classification                    GO Term
                                    SVM




                          posterior probability
                               estimate
Going further – computing gene function
from multiple data sources
• FFPRED is a currently available server for human
  (and vertebrate) proteins

• It works well but is limited to predicting only the
  functional classes that it was trained to recognize
• Extending the library requires time consuming
  training of new SVM models
• It also cannot be applied to rare functional classes
  due to limited training sets
Desirable features of a new approach


• Able to annotate all sequences

• Able to predict rare functions

• Able to offer something more than simple
  homology-based approaches

• Amenable to easy and quick updating
FunctionSpace Data Sources for H. sapiens

•   Sequence similarity
•   Signal peptides and other local features
•   Predicted secondary structure
•   Transmembrane segments
•   Predicted disordered regions
•   Domain architecture patterns
•   Gene fusion information
•   Gene co-expression
•   Protein-protein interactions



        For each sequence 49,231 features were derived
Aim
To estimate the functional similarity (a.k.a. semantic distance)
between two human proteins from their sequence features
plus available high throughput data.



Protein A


                                                    Functional
                                                    Similarity
                                                      Score


Protein B
Large-scale (domain-based) evolutionary
features

• Patterns of domain occurrence can provide
  valuable functional clues

• “Deeper” homology detection allows greater
  coverage

• We make use of our in-house fold/domain
  recognition method and several public domain
  libraries
pDomTHREADER Domain Coverage

Residues             35.7% Gene3d


                                                  CATH Domain annotations
    81.6%                               7000000

    threading                           6000000

                                        5000000


Sequences                               4000000

                                        3000000

                                        2000000

                                        1000000


   64.8%         59.4% Gene3d                 0
                                                     Public domain   Threading

   threading

 37.56 % increase in domain annotations across 5.5M sequences
 ~ 1.7 million novel domain assignments over public domain data
Computational Practicalities
                                                          Legion Nodes

 5.5M Query
 sequences                              Sequence
                            2Gb         database
                                       (5.5M seqs)



                         PSIBLAST        Find
                                         matches &
                       1min – 3 hours    generate
                                         alignments


                          Store &
                        post process

“Embarrassingly parallel” application: one sequence = one job.
Ideal capacity filling task for a modern supercomputer like Legion.
Gene Fusion Events can Predict Protein-
Protein Interactions from Sequence Data
H1         3.90.850.10                                        3.60.15.10                         H2
     fumaryl aceto acetase                                beta lactamase



                               Bi-functional enzyme
                 3.90.850.10                          3.60.15.10
                                                                           Mycobacterium tuberculosis

                                                                           Mycobacterium paratuberculosis

                                                                           Mycobacterium avium
                                 Hydrolase activity




        Hydrolysis of C-N bonds                 Hydrolysis of C-C bonds
A Novel Gene Fusion Discovered using CATH
domain fusion analysis

       Phosphoglyceromutase                                   DNA repair (RAD50)

            3.40.120.10                                              3.40.50.300
   Alpha-D-Glucose-1,6-Bisphosphate               P-loop nucleotide triphosphate hydrolases


                3.40.120.10
                              Transcription coupling repair factor

                  3.40.120.10                             3.40.50.300
                                                                                   Saccharopolyspora erythraea

                                                                                   Syntrophomonas wolfei


                                   Oxidative stress




          D-glucose metabolism                                DNA repair
Novel Gene Fusion Discovery

              3.40.120.10        3.40.50.300         3.40.50.300
                                                                   Saccharopolyspora erythraea
              3.40.50.300                            3.40.120.10
                                                                   Syntrophomonas wolfei




                                                       Novel annotations



  • Rice PGM1 gene annotated as GO:0006950 response to
    stress
  • PGM3 has relationship with DNA repair sequence

   Kanazawa K, Ashida H (1991) Relationship between oxidative stress
   and hepatic phosphoglucomutase activity in rats. Int J Tissue React 13: 225
Domain based features

 Score
 architectures




 Score
 complexes




 7960 features
 11210 features
Fusion scoring




  Each domain is a feature, score has 2 components

  1. Prediction quality (logistic transform of feature)
  2. Promiscuity weight related to the number of times the sequence
     occurs as part of a fused product wi = log fus
                                             i
Integration of “External” Features:
Microarray Expression Data
                                   Gene   Gene                                       14
                                    A      B




                                                               Probe Signal (log2)
                                                                                     12
  Normalised Microarray Datasets




                                                                                     10

                                                                                      8

                                                                                      6

                                                                                      4

                                                                                      2

                                                                                      0
                                                                                          1   2   3   4   5   6   7   8   9   10 11 12 13 14 15 16

                                                                                                      Experiment (conditions)




                                                 Pearson Correlation (R)
Biclustering Microarray Expression Data




           Zinc binding sequences              A set of transcription factors
           global correlation 0.42                global correlation 0.48


23912 features generated from biclustering of 2346 publicly available microarrays
                   (81 experiments) using BIMAX algorithm
FunctionSpace: Two-stage Integration of Data


                       SVMsw

                       SVMloc

                       SVMss

             Feature
Protein A    vectors
                       SVMtm

                       SVMdis
                                         Functional
                       SVMdpc   SVMfsc   Similarity
                       SVMgfc
                                           Score

                       SVMdpp

             Feature   SVMgfp
Protein B    vectors
                       SVMge

                       SVMppi
A 3-D Projection of Annotated Human Proteins

• 49,231 dimensions first
  reduced to 11 dimensions by
  SVM regression with 11
  different groups of features
• Each protein is here
  represented as a point in this
  derived 11-D feature space
  projected into 3-D
• Colouring is according to
  functional similarity which
  shows that proteins with similar
  functions (warmer colours)
  cluster strongly in this space
• 75% of nearest neighbour pairs
  share common GO terms
Individual Feature Contributions
  Matthews Correlation Coefficient
Function Annotation Results for 20674
Unannotated IPI Human Sequences




 Each sequence is classed “Easy”, “Medium” or “Hard” depending on
 degree of homology to functionally annotated proteins in UNIPROT.
Preliminary Results
In 2009 FunctionSpace produced GO term predictions for 19678 IPI
uncharacterized human sequences. 2746 have been annotated since.
                  MF              Measure             BP
                 16%          % Exact Matches         9%
                  -1.3 Mean semantic distance -1.7




       Less             More                     Less       More
      specific         specific                 specific   specific
Initial considerations for CAFA


•   50,000 sequences
•   11 eukaryotic & 7 prokaryotic species
•   High specificity annotations needed
•   Partial descriptive text already in Swiss-Prot/Uniprot for some
    entries

• FFPRED/FunctionSpace would not be enough

• Need to incorporate textual information from databases
  and comprehensive homology(orthology)-derived labels

• Need to get all this working in a few months!
Best Laid Plans for CAFA


• Plan A
   – Build separate annotation pipelines for missing data
   – Calibrate each pipeline according to precision values derived from
     benchmark on 500 highly annotated Swiss-Prot entries
   – Combine pipeline annotations using high-level classifier (SVM or Naive
     Bayes)


• Plan B
   – No time to build high-level classifier!
   – Combine annotation sources using heuristic graphical approach

• Hope for the best!
  (and expect the worst...)
GO term prediction from Swiss-Prot
 text-mining

• For targets which already had
  descriptive text, keywords or
  comments in Swiss-Prot, GO terms
  were assigned using a naive Bayes
  text-classification approach
• Single words and groups of 2 and 3
  words were counted
• Words occurring in different Swiss-Prot
  record types were distinguished in the
  analysis, and some simple pre-parsing
  of feature (FT) records was carried out
  in addition.
Homology-based annotation sources

• PSI-BLAST searches against Uniprot
    – Low E-value threshold to ensure close homologues used for
      annotation transfer
    – Alignment length threshold to avoid domain problem
• Transfer of annotations from orthologues
    – EggNOG 2.0
    – More reliable GO term transfer than for PSI-BLAST but lower
      coverage
• Profile-profile searches against Swiss-Prot
    – Low reliability transfer from very distant homologues
    – Improves coverage where needed (at expense of specificity)
Heuristic back-propagation of precision
estimates


                                 Back-propagation
                                 repeated for each
                                 annotation source
       Back-propagation
                                 to define a
       of precision
                                 consensus for
       estimates
                                 each node
   P’ = 1 - (1 – P) (1 – Q)
Final steps
• After back-propagation, all referenced GO terms
  are ranked according to final confidence scores

• To reduce conflicting annotations, pairs of terms
  with zero observed co-occurrence frequency in
  GOA are subjected to pairwise tournament
  selection.

• Results submitted to server using the
  mouse-window-cut-paste-click-submit
  algorithm
CASP vs CAFA from a Predictor’s Point of
View
• Number of targets
   – Manual vs automated approaches
• Difficulty of targets
   – A major limit in driving CASP forwards
• Assessment
   – Hard to pre-judge impact of decisions made during
     prediction season
• Tools for the community
   – Standards and methods in CASP have been very useful
• Getting the word out to the wider community
Acknowledgements

 Anna Lobley
 Domenico Cozzetto
 Daniel Buchan



 Kevin Bryson
 Christine Orengo

Contenu connexe

Tendances

Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...Integrated DNA Technologies
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishingNikolay Vyahhi
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
subtractive hybridization
subtractive hybridizationsubtractive hybridization
subtractive hybridizationSakshi Saxena
 
Genome editing & targeting tools
Genome editing & targeting toolsGenome editing & targeting tools
Genome editing & targeting toolsS Rasouli
 
Simultaneious monitoring of phosphorylation events and protein protein intera...
Simultaneious monitoring of phosphorylation events and protein protein intera...Simultaneious monitoring of phosphorylation events and protein protein intera...
Simultaneious monitoring of phosphorylation events and protein protein intera...PerkinElmer, Inc.
 
Cpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesCpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesIntegrated DNA Technologies
 
3D-Screen Technology overview
3D-Screen Technology overview3D-Screen Technology overview
3D-Screen Technology overviewpguedat
 
J.1747 0285.2009.00940.x
J.1747 0285.2009.00940.xJ.1747 0285.2009.00940.x
J.1747 0285.2009.00940.xdaisydew
 
UIowa 2005 - Iowa City, IA
UIowa 2005 - Iowa City, IAUIowa 2005 - Iowa City, IA
UIowa 2005 - Iowa City, IARandy Simpson
 
Protein engineering and its techniques himanshu
Protein engineering and its techniques himanshuProtein engineering and its techniques himanshu
Protein engineering and its techniques himanshuhimanshu kamboj
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assemblyRamya P
 
Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...Integrated DNA Technologies
 
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Integrated DNA Technologies
 

Tendances (20)

Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
Ribonucleoprotein delivery of CRISPR-Cas9 reagents for increased gene editing...
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Suman (2)
Suman (2)Suman (2)
Suman (2)
 
subtractive hybridization
subtractive hybridizationsubtractive hybridization
subtractive hybridization
 
Thesis
ThesisThesis
Thesis
 
Genome editing & targeting tools
Genome editing & targeting toolsGenome editing & targeting tools
Genome editing & targeting tools
 
Simultaneious monitoring of phosphorylation events and protein protein intera...
Simultaneious monitoring of phosphorylation events and protein protein intera...Simultaneious monitoring of phosphorylation events and protein protein intera...
Simultaneious monitoring of phosphorylation events and protein protein intera...
 
Cpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexesCpf1-based genome editing using ribonucleoprotein complexes
Cpf1-based genome editing using ribonucleoprotein complexes
 
3D-Screen Technology overview
3D-Screen Technology overview3D-Screen Technology overview
3D-Screen Technology overview
 
J.1747 0285.2009.00940.x
J.1747 0285.2009.00940.xJ.1747 0285.2009.00940.x
J.1747 0285.2009.00940.x
 
UIowa 2005 - Iowa City, IA
UIowa 2005 - Iowa City, IAUIowa 2005 - Iowa City, IA
UIowa 2005 - Iowa City, IA
 
Protein Engineering
Protein EngineeringProtein Engineering
Protein Engineering
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Protein engineering and its techniques himanshu
Protein engineering and its techniques himanshuProtein engineering and its techniques himanshu
Protein engineering and its techniques himanshu
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...Getting started with CRISPR: a review of gene knockout and homology-directed ...
Getting started with CRISPR: a review of gene knockout and homology-directed ...
 
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
Alt-R™ CRISPR-Cas9 System: Ribonucleoprotein delivery optimization for improv...
 

En vedette

Polt Presentation Priority Setting Vienna 18 02 2010
Polt Presentation Priority Setting Vienna 18 02 2010Polt Presentation Priority Setting Vienna 18 02 2010
Polt Presentation Priority Setting Vienna 18 02 2010Wolfgang_Polt
 
Zongshen Cyclone Fly
Zongshen Cyclone FlyZongshen Cyclone Fly
Zongshen Cyclone Flyhi.interest
 
Ignobel2010
Ignobel2010Ignobel2010
Ignobel2010Iddo
 
Ismb grant-writing-2012
Ismb grant-writing-2012Ismb grant-writing-2012
Ismb grant-writing-2012Iddo
 
Go camp 2010_cacao
Go camp 2010_cacaoGo camp 2010_cacao
Go camp 2010_cacaoIddo
 
David Jones AFP/CAFA2011
David Jones AFP/CAFA2011David Jones AFP/CAFA2011
David Jones AFP/CAFA2011Iddo
 
Manual Vaic Mp9 T800
Manual Vaic Mp9 T800Manual Vaic Mp9 T800
Manual Vaic Mp9 T800Psyfers
 
Jeff Grethe: CAMERA
Jeff Grethe: CAMERAJeff Grethe: CAMERA
Jeff Grethe: CAMERAIddo
 
Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Iddo
 
Katrina Photos Short Version
Katrina Photos Short VersionKatrina Photos Short Version
Katrina Photos Short Versionhomestarmy26
 
Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Iddo
 
Afp cafa djuric
Afp cafa djuricAfp cafa djuric
Afp cafa djuricIddo
 
A Year In the Western English Channel
A Year In the Western English ChannelA Year In the Western English Channel
A Year In the Western English ChannelIddo
 
Innovation in China: Zonghsen case study. Carles Debart
Innovation in China: Zonghsen case study. Carles DebartInnovation in China: Zonghsen case study. Carles Debart
Innovation in China: Zonghsen case study. Carles DebartCarles Debart
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentationgawump
 
Genome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin DiscoveryGenome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin DiscoveryIddo
 
Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005Iddo
 

En vedette (20)

Biotech Clusters - Nutopya
Biotech Clusters - NutopyaBiotech Clusters - Nutopya
Biotech Clusters - Nutopya
 
Polt Presentation Priority Setting Vienna 18 02 2010
Polt Presentation Priority Setting Vienna 18 02 2010Polt Presentation Priority Setting Vienna 18 02 2010
Polt Presentation Priority Setting Vienna 18 02 2010
 
The Chinese Way of Innovation
The Chinese Way of InnovationThe Chinese Way of Innovation
The Chinese Way of Innovation
 
Zongshen Cyclone Fly
Zongshen Cyclone FlyZongshen Cyclone Fly
Zongshen Cyclone Fly
 
Ignobel2010
Ignobel2010Ignobel2010
Ignobel2010
 
Ismb grant-writing-2012
Ismb grant-writing-2012Ismb grant-writing-2012
Ismb grant-writing-2012
 
Go camp 2010_cacao
Go camp 2010_cacaoGo camp 2010_cacao
Go camp 2010_cacao
 
David Jones AFP/CAFA2011
David Jones AFP/CAFA2011David Jones AFP/CAFA2011
David Jones AFP/CAFA2011
 
Manual Vaic Mp9 T800
Manual Vaic Mp9 T800Manual Vaic Mp9 T800
Manual Vaic Mp9 T800
 
Jeff Grethe: CAMERA
Jeff Grethe: CAMERAJeff Grethe: CAMERA
Jeff Grethe: CAMERA
 
Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013Ewan Birney Biocuration 2013
Ewan Birney Biocuration 2013
 
Portfolio 01
Portfolio 01Portfolio 01
Portfolio 01
 
Katrina Photos Short Version
Katrina Photos Short VersionKatrina Photos Short Version
Katrina Photos Short Version
 
Metagenomics Biocuration 2013
Metagenomics Biocuration 2013Metagenomics Biocuration 2013
Metagenomics Biocuration 2013
 
Afp cafa djuric
Afp cafa djuricAfp cafa djuric
Afp cafa djuric
 
A Year In the Western English Channel
A Year In the Western English ChannelA Year In the Western English Channel
A Year In the Western English Channel
 
Innovation in China: Zonghsen case study. Carles Debart
Innovation in China: Zonghsen case study. Carles DebartInnovation in China: Zonghsen case study. Carles Debart
Innovation in China: Zonghsen case study. Carles Debart
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
Genome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin DiscoveryGenome Informatics 2015 Bacteriocin Discovery
Genome Informatics 2015 Bacteriocin Discovery
 
Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005Critical Assessment of Function Annotation, 2005
Critical Assessment of Function Annotation, 2005
 

Similaire à Vienna afp2011

Microarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysMicroarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysayeshasattarsandhu
 
Protein function and bioinformatics
Protein function and bioinformaticsProtein function and bioinformatics
Protein function and bioinformaticsNeil Saunders
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Copenhagenomics
 
From sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysisFrom sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysisExpedeon
 
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...Mark Berjanskii
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology Sean Ekins
 
Using ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyUsing ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyChris Evelo
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomesSurya Saha
 
Thesis def
Thesis defThesis def
Thesis defJay Vyas
 
GeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionGeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionThermo Fisher Scientific
 
HIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
HIV Vaccines Process Development & Manufacturing - Pitfalls & PossibilitiesHIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
HIV Vaccines Process Development & Manufacturing - Pitfalls & PossibilitiesKBI Biopharma
 
20081216 05袁國芳 紅麴菌基因體計畫及基因研究
20081216 05袁國芳 紅麴菌基因體計畫及基因研究20081216 05袁國芳 紅麴菌基因體計畫及基因研究
20081216 05袁國芳 紅麴菌基因體計畫及基因研究Monascus2008
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020GenomeInABottle
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_coursehansjansen9999
 
PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...Valerie Wood
 
GMOD 2014 MAKER Lecture
GMOD 2014 MAKER LectureGMOD 2014 MAKER Lecture
GMOD 2014 MAKER Lecturebarrymoore
 

Similaire à Vienna afp2011 (20)

Microarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarraysMicroarray biotechnologg ppy dna microarrays
Microarray biotechnologg ppy dna microarrays
 
Protein function and bioinformatics
Protein function and bioinformaticsProtein function and bioinformatics
Protein function and bioinformatics
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
 
From sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysisFrom sample-to-spray: high performance workflow for top down protein analysis
From sample-to-spray: high performance workflow for top down protein analysis
 
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
Humans Aliens And E Harmony Or Why There Is No Such Thing As A Free Lunch In ...
 
SOT short course on computational toxicology
SOT short course on computational toxicology SOT short course on computational toxicology
SOT short course on computational toxicology
 
Using ontologies to do integrative systems biology
Using ontologies to do integrative systems biologyUsing ontologies to do integrative systems biology
Using ontologies to do integrative systems biology
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
Discovering drugs (I. Belda)
Discovering drugs (I. Belda)Discovering drugs (I. Belda)
Discovering drugs (I. Belda)
 
Thesis def
Thesis defThesis def
Thesis def
 
HPLC2005
HPLC2005HPLC2005
HPLC2005
 
GeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionGeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein production
 
HIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
HIV Vaccines Process Development & Manufacturing - Pitfalls & PossibilitiesHIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
HIV Vaccines Process Development & Manufacturing - Pitfalls & Possibilities
 
Seminar 20150920.2
Seminar 20150920.2Seminar 20150920.2
Seminar 20150920.2
 
20081216 05袁國芳 紅麴菌基因體計畫及基因研究
20081216 05袁國芳 紅麴菌基因體計畫及基因研究20081216 05袁國芳 紅麴菌基因體計畫及基因研究
20081216 05袁國芳 紅麴菌基因體計畫及基因研究
 
Giab agbt small_var_2020
Giab agbt small_var_2020Giab agbt small_var_2020
Giab agbt small_var_2020
 
20150601 bio sb_assembly_course
20150601 bio sb_assembly_course20150601 bio sb_assembly_course
20150601 bio sb_assembly_course
 
PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...
 
GMOD 2014 MAKER Lecture
GMOD 2014 MAKER LectureGMOD 2014 MAKER Lecture
GMOD 2014 MAKER Lecture
 

Plus de Iddo

What can Community Challenges do for You?
What can Community Challenges do for You?What can Community Challenges do for You?
What can Community Challenges do for You?Iddo
 
Surviving Scientific Presentations
Surviving Scientific PresentationsSurviving Scientific Presentations
Surviving Scientific PresentationsIddo
 
Friedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nrFriedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nrIddo
 
The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...Iddo
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongIddo
 
Tracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in BacteriaTracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in BacteriaIddo
 
Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...Iddo
 
Friedberg lab-overview-grad-students
Friedberg lab-overview-grad-studentsFriedberg lab-overview-grad-students
Friedberg lab-overview-grad-studentsIddo
 
Understanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low OutputUnderstanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low OutputIddo
 
Random Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in ScienceRandom Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in ScienceIddo
 
Convergent divergent
Convergent divergentConvergent divergent
Convergent divergentIddo
 
Some US Science Funding sources
Some US Science Funding sourcesSome US Science Funding sources
Some US Science Funding sourcesIddo
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013Iddo
 

Plus de Iddo (13)

What can Community Challenges do for You?
What can Community Challenges do for You?What can Community Challenges do for You?
What can Community Challenges do for You?
 
Surviving Scientific Presentations
Surviving Scientific PresentationsSurviving Scientific Presentations
Surviving Scientific Presentations
 
Friedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nrFriedberg lab-overview-grad-students-2019-nr
Friedberg lab-overview-grad-students-2019-nr
 
The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...
 
Why Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is WrongWhy Your Microbiome Analysis is Wrong
Why Your Microbiome Analysis is Wrong
 
Tracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in BacteriaTracing the Ancestry of Genomes in Bacteria
Tracing the Ancestry of Genomes in Bacteria
 
Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...Computational Challenges in Biological Data Science: an Optimistically Cautio...
Computational Challenges in Biological Data Science: an Optimistically Cautio...
 
Friedberg lab-overview-grad-students
Friedberg lab-overview-grad-studentsFriedberg lab-overview-grad-students
Friedberg lab-overview-grad-students
 
Understanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low OutputUnderstanding Biological Function in Times of High Throughput and Low Output
Understanding Biological Function in Times of High Throughput and Low Output
 
Random Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in ScienceRandom Musings on Fixing Data Shambles in Science
Random Musings on Fixing Data Shambles in Science
 
Convergent divergent
Convergent divergentConvergent divergent
Convergent divergent
 
Some US Science Funding sources
Some US Science Funding sourcesSome US Science Funding sources
Some US Science Funding sources
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013
 

Dernier

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Dernier (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Vienna afp2011

  • 1. Combining large-scale evolutionary analyses with multiple biological data sources to predict human protein function David Jones UCL Depts. of Computer Science and Structural and Molecular Biology
  • 2. Background In Uniprot, 30% of human … and only 0.5% have proteins still have no completely specific ones functional annotations at for all aspects all CC MF 30% MF BP CC BP
  • 3. Main approaches for function annotation • Annotation transfers by homology e.g. BLAST, HMMER Only applicable to a subset of the data Has reached a plateau in terms of novel function annotation but provides highest quality information • Model-classifier based using sequence features Limited to common and broad functions for which there are many examples
  • 4. FFPRED - Function Prediction Pipeline Novel sequence Amino acid sequence Characteristics structure disorder aa transmem motifs localisation Classification GO Term SVM posterior probability estimate
  • 5.
  • 6.
  • 7. Going further – computing gene function from multiple data sources • FFPRED is a currently available server for human (and vertebrate) proteins • It works well but is limited to predicting only the functional classes that it was trained to recognize • Extending the library requires time consuming training of new SVM models • It also cannot be applied to rare functional classes due to limited training sets
  • 8. Desirable features of a new approach • Able to annotate all sequences • Able to predict rare functions • Able to offer something more than simple homology-based approaches • Amenable to easy and quick updating
  • 9. FunctionSpace Data Sources for H. sapiens • Sequence similarity • Signal peptides and other local features • Predicted secondary structure • Transmembrane segments • Predicted disordered regions • Domain architecture patterns • Gene fusion information • Gene co-expression • Protein-protein interactions For each sequence 49,231 features were derived
  • 10. Aim To estimate the functional similarity (a.k.a. semantic distance) between two human proteins from their sequence features plus available high throughput data. Protein A Functional Similarity Score Protein B
  • 11. Large-scale (domain-based) evolutionary features • Patterns of domain occurrence can provide valuable functional clues • “Deeper” homology detection allows greater coverage • We make use of our in-house fold/domain recognition method and several public domain libraries
  • 12. pDomTHREADER Domain Coverage Residues 35.7% Gene3d CATH Domain annotations 81.6% 7000000 threading 6000000 5000000 Sequences 4000000 3000000 2000000 1000000 64.8% 59.4% Gene3d 0 Public domain Threading threading 37.56 % increase in domain annotations across 5.5M sequences ~ 1.7 million novel domain assignments over public domain data
  • 13. Computational Practicalities Legion Nodes 5.5M Query sequences Sequence 2Gb database (5.5M seqs) PSIBLAST Find matches & 1min – 3 hours generate alignments Store & post process “Embarrassingly parallel” application: one sequence = one job. Ideal capacity filling task for a modern supercomputer like Legion.
  • 14. Gene Fusion Events can Predict Protein- Protein Interactions from Sequence Data H1 3.90.850.10 3.60.15.10 H2 fumaryl aceto acetase beta lactamase Bi-functional enzyme 3.90.850.10 3.60.15.10 Mycobacterium tuberculosis Mycobacterium paratuberculosis Mycobacterium avium Hydrolase activity Hydrolysis of C-N bonds Hydrolysis of C-C bonds
  • 15. A Novel Gene Fusion Discovered using CATH domain fusion analysis Phosphoglyceromutase DNA repair (RAD50) 3.40.120.10 3.40.50.300 Alpha-D-Glucose-1,6-Bisphosphate P-loop nucleotide triphosphate hydrolases 3.40.120.10 Transcription coupling repair factor 3.40.120.10 3.40.50.300 Saccharopolyspora erythraea Syntrophomonas wolfei Oxidative stress D-glucose metabolism DNA repair
  • 16. Novel Gene Fusion Discovery 3.40.120.10 3.40.50.300 3.40.50.300 Saccharopolyspora erythraea 3.40.50.300 3.40.120.10 Syntrophomonas wolfei Novel annotations • Rice PGM1 gene annotated as GO:0006950 response to stress • PGM3 has relationship with DNA repair sequence Kanazawa K, Ashida H (1991) Relationship between oxidative stress and hepatic phosphoglucomutase activity in rats. Int J Tissue React 13: 225
  • 17. Domain based features Score architectures Score complexes 7960 features 11210 features
  • 18. Fusion scoring Each domain is a feature, score has 2 components 1. Prediction quality (logistic transform of feature) 2. Promiscuity weight related to the number of times the sequence occurs as part of a fused product wi = log fus i
  • 19. Integration of “External” Features: Microarray Expression Data Gene Gene 14 A B Probe Signal (log2) 12 Normalised Microarray Datasets 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Experiment (conditions) Pearson Correlation (R)
  • 20. Biclustering Microarray Expression Data Zinc binding sequences A set of transcription factors global correlation 0.42 global correlation 0.48 23912 features generated from biclustering of 2346 publicly available microarrays (81 experiments) using BIMAX algorithm
  • 21. FunctionSpace: Two-stage Integration of Data SVMsw SVMloc SVMss Feature Protein A vectors SVMtm SVMdis Functional SVMdpc SVMfsc Similarity SVMgfc Score SVMdpp Feature SVMgfp Protein B vectors SVMge SVMppi
  • 22. A 3-D Projection of Annotated Human Proteins • 49,231 dimensions first reduced to 11 dimensions by SVM regression with 11 different groups of features • Each protein is here represented as a point in this derived 11-D feature space projected into 3-D • Colouring is according to functional similarity which shows that proteins with similar functions (warmer colours) cluster strongly in this space • 75% of nearest neighbour pairs share common GO terms
  • 23. Individual Feature Contributions Matthews Correlation Coefficient
  • 24. Function Annotation Results for 20674 Unannotated IPI Human Sequences Each sequence is classed “Easy”, “Medium” or “Hard” depending on degree of homology to functionally annotated proteins in UNIPROT.
  • 25. Preliminary Results In 2009 FunctionSpace produced GO term predictions for 19678 IPI uncharacterized human sequences. 2746 have been annotated since. MF Measure BP 16% % Exact Matches 9% -1.3 Mean semantic distance -1.7 Less More Less More specific specific specific specific
  • 26. Initial considerations for CAFA • 50,000 sequences • 11 eukaryotic & 7 prokaryotic species • High specificity annotations needed • Partial descriptive text already in Swiss-Prot/Uniprot for some entries • FFPRED/FunctionSpace would not be enough • Need to incorporate textual information from databases and comprehensive homology(orthology)-derived labels • Need to get all this working in a few months!
  • 27. Best Laid Plans for CAFA • Plan A – Build separate annotation pipelines for missing data – Calibrate each pipeline according to precision values derived from benchmark on 500 highly annotated Swiss-Prot entries – Combine pipeline annotations using high-level classifier (SVM or Naive Bayes) • Plan B – No time to build high-level classifier! – Combine annotation sources using heuristic graphical approach • Hope for the best! (and expect the worst...)
  • 28. GO term prediction from Swiss-Prot text-mining • For targets which already had descriptive text, keywords or comments in Swiss-Prot, GO terms were assigned using a naive Bayes text-classification approach • Single words and groups of 2 and 3 words were counted • Words occurring in different Swiss-Prot record types were distinguished in the analysis, and some simple pre-parsing of feature (FT) records was carried out in addition.
  • 29. Homology-based annotation sources • PSI-BLAST searches against Uniprot – Low E-value threshold to ensure close homologues used for annotation transfer – Alignment length threshold to avoid domain problem • Transfer of annotations from orthologues – EggNOG 2.0 – More reliable GO term transfer than for PSI-BLAST but lower coverage • Profile-profile searches against Swiss-Prot – Low reliability transfer from very distant homologues – Improves coverage where needed (at expense of specificity)
  • 30. Heuristic back-propagation of precision estimates Back-propagation repeated for each annotation source Back-propagation to define a of precision consensus for estimates each node P’ = 1 - (1 – P) (1 – Q)
  • 31. Final steps • After back-propagation, all referenced GO terms are ranked according to final confidence scores • To reduce conflicting annotations, pairs of terms with zero observed co-occurrence frequency in GOA are subjected to pairwise tournament selection. • Results submitted to server using the mouse-window-cut-paste-click-submit algorithm
  • 32. CASP vs CAFA from a Predictor’s Point of View • Number of targets – Manual vs automated approaches • Difficulty of targets – A major limit in driving CASP forwards • Assessment – Hard to pre-judge impact of decisions made during prediction season • Tools for the community – Standards and methods in CASP have been very useful • Getting the word out to the wider community
  • 33. Acknowledgements Anna Lobley Domenico Cozzetto Daniel Buchan Kevin Bryson Christine Orengo