SlideShare une entreprise Scribd logo
1  sur  34
Protein loop classification
           using Artificial Neural
                  Networks


       Armando Vieira1 and Baldomero Oliva2
1
    ISEP and Centro de Física Computacional, Coimbra, Portugal
                     www.defi.isep.ipp.pt/~asv
           2
            Structural Bioinformatics Laboratory (GRIB)
        IMIM/Universitat Pompeu Fabra, Barcelona, Spain
XXI: the century of BIO
BIOINFORMATICS
joining two worlds apart
Outline
Brief review of protein structure
Statement of problem and why is so hard
Data pre-processing, corrections, updates
and beyond multiple alignments…
Neural Networks in protein structure
prediction
HLVQ
Results and future work
Proteins

All proteins are chains of 20 amino acids
Not all chains of amino acids are proteins
Fold rapidly and repeatedly
Proteins are the machinery of live
Essential to all (known) organisms
The Gist of it



Amino acid        Physical    Function
 sequence         structure
Typical globular protein
ME
MEKKEFHIVAETG
MEKKEFHIVA
MEKK
MEK
IHARPATLLVQTA
SLFNSDINLETLG
KSVNLKSIMGVMS
LGVGQGSDVTITV
DGADEADGMAAI
VETLQLQGLAQ
Coarse-Grained Model
+180
  b    b   b   p   o   M   e   e    e


  b    b   b   p   o   M   M   e    e


  b    b   b   p   .   l   l   s    e


  a    a   a   T   .   l   l   g    N      Ψ
 N     a   a   a   .   U   l   g    N


 N     a   a   a   .   U   g   g    N


  I    a   a   a   .   G   G   G     I


  e    F   F   F   o   e   e   e    e


  b    b   b   p   o   e   e   e    e


                                          -180
-180               φ               +180
Ramachandran Alphabet
    180°
                B
     90°


ψ    0°
                       A         G

     -90°
                                       E
    -180°
        -180°   -90°       0°   90°   180°
                           φ
5-letter alphabet
Residue Sequence    3° Structure



 MEKKEFHIVAET      ACCDECBAABDE
 GIHARPATLLVQT     CBDABCDBEABD
 ASLFNSDINLETL     BCBDBAEBDBDB
 GKSVNLKSIMGV      AEBABDCBBDBA
 MSLGVGQGSDVT      DDCBDBCBDBEB
 ITVDGADEADGM      DBCBBDCAABDE
 AAIVETLQLQGLA     DCDCEAABACAA
 Q...              AADC…
What shall we do?
• Ab initio:
  Quantum Mechanics +
  big computers +
  large # configurations
= huge problems…

• Machine Learning:
Use known cases to learn a suitable
  map:
       sequence→ structure
Machine Learning Approach
Artificial Neural Networks
• A problem-solving paradigm modeled after the
  physiological functioning of the human brain.

• Synapses in the brain are modeled by computational nodes.

• The firing of a synapse is modeled by input, output, and
  threshold functions.

• The network “learns” based on problems to which answers
  are known (supervised learning).

• The network can then produce answers to entirely new
  problems of the same type.
Neural Networks
Input
Layer
                Hidden
                Layers




                          Output
                          Layer
Overfitting – high risk!




Less complicated hypothesis has lower error rate
Hidden Layer Vector
        Quantization- HLVQ

   Traditional NN                  HLVQ
                            z
           o                      o
                o o                    o o
           x                      x
                o oo                   o oo
           x                      x
             ox   x                 ox   x
         xxx                    xxx
             x                      x




Main advantage: detect and correct prediction for
                    outliers
Loops, loops everywhere!!!
Look for a loop…
Geometry of the Motif
Loop Types

α−α : α -helix - α -helix
α−β : α -helix – β strand
β−α : β strand - α -helix


β -hairpin: β strand - β strand
β - link: β strand - β strand
α−α
       Similar conformation aa{b}aa / aa{p}aa
       Identical geometry (4,6)(0,45)(45,90)(180,225)


              Pro 75%


                              Ser 75%


                                  1.3.1 aa{p}aa
                                  1.1.2 aa{b}aa




© Baldomero Oliva
Class α−α
ArchDB database

~ 20 000 loops classified into ~ 3000 classes.
          EE-3.4.1
 Loop type - loop size . consensus . motif
TASK: classify a loop from sequence alone
If not possible, get as much information as
possible
Problems

• Coding of aminoacids

• Huge searching space, sparsely populated

• How to assign the loop classes?

• High dimensionality → Large Networks → poor
  generalization
Aminoacid coding
 the classical way
      A → (1, 0, …0)
      C → (0, 1, …0)
      Y → (0, 0, …1)

  Useful but not efficient!!!
I am working to improve it…
Theory; but how about applications?!
β-β link and β-β harpins from
          sequence

   HLVQ         Predicted Predicted
   (MLP)        β-β link β-β harpin
   Real           88.4      11.6
   β-β link      (79.4)    (20.6)
   Real           12.5      87.5
   β-β harpin    (16.1)    (83.9)
Prediction of all loop types
   from sequence alone

          β-β lk   α-β    β-β hp   β-α    α-α

 β-β lk   45.9     28.5    3.7     19.8   2.1

 α-β       8.8     67.4    1.2     18.0   4.6

 β-β hp    0.4     0.9     96.1    2.1    0.5

 β-α      4.4      6.2     2.4     79.5   7.6
 α-α      4.0      15.7    1.3     20.3   58.6
What’s it all mean?
Given a loop residue sequence, we can
(usually) identify its native structure.
Not ab initio: We cannot tell the structure
of a novel sequence.
HLVQ is superior to MLP
Future Work


Better coding of aminoacids
Larger sequences / low complexity
Going beyond structure
Clever alphabet that explore similarities
Multiobjective Genetic Algorithms
Beyond Multiple Alignments

• Alligments are good … but expensive and
  boring ...
• Information contained in a multiple
  alignment can, in principle, be expressed
  using an adequate aminoacid coding
  scheme      Sensibility

• How?        Genetic Algorithm
Coded Amino Acids

  Alanine (A)          Arginine (R)         Asparagine (N) Aspartic Acid (D) Cysteine (C)




Glutamic Acid (E)   Glutamine (Q)         Glycine (G)           Histidine (H)       Isoleucine (I)




  Leucine (L)       Lysine (K)       Methionine (M)     Phenylalanine (F)       Proline (P)




    Serine (S)       Threonine (T)        Tryptophan          Tyrosine (Y)        Valine (V)
                                                        http://www.chemie.fu-berlin.de/chemistry/bio/
ArchDB database
Protein Data Bank (PDB)
http://www.rcsb.org contains ~ 25 000
proteins with known structure of ~ 106
entries in SWISS-PROT

ArchDB ~ 20 000 classified loops

Contenu connexe

Similaire à Barcelona sabatica

OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomicsUSC
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Prof. Wim Van Criekinge
 
Two numerical graph algorithms
Two numerical graph algorithmsTwo numerical graph algorithms
Two numerical graph algorithmsDavid Gleich
 
Ontology mapping needs context & approximation
Ontology mapping needs context & approximationOntology mapping needs context & approximation
Ontology mapping needs context & approximationFrank van Harmelen
 
Cross Product Extensions to the Gene Ontology
Cross Product Extensions to the Gene OntologyCross Product Extensions to the Gene Ontology
Cross Product Extensions to the Gene OntologyChris Mungall
 
2015 bioinformatics score_matrices_wim_vancriekinge
2015 bioinformatics score_matrices_wim_vancriekinge2015 bioinformatics score_matrices_wim_vancriekinge
2015 bioinformatics score_matrices_wim_vancriekingeProf. Wim Van Criekinge
 
A Julia package for iterative SVDs with applications to genomics data analysis
A Julia package for iterative SVDs with applications to genomics data analysisA Julia package for iterative SVDs with applications to genomics data analysis
A Julia package for iterative SVDs with applications to genomics data analysisJiahao Chen
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2BITS
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishingNikolay Vyahhi
 
Sequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionSequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionTenaAvdic
 
Genotype Imputation via Matrix Completion
Genotype Imputation via Matrix CompletionGenotype Imputation via Matrix Completion
Genotype Imputation via Matrix Completionechi99
 
2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekinge2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekingeProf. Wim Van Criekinge
 
IGB genome genometry data models by Gregg Helt and Cyrus Harmon
IGB genome genometry data models by Gregg Helt and Cyrus HarmonIGB genome genometry data models by Gregg Helt and Cyrus Harmon
IGB genome genometry data models by Gregg Helt and Cyrus HarmonAnn Loraine
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSAksw Group
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 

Similaire à Barcelona sabatica (20)

OpenCL applications in genomics
OpenCL applications in genomicsOpenCL applications in genomics
OpenCL applications in genomics
 
Bioinformatica t3-scoring matrices
Bioinformatica t3-scoring matricesBioinformatica t3-scoring matrices
Bioinformatica t3-scoring matrices
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
 
Ch08 massspec
Ch08 massspecCh08 massspec
Ch08 massspec
 
Bioalgo 2012-03-massspec
Bioalgo 2012-03-massspecBioalgo 2012-03-massspec
Bioalgo 2012-03-massspec
 
Two numerical graph algorithms
Two numerical graph algorithmsTwo numerical graph algorithms
Two numerical graph algorithms
 
Ontology mapping needs context & approximation
Ontology mapping needs context & approximationOntology mapping needs context & approximation
Ontology mapping needs context & approximation
 
Ch06 alignment
Ch06 alignmentCh06 alignment
Ch06 alignment
 
Cross Product Extensions to the Gene Ontology
Cross Product Extensions to the Gene OntologyCross Product Extensions to the Gene Ontology
Cross Product Extensions to the Gene Ontology
 
2015 bioinformatics score_matrices_wim_vancriekinge
2015 bioinformatics score_matrices_wim_vancriekinge2015 bioinformatics score_matrices_wim_vancriekinge
2015 bioinformatics score_matrices_wim_vancriekinge
 
A Julia package for iterative SVDs with applications to genomics data analysis
A Julia package for iterative SVDs with applications to genomics data analysisA Julia package for iterative SVDs with applications to genomics data analysis
A Julia package for iterative SVDs with applications to genomics data analysis
 
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: analysis of raw data and preprocessing - part 2
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Sequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionSequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics Introduction
 
Genotype Imputation via Matrix Completion
Genotype Imputation via Matrix CompletionGenotype Imputation via Matrix Completion
Genotype Imputation via Matrix Completion
 
2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekinge2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekinge
 
IGB genome genometry data models by Gregg Helt and Cyrus Harmon
IGB genome genometry data models by Gregg Helt and Cyrus HarmonIGB genome genometry data models by Gregg Helt and Cyrus Harmon
IGB genome genometry data models by Gregg Helt and Cyrus Harmon
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 

Plus de Armando Vieira

Improving Insurance Risk Prediction with Generative Adversarial Networks (GANs)
Improving Insurance  Risk Prediction with Generative Adversarial Networks (GANs)Improving Insurance  Risk Prediction with Generative Adversarial Networks (GANs)
Improving Insurance Risk Prediction with Generative Adversarial Networks (GANs)Armando Vieira
 
Predicting online user behaviour using deep learning algorithms
Predicting online user behaviour using deep learning algorithmsPredicting online user behaviour using deep learning algorithms
Predicting online user behaviour using deep learning algorithmsArmando Vieira
 
Boosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsBoosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsArmando Vieira
 
Seasonality effects on second hand cars sales
Seasonality effects on second hand cars salesSeasonality effects on second hand cars sales
Seasonality effects on second hand cars salesArmando Vieira
 
Visualizations of high dimensional data using R and Shiny
Visualizations of high dimensional data using R and ShinyVisualizations of high dimensional data using R and Shiny
Visualizations of high dimensional data using R and ShinyArmando Vieira
 
Dl1 deep learning_algorithms
Dl1 deep learning_algorithmsDl1 deep learning_algorithms
Dl1 deep learning_algorithmsArmando Vieira
 
Extracting Knowledge from Pydata London 2015
Extracting Knowledge from Pydata London 2015Extracting Knowledge from Pydata London 2015
Extracting Knowledge from Pydata London 2015Armando Vieira
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Armando Vieira
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
 
Neural Networks and Genetic Algorithms Multiobjective acceleration
Neural Networks and Genetic Algorithms Multiobjective accelerationNeural Networks and Genetic Algorithms Multiobjective acceleration
Neural Networks and Genetic Algorithms Multiobjective accelerationArmando Vieira
 
Optimization of digital marketing campaigns
Optimization of digital marketing campaignsOptimization of digital marketing campaigns
Optimization of digital marketing campaignsArmando Vieira
 
Credit risk with neural networks bankruptcy prediction machine learning
Credit risk with neural networks bankruptcy prediction machine learningCredit risk with neural networks bankruptcy prediction machine learning
Credit risk with neural networks bankruptcy prediction machine learningArmando Vieira
 
Online democracy Armando Vieira
Online democracy Armando VieiraOnline democracy Armando Vieira
Online democracy Armando VieiraArmando Vieira
 
Invtur conference aveiro 2010
Invtur conference aveiro 2010Invtur conference aveiro 2010
Invtur conference aveiro 2010Armando Vieira
 
Tourism with recomendation systems
Tourism with recomendation systemsTourism with recomendation systems
Tourism with recomendation systemsArmando Vieira
 
Manifold learning for bankruptcy prediction
Manifold learning for bankruptcy predictionManifold learning for bankruptcy prediction
Manifold learning for bankruptcy predictionArmando Vieira
 

Plus de Armando Vieira (20)

Improving Insurance Risk Prediction with Generative Adversarial Networks (GANs)
Improving Insurance  Risk Prediction with Generative Adversarial Networks (GANs)Improving Insurance  Risk Prediction with Generative Adversarial Networks (GANs)
Improving Insurance Risk Prediction with Generative Adversarial Networks (GANs)
 
Predicting online user behaviour using deep learning algorithms
Predicting online user behaviour using deep learning algorithmsPredicting online user behaviour using deep learning algorithms
Predicting online user behaviour using deep learning algorithms
 
Boosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithmsBoosting conversion rates on ecommerce using deep learning algorithms
Boosting conversion rates on ecommerce using deep learning algorithms
 
Seasonality effects on second hand cars sales
Seasonality effects on second hand cars salesSeasonality effects on second hand cars sales
Seasonality effects on second hand cars sales
 
Visualizations of high dimensional data using R and Shiny
Visualizations of high dimensional data using R and ShinyVisualizations of high dimensional data using R and Shiny
Visualizations of high dimensional data using R and Shiny
 
Dl2 computing gpu
Dl2 computing gpuDl2 computing gpu
Dl2 computing gpu
 
Dl1 deep learning_algorithms
Dl1 deep learning_algorithmsDl1 deep learning_algorithms
Dl1 deep learning_algorithms
 
Extracting Knowledge from Pydata London 2015
Extracting Knowledge from Pydata London 2015Extracting Knowledge from Pydata London 2015
Extracting Knowledge from Pydata London 2015
 
Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio Hidden Layer Leraning Vector Quantizatio
Hidden Layer Leraning Vector Quantizatio
 
machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...machine learning in the age of big data: new approaches and business applicat...
machine learning in the age of big data: new approaches and business applicat...
 
Neural Networks and Genetic Algorithms Multiobjective acceleration
Neural Networks and Genetic Algorithms Multiobjective accelerationNeural Networks and Genetic Algorithms Multiobjective acceleration
Neural Networks and Genetic Algorithms Multiobjective acceleration
 
Optimization of digital marketing campaigns
Optimization of digital marketing campaignsOptimization of digital marketing campaigns
Optimization of digital marketing campaigns
 
Credit risk with neural networks bankruptcy prediction machine learning
Credit risk with neural networks bankruptcy prediction machine learningCredit risk with neural networks bankruptcy prediction machine learning
Credit risk with neural networks bankruptcy prediction machine learning
 
Online democracy Armando Vieira
Online democracy Armando VieiraOnline democracy Armando Vieira
Online democracy Armando Vieira
 
Invtur conference aveiro 2010
Invtur conference aveiro 2010Invtur conference aveiro 2010
Invtur conference aveiro 2010
 
Tourism with recomendation systems
Tourism with recomendation systemsTourism with recomendation systems
Tourism with recomendation systems
 
Manifold learning for bankruptcy prediction
Manifold learning for bankruptcy predictionManifold learning for bankruptcy prediction
Manifold learning for bankruptcy prediction
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
Requiem pelo ensino
Requiem pelo ensino Requiem pelo ensino
Requiem pelo ensino
 
Eurogen v
Eurogen vEurogen v
Eurogen v
 

Barcelona sabatica

  • 1. Protein loop classification using Artificial Neural Networks Armando Vieira1 and Baldomero Oliva2 1 ISEP and Centro de Física Computacional, Coimbra, Portugal www.defi.isep.ipp.pt/~asv 2 Structural Bioinformatics Laboratory (GRIB) IMIM/Universitat Pompeu Fabra, Barcelona, Spain
  • 4. Outline Brief review of protein structure Statement of problem and why is so hard Data pre-processing, corrections, updates and beyond multiple alignments… Neural Networks in protein structure prediction HLVQ Results and future work
  • 5. Proteins All proteins are chains of 20 amino acids Not all chains of amino acids are proteins Fold rapidly and repeatedly Proteins are the machinery of live Essential to all (known) organisms
  • 6. The Gist of it Amino acid Physical Function sequence structure
  • 9. +180 b b b p o M e e e b b b p o M M e e b b b p . l l s e a a a T . l l g N Ψ N a a a . U l g N N a a a . U g g N I a a a . G G G I e F F F o e e e e b b b p o e e e e -180 -180 φ +180
  • 10. Ramachandran Alphabet 180° B 90° ψ 0° A G -90° E -180° -180° -90° 0° 90° 180° φ
  • 11. 5-letter alphabet Residue Sequence 3° Structure MEKKEFHIVAET ACCDECBAABDE GIHARPATLLVQT CBDABCDBEABD ASLFNSDINLETL BCBDBAEBDBDB GKSVNLKSIMGV AEBABDCBBDBA MSLGVGQGSDVT DDCBDBCBDBEB ITVDGADEADGM DBCBBDCAABDE AAIVETLQLQGLA DCDCEAABACAA Q... AADC…
  • 12. What shall we do? • Ab initio: Quantum Mechanics + big computers + large # configurations = huge problems… • Machine Learning: Use known cases to learn a suitable map: sequence→ structure
  • 14. Artificial Neural Networks • A problem-solving paradigm modeled after the physiological functioning of the human brain. • Synapses in the brain are modeled by computational nodes. • The firing of a synapse is modeled by input, output, and threshold functions. • The network “learns” based on problems to which answers are known (supervised learning). • The network can then produce answers to entirely new problems of the same type.
  • 15. Neural Networks Input Layer Hidden Layers Output Layer
  • 16. Overfitting – high risk! Less complicated hypothesis has lower error rate
  • 17. Hidden Layer Vector Quantization- HLVQ Traditional NN HLVQ z o o o o o o x x o oo o oo x x ox x ox x xxx xxx x x Main advantage: detect and correct prediction for outliers
  • 19. Look for a loop…
  • 21. Loop Types α−α : α -helix - α -helix α−β : α -helix – β strand β−α : β strand - α -helix β -hairpin: β strand - β strand β - link: β strand - β strand
  • 22. α−α Similar conformation aa{b}aa / aa{p}aa Identical geometry (4,6)(0,45)(45,90)(180,225) Pro 75% Ser 75% 1.3.1 aa{p}aa 1.1.2 aa{b}aa © Baldomero Oliva
  • 24. ArchDB database ~ 20 000 loops classified into ~ 3000 classes. EE-3.4.1 Loop type - loop size . consensus . motif TASK: classify a loop from sequence alone If not possible, get as much information as possible
  • 25. Problems • Coding of aminoacids • Huge searching space, sparsely populated • How to assign the loop classes? • High dimensionality → Large Networks → poor generalization
  • 26. Aminoacid coding the classical way A → (1, 0, …0) C → (0, 1, …0) Y → (0, 0, …1) Useful but not efficient!!! I am working to improve it…
  • 27. Theory; but how about applications?!
  • 28. β-β link and β-β harpins from sequence HLVQ Predicted Predicted (MLP) β-β link β-β harpin Real 88.4 11.6 β-β link (79.4) (20.6) Real 12.5 87.5 β-β harpin (16.1) (83.9)
  • 29. Prediction of all loop types from sequence alone β-β lk α-β β-β hp β-α α-α β-β lk 45.9 28.5 3.7 19.8 2.1 α-β 8.8 67.4 1.2 18.0 4.6 β-β hp 0.4 0.9 96.1 2.1 0.5 β-α 4.4 6.2 2.4 79.5 7.6 α-α 4.0 15.7 1.3 20.3 58.6
  • 30. What’s it all mean? Given a loop residue sequence, we can (usually) identify its native structure. Not ab initio: We cannot tell the structure of a novel sequence. HLVQ is superior to MLP
  • 31. Future Work Better coding of aminoacids Larger sequences / low complexity Going beyond structure Clever alphabet that explore similarities Multiobjective Genetic Algorithms
  • 32. Beyond Multiple Alignments • Alligments are good … but expensive and boring ... • Information contained in a multiple alignment can, in principle, be expressed using an adequate aminoacid coding scheme Sensibility • How? Genetic Algorithm
  • 33. Coded Amino Acids Alanine (A) Arginine (R) Asparagine (N) Aspartic Acid (D) Cysteine (C) Glutamic Acid (E) Glutamine (Q) Glycine (G) Histidine (H) Isoleucine (I) Leucine (L) Lysine (K) Methionine (M) Phenylalanine (F) Proline (P) Serine (S) Threonine (T) Tryptophan Tyrosine (Y) Valine (V) http://www.chemie.fu-berlin.de/chemistry/bio/
  • 34. ArchDB database Protein Data Bank (PDB) http://www.rcsb.org contains ~ 25 000 proteins with known structure of ~ 106 entries in SWISS-PROT ArchDB ~ 20 000 classified loops