SlideShare une entreprise Scribd logo
1  sur  142
Télécharger pour lire hors ligne
Introduction to Bioinformatics
                   Dr. Jaume Bacardit
  Interdisciplinary Computing and Complex Systems
                 (ICOS) research group
                University of Nottingham
          jaume.bacardit@nottingham.ac.uk
About me
• Did my PhD in evolutionary learning
• Postdoc in Protein Structure Prediction 2005-
  2007
• Since 2008 lecturer in Bioinformatics at the
  University of Nottingham
• Research interests
  – Large-scale data mining
  – Biological data mining
Outline
•   What is Bioinformatics?
•   Basic molecular biology
•   Public databases
•   Sequence analysis
•   The scales of bioinformatics
•   Biological data mining
WHAT IS BIOINFORMATICS?
What is Bioinformatics?
• Several definitions exist. Michael Liebman proposed a quite
  elegant definition:
    – “The study of the information content and information flow in
      biological systems and processes‖ (Michael Liebman)
    – Information content: genome project
    – Information flow: molecular transport
    – Biological systems: cells, organisms, …
    – Biological processes: metabolic networks
• Bioinformatics is the science of using information to
  understand aspects of Biology. That is, a discipline where
  techniques such as applied mathematics, computer
  science, statistics, artificial intelligence, etc. are integrated to
  solve biological problems
Information, information, information

•    As we know there have been major advances in the
     field of molecular biology
•    These have been coupled with advances in laboratory
     (post)genomic technology
•    This has led to an explosive growth in the
     collection of biological information
•    This deluge of information has led to an absolute
     requirement for
     1. Computerized databases to store, organise and index the data
     2. For specialized tools to view and analyse the data
     3. Specialized tools to infer new knowledge from the data
Areas of research(taxonomy of the
         Bioinformatics Journal)
•   Genome Analysis
•   Sequence Analysis
•   Phylogenetics
•   Structural Bioinformatics
•   Gene Expression
•   Genetics and Population Analysis
•   Systems Biology
•   Data and Text Mining
•   Databases and Ontologies
•   Bioimage Informatics
(Borrowed from “An Introduction to Bioinformatics Algorithms” by Neil C.
Jones and Pavel A. Pevzner and further modified by Prof. Natalio
Krasnogor)

BASIC MOLECULAR BIOLOGY
Life begins with Cell




•   A cell is the smallest structural unit of an organism that is capable of
    sustained independent functioning
•   All cells have some common features
•   What is Life? Can we create it in the lab? Read:
The imitation game—a computational chemical approach to
   recognizing life. Nature Biotechnology, 24:1203-1206, 2006
2 types of cells:
Prokaryotes          &     Eukaryotes
Example of cell signaling
Terminology
•   The genome is an organism’s complete set of DNA.
    – a bacteria contains about 600,000 DNA base pairs
    – human and mouse genomes have some 3 billion.
•   human genome has 23 distinct chromosomes.
    – Each chromosome contains many genes.
•   Gene
    – basic physical and functional units of heredity.
    – specific sequences of DNA bases that encode
      instructions on how and when to make proteins.
•   Proteins
    – Make up the cellular structure
    – large, complex molecules made up of smaller subunits
      called amino acids.
All Life depends on 3 critical molecules
• DNAs
   – Hold information on how cell works
• RNAs
   – Act to transfer short pieces of information to different parts of cell
   – Provide templates to synthesize into protein
• Proteins
   – Form enzymes that send signals to other cells and regulate gene
     activity
   – Form body’s major components (e.g. hair, skin, etc.)
   – Are life’s laborers!
• Computationally, all three can be represented as
  sequences of a certain 4-letter (DNA/RNA) or 20-letter
  (Proteins) alphabet
DNA, RNA, and the Flow of Information

   Replication




       Transcription   Translation
                                     Weismann
                                     Barrier /
                                     Central
                                     Dogma of
                                     Molecular
                                     Biology
Overview of DNA to RNA to Protein




•    A gene is expressed in two steps
    1) Transcription: RNA synthesis
    2) Translation: Protein synthesis
DNA: The Basis of Life
• Deoxyribonucleic Acid (DNA)
  – Double stranded with complementary strands A-T, C-G
• DNA is a polymer
  – Sugar-Phosphate-Base
  – Bases held together by H bonding to the opposite strand
RNA
   • RNA is similar to DNA chemically. It is usually
     only a single strand. T(hyamine) is replaced by
     U(racil)
   • Some forms of RNA can form secondary
     structures by―pairing up‖ with itself. This can
     have impact on its properties dramatically.

                                                         DNA and RNA
                                                         can pair with
                                                         each other.


tRNA linear and 3D view:   http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif
RNA, continued
Several types exist, classified by function:

• hnRNA (heterogeneous nuclear RNA): Eukaryotic mRNA primary
  transcipts with introns that have not yet been excised (pre-mRNA).

• mRNA: this is what is usually being referred to when a
  Bioinformatician says ―RNA‖. This is used to carry a gene’s
  message out of the nucleus.

• tRNA: transfers genetic information from mRNA to an amino acid
  sequence as to build a protein

• rRNA: ribosomal RNA. Part of the ribosome which is involved in
  translation.
Transcription
• Transcription is highly regulated. Most DNA is in a
  dense form where it cannot be transcribed.
• To start, transcription requires a promoter, a small
  specific sequence of DNA to which polymerase can
  bind (~40 base pairs ―upstream‖ of gene)
• Finding these promoter regions is only a partially
  solved problem that is related to motif finding.
• There can also be repressors and inhibitors acting in
  various ways to stop transcription. This makes
  regulation of gene transcription complex to
  understand.
Definition of a Gene



•   Regulatory regions: up to 50 kb upstream of +1 site

•   Exons: protein coding and untranslated regions (UTR)
        1 to 178 exons per gene (mean 8.8)
        8 bp to 17 kb per exon (mean 145 bp)

•   Introns: splice acceptor and donor sites, junk DNA
          average 1 kb – 50 kb per intron

•   Gene size:    Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
Splicing
Splicing and other RNA processing

• In Eukaryotic cells, RNA is processed
  between transcription and translation.
• This complicates the relationship between
  a DNA gene and the protein it codes for.
• Sometimes alternate RNA processing can
  lead to an alternate protein (splice
  variants) as a result. This is true in the
  immune system.
Proteins: Crucial molecules for
           the functioning of life
• Structural Proteins: the organism's basic building blocks, eg.
collagen, nails, hair, etc.
• Enzymes: biological engines which mediate multitude of biochemical
reactions. Usually enzymes are very specific and catalyze only a single type
of reaction, but they can play a role in more than one pathway.
• Transmembrane proteins: they are the cell’s housekeepers, eg. By
regulating cell volume, extraction and concentration of small molecules from
the extracellular environment and generation of ionic gradients essential for
muscle and nerve cell function (sodium/potasium pump is an example)

• Proteins are polypeptide chains, constructed by joining a certain kind of
peptides, amino acids, in a linear way
• The chain of amino acids, however folds to create very complex 3D
structures
Translation

• The process of going
  from RNA to
  polypeptide.
• Three base pairs of
  RNA (called a codon)
  correspond to one
  amino acid based on a
  fixed table.
• Always starts with
  Methionine and ends
  with a stop codon
Amino Acids
Protein Structure: Introduction

• Different amino acids
  have different properties
• These properties will
  affect the protein
  structure and function
• Hydrophobicity, for
  instance, is the main
  driving force (but not the
  only one) of the folding
  process
Protein Structure: Hierarchical nature of protein
                     structure
Primary Structure = Sequence of amino acids
  MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTL
  PFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQRE
  KIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKK
  HLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYL
  IKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE




Secondary Structure                                      Tertiary




               Local Interactions                              Global Interactions
Protein Structure: Why is structure
              important?
 The function of a protein depends greatly on
  its structure
 The structure that a protein adopts is vital to
  it’s chemistry
 Its structure determines which of its amino
  acids are exposed to carry out the protein’s
  function
 Its structure also determines what substrates
  it can react with
Protein Structure: Mostly lacking
               information
• Therefore, it is clear that knowing the structure of a
  protein is crucial for many tasks
• However, we only know the structure for a very small
  fraction of all the proteins that we are aware of
   – The UniProtKB/TrEMBL archive contains 23165610 (16886838)
     sequences
   – The PDB archive of protein structure contains only
     84223(76669) structures
• In the native state, proteins fold on its own as soon as
  they are generated, amino-acid by amino-acid (with few
  exceptions e.g. chaperones)  can we predict this
  process as to close the gap between protein sequences
  and their 3D structures?
Central Dogma of Biology: A Bioinformatics
               Perspective
  The information for making proteins is stored in DNA. There is
  a process (transcription and translation) by which DNA is
  converted to protein. By understanding this process and how it
  is regulated we can make predictions and models of cells.

         Assembly




                                                  Protein
                                                  Sequence/Stru
 Sequence analysis                                cture Analysis
                           Gene Finding

Computational Problems
PUBLIC DATABASES
Information flow in bioinformatics
• Data enters the “bioinformatics scope” when a scientist deposits an
  experimental result in an appropriate archive
• The archive curates and annotates the data
• The data is released to the public
• Afterwards, the data may be retrieved/analysed:
   –   Integrating the new entry into a search engine
   –   Extracting useful subsets of the data
   –   Deriving new types of information from the data
   –   Aggregating the data, by homology, function, structure
   –   Reannotating the data with new discovered/inferred info.
• Quality of data depends on many factors, the techniques used to
  experimentally create the data, degree of inference and prediction
  involved in the annotation process, etc.
• Many publicly available databases:
  http://en.wikipedia.org/wiki/List_of_biological_databases
NCBI’s Entrez system
          http://www.ncbi.nlm.nih.gov/
Entrez is a search and retrieval system that integrates
information from databases at NCBI (National Center for
Biotechnology Information).
Uniprot http://www.uniprot.org
• The Universal Protein Resource (UniProt) is a collaboration between the
  European Bioinformatics Institute (EBI), the SIB Swiss Institute of
  Bioinformatics and the Protein Information Resource (PIR)
KEGG - http://www.genome.jp/kegg/

                  • Not just about
                    genes/proteins but
                    also pathways, that
                    is, their interactions
DAVID - http://david.abcc.ncifcrf.gov/
SEQUENCE ANALYSIS
Sequences
• Be it DNA, RNA or proteins we have many data
  that can be represented as sequences of a
  certain alphabet
• Many generic algorithms to deal with
  biological sequences exist

• Sequence alignment
• Motif representation
Sequence Alignment
• Is the assignment of residue-residue correspondences
  between nucleotide/proteomic sequences


 Query 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60
      MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY
 Sbjct 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60      matches
 Query 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSV 120
       YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL
 Sbjct 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL------------- 107
                                                                               gap
 ...

 Query 301 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPT 360
        QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ   + C P+
 Sbjct 281 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ-----DSFHLECQFPS 335

 Query 361 S-PSVN 365
         P VN                                             mismatches
 Sbjct 336 KFPGVN 341
Motivation
• Similarity is expected among biomolecules that are
  descended from a common ancestor.
   – Mutations cause differences, but survival of the organism requires
     that mutations occur in regions that are less critical to function
   – Important catalytic, regulatory or structural regions remain similar
• An alignment between two or more genetic or proteomic
  sequences represents an explicit hypothesis vis a vis their
  evolutionary histories.
• Thus comparison of related gene/protein sequences have
  been instrumental in shedding light into the information
  content of these sequences and their biological
  functions.
Definition and aims

•    Why align sequences?
    1. Start with a query sequence with unknown
       properties and search within a database of
       millions of sequences to find those which share
       similarity with the query.
    2. Start with a small set of sequences and identify
       similarities and differences among them.
    3. In many sequences or very long
       sequences, detect commonly occurring patterns
Similarity vs. Homology
• Similarity is the observation or measurement of resemblance
  and difference, independent of the source of resemblance.
• There are many examples of different organisms with
  functionally similar organs that came from distinc
  evolutionary origins
• When similarity is due to a common ancestry, we call it
  homology.
• Sequence alignment helps inferring homology hypothesis:
   – If two sequences are very similar, it is probable that there is a common
     origin
   – Therefore, if we know some information (structure, function) from
     sequence X, and sequence X is similar to sequence Y, it is probable that
     the same information applies to Y
Metrics of similarity: Definitions
• Gap: a break in the alignment, in either one of the
  sequences.
   – For nucleotides, a consequence of an insertion or deletion
     mutation.
   – For proteins, it’s more difficult to say.
• Regions of matching residues.
   – Indicate parts of a sequence that are well conserved
• Mismatched residues.
   – For nucleotides, a consequence of a substitution mutation
   – Less conserved regions
Metrics of similarity: Distance scoring
• Distance scoring
   – Given an alignment with matches, mismatches and
     gaps, we compute a score following:
      • For each mismatch, score is increased by 2
      • For each gap, score is increased by 4
      • For each match, no increase in score
   – Higher score, less similarity

                 A – G C C G T A T
                 A C G A - - T - T
                 0 4 0 2 4 4 0 4 0            = 18

• Equivalent metrics exist for similarity (not
  distance) where higher score means good
  similarity
Metrics of similarity: Mismatches and gaps


• Are all mismatches equally bad?
   – For protein sequences, there are several subgroups of amino
     acids with similar properties. Mismatches within a group have
     less impact
   – For nucleotide sequences, transition mutations (a↔g and
     t↔c) are more common than transversions (a or g ↔ t or c)
     mutations
   – Distance scoring of mismatches could be smarter  substitution
     matrices
      • Using statisical analysis on large corpus of real sequences to generate
        better scores
• How to penalize gaps
   – Each gap slot gets equal distance score
   – One score to open a gap, another (smaller) score to extend the
     same gap
Global vs Local alignment

• We know how to score good or bad
  alignments
  – How to find the optimal one?
• Two classes of alignment methods
  – Global alignment
     • Finds the best alignment of one entire sequence with
       another entire sequence
  – Local alignment
     • Find the best alignment of one segment of a sequence
       against another segment of another sequence
Exact vs. Approximate methods
• Exact methods for both global and local alignment exist, based on
  dynamic programming, but are slow
   – Good enough when there are few sequences
   – Not so good when comparing a target sequence to a database of millions
     of known sequences
• Approximate methods have been used for many years for large-
  scale alignment tasks
   – They use some kind of heuristic to speed up the alignment process
   – BLAST (Basic Local Alignment Search Tool) is the most famous approximate
     method
       • It identifies potential hits by looking for perfect matches of very small sub-sequences
         (seeds)
       • It only tries to create a full alignment for sequences where several seeds are identified
       • PSI-BLAST: version that takes into account that multiple hits are identified. It constructs a
         tailored substitution matrix based on hits and then refines the alignment
Multiple Sequence Alignment
• When we have to align more than two sequences
• Progressive methods (e.g. ClustalW)
  – Start with seed alignment
  – Iteratively incorporate other alignments to
    seed, without modifying what is aligned so far
  – ClustalW uses phylogenetic trees (representations of
    the evolutionary relationship between sequences) to
    progressively construct MSA
• Iterative methods (e.g. MUSCLE)
  – Can re-edit the partial MSA based on the newly
    incorporated alignments
ClustalW
Interface in
Uniprot
Motifs
• When visualising a MSA we can see regions of
  high agreement and regions of low
  agreement.
• The high agreement regions define that a
  certain protein belongs to a family
• What if we concentrate on modelling and
  identifying these regions instead of the whole
  sequences  Motif finding
Modelling motifs
• Patterns
   – Model the subsequence as a regular expression
       • C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]
       • Zing Finger motif
       • Can cope with moderate level of variability
• Profiles
   – Specify the most likely values for each position in the motif  acts as a
     substituton matrix
   – Use sequence similarty metrics to compute a score of the motif for a given
     sequence
           1 2 3 4         5 6     7 8     9
         A: 0.5 0.25 0    1 0     0 0     0 0.5
         T: 0 0.25 1     0 0     1 0     0.25 0
         G: 0.5 0.25 0    0 1     0 0     0.75 0.25
         C: 0 0.25 0     0 0     0 1     0 0.25

             http://drmotifs.genouest.org/2010/07/profiles-pwm-pswm-pssm/


• PROSITE implements both types of motifs
Modelling Motifs
• Hidden Markov Models
  – Model the motif as a series of state transitions
    with probabilities associated to each input symbol
    and state

  – Easy to visualise




               http://drmotifs.genouest.org/2010/07/hiden-markov-models-hmm/


  – PFAM uses HMM motifs
DNA -> RNA -> Proteins
THE SCALES OF BIOINFORMATICS
DNA
•   Coding/non coding
•   SNPs
•   Copy number variation
•   Assembly
•   Methylation
•   Primer design
Coding/Non Coding
• Identifying the regions from an organism’s
  genome that contain genes
• Many different factors involved in this
  identification
  –   Promoter identification
  –   Long enough Open Reading Frames (ORF)
  –   Splice variants
  –   Introns/Exons (in Eucaryotes)
  –   Statistical properties of gene-coding DNA
• HMM are also used for gene finding
Single Nucleotide Polymorphisms
                (SNPs)
• One base-pair variation in DNA
• In most cases in non-coding regions of DNA, but
  not always
• When frequent enough in a population they can
  be linked to specific traits, e.g. a disease
• SNP microarrays can be used to probe hundreds
  of thousands of SNPs in parallel
• In reality few SNPs act on their own
  – Genome-Wide Association Studies identify groups of
    SNPs linked to a certain condition
Copy Number Variation
• In general two copies of each gene exist in a
  genome
• It may be the cases than more/less than two
  copies exist of a certain gene for a specific
  sub-population
• It has been suggested that certain CNV can be
  linked to specific diseases
Genome assembly
• Sequencing technologies are able to read (sequence) a
  complete genome as a series of short overlapping
  fragments
• How to assemble back all these fragments?
• Greedy approach
   – Pair-wise alignments of all fragments
   – Merge fragments of largest overlap
   – Keep iterating until all segments are merged
• Worked more or less well on old sequencing
  technologies, not so well on next-generation
  sequencing data, due to smaller fragment sizes and
  larger error rate
Genome mapping
• Given a large set of short fragments, as a result of
  next-generation sequencing, map them to a
  reference genome
• Different from previous one. We do not want to
  reconstitute a complete genome, just identify to
  which genes each fragment belongs (among
  other applications).
• Speed is an issue
• Modern methods (e.g. SOAP2) compress the
  genome and are able to align the fragments in
  the compressed space
Methylation
• It is a chemical reaction that can block a
  certain region of a chromosome, preventing
  its transcription
• The process can be reverted, so essentially it is
  an on/off switch of the affected gene
• Specialised microarrays exist for the high-
  throughput detection of methylated genes
• Afterwards, data analysis can take place
DNA library specification
• A DNA library is a combinatorial set of DNA sequences suited to
  manufacture via DNA reuse
• The first stage towards the creation of a DNA library is the formal
  specification of the target DNA molecules that comprise it
• A set of sequences does not convey the intention behind the library


 Key challenge is to enable precise
   editing of DNA sequences in an
extensible and reproducible manner
whilst avoiding manual handling of
        these unwieldy objects
DNALD library format
• A DNALD library consists of three sets of definitions:
  inputs, intermediates and outputs, with different
  semantics
 – Inputs: existing DNA sequences to be provided with design
 – Intermediates: conceptual means of factoring commons seqs
 – Outputs: to be produced through DNA reuse
DNALD expressions
• A DNALD expression is a combination of explicit sequences, definition
  names, operators and functions that are interpreted according to rules of
  precedence and association ("evaluated") to produce a set of DNA
  sequences.
• Definitions bind names to the results of expressions.
Workbench interface



manage
projects
           text editor with:
           • syntax highlighting
           • auto-completion
           • code folding
           • etc.

                             viewed from different
                                 perspectives
CADMAD’s DNALD (DNA Library
                        Design)
>Ret_human
GGCCTCTACTTCTCGAGGGATGCTTACTGGGAGAAGCTGTATGTGGACCAGGCGGCCGGCA
CGCCCTTGCTGTACGTCCATGCCCTGCGGGACGCCCCTGAGGAGGTGCCCAGCTTCCGCCT

A specification language that
GGGCCAGCATCTCTACGGCACGTACCGCACACGGCTGCATGAGAACAACTGGATCTGCATC
CAGGAGGACACCGGCCTCCTCTACCTTAACCGGAGCCTGGACCATAGCTCCTGGGAGAAGC
TCAGTGTCCGCAACCGCGGCTTTCCCCTGCTCACCGTCTACCTCAAGGTCTTCCTGTCACC
CACATCCCTTCGTGAGGGCGAGTGCCAGTGGCCAGGCTGTGCCCGCGTATACTTCTCCTTC

produces a set of target DNA
TTCAACACCTCCTTTCCAGCCTGCAGCTCCCTCAAGCCCCGGGAGCTCTGCTTCCCAGAGA
CAAGGCCCTCCTTCCGCATTCGGGAGAACCGACCCCCAGGCACCTTCCACCAGTTCCGCCT
GCTGCCTGTGCAGTTCTTGTGCCCCAACATCAGCGTGGCCTACAGGCTCCTGGAGGGTGAG
GGTCTGCCCTTCCGCTGCGCCCCGGACAGCCTGGAGGTGAGCACGCGCTGGGCCCTGGACC
sequences as a function of
GCGAGCAGCGGGAGAAGTACGAGCTGGTGGCCGTGTGCACCGTGCACGCCGGCGCGCGCGA
GGAGGTGGTGATGGTGCCCTTCCCGGTGACCGTGTACGACGAGGACGACTCGGCGCCCACC
TTCCCCGCGGGCGTCGACACCGCCAGCGCCGTGGTGGAGTTC>Ret_mouse
GGCCTCTATTTCTCAAGGGATGCTTACTGGGAGAGGCTGTATGTAGACCAGCCAGCTGGCA
operations on a set of inputs
CACCTCTGCTCTATGTCCATGCCCTACGGGATGCCCCTGGAGAAGTGCCGAGCTTCCGCCT
GGGCCAGCATCTCTATGGCGTCTACCGTACACGGCTGCATGAGAATGACTGGATCCGCATC
AATGAGACTACTGGCCTTCTCTACCTCAATCAGAGCCTGGACCACAGTTCCTGGGAACAGC
TCAGCATCCGCAATGGTGGTTTCCCCCTGCTCACCATCTTCCTCCAGGTCTTTCTGGTGGA
AAACTGCCAGGAGTTCAGCGGTGTCTCCATCCAGTACAAGCTGCAGCCTTCCAGCATCAAC
TGCACTGCCCTAGGTGTGGTCACCTCACCCGAGGACACCTCGGGGACCCTATTTGTAAATG
ACACAGAGGCCCTGCGGCGACCTGAGTGCACCAAGCTTCAGTACACGGTGGTAGCCACTGA
CCGGCAGACCCGCAGACAGACCCAGGCTTCGCTAGTGGTCACTGTGGAGGGGACATCCATT
ACTGAAGAAGTAGGCT
To maximise CADMAD's impact the specification process must be:
>Ret_zebrafish
GGGCTGTATTTTCCTCAAAGGCTTTACACAGAGAACATCTACGTGGGTCAGCAGCAGGGAT
CACCGTTGCTTCAGGTCATTTCAATGCGGGAATTCCCTACAGAGAGGCCTTATTTCTTCCT

 • user friendly and debuggable
GTGCTCGCACAGAGACGCTTTTACATCATGGTTTCACATAGATGAGGCGTCCGGAGTTCTT
TATCTCAACAAAACCCTGGAGTGGAGCGACTTCAGTAGTTTACGCAGCGGCTCAGTTCGCT
CCCCGAAGGATCTCTGACCTATCAGTTAGAGATTGTCGACAGGAACATCACTGCTGAAGCT
CAGTCCTGTTACTGGGCGGTTAGTCTTGCACAAAACCCGAATGATAATACAGGCGTTCTCT

 • but expressively powerful enough to:
ATGTGAACGACACCAAAGTGTTACGCAGACCAGAGTGCCAAGAGCTGGAGTATGTGGTCAT
TGCCCAGGAGCAGCAGAACAAGCTTCAGGCCAAGACACAGCTCACCGTCAGTTTTCAAGGC
GAAGCAGATTCACTGAAAACGGATG
>Ret_chicken

    – define non-trivial combinatorial constructs
GGTCTGTACTTCCCCAGAAAGGAGTACTCAGAGAACGTCTACATTGACCAGCCAGCAGGTG
CGCCGCTCCTACGCATCCACGCCTTGAGGGATTCACATGGGAAACAGCCCACTTTCATCTG
TGCCAGAAGTCTCATCATTTCTCGAGCAAGATCCCATGAAAATCACTGGTTTCAAATCAGA


    – communicate degrees of freedom
GAAAAAATGGGACTTCTCTACCTCAGCAAGAGCCTAGATAGAGAAGACTTTAACATGCTGT
CTGTAGGAAACTGGATGCCATTATCAAAGGTGATGCTGTATGTCTTCCTCTCATCTCACCC
TTTCCAAGAGAAGGAATGTGACTCTGCTACTCGTACCACAGTCGTCCTCTCTTTGATCAAT
GCTACTGCACCAGCTTGCAGTTCACTGTCAGCAAGGCAGCTTTGCTTCACAGAAATGGATC
TCTCCTTTCACATCAAGGAGAATAAACCCCCTGGTACATTTCATCAGCTCCAGTTACCCTC
AGTTCATCATCTGTGTCAGAATCTCAGCATTACCTACAAACTGTTGGCAGCCGAAGGCCTG
CCTTTTCGGTACAATGAGAACACCACTGGTGTGAGTGTAACACAGCGCCTAGATCGAGAGG
AGAGAGAGAGATATGAGCTGATCGCCAAATGCACCGTGAGAGAAGGCTTCAGGGAAATGGA
GGTTGAGGTGCCCTTCCTCGTCAACGTGTTAGATGAAGATGACTCTCCTCCCTTCCTTCCC
RNA
• Expression
• Structure prediction
RNA expression
• Not all genes are transcribed/translated into proteins
  all the time
• The expression of genes is highly sophisticated and
  depends on many factors
• Identifying the genes being expressed in a given point
  of time in a specific tissue provides crucial information
  about the roles and interactions of such genes
   – Compare the genes expressed between different groups of
     samples to identify those that are differentially expressed
   – Identify co-expressed genes, that present patterns of
     correlation
Measuring RNA expression
• RT-PCR (Real-time reverse polimerase chain
  reaction)
  – Measures accurately the expression of a pre-
    determined gene
• RNA Microarrays
  – Measures, in parallel, the expression of tens of
    thousands of genes, but with considerable level of
    noise
• RNA-Seq
  – The next-generation sequencing variant for measuring
    gene expresison
RNA Structure prediction
• A RNA sequence can bind with itself to create
  complex shapes with a certain pattern of
  loops
• Can we predict, from a given sequence, the
  structural shape of the RNA?
Proteins
•   Protein classification
•   Structure prediction
•   Structure comparison
•   Function and interaction
Protein classification
• Proteins can be annotated in many different ways
   – Function
      • DNA-binding? Enzyme?
   – Tissue/Cellular/Sub-cellular localisation
   – Interacting with other proteins?
• Can we predict this annotation using ML?
• We need to transform the protein sequence into a
  uniform representation of equal size for all proteins
• Many different representations exist
• Several of these problems can be modelled as a
  hierarchical classification problem
Protein Structure Prediction
• PSP aims to predict the 3D structure of a protein
  based on its primary sequence
Protein Structure Prediction
 PSP is an open problem. The 3D structure
  depends on many variables
 It has been one of the main holy grails of
  computational biology for many decades
• Impact of having better protein structure models
  are countless
  –   Genetic therapy
  –   Synthesis of drugs for incurable diseases
  –   Improved crops
  –   Environmental remediation
Prediction types of PSP
• There are several kinds of prediction problems within
  the scope of PSP
   – The main one, of course, is to predict the 3D coordinates
     of all atoms of a protein (or at least the backbone) based
     on its primary sequence
   – There are many structural properties of individual residues
     within a protein that can be predicted, for instance:
      • The secondary structure state of the residue
      • If a residue is buried in the core of the protein or exposed in the
        surface
   – Accurate predictions of these sub-problems can simplify
     the general 3D PSP problem
3D Protein Structure Prediction
• Some PSP methods try to find similar proteins and then
  adapt the structure of the homolog (template) to the
  target protein  Homology Modeling
• Other methods try to find the structure of the protein
  from scratch (Ab Initio Modelling), optimizing some
  energy function that models the stability of the
  protein, in case no homolog can be identified
• In between there are other kind of methods, for
  varying degrees of good homology of our target, for
  instance, Fold Recognition or Threading
   • These methods identify a target based on more than
     homology (i.e. sequence alignment).
Coordination Number Prediction
  Two residues of a chain are said to be in contact if their
    distance is less than a certain threshold (e.g. 8Å)

                                               Native State
Primary                         Contact
Sequence




  CN of a residue : count of contacts that a certain
   residue has
  CN gives us a simplified profile of the density of packing
   of the protein
Contact Map prediction
• Prediction, given two residues
  from a chain, whether these two
  residues are in contact or not
• This problem can be represented
  by a binary matrix. 1= contact, 0
  = non contact
• Plotting this matrix reveals many
  characteristics from the protein
  structure
• Very sparse characteristic: Less
  than 2% of contacts in native
  structures


                                 helices   sheets
Other predictions
• Other kinds of residue
  structural aspects that can be
  predicted
   – Solvent accessibility: Amount of
     surface of each residue that is
     exposed to solvent
   – Recursive Convex Hull: A metric
     that models a protein as an
     onion, and assigns each residue
     to a layer. Formally, each layer is
     a convex hull of points
• These features (and
  others) are predicted in a
  similar was as done for SS
Protein Structure Comparison
Protein Structure Comparison
• Protein Structure Comparison (PSC) aims at
  – Assess the degree of similarity between protein structures
  – Given a query structure, identify other proteins with similar
    structure
• Why?
  – Group proteins by structural similarities
  – Determine the impact of individual residues on the protein
    structure
  – Identify distant homologues of protein families
  – Predict function of proteins with low degree of primary
    structure (i.e.. sequence) similarity with other proteins
  – Engineer new proteins for specific functions
  – Assess ab-initio predictions
Protein Structure Comparison

•    Sequence-Structure-Function relationships
    1) Conserved 1º sequences         similar structures

    2) Similar structures   ?   conserved 1º sequences

    3) Similar structures       conserved function
•    PSC shares many similarities with sequence
     alignment. Our aim is to infer new
     knowledge from the comparison process
Protein Structure Comparison
• Existing Approaches
  – SSAP (Orengo & Taylor, 96)
  – ProSup (Feng & Sippl, 96)
  – DALI (Holm & Sander, 93)
  – CE (Shindyalov & Bourne, 98)
  – LGA (Zemla, 2003)
  – SCOP (Murzin, Brenner, Hubbard & Chothia, 95)
  – CATH (Orengo, Mithie, Jones, Jones, Swindells &
    Thornton, 97)
  – ProCKSI – Consensus of multiple PSC methods
Prediction of Protein Function

• In an ideal world, the cascade of inference
  should flow from sequence  structure 
  function
• That is, if we can identify similar sequences of
  structures to our query target we can (at
  varying degrees of certainty) infer that they
  have similar function
Prediction of Protein Function
• As proteins evolve, they may
  – Retain function and specificity
  – Retain function but alter specificity
  – Change to a related function, or a similar function in a
    different metabolic contxt
  – Change to a completely unrelated function
• How much must a protein change before the
  function changes?
  – Sometimes, not at all. There are many cases of
    proteins with different functions in different
    environments
Prediction of Protein Function

• Thus, sequence or structure similarity is not
  always reliable to assign function
• Other ways of determining protein function
  – By identifying patterns of co-regulated genes
     • Using data from Microarray experiments
  – By identifying protein-protein interactions
Prediction of Protein Function
• A related question is: where is the function of a protein
  taking place?  active site
• Several methods exist to predict active/binding sites of
  proteins from local patterns of sequence or structure
• A raw way of doing this prediction is to take a look at the
  conserved residues of a sequence  they may be
  related to either the core of the protein (structural
  stability) or the function of a protein (a change of
  function is a risk for survival)
• More sophisticated methods exists to learn how to
  predict active sites. They use ML, in a similar way used to
  predict residue structural features in PSP
• Still, it is a very tough problem, and ML methods are not
  much better than blast-based methods
BIOLOGICAL DATA MINING
Three case studies
• Mining –omics data
• Predicting structural aspects of protein
  residues
• Automated alphabet reduction for protein
  datasets

• In all these three case studies we use the
  same evolutionary learning system: BioHEL
  [Bacardit et al., 09]
BioHEL
• BioHEL [Bacardit et al., 09] is an evolutionary
  learning system that applies the Iterative Rule
  Learning (IRL) approach
• Designed explicitly to deal with noisy large-scale
  datasets
• IRL was first used in EC by the SIA system
  [Venturini, 93]
BioHEL’s learning paradigm
– IRL has been used for many years in the ML
  community, with the name of separate-and-conquer
BioHEL’s objective function
• An objective function based on the Minimum-
  Description-Length (MDL) (Rissanen,1978) principle
  that tries to promote rules with
   – High accuracy: not making mistakes
   – High coverage: covering as much examples as possible
     without sacrificing accuracy. Recall (TP/(TP+FN)) will be
     used to define coverage
   – Low complexity: rules as simple and general as possible
   – The objective function is a linear combination of the three
     objectives above
BioHEL’s objective function
• Intuitively, we would like to have accurate
  rules covering as much examples as possible.
• However, in complex and inconsistent
  domains it is rare to obtain such rules
• In these cases, easier path for evolutionary search is to
  maximize accuracy at the expense of coverage
• Therefore, we need to enforce that the evolved rules cover
  enough examples
BioHEL’s objective function




• Three parameters define the shape of the function
• The choice of the coverage break is crucial for the proper performance of
  the system
• Also, coverage term penalizes rules that do not cover a minimum
  percentage of examples or that cover too many
BioHEL’s characteristics
• Attribute list rule representation
   – Automatically identifying the relevant attributes for a given rule and
     discarding all the other ones
• The ILAS windowing scheme
   – Efficiency enhancement method, not all training points are used for
     each fitness computation
• An explicit default rule mechanism
   – Generating more compact rule sets
   – Iterative process terminates when it is impossible to evolve a rule
     where the associated class is the majority class among the matched
     examples
   – At this point, all remaining training instances are assigned to the
     default class
MINING –OMICS DATA
Mining –omics data
• Biological data can be generated at many
  different levels
  – Genomics (DNA)
  – Transcriptomics (RNA)
  – Proteomics (proteins)
  – Metabolomics (small compounds)
  – Lipidomics (lipids)

• Hundreds of –omics have been catalogued
How an –omics dataset looks like?
• In most cases datasets present a similar structure
• Each sample is characteristed by a large number
  of variables (RNA, Proteins, lipids, etc.)
• Each variable indicates (usually quantitatively)
  the presence of that element in the sample
• Due to the high cost of most –omics technologies,
  variables >> samples
  – Problems of over-fitting
What can we do with the dataset?
• In most cases, samples are annotated with a
  qualitative label
   – Cancer/Non-cancer patients
   – Samples of seed tissue for which it is known if the seed
     germinated or not
   – Age of the sample
• Therefore, we can treat these datasets as
  classification problems, and generate prediction
  models from the data
• Not just as classification problems
   – Clustering/Biclustering
   – Association Rule Mining
   – Regression
But in most cases, domain experts are
 not (only) interested in predictions
• Biomarker identification
  – Identify the key variables
     • Most strongly associated to each outcome
        – Using e.g. t-tests to identify those
     • Presenting higher prediction capacity
        – As identified by ML methods
  – Identify interactions between variables
     • By presenting very high (anti)correlation between them
     • By acting together to generate predictions
Functional Network Reconstruction for
          seed germination
 Microarray data obtained from seed tissue of
  Arabidopsis Thaliana
 122 samples represented by the expression level
  of almost 14000 genes
 It had been experimentally determined whether
  each of the seeds had germinated or not
 Can we learn to predict germination/dormancy
  from the microarray data?
 [Bassel et al., 2011]
Generating rule sets
 BioHEL was able to predict the
  outcome of the samples with
  93.5% accuracy (10 x 10-fold cross-
  validation
 Learning from a scrambled dataset
  (labels randomly assigned to
  samples) produced ~50% accuracy
If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96  Predict
germination
If At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66  Predict
germination
If At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66  Predict germination
If At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and At1g48320>56.80
 Predict germination
Everything else  Predict dormancy
Identifying regulators
 Rule building process is stochastic
   Generates different rule sets each time the system
    is run
 But if we run the system many times, we can
  see some patterns in the rule sets
   Genes appearing quite more frequent than the
    rest
      Some associated to dormancy
      Some associated to germination
Known regulators appear with high
     frequency in the rules
Generating co-prediction networks of
                 interactions
• For each of the rules shown before to be
  true, all of the conditions in it need to be
  true at the same time
    – Each rule is expressing an interaction
      between certain gens
• From a high number of rule sets we can
  identify pairs of genes that co-occur with
  high frequency and generate functional
  networks
• The network shows different topology
  when compared to other type of network
  construction methods (e.g. by gene co-
  expression)
• Different regions in the network contain
  the germination and dormancy genes
Experimental validation
 We have experimentally verified this analysis
    By ordering and planting knockouts for the highly ranked
     genes
    We have been able to identify four new regulators of
     germination, with different phenotype from the wild type
PREDICTING STRUCTURAL ASPECTS OF
PROTEIN RESIDUES
Prediction of structural aspects of protein
                  residues
• Many of these features are due to local interactions of an amino
  acid and its immediate neighbours
   – Can it be predicted using information from the closest
      neighbours in the chain?

             Ri-5    Ri-4    Ri-3    Ri-2    Ri-1    Ri    Ri+1    Ri+2    Ri+3    Ri+4    Ri+5
             SSi-5   SSi-4   SSi-3   SSi-2   SSi-1   SSi   SSi+1   SSi+2   SSi+3   SSi+4   SSi+5




                                 Ri-1 Ri Ri+1  SSi
                                 Ri Ri+1 Ri+2  SSi+1
                                 Ri+1 Ri+2 Ri+3  SSi+2
   – In this simplified example to predict the SS state of residue i we
     would use information from residues i-1 i and i+1. That is a
     window of ±1 residues around the target
ARFF file for a simple PSP dataset
     @relation AA+CN_Q2
     @attribute AA_-4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
     @attribute AA_-3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
     @attribute AA_-2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
     @attribute AA_-1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
     @attribute AA {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}
     @attribute AA_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
     @attribute AA_2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
     @attribute AA_3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
     @attribute AA_4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y}
     @attribute class {0,1}
     @data
     X,X,X,X,A,E,I,K,H,0
     X,X,X,A,E,I,K,H,Y,0
     X,X,A,E,I,K,H,Y,Q,0
     X,A,E,I,K,H,Y,Q,F,0
     A,E,I,K,H,Y,Q,F,N,0
     E,I,K,H,Y,Q,F,N,V,0
     I,K,H,Y,Q,F,N,V,V,0
     K,H,Y,Q,F,N,V,V,M,1
     H,Y,Q,F,N,V,V,M,T,0
     Y,Q,F,N,V,V,M,T,C,1
What information do we include for each
               residue?

 – Early prediction methods used just the primary
   sequence  the AA types of the residues in the
   window
 – However the primary sequence has limited
   amount of information
    • It does not contain any evolutionary information it does not
      say which residues are conserved and which are not
 – Where can we obtain this information?
    • Position-Specific Scoring Matrices which is a product of a
      Multiple Sequence Alignment
Position-Specific Scoring Matrices (PSSM)

– For each residue in the query sequence compute
  the distribution of amino acids of the corresponding
  residues in all aligned sequences (discarding those
  too similar to the query)
– This distributions will tell us which mutations are
  likely and which mutations are less likely for each
  residue in the query sequence
– In essence it’s similar to a substitution matrix but
  tailored for the sequence that we are aligning
– A PSSM profile will also tell us which residues are
  more conserved and which residues are more
  subject to insertions or deletions
PSSM for the 10 first residues of 1n7lA
  A R N D C Q E G H I L K M F P S T W Y V
A: 4 -1 -2 -2 0 -1 -1 0 -2 -1 -2 -1 -1 -2 -1 1 0 -3 -2 0
M:-1 -2 -3 -4 -2 -1 -2 -3 -2 1 2 -2 7 0 -3 -2 -1 -2 -1 1
E:-1 0 0 2 -4 2 6 -2 0 -4 -3 1 -2 -4 -1 0 -1 -3 -2 -3
K:-1 2 0 -1 -4 1 1 -2 -1 -3 -3 5 -2 -4 -1 0 -1 -3 -2 -3
V: 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 5
Q:-1 1 0 0 -3 6 2 -2 0 -3 -3 1 -1 -4 -2 0 -1 -2 -2 -3
Y:-2 -1 -1 -3 -3 -1 -1 -3 6 -2 -2 -2 -1 2 -3 -2 -2 1 7 -2
L:-2 -3 -4 -4 -2 -3 -3 -4 -3 2 5 -3 2 0 -3 -3 -1 -2 -1 1
T: 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0
R:-2 6 -1 -2 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3
Secondary Structure Prediction
– The most usual way is to predict whether a
  residue belongs to an α helix a β sheet or is in
  coil state
– Several programs can determine the actual SS
  state of a protein from a PDB file. The most
  common of them is DSSP
– Typically, a window of ±7 amino acids (15 in total)
  is used. This means 300 attributes (when using
  PSSM).
– A dataset with 1000 proteins with
  ~250AA/protein would have ~250000 instances
Secondary Structure Prediction

R1 R2 R3      Rn-1 Rn                          PSSM1 PSSM2 PSSM3           PSSMn-1 PSSMn
                                MSA
Primary sequence                                   PSSM profile of sequence




   SSi?
                   Prediction     PSSMi-1 PSSMi PSSMi+1             Windows
                    method                                         generation

Prediction                      Window of PSSM profiles
Other prediction problems
• This same structure of prediction can be
  applied to most 1D structural aspects
• However, many of these features are natively
  continuous measures (or integer)
• To treat these problems as classification
  problems, we need to discretise the output
• Unsupervised methods are applied
     – Uniform length and uniform frequency disc.
UF


UL
PSP datasets are good ML benchmarks
• These problems can be modelled in may ways:
  – Regression or classification problems
  – Low/high number of classes
  – Balanced/unbalanced classes
  – Adjustable number of attributes
• Ideal benchmarks !!
• http://icos.cs.nott.ac.uk/datasets/psp_bench
  mark.html
Contact Map Prediction
• We participated in the CASP9 competition
• CASP = Critical Assessment of Techniques for Protein Structure Prediction.
  Biannual competition
• Every day, for about three months, the organizers release some protein
  sequences for which nobody knows the structure (129 sequences were
  released in CASP9, in 2010)
• Each prediction group is given three weeks to return their predictions
• If the machinery is not well oiled, it is not feasible to participate !!
• For CM, prediction groups have to return a list of predicted contacts (they
  are not interested in non-contacts) and, for each predicted pair of
  contacting residues, a confidence level
Contact Map prediction
• Prediction given two residues
  from a chain whether these
  two residues are in contact or
  not
• This problem can be
  represented by a binary
  matrix. 1= contact 0 = non
  contact
• Plotting this matrix reveals
  many characteristics from the
  protein structure


                              helices   sheets
Steps for CM prediction (Nottingham
                 method)
1. Prediction of
     Secondary structure (using PSIPRED)
     Solvent Accessibility
     Recursive Convex Hull        Using BioHEL [Bacardit et al., 09]

     Coordination Number
2. Integration of all these predictions plus other
   sources of information
3. Final CM prediction (using BioHEL)
Prediction of RCH, SA and CN
 We selected a set of 3262 protein chains from
  PDB-REPRDB with:
   A resolution less than 2Å
   Less than 30% sequence identify
   Without chain breaks nor non-standard residues
 90% of this set was used for training (~490000
  residues)
 10% for test
Prediction of RCH, SA and CN
 All three features were predicted based on a
  window of ±4 residues around the target
   Evolutionary information (as a Position-Specific
    Scoring Matrix) is the basis of this local
    information
   Each residue is characterised by a vector of 180
    values
 The domain for all three features was
  partitioned into 5 states
Characterisation of the contact map
                        problem
     Three types of input information were used
         1. Detailed information of three different windows of
            residues centered around
               The two target residues (2x)
               The middle point between them
         2. Information about the connecting segment between the
            two target residues and
         3. Global protein information.
     1

3


                2
Contact Map dataset
 From the original set of 3262 proteins we kept
  all that had <250 AA and a randomly selected
  20% of larger proteins
 Still, the resulting training set contained 32
  million pairs of AA and 631 attributes
 Less than 2% of those are actual contacts
 +60GB of disk space
Samples and ensembles
            Training set          50 samples of 660K examples are
                                   generated from the training set with a
                           x50     ratio of 2:1 non-contacts/contacts
Samples
                                  BioHEL is run 25 times for each sample
                                  Prediction is done by a consensus of
                           x25     1250 rule sets
Rule sets
                                  Confidence of prediction is computed
                                   based on the votes distribution in the
                                   ensemble.
                                  Whole training process took about 25K
            Consensus              CPU hours



             Predictions
Contact Map prediction in CASP
 Predictor groups are asked to submit a list of
  predicted contacts and a confidence level for each
  prediction
 The assessors then rank the predictions for each
  protein and take a look at the top L/x ones, where L
  is the length of the protein and x={5,10}
 From these L/x top ranked contacts two
  measures are computed
    Accuracy: TP/(TP+FP)
    Xd: difference between the distribution of
     predicted distance and a random distribution
CASP9 results

                                                These two groups derived contact
                                                predictions from 3D models




http://www.predictioncenter.org/casp9/doc/presentations/CASP9_RR.pdf
Understanding the rule sets
 Each rule set has in average 135 rules
 We have a total of 168470 rules
 Impossible to read all of them individually, but
  we can extract useful statistics
 For instance, how often was each attribute
  used in the rules?
 Full analysis
Distribution of frequency of use of
                      attributes
 All 631 attributes are
  actually used (min
  frequency=429)
 However, some of
  them are used much
  more frequently than
  others
Top 10 attributes
              Attribute     Frequency   Count
                                        s
              PredSS_r1_1   1.48%       18141


              PredCN_r1     1.66%       20336


              propensity    1.74%       21288


              PredSS_r2     1.75%       21350

              PredSS_r1     1.82%       22205


              PredRCH_r2    1.87%       22856


              PredRCH_r1    2.04%       24961


              PredSA_r2     2.12%       25891


              PredSA_r1     2.39%       29246


              separation    4.17%       50951


The four kind of residue’s predictions are highly ranked
AUTOMATED ALPHABET REDUCTION
FOR PROTEIN DATASETS
Motivation
• PSP is a very costly process
• As an example, one of the best PSP methods
  CASP8, Rosetta@Home could dedicate up to 104
  computing years to predict a single protein’s 3D
  structure
• One of the possible ways to alleviate this
  computational cost is to simplify the representation
  used to model the proteins
Target for reduction: the primary sequence

• The primary sequence of a protein is
  an usual target for such
  simplification
   – It is composed of a quite high cardinality
     alphabet of 20 symbols, which share
     commonalities between them
   – One example of reduction widely used
     in the community is the hydrophobic-
     polar (HP) alphabet, reducing these 20
     symbols to just two
   – HP representation usually is too
     simple, too much information is lost in
     the reduction process [Stout et al., 06]
• Can we automatically generate these
  reduced alphabets and tailor them
  to the specific problem at hand?
Automated Alphabet Reduction
        [Bacardit et al., 09]
• We will use an automated information theory-driven
  method to optimize alphabet reduction policies for PSP
  datasets
• An optimization algorithm will cluster the AA alphabet
  into a predefined number of new letters
• Fitness function of optimization is based on the Mutual
  Information (MI) metric. A metric that quantifies the
  interrelationship between two discrete variables
   – Aim is to find the reduced representation that maintains as much
     relevant information as possible for the feature being predicted
• Afterwards we will feed the reduced dataset into a
  learning method to verify if the reduction was proper
Alphabet Reduction protocol

             Size = N                          Test set




Dataset       ECGA        Dataset   BioHEL    Ensemble
Card=20                   Card=N             of rule sets




                                               Accuracy
               Mutual
            Information
                                                   133
Automated Alphabet Reduction
 Competent 5-letter alphabet (similar performance to
  the AA alphabet)
 Different alphabets for CN and SA domains
 Unexpected explanations: Alphabet reduction
  clustered AA types that experts did not expect
Automated Alphabet Reduction
   Our method produces better reduced alphabets than other
    reduced alphabets from the literature and than other expert-
    designed ones
 Alphabet    Letters   CN acc.    SA acc.     Diff.               Ref.
   AA          20      74.0±0.6   70.7±0.4     ---                 ---
Our method     5       73.3±0.5   70.3±0.4   0.7/0.4      [Bacardit et al., 07]
  WW5          6       73.1±0.7   69.6±0.4   0.9/1.1      [Wang & Wang, 99]       Alphabets
   SR5         6       73.1±0.7   69.6±0.4   0.9/1.1    [Solis & Rackovsky, 00]   from the
  MU4          5       72.6±0.7   69.4±0.4   1.4/1.3      [Murphy et al., 00]     literature
  MM5          6       73.1±0.6   69.3±0.3   0.9/1.4   [Melo & Marti-Renom, 06]
   HD1         7       72.9±0.6   69.3±0.4   1.1/1.4      [Bacardit et al., 07]   Expert
   HD2         9       73.0±0.6   69.3±0.4   1.0/1.4      [Bacardit et al., 07]   designed
   HD3         11      73.2±0.6   69.9±0.4   0.8/0.8      [Bacardit et al., 07]
                                                                                  alphabets
Efficiency gains from the alphabet
                 reduction
• We have extrapolated the reduced alphabet to the much
  larger and richer Position-Specific Scoring Matrices (PSSM)
  representation




• Accuracy difference is still less than 1%
• Obtained rule sets are simpler and training process is much
  faster
• Performance levels are similar to recent works in the
  literature [Kinjo et al., 05][Dor and Zhou, 07]
• Won the bronze medal of the 2007 Humies awards
Conclusions
• Bioinformatics contain many challenges that
  computer science can tackle
  – Optimisation
  – Machine learning
  – Software engineering
• Evolutionary computation has shown to be very
  competitive across a large range of bioinformatics
  problems
• Facing these challenges for EC has led to the
  development of many new methods
References/Bibliography
• Journals
   – The Bioinformatics Journal
   – BMC Bioinformatics
   – BMC Biodata Mining
• Bioinformatics books
   – Introduction to Bioinformatics by Arthur Lesk, Oxford University Press.
   – Introduction to Bioinformatics. A. Tramontano, Chapman and Hall/CRC
• Specialised topics
   – Bioinformatics for –omics data. Methods and Protocols. Bernd Mayer
      (ed). Springer
   – Next-Generation Sequencing special issue of the Bioinformatics
      Journal;
      http://www.oxfordjournals.org/our_journals/bioinformatics/nextgene
      rationsequencing.html
References/Bibliography
•   J. Bacardit, M. Stout, J.D. Hirst, N. Krasnogor and J. Blazewicz, Coordination number
    prediction using Learning Classifier Systems: Performance and interpretability. In Proceedings
    of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO2006), pp.
    247-254, ACM Press, 2006
•   Stout, M., Bacardit, J., Hirst, J.D. and Krasnogor, N. Prediction of Recursive Convex Hull Class
    Assignments for Protein Residues. Bioinformatics, 24(7):916-923, 2008
•   Stout, M., Bacardit, J., Hirst, J.D., Smith, R.E. and Krasnogor, N. Prediction of Topological
    Contacts in Proteins Using Learning Classifier Systems. Soft Computing Journal, 13(3):245-
    258, 2009
•   J. Bacardit, E.K. Burke and N. Krasnogor. Improving the scalability of rule-based evolutionary
    learning. Memetic Computing journal 1(1):55-67, 2009
•   J. Bacardit, M. Stout, J.D. Hirst, A. Valencia, R.E. Smith and N. Krasnogor. Automated Alphabet
    Reduction for Protein Datasets. BMC Bioinformatics 10:6, 2009
•   George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J. Holdsworth and Jaume Bacardit.
    Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on
    Large-Scale Data Sets. The Plant Cell, 23(9):3101-3116, 2011
•   J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and Natalio
    Krasnogor. Contact map prediction using a large-scale ensemble of rule sets and the fusion of
    multiple predicted structural features. Bioinformatics first published online July 25, 2012
    doi:10.1093/bioinformatics/bts472
References/Bibliography
•   Jason H. Moore et al., Bioinformatics challenges for genome-wide association studies
    Bioinformatics (2010) 26(4): 445-455
•   Loris Nanni, Sheryl Brahnam, Alessandra Lumini, High performance set of PseAAC and
    sequence based descriptors for protein classification, Journal of Theoretical Biology 266(1):1-
    10, 2010
•   Fernando Otero et al., A hierarchical multi-label classification ant colony algorithm for protein
    function prediction, Memetic Computing 2(3):165-181, 2010
•   Daniel Barthel et al., Procksi: a decision support system for protein (structure)
    comparison, knowledge, similarity and information. BMC Bioinformatics, 8:416, 2007
•   http://omics.org/index.php/Alphabetically_ordered_list_of_omes_and_omics
•   Federico Divina and Jesus S. Aguilar-Ruiz. 2006. Biclustering of Expression Data with
    Evolutionary Computation. IEEE Trans. on Knowl. and Data Eng. 18, 5 (May 2006), 590-602.
•   Martinez-Ballesteros, M Nepomuceno-Chamorro, J C Riquelme (2011) Inferring gene-gene
    associations from Quantitative Association Rules In: 11th International Conference on
    Intelligent Systems Design and Applications (ISDA 2011 ) 1241 – 1246
•   Rubén Armañanzas, Iñaki Inza, Roberto Santana, Yvan Saeys, Jose Flores, Jose Lozano, Yves
    Peer, Rosa Blanco, Víctor Robles, Concha Bielza, Pedro Larrañaga. A review of estimation of
    distribution algorithms in bioinformatics. BioData Mining 2008, 1:6 (11 September 2008)
Acknowledgements
•   Prof. Natalio Krasnogor
•   Prof. Michael Holdsworth
•   Prof. Jonathan Hirst
•   Dr. Michael Stout
•   Dr. George Bassel
•   Dr. Enrico Glaab
•   Dr. Pawel Widera

• EPSRC GR/T07534/01 & EP/H016597/1
• EU FP7 CADMAD project
Introduction to Bioinformatics
                   Dr. Jaume Bacardit
  Interdisciplinary Computing and Complex Systems
                 (ICOS) research group
                University of Nottingham
          jaume.bacardit@nottingham.ac.uk

Contenu connexe

Tendances (20)

Proteins databases
Proteins databasesProteins databases
Proteins databases
 
MULTIPLE SEQUENCE ALIGNMENT
MULTIPLE  SEQUENCE  ALIGNMENTMULTIPLE  SEQUENCE  ALIGNMENT
MULTIPLE SEQUENCE ALIGNMENT
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
RNA secondary structure prediction
RNA secondary structure predictionRNA secondary structure prediction
RNA secondary structure prediction
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Phylogenetic data analysis
Phylogenetic data analysisPhylogenetic data analysis
Phylogenetic data analysis
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICS
 
Msa
MsaMsa
Msa
 
Databases pathways of genomics and proteomics
Databases pathways of genomics and proteomics Databases pathways of genomics and proteomics
Databases pathways of genomics and proteomics
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Prosite
PrositeProsite
Prosite
 
Finding ORF
Finding ORFFinding ORF
Finding ORF
 
Primary, secondary, tertiary biological database
Primary, secondary, tertiary biological databasePrimary, secondary, tertiary biological database
Primary, secondary, tertiary biological database
 
Cath
CathCath
Cath
 
sequence of file formats in bioinformatics
sequence of file formats in bioinformaticssequence of file formats in bioinformatics
sequence of file formats in bioinformatics
 
Swiss prot
Swiss protSwiss prot
Swiss prot
 
Biological database
Biological databaseBiological database
Biological database
 
Introduction of bioinformatics
Introduction of bioinformaticsIntroduction of bioinformatics
Introduction of bioinformatics
 
Biological databases
Biological databasesBiological databases
Biological databases
 

En vedette

Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsphilmaweb
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformaticsHamid Ur-Rahman
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informaticsDaniela Rotariu
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionArindam Ghosh
 
Grid computing [2005]
Grid computing [2005]Grid computing [2005]
Grid computing [2005]Raul Soto
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notesSyed Mustafa
 
Uses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsUses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsPragya Pai
 
Bioinformatics Final Presentation
Bioinformatics Final PresentationBioinformatics Final Presentation
Bioinformatics Final PresentationShruthi Choudary
 
Application of bioinformatics
Application of bioinformaticsApplication of bioinformatics
Application of bioinformaticsKamlesh Patade
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformaticsAbhishek Vatsa
 
Relevance Ranking of Learning Objects
Relevance Ranking of Learning ObjectsRelevance Ranking of Learning Objects
Relevance Ranking of Learning Objectsguesta52c89
 
Multimediatag Heidelberg
Multimediatag HeidelbergMultimediatag Heidelberg
Multimediatag HeidelbergMsSchool
 
A Journey Into Wholeness Final
A Journey Into Wholeness  FinalA Journey Into Wholeness  Final
A Journey Into Wholeness Finalmsainfo
 
孩子的心
孩子的心孩子的心
孩子的心nonnon
 
Luentotallenteiden käyttö matemaattisten aineiden opetuksessa
Luentotallenteiden käyttö matemaattisten aineiden opetuksessaLuentotallenteiden käyttö matemaattisten aineiden opetuksessa
Luentotallenteiden käyttö matemaattisten aineiden opetuksessaIlkka Kukkonen
 
Beautiful Bridges
Beautiful BridgesBeautiful Bridges
Beautiful Bridgessanctuary
 

En vedette (20)

Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
 
Ab Initio Protein Structure Prediction
Ab Initio Protein Structure PredictionAb Initio Protein Structure Prediction
Ab Initio Protein Structure Prediction
 
Bioinformatics Projects And Applications
Bioinformatics Projects And ApplicationsBioinformatics Projects And Applications
Bioinformatics Projects And Applications
 
Grid computing [2005]
Grid computing [2005]Grid computing [2005]
Grid computing [2005]
 
Grid computing notes
Grid computing notesGrid computing notes
Grid computing notes
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Uses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in BioinformaticsUses of Artificial Intelligence in Bioinformatics
Uses of Artificial Intelligence in Bioinformatics
 
Bioinformatics principles and applications
Bioinformatics principles and applicationsBioinformatics principles and applications
Bioinformatics principles and applications
 
Bio Informatics
Bio InformaticsBio Informatics
Bio Informatics
 
Bioinformatics Final Presentation
Bioinformatics Final PresentationBioinformatics Final Presentation
Bioinformatics Final Presentation
 
Application of bioinformatics
Application of bioinformaticsApplication of bioinformatics
Application of bioinformatics
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Relevance Ranking of Learning Objects
Relevance Ranking of Learning ObjectsRelevance Ranking of Learning Objects
Relevance Ranking of Learning Objects
 
Multimediatag Heidelberg
Multimediatag HeidelbergMultimediatag Heidelberg
Multimediatag Heidelberg
 
A Journey Into Wholeness Final
A Journey Into Wholeness  FinalA Journey Into Wholeness  Final
A Journey Into Wholeness Final
 
孩子的心
孩子的心孩子的心
孩子的心
 
Luentotallenteiden käyttö matemaattisten aineiden opetuksessa
Luentotallenteiden käyttö matemaattisten aineiden opetuksessaLuentotallenteiden käyttö matemaattisten aineiden opetuksessa
Luentotallenteiden käyttö matemaattisten aineiden opetuksessa
 
Beautiful Bridges
Beautiful BridgesBeautiful Bridges
Beautiful Bridges
 

Similaire à Introduction to Bioinformatics

Basics of Molecular Biology
Basics of Molecular BiologyBasics of Molecular Biology
Basics of Molecular BiologyTapeshwar Yadav
 
1.introduction to genetic engineering and restriction enzymes
1.introduction to genetic engineering and restriction enzymes1.introduction to genetic engineering and restriction enzymes
1.introduction to genetic engineering and restriction enzymesGetachew Birhanu
 
Basics of molecular biology
Basics of molecular biologyBasics of molecular biology
Basics of molecular biologyMangesh Bhosale
 
Basics of molecular biology tools and techniques
Basics of molecular biology tools and techniquesBasics of molecular biology tools and techniques
Basics of molecular biology tools and techniquesBOTANYWith
 
lect-1-Basics-of-Molecular-Biology.ppt
lect-1-Basics-of-Molecular-Biology.pptlect-1-Basics-of-Molecular-Biology.ppt
lect-1-Basics-of-Molecular-Biology.pptAmosWafula3
 
lect-1-Basics-of-Molecular-Biology.ppt
lect-1-Basics-of-Molecular-Biology.pptlect-1-Basics-of-Molecular-Biology.ppt
lect-1-Basics-of-Molecular-Biology.pptmuhammedsayfadin
 
Basics of molecular biology
Basics of molecular biologyBasics of molecular biology
Basics of molecular biologyIhteram Ullah
 
Nucleic acids and protein synthesis
Nucleic acids and protein synthesisNucleic acids and protein synthesis
Nucleic acids and protein synthesisSian Ferguson
 
DNA, CHROMOSOMES & GENES
DNA, CHROMOSOMES & GENESDNA, CHROMOSOMES & GENES
DNA, CHROMOSOMES & GENESjagan vana
 
Basics of molecular biology
Basics of molecular biologyBasics of molecular biology
Basics of molecular biologyAshfaq Ahmad
 
molecular bilogy lab medical third year
molecular bilogy lab medical   third yearmolecular bilogy lab medical   third year
molecular bilogy lab medical third yearalhamily556677
 
104 Genetics and cellular functionLearning Objective.docx
104 Genetics and cellular functionLearning Objective.docx104 Genetics and cellular functionLearning Objective.docx
104 Genetics and cellular functionLearning Objective.docxaulasnilda
 
intro-molecular-biology.ppt
intro-molecular-biology.pptintro-molecular-biology.ppt
intro-molecular-biology.pptmuhammedsayfadin
 
biochemistry
biochemistrybiochemistry
biochemistrystudent
 
Mutation, repair, recombination
Mutation, repair, recombinationMutation, repair, recombination
Mutation, repair, recombinationKamlesh Yadav
 

Similaire à Introduction to Bioinformatics (20)

Introduction
IntroductionIntroduction
Introduction
 
Microbial genetics notes
Microbial genetics notesMicrobial genetics notes
Microbial genetics notes
 
Basics of Molecular Biology
Basics of Molecular BiologyBasics of Molecular Biology
Basics of Molecular Biology
 
1.introduction to genetic engineering and restriction enzymes
1.introduction to genetic engineering and restriction enzymes1.introduction to genetic engineering and restriction enzymes
1.introduction to genetic engineering and restriction enzymes
 
Basics of molecular biology
Basics of molecular biologyBasics of molecular biology
Basics of molecular biology
 
Molecular biology lecture
Molecular biology lectureMolecular biology lecture
Molecular biology lecture
 
Basics of molecular biology tools and techniques
Basics of molecular biology tools and techniquesBasics of molecular biology tools and techniques
Basics of molecular biology tools and techniques
 
lect-1-Basics-of-Molecular-Biology.ppt
lect-1-Basics-of-Molecular-Biology.pptlect-1-Basics-of-Molecular-Biology.ppt
lect-1-Basics-of-Molecular-Biology.ppt
 
lect-1-Basics-of-Molecular-Biology.ppt
lect-1-Basics-of-Molecular-Biology.pptlect-1-Basics-of-Molecular-Biology.ppt
lect-1-Basics-of-Molecular-Biology.ppt
 
Basics of molecular biology
Basics of molecular biologyBasics of molecular biology
Basics of molecular biology
 
Nucleic acids and protein synthesis
Nucleic acids and protein synthesisNucleic acids and protein synthesis
Nucleic acids and protein synthesis
 
DNA, CHROMOSOMES & GENES
DNA, CHROMOSOMES & GENESDNA, CHROMOSOMES & GENES
DNA, CHROMOSOMES & GENES
 
Basics of molecular biology
Basics of molecular biologyBasics of molecular biology
Basics of molecular biology
 
BASICS OF MOLECULAR BIOLOGY
BASICS OF MOLECULAR BIOLOGYBASICS OF MOLECULAR BIOLOGY
BASICS OF MOLECULAR BIOLOGY
 
molecular bilogy lab medical third year
molecular bilogy lab medical   third yearmolecular bilogy lab medical   third year
molecular bilogy lab medical third year
 
104 Genetics and cellular functionLearning Objective.docx
104 Genetics and cellular functionLearning Objective.docx104 Genetics and cellular functionLearning Objective.docx
104 Genetics and cellular functionLearning Objective.docx
 
intro-molecular-biology.ppt
intro-molecular-biology.pptintro-molecular-biology.ppt
intro-molecular-biology.ppt
 
biochemistry
biochemistrybiochemistry
biochemistry
 
Cell physio 202
Cell physio 202Cell physio 202
Cell physio 202
 
Mutation, repair, recombination
Mutation, repair, recombinationMutation, repair, recombination
Mutation, repair, recombination
 

Plus de jaumebp

Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningjaumebp
 
The Infobiotics Contact Map predictor at CASP9
The Infobiotics Contact Map predictor at CASP9The Infobiotics Contact Map predictor at CASP9
The Infobiotics Contact Map predictor at CASP9jaumebp
 
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...jaumebp
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learningjaumebp
 
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...jaumebp
 
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...jaumebp
 

Plus de jaumebp (6)

Knowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learningKnowledge extraction and visualisation using rule-based machine learning
Knowledge extraction and visualisation using rule-based machine learning
 
The Infobiotics Contact Map predictor at CASP9
The Infobiotics Contact Map predictor at CASP9The Infobiotics Contact Map predictor at CASP9
The Infobiotics Contact Map predictor at CASP9
 
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...Data Mining Protein Structures' Topological Properties  to Enhance Contact Ma...
Data Mining Protein Structures' Topological Properties to Enhance Contact Ma...
 
Large Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine LearningLarge Scale Data Mining using Genetics-Based Machine Learning
Large Scale Data Mining using Genetics-Based Machine Learning
 
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
A Mixed Discrete-Continuous Attribute List Representation for Large Scale Cla...
 
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...
Automated Alphabet Reduction Method with Evolutionary Algorithms for Protein ...
 

Dernier

Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 

Dernier (20)

Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 

Introduction to Bioinformatics

  • 1. Introduction to Bioinformatics Dr. Jaume Bacardit Interdisciplinary Computing and Complex Systems (ICOS) research group University of Nottingham jaume.bacardit@nottingham.ac.uk
  • 2. About me • Did my PhD in evolutionary learning • Postdoc in Protein Structure Prediction 2005- 2007 • Since 2008 lecturer in Bioinformatics at the University of Nottingham • Research interests – Large-scale data mining – Biological data mining
  • 3. Outline • What is Bioinformatics? • Basic molecular biology • Public databases • Sequence analysis • The scales of bioinformatics • Biological data mining
  • 5. What is Bioinformatics? • Several definitions exist. Michael Liebman proposed a quite elegant definition: – “The study of the information content and information flow in biological systems and processes‖ (Michael Liebman) – Information content: genome project – Information flow: molecular transport – Biological systems: cells, organisms, … – Biological processes: metabolic networks • Bioinformatics is the science of using information to understand aspects of Biology. That is, a discipline where techniques such as applied mathematics, computer science, statistics, artificial intelligence, etc. are integrated to solve biological problems
  • 6. Information, information, information • As we know there have been major advances in the field of molecular biology • These have been coupled with advances in laboratory (post)genomic technology • This has led to an explosive growth in the collection of biological information • This deluge of information has led to an absolute requirement for 1. Computerized databases to store, organise and index the data 2. For specialized tools to view and analyse the data 3. Specialized tools to infer new knowledge from the data
  • 7. Areas of research(taxonomy of the Bioinformatics Journal) • Genome Analysis • Sequence Analysis • Phylogenetics • Structural Bioinformatics • Gene Expression • Genetics and Population Analysis • Systems Biology • Data and Text Mining • Databases and Ontologies • Bioimage Informatics
  • 8. (Borrowed from “An Introduction to Bioinformatics Algorithms” by Neil C. Jones and Pavel A. Pevzner and further modified by Prof. Natalio Krasnogor) BASIC MOLECULAR BIOLOGY
  • 9. Life begins with Cell • A cell is the smallest structural unit of an organism that is capable of sustained independent functioning • All cells have some common features • What is Life? Can we create it in the lab? Read: The imitation game—a computational chemical approach to recognizing life. Nature Biotechnology, 24:1203-1206, 2006
  • 10. 2 types of cells: Prokaryotes & Eukaryotes
  • 11. Example of cell signaling
  • 12. Terminology • The genome is an organism’s complete set of DNA. – a bacteria contains about 600,000 DNA base pairs – human and mouse genomes have some 3 billion. • human genome has 23 distinct chromosomes. – Each chromosome contains many genes. • Gene – basic physical and functional units of heredity. – specific sequences of DNA bases that encode instructions on how and when to make proteins. • Proteins – Make up the cellular structure – large, complex molecules made up of smaller subunits called amino acids.
  • 13. All Life depends on 3 critical molecules • DNAs – Hold information on how cell works • RNAs – Act to transfer short pieces of information to different parts of cell – Provide templates to synthesize into protein • Proteins – Form enzymes that send signals to other cells and regulate gene activity – Form body’s major components (e.g. hair, skin, etc.) – Are life’s laborers! • Computationally, all three can be represented as sequences of a certain 4-letter (DNA/RNA) or 20-letter (Proteins) alphabet
  • 14. DNA, RNA, and the Flow of Information Replication Transcription Translation Weismann Barrier / Central Dogma of Molecular Biology
  • 15. Overview of DNA to RNA to Protein • A gene is expressed in two steps 1) Transcription: RNA synthesis 2) Translation: Protein synthesis
  • 16. DNA: The Basis of Life • Deoxyribonucleic Acid (DNA) – Double stranded with complementary strands A-T, C-G • DNA is a polymer – Sugar-Phosphate-Base – Bases held together by H bonding to the opposite strand
  • 17. RNA • RNA is similar to DNA chemically. It is usually only a single strand. T(hyamine) is replaced by U(racil) • Some forms of RNA can form secondary structures by―pairing up‖ with itself. This can have impact on its properties dramatically. DNA and RNA can pair with each other. tRNA linear and 3D view: http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif
  • 18. RNA, continued Several types exist, classified by function: • hnRNA (heterogeneous nuclear RNA): Eukaryotic mRNA primary transcipts with introns that have not yet been excised (pre-mRNA). • mRNA: this is what is usually being referred to when a Bioinformatician says ―RNA‖. This is used to carry a gene’s message out of the nucleus. • tRNA: transfers genetic information from mRNA to an amino acid sequence as to build a protein • rRNA: ribosomal RNA. Part of the ribosome which is involved in translation.
  • 19. Transcription • Transcription is highly regulated. Most DNA is in a dense form where it cannot be transcribed. • To start, transcription requires a promoter, a small specific sequence of DNA to which polymerase can bind (~40 base pairs ―upstream‖ of gene) • Finding these promoter regions is only a partially solved problem that is related to motif finding. • There can also be repressors and inhibitors acting in various ways to stop transcription. This makes regulation of gene transcription complex to understand.
  • 20. Definition of a Gene • Regulatory regions: up to 50 kb upstream of +1 site • Exons: protein coding and untranslated regions (UTR) 1 to 178 exons per gene (mean 8.8) 8 bp to 17 kb per exon (mean 145 bp) • Introns: splice acceptor and donor sites, junk DNA average 1 kb – 50 kb per intron • Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
  • 22. Splicing and other RNA processing • In Eukaryotic cells, RNA is processed between transcription and translation. • This complicates the relationship between a DNA gene and the protein it codes for. • Sometimes alternate RNA processing can lead to an alternate protein (splice variants) as a result. This is true in the immune system.
  • 23. Proteins: Crucial molecules for the functioning of life • Structural Proteins: the organism's basic building blocks, eg. collagen, nails, hair, etc. • Enzymes: biological engines which mediate multitude of biochemical reactions. Usually enzymes are very specific and catalyze only a single type of reaction, but they can play a role in more than one pathway. • Transmembrane proteins: they are the cell’s housekeepers, eg. By regulating cell volume, extraction and concentration of small molecules from the extracellular environment and generation of ionic gradients essential for muscle and nerve cell function (sodium/potasium pump is an example) • Proteins are polypeptide chains, constructed by joining a certain kind of peptides, amino acids, in a linear way • The chain of amino acids, however folds to create very complex 3D structures
  • 24. Translation • The process of going from RNA to polypeptide. • Three base pairs of RNA (called a codon) correspond to one amino acid based on a fixed table. • Always starts with Methionine and ends with a stop codon
  • 26. Protein Structure: Introduction • Different amino acids have different properties • These properties will affect the protein structure and function • Hydrophobicity, for instance, is the main driving force (but not the only one) of the folding process
  • 27. Protein Structure: Hierarchical nature of protein structure Primary Structure = Sequence of amino acids MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTL PFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQRE KIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKK HLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYL IKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE Secondary Structure Tertiary Local Interactions Global Interactions
  • 28. Protein Structure: Why is structure important?  The function of a protein depends greatly on its structure  The structure that a protein adopts is vital to it’s chemistry  Its structure determines which of its amino acids are exposed to carry out the protein’s function  Its structure also determines what substrates it can react with
  • 29. Protein Structure: Mostly lacking information • Therefore, it is clear that knowing the structure of a protein is crucial for many tasks • However, we only know the structure for a very small fraction of all the proteins that we are aware of – The UniProtKB/TrEMBL archive contains 23165610 (16886838) sequences – The PDB archive of protein structure contains only 84223(76669) structures • In the native state, proteins fold on its own as soon as they are generated, amino-acid by amino-acid (with few exceptions e.g. chaperones)  can we predict this process as to close the gap between protein sequences and their 3D structures?
  • 30. Central Dogma of Biology: A Bioinformatics Perspective The information for making proteins is stored in DNA. There is a process (transcription and translation) by which DNA is converted to protein. By understanding this process and how it is regulated we can make predictions and models of cells. Assembly Protein Sequence/Stru Sequence analysis cture Analysis Gene Finding Computational Problems
  • 32. Information flow in bioinformatics • Data enters the “bioinformatics scope” when a scientist deposits an experimental result in an appropriate archive • The archive curates and annotates the data • The data is released to the public • Afterwards, the data may be retrieved/analysed: – Integrating the new entry into a search engine – Extracting useful subsets of the data – Deriving new types of information from the data – Aggregating the data, by homology, function, structure – Reannotating the data with new discovered/inferred info. • Quality of data depends on many factors, the techniques used to experimentally create the data, degree of inference and prediction involved in the annotation process, etc. • Many publicly available databases: http://en.wikipedia.org/wiki/List_of_biological_databases
  • 33. NCBI’s Entrez system http://www.ncbi.nlm.nih.gov/ Entrez is a search and retrieval system that integrates information from databases at NCBI (National Center for Biotechnology Information).
  • 34. Uniprot http://www.uniprot.org • The Universal Protein Resource (UniProt) is a collaboration between the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR)
  • 35. KEGG - http://www.genome.jp/kegg/ • Not just about genes/proteins but also pathways, that is, their interactions
  • 38. Sequences • Be it DNA, RNA or proteins we have many data that can be represented as sequences of a certain alphabet • Many generic algorithms to deal with biological sequences exist • Sequence alignment • Motif representation
  • 39. Sequence Alignment • Is the assignment of residue-residue correspondences between nucleotide/proteomic sequences Query 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY Sbjct 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60 matches Query 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSV 120 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL Sbjct 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL------------- 107 gap ... Query 301 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPT 360 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ + C P+ Sbjct 281 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ-----DSFHLECQFPS 335 Query 361 S-PSVN 365 P VN mismatches Sbjct 336 KFPGVN 341
  • 40. Motivation • Similarity is expected among biomolecules that are descended from a common ancestor. – Mutations cause differences, but survival of the organism requires that mutations occur in regions that are less critical to function – Important catalytic, regulatory or structural regions remain similar • An alignment between two or more genetic or proteomic sequences represents an explicit hypothesis vis a vis their evolutionary histories. • Thus comparison of related gene/protein sequences have been instrumental in shedding light into the information content of these sequences and their biological functions.
  • 41. Definition and aims • Why align sequences? 1. Start with a query sequence with unknown properties and search within a database of millions of sequences to find those which share similarity with the query. 2. Start with a small set of sequences and identify similarities and differences among them. 3. In many sequences or very long sequences, detect commonly occurring patterns
  • 42. Similarity vs. Homology • Similarity is the observation or measurement of resemblance and difference, independent of the source of resemblance. • There are many examples of different organisms with functionally similar organs that came from distinc evolutionary origins • When similarity is due to a common ancestry, we call it homology. • Sequence alignment helps inferring homology hypothesis: – If two sequences are very similar, it is probable that there is a common origin – Therefore, if we know some information (structure, function) from sequence X, and sequence X is similar to sequence Y, it is probable that the same information applies to Y
  • 43. Metrics of similarity: Definitions • Gap: a break in the alignment, in either one of the sequences. – For nucleotides, a consequence of an insertion or deletion mutation. – For proteins, it’s more difficult to say. • Regions of matching residues. – Indicate parts of a sequence that are well conserved • Mismatched residues. – For nucleotides, a consequence of a substitution mutation – Less conserved regions
  • 44. Metrics of similarity: Distance scoring • Distance scoring – Given an alignment with matches, mismatches and gaps, we compute a score following: • For each mismatch, score is increased by 2 • For each gap, score is increased by 4 • For each match, no increase in score – Higher score, less similarity A – G C C G T A T A C G A - - T - T 0 4 0 2 4 4 0 4 0 = 18 • Equivalent metrics exist for similarity (not distance) where higher score means good similarity
  • 45. Metrics of similarity: Mismatches and gaps • Are all mismatches equally bad? – For protein sequences, there are several subgroups of amino acids with similar properties. Mismatches within a group have less impact – For nucleotide sequences, transition mutations (a↔g and t↔c) are more common than transversions (a or g ↔ t or c) mutations – Distance scoring of mismatches could be smarter  substitution matrices • Using statisical analysis on large corpus of real sequences to generate better scores • How to penalize gaps – Each gap slot gets equal distance score – One score to open a gap, another (smaller) score to extend the same gap
  • 46. Global vs Local alignment • We know how to score good or bad alignments – How to find the optimal one? • Two classes of alignment methods – Global alignment • Finds the best alignment of one entire sequence with another entire sequence – Local alignment • Find the best alignment of one segment of a sequence against another segment of another sequence
  • 47. Exact vs. Approximate methods • Exact methods for both global and local alignment exist, based on dynamic programming, but are slow – Good enough when there are few sequences – Not so good when comparing a target sequence to a database of millions of known sequences • Approximate methods have been used for many years for large- scale alignment tasks – They use some kind of heuristic to speed up the alignment process – BLAST (Basic Local Alignment Search Tool) is the most famous approximate method • It identifies potential hits by looking for perfect matches of very small sub-sequences (seeds) • It only tries to create a full alignment for sequences where several seeds are identified • PSI-BLAST: version that takes into account that multiple hits are identified. It constructs a tailored substitution matrix based on hits and then refines the alignment
  • 48. Multiple Sequence Alignment • When we have to align more than two sequences • Progressive methods (e.g. ClustalW) – Start with seed alignment – Iteratively incorporate other alignments to seed, without modifying what is aligned so far – ClustalW uses phylogenetic trees (representations of the evolutionary relationship between sequences) to progressively construct MSA • Iterative methods (e.g. MUSCLE) – Can re-edit the partial MSA based on the newly incorporated alignments
  • 50. Motifs • When visualising a MSA we can see regions of high agreement and regions of low agreement. • The high agreement regions define that a certain protein belongs to a family • What if we concentrate on modelling and identifying these regions instead of the whole sequences  Motif finding
  • 51. Modelling motifs • Patterns – Model the subsequence as a regular expression • C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA] • Zing Finger motif • Can cope with moderate level of variability • Profiles – Specify the most likely values for each position in the motif  acts as a substituton matrix – Use sequence similarty metrics to compute a score of the motif for a given sequence 1 2 3 4 5 6 7 8 9 A: 0.5 0.25 0 1 0 0 0 0 0.5 T: 0 0.25 1 0 0 1 0 0.25 0 G: 0.5 0.25 0 0 1 0 0 0.75 0.25 C: 0 0.25 0 0 0 0 1 0 0.25 http://drmotifs.genouest.org/2010/07/profiles-pwm-pswm-pssm/ • PROSITE implements both types of motifs
  • 52. Modelling Motifs • Hidden Markov Models – Model the motif as a series of state transitions with probabilities associated to each input symbol and state – Easy to visualise http://drmotifs.genouest.org/2010/07/hiden-markov-models-hmm/ – PFAM uses HMM motifs
  • 53. DNA -> RNA -> Proteins THE SCALES OF BIOINFORMATICS
  • 54. DNA • Coding/non coding • SNPs • Copy number variation • Assembly • Methylation • Primer design
  • 55. Coding/Non Coding • Identifying the regions from an organism’s genome that contain genes • Many different factors involved in this identification – Promoter identification – Long enough Open Reading Frames (ORF) – Splice variants – Introns/Exons (in Eucaryotes) – Statistical properties of gene-coding DNA • HMM are also used for gene finding
  • 56. Single Nucleotide Polymorphisms (SNPs) • One base-pair variation in DNA • In most cases in non-coding regions of DNA, but not always • When frequent enough in a population they can be linked to specific traits, e.g. a disease • SNP microarrays can be used to probe hundreds of thousands of SNPs in parallel • In reality few SNPs act on their own – Genome-Wide Association Studies identify groups of SNPs linked to a certain condition
  • 57. Copy Number Variation • In general two copies of each gene exist in a genome • It may be the cases than more/less than two copies exist of a certain gene for a specific sub-population • It has been suggested that certain CNV can be linked to specific diseases
  • 58. Genome assembly • Sequencing technologies are able to read (sequence) a complete genome as a series of short overlapping fragments • How to assemble back all these fragments? • Greedy approach – Pair-wise alignments of all fragments – Merge fragments of largest overlap – Keep iterating until all segments are merged • Worked more or less well on old sequencing technologies, not so well on next-generation sequencing data, due to smaller fragment sizes and larger error rate
  • 59. Genome mapping • Given a large set of short fragments, as a result of next-generation sequencing, map them to a reference genome • Different from previous one. We do not want to reconstitute a complete genome, just identify to which genes each fragment belongs (among other applications). • Speed is an issue • Modern methods (e.g. SOAP2) compress the genome and are able to align the fragments in the compressed space
  • 60. Methylation • It is a chemical reaction that can block a certain region of a chromosome, preventing its transcription • The process can be reverted, so essentially it is an on/off switch of the affected gene • Specialised microarrays exist for the high- throughput detection of methylated genes • Afterwards, data analysis can take place
  • 61. DNA library specification • A DNA library is a combinatorial set of DNA sequences suited to manufacture via DNA reuse • The first stage towards the creation of a DNA library is the formal specification of the target DNA molecules that comprise it • A set of sequences does not convey the intention behind the library Key challenge is to enable precise editing of DNA sequences in an extensible and reproducible manner whilst avoiding manual handling of these unwieldy objects
  • 62. DNALD library format • A DNALD library consists of three sets of definitions: inputs, intermediates and outputs, with different semantics – Inputs: existing DNA sequences to be provided with design – Intermediates: conceptual means of factoring commons seqs – Outputs: to be produced through DNA reuse
  • 63. DNALD expressions • A DNALD expression is a combination of explicit sequences, definition names, operators and functions that are interpreted according to rules of precedence and association ("evaluated") to produce a set of DNA sequences. • Definitions bind names to the results of expressions.
  • 64. Workbench interface manage projects text editor with: • syntax highlighting • auto-completion • code folding • etc. viewed from different perspectives
  • 65. CADMAD’s DNALD (DNA Library Design) >Ret_human GGCCTCTACTTCTCGAGGGATGCTTACTGGGAGAAGCTGTATGTGGACCAGGCGGCCGGCA CGCCCTTGCTGTACGTCCATGCCCTGCGGGACGCCCCTGAGGAGGTGCCCAGCTTCCGCCT A specification language that GGGCCAGCATCTCTACGGCACGTACCGCACACGGCTGCATGAGAACAACTGGATCTGCATC CAGGAGGACACCGGCCTCCTCTACCTTAACCGGAGCCTGGACCATAGCTCCTGGGAGAAGC TCAGTGTCCGCAACCGCGGCTTTCCCCTGCTCACCGTCTACCTCAAGGTCTTCCTGTCACC CACATCCCTTCGTGAGGGCGAGTGCCAGTGGCCAGGCTGTGCCCGCGTATACTTCTCCTTC produces a set of target DNA TTCAACACCTCCTTTCCAGCCTGCAGCTCCCTCAAGCCCCGGGAGCTCTGCTTCCCAGAGA CAAGGCCCTCCTTCCGCATTCGGGAGAACCGACCCCCAGGCACCTTCCACCAGTTCCGCCT GCTGCCTGTGCAGTTCTTGTGCCCCAACATCAGCGTGGCCTACAGGCTCCTGGAGGGTGAG GGTCTGCCCTTCCGCTGCGCCCCGGACAGCCTGGAGGTGAGCACGCGCTGGGCCCTGGACC sequences as a function of GCGAGCAGCGGGAGAAGTACGAGCTGGTGGCCGTGTGCACCGTGCACGCCGGCGCGCGCGA GGAGGTGGTGATGGTGCCCTTCCCGGTGACCGTGTACGACGAGGACGACTCGGCGCCCACC TTCCCCGCGGGCGTCGACACCGCCAGCGCCGTGGTGGAGTTC>Ret_mouse GGCCTCTATTTCTCAAGGGATGCTTACTGGGAGAGGCTGTATGTAGACCAGCCAGCTGGCA operations on a set of inputs CACCTCTGCTCTATGTCCATGCCCTACGGGATGCCCCTGGAGAAGTGCCGAGCTTCCGCCT GGGCCAGCATCTCTATGGCGTCTACCGTACACGGCTGCATGAGAATGACTGGATCCGCATC AATGAGACTACTGGCCTTCTCTACCTCAATCAGAGCCTGGACCACAGTTCCTGGGAACAGC TCAGCATCCGCAATGGTGGTTTCCCCCTGCTCACCATCTTCCTCCAGGTCTTTCTGGTGGA AAACTGCCAGGAGTTCAGCGGTGTCTCCATCCAGTACAAGCTGCAGCCTTCCAGCATCAAC TGCACTGCCCTAGGTGTGGTCACCTCACCCGAGGACACCTCGGGGACCCTATTTGTAAATG ACACAGAGGCCCTGCGGCGACCTGAGTGCACCAAGCTTCAGTACACGGTGGTAGCCACTGA CCGGCAGACCCGCAGACAGACCCAGGCTTCGCTAGTGGTCACTGTGGAGGGGACATCCATT ACTGAAGAAGTAGGCT To maximise CADMAD's impact the specification process must be: >Ret_zebrafish GGGCTGTATTTTCCTCAAAGGCTTTACACAGAGAACATCTACGTGGGTCAGCAGCAGGGAT CACCGTTGCTTCAGGTCATTTCAATGCGGGAATTCCCTACAGAGAGGCCTTATTTCTTCCT • user friendly and debuggable GTGCTCGCACAGAGACGCTTTTACATCATGGTTTCACATAGATGAGGCGTCCGGAGTTCTT TATCTCAACAAAACCCTGGAGTGGAGCGACTTCAGTAGTTTACGCAGCGGCTCAGTTCGCT CCCCGAAGGATCTCTGACCTATCAGTTAGAGATTGTCGACAGGAACATCACTGCTGAAGCT CAGTCCTGTTACTGGGCGGTTAGTCTTGCACAAAACCCGAATGATAATACAGGCGTTCTCT • but expressively powerful enough to: ATGTGAACGACACCAAAGTGTTACGCAGACCAGAGTGCCAAGAGCTGGAGTATGTGGTCAT TGCCCAGGAGCAGCAGAACAAGCTTCAGGCCAAGACACAGCTCACCGTCAGTTTTCAAGGC GAAGCAGATTCACTGAAAACGGATG >Ret_chicken – define non-trivial combinatorial constructs GGTCTGTACTTCCCCAGAAAGGAGTACTCAGAGAACGTCTACATTGACCAGCCAGCAGGTG CGCCGCTCCTACGCATCCACGCCTTGAGGGATTCACATGGGAAACAGCCCACTTTCATCTG TGCCAGAAGTCTCATCATTTCTCGAGCAAGATCCCATGAAAATCACTGGTTTCAAATCAGA – communicate degrees of freedom GAAAAAATGGGACTTCTCTACCTCAGCAAGAGCCTAGATAGAGAAGACTTTAACATGCTGT CTGTAGGAAACTGGATGCCATTATCAAAGGTGATGCTGTATGTCTTCCTCTCATCTCACCC TTTCCAAGAGAAGGAATGTGACTCTGCTACTCGTACCACAGTCGTCCTCTCTTTGATCAAT GCTACTGCACCAGCTTGCAGTTCACTGTCAGCAAGGCAGCTTTGCTTCACAGAAATGGATC TCTCCTTTCACATCAAGGAGAATAAACCCCCTGGTACATTTCATCAGCTCCAGTTACCCTC AGTTCATCATCTGTGTCAGAATCTCAGCATTACCTACAAACTGTTGGCAGCCGAAGGCCTG CCTTTTCGGTACAATGAGAACACCACTGGTGTGAGTGTAACACAGCGCCTAGATCGAGAGG AGAGAGAGAGATATGAGCTGATCGCCAAATGCACCGTGAGAGAAGGCTTCAGGGAAATGGA GGTTGAGGTGCCCTTCCTCGTCAACGTGTTAGATGAAGATGACTCTCCTCCCTTCCTTCCC
  • 67. RNA expression • Not all genes are transcribed/translated into proteins all the time • The expression of genes is highly sophisticated and depends on many factors • Identifying the genes being expressed in a given point of time in a specific tissue provides crucial information about the roles and interactions of such genes – Compare the genes expressed between different groups of samples to identify those that are differentially expressed – Identify co-expressed genes, that present patterns of correlation
  • 68. Measuring RNA expression • RT-PCR (Real-time reverse polimerase chain reaction) – Measures accurately the expression of a pre- determined gene • RNA Microarrays – Measures, in parallel, the expression of tens of thousands of genes, but with considerable level of noise • RNA-Seq – The next-generation sequencing variant for measuring gene expresison
  • 69. RNA Structure prediction • A RNA sequence can bind with itself to create complex shapes with a certain pattern of loops • Can we predict, from a given sequence, the structural shape of the RNA?
  • 70. Proteins • Protein classification • Structure prediction • Structure comparison • Function and interaction
  • 71. Protein classification • Proteins can be annotated in many different ways – Function • DNA-binding? Enzyme? – Tissue/Cellular/Sub-cellular localisation – Interacting with other proteins? • Can we predict this annotation using ML? • We need to transform the protein sequence into a uniform representation of equal size for all proteins • Many different representations exist • Several of these problems can be modelled as a hierarchical classification problem
  • 72. Protein Structure Prediction • PSP aims to predict the 3D structure of a protein based on its primary sequence
  • 73. Protein Structure Prediction  PSP is an open problem. The 3D structure depends on many variables  It has been one of the main holy grails of computational biology for many decades • Impact of having better protein structure models are countless – Genetic therapy – Synthesis of drugs for incurable diseases – Improved crops – Environmental remediation
  • 74. Prediction types of PSP • There are several kinds of prediction problems within the scope of PSP – The main one, of course, is to predict the 3D coordinates of all atoms of a protein (or at least the backbone) based on its primary sequence – There are many structural properties of individual residues within a protein that can be predicted, for instance: • The secondary structure state of the residue • If a residue is buried in the core of the protein or exposed in the surface – Accurate predictions of these sub-problems can simplify the general 3D PSP problem
  • 75. 3D Protein Structure Prediction • Some PSP methods try to find similar proteins and then adapt the structure of the homolog (template) to the target protein  Homology Modeling • Other methods try to find the structure of the protein from scratch (Ab Initio Modelling), optimizing some energy function that models the stability of the protein, in case no homolog can be identified • In between there are other kind of methods, for varying degrees of good homology of our target, for instance, Fold Recognition or Threading • These methods identify a target based on more than homology (i.e. sequence alignment).
  • 76. Coordination Number Prediction  Two residues of a chain are said to be in contact if their distance is less than a certain threshold (e.g. 8Å) Native State Primary Contact Sequence  CN of a residue : count of contacts that a certain residue has  CN gives us a simplified profile of the density of packing of the protein
  • 77. Contact Map prediction • Prediction, given two residues from a chain, whether these two residues are in contact or not • This problem can be represented by a binary matrix. 1= contact, 0 = non contact • Plotting this matrix reveals many characteristics from the protein structure • Very sparse characteristic: Less than 2% of contacts in native structures helices sheets
  • 78. Other predictions • Other kinds of residue structural aspects that can be predicted – Solvent accessibility: Amount of surface of each residue that is exposed to solvent – Recursive Convex Hull: A metric that models a protein as an onion, and assigns each residue to a layer. Formally, each layer is a convex hull of points • These features (and others) are predicted in a similar was as done for SS
  • 80. Protein Structure Comparison • Protein Structure Comparison (PSC) aims at – Assess the degree of similarity between protein structures – Given a query structure, identify other proteins with similar structure • Why? – Group proteins by structural similarities – Determine the impact of individual residues on the protein structure – Identify distant homologues of protein families – Predict function of proteins with low degree of primary structure (i.e.. sequence) similarity with other proteins – Engineer new proteins for specific functions – Assess ab-initio predictions
  • 81. Protein Structure Comparison • Sequence-Structure-Function relationships 1) Conserved 1º sequences similar structures 2) Similar structures ? conserved 1º sequences 3) Similar structures conserved function • PSC shares many similarities with sequence alignment. Our aim is to infer new knowledge from the comparison process
  • 82. Protein Structure Comparison • Existing Approaches – SSAP (Orengo & Taylor, 96) – ProSup (Feng & Sippl, 96) – DALI (Holm & Sander, 93) – CE (Shindyalov & Bourne, 98) – LGA (Zemla, 2003) – SCOP (Murzin, Brenner, Hubbard & Chothia, 95) – CATH (Orengo, Mithie, Jones, Jones, Swindells & Thornton, 97) – ProCKSI – Consensus of multiple PSC methods
  • 83. Prediction of Protein Function • In an ideal world, the cascade of inference should flow from sequence  structure  function • That is, if we can identify similar sequences of structures to our query target we can (at varying degrees of certainty) infer that they have similar function
  • 84. Prediction of Protein Function • As proteins evolve, they may – Retain function and specificity – Retain function but alter specificity – Change to a related function, or a similar function in a different metabolic contxt – Change to a completely unrelated function • How much must a protein change before the function changes? – Sometimes, not at all. There are many cases of proteins with different functions in different environments
  • 85. Prediction of Protein Function • Thus, sequence or structure similarity is not always reliable to assign function • Other ways of determining protein function – By identifying patterns of co-regulated genes • Using data from Microarray experiments – By identifying protein-protein interactions
  • 86. Prediction of Protein Function • A related question is: where is the function of a protein taking place?  active site • Several methods exist to predict active/binding sites of proteins from local patterns of sequence or structure • A raw way of doing this prediction is to take a look at the conserved residues of a sequence  they may be related to either the core of the protein (structural stability) or the function of a protein (a change of function is a risk for survival) • More sophisticated methods exists to learn how to predict active sites. They use ML, in a similar way used to predict residue structural features in PSP • Still, it is a very tough problem, and ML methods are not much better than blast-based methods
  • 88. Three case studies • Mining –omics data • Predicting structural aspects of protein residues • Automated alphabet reduction for protein datasets • In all these three case studies we use the same evolutionary learning system: BioHEL [Bacardit et al., 09]
  • 89. BioHEL • BioHEL [Bacardit et al., 09] is an evolutionary learning system that applies the Iterative Rule Learning (IRL) approach • Designed explicitly to deal with noisy large-scale datasets • IRL was first used in EC by the SIA system [Venturini, 93]
  • 90. BioHEL’s learning paradigm – IRL has been used for many years in the ML community, with the name of separate-and-conquer
  • 91. BioHEL’s objective function • An objective function based on the Minimum- Description-Length (MDL) (Rissanen,1978) principle that tries to promote rules with – High accuracy: not making mistakes – High coverage: covering as much examples as possible without sacrificing accuracy. Recall (TP/(TP+FN)) will be used to define coverage – Low complexity: rules as simple and general as possible – The objective function is a linear combination of the three objectives above
  • 92. BioHEL’s objective function • Intuitively, we would like to have accurate rules covering as much examples as possible. • However, in complex and inconsistent domains it is rare to obtain such rules • In these cases, easier path for evolutionary search is to maximize accuracy at the expense of coverage • Therefore, we need to enforce that the evolved rules cover enough examples
  • 93. BioHEL’s objective function • Three parameters define the shape of the function • The choice of the coverage break is crucial for the proper performance of the system • Also, coverage term penalizes rules that do not cover a minimum percentage of examples or that cover too many
  • 94. BioHEL’s characteristics • Attribute list rule representation – Automatically identifying the relevant attributes for a given rule and discarding all the other ones • The ILAS windowing scheme – Efficiency enhancement method, not all training points are used for each fitness computation • An explicit default rule mechanism – Generating more compact rule sets – Iterative process terminates when it is impossible to evolve a rule where the associated class is the majority class among the matched examples – At this point, all remaining training instances are assigned to the default class
  • 96. Mining –omics data • Biological data can be generated at many different levels – Genomics (DNA) – Transcriptomics (RNA) – Proteomics (proteins) – Metabolomics (small compounds) – Lipidomics (lipids) • Hundreds of –omics have been catalogued
  • 97. How an –omics dataset looks like? • In most cases datasets present a similar structure • Each sample is characteristed by a large number of variables (RNA, Proteins, lipids, etc.) • Each variable indicates (usually quantitatively) the presence of that element in the sample • Due to the high cost of most –omics technologies, variables >> samples – Problems of over-fitting
  • 98. What can we do with the dataset? • In most cases, samples are annotated with a qualitative label – Cancer/Non-cancer patients – Samples of seed tissue for which it is known if the seed germinated or not – Age of the sample • Therefore, we can treat these datasets as classification problems, and generate prediction models from the data • Not just as classification problems – Clustering/Biclustering – Association Rule Mining – Regression
  • 99. But in most cases, domain experts are not (only) interested in predictions • Biomarker identification – Identify the key variables • Most strongly associated to each outcome – Using e.g. t-tests to identify those • Presenting higher prediction capacity – As identified by ML methods – Identify interactions between variables • By presenting very high (anti)correlation between them • By acting together to generate predictions
  • 100. Functional Network Reconstruction for seed germination  Microarray data obtained from seed tissue of Arabidopsis Thaliana  122 samples represented by the expression level of almost 14000 genes  It had been experimentally determined whether each of the seeds had germinated or not  Can we learn to predict germination/dormancy from the microarray data?  [Bassel et al., 2011]
  • 101. Generating rule sets  BioHEL was able to predict the outcome of the samples with 93.5% accuracy (10 x 10-fold cross- validation  Learning from a scrambled dataset (labels randomly assigned to samples) produced ~50% accuracy If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96  Predict germination If At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66  Predict germination If At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66  Predict germination If At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and At1g48320>56.80  Predict germination Everything else  Predict dormancy
  • 102. Identifying regulators  Rule building process is stochastic  Generates different rule sets each time the system is run  But if we run the system many times, we can see some patterns in the rule sets  Genes appearing quite more frequent than the rest  Some associated to dormancy  Some associated to germination
  • 103. Known regulators appear with high frequency in the rules
  • 104. Generating co-prediction networks of interactions • For each of the rules shown before to be true, all of the conditions in it need to be true at the same time – Each rule is expressing an interaction between certain gens • From a high number of rule sets we can identify pairs of genes that co-occur with high frequency and generate functional networks • The network shows different topology when compared to other type of network construction methods (e.g. by gene co- expression) • Different regions in the network contain the germination and dormancy genes
  • 105. Experimental validation  We have experimentally verified this analysis  By ordering and planting knockouts for the highly ranked genes  We have been able to identify four new regulators of germination, with different phenotype from the wild type
  • 106. PREDICTING STRUCTURAL ASPECTS OF PROTEIN RESIDUES
  • 107. Prediction of structural aspects of protein residues • Many of these features are due to local interactions of an amino acid and its immediate neighbours – Can it be predicted using information from the closest neighbours in the chain? Ri-5 Ri-4 Ri-3 Ri-2 Ri-1 Ri Ri+1 Ri+2 Ri+3 Ri+4 Ri+5 SSi-5 SSi-4 SSi-3 SSi-2 SSi-1 SSi SSi+1 SSi+2 SSi+3 SSi+4 SSi+5 Ri-1 Ri Ri+1  SSi Ri Ri+1 Ri+2  SSi+1 Ri+1 Ri+2 Ri+3  SSi+2 – In this simplified example to predict the SS state of residue i we would use information from residues i-1 i and i+1. That is a window of ±1 residues around the target
  • 108. ARFF file for a simple PSP dataset @relation AA+CN_Q2 @attribute AA_-4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} @attribute AA_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute class {0,1} @data X,X,X,X,A,E,I,K,H,0 X,X,X,A,E,I,K,H,Y,0 X,X,A,E,I,K,H,Y,Q,0 X,A,E,I,K,H,Y,Q,F,0 A,E,I,K,H,Y,Q,F,N,0 E,I,K,H,Y,Q,F,N,V,0 I,K,H,Y,Q,F,N,V,V,0 K,H,Y,Q,F,N,V,V,M,1 H,Y,Q,F,N,V,V,M,T,0 Y,Q,F,N,V,V,M,T,C,1
  • 109. What information do we include for each residue? – Early prediction methods used just the primary sequence  the AA types of the residues in the window – However the primary sequence has limited amount of information • It does not contain any evolutionary information it does not say which residues are conserved and which are not – Where can we obtain this information? • Position-Specific Scoring Matrices which is a product of a Multiple Sequence Alignment
  • 110. Position-Specific Scoring Matrices (PSSM) – For each residue in the query sequence compute the distribution of amino acids of the corresponding residues in all aligned sequences (discarding those too similar to the query) – This distributions will tell us which mutations are likely and which mutations are less likely for each residue in the query sequence – In essence it’s similar to a substitution matrix but tailored for the sequence that we are aligning – A PSSM profile will also tell us which residues are more conserved and which residues are more subject to insertions or deletions
  • 111. PSSM for the 10 first residues of 1n7lA A R N D C Q E G H I L K M F P S T W Y V A: 4 -1 -2 -2 0 -1 -1 0 -2 -1 -2 -1 -1 -2 -1 1 0 -3 -2 0 M:-1 -2 -3 -4 -2 -1 -2 -3 -2 1 2 -2 7 0 -3 -2 -1 -2 -1 1 E:-1 0 0 2 -4 2 6 -2 0 -4 -3 1 -2 -4 -1 0 -1 -3 -2 -3 K:-1 2 0 -1 -4 1 1 -2 -1 -3 -3 5 -2 -4 -1 0 -1 -3 -2 -3 V: 0 -3 -3 -4 -1 -3 -3 -4 -4 3 1 -3 1 -1 -3 -2 0 -3 -1 5 Q:-1 1 0 0 -3 6 2 -2 0 -3 -3 1 -1 -4 -2 0 -1 -2 -2 -3 Y:-2 -1 -1 -3 -3 -1 -1 -3 6 -2 -2 -2 -1 2 -3 -2 -2 1 7 -2 L:-2 -3 -4 -4 -2 -3 -3 -4 -3 2 5 -3 2 0 -3 -3 -1 -2 -1 1 T: 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 2 5 -3 -2 0 R:-2 6 -1 -2 -4 1 0 -3 0 -3 -3 2 -2 -3 -2 -1 -1 -3 -2 -3
  • 112. Secondary Structure Prediction – The most usual way is to predict whether a residue belongs to an α helix a β sheet or is in coil state – Several programs can determine the actual SS state of a protein from a PDB file. The most common of them is DSSP – Typically, a window of ±7 amino acids (15 in total) is used. This means 300 attributes (when using PSSM). – A dataset with 1000 proteins with ~250AA/protein would have ~250000 instances
  • 113. Secondary Structure Prediction R1 R2 R3 Rn-1 Rn PSSM1 PSSM2 PSSM3 PSSMn-1 PSSMn MSA Primary sequence PSSM profile of sequence SSi? Prediction PSSMi-1 PSSMi PSSMi+1 Windows method generation Prediction Window of PSSM profiles
  • 114. Other prediction problems • This same structure of prediction can be applied to most 1D structural aspects • However, many of these features are natively continuous measures (or integer) • To treat these problems as classification problems, we need to discretise the output • Unsupervised methods are applied – Uniform length and uniform frequency disc. UF UL
  • 115. PSP datasets are good ML benchmarks • These problems can be modelled in may ways: – Regression or classification problems – Low/high number of classes – Balanced/unbalanced classes – Adjustable number of attributes • Ideal benchmarks !! • http://icos.cs.nott.ac.uk/datasets/psp_bench mark.html
  • 116. Contact Map Prediction • We participated in the CASP9 competition • CASP = Critical Assessment of Techniques for Protein Structure Prediction. Biannual competition • Every day, for about three months, the organizers release some protein sequences for which nobody knows the structure (129 sequences were released in CASP9, in 2010) • Each prediction group is given three weeks to return their predictions • If the machinery is not well oiled, it is not feasible to participate !! • For CM, prediction groups have to return a list of predicted contacts (they are not interested in non-contacts) and, for each predicted pair of contacting residues, a confidence level
  • 117. Contact Map prediction • Prediction given two residues from a chain whether these two residues are in contact or not • This problem can be represented by a binary matrix. 1= contact 0 = non contact • Plotting this matrix reveals many characteristics from the protein structure helices sheets
  • 118. Steps for CM prediction (Nottingham method) 1. Prediction of  Secondary structure (using PSIPRED)  Solvent Accessibility  Recursive Convex Hull Using BioHEL [Bacardit et al., 09]  Coordination Number 2. Integration of all these predictions plus other sources of information 3. Final CM prediction (using BioHEL)
  • 119. Prediction of RCH, SA and CN  We selected a set of 3262 protein chains from PDB-REPRDB with:  A resolution less than 2Å  Less than 30% sequence identify  Without chain breaks nor non-standard residues  90% of this set was used for training (~490000 residues)  10% for test
  • 120. Prediction of RCH, SA and CN  All three features were predicted based on a window of ±4 residues around the target  Evolutionary information (as a Position-Specific Scoring Matrix) is the basis of this local information  Each residue is characterised by a vector of 180 values  The domain for all three features was partitioned into 5 states
  • 121. Characterisation of the contact map problem  Three types of input information were used 1. Detailed information of three different windows of residues centered around  The two target residues (2x)  The middle point between them 2. Information about the connecting segment between the two target residues and 3. Global protein information. 1 3 2
  • 122. Contact Map dataset  From the original set of 3262 proteins we kept all that had <250 AA and a randomly selected 20% of larger proteins  Still, the resulting training set contained 32 million pairs of AA and 631 attributes  Less than 2% of those are actual contacts  +60GB of disk space
  • 123. Samples and ensembles Training set  50 samples of 660K examples are generated from the training set with a x50 ratio of 2:1 non-contacts/contacts Samples  BioHEL is run 25 times for each sample  Prediction is done by a consensus of x25 1250 rule sets Rule sets  Confidence of prediction is computed based on the votes distribution in the ensemble.  Whole training process took about 25K Consensus CPU hours Predictions
  • 124. Contact Map prediction in CASP  Predictor groups are asked to submit a list of predicted contacts and a confidence level for each prediction  The assessors then rank the predictions for each protein and take a look at the top L/x ones, where L is the length of the protein and x={5,10}  From these L/x top ranked contacts two measures are computed  Accuracy: TP/(TP+FP)  Xd: difference between the distribution of predicted distance and a random distribution
  • 125. CASP9 results These two groups derived contact predictions from 3D models http://www.predictioncenter.org/casp9/doc/presentations/CASP9_RR.pdf
  • 126. Understanding the rule sets  Each rule set has in average 135 rules  We have a total of 168470 rules  Impossible to read all of them individually, but we can extract useful statistics  For instance, how often was each attribute used in the rules?  Full analysis
  • 127. Distribution of frequency of use of attributes  All 631 attributes are actually used (min frequency=429)  However, some of them are used much more frequently than others
  • 128. Top 10 attributes Attribute Frequency Count s PredSS_r1_1 1.48% 18141 PredCN_r1 1.66% 20336 propensity 1.74% 21288 PredSS_r2 1.75% 21350 PredSS_r1 1.82% 22205 PredRCH_r2 1.87% 22856 PredRCH_r1 2.04% 24961 PredSA_r2 2.12% 25891 PredSA_r1 2.39% 29246 separation 4.17% 50951 The four kind of residue’s predictions are highly ranked
  • 129. AUTOMATED ALPHABET REDUCTION FOR PROTEIN DATASETS
  • 130. Motivation • PSP is a very costly process • As an example, one of the best PSP methods CASP8, Rosetta@Home could dedicate up to 104 computing years to predict a single protein’s 3D structure • One of the possible ways to alleviate this computational cost is to simplify the representation used to model the proteins
  • 131. Target for reduction: the primary sequence • The primary sequence of a protein is an usual target for such simplification – It is composed of a quite high cardinality alphabet of 20 symbols, which share commonalities between them – One example of reduction widely used in the community is the hydrophobic- polar (HP) alphabet, reducing these 20 symbols to just two – HP representation usually is too simple, too much information is lost in the reduction process [Stout et al., 06] • Can we automatically generate these reduced alphabets and tailor them to the specific problem at hand?
  • 132. Automated Alphabet Reduction [Bacardit et al., 09] • We will use an automated information theory-driven method to optimize alphabet reduction policies for PSP datasets • An optimization algorithm will cluster the AA alphabet into a predefined number of new letters • Fitness function of optimization is based on the Mutual Information (MI) metric. A metric that quantifies the interrelationship between two discrete variables – Aim is to find the reduced representation that maintains as much relevant information as possible for the feature being predicted • Afterwards we will feed the reduced dataset into a learning method to verify if the reduction was proper
  • 133. Alphabet Reduction protocol Size = N Test set Dataset ECGA Dataset BioHEL Ensemble Card=20 Card=N of rule sets Accuracy Mutual Information 133
  • 134. Automated Alphabet Reduction  Competent 5-letter alphabet (similar performance to the AA alphabet)  Different alphabets for CN and SA domains  Unexpected explanations: Alphabet reduction clustered AA types that experts did not expect
  • 135. Automated Alphabet Reduction  Our method produces better reduced alphabets than other reduced alphabets from the literature and than other expert- designed ones Alphabet Letters CN acc. SA acc. Diff. Ref. AA 20 74.0±0.6 70.7±0.4 --- --- Our method 5 73.3±0.5 70.3±0.4 0.7/0.4 [Bacardit et al., 07] WW5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Wang & Wang, 99] Alphabets SR5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Solis & Rackovsky, 00] from the MU4 5 72.6±0.7 69.4±0.4 1.4/1.3 [Murphy et al., 00] literature MM5 6 73.1±0.6 69.3±0.3 0.9/1.4 [Melo & Marti-Renom, 06] HD1 7 72.9±0.6 69.3±0.4 1.1/1.4 [Bacardit et al., 07] Expert HD2 9 73.0±0.6 69.3±0.4 1.0/1.4 [Bacardit et al., 07] designed HD3 11 73.2±0.6 69.9±0.4 0.8/0.8 [Bacardit et al., 07] alphabets
  • 136. Efficiency gains from the alphabet reduction • We have extrapolated the reduced alphabet to the much larger and richer Position-Specific Scoring Matrices (PSSM) representation • Accuracy difference is still less than 1% • Obtained rule sets are simpler and training process is much faster • Performance levels are similar to recent works in the literature [Kinjo et al., 05][Dor and Zhou, 07] • Won the bronze medal of the 2007 Humies awards
  • 137. Conclusions • Bioinformatics contain many challenges that computer science can tackle – Optimisation – Machine learning – Software engineering • Evolutionary computation has shown to be very competitive across a large range of bioinformatics problems • Facing these challenges for EC has led to the development of many new methods
  • 138. References/Bibliography • Journals – The Bioinformatics Journal – BMC Bioinformatics – BMC Biodata Mining • Bioinformatics books – Introduction to Bioinformatics by Arthur Lesk, Oxford University Press. – Introduction to Bioinformatics. A. Tramontano, Chapman and Hall/CRC • Specialised topics – Bioinformatics for –omics data. Methods and Protocols. Bernd Mayer (ed). Springer – Next-Generation Sequencing special issue of the Bioinformatics Journal; http://www.oxfordjournals.org/our_journals/bioinformatics/nextgene rationsequencing.html
  • 139. References/Bibliography • J. Bacardit, M. Stout, J.D. Hirst, N. Krasnogor and J. Blazewicz, Coordination number prediction using Learning Classifier Systems: Performance and interpretability. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO2006), pp. 247-254, ACM Press, 2006 • Stout, M., Bacardit, J., Hirst, J.D. and Krasnogor, N. Prediction of Recursive Convex Hull Class Assignments for Protein Residues. Bioinformatics, 24(7):916-923, 2008 • Stout, M., Bacardit, J., Hirst, J.D., Smith, R.E. and Krasnogor, N. Prediction of Topological Contacts in Proteins Using Learning Classifier Systems. Soft Computing Journal, 13(3):245- 258, 2009 • J. Bacardit, E.K. Burke and N. Krasnogor. Improving the scalability of rule-based evolutionary learning. Memetic Computing journal 1(1):55-67, 2009 • J. Bacardit, M. Stout, J.D. Hirst, A. Valencia, R.E. Smith and N. Krasnogor. Automated Alphabet Reduction for Protein Datasets. BMC Bioinformatics 10:6, 2009 • George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J. Holdsworth and Jaume Bacardit. Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets. The Plant Cell, 23(9):3101-3116, 2011 • J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and Natalio Krasnogor. Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics first published online July 25, 2012 doi:10.1093/bioinformatics/bts472
  • 140. References/Bibliography • Jason H. Moore et al., Bioinformatics challenges for genome-wide association studies Bioinformatics (2010) 26(4): 445-455 • Loris Nanni, Sheryl Brahnam, Alessandra Lumini, High performance set of PseAAC and sequence based descriptors for protein classification, Journal of Theoretical Biology 266(1):1- 10, 2010 • Fernando Otero et al., A hierarchical multi-label classification ant colony algorithm for protein function prediction, Memetic Computing 2(3):165-181, 2010 • Daniel Barthel et al., Procksi: a decision support system for protein (structure) comparison, knowledge, similarity and information. BMC Bioinformatics, 8:416, 2007 • http://omics.org/index.php/Alphabetically_ordered_list_of_omes_and_omics • Federico Divina and Jesus S. Aguilar-Ruiz. 2006. Biclustering of Expression Data with Evolutionary Computation. IEEE Trans. on Knowl. and Data Eng. 18, 5 (May 2006), 590-602. • Martinez-Ballesteros, M Nepomuceno-Chamorro, J C Riquelme (2011) Inferring gene-gene associations from Quantitative Association Rules In: 11th International Conference on Intelligent Systems Design and Applications (ISDA 2011 ) 1241 – 1246 • Rubén Armañanzas, Iñaki Inza, Roberto Santana, Yvan Saeys, Jose Flores, Jose Lozano, Yves Peer, Rosa Blanco, Víctor Robles, Concha Bielza, Pedro Larrañaga. A review of estimation of distribution algorithms in bioinformatics. BioData Mining 2008, 1:6 (11 September 2008)
  • 141. Acknowledgements • Prof. Natalio Krasnogor • Prof. Michael Holdsworth • Prof. Jonathan Hirst • Dr. Michael Stout • Dr. George Bassel • Dr. Enrico Glaab • Dr. Pawel Widera • EPSRC GR/T07534/01 & EP/H016597/1 • EU FP7 CADMAD project
  • 142. Introduction to Bioinformatics Dr. Jaume Bacardit Interdisciplinary Computing and Complex Systems (ICOS) research group University of Nottingham jaume.bacardit@nottingham.ac.uk

Notes de l'éditeur

  1. Definitions consist of a name and DNALD expression.Inputs to be defined using a subset of DNALD expressions: unambiguous nucleotide sequencesthe reverse and/or complement of thosethe imported outputs of other DNALD libraries (facilitating iterative library consumption)standard sequence formats
  2. Unary operations include subsequence extraction and mutation. These can be chained together, each operating on the result of the previous one.Binary operations include concatenation, repetition, and unions.Functions include reverse, complement and back-translation.Sequences of nucleotides and amino acids are quoted strings containing single letters symbols according to the IUPAC nomenclature (and numbers which are ignored)Amino acid sequences are only expected in the context of back-translationsAmbiguous nucleotides expand to set of unambiguous alternatives (within reason: 10×N=410=106 sequences)Circular sequences defined by parenthesised overlap at 3&apos;-end: &apos;ACGT…(AC)&apos;Reverse and complement functions: reverse(complement(&apos;ATAGAGTAG&apos;))Repetition operation is multiplication of an expression by either a positive integer or a range of positive integers creating a set: &apos;A&apos;*3 -&gt; &apos;AAA&apos;, &apos;A&apos;*(2:4) -&gt; {&apos;AA&apos;, &apos;AAA&apos;, &apos;AAAA&apos;}Back-translation returns the set of DNA sequences that could encode an amino acid sequence using a particular codon tableThe complete set of sequences will likely be unfeasibly large so must be handled appropriatelyVarious strategies for sampling the space of possible sequences will be developed and algorithms such as GeneOptimizer will be incorporated if source available or reimplemented from description if possibleUser-defined constraints yet to be formalised