Crea il tuo assistente AI con lo Stregatto (open source python framework)
Introduction to Bioinformatics
1. Introduction to Bioinformatics
Dr. Jaume Bacardit
Interdisciplinary Computing and Complex Systems
(ICOS) research group
University of Nottingham
jaume.bacardit@nottingham.ac.uk
2. About me
• Did my PhD in evolutionary learning
• Postdoc in Protein Structure Prediction 2005-
2007
• Since 2008 lecturer in Bioinformatics at the
University of Nottingham
• Research interests
– Large-scale data mining
– Biological data mining
3. Outline
• What is Bioinformatics?
• Basic molecular biology
• Public databases
• Sequence analysis
• The scales of bioinformatics
• Biological data mining
5. What is Bioinformatics?
• Several definitions exist. Michael Liebman proposed a quite
elegant definition:
– “The study of the information content and information flow in
biological systems and processes‖ (Michael Liebman)
– Information content: genome project
– Information flow: molecular transport
– Biological systems: cells, organisms, …
– Biological processes: metabolic networks
• Bioinformatics is the science of using information to
understand aspects of Biology. That is, a discipline where
techniques such as applied mathematics, computer
science, statistics, artificial intelligence, etc. are integrated to
solve biological problems
6. Information, information, information
• As we know there have been major advances in the
field of molecular biology
• These have been coupled with advances in laboratory
(post)genomic technology
• This has led to an explosive growth in the
collection of biological information
• This deluge of information has led to an absolute
requirement for
1. Computerized databases to store, organise and index the data
2. For specialized tools to view and analyse the data
3. Specialized tools to infer new knowledge from the data
7. Areas of research(taxonomy of the
Bioinformatics Journal)
• Genome Analysis
• Sequence Analysis
• Phylogenetics
• Structural Bioinformatics
• Gene Expression
• Genetics and Population Analysis
• Systems Biology
• Data and Text Mining
• Databases and Ontologies
• Bioimage Informatics
8. (Borrowed from “An Introduction to Bioinformatics Algorithms” by Neil C.
Jones and Pavel A. Pevzner and further modified by Prof. Natalio
Krasnogor)
BASIC MOLECULAR BIOLOGY
9. Life begins with Cell
• A cell is the smallest structural unit of an organism that is capable of
sustained independent functioning
• All cells have some common features
• What is Life? Can we create it in the lab? Read:
The imitation game—a computational chemical approach to
recognizing life. Nature Biotechnology, 24:1203-1206, 2006
12. Terminology
• The genome is an organism’s complete set of DNA.
– a bacteria contains about 600,000 DNA base pairs
– human and mouse genomes have some 3 billion.
• human genome has 23 distinct chromosomes.
– Each chromosome contains many genes.
• Gene
– basic physical and functional units of heredity.
– specific sequences of DNA bases that encode
instructions on how and when to make proteins.
• Proteins
– Make up the cellular structure
– large, complex molecules made up of smaller subunits
called amino acids.
13. All Life depends on 3 critical molecules
• DNAs
– Hold information on how cell works
• RNAs
– Act to transfer short pieces of information to different parts of cell
– Provide templates to synthesize into protein
• Proteins
– Form enzymes that send signals to other cells and regulate gene
activity
– Form body’s major components (e.g. hair, skin, etc.)
– Are life’s laborers!
• Computationally, all three can be represented as
sequences of a certain 4-letter (DNA/RNA) or 20-letter
(Proteins) alphabet
14. DNA, RNA, and the Flow of Information
Replication
Transcription Translation
Weismann
Barrier /
Central
Dogma of
Molecular
Biology
15. Overview of DNA to RNA to Protein
• A gene is expressed in two steps
1) Transcription: RNA synthesis
2) Translation: Protein synthesis
16. DNA: The Basis of Life
• Deoxyribonucleic Acid (DNA)
– Double stranded with complementary strands A-T, C-G
• DNA is a polymer
– Sugar-Phosphate-Base
– Bases held together by H bonding to the opposite strand
17. RNA
• RNA is similar to DNA chemically. It is usually
only a single strand. T(hyamine) is replaced by
U(racil)
• Some forms of RNA can form secondary
structures by―pairing up‖ with itself. This can
have impact on its properties dramatically.
DNA and RNA
can pair with
each other.
tRNA linear and 3D view: http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif
18. RNA, continued
Several types exist, classified by function:
• hnRNA (heterogeneous nuclear RNA): Eukaryotic mRNA primary
transcipts with introns that have not yet been excised (pre-mRNA).
• mRNA: this is what is usually being referred to when a
Bioinformatician says ―RNA‖. This is used to carry a gene’s
message out of the nucleus.
• tRNA: transfers genetic information from mRNA to an amino acid
sequence as to build a protein
• rRNA: ribosomal RNA. Part of the ribosome which is involved in
translation.
19. Transcription
• Transcription is highly regulated. Most DNA is in a
dense form where it cannot be transcribed.
• To start, transcription requires a promoter, a small
specific sequence of DNA to which polymerase can
bind (~40 base pairs ―upstream‖ of gene)
• Finding these promoter regions is only a partially
solved problem that is related to motif finding.
• There can also be repressors and inhibitors acting in
various ways to stop transcription. This makes
regulation of gene transcription complex to
understand.
20. Definition of a Gene
• Regulatory regions: up to 50 kb upstream of +1 site
• Exons: protein coding and untranslated regions (UTR)
1 to 178 exons per gene (mean 8.8)
8 bp to 17 kb per exon (mean 145 bp)
• Introns: splice acceptor and donor sites, junk DNA
average 1 kb – 50 kb per intron
• Gene size: Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.
22. Splicing and other RNA processing
• In Eukaryotic cells, RNA is processed
between transcription and translation.
• This complicates the relationship between
a DNA gene and the protein it codes for.
• Sometimes alternate RNA processing can
lead to an alternate protein (splice
variants) as a result. This is true in the
immune system.
23. Proteins: Crucial molecules for
the functioning of life
• Structural Proteins: the organism's basic building blocks, eg.
collagen, nails, hair, etc.
• Enzymes: biological engines which mediate multitude of biochemical
reactions. Usually enzymes are very specific and catalyze only a single type
of reaction, but they can play a role in more than one pathway.
• Transmembrane proteins: they are the cell’s housekeepers, eg. By
regulating cell volume, extraction and concentration of small molecules from
the extracellular environment and generation of ionic gradients essential for
muscle and nerve cell function (sodium/potasium pump is an example)
• Proteins are polypeptide chains, constructed by joining a certain kind of
peptides, amino acids, in a linear way
• The chain of amino acids, however folds to create very complex 3D
structures
24. Translation
• The process of going
from RNA to
polypeptide.
• Three base pairs of
RNA (called a codon)
correspond to one
amino acid based on a
fixed table.
• Always starts with
Methionine and ends
with a stop codon
26. Protein Structure: Introduction
• Different amino acids
have different properties
• These properties will
affect the protein
structure and function
• Hydrophobicity, for
instance, is the main
driving force (but not the
only one) of the folding
process
27. Protein Structure: Hierarchical nature of protein
structure
Primary Structure = Sequence of amino acids
MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTL
PFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQRE
KIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKK
HLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYL
IKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE
Secondary Structure Tertiary
Local Interactions Global Interactions
28. Protein Structure: Why is structure
important?
The function of a protein depends greatly on
its structure
The structure that a protein adopts is vital to
it’s chemistry
Its structure determines which of its amino
acids are exposed to carry out the protein’s
function
Its structure also determines what substrates
it can react with
29. Protein Structure: Mostly lacking
information
• Therefore, it is clear that knowing the structure of a
protein is crucial for many tasks
• However, we only know the structure for a very small
fraction of all the proteins that we are aware of
– The UniProtKB/TrEMBL archive contains 23165610 (16886838)
sequences
– The PDB archive of protein structure contains only
84223(76669) structures
• In the native state, proteins fold on its own as soon as
they are generated, amino-acid by amino-acid (with few
exceptions e.g. chaperones) can we predict this
process as to close the gap between protein sequences
and their 3D structures?
30. Central Dogma of Biology: A Bioinformatics
Perspective
The information for making proteins is stored in DNA. There is
a process (transcription and translation) by which DNA is
converted to protein. By understanding this process and how it
is regulated we can make predictions and models of cells.
Assembly
Protein
Sequence/Stru
Sequence analysis cture Analysis
Gene Finding
Computational Problems
32. Information flow in bioinformatics
• Data enters the “bioinformatics scope” when a scientist deposits an
experimental result in an appropriate archive
• The archive curates and annotates the data
• The data is released to the public
• Afterwards, the data may be retrieved/analysed:
– Integrating the new entry into a search engine
– Extracting useful subsets of the data
– Deriving new types of information from the data
– Aggregating the data, by homology, function, structure
– Reannotating the data with new discovered/inferred info.
• Quality of data depends on many factors, the techniques used to
experimentally create the data, degree of inference and prediction
involved in the annotation process, etc.
• Many publicly available databases:
http://en.wikipedia.org/wiki/List_of_biological_databases
33. NCBI’s Entrez system
http://www.ncbi.nlm.nih.gov/
Entrez is a search and retrieval system that integrates
information from databases at NCBI (National Center for
Biotechnology Information).
34. Uniprot http://www.uniprot.org
• The Universal Protein Resource (UniProt) is a collaboration between the
European Bioinformatics Institute (EBI), the SIB Swiss Institute of
Bioinformatics and the Protein Information Resource (PIR)
38. Sequences
• Be it DNA, RNA or proteins we have many data
that can be represented as sequences of a
certain alphabet
• Many generic algorithms to deal with
biological sequences exist
• Sequence alignment
• Motif representation
39. Sequence Alignment
• Is the assignment of residue-residue correspondences
between nucleotide/proteomic sequences
Query 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60
MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY
Sbjct 1 MQNSHSGVNQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRY 60 matches
Query 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSV 120
YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL
Sbjct 61 YETGSIRPRAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLL------------- 107
gap
...
Query 301 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQPPVPSQTSSYSCMLPT 360
QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ + C P+
Sbjct 281 QPTTPVSSFTSGSMLGRTDTALTNTYSALPPMPSFTMANNLPMQ-----DSFHLECQFPS 335
Query 361 S-PSVN 365
P VN mismatches
Sbjct 336 KFPGVN 341
40. Motivation
• Similarity is expected among biomolecules that are
descended from a common ancestor.
– Mutations cause differences, but survival of the organism requires
that mutations occur in regions that are less critical to function
– Important catalytic, regulatory or structural regions remain similar
• An alignment between two or more genetic or proteomic
sequences represents an explicit hypothesis vis a vis their
evolutionary histories.
• Thus comparison of related gene/protein sequences have
been instrumental in shedding light into the information
content of these sequences and their biological
functions.
41. Definition and aims
• Why align sequences?
1. Start with a query sequence with unknown
properties and search within a database of
millions of sequences to find those which share
similarity with the query.
2. Start with a small set of sequences and identify
similarities and differences among them.
3. In many sequences or very long
sequences, detect commonly occurring patterns
42. Similarity vs. Homology
• Similarity is the observation or measurement of resemblance
and difference, independent of the source of resemblance.
• There are many examples of different organisms with
functionally similar organs that came from distinc
evolutionary origins
• When similarity is due to a common ancestry, we call it
homology.
• Sequence alignment helps inferring homology hypothesis:
– If two sequences are very similar, it is probable that there is a common
origin
– Therefore, if we know some information (structure, function) from
sequence X, and sequence X is similar to sequence Y, it is probable that
the same information applies to Y
43. Metrics of similarity: Definitions
• Gap: a break in the alignment, in either one of the
sequences.
– For nucleotides, a consequence of an insertion or deletion
mutation.
– For proteins, it’s more difficult to say.
• Regions of matching residues.
– Indicate parts of a sequence that are well conserved
• Mismatched residues.
– For nucleotides, a consequence of a substitution mutation
– Less conserved regions
44. Metrics of similarity: Distance scoring
• Distance scoring
– Given an alignment with matches, mismatches and
gaps, we compute a score following:
• For each mismatch, score is increased by 2
• For each gap, score is increased by 4
• For each match, no increase in score
– Higher score, less similarity
A – G C C G T A T
A C G A - - T - T
0 4 0 2 4 4 0 4 0 = 18
• Equivalent metrics exist for similarity (not
distance) where higher score means good
similarity
45. Metrics of similarity: Mismatches and gaps
• Are all mismatches equally bad?
– For protein sequences, there are several subgroups of amino
acids with similar properties. Mismatches within a group have
less impact
– For nucleotide sequences, transition mutations (a↔g and
t↔c) are more common than transversions (a or g ↔ t or c)
mutations
– Distance scoring of mismatches could be smarter substitution
matrices
• Using statisical analysis on large corpus of real sequences to generate
better scores
• How to penalize gaps
– Each gap slot gets equal distance score
– One score to open a gap, another (smaller) score to extend the
same gap
46. Global vs Local alignment
• We know how to score good or bad
alignments
– How to find the optimal one?
• Two classes of alignment methods
– Global alignment
• Finds the best alignment of one entire sequence with
another entire sequence
– Local alignment
• Find the best alignment of one segment of a sequence
against another segment of another sequence
47. Exact vs. Approximate methods
• Exact methods for both global and local alignment exist, based on
dynamic programming, but are slow
– Good enough when there are few sequences
– Not so good when comparing a target sequence to a database of millions
of known sequences
• Approximate methods have been used for many years for large-
scale alignment tasks
– They use some kind of heuristic to speed up the alignment process
– BLAST (Basic Local Alignment Search Tool) is the most famous approximate
method
• It identifies potential hits by looking for perfect matches of very small sub-sequences
(seeds)
• It only tries to create a full alignment for sequences where several seeds are identified
• PSI-BLAST: version that takes into account that multiple hits are identified. It constructs a
tailored substitution matrix based on hits and then refines the alignment
48. Multiple Sequence Alignment
• When we have to align more than two sequences
• Progressive methods (e.g. ClustalW)
– Start with seed alignment
– Iteratively incorporate other alignments to
seed, without modifying what is aligned so far
– ClustalW uses phylogenetic trees (representations of
the evolutionary relationship between sequences) to
progressively construct MSA
• Iterative methods (e.g. MUSCLE)
– Can re-edit the partial MSA based on the newly
incorporated alignments
50. Motifs
• When visualising a MSA we can see regions of
high agreement and regions of low
agreement.
• The high agreement regions define that a
certain protein belongs to a family
• What if we concentrate on modelling and
identifying these regions instead of the whole
sequences Motif finding
51. Modelling motifs
• Patterns
– Model the subsequence as a regular expression
• C-x-H-x-[LIVMFY]-C-x(2)-C-[LIVMYA]
• Zing Finger motif
• Can cope with moderate level of variability
• Profiles
– Specify the most likely values for each position in the motif acts as a
substituton matrix
– Use sequence similarty metrics to compute a score of the motif for a given
sequence
1 2 3 4 5 6 7 8 9
A: 0.5 0.25 0 1 0 0 0 0 0.5
T: 0 0.25 1 0 0 1 0 0.25 0
G: 0.5 0.25 0 0 1 0 0 0.75 0.25
C: 0 0.25 0 0 0 0 1 0 0.25
http://drmotifs.genouest.org/2010/07/profiles-pwm-pswm-pssm/
• PROSITE implements both types of motifs
52. Modelling Motifs
• Hidden Markov Models
– Model the motif as a series of state transitions
with probabilities associated to each input symbol
and state
– Easy to visualise
http://drmotifs.genouest.org/2010/07/hiden-markov-models-hmm/
– PFAM uses HMM motifs
53. DNA -> RNA -> Proteins
THE SCALES OF BIOINFORMATICS
54. DNA
• Coding/non coding
• SNPs
• Copy number variation
• Assembly
• Methylation
• Primer design
55. Coding/Non Coding
• Identifying the regions from an organism’s
genome that contain genes
• Many different factors involved in this
identification
– Promoter identification
– Long enough Open Reading Frames (ORF)
– Splice variants
– Introns/Exons (in Eucaryotes)
– Statistical properties of gene-coding DNA
• HMM are also used for gene finding
56. Single Nucleotide Polymorphisms
(SNPs)
• One base-pair variation in DNA
• In most cases in non-coding regions of DNA, but
not always
• When frequent enough in a population they can
be linked to specific traits, e.g. a disease
• SNP microarrays can be used to probe hundreds
of thousands of SNPs in parallel
• In reality few SNPs act on their own
– Genome-Wide Association Studies identify groups of
SNPs linked to a certain condition
57. Copy Number Variation
• In general two copies of each gene exist in a
genome
• It may be the cases than more/less than two
copies exist of a certain gene for a specific
sub-population
• It has been suggested that certain CNV can be
linked to specific diseases
58. Genome assembly
• Sequencing technologies are able to read (sequence) a
complete genome as a series of short overlapping
fragments
• How to assemble back all these fragments?
• Greedy approach
– Pair-wise alignments of all fragments
– Merge fragments of largest overlap
– Keep iterating until all segments are merged
• Worked more or less well on old sequencing
technologies, not so well on next-generation
sequencing data, due to smaller fragment sizes and
larger error rate
59. Genome mapping
• Given a large set of short fragments, as a result of
next-generation sequencing, map them to a
reference genome
• Different from previous one. We do not want to
reconstitute a complete genome, just identify to
which genes each fragment belongs (among
other applications).
• Speed is an issue
• Modern methods (e.g. SOAP2) compress the
genome and are able to align the fragments in
the compressed space
60. Methylation
• It is a chemical reaction that can block a
certain region of a chromosome, preventing
its transcription
• The process can be reverted, so essentially it is
an on/off switch of the affected gene
• Specialised microarrays exist for the high-
throughput detection of methylated genes
• Afterwards, data analysis can take place
61. DNA library specification
• A DNA library is a combinatorial set of DNA sequences suited to
manufacture via DNA reuse
• The first stage towards the creation of a DNA library is the formal
specification of the target DNA molecules that comprise it
• A set of sequences does not convey the intention behind the library
Key challenge is to enable precise
editing of DNA sequences in an
extensible and reproducible manner
whilst avoiding manual handling of
these unwieldy objects
62. DNALD library format
• A DNALD library consists of three sets of definitions:
inputs, intermediates and outputs, with different
semantics
– Inputs: existing DNA sequences to be provided with design
– Intermediates: conceptual means of factoring commons seqs
– Outputs: to be produced through DNA reuse
63. DNALD expressions
• A DNALD expression is a combination of explicit sequences, definition
names, operators and functions that are interpreted according to rules of
precedence and association ("evaluated") to produce a set of DNA
sequences.
• Definitions bind names to the results of expressions.
64. Workbench interface
manage
projects
text editor with:
• syntax highlighting
• auto-completion
• code folding
• etc.
viewed from different
perspectives
65. CADMAD’s DNALD (DNA Library
Design)
>Ret_human
GGCCTCTACTTCTCGAGGGATGCTTACTGGGAGAAGCTGTATGTGGACCAGGCGGCCGGCA
CGCCCTTGCTGTACGTCCATGCCCTGCGGGACGCCCCTGAGGAGGTGCCCAGCTTCCGCCT
A specification language that
GGGCCAGCATCTCTACGGCACGTACCGCACACGGCTGCATGAGAACAACTGGATCTGCATC
CAGGAGGACACCGGCCTCCTCTACCTTAACCGGAGCCTGGACCATAGCTCCTGGGAGAAGC
TCAGTGTCCGCAACCGCGGCTTTCCCCTGCTCACCGTCTACCTCAAGGTCTTCCTGTCACC
CACATCCCTTCGTGAGGGCGAGTGCCAGTGGCCAGGCTGTGCCCGCGTATACTTCTCCTTC
produces a set of target DNA
TTCAACACCTCCTTTCCAGCCTGCAGCTCCCTCAAGCCCCGGGAGCTCTGCTTCCCAGAGA
CAAGGCCCTCCTTCCGCATTCGGGAGAACCGACCCCCAGGCACCTTCCACCAGTTCCGCCT
GCTGCCTGTGCAGTTCTTGTGCCCCAACATCAGCGTGGCCTACAGGCTCCTGGAGGGTGAG
GGTCTGCCCTTCCGCTGCGCCCCGGACAGCCTGGAGGTGAGCACGCGCTGGGCCCTGGACC
sequences as a function of
GCGAGCAGCGGGAGAAGTACGAGCTGGTGGCCGTGTGCACCGTGCACGCCGGCGCGCGCGA
GGAGGTGGTGATGGTGCCCTTCCCGGTGACCGTGTACGACGAGGACGACTCGGCGCCCACC
TTCCCCGCGGGCGTCGACACCGCCAGCGCCGTGGTGGAGTTC>Ret_mouse
GGCCTCTATTTCTCAAGGGATGCTTACTGGGAGAGGCTGTATGTAGACCAGCCAGCTGGCA
operations on a set of inputs
CACCTCTGCTCTATGTCCATGCCCTACGGGATGCCCCTGGAGAAGTGCCGAGCTTCCGCCT
GGGCCAGCATCTCTATGGCGTCTACCGTACACGGCTGCATGAGAATGACTGGATCCGCATC
AATGAGACTACTGGCCTTCTCTACCTCAATCAGAGCCTGGACCACAGTTCCTGGGAACAGC
TCAGCATCCGCAATGGTGGTTTCCCCCTGCTCACCATCTTCCTCCAGGTCTTTCTGGTGGA
AAACTGCCAGGAGTTCAGCGGTGTCTCCATCCAGTACAAGCTGCAGCCTTCCAGCATCAAC
TGCACTGCCCTAGGTGTGGTCACCTCACCCGAGGACACCTCGGGGACCCTATTTGTAAATG
ACACAGAGGCCCTGCGGCGACCTGAGTGCACCAAGCTTCAGTACACGGTGGTAGCCACTGA
CCGGCAGACCCGCAGACAGACCCAGGCTTCGCTAGTGGTCACTGTGGAGGGGACATCCATT
ACTGAAGAAGTAGGCT
To maximise CADMAD's impact the specification process must be:
>Ret_zebrafish
GGGCTGTATTTTCCTCAAAGGCTTTACACAGAGAACATCTACGTGGGTCAGCAGCAGGGAT
CACCGTTGCTTCAGGTCATTTCAATGCGGGAATTCCCTACAGAGAGGCCTTATTTCTTCCT
• user friendly and debuggable
GTGCTCGCACAGAGACGCTTTTACATCATGGTTTCACATAGATGAGGCGTCCGGAGTTCTT
TATCTCAACAAAACCCTGGAGTGGAGCGACTTCAGTAGTTTACGCAGCGGCTCAGTTCGCT
CCCCGAAGGATCTCTGACCTATCAGTTAGAGATTGTCGACAGGAACATCACTGCTGAAGCT
CAGTCCTGTTACTGGGCGGTTAGTCTTGCACAAAACCCGAATGATAATACAGGCGTTCTCT
• but expressively powerful enough to:
ATGTGAACGACACCAAAGTGTTACGCAGACCAGAGTGCCAAGAGCTGGAGTATGTGGTCAT
TGCCCAGGAGCAGCAGAACAAGCTTCAGGCCAAGACACAGCTCACCGTCAGTTTTCAAGGC
GAAGCAGATTCACTGAAAACGGATG
>Ret_chicken
– define non-trivial combinatorial constructs
GGTCTGTACTTCCCCAGAAAGGAGTACTCAGAGAACGTCTACATTGACCAGCCAGCAGGTG
CGCCGCTCCTACGCATCCACGCCTTGAGGGATTCACATGGGAAACAGCCCACTTTCATCTG
TGCCAGAAGTCTCATCATTTCTCGAGCAAGATCCCATGAAAATCACTGGTTTCAAATCAGA
– communicate degrees of freedom
GAAAAAATGGGACTTCTCTACCTCAGCAAGAGCCTAGATAGAGAAGACTTTAACATGCTGT
CTGTAGGAAACTGGATGCCATTATCAAAGGTGATGCTGTATGTCTTCCTCTCATCTCACCC
TTTCCAAGAGAAGGAATGTGACTCTGCTACTCGTACCACAGTCGTCCTCTCTTTGATCAAT
GCTACTGCACCAGCTTGCAGTTCACTGTCAGCAAGGCAGCTTTGCTTCACAGAAATGGATC
TCTCCTTTCACATCAAGGAGAATAAACCCCCTGGTACATTTCATCAGCTCCAGTTACCCTC
AGTTCATCATCTGTGTCAGAATCTCAGCATTACCTACAAACTGTTGGCAGCCGAAGGCCTG
CCTTTTCGGTACAATGAGAACACCACTGGTGTGAGTGTAACACAGCGCCTAGATCGAGAGG
AGAGAGAGAGATATGAGCTGATCGCCAAATGCACCGTGAGAGAAGGCTTCAGGGAAATGGA
GGTTGAGGTGCCCTTCCTCGTCAACGTGTTAGATGAAGATGACTCTCCTCCCTTCCTTCCC
67. RNA expression
• Not all genes are transcribed/translated into proteins
all the time
• The expression of genes is highly sophisticated and
depends on many factors
• Identifying the genes being expressed in a given point
of time in a specific tissue provides crucial information
about the roles and interactions of such genes
– Compare the genes expressed between different groups of
samples to identify those that are differentially expressed
– Identify co-expressed genes, that present patterns of
correlation
68. Measuring RNA expression
• RT-PCR (Real-time reverse polimerase chain
reaction)
– Measures accurately the expression of a pre-
determined gene
• RNA Microarrays
– Measures, in parallel, the expression of tens of
thousands of genes, but with considerable level of
noise
• RNA-Seq
– The next-generation sequencing variant for measuring
gene expresison
69. RNA Structure prediction
• A RNA sequence can bind with itself to create
complex shapes with a certain pattern of
loops
• Can we predict, from a given sequence, the
structural shape of the RNA?
70. Proteins
• Protein classification
• Structure prediction
• Structure comparison
• Function and interaction
71. Protein classification
• Proteins can be annotated in many different ways
– Function
• DNA-binding? Enzyme?
– Tissue/Cellular/Sub-cellular localisation
– Interacting with other proteins?
• Can we predict this annotation using ML?
• We need to transform the protein sequence into a
uniform representation of equal size for all proteins
• Many different representations exist
• Several of these problems can be modelled as a
hierarchical classification problem
73. Protein Structure Prediction
PSP is an open problem. The 3D structure
depends on many variables
It has been one of the main holy grails of
computational biology for many decades
• Impact of having better protein structure models
are countless
– Genetic therapy
– Synthesis of drugs for incurable diseases
– Improved crops
– Environmental remediation
74. Prediction types of PSP
• There are several kinds of prediction problems within
the scope of PSP
– The main one, of course, is to predict the 3D coordinates
of all atoms of a protein (or at least the backbone) based
on its primary sequence
– There are many structural properties of individual residues
within a protein that can be predicted, for instance:
• The secondary structure state of the residue
• If a residue is buried in the core of the protein or exposed in the
surface
– Accurate predictions of these sub-problems can simplify
the general 3D PSP problem
75. 3D Protein Structure Prediction
• Some PSP methods try to find similar proteins and then
adapt the structure of the homolog (template) to the
target protein Homology Modeling
• Other methods try to find the structure of the protein
from scratch (Ab Initio Modelling), optimizing some
energy function that models the stability of the
protein, in case no homolog can be identified
• In between there are other kind of methods, for
varying degrees of good homology of our target, for
instance, Fold Recognition or Threading
• These methods identify a target based on more than
homology (i.e. sequence alignment).
76. Coordination Number Prediction
Two residues of a chain are said to be in contact if their
distance is less than a certain threshold (e.g. 8Å)
Native State
Primary Contact
Sequence
CN of a residue : count of contacts that a certain
residue has
CN gives us a simplified profile of the density of packing
of the protein
77. Contact Map prediction
• Prediction, given two residues
from a chain, whether these two
residues are in contact or not
• This problem can be represented
by a binary matrix. 1= contact, 0
= non contact
• Plotting this matrix reveals many
characteristics from the protein
structure
• Very sparse characteristic: Less
than 2% of contacts in native
structures
helices sheets
78. Other predictions
• Other kinds of residue
structural aspects that can be
predicted
– Solvent accessibility: Amount of
surface of each residue that is
exposed to solvent
– Recursive Convex Hull: A metric
that models a protein as an
onion, and assigns each residue
to a layer. Formally, each layer is
a convex hull of points
• These features (and
others) are predicted in a
similar was as done for SS
80. Protein Structure Comparison
• Protein Structure Comparison (PSC) aims at
– Assess the degree of similarity between protein structures
– Given a query structure, identify other proteins with similar
structure
• Why?
– Group proteins by structural similarities
– Determine the impact of individual residues on the protein
structure
– Identify distant homologues of protein families
– Predict function of proteins with low degree of primary
structure (i.e.. sequence) similarity with other proteins
– Engineer new proteins for specific functions
– Assess ab-initio predictions
81. Protein Structure Comparison
• Sequence-Structure-Function relationships
1) Conserved 1º sequences similar structures
2) Similar structures ? conserved 1º sequences
3) Similar structures conserved function
• PSC shares many similarities with sequence
alignment. Our aim is to infer new
knowledge from the comparison process
83. Prediction of Protein Function
• In an ideal world, the cascade of inference
should flow from sequence structure
function
• That is, if we can identify similar sequences of
structures to our query target we can (at
varying degrees of certainty) infer that they
have similar function
84. Prediction of Protein Function
• As proteins evolve, they may
– Retain function and specificity
– Retain function but alter specificity
– Change to a related function, or a similar function in a
different metabolic contxt
– Change to a completely unrelated function
• How much must a protein change before the
function changes?
– Sometimes, not at all. There are many cases of
proteins with different functions in different
environments
85. Prediction of Protein Function
• Thus, sequence or structure similarity is not
always reliable to assign function
• Other ways of determining protein function
– By identifying patterns of co-regulated genes
• Using data from Microarray experiments
– By identifying protein-protein interactions
86. Prediction of Protein Function
• A related question is: where is the function of a protein
taking place? active site
• Several methods exist to predict active/binding sites of
proteins from local patterns of sequence or structure
• A raw way of doing this prediction is to take a look at the
conserved residues of a sequence they may be
related to either the core of the protein (structural
stability) or the function of a protein (a change of
function is a risk for survival)
• More sophisticated methods exists to learn how to
predict active sites. They use ML, in a similar way used to
predict residue structural features in PSP
• Still, it is a very tough problem, and ML methods are not
much better than blast-based methods
88. Three case studies
• Mining –omics data
• Predicting structural aspects of protein
residues
• Automated alphabet reduction for protein
datasets
• In all these three case studies we use the
same evolutionary learning system: BioHEL
[Bacardit et al., 09]
89. BioHEL
• BioHEL [Bacardit et al., 09] is an evolutionary
learning system that applies the Iterative Rule
Learning (IRL) approach
• Designed explicitly to deal with noisy large-scale
datasets
• IRL was first used in EC by the SIA system
[Venturini, 93]
90. BioHEL’s learning paradigm
– IRL has been used for many years in the ML
community, with the name of separate-and-conquer
91. BioHEL’s objective function
• An objective function based on the Minimum-
Description-Length (MDL) (Rissanen,1978) principle
that tries to promote rules with
– High accuracy: not making mistakes
– High coverage: covering as much examples as possible
without sacrificing accuracy. Recall (TP/(TP+FN)) will be
used to define coverage
– Low complexity: rules as simple and general as possible
– The objective function is a linear combination of the three
objectives above
92. BioHEL’s objective function
• Intuitively, we would like to have accurate
rules covering as much examples as possible.
• However, in complex and inconsistent
domains it is rare to obtain such rules
• In these cases, easier path for evolutionary search is to
maximize accuracy at the expense of coverage
• Therefore, we need to enforce that the evolved rules cover
enough examples
93. BioHEL’s objective function
• Three parameters define the shape of the function
• The choice of the coverage break is crucial for the proper performance of
the system
• Also, coverage term penalizes rules that do not cover a minimum
percentage of examples or that cover too many
94. BioHEL’s characteristics
• Attribute list rule representation
– Automatically identifying the relevant attributes for a given rule and
discarding all the other ones
• The ILAS windowing scheme
– Efficiency enhancement method, not all training points are used for
each fitness computation
• An explicit default rule mechanism
– Generating more compact rule sets
– Iterative process terminates when it is impossible to evolve a rule
where the associated class is the majority class among the matched
examples
– At this point, all remaining training instances are assigned to the
default class
96. Mining –omics data
• Biological data can be generated at many
different levels
– Genomics (DNA)
– Transcriptomics (RNA)
– Proteomics (proteins)
– Metabolomics (small compounds)
– Lipidomics (lipids)
• Hundreds of –omics have been catalogued
97. How an –omics dataset looks like?
• In most cases datasets present a similar structure
• Each sample is characteristed by a large number
of variables (RNA, Proteins, lipids, etc.)
• Each variable indicates (usually quantitatively)
the presence of that element in the sample
• Due to the high cost of most –omics technologies,
variables >> samples
– Problems of over-fitting
98. What can we do with the dataset?
• In most cases, samples are annotated with a
qualitative label
– Cancer/Non-cancer patients
– Samples of seed tissue for which it is known if the seed
germinated or not
– Age of the sample
• Therefore, we can treat these datasets as
classification problems, and generate prediction
models from the data
• Not just as classification problems
– Clustering/Biclustering
– Association Rule Mining
– Regression
99. But in most cases, domain experts are
not (only) interested in predictions
• Biomarker identification
– Identify the key variables
• Most strongly associated to each outcome
– Using e.g. t-tests to identify those
• Presenting higher prediction capacity
– As identified by ML methods
– Identify interactions between variables
• By presenting very high (anti)correlation between them
• By acting together to generate predictions
100. Functional Network Reconstruction for
seed germination
Microarray data obtained from seed tissue of
Arabidopsis Thaliana
122 samples represented by the expression level
of almost 14000 genes
It had been experimentally determined whether
each of the seeds had germinated or not
Can we learn to predict germination/dormancy
from the microarray data?
[Bassel et al., 2011]
101. Generating rule sets
BioHEL was able to predict the
outcome of the samples with
93.5% accuracy (10 x 10-fold cross-
validation
Learning from a scrambled dataset
(labels randomly assigned to
samples) produced ~50% accuracy
If At1g27595>100.87 and At3g49000>68.13 and At2g40475>55.96 Predict
germination
If At4g34710>349.67 and At4g37760>150.75 and At1g30135>17.66 Predict
germination
If At3g03050>37.90 and At2g20630>96.01 and At3g02885>9.66 Predict germination
If At5g54910>45.03 and At4g18975>16.74 and At3g28910>52.76 and At1g48320>56.80
Predict germination
Everything else Predict dormancy
102. Identifying regulators
Rule building process is stochastic
Generates different rule sets each time the system
is run
But if we run the system many times, we can
see some patterns in the rule sets
Genes appearing quite more frequent than the
rest
Some associated to dormancy
Some associated to germination
104. Generating co-prediction networks of
interactions
• For each of the rules shown before to be
true, all of the conditions in it need to be
true at the same time
– Each rule is expressing an interaction
between certain gens
• From a high number of rule sets we can
identify pairs of genes that co-occur with
high frequency and generate functional
networks
• The network shows different topology
when compared to other type of network
construction methods (e.g. by gene co-
expression)
• Different regions in the network contain
the germination and dormancy genes
105. Experimental validation
We have experimentally verified this analysis
By ordering and planting knockouts for the highly ranked
genes
We have been able to identify four new regulators of
germination, with different phenotype from the wild type
107. Prediction of structural aspects of protein
residues
• Many of these features are due to local interactions of an amino
acid and its immediate neighbours
– Can it be predicted using information from the closest
neighbours in the chain?
Ri-5 Ri-4 Ri-3 Ri-2 Ri-1 Ri Ri+1 Ri+2 Ri+3 Ri+4 Ri+5
SSi-5 SSi-4 SSi-3 SSi-2 SSi-1 SSi SSi+1 SSi+2 SSi+3 SSi+4 SSi+5
Ri-1 Ri Ri+1 SSi
Ri Ri+1 Ri+2 SSi+1
Ri+1 Ri+2 Ri+3 SSi+2
– In this simplified example to predict the SS state of residue i we
would use information from residues i-1 i and i+1. That is a
window of ±1 residues around the target
109. What information do we include for each
residue?
– Early prediction methods used just the primary
sequence the AA types of the residues in the
window
– However the primary sequence has limited
amount of information
• It does not contain any evolutionary information it does not
say which residues are conserved and which are not
– Where can we obtain this information?
• Position-Specific Scoring Matrices which is a product of a
Multiple Sequence Alignment
110. Position-Specific Scoring Matrices (PSSM)
– For each residue in the query sequence compute
the distribution of amino acids of the corresponding
residues in all aligned sequences (discarding those
too similar to the query)
– This distributions will tell us which mutations are
likely and which mutations are less likely for each
residue in the query sequence
– In essence it’s similar to a substitution matrix but
tailored for the sequence that we are aligning
– A PSSM profile will also tell us which residues are
more conserved and which residues are more
subject to insertions or deletions
112. Secondary Structure Prediction
– The most usual way is to predict whether a
residue belongs to an α helix a β sheet or is in
coil state
– Several programs can determine the actual SS
state of a protein from a PDB file. The most
common of them is DSSP
– Typically, a window of ±7 amino acids (15 in total)
is used. This means 300 attributes (when using
PSSM).
– A dataset with 1000 proteins with
~250AA/protein would have ~250000 instances
114. Other prediction problems
• This same structure of prediction can be
applied to most 1D structural aspects
• However, many of these features are natively
continuous measures (or integer)
• To treat these problems as classification
problems, we need to discretise the output
• Unsupervised methods are applied
– Uniform length and uniform frequency disc.
UF
UL
115. PSP datasets are good ML benchmarks
• These problems can be modelled in may ways:
– Regression or classification problems
– Low/high number of classes
– Balanced/unbalanced classes
– Adjustable number of attributes
• Ideal benchmarks !!
• http://icos.cs.nott.ac.uk/datasets/psp_bench
mark.html
116. Contact Map Prediction
• We participated in the CASP9 competition
• CASP = Critical Assessment of Techniques for Protein Structure Prediction.
Biannual competition
• Every day, for about three months, the organizers release some protein
sequences for which nobody knows the structure (129 sequences were
released in CASP9, in 2010)
• Each prediction group is given three weeks to return their predictions
• If the machinery is not well oiled, it is not feasible to participate !!
• For CM, prediction groups have to return a list of predicted contacts (they
are not interested in non-contacts) and, for each predicted pair of
contacting residues, a confidence level
117. Contact Map prediction
• Prediction given two residues
from a chain whether these
two residues are in contact or
not
• This problem can be
represented by a binary
matrix. 1= contact 0 = non
contact
• Plotting this matrix reveals
many characteristics from the
protein structure
helices sheets
118. Steps for CM prediction (Nottingham
method)
1. Prediction of
Secondary structure (using PSIPRED)
Solvent Accessibility
Recursive Convex Hull Using BioHEL [Bacardit et al., 09]
Coordination Number
2. Integration of all these predictions plus other
sources of information
3. Final CM prediction (using BioHEL)
119. Prediction of RCH, SA and CN
We selected a set of 3262 protein chains from
PDB-REPRDB with:
A resolution less than 2Å
Less than 30% sequence identify
Without chain breaks nor non-standard residues
90% of this set was used for training (~490000
residues)
10% for test
120. Prediction of RCH, SA and CN
All three features were predicted based on a
window of ±4 residues around the target
Evolutionary information (as a Position-Specific
Scoring Matrix) is the basis of this local
information
Each residue is characterised by a vector of 180
values
The domain for all three features was
partitioned into 5 states
121. Characterisation of the contact map
problem
Three types of input information were used
1. Detailed information of three different windows of
residues centered around
The two target residues (2x)
The middle point between them
2. Information about the connecting segment between the
two target residues and
3. Global protein information.
1
3
2
122. Contact Map dataset
From the original set of 3262 proteins we kept
all that had <250 AA and a randomly selected
20% of larger proteins
Still, the resulting training set contained 32
million pairs of AA and 631 attributes
Less than 2% of those are actual contacts
+60GB of disk space
123. Samples and ensembles
Training set 50 samples of 660K examples are
generated from the training set with a
x50 ratio of 2:1 non-contacts/contacts
Samples
BioHEL is run 25 times for each sample
Prediction is done by a consensus of
x25 1250 rule sets
Rule sets
Confidence of prediction is computed
based on the votes distribution in the
ensemble.
Whole training process took about 25K
Consensus CPU hours
Predictions
124. Contact Map prediction in CASP
Predictor groups are asked to submit a list of
predicted contacts and a confidence level for each
prediction
The assessors then rank the predictions for each
protein and take a look at the top L/x ones, where L
is the length of the protein and x={5,10}
From these L/x top ranked contacts two
measures are computed
Accuracy: TP/(TP+FP)
Xd: difference between the distribution of
predicted distance and a random distribution
125. CASP9 results
These two groups derived contact
predictions from 3D models
http://www.predictioncenter.org/casp9/doc/presentations/CASP9_RR.pdf
126. Understanding the rule sets
Each rule set has in average 135 rules
We have a total of 168470 rules
Impossible to read all of them individually, but
we can extract useful statistics
For instance, how often was each attribute
used in the rules?
Full analysis
127. Distribution of frequency of use of
attributes
All 631 attributes are
actually used (min
frequency=429)
However, some of
them are used much
more frequently than
others
128. Top 10 attributes
Attribute Frequency Count
s
PredSS_r1_1 1.48% 18141
PredCN_r1 1.66% 20336
propensity 1.74% 21288
PredSS_r2 1.75% 21350
PredSS_r1 1.82% 22205
PredRCH_r2 1.87% 22856
PredRCH_r1 2.04% 24961
PredSA_r2 2.12% 25891
PredSA_r1 2.39% 29246
separation 4.17% 50951
The four kind of residue’s predictions are highly ranked
130. Motivation
• PSP is a very costly process
• As an example, one of the best PSP methods
CASP8, Rosetta@Home could dedicate up to 104
computing years to predict a single protein’s 3D
structure
• One of the possible ways to alleviate this
computational cost is to simplify the representation
used to model the proteins
131. Target for reduction: the primary sequence
• The primary sequence of a protein is
an usual target for such
simplification
– It is composed of a quite high cardinality
alphabet of 20 symbols, which share
commonalities between them
– One example of reduction widely used
in the community is the hydrophobic-
polar (HP) alphabet, reducing these 20
symbols to just two
– HP representation usually is too
simple, too much information is lost in
the reduction process [Stout et al., 06]
• Can we automatically generate these
reduced alphabets and tailor them
to the specific problem at hand?
132. Automated Alphabet Reduction
[Bacardit et al., 09]
• We will use an automated information theory-driven
method to optimize alphabet reduction policies for PSP
datasets
• An optimization algorithm will cluster the AA alphabet
into a predefined number of new letters
• Fitness function of optimization is based on the Mutual
Information (MI) metric. A metric that quantifies the
interrelationship between two discrete variables
– Aim is to find the reduced representation that maintains as much
relevant information as possible for the feature being predicted
• Afterwards we will feed the reduced dataset into a
learning method to verify if the reduction was proper
133. Alphabet Reduction protocol
Size = N Test set
Dataset ECGA Dataset BioHEL Ensemble
Card=20 Card=N of rule sets
Accuracy
Mutual
Information
133
134. Automated Alphabet Reduction
Competent 5-letter alphabet (similar performance to
the AA alphabet)
Different alphabets for CN and SA domains
Unexpected explanations: Alphabet reduction
clustered AA types that experts did not expect
135. Automated Alphabet Reduction
Our method produces better reduced alphabets than other
reduced alphabets from the literature and than other expert-
designed ones
Alphabet Letters CN acc. SA acc. Diff. Ref.
AA 20 74.0±0.6 70.7±0.4 --- ---
Our method 5 73.3±0.5 70.3±0.4 0.7/0.4 [Bacardit et al., 07]
WW5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Wang & Wang, 99] Alphabets
SR5 6 73.1±0.7 69.6±0.4 0.9/1.1 [Solis & Rackovsky, 00] from the
MU4 5 72.6±0.7 69.4±0.4 1.4/1.3 [Murphy et al., 00] literature
MM5 6 73.1±0.6 69.3±0.3 0.9/1.4 [Melo & Marti-Renom, 06]
HD1 7 72.9±0.6 69.3±0.4 1.1/1.4 [Bacardit et al., 07] Expert
HD2 9 73.0±0.6 69.3±0.4 1.0/1.4 [Bacardit et al., 07] designed
HD3 11 73.2±0.6 69.9±0.4 0.8/0.8 [Bacardit et al., 07]
alphabets
136. Efficiency gains from the alphabet
reduction
• We have extrapolated the reduced alphabet to the much
larger and richer Position-Specific Scoring Matrices (PSSM)
representation
• Accuracy difference is still less than 1%
• Obtained rule sets are simpler and training process is much
faster
• Performance levels are similar to recent works in the
literature [Kinjo et al., 05][Dor and Zhou, 07]
• Won the bronze medal of the 2007 Humies awards
137. Conclusions
• Bioinformatics contain many challenges that
computer science can tackle
– Optimisation
– Machine learning
– Software engineering
• Evolutionary computation has shown to be very
competitive across a large range of bioinformatics
problems
• Facing these challenges for EC has led to the
development of many new methods
138. References/Bibliography
• Journals
– The Bioinformatics Journal
– BMC Bioinformatics
– BMC Biodata Mining
• Bioinformatics books
– Introduction to Bioinformatics by Arthur Lesk, Oxford University Press.
– Introduction to Bioinformatics. A. Tramontano, Chapman and Hall/CRC
• Specialised topics
– Bioinformatics for –omics data. Methods and Protocols. Bernd Mayer
(ed). Springer
– Next-Generation Sequencing special issue of the Bioinformatics
Journal;
http://www.oxfordjournals.org/our_journals/bioinformatics/nextgene
rationsequencing.html
139. References/Bibliography
• J. Bacardit, M. Stout, J.D. Hirst, N. Krasnogor and J. Blazewicz, Coordination number
prediction using Learning Classifier Systems: Performance and interpretability. In Proceedings
of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO2006), pp.
247-254, ACM Press, 2006
• Stout, M., Bacardit, J., Hirst, J.D. and Krasnogor, N. Prediction of Recursive Convex Hull Class
Assignments for Protein Residues. Bioinformatics, 24(7):916-923, 2008
• Stout, M., Bacardit, J., Hirst, J.D., Smith, R.E. and Krasnogor, N. Prediction of Topological
Contacts in Proteins Using Learning Classifier Systems. Soft Computing Journal, 13(3):245-
258, 2009
• J. Bacardit, E.K. Burke and N. Krasnogor. Improving the scalability of rule-based evolutionary
learning. Memetic Computing journal 1(1):55-67, 2009
• J. Bacardit, M. Stout, J.D. Hirst, A. Valencia, R.E. Smith and N. Krasnogor. Automated Alphabet
Reduction for Protein Datasets. BMC Bioinformatics 10:6, 2009
• George W. Bassel, Enrico Glaab, Julietta Marquez, Michael J. Holdsworth and Jaume Bacardit.
Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on
Large-Scale Data Sets. The Plant Cell, 23(9):3101-3116, 2011
• J. Bacardit, P. Widera, A. Márquez-Chamorro, F. Divina, J.S. Aguilar-Ruiz and Natalio
Krasnogor. Contact map prediction using a large-scale ensemble of rule sets and the fusion of
multiple predicted structural features. Bioinformatics first published online July 25, 2012
doi:10.1093/bioinformatics/bts472
140. References/Bibliography
• Jason H. Moore et al., Bioinformatics challenges for genome-wide association studies
Bioinformatics (2010) 26(4): 445-455
• Loris Nanni, Sheryl Brahnam, Alessandra Lumini, High performance set of PseAAC and
sequence based descriptors for protein classification, Journal of Theoretical Biology 266(1):1-
10, 2010
• Fernando Otero et al., A hierarchical multi-label classification ant colony algorithm for protein
function prediction, Memetic Computing 2(3):165-181, 2010
• Daniel Barthel et al., Procksi: a decision support system for protein (structure)
comparison, knowledge, similarity and information. BMC Bioinformatics, 8:416, 2007
• http://omics.org/index.php/Alphabetically_ordered_list_of_omes_and_omics
• Federico Divina and Jesus S. Aguilar-Ruiz. 2006. Biclustering of Expression Data with
Evolutionary Computation. IEEE Trans. on Knowl. and Data Eng. 18, 5 (May 2006), 590-602.
• Martinez-Ballesteros, M Nepomuceno-Chamorro, J C Riquelme (2011) Inferring gene-gene
associations from Quantitative Association Rules In: 11th International Conference on
Intelligent Systems Design and Applications (ISDA 2011 ) 1241 – 1246
• Rubén Armañanzas, Iñaki Inza, Roberto Santana, Yvan Saeys, Jose Flores, Jose Lozano, Yves
Peer, Rosa Blanco, Víctor Robles, Concha Bielza, Pedro Larrañaga. A review of estimation of
distribution algorithms in bioinformatics. BioData Mining 2008, 1:6 (11 September 2008)
141. Acknowledgements
• Prof. Natalio Krasnogor
• Prof. Michael Holdsworth
• Prof. Jonathan Hirst
• Dr. Michael Stout
• Dr. George Bassel
• Dr. Enrico Glaab
• Dr. Pawel Widera
• EPSRC GR/T07534/01 & EP/H016597/1
• EU FP7 CADMAD project
142. Introduction to Bioinformatics
Dr. Jaume Bacardit
Interdisciplinary Computing and Complex Systems
(ICOS) research group
University of Nottingham
jaume.bacardit@nottingham.ac.uk
Notes de l'éditeur
Definitions consist of a name and DNALD expression.Inputs to be defined using a subset of DNALD expressions: unambiguous nucleotide sequencesthe reverse and/or complement of thosethe imported outputs of other DNALD libraries (facilitating iterative library consumption)standard sequence formats
Unary operations include subsequence extraction and mutation. These can be chained together, each operating on the result of the previous one.Binary operations include concatenation, repetition, and unions.Functions include reverse, complement and back-translation.Sequences of nucleotides and amino acids are quoted strings containing single letters symbols according to the IUPAC nomenclature (and numbers which are ignored)Amino acid sequences are only expected in the context of back-translationsAmbiguous nucleotides expand to set of unambiguous alternatives (within reason: 10×N=410=106 sequences)Circular sequences defined by parenthesised overlap at 3'-end: 'ACGT…(AC)'Reverse and complement functions: reverse(complement('ATAGAGTAG'))Repetition operation is multiplication of an expression by either a positive integer or a range of positive integers creating a set: 'A'*3 -> 'AAA', 'A'*(2:4) -> {'AA', 'AAA', 'AAAA'}Back-translation returns the set of DNA sequences that could encode an amino acid sequence using a particular codon tableThe complete set of sequences will likely be unfeasibly large so must be handled appropriatelyVarious strategies for sampling the space of possible sequences will be developed and algorithms such as GeneOptimizer will be incorporated if source available or reimplemented from description if possibleUser-defined constraints yet to be formalised