Retos de la Bioinformatica

Bioinformática: la biología por otros medios

Alberto Labarga
UGR, Noviembre 2008

Computational Biology
Bioinformatics
[Biological Information]

Hacia una teoría científica de la herencia

1859 1866 1870 1900 1902

Charles Darwin publica en 1859
'The Origin of Species‘
donde se propone que los seres
vivos son el resultado de la
selección natural y que todas
las criaturas han evolucionado
a lo largo de las generaciones a
través de pequeños cambios.

1859 1866 1870 1900 1902

Leyes de Mendel,
publicadas en 1866,
redescubiertas en 1900

1859 1866 1870 1900 1902

En 1870, un científico alemán llamado
Friedrich Miescher aísla los
componentes almacenados en el
núcleo, compuesto principalmente por
proteinas y ácidos nucleicos. En aquel
momento se creía que el elemento que
almacenaba la información
hereditaria tenía que ser la proteína,
compuesta por 20 aminoacidos,
mientras que los ácidos nucleicos
tenían sólo 4 componentes.

1859 1866 1870 1900 1902

A comienzo de siglo, Phoebus Levene,
descubrió que el ADN es una cadena de
nucleótidos, en la que cada nucleótido está
compuesto de un azucar (desoxirribosa), un
grupo fosfato y una base nitrogenada, que
podía ser de cuatro tipos, Adenin, Timina,
guanina y Citosina

1859 1866 1870 1900 1902

Walter Sutton, a graduate student in E. B. Wilson’s
lab at Columbia University, observed that in the
process of cell division, called meiosis, that produces
sperm and egg cells, each sperm or egg receives only
one chromosome of each type. (In other parts of the
body, cells have two chromosomes of each type, one
inherited from each parent.) The segregation pattern
of chromosomes during meiosis matched the
segregation patterns of Mendel’s genes.

1859 1866 1870 1900 1902

El descubrimiento del ADN

1928 1944 1949 1952 1953

1928 Frederick Griffith: principio de transformación

si mezclaba a los neumococos R
con neumococos S previamente
muertos por calor, entonces los
ratones se morían. Aún más, en la
sangre de estos ratones muertos
Griffith encontró neumococos
con cápsula (S).

1928 1944 1949 1952 1953

En 1944 Oswald Avery y sus colaboradores, que
estaban estudiando la bacateria que causa la
neumonía, Pneumococcus, descubrieron que las
bacterias tienen ácidos nucleicos y que es la molécula
de ADN la encargada de almacenar los genes. Otros
estudios con virus se encargaronde confirmar esta
teoría a pesar de que se seguía creyendo que el ADN
era demasiado simple.

1928 1944 1949 1952 1953

La vida puede verse como un proceso
de almacenamiento y transmisión de
información biológica.
Los cromosomas son los portadores de
esta información.
La información está almacenada en la
forma de un código molecular
Para entender la vida debemos
identificar estas moléculas y descifrar
el código

1928 1944 1949 1952 1953

1949 DNA se duplica durante la división celular
Chargaff: A = T and G = C

1928 1944 1949 1952 1953

1952 - Hershey-Chase Experiment

1928 1944 1949 1952 1953

M.H.F. Wilkins, A.R. Stokes, H.R. Wilson:
Molecular Structure of Deoxypentose Nucleic
Acids. Nature 171, 738 (1953)

R.E. Franklin and R.G. Gosling
Molecular Configuration in Sodium
Thymonucleate, Nature 171, 740
(1953)

1928 1944 1949 1952 1953

MOLECULAR STRUCTURE
OF NUCLEIC ACIDS
“We wish to propose a
structure for the salt of
desoxyribose nucleic acid
(DNA). This structure has
novel features which are of
considerable biological
interest”
Nature. 25 de abril de 1953

1928 1944 1949 1952 1953

“It has not escaped our
attention that the specific
pairing we have
postulated immediately
suggests a possible
copying mechanism for
the genetic material.”

1928 1944 1949 1952 1953

En 1955 Ochoa publicó en Journal of the American
Chemical Society con la bioquímica francorrusa
Marianne Grunberg-Manago, el aislamiento de una
enzima del colibacilo que cataliza la síntesis de ARN, el
intermediario entre el ADN y las proteínas. Los
descubridores llamaron «polinucleótido-fosforilasa» a
la enzima, conocida luego como ARN-polimerasa. El
descubrimiento de la polinucleótido fosforilasa dio
lugar a la preparación de polinucleótidos sintéticos de
distinta composición de bases con los que el grupo de
Severo Ochoa, en paralelo con el grupo de Marshall
Nirenberg, llegaron al desciframiento de la clave
genética.

1955 1959 1962 1966

Cuando Perutz llegó a Cambridge la
estructura molecular más grande que se
había resuelto era la del pigmento natural
ficocianina, de 58 átomos. Una proteína
tiene miles de átomos. Bernal, su director,
había realizado algunas imágenes de
difracción de rayos X de cristales de una
proteína, la pepsina, pero sin llegar a
interpretarlas. El tema escogido por Perutz
para su tesis fue otra proteína, la
hemoglobina, el transportador de oxígeno
que da color rojo a nuestra sangre. La
hemoglobina tiene nada menos que 11.000
átomos. Tardo 23 años.

1955 1959 1962 1966

Over the course of several years,
Marshall Nirenberg, Har Khorana and
Severo Ochoa and their colleagues
elucidated the genetic code – showing
how nucleic acids with their 4-letter
alphabet determine the order of the 20
kinds of amino acids in proteins.
Messenger RNA is interpreted three
letters at a time; a set of three
nucleotides forms a "codon" that
encodes an amino acid. A three-letter
word made of four possible letters can
have 64 (4 x 4 x 4) permutations, which
is more than enough to encode the 20
amino acids in living beings.

1955 1959 1962 1966

Entendiendo los mecanismos, creando las herramientas

1970 1971 1975 1977 1980

El Central Dogma

1970 1971 1975 1977 1980

Created in 1971
with seven
structures

1970 1971 1975 1977 1980

El ADN recombinante, o ADN recombinado, es
una molécula de ADN formada por la unión de
dos moléculas heterólogas, es decir, de diferente
origen.
Se realiza a través de las enzimas de restricción
que son capaces de "cortar" el ADN en puntos
concretos.
De una manera muy simple podemos decir que
"cortamos" un gen humano y se lo "pegamos" al
ADN de una bacteria; si por ejemplo es el gen
que regula la fabricación de insulina, lo que
haríamos al ponérselo a una bacteria es
"obligar" a ésta a que fabrique la insulina.

1970 1971 1975 1977 1980

1970 1971 1975 1977 1980

A precursor-RNA may often be matured to
mRNAs with alternative structures. An example
where alternative splicing has a dramatic
consequence is somatic sex determination in the
fruit fly Drosophila melanogaster.

In this system, the female-specific sxl-protein
is a key regulator. It controls a cascade of
alternative RNA splicing decisions that finally
result in female flies.

1970 1971 1975 1977 1980

Entendiendo los mecanismos, creando las herramientas

1981 1982 1983 1985 1987 1990

Read out the letters from a DNA sequence

GTGAGGCGCTGC

1981 1982 1983 1985 1987 1990

1983 La reacción en cadena de la polimerasa,
conocida como PCR por sus siglas en inglés
(Polymerase Chain Reaction), es una técnica
de biología molecular descrita en 1986 por
Kary Mullis,[1] cuyo objetivo es obtener un
gran número de copias de un fragmento de
ADN particular, partiendo de un mínimo; en
teoría basta partir de una única copia de ese
fragmento original, o molde.

1981 1982 1983 1985 1987 1990

Total nucleotides Number of entries
(Nov 07: 188,490,792,445) (Nov 07: 106,144,026)

1981 1982 1983 1985 1987 1990

1981 1982 1983 1985 1987 1990

El Proyecto Genoma Humano (PGH) (Human
Genome Project en inglés) consiste en
determinar las posiciones relativas de todos los
nucleótidos (o pares de bases) e identificar
100.000 genes presentes en él.
El proyecto, dotado con 3.000 millones de
dólares, fue fundado en 1990 por el
Departamento de Energía y los Institutos de la
Salud de los Estados Unidos, con un plazo de
realización de 15 años.

1981 1982 1983 1985 1987 1990

”Imagine varias copias de un libro, cortadas en
10 millones de trocitos cada una, de manera
que los trocitos se solapan. Supongamos que 1
millón de trocitos se han perdido, y que los
otros 9 millones están manchados de tinta.
Recupere el texto original.”

HUGO: Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by
fragmenting the target genome and cloning it into a large-fragment cloning vector; here, BAC vectors are shown. The
genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones
are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct
the sequence of the genome.

Descifrando el libro de la vida

1990 1995 1996 1997 1998 1999 2001

S.F. Altschul, et al. (1990), "Basic Local
Alignment Search Tool," J. Molec.
Biol., 215(3): 403-10, 1990. 15,306
citations
Altschul, S.F. et al (1997), “Gapped
BLAST and PSI-BLAST: a new
generation of protein database search
programs”, Nucleic Acids Res., vol. 25,
no. 17, pp. 3389-402.

1990 1995 1996 1997 1998 1999 2001

• SSAHA (Ning et al., 2001)
• http://www.sanger.ac.uk/Software/analysis/SSAHA/
• SSAHA is an algorithm for very fast matching and alignment of DNA
sequences. It stands for Sequence Search and Alignment by Hashing
Algorithm. It achieves its fast search speed by converting sequence
information into a `hash table' data structure, which can then be
searched very rapidly for matches.

• BLAT (J. Kent, 2002)
• http://genome.ucsc.edu/cgi-bin/hgBlat
• BLAT on DNA is designed to quickly find sequences of 95% and greater
similarity of length 40 bases or more. It may miss more divergent or
shorter sequence alignments. It will find perfect sequence matches of 33
bases, and sometimes find them down to 20 bases. BLAT on proteins
finds sequences of 80% and greater similarity of length 20 amino acids
or more.

J. Thompson, T. Gibson, D.
Higgins (1994), CLUSTAL W:
improving the sensitivity of
progressive multiple sequence
alignment … Nuc. Acids. Res. 22,
4673 - 4680

1990 1995 1996 1997 1998 1999 2001

Flowchart of computation steps in
Clustal W (Thompson et al., 1994)

Pairwise alignment: calculation of distance matrix

Creation of unrooted neighbor-joining tree

Rooted nJ tree (guide tree) and calculation of sequence weights

Progressive alignment following the guide tree

Otros métodos

Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for
fast and accurate multiple sequence alignment. J. Mol. Biol, 302, 205–217.

Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high
accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797.

Katoh, K., Kuma, K., Toh, H., Miyata, T. (2005) MAFFT version 5:
improvement in accuracy of multiple sequence alignment. Nucleic Acids
Res, 33, 511–518.

Lassmann, T., Sonnhammer, E. (2005) Kalign – an accurate and fast multiple
sequence alignment algorithm. BMC Bioinformatics , 6, 298.

Larkin M.A. et al. (2007) ClustalW and ClustalX version 2. Bioinformatics 2007
23(21): 2947-2948.

Tree of Life

http://tolweb.org/tree/phylogeny.html http://itol.embl.de/

1995
• El primer genoma completo
de un organismo
Hemophilus influenzae.

1990 1995 1996 1997 1998 1999 2001

1996
• El genoma de la levadura se completa:
aproximadamente, 6,000 genes y
14.000.000 de pares de bases

1990 1995 1996 1997 1998 1999 2001

1990 1995 1996 1997 1998 1999 2001

1997

•Ecuenciado el genoma de la
bacteria E. Coli: 4,600 genes
4,5 millones de nucleótidos.

1990 1995 1996 1997 1998 1999 2001

1998

El genoma del gusano
Caenorhabditis elegans,
tiene 18,000 genes unos
100 millones de nucleotidos

1990 1995 1996 1997 1998 1999 2001

1999
•Se consigue la secuencia
completa del cromosoma 22
El HGP va por delante de lo
planeado.
Sorprende el reducido
número de genes encontrado
(unos 300)

1990 1995 1996 1997 1998 1999 2001

Fire A, Xu S, Montgomery M, Kostas
S, Driver S, Mello C (1998). "Potent
and specific genetic interference by
double-stranded RNA in
Caenorhabditis elegans". Nature 391
(6669): 806–11. doi:10.1038/35888.
PMID 9486653

Hamilton A, Baulcombe D
(1999). "A species of small
antisense RNA in
posttranscriptional gene
silencing in plants". Science
286 (5441): 950–2.
PMID 10542148

Dr Alan Wolffe (1999)
• Epigenetics is heritable
changes in gene expression
that occur without a change
in DNA sequence
• Such changes cannot be
attributed to changes in DNA
sequence (mutations)
• They are as Irreversible as
mutations (or difficult to
reverse)

Gene prediction

Where are the genes?

In humans:

~22,000 genes
~1.5% of human DNA

the gencode pipeline

1. mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the
human genome
2. manual curation to resolve conflicting evidence
3. additional computational predictions
4. experimental verification
5. FINAL ANNOTATION

Genome annotation - building a pipeline

Genome sequence

Map repeats Map ESTs Map Peptides

Genefinding

nc-RNAs Protein-coding genes

Functional annotation

Release

August 2008 Bioinformatics tools for Comparative 64
Genomics of Vectors

Genefinding - ab initio predictions

 Use compositional features of the DNA sequence to define coding
segments (essentially exons)
 ORFs
 Coding bias
 Splice site consensus sequences
 Start and stop codons
 Each feature is assigned a log likelihood score
 Use dynamic programming to find the highest scoring path
 Need to be trained using a known set of coding sequences
 Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh

Genomics of Vectors

ab initio prediction

Genome

Coding
potential

ATG & Stop
codons

Splice sites
ATG & Stop
codons
Coding
potential

Genomics of Vectors


Genome

Coding
potential

ATG & Stop
codons

Splice sites
ATG & Stop
codons
Coding
potential

Genomics of Vectors


Genome

Coding
potential

ATG & Stop
codons

Splice sites
ATG & Stop
codons
Coding
potential
Find best prediction

Genomics of Vectors

Genefinding - similarity

 Use known coding sequence to define coding regions
 EST sequences
 Peptide sequences
 Needs to handle fuzzy alignment regions around splice sites
 Needs to attempt to find start and stop codons
 Examples: EST2Genome, exonerate, genewise

 Use 2 or more genomic sequences to predict genes based on
conservation of exon sequences
 Examples: Twinscan and SLAM

Genomics of Vectors

Similarity-based prediction

Genome

Align
cDNA/peptide

Create prediction

Genomics of Vectors

Example of a simple HMM

Top: model architecture and parameters. Bottom: sequence generation process.
green: state transition probabilities, red: emission probabilities.
Prob(sequence, path|model) = 6.8e-8.
EPFL – Bioinformatics I – 05 Dec 2005

Automatic Annotation vs Manual

Automatic Annotation Manual Annotation
• Quick whole genome analysis ~ • Extremely slow~3 months Chr 6
weeks • Need finished seq
• Consistent annotation • Flexible, can deal with
• Use unfinished sequence/shotgun inconsistencies in data
assembly • Most rules have exception
• No polyA sites/signals, pseudogene • Consult publications as well as
• Predicts ~70% loci databases

Analysis EGASP predictions vs manual
100
annotation 100
Exon Sn Nuc Sn
90 90 Nuc Sp
Exon Sp
80 80

70 70

60 60

50 50

40 40

30 30

20 20

10 10

0 0
9_101_1 20_79_1 36_46_1 41_77_1 9_101_1 20_79_1 36_46_1 41_77_1

80
80
Trans Sn
70 Gene Sn
Trans Sp 70
Gene Sp
60
60

50
50

40 40

30 30

20 20

10 10

0 0
9_101_1 20_79_1 36_46_1 41_77_1 9_101_1 20_79_1 36_46_1 41_77_1

Y sólo es el principio

2002 2004 2005 2007 2010

2002 2004 2005 2007 2010

10/3/02 8/28/03 5/07 10/08

Published complete genomes: 104 156 500 874

Ongoing prokaryotic genomes: 316 386 1500 2124

Ongoing eukaryotic genomes: 218 246 700 1004

http://www.genomesonline.org 4000

2002 2004 2005 2007 2010

32,000,000
454-GS20

Millions
4 .5 4

4 .0 4
Applied Biosystems 3 .5 4
Roche / 454

# Bases/Run
3 .0 4
ABI 3730XL ABI
Genome Sequencer FLX 2 .5 4
ABI
1 Mb / day 2 .0 4
ABI 3730
100 Mb / run 1 .5 4 3700
1 .0 4 370/377
0 .5 4

0 .0 4
1994 1996 1998 2000 2002 2004 2006
Dat e of Int roduct ion

Applied Biosystems
SOLiD
Illumina / Solexa 3000 Mb / run
Genetic Analyzer
2000 Mb / run

2002 2004 2005 2007 2010

Aunque los seres humanos compartimos
99.9 por ciento de la información genética,
tenemos pequeñas variaciones, llamadas
poliformismos singulares de nucléotido o
SNP (por su siglas en inglés; se pronuncia
snip). Se estima que existen unos 10
millones de SNP en la especie humana y
supuestamente esas diferencias estarían
relacionadas con la mayor resistencia o
susceptibilidad a enfermedades y
medicamentos.

2002 2004 2005 2007 2010

VARIACIÓN EN LA SECUENCIA HUMANA DE
DNA

Tasa de mutación = 10-8 /sitio/generación
Nº generaciones ancestro común-humano actual: 104-105

ENCyclopedia Of DNA Elements

2002 2004 2005 2007 2010

Sequence (DNA/RNA)
Comparative & phylogeny
genomics

Protein sequence analysis &
Regulation of gene evolution
expression;
transcription factors &
micro RNAs
Protein structure & function:
computational crystallography
Protein families,
motifs and domains

Chemical biology

Protein interactions & complexes: modelling and
prediction

Pathway analysis

Data integration & literature
mining

Image analysis Systems
modelling

Se preparan las
Se preparan copias del ADN muestras de ARN
de los genes de interés de interés Laser 1 Laser 2

control muestr
a
El chip se excita
con láseres
diferentes: el
...que se Transcripción
control
imprimen inversa
reacciona a uno
en el chip Añadir de ellos y la
fluorescencia
muestra al otro
La comparación
de ambas
imágenes nos
indica que genes
se expresan de
manera diferente

Las muestras se hibridan
en el microarray
Schena et al. Science 1995

Microarray analysis
Clinical prediction of Leukemia type

• 2 types
– Acute lymphoid (ALL)
– Acute myeloid (AML)
• Different treatment & outcomes
• Predict type before treatment?

Golub et. al. Science 286:531-537. (1999)

Biomarkers discovery

Data statistical
Management analysis Network
Annotation análisis Selection

30.000 1500 genes 150 genes 50 elements 10 targets
genes

RT-PCR Standard Processing Procedure

TaqMan
Assays
! Overview Plates & Samples

! Quality Control
Step1: Calculate Ct with
SDS and export text file Raw Values

! Discard Samples
Step2: Retrieve
data and define
experiment design
! Quality Control
ΔCt Overview

Step 4: Selection of Optimal Step 5: Differential
Step 3: Biological Endogenous Controls & Expression Analysis ΔΔCt
Replicates Calculation of ΔCt

Example of Array CGH Technology*

Chari et al, Cancer Informatics, 2006, 2, 48-58 88

Chip-on-chip

Source: http://www.chiponchip.org/

ChIP (Chromatin ImmunoPrecipitation)

• Chromatin immunoprecipitation, or ChIP, refers to a procedure
used to determine whether a given protein binds to a specific
DNA sequence in vivo

DNA-binding proteins are crosslinked
to DNA with formaldehyde in vivo

Bind antibodies specific to the DNA-
binding protein to isolate the complex
by precipitation. Reverse the cross-
linking to release the DNA and digest
the proteins.

Isolate the chromatin. Shear DNA
along with bound proteins into small
fragments.
Use PCR( Polymerase Chain Reaction )
to amplify specific DNA sequences to
see if they were precipitated with the
antibody

Protein Microarray
G. MacBeath and S.L. Schreiber, 2000, Science 289:1760

arrayIT TM

Spotting platform and protein microarray

Different Kinds of Protein Arrays*

Antibody Array Antigen Array Ligand Array

Detection by: SELDI MS, fluorescence, SPR,
electrochemical, radioactivity, microcantelever

Some Questions:

• Which genes have expression levels that are correlated
with some external variable?
• For a given pathway, which of the genes in our collection
are most likely to be involved?
• For a diffuse disease, which genes are associated with
different outcomes?

Challenges for Data Analysis

• Normalization (removing systematic measurement effects)
• Variable Selection (Identification of relevant Variables)
• Large sample Effects:

Type I and Type II errors (False positives / False negatives)

• Dimensionality Reduction
• Identification of new disease classes
• Classification of data into known disease classes

Data Analysis Methods
Dimension Reduction
• PCA (Principle Component Analysis)
• ICA (Independent Component Analysis)
• Multidimensional Scaling

Unsupervised Learning
• K-Means / K-Medoid
• Hierarchical Clustering Algorithms

Supervised Learning
• Linear Discriminant Analysis
• Maximum Likelihood Discrimination
• Nearest Neighbor Methods
• Decision Trees
• Random Forests

Popular Classification Methods

• Decision Trees/Rules
– Find smallest gene sets, but not robust – poor performance
• Neural Nets - work well for reduced number of genes
• K-nearest neighbor – good results for small number of genes, but
no model
• Naïve Bayes – simple, robust, but ignores gene interactions
• Support Vector Machines (SVM)
– Good accuracy, does own gene selection,
but hard to understand
• Specialized methods, D/S/A (Dudoit), …

102

Support Vector Machine (SVM)

• Main idea: Select hyperplane that is more likely to
generalize on a future datum

Best Practices

• Capture the complete process, from raw data to final
results
• Gene (feature) selection inside cross-validation
• Randomization testing
• Robust classification algorithms
– Simple methods give good results
– Advanced methods can be better
• Wrapper approach for best gene subset selection
• Use bagging to improve accuracy
• Remove/relabel mislabeled or poorly differentiated
samples

104

Enrichment Analysis

• What are major enriched GO terms?
• What are the highly active pathways?
• What are the frequently interacting proteins?
• What are the known disease associations?

Alistair Chalk, 2008

Meta-analysis example: “Creation and
implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006

Meta-analysis example: “Creation and
implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006
• Clustered experiments based on
mapping concepts found in sample
annotations to UMLS meta-thesaurus.
• Relationships found between
phenotype (e.g., aging), disease (e.g.,
leukemia), environmental (e.g., injury)
and experimental (e.g., muscle cells)
factors and genes with differential
expression.
• “the ease and accuracy of automating
inferences across data are crucially
dependent on the accuracy and
consistency of the human annotation
process, which will only happen when
every investigator has a better
prospective understanding of the long-
term value of the time invested in
improving annotations.”

PPI ANNOTATION AND DATABASES

Database Reference URL
MINT (Zanoni et al., 2002) http://mint.bio.uniroma2.it/mint

IntAct (Hermjakob et al., 2004) http://www.ebi.ac.uk/intact

DIP (Xenarios et al., 2002) http://dip.doe-mbi.ucla.edu/

HPID (Han et al., 2004) http://www.hpid.org

HPRD (Peri et al., 2004) http://www.hprd.org/

 iMEX agreement to share curation efforts

 Protein Standard Initiative (PSI) recommendation

 Molecular Interaction (MI) Ontology

 Large scale experiments

Literature curation

Complex networks

• Many systems can be represented as
networks (graphs)
– Nodes: individual component (proteins)
– Edges: relationships (interactions)
• They share common properties
– Scale-free
– Hierarchical
– Clustering
• Some properties may be intrinsic
and can be understood better when
putting into the context of evolution

Detecting Hierarchical Organization

Summary: Network Measures

• Degree ki
The number of edges involving node i
• Degree distribution P(k)
The probability (frequency) of nodes of degree k
• Mean path length
The avg. shortest path between all node pairs
• Network Diameter
– i.e. the longest shortest path
• Clustering Coefficient
– A high CC is found for modules

Mapping the phenotypic data to the network
•Systematic phenotyping
of 1615 gene knockout
strains in yeast
•Evaluation of growth of
each strain in the presence
of MMS (and other DNA
damaging agents)
•Screening against a
network of 12,232 protein
interactions

Begley TJ, Rosenbach AS, Ideker T,
Samson LD. Damage recovery pathways
in Saccharomyces cerevisiae revealed by
genomic phenotyping and interactome
mapping. Mol Cancer Res. 2002
Dec;1(2):103-12.

The Role of Proteomics

• The existence of an ORF does not imply the
existence of a functional gene.
• Limitations of comparative genomics.
• mRNA levels may not correlate with protein levels.
• Protein modifications  post-transcriptional
modifications, isoforms, post-translational
modifications, mutants.
• Issues of proteolysis, sequestration, etc. relevant only
at the protein level.
• Protein complex composition, protein-protein
interactions, structures.

Structural proteomics

• Folding
• Structure and function
• Protein structure prediction
• Secondary structure
• Tertiary structure
• Function
• Post-translational modification
• Prot.-Prot. Interaction -- Docking algorithm
• Molecular dynamics/Monte Carlo

What kind of methods around?

5 main levels of protein Structure prediction:

1. Extensive Sequence Search
2. Threading and 1D-3D profiles
3. Ab initio prediction of protein structure
4. Comparative Modelling
5. Docking (domain interaction prediction)

Prediction of Protein Structures

• Examples – a few good examples

actual predicted actual predicted

actual predicted actual predicted

MODPIPE: Large-Scale Comparative Protein Structure Modeling
START

1

Get profile for sequence (NR) Expand match to cover
complete domains
PSI-BLAST

For each template structure
For each target sequence
Scan sequence profile against

MODELLER
representative PDB chains Align matched parts of sequence and
structure

Scan PDB chain profiles Build model for target segment by
against sequence satisfaction of spatial restraints

Evaluate model
Select templates using
permissive E-value cutoff

1 END

R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998.
N. Eswar, M. Marti-Renom, M.S. Madhusudhan, B. John, A. Fiser, R. Sánchez, F. Melo, N. Mirkovic, A. Šali. 3/25/03

Structural Proteomics:
The Motivation*

2000000 200000
1800000 180000
1600000 160000
1400000 140000
Sequences

Structures
1200000 120000
1000000 100000
800000 80000
600000 60000
400000 40000
200000 20000
0 0
1980 1985 1990 1995 2000 2005

The hierarchies of protein structure

Docking Programs

• Dock (UCSF)
• Autodock (Scripps)
• Glide
(Schrodinger)
• ICM (Molsoft)
• FRED (Open Eye)
• Gold, FlexX, etc.

126

Graphical Notation: a necessity for the conceptual representation
of biopathways

Qualitative Mechanistic

various degree of
detail, mixed level
of presentation

Aladjem et al., Science STKE pe8
Thiery & Sleeman, Nat. Rev. Mol. (2004)
Cell. Biol 7:131 (2006)

128

Strategies: simulate or analyse?
(or rather what to do first)

obtain qualitative
convert diagram simulate model understanding
into a quantitative behavior through numerical
model numerically results and model
reduction

build and identify qualitatively
simulate a “elementary analyze network
reduced model modes” topology, stability,
etc

129

130
stochsim
Boolean
networks
Space of modeling methods

continuous ↔ discrete

Continuum of modeling approaches

Top-down Bottom-up

Frazier et al. (2003) Science 11 April Vol 300:290-293

Nucleic Acids Research article lists
1078 public databases

Nucleic Acids Research, 2008, Vol. 36, Database issue
http://nar.oxfordjournals.org/cgi/reprint/36/suppl_1/D2

Growth in Available Bioinformatics Databases

Too much unintegrated data

• Data sources incompatible
• No (or few) standard naming convention
• No common interface (varying tools for browsing,
querying and visualizing data)

– Large experiments or large research – Small, isolated, independent,
groups/labs, possibly distributed groups/individuals
– Large service provider institutes. – Loosely coupled provider-
consumer of resources.
– Tightly coupled provider-consumer of
resources. – Commonly resource consumers
– Commonly resource providers. – Boutique suppliers.
– Some or lots of access to sys admin – Poor access systems admins

Challenges: Names and Identity

• WSL-1 protein Q93038 = Tumor necrosis factor
• Apoptosis-mediating receptor DR3 receptor superfamily member
• Apoptosis-mediating receptor 25 precursor
TRAMP
• Death domain receptor 3
Annotation history:
• WSL protein
• Apoptosis-inducing receptor AIR Q92983 P78515
• Apo-3 O00275 Q93036
• Lymphocyte-associated receptor of death O00276 Q93037
• LARD O00277 Q99722
• GENE: Name=TNFRSF25 O00278 Q99830
O00279 Q99831
O00280 Q9BY86
O14865 Q9UME0
GUIDs O14866 Q9UME1
Life Science P78507 Q9UME5
Identifier?
Normalisation

138 http://www.expasy.org/uniprot/Q93038

Why must support standards?

• Unambiguous representation, description
and communication
– Final results and metadata
• Interoperability
– Data management and analysis
• Integration of OMICS  system biology

What to standarize?

• CONTENT: Minimal/Core Information to be reported
• MIBBI (http://www.mibbi.org)
• SEMANTIC: Terminology Used -> Ontologies
• OBI (http://obi-ontology.org)
• SYNTAX: Data Model, Data Exchange
• Fuge (http://fuge.sourceforge.net/)

MIBBI: Standard Content

Promoting Coherent Minimum Reporting Requirements for
Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotech.

Link Integration: Integration Lite

Application interface

User interface
Application
Ontology
Authority
Identity Authority

143

Warehouse

Wrappers Wrappers

Data Access and Query

User interface
Application
Unified

Wrappers
model

• Copy the data sets, clean and massage data into shape
• Combine them into a (different) pre-determined model before query
• ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator, MIPS, BioMART
• Often called “Knowledge bases” 
144

View integration

Wrappers Wrappers

Data Access and Query

User interface
Application
Unified

Wrappers
model

• Data at Source; Virtual integrating database view
• Global as View / Local as View mappings between models
• Map from model to databases dynamically so always fresh
• TAMBIS, Information Integrator, K4, ComparaGrid, UTOPIA, caCORE
145

Specialist Integrating Application

Wrappers Wrappers

User interface
Application
Wrappers

E.g. Ensembl, UTOPIA
• Very popular. Known to be one application.

146

Workflows

Workflow
Engine

User interface
Application
Wrapper
• Data flow protocol. Automated data chaining.
• General technique for describing and enacting a process
• Describes what you want to do, not how you want to do it
• Various degrees of data type compliance anticipated
147

Mash-Up Data Marshalling
objects

Protocol

Mash Up Application
User interface
Protocol
Protocol
• Content syndication and feeds
• Emphasis on User creating specific integration by mapping.
• Just in time, just enough design
• On demand integration
148

Semantic Web help?

Access and Query
Wrappers

User interface
Application
Wrapper
Wrappers
Semantic Enrichment
Model flattening
Mapping Transparency

• Slight problem: we have no first class metadata migration and
management infrastructure, where metadata is outside the application and
in the middleware, and we can handle progressive curation
150

Service Oriented Architecture

Advanced Search
Retrieve data
Submit data

submission
curation
ws ws ws ws ws

dataflow workflow

An Integrative Analysis Example

Relational
data
Decision
mining Text
tree model
of mining
Visualizing
metabonomi
serial/spect Visualizing
c profile
rum data cluster
statistics Visualizing
Visualizing
Visualizin
Chemical
multidimensi
Visualizing g
sequence
structure data
pathway onal
data
Chemical
relational
Text mining
Spectrum visualization
data data
sequence
visualization
data data
clusters
mining model

From experiments to scientific publications

1- Experiments 2- Results 3- Scientific Peer-
reviewed articles
Planning and Processing and
carrying out interpretation of 'Relevant' results are
experiments obtained results published in scientific
(lab work) journals

PubMed/Medline database at NCBI

- Developed at the National
Center for Biotechnology
Information (NCBI).

- The core 'Textome'.

- repository of citation
entries of scientific
articles.

- PubMed titles and
abstracts
are primary data source for
Bio-NLP.

- ~ 450,000 new abstracts/a

- > 4,800 biomedical
journals

- ENTREZ search engine

Data in scientific articles

Scientific Free Text Tables Figures
Journals
Title

Abstracts
Keywords
Text body
References

Journal- Biomedical literature characteristics
specific
Information: - Heavy use of domain specific terminology (12%
biochemistry
•Format
•Paper structure related technical terms).
(sections) - Polysemic words (word sense disambiguation).
•Article type
- Most words with low frequency (data sparseness).
- New names and terms created.
- Typographical variants
- Different writing styles (native languages)

BioCreative results

TP: prediction evaluated as protein
and GO terms correct

Precision: TP / Total nr. of
evaluated submissions

1: Chiang et al.
2: Couto et al.
3: Ehrler et al.
4: Ray et al.
5: Rice et al.
6: Verspoor et al.

 Data Integration
• Standards, DBs Infrastructure

 Knowledge Discovery
• Algorithms, Informatics, Machine Learning

 Integrate knowledge
• Text mining, Ontologies

 Modelling
• Pathways, Circuits, Abstraction

Research Support

Los retos de la biología en los próximos
50 years
• Listado de todos los componentes moleculares que
forman un organismo:
– Genes, proteinas, y otros elementos funcionales
• Comprender la funcion de cada componente
• Comprender como interaccionan
• Estudiar como la función ha evolucionado
• Encontrar defectos geneticos que causan enfermedades
• Diseñar medicamentos y terapias de manera racional
• Secuenciar el genoma de cada individuo y usarlo en una
medicina personalizada

• La Bioinformatica es un componente esencial
para conseguir todos estos objetivos

Retos de la Bioinformatica

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Retos de la Bioinformatica

Similaire à Retos de la Bioinformatica (20)

Plus de Alberto Labarga

Plus de Alberto Labarga (20)

Dernier

Dernier (20)

Retos de la Bioinformatica