3. Hacia una teoría científica de la herencia
1859 1866 1870 1900 1902
4. Charles Darwin publica en 1859
'The Origin of Species‘
donde se propone que los seres
vivos son el resultado de la
selección natural y que todas
las criaturas han evolucionado
a lo largo de las generaciones a
través de pequeños cambios.
1859 1866 1870 1900 1902
5. Leyes de Mendel,
publicadas en 1866,
redescubiertas en 1900
1859 1866 1870 1900 1902
6. En 1870, un científico alemán llamado
Friedrich Miescher aísla los
componentes almacenados en el
núcleo, compuesto principalmente por
proteinas y ácidos nucleicos. En aquel
momento se creía que el elemento que
almacenaba la información
hereditaria tenía que ser la proteína,
compuesta por 20 aminoacidos,
mientras que los ácidos nucleicos
tenían sólo 4 componentes.
1859 1866 1870 1900 1902
7. A comienzo de siglo, Phoebus Levene,
descubrió que el ADN es una cadena de
nucleótidos, en la que cada nucleótido está
compuesto de un azucar (desoxirribosa), un
grupo fosfato y una base nitrogenada, que
podía ser de cuatro tipos, Adenin, Timina,
guanina y Citosina
1859 1866 1870 1900 1902
8. Walter Sutton, a graduate student in E. B. Wilson’s
lab at Columbia University, observed that in the
process of cell division, called meiosis, that produces
sperm and egg cells, each sperm or egg receives only
one chromosome of each type. (In other parts of the
body, cells have two chromosomes of each type, one
inherited from each parent.) The segregation pattern
of chromosomes during meiosis matched the
segregation patterns of Mendel’s genes.
1859 1866 1870 1900 1902
10. 1928 Frederick Griffith: principio de transformación
si mezclaba a los neumococos R
con neumococos S previamente
muertos por calor, entonces los
ratones se morían. Aún más, en la
sangre de estos ratones muertos
Griffith encontró neumococos
con cápsula (S).
1928 1944 1949 1952 1953
11. En 1944 Oswald Avery y sus colaboradores, que
estaban estudiando la bacateria que causa la
neumonía, Pneumococcus, descubrieron que las
bacterias tienen ácidos nucleicos y que es la molécula
de ADN la encargada de almacenar los genes. Otros
estudios con virus se encargaronde confirmar esta
teoría a pesar de que se seguía creyendo que el ADN
era demasiado simple.
1928 1944 1949 1952 1953
12. La vida puede verse como un proceso
de almacenamiento y transmisión de
información biológica.
Los cromosomas son los portadores de
esta información.
La información está almacenada en la
forma de un código molecular
Para entender la vida debemos
identificar estas moléculas y descifrar
el código
1928 1944 1949 1952 1953
13. 1949 DNA se duplica durante la división celular
Chargaff: A = T and G = C
1928 1944 1949 1952 1953
15. M.H.F. Wilkins, A.R. Stokes, H.R. Wilson:
Molecular Structure of Deoxypentose Nucleic
Acids. Nature 171, 738 (1953)
R.E. Franklin and R.G. Gosling
Molecular Configuration in Sodium
Thymonucleate, Nature 171, 740
(1953)
1928 1944 1949 1952 1953
16. MOLECULAR STRUCTURE
OF NUCLEIC ACIDS
“We wish to propose a
structure for the salt of
desoxyribose nucleic acid
(DNA). This structure has
novel features which are of
considerable biological
interest”
Nature. 25 de abril de 1953
1928 1944 1949 1952 1953
17. “It has not escaped our
attention that the specific
pairing we have
postulated immediately
suggests a possible
copying mechanism for
the genetic material.”
1928 1944 1949 1952 1953
20. En 1955 Ochoa publicó en Journal of the American
Chemical Society con la bioquímica francorrusa
Marianne Grunberg-Manago, el aislamiento de una
enzima del colibacilo que cataliza la síntesis de ARN, el
intermediario entre el ADN y las proteínas. Los
descubridores llamaron «polinucleótido-fosforilasa» a
la enzima, conocida luego como ARN-polimerasa. El
descubrimiento de la polinucleótido fosforilasa dio
lugar a la preparación de polinucleótidos sintéticos de
distinta composición de bases con los que el grupo de
Severo Ochoa, en paralelo con el grupo de Marshall
Nirenberg, llegaron al desciframiento de la clave
genética.
1955 1959 1962 1966
23. Cuando Perutz llegó a Cambridge la
estructura molecular más grande que se
había resuelto era la del pigmento natural
ficocianina, de 58 átomos. Una proteína
tiene miles de átomos. Bernal, su director,
había realizado algunas imágenes de
difracción de rayos X de cristales de una
proteína, la pepsina, pero sin llegar a
interpretarlas. El tema escogido por Perutz
para su tesis fue otra proteína, la
hemoglobina, el transportador de oxígeno
que da color rojo a nuestra sangre. La
hemoglobina tiene nada menos que 11.000
átomos. Tardo 23 años.
1955 1959 1962 1966
25. Over the course of several years,
Marshall Nirenberg, Har Khorana and
Severo Ochoa and their colleagues
elucidated the genetic code – showing
how nucleic acids with their 4-letter
alphabet determine the order of the 20
kinds of amino acids in proteins.
Messenger RNA is interpreted three
letters at a time; a set of three
nucleotides forms a "codon" that
encodes an amino acid. A three-letter
word made of four possible letters can
have 64 (4 x 4 x 4) permutations, which
is more than enough to encode the 20
amino acids in living beings.
1955 1959 1962 1966
30. Created in 1971
with seven
structures
1970 1971 1975 1977 1980
31. El ADN recombinante, o ADN recombinado, es
una molécula de ADN formada por la unión de
dos moléculas heterólogas, es decir, de diferente
origen.
Se realiza a través de las enzimas de restricción
que son capaces de "cortar" el ADN en puntos
concretos.
De una manera muy simple podemos decir que
"cortamos" un gen humano y se lo "pegamos" al
ADN de una bacteria; si por ejemplo es el gen
que regula la fabricación de insulina, lo que
haríamos al ponérselo a una bacteria es
"obligar" a ésta a que fabrique la insulina.
1970 1971 1975 1977 1980
33. A precursor-RNA may often be matured to
mRNAs with alternative structures. An example
where alternative splicing has a dramatic
consequence is somatic sex determination in the
fruit fly Drosophila melanogaster.
In this system, the female-specific sxl-protein
is a key regulator. It controls a cascade of
alternative RNA splicing decisions that finally
result in female flies.
1970 1971 1975 1977 1980
35. Read out the letters from a DNA sequence
GTGAGGCGCTGC
1981 1982 1983 1985 1987 1990
36. 1983 La reacción en cadena de la polimerasa,
conocida como PCR por sus siglas en inglés
(Polymerase Chain Reaction), es una técnica
de biología molecular descrita en 1986 por
Kary Mullis,[1] cuyo objetivo es obtener un
gran número de copias de un fragmento de
ADN particular, partiendo de un mínimo; en
teoría basta partir de una única copia de ese
fragmento original, o molde.
1981 1982 1983 1985 1987 1990
37. Total nucleotides Number of entries
(Nov 07: 188,490,792,445) (Nov 07: 106,144,026)
1981 1982 1983 1985 1987 1990
39. El Proyecto Genoma Humano (PGH) (Human
Genome Project en inglés) consiste en
determinar las posiciones relativas de todos los
nucleótidos (o pares de bases) e identificar
100.000 genes presentes en él.
El proyecto, dotado con 3.000 millones de
dólares, fue fundado en 1990 por el
Departamento de Energía y los Institutos de la
Salud de los Estados Unidos, con un plazo de
realización de 15 años.
1981 1982 1983 1985 1987 1990
40. ”Imagine varias copias de un libro, cortadas en
10 millones de trocitos cada una, de manera
que los trocitos se solapan. Supongamos que 1
millón de trocitos se han perdido, y que los
otros 9 millones están manchados de tinta.
Recupere el texto original.”
41.
42. HUGO: Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by
fragmenting the target genome and cloning it into a large-fragment cloning vector; here, BAC vectors are shown. The
genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones
are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct
the sequence of the genome.
44. S.F. Altschul, et al. (1990), "Basic Local
Alignment Search Tool," J. Molec.
Biol., 215(3): 403-10, 1990. 15,306
citations
Altschul, S.F. et al (1997), “Gapped
BLAST and PSI-BLAST: a new
generation of protein database search
programs”, Nucleic Acids Res., vol. 25,
no. 17, pp. 3389-402.
1990 1995 1996 1997 1998 1999 2001
45.
46.
47. • SSAHA (Ning et al., 2001)
• http://www.sanger.ac.uk/Software/analysis/SSAHA/
• SSAHA is an algorithm for very fast matching and alignment of DNA
sequences. It stands for Sequence Search and Alignment by Hashing
Algorithm. It achieves its fast search speed by converting sequence
information into a `hash table' data structure, which can then be
searched very rapidly for matches.
• BLAT (J. Kent, 2002)
• http://genome.ucsc.edu/cgi-bin/hgBlat
• BLAT on DNA is designed to quickly find sequences of 95% and greater
similarity of length 40 bases or more. It may miss more divergent or
shorter sequence alignments. It will find perfect sequence matches of 33
bases, and sometimes find them down to 20 bases. BLAT on proteins
finds sequences of 80% and greater similarity of length 20 amino acids
or more.
48. J. Thompson, T. Gibson, D.
Higgins (1994), CLUSTAL W:
improving the sensitivity of
progressive multiple sequence
alignment … Nuc. Acids. Res. 22,
4673 - 4680
1990 1995 1996 1997 1998 1999 2001
49. Flowchart of computation steps in
Clustal W (Thompson et al., 1994)
Pairwise alignment: calculation of distance matrix
Creation of unrooted neighbor-joining tree
Rooted nJ tree (guide tree) and calculation of sequence weights
Progressive alignment following the guide tree
50. Otros métodos
Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for
fast and accurate multiple sequence alignment. J. Mol. Biol, 302, 205–217.
Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high
accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797.
Katoh, K., Kuma, K., Toh, H., Miyata, T. (2005) MAFFT version 5:
improvement in accuracy of multiple sequence alignment. Nucleic Acids
Res, 33, 511–518.
Lassmann, T., Sonnhammer, E. (2005) Kalign – an accurate and fast multiple
sequence alignment algorithm. BMC Bioinformatics , 6, 298.
Larkin M.A. et al. (2007) ClustalW and ClustalX version 2. Bioinformatics 2007
23(21): 2947-2948.
55. 1997
•Ecuenciado el genoma de la
bacteria E. Coli: 4,600 genes
4,5 millones de nucleótidos.
1990 1995 1996 1997 1998 1999 2001
56. 1998
El genoma del gusano
Caenorhabditis elegans,
tiene 18,000 genes unos
100 millones de nucleotidos
1990 1995 1996 1997 1998 1999 2001
57. 1999
•Se consigue la secuencia
completa del cromosoma 22
El HGP va por delante de lo
planeado.
Sorprende el reducido
número de genes encontrado
(unos 300)
1990 1995 1996 1997 1998 1999 2001
58. Fire A, Xu S, Montgomery M, Kostas
S, Driver S, Mello C (1998). "Potent
and specific genetic interference by
double-stranded RNA in
Caenorhabditis elegans". Nature 391
(6669): 806–11. doi:10.1038/35888.
PMID 9486653
59. Hamilton A, Baulcombe D
(1999). "A species of small
antisense RNA in
posttranscriptional gene
silencing in plants". Science
286 (5441): 950–2.
PMID 10542148
60. Dr Alan Wolffe (1999)
• Epigenetics is heritable
changes in gene expression
that occur without a change
in DNA sequence
• Such changes cannot be
attributed to changes in DNA
sequence (mutations)
• They are as Irreversible as
mutations (or difficult to
reverse)
62. Gene prediction
Where are the genes?
In humans:
~22,000 genes
~1.5% of human DNA
63. the gencode pipeline
1. mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the
human genome
2. manual curation to resolve conflicting evidence
3. additional computational predictions
4. experimental verification
5. FINAL ANNOTATION
64. Genome annotation - building a pipeline
Genome sequence
Map repeats Map ESTs Map Peptides
Genefinding
nc-RNAs Protein-coding genes
Functional annotation
Release
August 2008 Bioinformatics tools for Comparative 64
Genomics of Vectors
65. Genefinding - ab initio predictions
Use compositional features of the DNA sequence to define coding
segments (essentially exons)
ORFs
Coding bias
Splice site consensus sequences
Start and stop codons
Each feature is assigned a log likelihood score
Use dynamic programming to find the highest scoring path
Need to be trained using a known set of coding sequences
Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh
August 2008 Bioinformatics tools for Comparative 65
Genomics of Vectors
68. ab initio prediction
Genome
Coding
potential
ATG & Stop
codons
Splice sites
ATG & Stop
codons
Coding
potential
Find best prediction
August 2008 Bioinformatics tools for Comparative 68
Genomics of Vectors
69. Genefinding - similarity
Use known coding sequence to define coding regions
EST sequences
Peptide sequences
Needs to handle fuzzy alignment regions around splice sites
Needs to attempt to find start and stop codons
Examples: EST2Genome, exonerate, genewise
Use 2 or more genomic sequences to predict genes based on
conservation of exon sequences
Examples: Twinscan and SLAM
August 2008 Bioinformatics tools for Comparative 69
Genomics of Vectors
70. Similarity-based prediction
Genome
Align
cDNA/peptide
Create prediction
August 2008 Bioinformatics tools for Comparative 70
Genomics of Vectors
71. Example of a simple HMM
Top: model architecture and parameters. Bottom: sequence generation process.
green: state transition probabilities, red: emission probabilities.
Prob(sequence, path|model) = 6.8e-8.
EPFL – Bioinformatics I – 05 Dec 2005
72. Automatic Annotation vs Manual
Automatic Annotation Manual Annotation
• Quick whole genome analysis ~ • Extremely slow~3 months Chr 6
weeks • Need finished seq
• Consistent annotation • Flexible, can deal with
• Use unfinished sequence/shotgun inconsistencies in data
assembly • Most rules have exception
• No polyA sites/signals, pseudogene • Consult publications as well as
• Predicts ~70% loci databases
77. 32,000,000
454-GS20
Millions
4 .5 4
4 .0 4
Applied Biosystems 3 .5 4
Roche / 454
# Bases/Run
3 .0 4
ABI 3730XL ABI
Genome Sequencer FLX 2 .5 4
ABI
1 Mb / day 2 .0 4
ABI 3730
100 Mb / run 1 .5 4 3700
1 .0 4 370/377
0 .5 4
0 .0 4
1994 1996 1998 2000 2002 2004 2006
Dat e of Int roduct ion
Applied Biosystems
SOLiD
Illumina / Solexa 3000 Mb / run
Genetic Analyzer
2000 Mb / run
2002 2004 2005 2007 2010
78. Aunque los seres humanos compartimos
99.9 por ciento de la información genética,
tenemos pequeñas variaciones, llamadas
poliformismos singulares de nucléotido o
SNP (por su siglas en inglés; se pronuncia
snip). Se estima que existen unos 10
millones de SNP en la especie humana y
supuestamente esas diferencias estarían
relacionadas con la mayor resistencia o
susceptibilidad a enfermedades y
medicamentos.
2002 2004 2005 2007 2010
79. VARIACIÓN EN LA SECUENCIA HUMANA DE
DNA
Tasa de mutación = 10-8 /sitio/generación
Nº generaciones ancestro común-humano actual: 104-105
83. Sequence (DNA/RNA)
Comparative & phylogeny
genomics
Protein sequence analysis &
Regulation of gene evolution
expression;
transcription factors &
micro RNAs
Protein structure & function:
computational crystallography
Protein families,
motifs and domains
Chemical biology
Protein interactions & complexes: modelling and
prediction
Pathway analysis
Data integration & literature
mining
Image analysis Systems
modelling
84. Se preparan las
Se preparan copias del ADN muestras de ARN
de los genes de interés de interés Laser 1 Laser 2
control muestr
a
El chip se excita
con láseres
diferentes: el
...que se Transcripción
control
imprimen inversa
reacciona a uno
en el chip Añadir de ellos y la
fluorescencia
muestra al otro
La comparación
de ambas
imágenes nos
indica que genes
se expresan de
manera diferente
Las muestras se hibridan
en el microarray
Schena et al. Science 1995
85. Microarray analysis
Clinical prediction of Leukemia type
• 2 types
– Acute lymphoid (ALL)
– Acute myeloid (AML)
• Different treatment & outcomes
• Predict type before treatment?
Golub et. al. Science 286:531-537. (1999)
91. ChIP (Chromatin ImmunoPrecipitation)
• Chromatin immunoprecipitation, or ChIP, refers to a procedure
used to determine whether a given protein binds to a specific
DNA sequence in vivo
DNA-binding proteins are crosslinked
to DNA with formaldehyde in vivo
Bind antibodies specific to the DNA-
binding protein to isolate the complex
by precipitation. Reverse the cross-
linking to release the DNA and digest
the proteins.
Isolate the chromatin. Shear DNA
along with bound proteins into small
fragments.
Use PCR( Polymerase Chain Reaction )
to amplify specific DNA sequences to
see if they were precipitated with the
antibody
92.
93. Protein Microarray
G. MacBeath and S.L. Schreiber, 2000, Science 289:1760
arrayIT TM
Spotting platform and protein microarray
94. Different Kinds of Protein Arrays*
Antibody Array Antigen Array Ligand Array
Detection by: SELDI MS, fluorescence, SPR,
electrochemical, radioactivity, microcantelever
97. Some Questions:
• Which genes have expression levels that are correlated
with some external variable?
• For a given pathway, which of the genes in our collection
are most likely to be involved?
• For a diffuse disease, which genes are associated with
different outcomes?
98. Challenges for Data Analysis
• Normalization (removing systematic measurement effects)
• Variable Selection (Identification of relevant Variables)
• Large sample Effects:
Type I and Type II errors (False positives / False negatives)
• Dimensionality Reduction
• Identification of new disease classes
• Classification of data into known disease classes
99. Data Analysis Methods
Dimension Reduction
• PCA (Principle Component Analysis)
• ICA (Independent Component Analysis)
• Multidimensional Scaling
Unsupervised Learning
• K-Means / K-Medoid
• Hierarchical Clustering Algorithms
Supervised Learning
• Linear Discriminant Analysis
• Maximum Likelihood Discrimination
• Nearest Neighbor Methods
• Decision Trees
• Random Forests
102. Popular Classification Methods
• Decision Trees/Rules
– Find smallest gene sets, but not robust – poor performance
• Neural Nets - work well for reduced number of genes
• K-nearest neighbor – good results for small number of genes, but
no model
• Naïve Bayes – simple, robust, but ignores gene interactions
• Support Vector Machines (SVM)
– Good accuracy, does own gene selection,
but hard to understand
• Specialized methods, D/S/A (Dudoit), …
102
103. Support Vector Machine (SVM)
• Main idea: Select hyperplane that is more likely to
generalize on a future datum
104. Best Practices
• Capture the complete process, from raw data to final
results
• Gene (feature) selection inside cross-validation
• Randomization testing
• Robust classification algorithms
– Simple methods give good results
– Advanced methods can be better
• Wrapper approach for best gene subset selection
• Use bagging to improve accuracy
• Remove/relabel mislabeled or poorly differentiated
samples
104
105. Enrichment Analysis
• What are major enriched GO terms?
• What are the highly active pathways?
• What are the frequently interacting proteins?
• What are the known disease associations?
Alistair Chalk, 2008
107. Meta-analysis example: “Creation and
implications of a phenome-genome network”
Butte and Kohane. Nat Biotech. 2006
• Clustered experiments based on
mapping concepts found in sample
annotations to UMLS meta-thesaurus.
• Relationships found between
phenotype (e.g., aging), disease (e.g.,
leukemia), environmental (e.g., injury)
and experimental (e.g., muscle cells)
factors and genes with differential
expression.
• “the ease and accuracy of automating
inferences across data are crucially
dependent on the accuracy and
consistency of the human annotation
process, which will only happen when
every investigator has a better
prospective understanding of the long-
term value of the time invested in
improving annotations.”
110. PPI ANNOTATION AND DATABASES
Database Reference URL
MINT (Zanoni et al., 2002) http://mint.bio.uniroma2.it/mint
IntAct (Hermjakob et al., 2004) http://www.ebi.ac.uk/intact
DIP (Xenarios et al., 2002) http://dip.doe-mbi.ucla.edu/
HPID (Han et al., 2004) http://www.hpid.org
HPRD (Peri et al., 2004) http://www.hprd.org/
iMEX agreement to share curation efforts
Protein Standard Initiative (PSI) recommendation
Molecular Interaction (MI) Ontology
Large scale experiments
Literature curation
111.
112. Complex networks
• Many systems can be represented as
networks (graphs)
– Nodes: individual component (proteins)
– Edges: relationships (interactions)
• They share common properties
– Scale-free
– Hierarchical
– Clustering
• Some properties may be intrinsic
and can be understood better when
putting into the context of evolution
114. Summary: Network Measures
• Degree ki
The number of edges involving node i
• Degree distribution P(k)
The probability (frequency) of nodes of degree k
• Mean path length
The avg. shortest path between all node pairs
• Network Diameter
– i.e. the longest shortest path
• Clustering Coefficient
– A high CC is found for modules
115. Mapping the phenotypic data to the network
•Systematic phenotyping
of 1615 gene knockout
strains in yeast
•Evaluation of growth of
each strain in the presence
of MMS (and other DNA
damaging agents)
•Screening against a
network of 12,232 protein
interactions
Begley TJ, Rosenbach AS, Ideker T,
Samson LD. Damage recovery pathways
in Saccharomyces cerevisiae revealed by
genomic phenotyping and interactome
mapping. Mol Cancer Res. 2002
Dec;1(2):103-12.
116.
117. The Role of Proteomics
• The existence of an ORF does not imply the
existence of a functional gene.
• Limitations of comparative genomics.
• mRNA levels may not correlate with protein levels.
• Protein modifications post-transcriptional
modifications, isoforms, post-translational
modifications, mutants.
• Issues of proteolysis, sequestration, etc. relevant only
at the protein level.
• Protein complex composition, protein-protein
interactions, structures.
118. Structural proteomics
• Folding
• Structure and function
• Protein structure prediction
• Secondary structure
• Tertiary structure
• Function
• Post-translational modification
• Prot.-Prot. Interaction -- Docking algorithm
• Molecular dynamics/Monte Carlo
119. What kind of methods around?
5 main levels of protein Structure prediction:
1. Extensive Sequence Search
2. Threading and 1D-3D profiles
3. Ab initio prediction of protein structure
4. Comparative Modelling
5. Docking (domain interaction prediction)
120.
121. Prediction of Protein Structures
• Examples – a few good examples
actual predicted actual predicted
actual predicted actual predicted
122.
123. MODPIPE: Large-Scale Comparative Protein Structure Modeling
START
1
Get profile for sequence (NR) Expand match to cover
complete domains
PSI-BLAST
For each template structure
For each target sequence
Scan sequence profile against
MODELLER
representative PDB chains Align matched parts of sequence and
structure
Scan PDB chain profiles Build model for target segment by
against sequence satisfaction of spatial restraints
Evaluate model
Select templates using
permissive E-value cutoff
1 END
R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998.
N. Eswar, M. Marti-Renom, M.S. Madhusudhan, B. John, A. Fiser, R. Sánchez, F. Melo, N. Mirkovic, A. Šali. 3/25/03
128. Graphical Notation: a necessity for the conceptual representation
of biopathways
Qualitative Mechanistic
various degree of
detail, mixed level
of presentation
Aladjem et al., Science STKE pe8
Thiery & Sleeman, Nat. Rev. Mol. (2004)
Cell. Biol 7:131 (2006)
128
129. Strategies: simulate or analyse?
(or rather what to do first)
obtain qualitative
convert diagram simulate model understanding
into a quantitative behavior through numerical
model numerically results and model
reduction
build and identify qualitatively
simulate a “elementary analyze network
reduced model modes” topology, stability,
etc
129
130. 130
stochsim
Boolean
networks
Space of modeling methods
continuous ↔ discrete
136. Too much unintegrated data
• Data sources incompatible
• No (or few) standard naming convention
• No common interface (varying tools for browsing,
querying and visualizing data)
137. – Large experiments or large research – Small, isolated, independent,
groups/labs, possibly distributed groups/individuals
– Large service provider institutes. – Loosely coupled provider-
consumer of resources.
– Tightly coupled provider-consumer of
resources. – Commonly resource consumers
– Commonly resource providers. – Boutique suppliers.
– Some or lots of access to sys admin – Poor access systems admins
138. Challenges: Names and Identity
• WSL-1 protein Q93038 = Tumor necrosis factor
• Apoptosis-mediating receptor DR3 receptor superfamily member
• Apoptosis-mediating receptor 25 precursor
TRAMP
• Death domain receptor 3
Annotation history:
• WSL protein
• Apoptosis-inducing receptor AIR Q92983 P78515
• Apo-3 O00275 Q93036
• Lymphocyte-associated receptor of death O00276 Q93037
• LARD O00277 Q99722
• GENE: Name=TNFRSF25 O00278 Q99830
O00279 Q99831
O00280 Q9BY86
O14865 Q9UME0
GUIDs O14866 Q9UME1
Life Science P78507 Q9UME5
Identifier?
Normalisation
138 http://www.expasy.org/uniprot/Q93038
139.
140. Why must support standards?
• Unambiguous representation, description
and communication
– Final results and metadata
• Interoperability
– Data management and analysis
• Integration of OMICS system biology
141. What to standarize?
• CONTENT: Minimal/Core Information to be reported
• MIBBI (http://www.mibbi.org)
• SEMANTIC: Terminology Used -> Ontologies
• OBI (http://obi-ontology.org)
• SYNTAX: Data Model, Data Exchange
• Fuge (http://fuge.sourceforge.net/)
142. MIBBI: Standard Content
Promoting Coherent Minimum Reporting Requirements for
Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotech.
143. Link Integration: Integration Lite
Application interface
User interface
Application
Ontology
Authority
Identity Authority
143
144. Warehouse
Wrappers Wrappers
Data Access and Query
User interface
Application
Unified
Wrappers
model
• Copy the data sets, clean and massage data into shape
• Combine them into a (different) pre-determined model before query
• ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator, MIPS, BioMART
• Often called “Knowledge bases”
144
145. View integration
Wrappers Wrappers
Data Access and Query
User interface
Application
Unified
Wrappers
model
• Data at Source; Virtual integrating database view
• Global as View / Local as View mappings between models
• Map from model to databases dynamically so always fresh
• TAMBIS, Information Integrator, K4, ComparaGrid, UTOPIA, caCORE
145
146. Specialist Integrating Application
Wrappers Wrappers
User interface
Application
Wrappers
E.g. Ensembl, UTOPIA
• Very popular. Known to be one application.
146
147. Workflows
Workflow
Engine
User interface
Application
Wrapper
• Data flow protocol. Automated data chaining.
• General technique for describing and enacting a process
• Describes what you want to do, not how you want to do it
• Various degrees of data type compliance anticipated
147
148. Mash-Up Data Marshalling
objects
Protocol
Mash Up Application
User interface
Protocol
Protocol
• Content syndication and feeds
• Emphasis on User creating specific integration by mapping.
• Just in time, just enough design
• On demand integration
148
150. Semantic Web help?
Access and Query
Wrappers
User interface
Application
Wrapper
Wrappers
Semantic Enrichment
Model flattening
Mapping Transparency
• Slight problem: we have no first class metadata migration and
management infrastructure, where metadata is outside the application and
in the middleware, and we can handle progressive curation
150
151.
152. Service Oriented Architecture
Advanced Search
Retrieve data
Submit data
submission
curation
ws ws ws ws ws
dataflow workflow
155. An Integrative Analysis Example
Relational
data
Decision
mining Text
tree model
of mining
Visualizing
metabonomi
serial/spect Visualizing
c profile
rum data cluster
statistics Visualizing
Visualizing
Visualizin
Chemical
multidimensi
Visualizing g
sequence
structure data
pathway onal
data
Chemical
relational
Text mining
Spectrum visualization
data data
sequence
visualization
data data
clusters
mining model
156. From experiments to scientific publications
1- Experiments 2- Results 3- Scientific Peer-
reviewed articles
Planning and Processing and
carrying out interpretation of 'Relevant' results are
experiments obtained results published in scientific
(lab work) journals
157. PubMed/Medline database at NCBI
- Developed at the National
Center for Biotechnology
Information (NCBI).
- The core 'Textome'.
- repository of citation
entries of scientific
articles.
- PubMed titles and
abstracts
are primary data source for
Bio-NLP.
- ~ 450,000 new abstracts/a
- > 4,800 biomedical
journals
- ENTREZ search engine
158. Data in scientific articles
Scientific Free Text Tables Figures
Journals
Title
Abstracts
Keywords
Text body
References
Journal- Biomedical literature characteristics
specific
Information: - Heavy use of domain specific terminology (12%
biochemistry
•Format
•Paper structure related technical terms).
(sections) - Polysemic words (word sense disambiguation).
•Article type
- Most words with low frequency (data sparseness).
- New names and terms created.
- Typographical variants
- Different writing styles (native languages)
162. BioCreative results
TP: prediction evaluated as protein
and GO terms correct
Precision: TP / Total nr. of
evaluated submissions
1: Chiang et al.
2: Couto et al.
3: Ehrler et al.
4: Ray et al.
5: Rice et al.
6: Verspoor et al.
163. Data Integration
• Standards, DBs Infrastructure
Knowledge Discovery
• Algorithms, Informatics, Machine Learning
Integrate knowledge
• Text mining, Ontologies
Modelling
• Pathways, Circuits, Abstraction
Research Support
164. Los retos de la biología en los próximos
50 years
• Listado de todos los componentes moleculares que
forman un organismo:
– Genes, proteinas, y otros elementos funcionales
• Comprender la funcion de cada componente
• Comprender como interaccionan
• Estudiar como la función ha evolucionado
• Encontrar defectos geneticos que causan enfermedades
• Diseñar medicamentos y terapias de manera racional
• Secuenciar el genoma de cada individuo y usarlo en una
medicina personalizada
• La Bioinformatica es un componente esencial
para conseguir todos estos objetivos