Seguimiento y Evaluación OnLine de Trabajos de Prácticas en Asignaturas de Es...
Bioinformatics Introduction
1. Bioinformatics in medicine
today
David Montaner
dmontaner@cipf.es
Centro de Investigación Príncipe Felipe
Institute of Computational Genomics
9 May 2013
in Valencia
David Montaner Bioinformatics in medicine 1/26
2. Genomics
“Progress in science depends on new techniques, new
discoveries and new ideas, probably in that order.”
Sydney Brenner, 1980
Microarray devices and high-throughput sequencing allow us
measuring thousands or millions of genomic characteristics.
David Montaner Bioinformatics in medicine 2/26
3. Genomics vs. genetics
Genetics:
• Single genes are responsible for biological changes.
• one gene → one hypothesis → one p-value → conclusions
Genomics:
• Genes or genomic features act together to produce
biological changes.
• many genes → many hypothesis → many p-value →
→ more data analysis
• Computational support is needed even for drawing
conclusions
David Montaner Bioinformatics in medicine 3/26
4. Genomic numbers
Microarray:
• 30.000 genes
• 2 million SNPs
• 100 Mb
Measured features:
• genes, isoforms
• SNPs, Polymorphisms
• IN-DELS
• loss of heterozygosity
• methylation
• copy number alterations
NGS:
• 30.000 genes
• 30.000 transcripts
• 20 million SNPs
• 10-100 GB
Registered information:
• Genomic characteristics:
position, chromosome ...
• Biological function
• Disease association
• miRNA targets
David Montaner Bioinformatics in medicine 4/26
5. Genomic databases
Nucleic Acid Research lists +1500 online databases!
http://www.oxfordjournals.org/nar/database/c
• Many different databases for each category, which should I
use?
• No standards: different IDs, methods, servers, formats, ...
• Lack of international initiatives, many local and small
databases
• Different gene IDs, more than 50
• In vivo vs in silico databases
David Montaner Bioinformatics in medicine 5/26
6. Biological databases (Wikipedia)
1 Primary nucleotide
sequence databases
2 Metadatabases
3 Genome databases
4 Protein sequence
databases
5 Proteomics databases
6 Protein structure
databases
7 Protein model databases
8 RNA databases
9 Carbohydrate structure
databases
10 Protein-protein interactions
11 Signal transduction
pathway databases
12 Metabolic pathway
databases
13 Experimental data
repositories (Microarrays
NGS, Sanger)
14 Exosomal databases
15 Mathematical model
databases
16 PCR / real time PCR
primer databases
17 Specialized databases
18 Taxonomic databases
19 Wiki-style databasesDavid Montaner Bioinformatics in medicine 6/26
7. Primary nucleotide sequence
databases
Contain any kind of nucleotide sequences, form genes to
genomes.
The International Nucleotide Sequence Database (INSD)
Collaboration:
• GenBank
National Center for Biotechnology Information (NCBI)
• European Nucleotide Archive (ENA)
European Bioinformatics Institute (EBI)
• DNA Data Bank of Japan (DDBJ)
David Montaner Bioinformatics in medicine 7/26
8. GenBank
Primary nucleotide sequence databases
• available on the NCBI ftp site:
http://www.ncbi.nlm.nih.gov/Ftp/
• A new release is made every two months.
• 3 types of entries:
• CoreNucleotide (the main collection)
• dbEST (Expressed Sequence Tags)
• dbGSS (Genome Survey Sequences)
Access:
• Search for sequence identifiers using Entrez Nucleotide:
http://www.ncbi.nlm.nih.gov/nucleotide/
• Align GenBank sequences to a query sequence using
BLAST (Basic Local Alignment Search Tool).
http://blast.ncbi.nlm.nih.gov/Blast.cgi
• Several other e-utilities (see book)
See an example of a GenBank record.
David Montaner Bioinformatics in medicine 8/26
9. Metadatabases
• Collect and organize data from primary nucleotide
sequence databases and may other resources.
• Make the information available in a convenient format and
provide data handling resources: web pages, application
programming interface (API) …
• Focus on particular species, diseases …
Examples
• Entrez: searches through almost all NCBI resources.
http://www.ncbi.nlm.nih.gov/sites/gquery
• GeneCards: provides genomic, proteomic, transcriptomic,
genetic and functional information for human genes (known
and predicted)
http://www.genecards.org/
David Montaner Bioinformatics in medicine 9/26
10. Entrez
Metadatabases
• Searches through almost all NCBI resources.
• Entrez search page: http://www.ncbi.nlm.nih.gov/sites/gquery
• queries can be saved if you have a a MyNCBI account
http://www.ncbi.nlm.nih.gov/
David Montaner Bioinformatics in medicine 10/26
11. Genome databases
Collect genome sequences and annotation (specification about
genes) for particular organisms, and try to improve them:
• Data curation.
• Complete missing information using insilico methods.
• Generate new relational organization.
• Complement feature IDs.
• Provide “easy” access, visualization …
Examples
• Ensembl: automatic annotation on selected eukaryote
genomes.
• UCSC Genome Browser: reference sequence and working
draft assemblies for a large collection of genomes
• Wormbase: genome of the model organism C.elegans.
David Montaner Bioinformatics in medicine 11/26
12. Ensembl
Genome databases
• Ensembl is a joint project between European Bioinformatics
Institute (EBI) the European Molecular Biology Laboratory
(EMBL) and the Wellcome Trust Sanger Institute.
• Develop a software system which produces and maintains
automatic annotation on selected vertebrate and
eukaryote genomes.
• http://www.ensembl.org
David Montaner Bioinformatics in medicine 12/26
13. UCSC Genome Browser
Genome databases
• UCSC: University of California, Santa Cruz.
• This site contains the reference sequence and working
draft assemblies for a large collection of genomes.
• http://genome.ucsc.edu/
David Montaner Bioinformatics in medicine 13/26
14. Protein sequence databases
• Most times proteins are the final unit of interest to research.
• There is a direct conversion from DNA/RNA sequences to
protein sequences.
• Gene IDs and protein IDs are equivalently used by
researchers (biologists not bioinformaticians …)
Examples
• UniProt: Universal Protein Resource (EBI)
• Swiss-Prot (Swiss Institute of Bioinformatics)
• InterPro Classifies proteins into families and predicts the
presence of domains and sites.
• Pfam Protein families database of alignments and HMMs
(Sanger Institute)
David Montaner Bioinformatics in medicine 14/26
15. RNA databases
• Contain information about RNA molecules.
• Most of them regarding gene regulatory factors. (Gene
information is usually in other repositories).
Examples
• mirBase: microRNAs
http://www.mirbase.org/
• TRANSFAC: transcription factors in eukaryote (Proprietary
database).
• JASPAR: transcription factor binding sites for eukaryote
(Open access, curated, non-redundant).
http://jaspar.genereg.net/
David Montaner Bioinformatics in medicine 15/26
16. Protein-protein interactions
• Proteins are the main functional units.
• But they do not work in isolation.
• Pretty useless at the moment but promising in the future …
• some information is experimental, but most of it is
generated insilico.
Examples
• IntAct: protein–small molecule
and protein–nucleic acid
interactions.
• BIND: Biomolecular Interaction
Network Database.
David Montaner Bioinformatics in medicine 16/26
17. Signal transduction pathway
databases
& Metabolic pathway databases
• Information about how genes (or proteins) interact among
them.
• not only physical interactions …
Examples
• Reactome: free online database of biological pathways.
http://www.reactome.org
• KEGG: Kyoto Encyclopedia of Genes and Genomes.
Metabolic pathways.
http://www.genome.jp/kegg/pathway.html
David Montaner Bioinformatics in medicine 17/26
19. Experimental data repositories
Contain Microarray, NGS, Sanger, and other experimental high
throughput data.
• GEO: Gene Expression Omnibus (NCBI)
http://www.ncbi.nlm.nih.gov/geo/
• ArrayExpress: database of functional genomics
experiments including (EBI)
http://www.ebi.ac.uk/arrayexpress/
• The Cancer Genome Atlas (TCGA): Data on different
cancer related tissues.
http://cancergenome.nih.gov/
David Montaner Bioinformatics in medicine 19/26
20. Bioinformatics
Training
• Biology 1/3
• Statistics 1/3
• Computer science 1/3 ←−
Efficiently combine:
• Experimental information
• Database registered knowledge
Time and resources:
• As in the wet lab
David Montaner Bioinformatics in medicine 20/26
22. Example I
Autistic children
1 (microarray) NGS data processing
• data quality control, filtering...
• map against reference genome
• CNV calling
2 CNV filtering
• just 75 rare de novo CNV events (not registered in
databases)
• filter out the long ones
• keep the ones that contain genes
David Montaner Bioinformatics in medicine 22/26
23. Example II
3 move to the gene level
• 47 loci in total affecting 433 human genes
4 Building the background likelihood network
• GO annotations
• KEGG pathways
• InterPro domains
• protein-proteins interactions. Databases: BIND, BioGRID,
DIP, HPRD, InNetDB, IntAct, BiGG, MINT, and MIPS
• sequence homology between the gene pair (BLAST)
David Montaner Bioinformatics in medicine 23/26
24. Example III
5 Search for high scoring clusters affected by CNVs
6 Evaluating significance of cluster scores:
10.000 simulations
David Montaner Bioinformatics in medicine 24/26
25. Example IV
7 Functional characterization of the identified network
8 And, finally, draw conclusions
David Montaner Bioinformatics in medicine 25/26