Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Biological databases
1. 10/11/2017
1
Biological Databases
Dr. Ayaz Ahmad
2
Biological databases
1. Biological information and databases
– Overview and definition, types of biological databases
2. Popular databases, records, data format
– Genbank, SwissProt, OMIM, PDB, KEGG, BIND, Pfam, PROSITE, PubMed
3. Accessing biological databases, retrieval systems
– Entrez, SRS
4. Searching biological databases
– Data quality, coverage, redundancy, errors
Textbook:
--T.K.Atwood and D.J. Parry Smith, Introduction to Bioinformatics.
Biological databases: chapters 3 and 4
2. 10/11/2017
2
3
Biological Information
Nucleic acids:
• DNA sequence, genes, gene products (proteins), mutation,
gene coding, distribution patterns, motifs
• Genomics: genome, gene structure and expression, genetic
map, genetic disorder
• RNA sequence, secondary structure, 3D structure,
interactions
Proteins:
• Protein sequence, corresponding gene, secondary structure,
3D structure, function, motifs, homology, interactions
• Proteomics: expression profile, proteins in disease processes
etc.
• Ligands and drugs (inhibitors, activators, substrates,
metabolites)
4
Biological Information
Pathways:
• Molecular networks, biological chain events,
regulation, feedback, kinetic data
Function:
• Binding sites, interactions, molecular action
(binding, chemical reaction, etc.)
• Biological effect (signaling, transport, feedback,
regulation, modification, etc.)
• Functional relationship, protein families, motifs, and
homologs
3. 10/11/2017
3
WHAT IS A DATABASE?
• Structured collection of information.
• Consists of basic units called records or entries.
• Each record consists of fields, which hold pre-defined
data related to the record.
• For example, a protein database would have protein
entries as records and protein properties as fields (e.g.,
name of protein, length, amino-acid sequence)
THE ‘PERFECT’ DATABASE
• Comprehensive, but easy to search.
• Annotated, but not “too annotated”.
• A simple, easy to understand structure.
• Cross-referenced.
• Minimum redundancy.
• Easy retrieval of data.
4. 10/11/2017
4
7
Biological databases
Purpose
1. To disseminate biological data and information
2. To provide biological data in computer-readable form
3. To allow analysis of biological data
TYPES OF MOLECULAR DATABASES
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, Trace, SRA, SNP, GEO
• Derived Databases
– Derived from primary data
– Content controlled by third party (e.g. NCBI)
• Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO
datasets, UniGene, Homologene, Structure,
Conserved Domain
5. 10/11/2017
5
PRIMARY VS. DERIVED SEQUENCE
DATABASES
GenBank
Sequencing
Centers
TATAGCCG TATAGCCGTATAGCCG TATAGCCG
Labs
Algorithms
UniGene
Curators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
Bibliographic Databases
Integrated Databases
Structural Databases
Sequence Databases
Clinical Databases
Types of Biological Databases
6. 10/11/2017
6
“Ten Important Bioinformatics Databases”
GenBank www.ncbi.nlm.nih.gov nucleotide sequences
Ensembl www.ensembl.org human/mouse genome
(and others)
PubMed www.ncbi.nlm.nih.gov literature references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymes www.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
KEGG www.genome.ad.jp metabolic pathways
Source: Bioinformatics for Dummies
12
GenBank
http://www.ncbi.nih.gov/Genbank/
7. 10/11/2017
7
13
GenBank database
(http://www.ncbi.nih.gov/Genbank/)
– Contains publicly available DNA sequences from more than
100,000 organisms.
– Also contains derived protein sequences, and annotations
describing biological, structural, and other relevant features.
– Accessible through Entrez, NCBI’s integrated retrieval system
– Sequence similarity search tools: BLAST
GenBank
• Annotated collection of all publicly
available nucleotide sequences and their
protein translations.
• Receives sequences produced in
laboratories throughout the world from
more than 100,000 distinct organisms.
• Grows exponentially, doubling every 10
months
8. 10/11/2017
8
GENBANK - PRIMARY SEQUENCE DB
http://www.ncbi.nlm.nih.gov/genbank/
• Nucleotide only sequence database
• Archival in nature
– Historical
– Reflective of submitter point of view
– Redundant
• Data
– Direct submissions
– Batch submissions
– FTP accounts (genome data)
GenBank
•Data shared nightly among three
collaborating databases
•GenBank at NCBI
•DNA Database of Japan (DDBJ)
•EMBL at EBI
9. 10/11/2017
9
The International Sequence Database Collaboration
Source NCBI
GeneBank Release 220
June 2017
• full release every two months
• incremental and cumulative updates daily
• available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
10. 10/11/2017
10
GenBank Record
➢ Header
information that apply to
the whole record
➢ Features
annotations on the record
➢ Sequence
GeneBank Record
modification
date
Header
Locus Name
Sequence Length
Molecule Type
GenBank Division
Modification DateAccession Number
Version Number
12. 10/11/2017
12
Direct Submission
• A typical GenBank submission consists of
a single, contiguous stretch of DNA or
RNA sequence (contigs) with annotations
(metadata).
• If part of a nucleotide sequence encodes a
protein, a conceptual translation, called a
CDS (coding sequence) is annotated.
High-Throughput Genomic
Sequence (HTGS)
• HTGS entries are submitted in bulk by
genome centers, processed by an
automated system, and then released to
GenBank.
• Currently, more than 30 genome centers
are submitting data for a number of
organisms, including human, mouse, rat,
rice, and Plasmodium falciparum.
13. 10/11/2017
13
Whole Genome Shotgun
Sequences (WGS)
• Shotgun sequence reads are assembled into contigs,
submitted, and updated as the sequencing project
progresses and new assemblies are computed.
Submission Tools
• BankIt: Web-based form for submission of
a small number of sequences with minimal
annotation to GenBank.
• Sequin: More appropriate for complicated
submissions containing a significant
amount of annotation or many sequences.
14. 10/11/2017
14
Sequence Data Flow and
Processing
• Within 48 hours of direct submission with BankIt or Sequin,
the database staff reviews the submission to determine
whether it meets the minimal criteria and then assigns an
Accession number.
– All sequences must be > 50 bp in length and be sequenced by,
or on behalf of, the group submitting the sequence.
– GenBank will not accept sequences constructed in silico
– GenBank will not accept noncontiguous sequences containing
internal, unsequenced spacers.
– GenBank will not accept sequences for which there is no
physical counterpart, such as those derived from a mix of
genomic DNA and mRNA.
– Submissions are checked to determine whether they are new or
updates.
Sequence Data Flow and
Processing
• Indexing:
– Biological validity: Translation, organism lineage, BLAST
searches
– Vector contamination: Is there any vector DNA present in the
sequence?
– Publication status: If published, citation is included in annotation
and linked to Entrez
– Formatting and spelling
• Sequences are sent to submitter for final review before
release into the public database.
• Sequences must become publicly available once the
accession number or the sequence has been published.
• GenBank annotation staff process about 1900
submissions/month, or about 20,000 sequences.
15. 10/11/2017
15
Essential Bioinformatics and
Biocomputing (LSM2104), NUS 29
DNA databases
• An Example from GenBank– flat file
– Human Alpha-Lactalbumin gene
This protein is a complex of 2 proteins A and B. In the absence of the
B protein, the enzyme catalyzes the transfer of
galactose from UDP-galactose to Nacetylglucosamine (cf. EC 2.4.1.90).
Essential Bioinformatics and
Biocomputing (LSM2104), NUS 30
A GenBank entry – HEADER
16. 10/11/2017
16
31
GenBank Entry – Links provided in the Header
• MapViewer – find the gene position in chromosome
• Related Sequences – other entries related to this gene (or sequence)
• OMIM– link to catalog of human genes and genetic disorders
• Protein – retrieve protein record from GenPept
• Medline and PubMed –literature abstracts related to this gene
• Taxonomy – Classification of organisms
• UniGene – Unified gene data
• UniSTS – Unified sequence tagged sites, marker and mapping data
• LinkOut – links to publishers, aggregators libraries, biological databases,
sequence centers, and other Web resources
• REFSEQ – reference sequence standards
Note: These links are representative. Other links may also be found in GenBank
entries.
Essential Bioinformatics and
Biocomputing (LSM2104), NUS 32
GenBank entry - FEATURES