SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
10/11/2017
1
Biological Databases
Dr. Ayaz Ahmad
2
Biological databases
1. Biological information and databases
– Overview and definition, types of biological databases
2. Popular databases, records, data format
– Genbank, SwissProt, OMIM, PDB, KEGG, BIND, Pfam, PROSITE, PubMed
3. Accessing biological databases, retrieval systems
– Entrez, SRS
4. Searching biological databases
– Data quality, coverage, redundancy, errors
Textbook:
--T.K.Atwood and D.J. Parry Smith, Introduction to Bioinformatics.
Biological databases: chapters 3 and 4
10/11/2017
2
3
Biological Information
Nucleic acids:
• DNA sequence, genes, gene products (proteins), mutation,
gene coding, distribution patterns, motifs
• Genomics: genome, gene structure and expression, genetic
map, genetic disorder
• RNA sequence, secondary structure, 3D structure,
interactions
Proteins:
• Protein sequence, corresponding gene, secondary structure,
3D structure, function, motifs, homology, interactions
• Proteomics: expression profile, proteins in disease processes
etc.
• Ligands and drugs (inhibitors, activators, substrates,
metabolites)
4
Biological Information
Pathways:
• Molecular networks, biological chain events,
regulation, feedback, kinetic data
Function:
• Binding sites, interactions, molecular action
(binding, chemical reaction, etc.)
• Biological effect (signaling, transport, feedback,
regulation, modification, etc.)
• Functional relationship, protein families, motifs, and
homologs
10/11/2017
3
WHAT IS A DATABASE?
• Structured collection of information.
• Consists of basic units called records or entries.
• Each record consists of fields, which hold pre-defined
data related to the record.
• For example, a protein database would have protein
entries as records and protein properties as fields (e.g.,
name of protein, length, amino-acid sequence)
THE ‘PERFECT’ DATABASE
• Comprehensive, but easy to search.
• Annotated, but not “too annotated”.
• A simple, easy to understand structure.
• Cross-referenced.
• Minimum redundancy.
• Easy retrieval of data.
10/11/2017
4
7
Biological databases
Purpose
1. To disseminate biological data and information
2. To provide biological data in computer-readable form
3. To allow analysis of biological data
TYPES OF MOLECULAR DATABASES
• Primary Databases
– Original submissions by experimentalists
– Content controlled by the submitter
• Examples: GenBank, Trace, SRA, SNP, GEO
• Derived Databases
– Derived from primary data
– Content controlled by third party (e.g. NCBI)
• Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO
datasets, UniGene, Homologene, Structure,
Conserved Domain
10/11/2017
5
PRIMARY VS. DERIVED SEQUENCE
DATABASES
GenBank
Sequencing
Centers
TATAGCCG TATAGCCGTATAGCCG TATAGCCG
Labs
Algorithms
UniGene
Curators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters
Bibliographic Databases
Integrated Databases
Structural Databases
Sequence Databases
Clinical Databases
Types of Biological Databases
10/11/2017
6
“Ten Important Bioinformatics Databases”
GenBank www.ncbi.nlm.nih.gov nucleotide sequences
Ensembl www.ensembl.org human/mouse genome
(and others)
PubMed www.ncbi.nlm.nih.gov literature references
NR www.ncbi.nlm.nih.gov protein sequences
SWISS-PROT www.expasy.ch protein sequences
InterPro www.ebi.ac.uk protein domains
OMIM www.ncbi.nlm.nih.gov genetic diseases
Enzymes www.chem.qmul.ac.uk enzymes
PDB www.rcsb.org/pdb/ protein structures
KEGG www.genome.ad.jp metabolic pathways
Source: Bioinformatics for Dummies
12
GenBank
http://www.ncbi.nih.gov/Genbank/
10/11/2017
7
13
GenBank database
(http://www.ncbi.nih.gov/Genbank/)
– Contains publicly available DNA sequences from more than
100,000 organisms.
– Also contains derived protein sequences, and annotations
describing biological, structural, and other relevant features.
– Accessible through Entrez, NCBI’s integrated retrieval system
– Sequence similarity search tools: BLAST
GenBank
• Annotated collection of all publicly
available nucleotide sequences and their
protein translations.
• Receives sequences produced in
laboratories throughout the world from
more than 100,000 distinct organisms.
• Grows exponentially, doubling every 10
months
10/11/2017
8
GENBANK - PRIMARY SEQUENCE DB
http://www.ncbi.nlm.nih.gov/genbank/
• Nucleotide only sequence database
• Archival in nature
– Historical
– Reflective of submitter point of view
– Redundant
• Data
– Direct submissions
– Batch submissions
– FTP accounts (genome data)
GenBank
•Data shared nightly among three
collaborating databases
•GenBank at NCBI
•DNA Database of Japan (DDBJ)
•EMBL at EBI
10/11/2017
9
The International Sequence Database Collaboration
Source NCBI
GeneBank Release 220
June 2017
• full release every two months
• incremental and cumulative updates daily
• available only through internet
ftp://ftp.ncbi.nih.gov/genbank/
10/11/2017
10
GenBank Record
➢ Header
information that apply to
the whole record
➢ Features
annotations on the record
➢ Sequence
GeneBank Record
modification
date
Header
Locus Name
Sequence Length
Molecule Type
GenBank Division
Modification DateAccession Number
Version Number
10/11/2017
11
GeneBank Record
Link to Seq
FEATURE
GenBank RecordSequence
10/11/2017
12
Direct Submission
• A typical GenBank submission consists of
a single, contiguous stretch of DNA or
RNA sequence (contigs) with annotations
(metadata).
• If part of a nucleotide sequence encodes a
protein, a conceptual translation, called a
CDS (coding sequence) is annotated.
High-Throughput Genomic
Sequence (HTGS)
• HTGS entries are submitted in bulk by
genome centers, processed by an
automated system, and then released to
GenBank.
• Currently, more than 30 genome centers
are submitting data for a number of
organisms, including human, mouse, rat,
rice, and Plasmodium falciparum.
10/11/2017
13
Whole Genome Shotgun
Sequences (WGS)
• Shotgun sequence reads are assembled into contigs,
submitted, and updated as the sequencing project
progresses and new assemblies are computed.
Submission Tools
• BankIt: Web-based form for submission of
a small number of sequences with minimal
annotation to GenBank.
• Sequin: More appropriate for complicated
submissions containing a significant
amount of annotation or many sequences.
10/11/2017
14
Sequence Data Flow and
Processing
• Within 48 hours of direct submission with BankIt or Sequin,
the database staff reviews the submission to determine
whether it meets the minimal criteria and then assigns an
Accession number.
– All sequences must be > 50 bp in length and be sequenced by,
or on behalf of, the group submitting the sequence.
– GenBank will not accept sequences constructed in silico
– GenBank will not accept noncontiguous sequences containing
internal, unsequenced spacers.
– GenBank will not accept sequences for which there is no
physical counterpart, such as those derived from a mix of
genomic DNA and mRNA.
– Submissions are checked to determine whether they are new or
updates.
Sequence Data Flow and
Processing
• Indexing:
– Biological validity: Translation, organism lineage, BLAST
searches
– Vector contamination: Is there any vector DNA present in the
sequence?
– Publication status: If published, citation is included in annotation
and linked to Entrez
– Formatting and spelling
• Sequences are sent to submitter for final review before
release into the public database.
• Sequences must become publicly available once the
accession number or the sequence has been published.
• GenBank annotation staff process about 1900
submissions/month, or about 20,000 sequences.
10/11/2017
15
Essential Bioinformatics and
Biocomputing (LSM2104), NUS 29
DNA databases
• An Example from GenBank– flat file
– Human Alpha-Lactalbumin gene
This protein is a complex of 2 proteins A and B. In the absence of the
B protein, the enzyme catalyzes the transfer of
galactose from UDP-galactose to Nacetylglucosamine (cf. EC 2.4.1.90).
Essential Bioinformatics and
Biocomputing (LSM2104), NUS 30
A GenBank entry – HEADER
10/11/2017
16
31
GenBank Entry – Links provided in the Header
• MapViewer – find the gene position in chromosome
• Related Sequences – other entries related to this gene (or sequence)
• OMIM– link to catalog of human genes and genetic disorders
• Protein – retrieve protein record from GenPept
• Medline and PubMed –literature abstracts related to this gene
• Taxonomy – Classification of organisms
• UniGene – Unified gene data
• UniSTS – Unified sequence tagged sites, marker and mapping data
• LinkOut – links to publishers, aggregators libraries, biological databases,
sequence centers, and other Web resources
• REFSEQ – reference sequence standards
Note: These links are representative. Other links may also be found in GenBank
entries.
Essential Bioinformatics and
Biocomputing (LSM2104), NUS 32
GenBank entry - FEATURES
10/11/2017
17
33
GenBank - SEQUENCE

Contenu connexe

Tendances (20)

Ddbj
DdbjDdbj
Ddbj
 
Protein information resource (PIR)
Protein information resource (PIR)Protein information resource (PIR)
Protein information resource (PIR)
 
Gen bank databases
Gen bank databasesGen bank databases
Gen bank databases
 
Fasta
FastaFasta
Fasta
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentation
 
Introduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbjIntroduction to ncbi, embl, ddbj
Introduction to ncbi, embl, ddbj
 
Proteome databases
Proteome databasesProteome databases
Proteome databases
 
European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)
 
Applications of bioinformatics
Applications of bioinformaticsApplications of bioinformatics
Applications of bioinformatics
 
Swiss PROT
Swiss PROT Swiss PROT
Swiss PROT
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Blast and fasta
Blast and fastaBlast and fasta
Blast and fasta
 
NCBI
NCBINCBI
NCBI
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
BLAST
BLASTBLAST
BLAST
 
Tools of bioinforformatics by kk
Tools of bioinforformatics by kkTools of bioinforformatics by kk
Tools of bioinforformatics by kk
 
Swiss prot
Swiss protSwiss prot
Swiss prot
 
Swiss prot protein database
Swiss prot protein databaseSwiss prot protein database
Swiss prot protein database
 

Similaire à Biological databases

Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim D. Pruitt
 
bioinfomatics
bioinfomaticsbioinfomatics
bioinfomaticsnguyenpg
 
Genome resource databases in horticutural crops
Genome resource databases in horticutural cropsGenome resource databases in horticutural crops
Genome resource databases in horticutural cropsPulipati Gangadhara Rao
 
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...VHIR Vall d’Hebron Institut de Recerca
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformaticsAtai Rabby
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBioinformaticsCentre
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdfnedalalazzwy
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectivePalaniappan SP
 
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVSExploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVSGolden Helix Inc
 
02. Biological sequence databases.pptx
02. Biological sequence databases.pptx02. Biological sequence databases.pptx
02. Biological sequence databases.pptxHussainTaqi1
 
Data Base in Bioinformatics.ppt
Data Base in Bioinformatics.pptData Base in Bioinformatics.ppt
Data Base in Bioinformatics.pptBangaluru
 
biological databases.pptx
biological databases.pptxbiological databases.pptx
biological databases.pptxscience lover
 
Ncbi basic intro_v_pitt_kent_osu
Ncbi basic intro_v_pitt_kent_osuNcbi basic intro_v_pitt_kent_osu
Ncbi basic intro_v_pitt_kent_osuBen Busby
 

Similaire à Biological databases (20)

Biological databases
Biological databasesBiological databases
Biological databases
 
Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015Kim Pruitt trainingbiocuration2015
Kim Pruitt trainingbiocuration2015
 
Intro to databases
Intro to databasesIntro to databases
Intro to databases
 
bioinfomatics
bioinfomaticsbioinfomatics
bioinfomatics
 
Genome resource databases in horticutural crops
Genome resource databases in horticutural cropsGenome resource databases in horticutural crops
Genome resource databases in horticutural crops
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
 
Informal presentation on bioinformatics
Informal presentation on bioinformaticsInformal presentation on bioinformatics
Informal presentation on bioinformatics
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdf
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data Perspective
 
Major biological nucleotide databases
Major biological nucleotide databasesMajor biological nucleotide databases
Major biological nucleotide databases
 
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVSExploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
 
02. Biological sequence databases.pptx
02. Biological sequence databases.pptx02. Biological sequence databases.pptx
02. Biological sequence databases.pptx
 
Databases_L2.pptx
Databases_L2.pptxDatabases_L2.pptx
Databases_L2.pptx
 
Data Base in Bioinformatics.ppt
Data Base in Bioinformatics.pptData Base in Bioinformatics.ppt
Data Base in Bioinformatics.ppt
 
biological databases.pptx
biological databases.pptxbiological databases.pptx
biological databases.pptx
 
Ncbi
NcbiNcbi
Ncbi
 
Ncbi basic intro_v_pitt_kent_osu
Ncbi basic intro_v_pitt_kent_osuNcbi basic intro_v_pitt_kent_osu
Ncbi basic intro_v_pitt_kent_osu
 

Plus de Ashfaq Ahmad

10000 plus English Vocabulary
10000 plus English Vocabulary10000 plus English Vocabulary
10000 plus English VocabularyAshfaq Ahmad
 
Personality and psychographics
Personality and psychographicsPersonality and psychographics
Personality and psychographicsAshfaq Ahmad
 
Affinity chromatography
Affinity chromatographyAffinity chromatography
Affinity chromatographyAshfaq Ahmad
 
Basics of spectroscopy
Basics of spectroscopyBasics of spectroscopy
Basics of spectroscopyAshfaq Ahmad
 
Spectroscopy basics
Spectroscopy basicsSpectroscopy basics
Spectroscopy basicsAshfaq Ahmad
 
High performance liquid chromatography
High performance liquid chromatographyHigh performance liquid chromatography
High performance liquid chromatographyAshfaq Ahmad
 
Affinity chromatography and gel filteration
Affinity chromatography and gel filterationAffinity chromatography and gel filteration
Affinity chromatography and gel filterationAshfaq Ahmad
 
Lecture 11 and 12 microbial_sem_6 (1)
Lecture 11 and 12 microbial_sem_6 (1)Lecture 11 and 12 microbial_sem_6 (1)
Lecture 11 and 12 microbial_sem_6 (1)Ashfaq Ahmad
 
Lecture 9 and 10 microbial_sem_6
Lecture 9 and 10 microbial_sem_6Lecture 9 and 10 microbial_sem_6
Lecture 9 and 10 microbial_sem_6Ashfaq Ahmad
 
Lecture 7 and 8 microbial_sem_6_20180307
Lecture 7 and 8 microbial_sem_6_20180307Lecture 7 and 8 microbial_sem_6_20180307
Lecture 7 and 8 microbial_sem_6_20180307Ashfaq Ahmad
 
Lecture 5 and 6 microbial_sem_6_20180307
Lecture 5 and 6 microbial_sem_6_20180307Lecture 5 and 6 microbial_sem_6_20180307
Lecture 5 and 6 microbial_sem_6_20180307Ashfaq Ahmad
 
Chromatography basics
Chromatography basicsChromatography basics
Chromatography basicsAshfaq Ahmad
 
Research methodology notes
Research methodology notesResearch methodology notes
Research methodology notesAshfaq Ahmad
 
Lecture 2 microbial_sem_6_20180220
Lecture 2 microbial_sem_6_20180220Lecture 2 microbial_sem_6_20180220
Lecture 2 microbial_sem_6_20180220Ashfaq Ahmad
 
Lecture 1 microbial_sem_6_20170213
Lecture 1 microbial_sem_6_20170213Lecture 1 microbial_sem_6_20170213
Lecture 1 microbial_sem_6_20170213Ashfaq Ahmad
 
Structural genomics
Structural genomicsStructural genomics
Structural genomicsAshfaq Ahmad
 
Structural genomics
Structural genomicsStructural genomics
Structural genomicsAshfaq Ahmad
 
Snp and its role in diseases
Snp and its role in diseasesSnp and its role in diseases
Snp and its role in diseasesAshfaq Ahmad
 

Plus de Ashfaq Ahmad (20)

10000 plus English Vocabulary
10000 plus English Vocabulary10000 plus English Vocabulary
10000 plus English Vocabulary
 
Personality and psychographics
Personality and psychographicsPersonality and psychographics
Personality and psychographics
 
Affinity chromatography
Affinity chromatographyAffinity chromatography
Affinity chromatography
 
Basics of spectroscopy
Basics of spectroscopyBasics of spectroscopy
Basics of spectroscopy
 
Spectroscopy basics
Spectroscopy basicsSpectroscopy basics
Spectroscopy basics
 
High performance liquid chromatography
High performance liquid chromatographyHigh performance liquid chromatography
High performance liquid chromatography
 
Affinity chromatography and gel filteration
Affinity chromatography and gel filterationAffinity chromatography and gel filteration
Affinity chromatography and gel filteration
 
Rflp presentation
Rflp presentationRflp presentation
Rflp presentation
 
Lecture 11 and 12 microbial_sem_6 (1)
Lecture 11 and 12 microbial_sem_6 (1)Lecture 11 and 12 microbial_sem_6 (1)
Lecture 11 and 12 microbial_sem_6 (1)
 
Lecture 9 and 10 microbial_sem_6
Lecture 9 and 10 microbial_sem_6Lecture 9 and 10 microbial_sem_6
Lecture 9 and 10 microbial_sem_6
 
Lecture 7 and 8 microbial_sem_6_20180307
Lecture 7 and 8 microbial_sem_6_20180307Lecture 7 and 8 microbial_sem_6_20180307
Lecture 7 and 8 microbial_sem_6_20180307
 
Lecture 5 and 6 microbial_sem_6_20180307
Lecture 5 and 6 microbial_sem_6_20180307Lecture 5 and 6 microbial_sem_6_20180307
Lecture 5 and 6 microbial_sem_6_20180307
 
Chromatography basics
Chromatography basicsChromatography basics
Chromatography basics
 
Research methodology notes
Research methodology notesResearch methodology notes
Research methodology notes
 
Lecture 2 microbial_sem_6_20180220
Lecture 2 microbial_sem_6_20180220Lecture 2 microbial_sem_6_20180220
Lecture 2 microbial_sem_6_20180220
 
Lecture 1 microbial_sem_6_20170213
Lecture 1 microbial_sem_6_20170213Lecture 1 microbial_sem_6_20170213
Lecture 1 microbial_sem_6_20170213
 
Western blotting
Western blottingWestern blotting
Western blotting
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
Structural genomics
Structural genomicsStructural genomics
Structural genomics
 
Snp and its role in diseases
Snp and its role in diseasesSnp and its role in diseases
Snp and its role in diseases
 

Dernier

General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 

Dernier (20)

General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 

Biological databases

  • 1. 10/11/2017 1 Biological Databases Dr. Ayaz Ahmad 2 Biological databases 1. Biological information and databases – Overview and definition, types of biological databases 2. Popular databases, records, data format – Genbank, SwissProt, OMIM, PDB, KEGG, BIND, Pfam, PROSITE, PubMed 3. Accessing biological databases, retrieval systems – Entrez, SRS 4. Searching biological databases – Data quality, coverage, redundancy, errors Textbook: --T.K.Atwood and D.J. Parry Smith, Introduction to Bioinformatics. Biological databases: chapters 3 and 4
  • 2. 10/11/2017 2 3 Biological Information Nucleic acids: • DNA sequence, genes, gene products (proteins), mutation, gene coding, distribution patterns, motifs • Genomics: genome, gene structure and expression, genetic map, genetic disorder • RNA sequence, secondary structure, 3D structure, interactions Proteins: • Protein sequence, corresponding gene, secondary structure, 3D structure, function, motifs, homology, interactions • Proteomics: expression profile, proteins in disease processes etc. • Ligands and drugs (inhibitors, activators, substrates, metabolites) 4 Biological Information Pathways: • Molecular networks, biological chain events, regulation, feedback, kinetic data Function: • Binding sites, interactions, molecular action (binding, chemical reaction, etc.) • Biological effect (signaling, transport, feedback, regulation, modification, etc.) • Functional relationship, protein families, motifs, and homologs
  • 3. 10/11/2017 3 WHAT IS A DATABASE? • Structured collection of information. • Consists of basic units called records or entries. • Each record consists of fields, which hold pre-defined data related to the record. • For example, a protein database would have protein entries as records and protein properties as fields (e.g., name of protein, length, amino-acid sequence) THE ‘PERFECT’ DATABASE • Comprehensive, but easy to search. • Annotated, but not “too annotated”. • A simple, easy to understand structure. • Cross-referenced. • Minimum redundancy. • Easy retrieval of data.
  • 4. 10/11/2017 4 7 Biological databases Purpose 1. To disseminate biological data and information 2. To provide biological data in computer-readable form 3. To allow analysis of biological data TYPES OF MOLECULAR DATABASES • Primary Databases – Original submissions by experimentalists – Content controlled by the submitter • Examples: GenBank, Trace, SRA, SNP, GEO • Derived Databases – Derived from primary data – Content controlled by third party (e.g. NCBI) • Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain
  • 5. 10/11/2017 5 PRIMARY VS. DERIVED SEQUENCE DATABASES GenBank Sequencing Centers TATAGCCG TATAGCCGTATAGCCG TATAGCCG Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters Bibliographic Databases Integrated Databases Structural Databases Sequence Databases Clinical Databases Types of Biological Databases
  • 6. 10/11/2017 6 “Ten Important Bioinformatics Databases” GenBank www.ncbi.nlm.nih.gov nucleotide sequences Ensembl www.ensembl.org human/mouse genome (and others) PubMed www.ncbi.nlm.nih.gov literature references NR www.ncbi.nlm.nih.gov protein sequences SWISS-PROT www.expasy.ch protein sequences InterPro www.ebi.ac.uk protein domains OMIM www.ncbi.nlm.nih.gov genetic diseases Enzymes www.chem.qmul.ac.uk enzymes PDB www.rcsb.org/pdb/ protein structures KEGG www.genome.ad.jp metabolic pathways Source: Bioinformatics for Dummies 12 GenBank http://www.ncbi.nih.gov/Genbank/
  • 7. 10/11/2017 7 13 GenBank database (http://www.ncbi.nih.gov/Genbank/) – Contains publicly available DNA sequences from more than 100,000 organisms. – Also contains derived protein sequences, and annotations describing biological, structural, and other relevant features. – Accessible through Entrez, NCBI’s integrated retrieval system – Sequence similarity search tools: BLAST GenBank • Annotated collection of all publicly available nucleotide sequences and their protein translations. • Receives sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. • Grows exponentially, doubling every 10 months
  • 8. 10/11/2017 8 GENBANK - PRIMARY SEQUENCE DB http://www.ncbi.nlm.nih.gov/genbank/ • Nucleotide only sequence database • Archival in nature – Historical – Reflective of submitter point of view – Redundant • Data – Direct submissions – Batch submissions – FTP accounts (genome data) GenBank •Data shared nightly among three collaborating databases •GenBank at NCBI •DNA Database of Japan (DDBJ) •EMBL at EBI
  • 9. 10/11/2017 9 The International Sequence Database Collaboration Source NCBI GeneBank Release 220 June 2017 • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/
  • 10. 10/11/2017 10 GenBank Record ➢ Header information that apply to the whole record ➢ Features annotations on the record ➢ Sequence GeneBank Record modification date Header Locus Name Sequence Length Molecule Type GenBank Division Modification DateAccession Number Version Number
  • 11. 10/11/2017 11 GeneBank Record Link to Seq FEATURE GenBank RecordSequence
  • 12. 10/11/2017 12 Direct Submission • A typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence (contigs) with annotations (metadata). • If part of a nucleotide sequence encodes a protein, a conceptual translation, called a CDS (coding sequence) is annotated. High-Throughput Genomic Sequence (HTGS) • HTGS entries are submitted in bulk by genome centers, processed by an automated system, and then released to GenBank. • Currently, more than 30 genome centers are submitting data for a number of organisms, including human, mouse, rat, rice, and Plasmodium falciparum.
  • 13. 10/11/2017 13 Whole Genome Shotgun Sequences (WGS) • Shotgun sequence reads are assembled into contigs, submitted, and updated as the sequencing project progresses and new assemblies are computed. Submission Tools • BankIt: Web-based form for submission of a small number of sequences with minimal annotation to GenBank. • Sequin: More appropriate for complicated submissions containing a significant amount of annotation or many sequences.
  • 14. 10/11/2017 14 Sequence Data Flow and Processing • Within 48 hours of direct submission with BankIt or Sequin, the database staff reviews the submission to determine whether it meets the minimal criteria and then assigns an Accession number. – All sequences must be > 50 bp in length and be sequenced by, or on behalf of, the group submitting the sequence. – GenBank will not accept sequences constructed in silico – GenBank will not accept noncontiguous sequences containing internal, unsequenced spacers. – GenBank will not accept sequences for which there is no physical counterpart, such as those derived from a mix of genomic DNA and mRNA. – Submissions are checked to determine whether they are new or updates. Sequence Data Flow and Processing • Indexing: – Biological validity: Translation, organism lineage, BLAST searches – Vector contamination: Is there any vector DNA present in the sequence? – Publication status: If published, citation is included in annotation and linked to Entrez – Formatting and spelling • Sequences are sent to submitter for final review before release into the public database. • Sequences must become publicly available once the accession number or the sequence has been published. • GenBank annotation staff process about 1900 submissions/month, or about 20,000 sequences.
  • 15. 10/11/2017 15 Essential Bioinformatics and Biocomputing (LSM2104), NUS 29 DNA databases • An Example from GenBank– flat file – Human Alpha-Lactalbumin gene This protein is a complex of 2 proteins A and B. In the absence of the B protein, the enzyme catalyzes the transfer of galactose from UDP-galactose to Nacetylglucosamine (cf. EC 2.4.1.90). Essential Bioinformatics and Biocomputing (LSM2104), NUS 30 A GenBank entry – HEADER
  • 16. 10/11/2017 16 31 GenBank Entry – Links provided in the Header • MapViewer – find the gene position in chromosome • Related Sequences – other entries related to this gene (or sequence) • OMIM– link to catalog of human genes and genetic disorders • Protein – retrieve protein record from GenPept • Medline and PubMed –literature abstracts related to this gene • Taxonomy – Classification of organisms • UniGene – Unified gene data • UniSTS – Unified sequence tagged sites, marker and mapping data • LinkOut – links to publishers, aggregators libraries, biological databases, sequence centers, and other Web resources • REFSEQ – reference sequence standards Note: These links are representative. Other links may also be found in GenBank entries. Essential Bioinformatics and Biocomputing (LSM2104), NUS 32 GenBank entry - FEATURES