SlideShare a Scribd company logo
1 of 41
Mapping Proteins to Functions
Part 1
dsdht.wikispaces.com
Points to remember
Proteins are single, unbranched chains of
amino acid monomers.
There are 20 different amino acids
There are four levels of protein structureprimary,secondary,tertiary and quaternary.
A protein’s amino acid sequence determines
its three-dimensional structure
(conformation).
Proteins Functional Classes

Why do we care about protein
function?
• Diagnose reasons for the disease.
• Discover new drugs.
• Understand Mechanism of action of
processes in the system.
Data used for prediction of protein function
•
•
•
•
•
•

Amino acid sequences
Protein structure
Genome sequences
Phylogenetic data
Microarray expression data
Protein interaction networks and protein
complexes
• Biomedical literature
The concept of protein function is highly
context-sensitive and not very well-defined.
infact, this concept typically acts as an umbrella
term for all types of activities that a protein is
involved in, be it cellular, molecular or
physiological.
Characterization on protein function

Molecular function, cellular function and
Phenotypic function are hierarchically
related.

Predicting function: from genes to genomes. Bork etal
1998.
Gene Ontology classification scheme categorizes
protein function into cellular component,
molecular function and biological process.
In computer science and information science, an ontology formally represents
knowledge as a set of concepts within a domain, and the relationships between pairs of
concepts. It can be used to model a domain and support reasoning about entities
Read more at http://www.answers.com/topic/ontology-computer-science

http://www.nature.com/ng/journal/v25/n1/full/ng0500_25.html

http://www.geneontology.org/
GO Format

•
•
•
•
•
•

Figure adapted from [Ashburner et al. 2000])

Wide coverage
Standardized format
Hierarchical structure
Disjoint Categories
Multiple functions
Dynamic nature
Molecular function
• Molecular function describes activities, such as
catalytic or binding activities, at the molecular level
• GO molecular function terms represent activities rather
than the entities that perform the actions, and do not
specify where or when, or in what context, the action
takes place
• Examples of broad functional terms are catalytic
activity or transporter activity; an example of a
narrower term is adenylate cyclase activity
Biological process
• A biological process is series of events
accomplished by one or more ordered assemblies
of molecular functions
• An example of a broad GO biological process
terms is signal transduction; examples of more
specific terms are pyrimidine metabolism or
alpha-glucoside transport.
• It can be difficult to distinguish between a
biological process and a molecular function.
Cellular component
• A cellular component is just that, a component of
a cell that is part of some larger object
• It may be an anatomical structure (for example,
the rough endoplasmic reticulum or the nucleus)
or a gene product group (for example, the
ribosome, the proteasome or a protein dimer)
• The cellular component categories are probably
the best defined categories since they correspond
to actual entities
Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene
lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57.
DAVID (Gene Ontology Enrichment)
Youtube Videos
http://www.youtube.com/watch?v=xIu9mm6b7N0
http://www.youtube.com/watch?v=zedjRViji2c
Try out the microarray list given below for analyzing Proteins.
31741_at 31734_at 32696_at 37559_at 41400_at 35985_at
39304_g_at 41438_at 35067_at 32919_at 35429_at 36674_at
967_g_at 36669_at 39242_at 39573_at 39407_at 33346_r_at
40319_at 2043_s_at 1788_s_at 36651_at 41788_i_at 35595_at
36285_at 39586_at 35160_at 39424_at 36865_at 2004_at
36728_at 37218_at 40347_at 36226_r_at 33012_at 37906_at
32872_at
Sequence & Structure based methods
Part2
Basic Set of Protein Annotations

• Protein name
- descriptive common name for the protein
eg. “kinase”
• Gene symbol
-mnemonic abbreviation for the gene
- eg “recA”
• EC number
-what the protein is doing in the cell and why
-eg “involved in glycolysis”
• Supporting evidence
- accession numbers of BER and HMM matches
- whatever information you used to make the annotation
• Unique Identifier
- eg locus ids
Sequence Similarity Evidence
• pairwise alignments -two protein’s amino acid sequences aligned next to
each other so that the maximum number of amino acids match
• Multiple alignment - 3 or more amino acid sequences aligned to each other
so that the maximum number of amino acids match in each column
• Protein families - clusters of proteins that all share sequence similarity and
presumably similar function
• Motifs -short regions of amino acid sequence shared by many proteins. A
motif can be found in number of different proteins where it carries out
similar functions.
Important terms to understand
• Homologs – two sequences have evolved from the same
common ancestor they may not share same function
• Orthologs – a type of homolog where two sequences are
in different species that arose from a common ancestor.
Speciation have created the tow copies of the sequence.
• Paralogs- a type of homolog where the two sequences
have arisen due to a gene duplication within one
species.They initially have the same function but as time
goes byone copy will be free to evolve new functions, as
the other copy will maintain the original function.
• Xenologs – a type of ortholog where two gene
sequences have arisen due to horizontal transfer (by
means of reproduction)
Taken from http://ae.igs.umaryland.edu/docs/FunctionalAnnotApril.pdf
Sequence similarity, sequence homology, and
functional homology

• Sequence similarity means that the sequences
are similar – no more, no less
• Sequence homology implies that the proteins are
encoded by genes that share a common ancestry.
• Functional homology means that two proteins
from two organisms have the same function.
• Sequence similarity or sequence homology does
not guarantee functional homology
Existing Sequence based function prediction methods
Homology based approaches
•
•
•
•

BLAST
FASTA
SSEARCH
PSI-BLAST-iterates searches by using a sequence profile computed from a multiple
sequence alignment obtained from the search from the previous round.

Subsequence based approaches

•

Motifs and domains
http://molbiol-tools.ca/Motifs.htm

Feature based approaches
• normalized Van der Waals volume, polarity, charge and surface tension, which are
averaged over all the residues to in the sequence obtain the feature-value vector for
the protein to train a classifier
• SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi)
Drawbacks of BLAST and FASTA
• Provide functional annotation typically to half
of the genes in a genome since homologous
sequences are not found at accepted
significance thresholds.
• Automated methods of annotation transfer
between similar sequences contribute to error
propagation.
Enhanced Sequence based methods
•

PFP – Kihara lab. (http://kiharalab.org/web/pfp.php)

•

The PFP algorithm uses PSI-BLAST (version 2.2.6) to predict probable GO function annotations in
three categories—molecular function, biological process, and cellular component—with statistical
significance scores (Pvalue)
For each sequence retrieved by PSI-BLAST ,the associated GO terms are scored.
GO terms are scored according to
a) frequency of association to similar sequences
b) degree of similarity those sequences share with the query

•
•

where s(fa) is the final score assigned to the GO term fa, N is the number of similar sequences retrieved by PSIBLAST, Nfunc(i) is the number of GO terms annotating sequence i, E_value(i) is the E-value given to the sequence i,
fj is a GO term annotating sequence i, and b is the constant value, 2 = (log10100), which keeps the score
positive. P(fa|fj) is the association score for fa given fj obtained from the function association matrix (FAM).

c(fa, fj) is number of times fa and fj are assigned simultaneously to each sequence in UniProt, and c(fj) is the total number of
times fj appeared in Uni- Prot, l is the size of one dimension of the FAM (i.e. the total number of unique GO terms), and ε is the
pseudocount.
When Homology searches fail
• Sometimes no orthologs or even paralogs can be
identified by sequence similarity searches, or they are
all of unknown function.
• No functional information can thus be transferred
based on simple sequence homology

• By instead analyzing the various parts that make up the
complete protein, it is nonetheless often possible to
predict the protein function
Protein domains
• Many eukaryotic proteins consist of multiple
globular domains that can fold independently
• These domains have been mixed and matched
through evolution
• Each type of domain contributes towards the
molecular function of the complete protein
• Numerous resources are able to identify such
domains from sequence alone using HMMs
Which domain resource should I use?
• SMART is focused on signal transduction domains
• Pfam is very actively developed and thus tends to
have the most up-to-date domain collection

• InterPro is useful for genome annotation since
the domains are annotated with GO terms
• CDD is conveniently integrated with the NCBI
BLAST web interface
Function prediction from post translational
modifications
• Proteins with similar function may
not be related in sequence
• Still they must perform their
function in the context of the same
cellular machinery
• Similarities in features such like
PTMs and physical/chemical
properties could be expected for
proteins
with similar function
The concept of ProtFun

http://www.cbs.dtu.dk/services/ProtFun/
Function prediction on the
human prion sequence
############## ProtFun 1.1 predictions ##############
>PRIO_HUMAN
# Functional category
Amino_acid_biosynthesis
Biosynthesis_of_cofactors
Cell_envelope
Cellular_processes
Central_intermediary_metabolism
Energy_metabolism
Fatty_acid_metabolism
Purines_and_pyrimidines
Regulatory_functions
Replication_and_transcription
Translation
Transport_and_binding

Prob
0.020
0.032
0.146
0.053
0.130
0.029
0.017
0.528
0.013
0.020
0.035
=> 0.831

Odds
0.909
0.444
2.393
0.726
2.063
0.322
1.308
2.173
0.081
0.075
0.795
2.027

# Enzyme/nonenzyme
Enzyme
Nonenzyme

Prob
0.250
=> 0.750

Odds
0.873
1.051

Prob
0.070
0.031
0.057
0.020
0.010
0.017

Odds
0.336
0.090
0.180
0.426
0.313
0.334

# Enzyme class
Oxidoreductase
Transferase
Hydrolase
Isomerase
Ligase
Lyase

(EC
(EC
(EC
(EC
(EC
(EC

1.-.-.-)
2.-.-.-)
3.-.-.-)
4.-.-.-)
5.-.-.-)
6.-.-.-)
ProtFun data sets
• Labeling of training and test data
– Cellular role categories: human SwissProt sequences
were categorizes using EUCLID
– Enzyme categories: top-level enzyme classifications
were extract from human SwissProt description lines
– Gene Ontology terms were transferred from InterPro

• The sequences were divided into training and test
sets without significant sequence similarity
• Binary predictors were for each category
Structure based methods

Three standard databases dominate the structure data
landscape:
PDB-Structure data from NMR and ,X-ray
SCOP- organizes the available structures in a hierarchy so as to
elicit the evolutionary relationships between them.Family,
Superfamily and Fold

CATH-(Class, Architecture, Topology and Homologous
superfamily)
Structure based methods
Protein Folds Super Secondary Structures

Biological function

Adapted from Martin 1998. Protein folds and
functions
Approaches for deriving functional information from 3D
structure

Adapted from From Structure to function.Thorton etal, 2000,Nature .
http://www.jove.com/video/3259/a-protocol-forcomputer-based-protein-structure-function
ProFunc server methods
•
•
•
•
•
•
•
•
•
•
•
•
•

Sequence-based methods:
BLAST search against the UniProt Knowledgebase.
FASTA search against sequences of structures in the Protein Data Bank.
InterProScan
Superfamily search
Residue conservation mapped onto structure
Genome location analysis
Structure-based methods:
Fold matching using MSDfold and DALI
Helix-Turn-Helix motif search
Nest analysis
Surface clefts analysis
Template methods
Enzyme active sites
Ligand binding sites
DNA binding sites
Reverse template search
References
• http://www.sciencedirect.com/science/article/pii/S0022283698921
441
• http://www.nature.com/ng/journal/v25/n1/full/ng0500_25.html
• http://www.nature.com/nprot/journal/v4/n1/full/nprot.2008.211.h
tml
• http://www.nature.com/nprot/journal/v4/n1/full/nprot.2008.211.h
tml
• http://www.cs.helsinki.fi/bioinformatiikka/mbi/courses/0708/itb/slides/itb0708_slides_83-116.pdf (BLAST and FASTA)
• http://kiharalab.org/web/paper/HawkinsChitaleLubanKihara_Protei
ns09.pdf
• http://www.ebi.ac.uk/thorntonsrv/databases/profunc/doc/profunc_tutorial.pdf

More Related Content

What's hot (20)

Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Single nucleotide polymorphism, (SNP)
Single nucleotide polymorphism, (SNP)Single nucleotide polymorphism, (SNP)
Single nucleotide polymorphism, (SNP)
 
VNTR- Minisatellite
VNTR- MinisatelliteVNTR- Minisatellite
VNTR- Minisatellite
 
Types of genomics ppt
Types of genomics pptTypes of genomics ppt
Types of genomics ppt
 
Sts
StsSts
Sts
 
Expression vectors
Expression vectorsExpression vectors
Expression vectors
 
dna sequencing methods
 dna sequencing methods dna sequencing methods
dna sequencing methods
 
Genomic databases
Genomic databasesGenomic databases
Genomic databases
 
Gene identification and discovery
Gene identification and discoveryGene identification and discovery
Gene identification and discovery
 
Snp
SnpSnp
Snp
 
Sanger sequencing
Sanger sequencing Sanger sequencing
Sanger sequencing
 
Web based servers and softwares for genome analysis
Web based servers and softwares for genome analysisWeb based servers and softwares for genome analysis
Web based servers and softwares for genome analysis
 
DNA & RNA isolation
DNA & RNA isolationDNA & RNA isolation
DNA & RNA isolation
 
Ion torrent
Ion torrentIon torrent
Ion torrent
 
Comparative genomics in eukaryotes, organelles
Comparative genomics in eukaryotes, organellesComparative genomics in eukaryotes, organelles
Comparative genomics in eukaryotes, organelles
 
Genome Assembly
Genome AssemblyGenome Assembly
Genome Assembly
 
Introduction to proteomics
Introduction to proteomicsIntroduction to proteomics
Introduction to proteomics
 
UPGMA
UPGMAUPGMA
UPGMA
 
Est database
Est databaseEst database
Est database
 
YEAST TWO HYBRID SYSTEM
 YEAST TWO HYBRID SYSTEM YEAST TWO HYBRID SYSTEM
YEAST TWO HYBRID SYSTEM
 

Similar to Mapping Protein Functions Using Sequence and Structure Data

Molecular basis of evolution and softwares used in phylogenetic tree contruction
Molecular basis of evolution and softwares used in phylogenetic tree contructionMolecular basis of evolution and softwares used in phylogenetic tree contruction
Molecular basis of evolution and softwares used in phylogenetic tree contructionUdayBhanushali111
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopMorgan Langille
 
Lecture__on__Proteomics_Introduction.ppt
Lecture__on__Proteomics_Introduction.pptLecture__on__Proteomics_Introduction.ppt
Lecture__on__Proteomics_Introduction.pptSachin Teotia
 
Protein Chemistry-Proteomics-Lec1_Intro.ppt
Protein Chemistry-Proteomics-Lec1_Intro.pptProtein Chemistry-Proteomics-Lec1_Intro.ppt
Protein Chemistry-Proteomics-Lec1_Intro.pptSachin Teotia
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function predictionLars Juhl Jensen
 
Protein protein interaction
Protein protein interactionProtein protein interaction
Protein protein interactionAashish Patel
 
protein-protein interaction
protein-protein  interactionprotein-protein  interaction
protein-protein interactionZeshan Haider
 
Functional proteomics, and tools
Functional proteomics, and toolsFunctional proteomics, and tools
Functional proteomics, and toolsKAUSHAL SAHU
 
Apollo Workshop AGS2017 Introduction
Apollo Workshop AGS2017 IntroductionApollo Workshop AGS2017 Introduction
Apollo Workshop AGS2017 IntroductionMonica Munoz-Torres
 
Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120Sucheta Tripathy
 
Computational Prediction Of Protein-1.pptx
Computational Prediction Of Protein-1.pptxComputational Prediction Of Protein-1.pptx
Computational Prediction Of Protein-1.pptxashharnomani
 
bioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational Boilogygybioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational BoilogygyMUHAMMEDBAWAYUSUF
 
Proteomics: lecture (1) introduction to proteomics
Proteomics: lecture (1) introduction to proteomicsProteomics: lecture (1) introduction to proteomics
Proteomics: lecture (1) introduction to proteomicsClaudine83
 
Curation Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopCuration Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopMonica Munoz-Torres
 

Similar to Mapping Protein Functions Using Sequence and Structure Data (20)

Molecular basis of evolution and softwares used in phylogenetic tree contruction
Molecular basis of evolution and softwares used in phylogenetic tree contructionMolecular basis of evolution and softwares used in phylogenetic tree contruction
Molecular basis of evolution and softwares used in phylogenetic tree contruction
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
Protein database
Protein databaseProtein database
Protein database
 
Lecture__on__Proteomics_Introduction.ppt
Lecture__on__Proteomics_Introduction.pptLecture__on__Proteomics_Introduction.ppt
Lecture__on__Proteomics_Introduction.ppt
 
Protein Chemistry-Proteomics-Lec1_Intro.ppt
Protein Chemistry-Proteomics-Lec1_Intro.pptProtein Chemistry-Proteomics-Lec1_Intro.ppt
Protein Chemistry-Proteomics-Lec1_Intro.ppt
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
Protein protein interaction
Protein protein interactionProtein protein interaction
Protein protein interaction
 
protein-protein interaction
protein-protein  interactionprotein-protein  interaction
protein-protein interaction
 
Functional proteomics, and tools
Functional proteomics, and toolsFunctional proteomics, and tools
Functional proteomics, and tools
 
Apollo Workshop AGS2017 Introduction
Apollo Workshop AGS2017 IntroductionApollo Workshop AGS2017 Introduction
Apollo Workshop AGS2017 Introduction
 
Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120Tyler functional annotation thurs 1120
Tyler functional annotation thurs 1120
 
Computational Prediction Of Protein-1.pptx
Computational Prediction Of Protein-1.pptxComputational Prediction Of Protein-1.pptx
Computational Prediction Of Protein-1.pptx
 
Protein protein interactions
Protein protein interactionsProtein protein interactions
Protein protein interactions
 
Genome Curation using Apollo
Genome Curation using ApolloGenome Curation using Apollo
Genome Curation using Apollo
 
Bioinformatica t7-protein structure
Bioinformatica t7-protein structureBioinformatica t7-protein structure
Bioinformatica t7-protein structure
 
bioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational Boilogygybioinformatics lecture 2.pptx and computational Boilogygy
bioinformatics lecture 2.pptx and computational Boilogygy
 
Proteomics: lecture (1) introduction to proteomics
Proteomics: lecture (1) introduction to proteomicsProteomics: lecture (1) introduction to proteomics
Proteomics: lecture (1) introduction to proteomics
 
Proteins databases
Proteins databasesProteins databases
Proteins databases
 
Important protein databases and proteomics softwares
Important protein databases and proteomics softwaresImportant protein databases and proteomics softwares
Important protein databases and proteomics softwares
 
Curation Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopCuration Introduction - Apollo Workshop
Curation Introduction - Apollo Workshop
 

More from Abhik Seal

Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in rAbhik Seal
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryAbhik Seal
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical DatasetsAbhik Seal
 
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsAbhik Seal
 
Sequencedatabases
SequencedatabasesSequencedatabases
SequencedatabasesAbhik Seal
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataAbhik Seal
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles Abhik Seal
 
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with googleAbhik Seal
 
3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using dataAbhik Seal
 
R scatter plots
R scatter plotsR scatter plots
R scatter plotsAbhik Seal
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorialAbhik Seal
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
PharmacohorepptAbhik Seal
 

More from Abhik Seal (20)

Chemical data
Chemical dataChemical data
Chemical data
 
Clinicaldataanalysis in r
Clinicaldataanalysis in rClinicaldataanalysis in r
Clinicaldataanalysis in r
 
Virtual Screening in Drug Discovery
Virtual Screening in Drug DiscoveryVirtual Screening in Drug Discovery
Virtual Screening in Drug Discovery
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Networks
NetworksNetworks
Networks
 
Modeling Chemical Datasets
Modeling Chemical DatasetsModeling Chemical Datasets
Modeling Chemical Datasets
 
Introduction to Adverse Drug Reactions
Introduction to Adverse Drug ReactionsIntroduction to Adverse Drug Reactions
Introduction to Adverse Drug Reactions
 
Sequencedatabases
SequencedatabasesSequencedatabases
Sequencedatabases
 
Chemical File Formats for storing chemical data
Chemical File Formats for storing chemical dataChemical File Formats for storing chemical data
Chemical File Formats for storing chemical data
 
Understanding Smiles
Understanding Smiles Understanding Smiles
Understanding Smiles
 
Learning chemistry with google
Learning chemistry with googleLearning chemistry with google
Learning chemistry with google
 
3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data3 d virtual screening of pknb inhibitors using data
3 d virtual screening of pknb inhibitors using data
 
Poster
PosterPoster
Poster
 
R scatter plots
R scatter plotsR scatter plots
R scatter plots
 
Indo us 2012
Indo us 2012Indo us 2012
Indo us 2012
 
Q plot tutorial
Q plot tutorialQ plot tutorial
Q plot tutorial
 
Weka guide
Weka guideWeka guide
Weka guide
 
Pharmacohoreppt
PharmacohorepptPharmacohoreppt
Pharmacohoreppt
 
Document1
Document1Document1
Document1
 

Recently uploaded

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 

Recently uploaded (20)

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 

Mapping Protein Functions Using Sequence and Structure Data

  • 1. Mapping Proteins to Functions Part 1 dsdht.wikispaces.com
  • 2. Points to remember Proteins are single, unbranched chains of amino acid monomers. There are 20 different amino acids There are four levels of protein structureprimary,secondary,tertiary and quaternary. A protein’s amino acid sequence determines its three-dimensional structure (conformation).
  • 3. Proteins Functional Classes Why do we care about protein function? • Diagnose reasons for the disease. • Discover new drugs. • Understand Mechanism of action of processes in the system.
  • 4. Data used for prediction of protein function • • • • • • Amino acid sequences Protein structure Genome sequences Phylogenetic data Microarray expression data Protein interaction networks and protein complexes • Biomedical literature
  • 5. The concept of protein function is highly context-sensitive and not very well-defined. infact, this concept typically acts as an umbrella term for all types of activities that a protein is involved in, be it cellular, molecular or physiological.
  • 6. Characterization on protein function Molecular function, cellular function and Phenotypic function are hierarchically related. Predicting function: from genes to genomes. Bork etal 1998.
  • 7. Gene Ontology classification scheme categorizes protein function into cellular component, molecular function and biological process. In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between pairs of concepts. It can be used to model a domain and support reasoning about entities Read more at http://www.answers.com/topic/ontology-computer-science http://www.nature.com/ng/journal/v25/n1/full/ng0500_25.html http://www.geneontology.org/
  • 8. GO Format • • • • • • Figure adapted from [Ashburner et al. 2000]) Wide coverage Standardized format Hierarchical structure Disjoint Categories Multiple functions Dynamic nature
  • 9. Molecular function • Molecular function describes activities, such as catalytic or binding activities, at the molecular level • GO molecular function terms represent activities rather than the entities that perform the actions, and do not specify where or when, or in what context, the action takes place • Examples of broad functional terms are catalytic activity or transporter activity; an example of a narrower term is adenylate cyclase activity
  • 10. Biological process • A biological process is series of events accomplished by one or more ordered assemblies of molecular functions • An example of a broad GO biological process terms is signal transduction; examples of more specific terms are pyrimidine metabolism or alpha-glucoside transport. • It can be difficult to distinguish between a biological process and a molecular function.
  • 11. Cellular component • A cellular component is just that, a component of a cell that is part of some larger object • It may be an anatomical structure (for example, the rough endoplasmic reticulum or the nucleus) or a gene product group (for example, the ribosome, the proteasome or a protein dimer) • The cellular component categories are probably the best defined categories since they correspond to actual entities
  • 12. Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57.
  • 13. DAVID (Gene Ontology Enrichment) Youtube Videos http://www.youtube.com/watch?v=xIu9mm6b7N0 http://www.youtube.com/watch?v=zedjRViji2c Try out the microarray list given below for analyzing Proteins. 31741_at 31734_at 32696_at 37559_at 41400_at 35985_at 39304_g_at 41438_at 35067_at 32919_at 35429_at 36674_at 967_g_at 36669_at 39242_at 39573_at 39407_at 33346_r_at 40319_at 2043_s_at 1788_s_at 36651_at 41788_i_at 35595_at 36285_at 39586_at 35160_at 39424_at 36865_at 2004_at 36728_at 37218_at 40347_at 36226_r_at 33012_at 37906_at 32872_at
  • 14. Sequence & Structure based methods Part2
  • 15. Basic Set of Protein Annotations • Protein name - descriptive common name for the protein eg. “kinase” • Gene symbol -mnemonic abbreviation for the gene - eg “recA” • EC number -what the protein is doing in the cell and why -eg “involved in glycolysis” • Supporting evidence - accession numbers of BER and HMM matches - whatever information you used to make the annotation • Unique Identifier - eg locus ids
  • 16. Sequence Similarity Evidence • pairwise alignments -two protein’s amino acid sequences aligned next to each other so that the maximum number of amino acids match • Multiple alignment - 3 or more amino acid sequences aligned to each other so that the maximum number of amino acids match in each column • Protein families - clusters of proteins that all share sequence similarity and presumably similar function • Motifs -short regions of amino acid sequence shared by many proteins. A motif can be found in number of different proteins where it carries out similar functions.
  • 17. Important terms to understand • Homologs – two sequences have evolved from the same common ancestor they may not share same function • Orthologs – a type of homolog where two sequences are in different species that arose from a common ancestor. Speciation have created the tow copies of the sequence. • Paralogs- a type of homolog where the two sequences have arisen due to a gene duplication within one species.They initially have the same function but as time goes byone copy will be free to evolve new functions, as the other copy will maintain the original function. • Xenologs – a type of ortholog where two gene sequences have arisen due to horizontal transfer (by means of reproduction)
  • 19. Sequence similarity, sequence homology, and functional homology • Sequence similarity means that the sequences are similar – no more, no less • Sequence homology implies that the proteins are encoded by genes that share a common ancestry. • Functional homology means that two proteins from two organisms have the same function. • Sequence similarity or sequence homology does not guarantee functional homology
  • 20. Existing Sequence based function prediction methods Homology based approaches • • • • BLAST FASTA SSEARCH PSI-BLAST-iterates searches by using a sequence profile computed from a multiple sequence alignment obtained from the search from the previous round. Subsequence based approaches • Motifs and domains http://molbiol-tools.ca/Motifs.htm Feature based approaches • normalized Van der Waals volume, polarity, charge and surface tension, which are averaged over all the residues to in the sequence obtain the feature-value vector for the protein to train a classifier • SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi)
  • 21. Drawbacks of BLAST and FASTA • Provide functional annotation typically to half of the genes in a genome since homologous sequences are not found at accepted significance thresholds. • Automated methods of annotation transfer between similar sequences contribute to error propagation.
  • 22. Enhanced Sequence based methods • PFP – Kihara lab. (http://kiharalab.org/web/pfp.php) • The PFP algorithm uses PSI-BLAST (version 2.2.6) to predict probable GO function annotations in three categories—molecular function, biological process, and cellular component—with statistical significance scores (Pvalue) For each sequence retrieved by PSI-BLAST ,the associated GO terms are scored. GO terms are scored according to a) frequency of association to similar sequences b) degree of similarity those sequences share with the query • • where s(fa) is the final score assigned to the GO term fa, N is the number of similar sequences retrieved by PSIBLAST, Nfunc(i) is the number of GO terms annotating sequence i, E_value(i) is the E-value given to the sequence i, fj is a GO term annotating sequence i, and b is the constant value, 2 = (log10100), which keeps the score positive. P(fa|fj) is the association score for fa given fj obtained from the function association matrix (FAM). c(fa, fj) is number of times fa and fj are assigned simultaneously to each sequence in UniProt, and c(fj) is the total number of times fj appeared in Uni- Prot, l is the size of one dimension of the FAM (i.e. the total number of unique GO terms), and ε is the pseudocount.
  • 23. When Homology searches fail • Sometimes no orthologs or even paralogs can be identified by sequence similarity searches, or they are all of unknown function. • No functional information can thus be transferred based on simple sequence homology • By instead analyzing the various parts that make up the complete protein, it is nonetheless often possible to predict the protein function
  • 24. Protein domains • Many eukaryotic proteins consist of multiple globular domains that can fold independently • These domains have been mixed and matched through evolution • Each type of domain contributes towards the molecular function of the complete protein • Numerous resources are able to identify such domains from sequence alone using HMMs
  • 25.
  • 26.
  • 27.
  • 28.
  • 29. Which domain resource should I use? • SMART is focused on signal transduction domains • Pfam is very actively developed and thus tends to have the most up-to-date domain collection • InterPro is useful for genome annotation since the domains are annotated with GO terms • CDD is conveniently integrated with the NCBI BLAST web interface
  • 30. Function prediction from post translational modifications • Proteins with similar function may not be related in sequence • Still they must perform their function in the context of the same cellular machinery • Similarities in features such like PTMs and physical/chemical properties could be expected for proteins with similar function
  • 31. The concept of ProtFun http://www.cbs.dtu.dk/services/ProtFun/
  • 32.
  • 33. Function prediction on the human prion sequence ############## ProtFun 1.1 predictions ############## >PRIO_HUMAN # Functional category Amino_acid_biosynthesis Biosynthesis_of_cofactors Cell_envelope Cellular_processes Central_intermediary_metabolism Energy_metabolism Fatty_acid_metabolism Purines_and_pyrimidines Regulatory_functions Replication_and_transcription Translation Transport_and_binding Prob 0.020 0.032 0.146 0.053 0.130 0.029 0.017 0.528 0.013 0.020 0.035 => 0.831 Odds 0.909 0.444 2.393 0.726 2.063 0.322 1.308 2.173 0.081 0.075 0.795 2.027 # Enzyme/nonenzyme Enzyme Nonenzyme Prob 0.250 => 0.750 Odds 0.873 1.051 Prob 0.070 0.031 0.057 0.020 0.010 0.017 Odds 0.336 0.090 0.180 0.426 0.313 0.334 # Enzyme class Oxidoreductase Transferase Hydrolase Isomerase Ligase Lyase (EC (EC (EC (EC (EC (EC 1.-.-.-) 2.-.-.-) 3.-.-.-) 4.-.-.-) 5.-.-.-) 6.-.-.-)
  • 34. ProtFun data sets • Labeling of training and test data – Cellular role categories: human SwissProt sequences were categorizes using EUCLID – Enzyme categories: top-level enzyme classifications were extract from human SwissProt description lines – Gene Ontology terms were transferred from InterPro • The sequences were divided into training and test sets without significant sequence similarity • Binary predictors were for each category
  • 35. Structure based methods Three standard databases dominate the structure data landscape: PDB-Structure data from NMR and ,X-ray SCOP- organizes the available structures in a hierarchy so as to elicit the evolutionary relationships between them.Family, Superfamily and Fold CATH-(Class, Architecture, Topology and Homologous superfamily)
  • 36. Structure based methods Protein Folds Super Secondary Structures Biological function Adapted from Martin 1998. Protein folds and functions
  • 37. Approaches for deriving functional information from 3D structure Adapted from From Structure to function.Thorton etal, 2000,Nature .
  • 39.
  • 40. ProFunc server methods • • • • • • • • • • • • • Sequence-based methods: BLAST search against the UniProt Knowledgebase. FASTA search against sequences of structures in the Protein Data Bank. InterProScan Superfamily search Residue conservation mapped onto structure Genome location analysis Structure-based methods: Fold matching using MSDfold and DALI Helix-Turn-Helix motif search Nest analysis Surface clefts analysis Template methods Enzyme active sites Ligand binding sites DNA binding sites Reverse template search
  • 41. References • http://www.sciencedirect.com/science/article/pii/S0022283698921 441 • http://www.nature.com/ng/journal/v25/n1/full/ng0500_25.html • http://www.nature.com/nprot/journal/v4/n1/full/nprot.2008.211.h tml • http://www.nature.com/nprot/journal/v4/n1/full/nprot.2008.211.h tml • http://www.cs.helsinki.fi/bioinformatiikka/mbi/courses/0708/itb/slides/itb0708_slides_83-116.pdf (BLAST and FASTA) • http://kiharalab.org/web/paper/HawkinsChitaleLubanKihara_Protei ns09.pdf • http://www.ebi.ac.uk/thorntonsrv/databases/profunc/doc/profunc_tutorial.pdf