SlideShare a Scribd company logo
1 of 1
Download to read offline
DISCOVERY	
  OF	
  FUNCTIONAL	
  PROTEIN	
  LINEAR	
  MOTIFS	
  
                                                            USING	
  A	
  GREEDY	
  ALGORITHM	
  AND	
  INFORMATION	
  THEORY	
  
                                                      LEANDRO	
  G.	
  RADUSKY§,	
  JULIANA	
  GLAVINA§,	
  MARIA	
  FATIMA	
  LADELFA¶,	
  MARTIN	
  MONTE¶	
  	
  
                                                                                          AND	
  IGNACIO	
  E.	
  SANCHEZ§	
  
                   §PROTEIN	
  PHYSIOLOGY	
  LABORATORY,	
  DEPARTAMENTO	
  DE	
  QUIMICA	
  BIOLOGICA,	
  FACULTAD	
  DE	
  CIENCIAS	
  EXACTAS	
  Y	
  NATURALES-­‐UNIVERSIDAD	
  DE	
  BUENOS	
  AIRES,	
  ARGENTINA	
  ¶MOLECULAR	
  

                            AND	
  CELL	
  BIOLOGY	
  LABORATORY,	
  DEPARTAMENTO	
  DE	
  QUIMICA	
  BIOLOGICA,	
  FACULTAD	
  DE	
  CIENCIAS	
  EXACTAS	
  Y	
  NATURALES-­‐UNIVERSIDAD	
  DE	
  BUENOS	
  AIRES,	
  ARGENTINA	
  .	
  	
  




INTRODUCTION	
  
The molecular basis of many protein-protein interactions reported in the literature is unknown, especially for those observed in high-throughput studies [1]. Many
globular domains bind in a specific manner to short (5-15 residues) sequences embedded within intrinsically disordered regions, the so-called “linear motifs” [1]. It is
likely that recognition of yet unknown linear motifs lies behind many protein-protein complexes of biological interest. We present an algorithm that extracts linear
motifs from protein-protein interaction datasets.	
  


                                          ALGORITHM	
                                                                                                                                             RESULTS	
  
1.	
  DATASET                                                                                                                              VALIDATION:	
  SEARCH	
  FOR	
  KNOWN	
  MOTIFS	
  
                                                                                              Protein
The algorithm takes as input the sequence of all the                                        under study                                    We have tested the ability of our algorithm to identify known functional linear motifs in
protein targets bound by the protein under study.                                                                                          sequence sets taken from the ELM database [6].
                                                                                                      Physically
The hypothesis is that any linear motif mediating                                                     interacts with                       Motif       14-3-3 type 1        Gamma-adaptin            Clathrin box        Mannosylation              CtBP                Dynein
the interaction will be overrepresented in the
sequence of these proteins.                                                                                                                                                  (DE)(DES)xF               L(ILM)x                                    Px(DEN)
                                                                                                          Several                           ELM       R(SFYW)xSxP                                                             WxxW                                    (QR)xTQT
                                                                                                                                                                            x(DE)(LVIMFD)            (ILMF)(DE)                                   L(VAST)
                                                                                                          Protein
The user also determines the length of the putative                                                       targets
                                                                                                                                          Dilimot         RSxSxP                DDxFxxF                  LIxLD                DGxW                DxPxDL                KxTQT
linear motif to be looked for, e.g., ten residues.
                                                                                                                                          Our
                                                                                                                                         method


2.	
  INPUT	
  FILTERS                                                                                                                     Our algorithm captures the known motif in six cases (top), suggesting significant sequence
                                                                                                                                           specificity in positions marked as “x” in the consensus. There is a partial match with the
1.  The presence of homologous proteins in the dataset would                                                                               known consensus in two cases (bottom left) and no match in three cases (bottom right).
    lead to spurious motif overrepresentation. We use the CD-                                                                              The performance is comparable to that of Dilimot [1], a similar software that describes
    HIT algorithm [2] to identify this kind of redundancy and                                                                              motifs as consensus sequences
    remove it from the input.
2.  Most functional linear motifs are located within disordered
                                                                                                                                            Motif            Integrin                TRAF6                  Motif          NR box                  EH1                   HP1
    protein domains [1]. Disordered regions are identified
    using the VSL software [3] and kept for analysis.                                                                                        ELM               RGD                     PxE                   ELM             LxLL          Fx(IV)xx(IL)(ILM)          PxVx(LM)

                                                                                                                                           Dilimot            RxDV                    PQE                  Dilimot        Not found              FxIxNI               KVPxVxL

3.	
  MOTIF	
  SEARCH                                input                                                                                  Our
                                                                                                                                           method
                                                                                                                                                                                                           Our
                                                                                                                                                                                                          method
                                                                                                                                                                                                                          Not found            Not found              Not found
                                                     Matrix M: sequences to be analyzed
Our software is an adaptation of a                   Integer L: motif length
method used for motif search in DNA
sequences [4], implemented in Python.                output
                                                                                                                                          CASE	
  STUDY:	
  NUCLEOLAR	
  LOCALIZATION	
  OF	
  MAGE	
  PROTEINS	
  
                                                     Matrix Res: All k-word alingments
It first calculates all possible alignments
of two k-words in the dataset.                       Algorithm                                                                            The MAGE (melanoma-associated antigen) family of proteins are plausible targets for
                                                                                                                                          anticancer therapy [7]. The MAGE-A2 protein localizes to the nucleus, while the MAGE-B2
Next, we offer all possible k-words to  {                                                                                                 protein is observed in both the nucleus and the nucleolus.
each growing alignment and incorporate                    M’ = ObtainAllKWords(M)
the one resulting in the highest score.                   Res = CreateAlignmentsOfTwoKWords (M’)
                                                                                                                                          Our algorithm extracted a putative nucleolar localization motif from a database of nucleolar
                                                          While (Res) has changed
                                                          {
                                                                                                                                          proteins [8,9]. The motif matches the Lys/Arg-rich N-terminus of MAGE-B2 (red) but not of
We repeat this procedure until                               CurrentKWordss = ObtainAllKWords (M)                                         MAGE-A2. A truncated MAGE-B2 variant that retains the motif localizes to the nucleolus.
incorporation of new k-words does not                        For all alignments A in Res                                                                                                                                                            Truncated MAGE-B2-GFP
increase the score of any alignment.                         {                                                                                                                     GFP-MAGE-A2                       GFP-MAGE-B2
                                                                AddBestKword (A, CurrentKwords)
Last, we sort the alignments by their                        }
                                                          }
scores. The sorted list is the output of                  SortByScore (Res)
the search.                                               Print Res
                                                     }




4.	
  MOTIF	
  SCORING
                                                                                                                                             Transfected U2Os cells.
We use the information content [5] of each alignment to quantify the overrepresentation of                                                   Green: GFP tag, blue: DAPI.
the motif contained in each sequence alignment.                                                                                              Magnification 100x.

The uncertainty at a position of the alignment is:                               H(l) = -Σ f(aa,l) log2 f(aa,l) (bits)

The information content at a position is the decrease in
uncertainty between a random sequence and the                                                                                              CONCLUDING	
  REMARKS	
  
observed sequences, with a correction e(n) for the                               Rsequence(l) = log220 +
sampling of a finite number of sequences:                                        Σ f(aa,l) log2 f(aa,l)-e(n) (bits)                        •  We have implemented an algorithm for the discovery of novel protein
                                                                                                                                              functional motifs within sets of unaligned sequences.
The information content of an alignment is the sum over
all positions:                                                                    Rsequence = Rsequence(l) (bits)                          •  The algorithm shows good performance in the recovery of known motifs.
                                                                                                                                           •  We propose a putative motif responsible for localization of MAGE proteins
                                                                                                                                              in the nucleolus.
5.	
  OUTPUT	
  
                                                                                                                                         REFERENCES	
  
                                                                                                                                         [1] Neduva V et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology 2005, 3:e405.
We measure the similarity between two motifs as the Pearson correlation coefficient R                                                    [2] Huang Y et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26:680-682.
                                                                                                                                         [3] Obradovic Z et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 2005, 61:S176-182.
between the corresponding amino acid frequencies. The group alignments above the                                                         [4] Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989, 86:1183-1187.
                                                                                                                                         [5] Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100.
desired value of R.                                                                                                                      [6] Gould CM et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80.
Finally, we use sequence logos [4] to picture the motifs in the highest scoring alignments.                                              [7] Simpson AJ et al. Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer, 2005, 5: 615-625
                                                                                                                                         [8] Emmot E, Hiscox JA Nucleolar targeting: the hub of the matter. EMBO Rel 2009 10(3):231-8.
                                                                                                                                         [9] Scott MS et al. Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010 Nov 1;38(21):7388-99.

More Related Content

Similar to Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and Information Theory

The chaotic structure of
The chaotic structure ofThe chaotic structure of
The chaotic structure ofcsandit
 
The Chaotic Structure of Bacterial Virulence Protein Sequences
The Chaotic Structure of Bacterial Virulence Protein SequencesThe Chaotic Structure of Bacterial Virulence Protein Sequences
The Chaotic Structure of Bacterial Virulence Protein Sequencescsandit
 
Emerging Approach to Computing Techniques.pptx
Emerging Approach to Computing Techniques.pptxEmerging Approach to Computing Techniques.pptx
Emerging Approach to Computing Techniques.pptxPoonamKumarSharma
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsinfopapers
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...CSCJournals
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological dataKrish_ver2
 
Adaptive equalization
Adaptive equalizationAdaptive equalization
Adaptive equalizationKamal Bhatt
 
2 partners ed_kickoff_dtai
2 partners ed_kickoff_dtai2 partners ed_kickoff_dtai
2 partners ed_kickoff_dtaiSirris
 
Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...eSAT Journals
 
Comparative analysis of dynamic programming
Comparative analysis of dynamic programmingComparative analysis of dynamic programming
Comparative analysis of dynamic programmingeSAT Publishing House
 
chapter10
chapter10chapter10
chapter10butest
 
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTGENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedJonathan Eisen
 
A neuro fuzzy decision support system
A neuro fuzzy decision support systemA neuro fuzzy decision support system
A neuro fuzzy decision support systemR A Akerkar
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningReece Hart
 
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM ModelCrimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM ModelCrimsonPublishers-SBB
 
Motif Finding.pdf
Motif Finding.pdfMotif Finding.pdf
Motif Finding.pdfShimoFcis
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein functionLars Juhl Jensen
 
Capital market applications of neural networks etc
Capital market applications of neural networks etcCapital market applications of neural networks etc
Capital market applications of neural networks etc23tino
 

Similar to Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and Information Theory (20)

The chaotic structure of
The chaotic structure ofThe chaotic structure of
The chaotic structure of
 
The Chaotic Structure of Bacterial Virulence Protein Sequences
The Chaotic Structure of Bacterial Virulence Protein SequencesThe Chaotic Structure of Bacterial Virulence Protein Sequences
The Chaotic Structure of Bacterial Virulence Protein Sequences
 
Emerging Approach to Computing Techniques.pptx
Emerging Approach to Computing Techniques.pptxEmerging Approach to Computing Techniques.pptx
Emerging Approach to Computing Techniques.pptx
 
A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernels
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...A Biological Sequence Compression Based on cross chromosomal similarities usi...
A Biological Sequence Compression Based on cross chromosomal similarities usi...
 
5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data5.4 mining sequence patterns in biological data
5.4 mining sequence patterns in biological data
 
Adaptive equalization
Adaptive equalizationAdaptive equalization
Adaptive equalization
 
2 partners ed_kickoff_dtai
2 partners ed_kickoff_dtai2 partners ed_kickoff_dtai
2 partners ed_kickoff_dtai
 
Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...Comparative analysis of dynamic programming algorithms to find similarity in ...
Comparative analysis of dynamic programming algorithms to find similarity in ...
 
Comparative analysis of dynamic programming
Comparative analysis of dynamic programmingComparative analysis of dynamic programming
Comparative analysis of dynamic programming
 
chapter10
chapter10chapter10
chapter10
 
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTGENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXT
 
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from UnculturedMicrobial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
Microbial Phylogenomics (EVE161) Class 17: Genomes from Uncultured
 
A neuro fuzzy decision support system
A neuro fuzzy decision support systemA neuro fuzzy decision support system
A neuro fuzzy decision support system
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
 
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM ModelCrimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
Crimson Publishers-Predicting Protein Transmembrane Regionsby Using LSTM Model
 
Motif Finding.pdf
Motif Finding.pdfMotif Finding.pdf
Motif Finding.pdf
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
Capital market applications of neural networks etc
Capital market applications of neural networks etcCapital market applications of neural networks etc
Capital market applications of neural networks etc
 

More from Asociación Argentina de Bioinformática y Biología Computacional

More from Asociación Argentina de Bioinformática y Biología Computacional (16)

About using new descriptors for cheminformatics
About using new descriptors for cheminformaticsAbout using new descriptors for cheminformatics
About using new descriptors for cheminformatics
 
La Unidad de Bioinformática del INTA
La Unidad de Bioinformática del INTALa Unidad de Bioinformática del INTA
La Unidad de Bioinformática del INTA
 
Structural Order and Disorder Dictate Sequence And Functional Evolution of th...
Structural Order and Disorder Dictate Sequence And Functional Evolution of th...Structural Order and Disorder Dictate Sequence And Functional Evolution of th...
Structural Order and Disorder Dictate Sequence And Functional Evolution of th...
 
Cooperatividad en la Expresión Génica: Abordaje Estocástico
Cooperatividad en la Expresión Génica: Abordaje EstocásticoCooperatividad en la Expresión Génica: Abordaje Estocástico
Cooperatividad en la Expresión Génica: Abordaje Estocástico
 
Prediction of heparin binding sites on GAPDH
Prediction of heparin binding sites on GAPDHPrediction of heparin binding sites on GAPDH
Prediction of heparin binding sites on GAPDH
 
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
 
Predicting peptide/MHC interactions: Application to epitope identification an...
Predicting peptide/MHC interactions: Application to epitope identification an...Predicting peptide/MHC interactions: Application to epitope identification an...
Predicting peptide/MHC interactions: Application to epitope identification an...
 
Design of degenerated primers from bioinformatics online software for putativ...
Design of degenerated primers from bioinformatics online software for putativ...Design of degenerated primers from bioinformatics online software for putativ...
Design of degenerated primers from bioinformatics online software for putativ...
 
A structure-function analysis of s HSPs in plants
A structure-function analysis of s HSPs in plantsA structure-function analysis of s HSPs in plants
A structure-function analysis of s HSPs in plants
 
Modelado de la proteína p35 de toxoplasma gondii
Modelado de la proteína p35 de toxoplasma gondiiModelado de la proteína p35 de toxoplasma gondii
Modelado de la proteína p35 de toxoplasma gondii
 
Data balancing for phenotype classification based on SNPs
Data balancing for phenotype classification based on SNPsData balancing for phenotype classification based on SNPs
Data balancing for phenotype classification based on SNPs
 
Gene selection via significant subset using silhouette index
Gene selection via significant subset using silhouette indexGene selection via significant subset using silhouette index
Gene selection via significant subset using silhouette index
 
Bolstered error estimation for discrete classifier applied to genomic signal ...
Bolstered error estimation for discrete classifier applied to genomic signal ...Bolstered error estimation for discrete classifier applied to genomic signal ...
Bolstered error estimation for discrete classifier applied to genomic signal ...
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
¿Cuál es la estabilidad relevante de las proteínas?
¿Cuál es la estabilidad relevante de las proteínas?¿Cuál es la estabilidad relevante de las proteínas?
¿Cuál es la estabilidad relevante de las proteínas?
 
Biogeografía histórica y Análisis de Vicarianza: Una perspectiva computacional
Biogeografía histórica y Análisis de Vicarianza: Una perspectiva computacionalBiogeografía histórica y Análisis de Vicarianza: Una perspectiva computacional
Biogeografía histórica y Análisis de Vicarianza: Una perspectiva computacional
 

Recently uploaded

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 

Recently uploaded (20)

Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 

Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and Information Theory

  • 1. DISCOVERY  OF  FUNCTIONAL  PROTEIN  LINEAR  MOTIFS   USING  A  GREEDY  ALGORITHM  AND  INFORMATION  THEORY   LEANDRO  G.  RADUSKY§,  JULIANA  GLAVINA§,  MARIA  FATIMA  LADELFA¶,  MARTIN  MONTE¶     AND  IGNACIO  E.  SANCHEZ§   §PROTEIN  PHYSIOLOGY  LABORATORY,  DEPARTAMENTO  DE  QUIMICA  BIOLOGICA,  FACULTAD  DE  CIENCIAS  EXACTAS  Y  NATURALES-­‐UNIVERSIDAD  DE  BUENOS  AIRES,  ARGENTINA  ¶MOLECULAR   AND  CELL  BIOLOGY  LABORATORY,  DEPARTAMENTO  DE  QUIMICA  BIOLOGICA,  FACULTAD  DE  CIENCIAS  EXACTAS  Y  NATURALES-­‐UNIVERSIDAD  DE  BUENOS  AIRES,  ARGENTINA  .     INTRODUCTION   The molecular basis of many protein-protein interactions reported in the literature is unknown, especially for those observed in high-throughput studies [1]. Many globular domains bind in a specific manner to short (5-15 residues) sequences embedded within intrinsically disordered regions, the so-called “linear motifs” [1]. It is likely that recognition of yet unknown linear motifs lies behind many protein-protein complexes of biological interest. We present an algorithm that extracts linear motifs from protein-protein interaction datasets.   ALGORITHM   RESULTS   1.  DATASET VALIDATION:  SEARCH  FOR  KNOWN  MOTIFS   Protein The algorithm takes as input the sequence of all the under study We have tested the ability of our algorithm to identify known functional linear motifs in protein targets bound by the protein under study. sequence sets taken from the ELM database [6]. Physically The hypothesis is that any linear motif mediating interacts with Motif 14-3-3 type 1 Gamma-adaptin Clathrin box Mannosylation CtBP Dynein the interaction will be overrepresented in the sequence of these proteins. (DE)(DES)xF L(ILM)x Px(DEN) Several ELM R(SFYW)xSxP WxxW (QR)xTQT x(DE)(LVIMFD) (ILMF)(DE) L(VAST) Protein The user also determines the length of the putative targets Dilimot RSxSxP DDxFxxF LIxLD DGxW DxPxDL KxTQT linear motif to be looked for, e.g., ten residues. Our method 2.  INPUT  FILTERS Our algorithm captures the known motif in six cases (top), suggesting significant sequence specificity in positions marked as “x” in the consensus. There is a partial match with the 1.  The presence of homologous proteins in the dataset would known consensus in two cases (bottom left) and no match in three cases (bottom right). lead to spurious motif overrepresentation. We use the CD- The performance is comparable to that of Dilimot [1], a similar software that describes HIT algorithm [2] to identify this kind of redundancy and motifs as consensus sequences remove it from the input. 2.  Most functional linear motifs are located within disordered Motif Integrin TRAF6 Motif NR box EH1 HP1 protein domains [1]. Disordered regions are identified using the VSL software [3] and kept for analysis. ELM RGD PxE ELM LxLL Fx(IV)xx(IL)(ILM) PxVx(LM) Dilimot RxDV PQE Dilimot Not found FxIxNI KVPxVxL 3.  MOTIF  SEARCH input Our method Our method Not found Not found Not found Matrix M: sequences to be analyzed Our software is an adaptation of a Integer L: motif length method used for motif search in DNA sequences [4], implemented in Python. output CASE  STUDY:  NUCLEOLAR  LOCALIZATION  OF  MAGE  PROTEINS   Matrix Res: All k-word alingments It first calculates all possible alignments of two k-words in the dataset. Algorithm The MAGE (melanoma-associated antigen) family of proteins are plausible targets for anticancer therapy [7]. The MAGE-A2 protein localizes to the nucleus, while the MAGE-B2 Next, we offer all possible k-words to { protein is observed in both the nucleus and the nucleolus. each growing alignment and incorporate M’ = ObtainAllKWords(M) the one resulting in the highest score. Res = CreateAlignmentsOfTwoKWords (M’) Our algorithm extracted a putative nucleolar localization motif from a database of nucleolar While (Res) has changed { proteins [8,9]. The motif matches the Lys/Arg-rich N-terminus of MAGE-B2 (red) but not of We repeat this procedure until CurrentKWordss = ObtainAllKWords (M) MAGE-A2. A truncated MAGE-B2 variant that retains the motif localizes to the nucleolus. incorporation of new k-words does not For all alignments A in Res Truncated MAGE-B2-GFP increase the score of any alignment. { GFP-MAGE-A2 GFP-MAGE-B2 AddBestKword (A, CurrentKwords) Last, we sort the alignments by their } } scores. The sorted list is the output of SortByScore (Res) the search. Print Res } 4.  MOTIF  SCORING Transfected U2Os cells. We use the information content [5] of each alignment to quantify the overrepresentation of Green: GFP tag, blue: DAPI. the motif contained in each sequence alignment. Magnification 100x. The uncertainty at a position of the alignment is: H(l) = -Σ f(aa,l) log2 f(aa,l) (bits) The information content at a position is the decrease in uncertainty between a random sequence and the CONCLUDING  REMARKS   observed sequences, with a correction e(n) for the Rsequence(l) = log220 + sampling of a finite number of sequences: Σ f(aa,l) log2 f(aa,l)-e(n) (bits) •  We have implemented an algorithm for the discovery of novel protein functional motifs within sets of unaligned sequences. The information content of an alignment is the sum over all positions: Rsequence = Rsequence(l) (bits) •  The algorithm shows good performance in the recovery of known motifs. •  We propose a putative motif responsible for localization of MAGE proteins in the nucleolus. 5.  OUTPUT   REFERENCES   [1] Neduva V et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology 2005, 3:e405. We measure the similarity between two motifs as the Pearson correlation coefficient R [2] Huang Y et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26:680-682. [3] Obradovic Z et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 2005, 61:S176-182. between the corresponding amino acid frequencies. The group alignments above the [4] Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989, 86:1183-1187. [5] Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100. desired value of R. [6] Gould CM et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80. Finally, we use sequence logos [4] to picture the motifs in the highest scoring alignments. [7] Simpson AJ et al. Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer, 2005, 5: 615-625 [8] Emmot E, Hiscox JA Nucleolar targeting: the hub of the matter. EMBO Rel 2009 10(3):231-8. [9] Scott MS et al. Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010 Nov 1;38(21):7388-99.