The algorithm discovers novel functional linear motifs within sets of unaligned protein sequences using a greedy approach. It identifies overrepresented short sequences or "motifs" that may mediate protein-protein interactions. The algorithm is tested on known motifs from databases, showing it can correctly identify several known motifs. As a case study, the algorithm extracts a putative nucleolar localization motif present in nucleolar proteins including the N-terminus of protein MAGE-B2, explaining its nucleolar localization.
Long journey of Ruby standard library at RubyConf AU 2024
Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and Information Theory
1. DISCOVERY
OF
FUNCTIONAL
PROTEIN
LINEAR
MOTIFS
USING
A
GREEDY
ALGORITHM
AND
INFORMATION
THEORY
LEANDRO
G.
RADUSKY§,
JULIANA
GLAVINA§,
MARIA
FATIMA
LADELFA¶,
MARTIN
MONTE¶
AND
IGNACIO
E.
SANCHEZ§
§PROTEIN
PHYSIOLOGY
LABORATORY,
DEPARTAMENTO
DE
QUIMICA
BIOLOGICA,
FACULTAD
DE
CIENCIAS
EXACTAS
Y
NATURALES-‐UNIVERSIDAD
DE
BUENOS
AIRES,
ARGENTINA
¶MOLECULAR
AND
CELL
BIOLOGY
LABORATORY,
DEPARTAMENTO
DE
QUIMICA
BIOLOGICA,
FACULTAD
DE
CIENCIAS
EXACTAS
Y
NATURALES-‐UNIVERSIDAD
DE
BUENOS
AIRES,
ARGENTINA
.
INTRODUCTION
The molecular basis of many protein-protein interactions reported in the literature is unknown, especially for those observed in high-throughput studies [1]. Many
globular domains bind in a specific manner to short (5-15 residues) sequences embedded within intrinsically disordered regions, the so-called “linear motifs” [1]. It is
likely that recognition of yet unknown linear motifs lies behind many protein-protein complexes of biological interest. We present an algorithm that extracts linear
motifs from protein-protein interaction datasets.
ALGORITHM
RESULTS
1.
DATASET VALIDATION:
SEARCH
FOR
KNOWN
MOTIFS
Protein
The algorithm takes as input the sequence of all the under study We have tested the ability of our algorithm to identify known functional linear motifs in
protein targets bound by the protein under study. sequence sets taken from the ELM database [6].
Physically
The hypothesis is that any linear motif mediating interacts with Motif 14-3-3 type 1 Gamma-adaptin Clathrin box Mannosylation CtBP Dynein
the interaction will be overrepresented in the
sequence of these proteins. (DE)(DES)xF L(ILM)x Px(DEN)
Several ELM R(SFYW)xSxP WxxW (QR)xTQT
x(DE)(LVIMFD) (ILMF)(DE) L(VAST)
Protein
The user also determines the length of the putative targets
Dilimot RSxSxP DDxFxxF LIxLD DGxW DxPxDL KxTQT
linear motif to be looked for, e.g., ten residues.
Our
method
2.
INPUT
FILTERS Our algorithm captures the known motif in six cases (top), suggesting significant sequence
specificity in positions marked as “x” in the consensus. There is a partial match with the
1. The presence of homologous proteins in the dataset would known consensus in two cases (bottom left) and no match in three cases (bottom right).
lead to spurious motif overrepresentation. We use the CD- The performance is comparable to that of Dilimot [1], a similar software that describes
HIT algorithm [2] to identify this kind of redundancy and motifs as consensus sequences
remove it from the input.
2. Most functional linear motifs are located within disordered
Motif Integrin TRAF6 Motif NR box EH1 HP1
protein domains [1]. Disordered regions are identified
using the VSL software [3] and kept for analysis. ELM RGD PxE ELM LxLL Fx(IV)xx(IL)(ILM) PxVx(LM)
Dilimot RxDV PQE Dilimot Not found FxIxNI KVPxVxL
3.
MOTIF
SEARCH input Our
method
Our
method
Not found Not found Not found
Matrix M: sequences to be analyzed
Our software is an adaptation of a Integer L: motif length
method used for motif search in DNA
sequences [4], implemented in Python. output
CASE
STUDY:
NUCLEOLAR
LOCALIZATION
OF
MAGE
PROTEINS
Matrix Res: All k-word alingments
It first calculates all possible alignments
of two k-words in the dataset. Algorithm The MAGE (melanoma-associated antigen) family of proteins are plausible targets for
anticancer therapy [7]. The MAGE-A2 protein localizes to the nucleus, while the MAGE-B2
Next, we offer all possible k-words to { protein is observed in both the nucleus and the nucleolus.
each growing alignment and incorporate M’ = ObtainAllKWords(M)
the one resulting in the highest score. Res = CreateAlignmentsOfTwoKWords (M’)
Our algorithm extracted a putative nucleolar localization motif from a database of nucleolar
While (Res) has changed
{
proteins [8,9]. The motif matches the Lys/Arg-rich N-terminus of MAGE-B2 (red) but not of
We repeat this procedure until CurrentKWordss = ObtainAllKWords (M) MAGE-A2. A truncated MAGE-B2 variant that retains the motif localizes to the nucleolus.
incorporation of new k-words does not For all alignments A in Res Truncated MAGE-B2-GFP
increase the score of any alignment. { GFP-MAGE-A2 GFP-MAGE-B2
AddBestKword (A, CurrentKwords)
Last, we sort the alignments by their }
}
scores. The sorted list is the output of SortByScore (Res)
the search. Print Res
}
4.
MOTIF
SCORING
Transfected U2Os cells.
We use the information content [5] of each alignment to quantify the overrepresentation of Green: GFP tag, blue: DAPI.
the motif contained in each sequence alignment. Magnification 100x.
The uncertainty at a position of the alignment is: H(l) = -Σ f(aa,l) log2 f(aa,l) (bits)
The information content at a position is the decrease in
uncertainty between a random sequence and the CONCLUDING
REMARKS
observed sequences, with a correction e(n) for the Rsequence(l) = log220 +
sampling of a finite number of sequences: Σ f(aa,l) log2 f(aa,l)-e(n) (bits) • We have implemented an algorithm for the discovery of novel protein
functional motifs within sets of unaligned sequences.
The information content of an alignment is the sum over
all positions: Rsequence = Rsequence(l) (bits) • The algorithm shows good performance in the recovery of known motifs.
• We propose a putative motif responsible for localization of MAGE proteins
in the nucleolus.
5.
OUTPUT
REFERENCES
[1] Neduva V et al. Systematic discovery of new recognition peptides mediating protein interaction networks. PLoS Biology 2005, 3:e405.
We measure the similarity between two motifs as the Pearson correlation coefficient R [2] Huang Y et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010, 26:680-682.
[3] Obradovic Z et al. Exploiting heterogeneous sequence properties improves prediction of protein disorder. Proteins 2005, 61:S176-182.
between the corresponding amino acid frequencies. The group alignments above the [4] Stormo GD, Hartzell GW 3rd. Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989, 86:1183-1187.
[5] Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990 Oct 25;18(20):6097-100.
desired value of R. [6] Gould CM et al. ELM: the status of the 2010 eukaryotic linear motif resource. Nucleic Acids Res. 2010 Jan;38(Database issue):D167-80.
Finally, we use sequence logos [4] to picture the motifs in the highest scoring alignments. [7] Simpson AJ et al. Cancer/testis antigens, gametogenesis and cancer. Nat Rev Cancer, 2005, 5: 615-625
[8] Emmot E, Hiscox JA Nucleolar targeting: the hub of the matter. EMBO Rel 2009 10(3):231-8.
[9] Scott MS et al. Characterization and prediction of protein nucleolar localization sequences. Nucleic Acids Res. 2010 Nov 1;38(21):7388-99.