Background / Purpose:
The problem of automatically constructing an accurate protein structure alignment still remains challenging especially when proteins to be aligned are distantly-related.
Main conclusion:
We present a novel method, DeepAlign, which aligns two protein structures using not only spatial proximity of equivalent residues (after rigid-body superposition), but also evolutionary distance and hydrogen bonding similarity.
Generative AI for Technical Writer or Information Developers
Protein structure alignment beyond spatial proximity 3 dsig_2012
1. Protein structure alignment
beyond spatial proximity
3DSIG 2012
Jul 14, Long Beach, California
Sheng WANG
Toyota Technological Institute at Chicago
2. Related works on Pairwise Structure Alignment
1
2
Almost all the structure alignment tools
TMalign, fr-TMalign
3 DALI,
MUSTANG
4
MAMMOTH, Vorolign,
YAKUSA
5 FATCAT, CE, MATT,
FlexProt
Note: all proteins we align only consider their C-alpha atom
3. Our contribution
Design a scoring function
• local sub-structure similarity
• evolutionary and functional information
• angular similarity for hydrogen bonding
Employ a fast and efficient search algorithm
• from highly similar local sub-structures pair (SFP)
• recruit new SFPs that satisfies spatial constrains
• final refine the alignment within a bound
4. Scoring Function
local similarity global similarity
CLESUM is the local structure substitution matrix;
BLOSUM is the amino acid substitution matrix;
v(i,j) measures the angular similarity using three vectors;
d(i,j) measures the spatial proximity of two aligned residues.
Note: both v(i,j) and d(i,j) are calculated after rigid-body superposition.
Score(i,j)=( max(0,BLOSUM(i,j) )+CLESUM(i,j) )*v(i,j)*d(i,j)
12. SFP_long
score rank
5 2 4 1
Example: TopK = 5; TopJ = 1
# of consistent SFPs = 4 # of consistent SFPs = 1
From TopK coarse-grained to TopJ fine-grained initial alignment
Top2 SFP is globally supported by three other SFPs,
while Top1 SFP is supported only by itself.
3
13. Third
Update
d1 d2
d3
d1 > d2 > d3
Output
Alignment
Fisrt
Update
Second
Update
Refine each fine-grained initial alignment by three iteration
Final refinement
SFP_short score rank
(high -> low)
14. Final refinement on DeepAlign-score only in bounded area
(1) refined fine-grained alignment (2) bounded area upon the alignment
(3) dynamic programming to find a path
with maximal DeepAlign-score within
bounded area
15. • CDD (Conserved Domain Database): contains 3591
conserved domain structure alignments.
• MALUDUP: contains 241 alignments for homologous
domains originated from internal duplication.
• MALISAM: contains 130 alignments for structurally
analogous motifs in proteins.
Result on manually-curated data
16. Result on discrimination data
• We use SABmark to test the ability of identifying distant
homologs (super-family) and structural analogs (fold)
among those negative data (with no structural similarity)
DeepAlign
DeepAlign
super-family fold
17. One example
Superimposition of domain d1pqsa_ and d1poh__ from
MALISAM. (A) TMalign, (B) DeepAlign optimizing TM-
score and (C) DeepAlign.
TMscore
0.288
TMscore
0.514
TMscore
0.473
18. Thank you !!
Please find the executable program of DeepAlign at:
http://ttic.uchicago.edu/~jinbo/DeepAlign/DeepAlign_exe_V1.00.tar.gz
Notes de l'éditeur
Currently, in our market there exists a bunch of pairwise structure alignment tools. In fact, the RMSD and number of aligned residue pairs (Ne) could be considered as the universal measurement for all the structure alignment tools. However, using this scoring function will have the drawback that, consider just a few aligned pairs that have a very larger local distance, then the whole RMSD will become large, even though all the other aligned pairs have a very small distance. Actually, each methods are unique in their scoring functions as well as their search algorithm that aiming to maximize the scores. Just take TMalign as example, it put the di into the denominator that could solve the RMSD’s drawback by lower down the contribution of those outlier’s pair while enhance those good-aligned pairs. However, it’s obvious that, in current popular scoring functions, they more consider the geometrical distance, while neglect the other important measure such as evolutionary or functional relationship between the aligned residue pairs.
Our contribution of this work is in two fold: First, we’ve designed a new scoring function that considers the three following things. One, we consider the two proteins’s local sub-structure’s similarity. Two, we consider the evolutionary as well as functional information among the two proteins. Three, we describe the hydrogen bonding similarity using a vect-based score. Second, we’ve employed a search algorithm that could maximize our scoring function while keep the running time fast. It starts from highly similar local sub-structure pair using the local part of the score, while recruit new SFPs that satisfies the global part constrains of the score. Finally, a dynamic programming refinement is applied on the whole part of the scoring function only within a bound, which make this procedure only O(n) time complexity.
It is very challenging to design a scoring function to capture all the criteria used by human experts, who align protein structures using not only geometric information, but also evolutionary and functional information. Here we design a score to measure the two corresponding residue i and j from two proteins, which composed of two parts: the first part is describing local similarity that is a MAX function with 0 and two substitution matrix, BLOSUM and CLESUM. BLOSUM is all we known that describe the similarity between two amino acid, while CLESUM is a substitution matrix for describing the similarity between two local structures. The second part of the function is for describing the global similarity, and here global means that we calculate the score after rigid-body superposition. The d(i,j) might derive from conventional spatial proximity of two aligned residues, such as RMSD or TM-score, while v(i,j) measures the angular similarity that designed for hydrogen bonding comparison. In the following slides, we’ll discuss CLESUM and v(i,j) separately.
In our previous work, we've clustered 17 local protein structure motifs constructed by 4 continuous CA atom. These motifs are called CLE (conformational letter), and each CLE is actually a distribution over three angles. Among 17 CLEs, 4 of them is alpha-helix like, 4 of them is beta-sheet like, while 9 of them is coil-state. Given one protein, it's straight-forward to transform from 3D structure to 1D CLE string, by a sliding window of four continuous CA atoms. And then check which CLE is the most similar to the query one. It's noteworthy to say that, among these 9 coil-state motif, A appeared mostly at the begin of helix, O at the end of helix; while L appeared at the begin of sheet, G appeared at the end of sheet. M is mostly 310 helix like, So these 5 states could be regarded as conserved coil states. On the other hand, Q is a state that diverges a lot and appears in disorder or flexible loop regions. Given two proteins, if they are evolutionarily or functionally related, their corresponding residues should also be similar in their CLE motif, especially at those conserved coil regions, such as CLE A, O, L and G . So just like BLOSUM, the amino acid substitution matrix, a substitution matrix that could measure the relationship between two CLE are also required.
CLESUM is such a matrix for the similarity measure for CLE. It is constructed using the pairwise alignment by representatives structures from FSSP database. The index of CLESUM is by the same means as BLOSUM. As we see from this matrix, typical helix and typical sheet don't have a higher score although their local geometrical distance should be more close. However those evolutionary and functional related regions, such as the two terminal of helix and sheet, (e.g., A, O) and those flexible regions, (e.g., Q ). This makes CLESUM a proper measure of similarity between two CLE states, that beyond spatial approximity, into evolutionary and functional relationship.
Consider this alignment on two helix. If we only consider the local structure pattern, then each residue is marked as "H". So if there appears one-position translation, then the two helix could still be aligned well under CLESUM score. However, this might not be correct when we considers amino acid information. Using BLOSUM as a measure, the one-position translation alignment would be much worse than correct one in terms of BLOSUM score. This case concludes that, in order to define a good scoring function to depict the local fragment pair's evolutionary similarity, both CLESUM and BLOSUM should be considered.
Both CLESUM and BLOSUM measure the evolutionary similarity between two residues from two proteins. The more positive, the more similar; while more negative, the less similar. First question, why we use max function? This is due to fact that, we only consider those evolutionary conserved or functionally related residue pairs, and neglect all those un-related ones. Second question, why we should use add function instead of multiply them? suppose we assume that CLESUM pair and BLOSUM pair is independent, then by their original log-odds form, the add function also will derive a log-odds form. This also explains that, suppose BLOSUM and CLESUM are in the same scaling, then their should be no weight on each matrix. To this end, we may conclude that, the score of max(0,CLESUM(i,j)+BLOSUM(i,j) ) can be used to sort the importance of SFPs between two proteins, that considers not only their pure geometrical similarity, but also their evolutionary and function relationship.
The reason why we use angular similarity v(i,j) is based on the fact that, in aligning two superimposed beta-sheet, for example, if we only minimize the geometrical-based distance, the incorrect alignment would occur. However, human expert would choose the correct alignment shown in (B) that have larger RMSD value. This is because two beta sheet that ought to be aligned should in the same direction.
The vect-score v(i,j) is designed to solve the problem. Consider three vectors shown here, say i-> i-1, i-> i+1 and i -> i_cb; then if i and j are in the same direction, then the vector-score defined here will have a large value; while in the intersected case that are incorrect, this score would be very negative.
Till now, we have already a good scoring function for aligning two proteins. Then next requirement is, how to fast and efficient optimize the scoring function while not sacrificing the running speed? Our strategy is, we create two lists of Similar Fragment Pair (SFP) that , one with long length and high similarity cutoff, while the other with short length and low similarity cutoff. Both SFP lists are sorted by their score. We only choose TopK SFP_long, which is used to construct the coarse-grained initial alignment. Then among the consistency degree check, we select TopJ coarse-grained initial alignment to form fine-grained initial alignment by enlarging the corresponding set from consistency check for SFP_short. After final refinement, we return M solution sorted by their DeepAlign-score. Since we added SFP’s to our corresponding set according to the ranking of their local similarity score, while in the same time we keep their spatial consistency. So this strategy guarantees that we can get the alignment with high enough DeepAlign score. We’ve also tried that, using small numbers of TopK, TopJ won’t affect the result much.
Suppose we set TopK=5 and TopJ=1 for example, then the coarse-grained initial alignment is just by superimpose two structures according the the given SFP. Then for each such superimposition, we check the degree of consistent of all the other SFPs. Here in the example show that, although Top1 SFP has local similarity score higher than Top2 SFP, it’s consistency degree is less than Top2. Then if only one fine-grained initial alignment is chosen, the Top1 SFP based alignment would be omitted.
Given one fine-grained initial alignment, we gradually add SFP_short into our corresponding set according to their score rank. Then at previous update, since the superimposition is only determined by a small set of correspondence, then we should have a higher spatial consistency cutoff in order to add those high similar SFPs. While at the later update, since a bunch of SFPs have already been added, then we should lower down our distance cutoff. This procedure guarantees the total DeepAlign-score to increase continually during each iteration.
The final refinement step on DeepAlign-score is conducted by running dynamic programming on the L1*L2 matrix with each indices be the DeepAlign-score (i,j) given the refined fine-grained superimposition. This procedure is actually very similar to the final step of CE, prosup and Tmalign. However, one improvement in our method is the bounded area upon the initial alignment, since during our previous step, such alignment is already accurate enough, so we only need consider about a small boundary of the alignment. This will reduce the time complexity from O(n^2) to about O(n) which make this algorithm fast enough compared to all the current method.
Here is the result of DeepAlign on the three manually-curated data. The reason why we use human-curated data is that, it’s very hard to judge the better-or-worse of two protein structure alignment just by a single criteria such as the commonly used RMSD, or by some combination of RMSD and the length of alignment. It’s also unfair to compare them according to a certain algorithm-specific score such as Tmscore or DALI-score, since the corresponding algorithm will maximize this score, while others not. One good method to compare different methods is to take a human-curated alignment benchmark as gold-standard. Here we used three such data, from CDD by NCBI that contains 3591 alignments, to MALIDUP by Nick Grishin’s group that contains 241 alignments, and to MALISAM also by Nick Grishin’s gourp, contains 130 alignments. The difficulty of these three datasets, are from SCOP family level, to superfamily and finally to fold level. From the result we could see that, in both structural level, DeepAlign could the most accuracy with the human-curated alignment, compared to all the other popular methods.
Here is another fair comparison method which first appeared in Matt’s paper in 2008. In particular, we could use SABmark to test the ability of discriminate positive data, that are within the same SCOP level, with those negative data that with no structural similarity at all. We could use ROC curve and AUC value to judge the performance of one method. From the result we know that, in both super-family and fold level, DeepAlign could reach the highest AUC value than others.
Finally, I’ll finish this talk with one example. Here the three superimposition comes from three different method on the alignment of two proteins. In (A), we use Tmalign, in (B) we use DeepAlign but to optimizing TMscore and in (C), we run ordinary DeepAlign. From the result we see that, Tmalign totally fail on this case, that only returns a 0.288 Tmscore. However, although DeepAlign actually could generate a alignment with the highest Tmscore at 0.514, if we take a look at the detail of alignment, we find the Beta-sheet and Alpha-helix regions are not aligned well. Finally, if we run DeepAlign on optimizing DeepAlign-score, the incorrectness of these regions are fixed, even that the final Tmscore is not as high as the previous one.