1) The document presents a study that uses an evolutionary algorithm (EA) called FG-EA to generate complex features from DNA sequence data for predicting splice sites.
2) FG-EA uses genetic programming techniques to evolve feature representations as trees, which are then evaluated using a surrogate fitness function and support vector machine (SVM) classifier.
3) Experimental results show that FG-EA outperforms state-of-the-art feature generation methods, and reveals the importance of novel conjunctive and disjunctive features for splice site classification.
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Evolutionary algorithm for DNA splice site prediction
1. 1
Title: An Evolutionary algorithm approach for Feature generation from Sequence data and its
application to DNA Splice site prediction.
Authors: Uday Kamath, Keneth A. De Jong, Amarda Shehu, Jack Compton, and Rezarta Islamaj-
Dogan.
Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 9, No. 5,
September/October 2012.
Speaker: Nguyen Dinh Chien (阮庭戰), Student ID: P0126557.
Sequence-based classification aims to discover signals or features hidden in the sequence data in
prediction information. That sequence data correlate with the sought property a discriminate between
sequences that contain the property and those that do not. Reduction techniques, such as Information
gain, Chi-Square, Mutual Information, and KL-distance, are additionally employed to further reduce
the size of feature set. Reducing is important to propose feature generation methods that are not
limited by biological insight, the considered type of feature or the ability to enumerate features.
Transcription of an eukaryotic DNA sequence into mRNA occurs only after enzymes splice away
nocoding regions (intron) from pre-mRNA sequence to leave only coding regions (exons), so that,
prediction of splice sites is a fundamental component of the gene-finding problem.
Splice site prediction is a difficult problem. AG and GT (GU) cannot be used as features due to
their abundance in non-splice site sequences.
Many studies have demonstrated the advantages of EAs for feature generation in different domains,
such as, Fast Genetic Selection, Nearest Neighbor Classifier… All of these methods obtain predictive
features from sequence data have shown success in diverse bioinformatics problems. The common
structure of Evolution algorithm (EA) is showed in following figure.
In this study, they explored the use of evolutionary algorithms to search a large and complex feature
space. They obtained features from sequence data that can significantly improve the classification
accuracy of a Support Vector Machine (SVM). This approach is called FG-EA, means Feature
Generation with Evolution Algorithm, and they used Genetic Programming (GP) techniques to evolve
the kind of structures illustrated in following figure.
2. 2
Comparison with state-of-the-art feature-based classification method, they realized that FG-EA
features significantly improve the classification performance. The FG-EA algorithm generates
complex features represented internally as GP trees and evaluates them on splice site training data
using a surrogate fitness function.
- The features in the hall of fame transform input sequence data into features vectors.
- SVM operating over the feature vectors finally allow evaluating the accuracy of the resulting
classifier.
With this diagram, they employ to predict DNA splice sites. The top features obtained after the
exploration of the feature space with FG-EA allow transforming input sequences into feature vectors
on which a SVM classifier can then operate.
The above diagram showed the main steps in FG-EA algorithm. Features/individuals are evolved until
a maximum number Gen_Max of generations has been reached. The mutation and crossover operators
detailed bellow are employed to obtain new features in a generation. Top features of a generation are
contributed to a growing hall of fame which then in turn contributes randomly selected features to seed
the next generation.
The tree represents the feature, “GTT with length=3 in position 30 AND GTT with length=3 in
position 36.” Using an efficient fitness function, they identified a set of candidate features (a hall of
fame) to be used as input to a standard SVM classification procedure.
3. 3
In this paper, the Authors proposed some type of features, such as, Compositional features and
positional features, Correlational features, Conjunctive and disjunctive features
Generating random initial features from 0 consists of N=15,000 features. The tree representing
features are generated using the well-known ramped half-and-half generative method, which includes
both Full and Grow techniques. These techniques obtain a mixture of full-balanced trees and bushy
trees with each technique is employed with equal probability of 0.5. Subsequent generations are
evolved using standard GP (Genetic Programming) selection, crossover, and mutation mechanisms.
Given a set of m features extracted from the hall of fame to serve as parents in a generation, the rest of
GenSize-m features are generated using the mutation and crossover operators. They employed three
breeding pipelines: mutation-ERC, mutation, and Crossover.
The fitness function is key to achieving an efficient and effective EA search heuristic. FG-EA uses a
surrogate fitness function given by:
)(*)(
,
fIG
C
C
fFitness
f
f – feature; the ratio C+,f/C+ is weighted by the information gain (IG)
Given m class attributes:
m
i ii
m
i i
m
i iii fcPfcPfPfcPfcpfPcPcPfIG 11 1
))|(log().|(.)())|(log().|().())(log().()(
The ℓ fittest individuals of a generation are added to a hall of fame, which keeps the fittest individuals
of each generation. Maintaining a hall of fame guarantees that fit individuals will not be lost or
changed. They used hall of fame with two reasons, such as, Maintaining diversity in the solution space,
and Guarantee optimal performance.
In this study, they used ℓ=250 fittest individuals of a generation, and a generation seeds its population
with m=100 randomly chosen individuals from the current set of features in the hall of fame.
The set of features in the hall of fame can be further narrowed through Recursive Feature Elimination
(RFE). RFE starts with a large feature set and gradually reduce this set by removing the least successful
features until a stopping criterion is met. They employed RFE to estimate the impact of feature set sizes
on the precision and accuracy of the classification, and directly compare with existing work.
SVMs are popular and successful in a wide variety of binary classification problems. First, they
mapped sequence data into a Euclidean vector space. Second, they selected a kernel function to map
the vector space into higher dimensional and more effective Euclidean space. And final, they turn
parameters for the kernel and other SVM parameters to improve performance.
Compare the classification performance to two different groups of state-of-the-art methods in splice
site prediction, feature-based, and kernel-based. Feature-based: (FGA and GeneSplicer) extracted from
the 2005 NCBI RefSeq collection of 5057 human pre-mRNA sequences
(http://www.ncbi.nlm.nih.gov). Used to extract 51008 positive (contain splice sites) and 200000
negative sequences, and 25504 acceptor and 25504 donor consist of 162 nucleotides each (80
upstream + AG|GT + 80 downstream). Kernel-based: (WD and WDS) extracted from the worm data
set with EST sequences (http://www.wormbase.org). In this group, they used 64844 donors and 64838
acceptor splice site sequences. Each sequence is 142 nucleotides long (60+AG|GT+80).
4. 4
There are two sets of classification experiments are conducted, such as, compare the performance of
FG-EA to FGA and GeneSplicer, and Compare with WD and WDS methods. The values of these
parameters can be found on the website http://www.cs.gmu.edu/~ashehu/?q=OurTools. They measure
performance in terms of 11ptAVG, FPR, auROC, and auPRC. For example, the following figure show
that FPR over recall (left: acceptor, right: donor) are plotted for the B2Hum testing data set.
And, in two following table, the authors compared auROC values with auPRC values on 40K
Sequences sampled (left hand), and on 10 different sets of 360K sequences sampled (right hand) from
the Worm data sets.
They divide the hall of fame features in three types of subsets. First subset is all composition features;
second subset is all region-specific
compositional, positional, correlational
features; and third subset is all remaining
features include conjunctive and
disjunctive features. In the right-hand table,
we can see that IG (information gain) sums
of subsets of features evaluated over
acceptor and donor data.
FG-EA outperforms state-of-the-art feature generation methods in splice site classification. FG-EA
reveals the significant role of novel complex conjunctive and disjunctive features. The proposed FG-
EA algorithm can easily be employed in other prediction problems on biological sequences. Further
extensions of the FG-EA can combine the evolution of features with evolution of SVM kernels for
greater classification accuracy. Plan on employing regular expressions to further combine and reduce
the bloat in the expressions and so improve readability and performance.