SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
1
Title: An Evolutionary algorithm approach for Feature generation from Sequence data and its
application to DNA Splice site prediction.
Authors: Uday Kamath, Keneth A. De Jong, Amarda Shehu, Jack Compton, and Rezarta Islamaj-
Dogan.
Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 9, No. 5,
September/October 2012.
Speaker: Nguyen Dinh Chien (阮庭戰), Student ID: P0126557.
Sequence-based classification aims to discover signals or features hidden in the sequence data in
prediction information. That sequence data correlate with the sought property a discriminate between
sequences that contain the property and those that do not. Reduction techniques, such as Information
gain, Chi-Square, Mutual Information, and KL-distance, are additionally employed to further reduce
the size of feature set. Reducing is important to propose feature generation methods that are not
limited by biological insight, the considered type of feature or the ability to enumerate features.
Transcription of an eukaryotic DNA sequence into mRNA occurs only after enzymes splice away
nocoding regions (intron) from pre-mRNA sequence to leave only coding regions (exons), so that,
prediction of splice sites is a fundamental component of the gene-finding problem.
Splice site prediction is a difficult problem. AG and GT (GU) cannot be used as features due to
their abundance in non-splice site sequences.
Many studies have demonstrated the advantages of EAs for feature generation in different domains,
such as, Fast Genetic Selection, Nearest Neighbor Classifier… All of these methods obtain predictive
features from sequence data have shown success in diverse bioinformatics problems. The common
structure of Evolution algorithm (EA) is showed in following figure.
In this study, they explored the use of evolutionary algorithms to search a large and complex feature
space. They obtained features from sequence data that can significantly improve the classification
accuracy of a Support Vector Machine (SVM). This approach is called FG-EA, means Feature
Generation with Evolution Algorithm, and they used Genetic Programming (GP) techniques to evolve
the kind of structures illustrated in following figure.
2
Comparison with state-of-the-art feature-based classification method, they realized that FG-EA
features significantly improve the classification performance. The FG-EA algorithm generates
complex features represented internally as GP trees and evaluates them on splice site training data
using a surrogate fitness function.
- The features in the hall of fame transform input sequence data into features vectors.
- SVM operating over the feature vectors finally allow evaluating the accuracy of the resulting
classifier.
With this diagram, they employ to predict DNA splice sites. The top features obtained after the
exploration of the feature space with FG-EA allow transforming input sequences into feature vectors
on which a SVM classifier can then operate.
The above diagram showed the main steps in FG-EA algorithm. Features/individuals are evolved until
a maximum number Gen_Max of generations has been reached. The mutation and crossover operators
detailed bellow are employed to obtain new features in a generation. Top features of a generation are
contributed to a growing hall of fame which then in turn contributes randomly selected features to seed
the next generation.
The tree represents the feature, “GTT with length=3 in position 30 AND GTT with length=3 in
position 36.” Using an efficient fitness function, they identified a set of candidate features (a hall of
fame) to be used as input to a standard SVM classification procedure.
3
In this paper, the Authors proposed some type of features, such as, Compositional features and
positional features, Correlational features, Conjunctive and disjunctive features
Generating random initial features from 0 consists of N=15,000 features. The tree representing
features are generated using the well-known ramped half-and-half generative method, which includes
both Full and Grow techniques. These techniques obtain a mixture of full-balanced trees and bushy
trees with each technique is employed with equal probability of 0.5. Subsequent generations are
evolved using standard GP (Genetic Programming) selection, crossover, and mutation mechanisms.
Given a set of m features extracted from the hall of fame to serve as parents in a generation, the rest of
GenSize-m features are generated using the mutation and crossover operators. They employed three
breeding pipelines: mutation-ERC, mutation, and Crossover.
The fitness function is key to achieving an efficient and effective EA search heuristic. FG-EA uses a
surrogate fitness function given by:
)(*)(
,
fIG
C
C
fFitness
f



f – feature; the ratio C+,f/C+ is weighted by the information gain (IG)
Given m class attributes:
   

m
i ii
m
i i
m
i iii fcPfcPfPfcPfcpfPcPcPfIG 11 1
))|(log().|(.)())|(log().|().())(log().()(
The ℓ fittest individuals of a generation are added to a hall of fame, which keeps the fittest individuals
of each generation. Maintaining a hall of fame guarantees that fit individuals will not be lost or
changed. They used hall of fame with two reasons, such as, Maintaining diversity in the solution space,
and Guarantee optimal performance.
In this study, they used ℓ=250 fittest individuals of a generation, and a generation seeds its population
with m=100 randomly chosen individuals from the current set of features in the hall of fame.
The set of features in the hall of fame can be further narrowed through Recursive Feature Elimination
(RFE). RFE starts with a large feature set and gradually reduce this set by removing the least successful
features until a stopping criterion is met. They employed RFE to estimate the impact of feature set sizes
on the precision and accuracy of the classification, and directly compare with existing work.
SVMs are popular and successful in a wide variety of binary classification problems. First, they
mapped sequence data into a Euclidean vector space. Second, they selected a kernel function to map
the vector space into higher dimensional and more effective Euclidean space. And final, they turn
parameters for the kernel and other SVM parameters to improve performance.
Compare the classification performance to two different groups of state-of-the-art methods in splice
site prediction, feature-based, and kernel-based. Feature-based: (FGA and GeneSplicer) extracted from
the 2005 NCBI RefSeq collection of 5057 human pre-mRNA sequences
(http://www.ncbi.nlm.nih.gov). Used to extract 51008 positive (contain splice sites) and 200000
negative sequences, and 25504 acceptor and 25504 donor consist of 162 nucleotides each (80
upstream + AG|GT + 80 downstream). Kernel-based: (WD and WDS) extracted from the worm data
set with EST sequences (http://www.wormbase.org). In this group, they used 64844 donors and 64838
acceptor splice site sequences. Each sequence is 142 nucleotides long (60+AG|GT+80).
4
There are two sets of classification experiments are conducted, such as, compare the performance of
FG-EA to FGA and GeneSplicer, and Compare with WD and WDS methods. The values of these
parameters can be found on the website http://www.cs.gmu.edu/~ashehu/?q=OurTools. They measure
performance in terms of 11ptAVG, FPR, auROC, and auPRC. For example, the following figure show
that FPR over recall (left: acceptor, right: donor) are plotted for the B2Hum testing data set.
And, in two following table, the authors compared auROC values with auPRC values on 40K
Sequences sampled (left hand), and on 10 different sets of 360K sequences sampled (right hand) from
the Worm data sets.
They divide the hall of fame features in three types of subsets. First subset is all composition features;
second subset is all region-specific
compositional, positional, correlational
features; and third subset is all remaining
features include conjunctive and
disjunctive features. In the right-hand table,
we can see that IG (information gain) sums
of subsets of features evaluated over
acceptor and donor data.
FG-EA outperforms state-of-the-art feature generation methods in splice site classification. FG-EA
reveals the significant role of novel complex conjunctive and disjunctive features. The proposed FG-
EA algorithm can easily be employed in other prediction problems on biological sequences. Further
extensions of the FG-EA can combine the evolution of features with evolution of SVM kernels for
greater classification accuracy. Plan on employing regular expressions to further combine and reduce
the bloat in the expressions and so improve readability and performance.

Contenu connexe

Tendances

Genetic Programming for Generating Prototypes in Classification Problems
Genetic Programming for Generating Prototypes in Classification ProblemsGenetic Programming for Generating Prototypes in Classification Problems
Genetic Programming for Generating Prototypes in Classification ProblemsTarundeep Dhot
 
protein sequence analysis
protein sequence analysisprotein sequence analysis
protein sequence analysisRamikaSingla
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slidespannicle
 
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...ArchiLab 7
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS
 

Tendances (7)

Genetic Programming for Generating Prototypes in Classification Problems
Genetic Programming for Generating Prototypes in Classification ProblemsGenetic Programming for Generating Prototypes in Classification Problems
Genetic Programming for Generating Prototypes in Classification Problems
 
Swaati modeling
Swaati modeling Swaati modeling
Swaati modeling
 
protein sequence analysis
protein sequence analysisprotein sequence analysis
protein sequence analysis
 
Seminar Slides
Seminar SlidesSeminar Slides
Seminar Slides
 
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 
Ijetr042111
Ijetr042111Ijetr042111
Ijetr042111
 

En vedette

Quy Che Quyet Dinh 14
Quy Che Quyet Dinh 14Quy Che Quyet Dinh 14
Quy Che Quyet Dinh 14Nguyen Chien
 
Vxl Dahl 2009 05 08
Vxl Dahl 2009 05 08Vxl Dahl 2009 05 08
Vxl Dahl 2009 05 08Nguyen Chien
 
Evolutionary inaccuracy of pairwise structural alignments (slide)
Evolutionary inaccuracy of pairwise structural alignments (slide)Evolutionary inaccuracy of pairwise structural alignments (slide)
Evolutionary inaccuracy of pairwise structural alignments (slide)Nguyen Chien
 
Giao Trinh Vi Xu Ly (20 12 2008)
Giao Trinh Vi Xu Ly (20 12 2008)Giao Trinh Vi Xu Ly (20 12 2008)
Giao Trinh Vi Xu Ly (20 12 2008)Nguyen Chien
 
Noi quy van phong
Noi quy van phongNoi quy van phong
Noi quy van phongNgan Hoang
 

En vedette (8)

Quy Che Quyet Dinh 14
Quy Che Quyet Dinh 14Quy Che Quyet Dinh 14
Quy Che Quyet Dinh 14
 
Mips Assembly
Mips AssemblyMips Assembly
Mips Assembly
 
Vxl Dahl 2009 05 08
Vxl Dahl 2009 05 08Vxl Dahl 2009 05 08
Vxl Dahl 2009 05 08
 
Evolutionary inaccuracy of pairwise structural alignments (slide)
Evolutionary inaccuracy of pairwise structural alignments (slide)Evolutionary inaccuracy of pairwise structural alignments (slide)
Evolutionary inaccuracy of pairwise structural alignments (slide)
 
Giao Trinh Vi Xu Ly (20 12 2008)
Giao Trinh Vi Xu Ly (20 12 2008)Giao Trinh Vi Xu Ly (20 12 2008)
Giao Trinh Vi Xu Ly (20 12 2008)
 
7. quy che khen thuong ky luat
7. quy che khen thuong ky luat7. quy che khen thuong ky luat
7. quy che khen thuong ky luat
 
P0126557 slides
P0126557 slidesP0126557 slides
P0126557 slides
 
Noi quy van phong
Noi quy van phongNoi quy van phong
Noi quy van phong
 

Similaire à Evolutionary algorithm for DNA splice site prediction

A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsinfopapers
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Application of support vector machines for prediction of anti hiv activity of...
Application of support vector machines for prediction of anti hiv activity of...Application of support vector machines for prediction of anti hiv activity of...
Application of support vector machines for prediction of anti hiv activity of...Alexander Decker
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1  Feature SetOptimal Feature Selection from VMware ESXi 5.1  Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Setijccmsjournal
 
The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...IJECEIAES
 
Automatic Recognition of Medicinal Plants using Machine Learning Techniques
Automatic Recognition of Medicinal Plants using Machine Learning TechniquesAutomatic Recognition of Medicinal Plants using Machine Learning Techniques
Automatic Recognition of Medicinal Plants using Machine Learning TechniquesIRJET Journal
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsinfopapers
 
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...ijesajournal
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataIRJET Journal
 
Adaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on CooperativeAdaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on CooperativeESCOM
 
Computer aided classification of Bascal cell carcinoma using adaptive Neuro-f...
Computer aided classification of Bascal cell carcinoma using adaptive Neuro-f...Computer aided classification of Bascal cell carcinoma using adaptive Neuro-f...
Computer aided classification of Bascal cell carcinoma using adaptive Neuro-f...Editor IJMTER
 
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...infopapers
 

Similaire à Evolutionary algorithm for DNA splice site prediction (20)

A general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernelsA general frame for building optimal multiple SVM kernels
A general frame for building optimal multiple SVM kernels
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Application of support vector machines for prediction of anti hiv activity of...
Application of support vector machines for prediction of anti hiv activity of...Application of support vector machines for prediction of anti hiv activity of...
Application of support vector machines for prediction of anti hiv activity of...
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1  Feature SetOptimal Feature Selection from VMware ESXi 5.1  Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...
 
Automatic Recognition of Medicinal Plants using Machine Learning Techniques
Automatic Recognition of Medicinal Plants using Machine Learning TechniquesAutomatic Recognition of Medicinal Plants using Machine Learning Techniques
Automatic Recognition of Medicinal Plants using Machine Learning Techniques
 
Evaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernelsEvaluation of a hybrid method for constructing multiple SVM kernels
Evaluation of a hybrid method for constructing multiple SVM kernels
 
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...
PERFORMANCE EVALUATION OF FUZZY LOGIC AND BACK PROPAGATION NEURAL NETWORK FOR...
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerData
 
Adaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on CooperativeAdaptive Training of Radial Basis Function Networks Based on Cooperative
Adaptive Training of Radial Basis Function Networks Based on Cooperative
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Computer aided classification of Bascal cell carcinoma using adaptive Neuro-f...
Computer aided classification of Bascal cell carcinoma using adaptive Neuro-f...Computer aided classification of Bascal cell carcinoma using adaptive Neuro-f...
Computer aided classification of Bascal cell carcinoma using adaptive Neuro-f...
 
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
Optimization of Complex SVM Kernels Using a Hybrid Algorithm Based on Wasp Be...
 
B017410916
B017410916B017410916
B017410916
 
B017410916
B017410916B017410916
B017410916
 
20120140505011
2012014050501120120140505011
20120140505011
 
2224d_final
2224d_final2224d_final
2224d_final
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Dernier (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Evolutionary algorithm for DNA splice site prediction

  • 1. 1 Title: An Evolutionary algorithm approach for Feature generation from Sequence data and its application to DNA Splice site prediction. Authors: Uday Kamath, Keneth A. De Jong, Amarda Shehu, Jack Compton, and Rezarta Islamaj- Dogan. Source: IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. 9, No. 5, September/October 2012. Speaker: Nguyen Dinh Chien (阮庭戰), Student ID: P0126557. Sequence-based classification aims to discover signals or features hidden in the sequence data in prediction information. That sequence data correlate with the sought property a discriminate between sequences that contain the property and those that do not. Reduction techniques, such as Information gain, Chi-Square, Mutual Information, and KL-distance, are additionally employed to further reduce the size of feature set. Reducing is important to propose feature generation methods that are not limited by biological insight, the considered type of feature or the ability to enumerate features. Transcription of an eukaryotic DNA sequence into mRNA occurs only after enzymes splice away nocoding regions (intron) from pre-mRNA sequence to leave only coding regions (exons), so that, prediction of splice sites is a fundamental component of the gene-finding problem. Splice site prediction is a difficult problem. AG and GT (GU) cannot be used as features due to their abundance in non-splice site sequences. Many studies have demonstrated the advantages of EAs for feature generation in different domains, such as, Fast Genetic Selection, Nearest Neighbor Classifier… All of these methods obtain predictive features from sequence data have shown success in diverse bioinformatics problems. The common structure of Evolution algorithm (EA) is showed in following figure. In this study, they explored the use of evolutionary algorithms to search a large and complex feature space. They obtained features from sequence data that can significantly improve the classification accuracy of a Support Vector Machine (SVM). This approach is called FG-EA, means Feature Generation with Evolution Algorithm, and they used Genetic Programming (GP) techniques to evolve the kind of structures illustrated in following figure.
  • 2. 2 Comparison with state-of-the-art feature-based classification method, they realized that FG-EA features significantly improve the classification performance. The FG-EA algorithm generates complex features represented internally as GP trees and evaluates them on splice site training data using a surrogate fitness function. - The features in the hall of fame transform input sequence data into features vectors. - SVM operating over the feature vectors finally allow evaluating the accuracy of the resulting classifier. With this diagram, they employ to predict DNA splice sites. The top features obtained after the exploration of the feature space with FG-EA allow transforming input sequences into feature vectors on which a SVM classifier can then operate. The above diagram showed the main steps in FG-EA algorithm. Features/individuals are evolved until a maximum number Gen_Max of generations has been reached. The mutation and crossover operators detailed bellow are employed to obtain new features in a generation. Top features of a generation are contributed to a growing hall of fame which then in turn contributes randomly selected features to seed the next generation. The tree represents the feature, “GTT with length=3 in position 30 AND GTT with length=3 in position 36.” Using an efficient fitness function, they identified a set of candidate features (a hall of fame) to be used as input to a standard SVM classification procedure.
  • 3. 3 In this paper, the Authors proposed some type of features, such as, Compositional features and positional features, Correlational features, Conjunctive and disjunctive features Generating random initial features from 0 consists of N=15,000 features. The tree representing features are generated using the well-known ramped half-and-half generative method, which includes both Full and Grow techniques. These techniques obtain a mixture of full-balanced trees and bushy trees with each technique is employed with equal probability of 0.5. Subsequent generations are evolved using standard GP (Genetic Programming) selection, crossover, and mutation mechanisms. Given a set of m features extracted from the hall of fame to serve as parents in a generation, the rest of GenSize-m features are generated using the mutation and crossover operators. They employed three breeding pipelines: mutation-ERC, mutation, and Crossover. The fitness function is key to achieving an efficient and effective EA search heuristic. FG-EA uses a surrogate fitness function given by: )(*)( , fIG C C fFitness f    f – feature; the ratio C+,f/C+ is weighted by the information gain (IG) Given m class attributes:      m i ii m i i m i iii fcPfcPfPfcPfcpfPcPcPfIG 11 1 ))|(log().|(.)())|(log().|().())(log().()( The ℓ fittest individuals of a generation are added to a hall of fame, which keeps the fittest individuals of each generation. Maintaining a hall of fame guarantees that fit individuals will not be lost or changed. They used hall of fame with two reasons, such as, Maintaining diversity in the solution space, and Guarantee optimal performance. In this study, they used ℓ=250 fittest individuals of a generation, and a generation seeds its population with m=100 randomly chosen individuals from the current set of features in the hall of fame. The set of features in the hall of fame can be further narrowed through Recursive Feature Elimination (RFE). RFE starts with a large feature set and gradually reduce this set by removing the least successful features until a stopping criterion is met. They employed RFE to estimate the impact of feature set sizes on the precision and accuracy of the classification, and directly compare with existing work. SVMs are popular and successful in a wide variety of binary classification problems. First, they mapped sequence data into a Euclidean vector space. Second, they selected a kernel function to map the vector space into higher dimensional and more effective Euclidean space. And final, they turn parameters for the kernel and other SVM parameters to improve performance. Compare the classification performance to two different groups of state-of-the-art methods in splice site prediction, feature-based, and kernel-based. Feature-based: (FGA and GeneSplicer) extracted from the 2005 NCBI RefSeq collection of 5057 human pre-mRNA sequences (http://www.ncbi.nlm.nih.gov). Used to extract 51008 positive (contain splice sites) and 200000 negative sequences, and 25504 acceptor and 25504 donor consist of 162 nucleotides each (80 upstream + AG|GT + 80 downstream). Kernel-based: (WD and WDS) extracted from the worm data set with EST sequences (http://www.wormbase.org). In this group, they used 64844 donors and 64838 acceptor splice site sequences. Each sequence is 142 nucleotides long (60+AG|GT+80).
  • 4. 4 There are two sets of classification experiments are conducted, such as, compare the performance of FG-EA to FGA and GeneSplicer, and Compare with WD and WDS methods. The values of these parameters can be found on the website http://www.cs.gmu.edu/~ashehu/?q=OurTools. They measure performance in terms of 11ptAVG, FPR, auROC, and auPRC. For example, the following figure show that FPR over recall (left: acceptor, right: donor) are plotted for the B2Hum testing data set. And, in two following table, the authors compared auROC values with auPRC values on 40K Sequences sampled (left hand), and on 10 different sets of 360K sequences sampled (right hand) from the Worm data sets. They divide the hall of fame features in three types of subsets. First subset is all composition features; second subset is all region-specific compositional, positional, correlational features; and third subset is all remaining features include conjunctive and disjunctive features. In the right-hand table, we can see that IG (information gain) sums of subsets of features evaluated over acceptor and donor data. FG-EA outperforms state-of-the-art feature generation methods in splice site classification. FG-EA reveals the significant role of novel complex conjunctive and disjunctive features. The proposed FG- EA algorithm can easily be employed in other prediction problems on biological sequences. Further extensions of the FG-EA can combine the evolution of features with evolution of SVM kernels for greater classification accuracy. Plan on employing regular expressions to further combine and reduce the bloat in the expressions and so improve readability and performance.