SlideShare une entreprise Scribd logo
1  sur  7
Télécharger pour lire hors ligne
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. VI (Nov – Dec. 2015), PP 45-51
www.iosrjournals.org
DOI: 10.9790/0661-17664551 www.iosrjournals.org 45 | Page
Filter-Wrapper Approach to Feature Selection Using PSO-GA for
Arabic Document Classification with Naive Bayes Multinomial
Indriyani1)
, Wawan Gunawan 2)
, Ardhon Rakhmadi3)
1, 2, 3)
Department of Informatics, Faculty of Information Technology, Institut Teknologi Sepuluh Nopember,
Jl. Teknik Kimia, Sukolilo, Surabaya, 60117, Indonesia
Abstract: Text categorization and feature selection are two of the many text data mining problems. In
text categorization, the document that contains a collection of text will be changed to the dataset
format, the dataset that consists of features and class, words become features and categories of
documents become class on this dataset. The number of features that too many can cause a decrease
in performance of classifier because many of the features that are redundant and not optimal so that
feature selection is required to select the optimal features. This paper proposed a feature selection
strategy based on Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) methods for
Arabic Document Classification with Naive Bayes Multinomial (NBM). Particle Swarm Optimization
(PSO) is adopted in the first phase with the aim to eliminate the insignificant features and prepared
the reduce features to the next phase. In the second phase, the reduced features are optimized using
the new evolutionary computation method, Genetic Algorithm (GA). These methods have greatly
reduced the features and achieved higher classification compared with full features without features
selection. From the experiment that has been done the obtained results of accuracy are NBM
85.31%, NBM-PSO 83.91% and NBM-PSO-GA 90.20%.
Keywords: Document Classification, Feature Selection, Particle Swarm Optimization (PSO), Genetic
Algorithm (GA).
I. Introduction
Text data mining[1] is a research domain involving many research areas, such as natural
language processing, machine learning, information retrieval[2], and data mining. Text categorization
and feature selection are two of the many text data mining problems. The text document
categorization problem has been studied by many researchers[3][4][5]. Yang et al. showed
comparative research on feature selection in text classification[5]. Many machine learning methods
have been used for the text categorization problem, but the problem of feature selection is to find a
subset of features for optimal classification. Even some noise features may sharply reduce the
classification accuracy. Furthermore, a high number of features can slow down the classification
process or even make some classifiers inapplicable.
According to John, Kohavi, and Pfleger[6], there are mainly two types of feature selection
methods in machine learning: wrappers and filters. Wrappers use the classification accuracy of some
learning algorithm as their evaluation function. Since wrappers have to train a classifier for each
feature subset to be evaluated, they are usually much more time consuming especially when the
number of features is high. So wrappers are generally not suitable for text classification.
Hence, feature selection is commonly used in text classification to reduce the dimensionality
of feature space and improve the efficiency and accuracy of classifiers. Text classification tasks can
usually be accurately performed with less than 100 words in simple cases, and do best with words in
the thousands in the complex ones[7].
As opposed to wrappers, filters perform feature selection independently of the learning
algorithm that will use the selected features. In order to evaluate a feature, filters use an evaluation
metric that measures the ability of the feature to differentiate each class.
Since wrappers have to train a classifier for each feature subset to be evaluated, they are
usually much more time consuming especially when the number of features is high. So wrappers are
generally not suitable for text classification.
For this reason, our motivation is to build a good text classifier by investigating hybrid filter
and wrapper method for feature selection. Finally, our proposed algorithm Particle Swarm Optimation
Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification…
DOI: 10.9790/0661-17664551 www.iosrjournals.org 46 | Page
for filtering and second phase, we reduce result features filtering using wrapper Genetic Algorithm
(GA) for optimizing features for Classification using Naive Bayes Multinominal for Arabic Document
text Classification. The evaluation used an Arabic corpus that consists of 478 documents from
www.shamela.ws, which are independently classified into seven categories.
II. Methodology
General description of the research method is shown in Figure 1. The stages and the methods
used to lead this study include.
Stages and methods used to underpin this study include:
A. DataSet Document
Document used as experimental data taken from www.shamela.ws and taken seven categories
according to the most often discussed is Sholat, Zakat, Shaum, hajj, mutazawwij, Baa'aand Isytaro
and Wakaf/
B. PreProcessing
Fig 1. Research Stage
In the process of preprocessing, each document is converted to vector document news with
the following sequence:
- Filtering, namely the elimination of illegal characters (numbers and symbols) in the document's
contents.
- Stoplist removal, i.e. removal of the characters included in the category of stopword or words that
have a high frequency contained in the data stoplist.
- Terms extraction, which extract the terms (words) of each document to be processed and compiled
into a vector of terms that represent the document.
- Stemming, namely to restore the basic shape of each term found in the document vector and
grouping based on the terms similar.
- TF-IDF weighting i.e. performs weighting TF-IDF on any terms that exist in the document vector.
C. Filter - Particle Swarm Optimization (PSO)
Particle swarm optimization (PSO) is an evolutionary computation technique developed by
Kennedy and Eberhart [11]. The Particle Swarms find optimal regions of the complex search space
through the interaction of individuals in the population. PSO is attractive for feature selection in that
particle swarms will discover best feature combinations as they fly within the subset space.During
Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification…
DOI: 10.9790/0661-17664551 www.iosrjournals.org 47 | Page
movement, the current position of particle i is represented by a vector xi = (xi1, xi2, ..., xiD), where D
is the dimensionality of the search space. The velocity of particle i is represented as vi = (vi1, vi2, ...,
viD) and it must be in a range defined by parameters vmin and vmax. The personal best position of the
particle local best is the best previous position of that particle and the best position obtained by the
population thus far is called global best. According to the following equations, PSO searches for the
optimal solution by updating the velocity and the position of each particle:
xid(t+1) = xid(t) + vid(t+1) (1)
vid(t+1) = w * vid(t) +c1* r1i * ( pid - xid(t)) +
c2* r2i * ( pgd - xid(t)) (2)
where t denotes the tth iteration, d ϵ D denotes the dth dimension in the search space, w is inertia weight, c1 and
c2are the learning factors called, respectively, cognitive parameter, social parameter, r1i and r2i are random values uniformly
distributed in [0, 1], and finally pid and pgd represent the elements of local best and global best in the dth dimension. The
personal best position of particle i is calculated as
(3)
In this work, PSO is a filter with CFS (Correlation-based Feature Selection) as a fitness function.
Like the majority of feature selection techniques, CFS uses a search algorithm along with a function
to evaluate the worth of feature subsets. CFS measures the usefulness of each feature for predicting
the class label along with the level of inter correlation among them, based on the hypothesis: Good
feature subsets contain features highly correlated (predictive of) with the class, yet uncorrelated with
(not predictive of)each other [12].
D. Design of the Wrapper Phase
After filtering, the next phase is the wrapper, Wrapper methods used previously usually adopt
random search strategies, such as Adaptive Genetic Algorithm (AGA), Chaotic Binary Particle
Swarm Optimization (CBPSO) and Clonal Selection Algorithm (CSA). The wrapper approach
consists of methods choosing a minimum subset of features that satisfies an evaluation criterion.
It was proved that the wrapper approach produces the best results out of the feature selection
methods [13], although this is a time-consuming method since each feature subset considered must be
evaluated with the classifier algorithm. to overcome this, we perform filtering using the PSO in
advance so that we can reduce the execution time.In the wrapper method, the features subset selection
algorithm exists as a wrapper around the data mining algorithm and outcome evaluation.The induction
algorithm is used as a black box. The feature selection algorithm conducts a search for a proper subset
using the induction algorithm itself as a part of the evaluation function. GA-based wrapper methods
involve a Genetic Algorithm (GA) as a search method of subset features.
GA is a random search method, effectively exploring large search spaces [14]. The basic idea
of GA is to evolve a population of individuals (chromosomes), where an individual is a possible
solution to a given problem. In the case of searching the appropriate subset of features, a population
consists of different subsets evolved by a mutation, a crossover, and selection operations. After
reaching maximum generations, algorithms returns the chromosome with the highest fitness, i.e. the
subset of features with the highest accuracy.
Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification…
DOI: 10.9790/0661-17664551 www.iosrjournals.org 48 | Page
Figure 2. GA-based wrapper feature selection with Naive Bayes Multinomial as an induction
algorithm evaluating feature subset.
E. Naive Bayes Multinomial
Multinomial Naive Bayes (MNB) is the version of Naive Bayes that is commonly used for
text categorization problems. In the MNB classifier, each document is viewed as a collection of words
and the order of words is considered irrelevant.
The probability of a class value c given a test document d is computed as
(4)
where nwd is the number of times word w occurs in document d, P (w|c) is the probability of
observing word w given class c, P (c) is the prior probability of class c, and P (d) is a constant that
makes the probabilities for the different classes sum to one. P (c) is estimated by the proportion of
training documents pertaining to class c and P (w|c) is estimated as
(5)
Many text categorization problems are unbalanced. This can cause problems because of the
Laplace correction used in. Consider a word w in a two class problem with classes c1 and c2, where w
is completely irrelevant to the classification. This means the odds ratio for that particular word should
be one, i.e. , so that this word does not influence the class probability. Assume the word
occurs with equal relative frequency 0.1 in the text of each of the two classes. Assume there are
20,000 words in the vocabulary (k = 20, 000) and the total size of the corpus is 100,000 words.
Fortunately, it turns out that there is a simple remedy: we can normalize the word counts in each
class so that the total size of the classes is the same for both classes after normalization[17].
III. Experimental results
In order to evaluate the performance of the proposed method, our algorithm has been tested
using 478 documents from www.shamela.ws, which are independently classified into seven
categories, 70% for data Training and 30% for data testing. The number of words (features) in this
dataset is 1578, in our experiments, we have successfully reduced the number of words (features) into
Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification…
DOI: 10.9790/0661-17664551 www.iosrjournals.org 49 | Page
618 using CFS-PSO filtering algorithm and then the second phase we reduce result features filtering
using wrapper Genetic Algorithm (GA) for optimizing features into 266. We have compared the
accuracy our proposed method NBM-PSO-GA with NBM without filtering and NBM using
Algorithm PSO for filtering with the same dataset.
3.1. The classification results of the NBM, NBM-PSO, and NBM-PSO-GA
The classification results of the NBM, NBM-PSO, and NBM-PSO-GA Classifier are shown
in Tables 3.2, 3.3, and 3.4 respectively.
Tabel 3.1. Number of documents selected from Dataset
Class Name Doc Num.
Sholat
ÕáÇÉ
136
Zakat
ÇáÒßÇÉ
36
Shaum
ÕíÇã
42
Hajj
ÇáÍÇÌ
64
mutazawwij
ãÊÒæÌ
46
Baa'aand Isytaro
ÇáÈíÚ æÇáÔÑÇÁ
109
Wakaf
ÇáÃæÞÇÝ
45
As shown in Table 3.2, when the Naive Bayes Multinomial classifier is used, the Accuracy
Rate, Recall Rate, F-Measure and Precision of the entire classification result Accuracy Rate are
81.8%. The table indicates that Recall is 0.82, Precession is 0.82 and F-Measure are 0.828. Moreover,
the table indicates that “Sholat” classification Accuracy can reach 97.4 % and the Accuracy Rate of
the „„Shaum” class is as low as 66.7% and this influences the entire classification result.
As shown in Table 3.3, when the Naive Bayes Multinomial with a features filter selection
(PSO) classifier is used, the Accuracy Rate, Recall Rate, F-Measure and Precision of the entire
classification result Accuracy Rate are 81.8%. The table indicates that Recall is 0.82, Precession is
0.84, and F-Measure is 0.82. Moreover, the table indicates that “Sholat" classification accuracy can
reach 97,4% and the Accuracy Rate of the „„Baa'aand Isytaro” class is as low as 68,8% and this very
influences the entire classification result.
Finally, the method we propose, as shown in Table 3.3. Naive Bayes Multinomial with
Hybrid features selection with Filter(PSO) and Wrapper(GA) classifier is used, the Accuracy Rate,
Recall Rate, F-Measure and Precision of the entire classification result Accuracy Rate are 90,2%,
Recall is 0.90, Precession is 0.91 and F-Measure is 0.90. Moreover, the table indicates that “Sholat”
classification accuracy can reach 97,z% and the Accuracy Rate of the „„Baa'aand Isytaro” class is as
low as 71,4% and this very influences the entire classification result.
Tabel 3.2. Classification results of the Naive Bayes Multinomial
Class Name Recall Precision F-Measure Accurate
Sholat ‫صالة‬ 0.97 0.93 0.95 97.4
Zakat ‫اة‬ ‫زك‬ ‫ال‬ 0.82 0.69 0.75 81.8
Shaum ‫يام‬ ‫ص‬ 0.67 0.75 0.71 66.7
Hajj ‫حاج‬ ‫ال‬ 0.86 0.67 0.75 85.7
Mutazawwij ‫تزوج‬ ‫م‬ 0.71 0.83 0.77 71.4
Baa'aand Isytaro
‫يع‬ ‫ب‬ ‫ال‬ ‫شراء‬ ‫وال‬ 0.75 0.84 0.79 75
Wakaf ‫اف‬ ‫األوق‬ 0.73 0.79 0.76 73.3
Total 0.82 0.82 0.82 81.8
Tabel 3.3. Classification results of the Naive Bayes Multinomial Filtering with PSO
Class Name Recall Precision F-Measure Accurate
Sholat ‫صالة‬ 0.97 0.93 0.95 97.4
Zakat ‫اة‬ ‫زك‬ ‫ال‬ 0.91 0.71 0.80 90.9
Shaum ‫يام‬ ‫ص‬ 0.78 0.70 0.74 77.8
Hajj ‫حاج‬ ‫ال‬ 0.93 0.65 0.77 92.9
Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification…
DOI: 10.9790/0661-17664551 www.iosrjournals.org 50 | Page
Mutazawwij ‫تزوج‬ ‫م‬ 0.71 0.71 0.71 71.4
Baa'aand Isytaro
‫يع‬ ‫ب‬ ‫ال‬ ‫شراء‬ ‫وال‬ 0.69 0.92 0.79 68.8
Wakaf ‫اف‬ ‫األوق‬ 0.73 0.73 0.73 73.3
Total 0.82 0.84 0.82 81.8
Tabel 3.4. Classification results of the Naive Bayes Multinomial Filtering PSO + Wrapper GA
Class Name Recall Precision F-Measure Accurate
Sholat ‫صالة‬ 0.97 0.93 0.95 97.4
Zakat ‫اة‬ ‫زك‬ ‫ال‬ 0.91 0.83 0.87 90.9
Shaum ‫يام‬ ‫ص‬ 0.89 0.89 0.89 88.9
Hajj ‫حاج‬ ‫ال‬ 0.93 0.81 0.87 92.9
Mutazawwij ‫تزوج‬ ‫م‬ 0.71 0.71 0.71 71.4
Baa'aand Isytaro
‫يع‬ ‫ب‬ ‫ال‬
‫شراء‬ ‫وال‬ 0.88 0.96 0.91 87.5
Wakaf ‫اف‬ ‫األوق‬ 0.87 0.93 0.90 86.7
Total 0.90 0.91 0.90 90.2
Conclusions of 3 The table above shows that the method that we propose to show the value of
the highest accuracy is 85.90%, compared with the NBM and NBM-PSO, and the lowest for the
accuracy of the 81% that NBM-PSO.
Figure 3 Accuracy with number of features
From Figure 3.4 it can be seen, features selection using Filtering PSO, successful can reduce
features from 1578 into 493, with the same accuracy of the classification without any filtering, and the
we propose method is the result of filtering of the PSO, we Optimizing features using GA , and the
results are in addition to the number of features is reduced to 266, accuracy is also increased by 8.4%.
IV. Conclusion and future work
This paper proposed the integration of NBM-PSO-GA methods for Arabic Document
Classification, our experiments revealed the important of feature selection process in building a
classifier. The integration of PSO+GA has greatly reduced the features by keeping the resources to a
minimum while at the same time improves the classification accuracy. The accuracy rate for our
method is 85.89%. For future work, we will try to combine the two methods of classification with
other selection features for optimal results.
Reference
[1]. Chakrabarti, S., 2000. Data mining for hypertext: A tutorial survey. SIGKDD explorations 1(2), ACM SIGKDD.
[2]. Salton, G., 1989. Automatic Text Procession: The Transformation, Analysis, and Retrieval of Information by Computer. Addison
Wesley, Reading, PA.
[3]. Rifkin, R.M., 2002. Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning, Ph.D.
Dissertation, Sloan School of Management Science, September, MIT.
[4]. Salton, G., 1989. Automatic Text Procession: The Transformation, Analysis, and Retrieval of Information by Computer. Addison
Wesley, Reading, PA.
[5]. Yang, Y., Pedersen, J., 1997. A comparative study on feature selection in text categorization. In: Proceedings of ICML-97,
Fourteenth International Conference on Machine Learning.
[6]. John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant Features and the Subset
Selection Problem. In Proceedings of the 11th International Conference on machine
learning (pp. 121–129). San Francisco: Morgan Kaufmann
[7]. A. McCallum, K. Nigam, A comparison of event models for naive Bayes text classification, in: AAAI-98 Workshop on Learning
for Text Categorization, 752, 1998, pp. 41–48.
Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification…
DOI: 10.9790/0661-17664551 www.iosrjournals.org 51 | Page
[8]. G. Forman, An extensive empirical study of feature selection metrics for text classification, J. Mach. Learn. Res. 3 (2003) 1289–
1305.
[9]. M. Rogati, Y. Yang, High-performing variable selection for text classification, in:
CIKM ‟02 Proceedings of the 11th International Conference on Information and
Knowledge Management, 2002, pp. 659–661.
[10]. Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, in: The Fourteenth International
Conference on Machine Learning (ICML 97), 1997, pp. 412–420.
[11]. J .Kennedy, R.Eberhart, “Particle Swarm Optimization",In :Proc IEEE Int. Conf. On Neural Networks, Perth, pp. 1942-1948, 1995.
[12]. Matthew Settles, “An Introduction to Particle Swarm Optimization”, November 2007.
[13]. X. Zhiwei and W. Xinghua, “Research for information extraction based on wrapper model algorithm,” in 2010 SecondInternational
Conference on Computer Research and Development, Kuala Lumpur, Malaysia, 2010, pp. 652–655
[14]. D. Goldberg, Genetic Algorithms in Search, Optimization, andMachine Learning. Addison-Wesley, Reading, MA, 1989.
[15]. Asha Gowda Karegowda, M.A.Jayaram, A.S .Manjunath.”Feature Subset Selection using Cascaded GA & CFS:A Filter Approach
in Supervised Learning”. International Journal of Computer Applications (0975 – 8887) , Volume 23– No.2, June 2011
[16]. Hall, Mark A and Smith, Lloyd A, " Feature subset selection: a correlation based filter approach", Springer, 1997.
[17]. Eibe Frank1 and Remco R. Bouckaert ,” Naive Bayes for Text Classification with
Unbalanced Classes” Computer Science Department, University of Waikato,2006

Contenu connexe

Tendances

Tendances (17)

A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
A Novel Method for Keyword Retrieval using Weighted Standard Deviation: “D4 A...
 
A Review on Text Mining in Data Mining
A Review on Text Mining in Data MiningA Review on Text Mining in Data Mining
A Review on Text Mining in Data Mining
 
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection:  Comparative StudyA Threshold Fuzzy Entropy Based Feature Selection:  Comparative Study
A Threshold Fuzzy Entropy Based Feature Selection: Comparative Study
 
Improving the effectiveness of information retrieval system using adaptive ge...
Improving the effectiveness of information retrieval system using adaptive ge...Improving the effectiveness of information retrieval system using adaptive ge...
Improving the effectiveness of information retrieval system using adaptive ge...
 
The comparison of the text classification methods to be used for the analysis...
The comparison of the text classification methods to be used for the analysis...The comparison of the text classification methods to be used for the analysis...
The comparison of the text classification methods to be used for the analysis...
 
Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machine
 
Implementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserImplementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic Parser
 
A Novel approach for Document Clustering using Concept Extraction
A Novel approach for Document Clustering using Concept ExtractionA Novel approach for Document Clustering using Concept Extraction
A Novel approach for Document Clustering using Concept Extraction
 
G04124041046
G04124041046G04124041046
G04124041046
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESFINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES
 
Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm Particle Swarm Optimization based K-Prototype Clustering Algorithm
Particle Swarm Optimization based K-Prototype Clustering Algorithm
 
Study on Relavance Feature Selection Methods
Study on Relavance Feature Selection MethodsStudy on Relavance Feature Selection Methods
Study on Relavance Feature Selection Methods
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
 

En vedette

ДЦ "Св.Неделя" Сандански
ДЦ "Св.Неделя" СанданскиДЦ "Св.Неделя" Сандански
ДЦ "Св.Неделя" Сандански
Teodora Pancheva
 
Наблюдавано Жилищe
Наблюдавано ЖилищeНаблюдавано Жилищe
Наблюдавано Жилищe
Teodora Pancheva
 

En vedette (19)

Les Effets Pervers du Travail
Les Effets Pervers du TravailLes Effets Pervers du Travail
Les Effets Pervers du Travail
 
Islamic banking for Micro & Small businesses
Islamic banking for  Micro & Small businessesIslamic banking for  Micro & Small businesses
Islamic banking for Micro & Small businesses
 
Personal branding tedtalk pgp30291_tarun yadav
Personal branding tedtalk pgp30291_tarun yadavPersonal branding tedtalk pgp30291_tarun yadav
Personal branding tedtalk pgp30291_tarun yadav
 
ДЦ "Св.Неделя" Сандански
ДЦ "Св.Неделя" СанданскиДЦ "Св.Неделя" Сандански
ДЦ "Св.Неделя" Сандански
 
الخصائص الإحتماعيَة للأطفال اللذين يعانون من التوحد
الخصائص الإحتماعيَة للأطفال اللذين يعانون من التوحدالخصائص الإحتماعيَة للأطفال اللذين يعانون من التوحد
الخصائص الإحتماعيَة للأطفال اللذين يعانون من التوحد
 
Наблюдавано Жилищe
Наблюдавано ЖилищeНаблюдавано Жилищe
Наблюдавано Жилищe
 
L08a
L08aL08a
L08a
 
آموزش طراحی الگوریتم به همراه حل مثال های عملی - بخش پنجم
آموزش طراحی الگوریتم به همراه حل مثال های عملی - بخش پنجمآموزش طراحی الگوریتم به همراه حل مثال های عملی - بخش پنجم
آموزش طراحی الگوریتم به همراه حل مثال های عملی - بخش پنجم
 
Genetic Algorithms: Programming by the Seat of Your Genes
Genetic Algorithms: Programming by the Seat of Your GenesGenetic Algorithms: Programming by the Seat of Your Genes
Genetic Algorithms: Programming by the Seat of Your Genes
 
Mba project-report-service-quality-banks
Mba project-report-service-quality-banksMba project-report-service-quality-banks
Mba project-report-service-quality-banks
 
King hassan ii mosque ppt
King hassan ii mosque pptKing hassan ii mosque ppt
King hassan ii mosque ppt
 
Congenital glaucoma and neurofibromatosis type 1
Congenital glaucoma and neurofibromatosis type 1Congenital glaucoma and neurofibromatosis type 1
Congenital glaucoma and neurofibromatosis type 1
 
Power of concentration - 1st part
Power of concentration - 1st partPower of concentration - 1st part
Power of concentration - 1st part
 
Spinal & epidural anaesthesia
Spinal & epidural anaesthesiaSpinal & epidural anaesthesia
Spinal & epidural anaesthesia
 
Asian studies; Ancient India, Indian Civilization, Indus Valley Civilization
Asian studies; Ancient India, Indian Civilization, Indus Valley CivilizationAsian studies; Ancient India, Indian Civilization, Indus Valley Civilization
Asian studies; Ancient India, Indian Civilization, Indus Valley Civilization
 
تخطيط برامج التدريب المستمر في ضوء معايير الجودة الشاملة
تخطيط برامج التدريب المستمر في ضوء معايير الجودة الشاملةتخطيط برامج التدريب المستمر في ضوء معايير الجودة الشاملة
تخطيط برامج التدريب المستمر في ضوء معايير الجودة الشاملة
 
Eight_Dusit_Oman
Eight_Dusit_OmanEight_Dusit_Oman
Eight_Dusit_Oman
 
Total Quality Management in Banking Sector
Total Quality Management in Banking SectorTotal Quality Management in Banking Sector
Total Quality Management in Banking Sector
 
آموزش طراحی الگوریتم به همراه حل مثال های عملی - بخش دوم
آموزش طراحی الگوریتم به همراه حل مثال های عملی - بخش دومآموزش طراحی الگوریتم به همراه حل مثال های عملی - بخش دوم
آموزش طراحی الگوریتم به همراه حل مثال های عملی - بخش دوم
 

Similaire à Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document Classification with Naive Bayes Multinomial

APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
IJCSEA Journal
 
Applying genetic algorithms to information retrieval using vector space model
Applying genetic algorithms to information retrieval using vector space modelApplying genetic algorithms to information retrieval using vector space model
Applying genetic algorithms to information retrieval using vector space model
IJCSEA Journal
 
Improved feature selection using a hybrid side-blotched lizard algorithm and ...
Improved feature selection using a hybrid side-blotched lizard algorithm and ...Improved feature selection using a hybrid side-blotched lizard algorithm and ...
Improved feature selection using a hybrid side-blotched lizard algorithm and ...
IJECEIAES
 

Similaire à Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document Classification with Naive Bayes Multinomial (20)

Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
 
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
 
Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...
 
Survey on Efficient Techniques of Text Mining
Survey on Efficient Techniques of Text MiningSurvey on Efficient Techniques of Text Mining
Survey on Efficient Techniques of Text Mining
 
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILESAN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
AN EFFICIENT FEATURE SELECTION IN CLASSIFICATION OF AUDIO FILES
 
An efficient feature selection in
An efficient feature selection inAn efficient feature selection in
An efficient feature selection in
 
I017235662
I017235662I017235662
I017235662
 
Answer extraction and passage retrieval for
Answer extraction and passage retrieval forAnswer extraction and passage retrieval for
Answer extraction and passage retrieval for
 
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTAN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
 
T0 numtq0n tk=
T0 numtq0n tk=T0 numtq0n tk=
T0 numtq0n tk=
 
IRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant ColonyIRJET- Survey of Feature Selection based on Ant Colony
IRJET- Survey of Feature Selection based on Ant Colony
 
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
Automatic Feature Subset Selection using Genetic Algorithm for ClusteringAutomatic Feature Subset Selection using Genetic Algorithm for Clustering
Automatic Feature Subset Selection using Genetic Algorithm for Clustering
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
FAST FUZZY FEATURE CLUSTERING FOR TEXT CLASSIFICATION
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL
 
Applying genetic algorithms to information retrieval using vector space model
Applying genetic algorithms to information retrieval using vector space modelApplying genetic algorithms to information retrieval using vector space model
Applying genetic algorithms to information retrieval using vector space model
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text Classification
 
Improved feature selection using a hybrid side-blotched lizard algorithm and ...
Improved feature selection using a hybrid side-blotched lizard algorithm and ...Improved feature selection using a hybrid side-blotched lizard algorithm and ...
Improved feature selection using a hybrid side-blotched lizard algorithm and ...
 

Plus de iosrjce

Plus de iosrjce (20)

An Examination of Effectuation Dimension as Financing Practice of Small and M...
An Examination of Effectuation Dimension as Financing Practice of Small and M...An Examination of Effectuation Dimension as Financing Practice of Small and M...
An Examination of Effectuation Dimension as Financing Practice of Small and M...
 
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Does Goods and Services Tax (GST) Leads to Indian Economic Development?Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
 
Childhood Factors that influence success in later life
Childhood Factors that influence success in later lifeChildhood Factors that influence success in later life
Childhood Factors that influence success in later life
 
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
 
Customer’s Acceptance of Internet Banking in Dubai
Customer’s Acceptance of Internet Banking in DubaiCustomer’s Acceptance of Internet Banking in Dubai
Customer’s Acceptance of Internet Banking in Dubai
 
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
 
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
Consumer Perspectives on Brand Preference: A Choice Based Model ApproachConsumer Perspectives on Brand Preference: A Choice Based Model Approach
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
 
Student`S Approach towards Social Network Sites
Student`S Approach towards Social Network SitesStudent`S Approach towards Social Network Sites
Student`S Approach towards Social Network Sites
 
Broadcast Management in Nigeria: The systems approach as an imperative
Broadcast Management in Nigeria: The systems approach as an imperativeBroadcast Management in Nigeria: The systems approach as an imperative
Broadcast Management in Nigeria: The systems approach as an imperative
 
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study on Retailer’s Perception on Soya Products with Special Reference to T...A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
 
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
 
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Consumers’ Behaviour on Sony Xperia: A Case Study on BangladeshConsumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
 
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
 
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
 
Media Innovations and its Impact on Brand awareness & Consideration
Media Innovations and its Impact on Brand awareness & ConsiderationMedia Innovations and its Impact on Brand awareness & Consideration
Media Innovations and its Impact on Brand awareness & Consideration
 
Customer experience in supermarkets and hypermarkets – A comparative study
Customer experience in supermarkets and hypermarkets – A comparative studyCustomer experience in supermarkets and hypermarkets – A comparative study
Customer experience in supermarkets and hypermarkets – A comparative study
 
Social Media and Small Businesses: A Combinational Strategic Approach under t...
Social Media and Small Businesses: A Combinational Strategic Approach under t...Social Media and Small Businesses: A Combinational Strategic Approach under t...
Social Media and Small Businesses: A Combinational Strategic Approach under t...
 
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
 
Implementation of Quality Management principles at Zimbabwe Open University (...
Implementation of Quality Management principles at Zimbabwe Open University (...Implementation of Quality Management principles at Zimbabwe Open University (...
Implementation of Quality Management principles at Zimbabwe Open University (...
 
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
 

Dernier

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Dernier (20)

Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Unit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdfUnit 1 - Soil Classification and Compaction.pdf
Unit 1 - Soil Classification and Compaction.pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 

Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document Classification with Naive Bayes Multinomial

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 6, Ver. VI (Nov – Dec. 2015), PP 45-51 www.iosrjournals.org DOI: 10.9790/0661-17664551 www.iosrjournals.org 45 | Page Filter-Wrapper Approach to Feature Selection Using PSO-GA for Arabic Document Classification with Naive Bayes Multinomial Indriyani1) , Wawan Gunawan 2) , Ardhon Rakhmadi3) 1, 2, 3) Department of Informatics, Faculty of Information Technology, Institut Teknologi Sepuluh Nopember, Jl. Teknik Kimia, Sukolilo, Surabaya, 60117, Indonesia Abstract: Text categorization and feature selection are two of the many text data mining problems. In text categorization, the document that contains a collection of text will be changed to the dataset format, the dataset that consists of features and class, words become features and categories of documents become class on this dataset. The number of features that too many can cause a decrease in performance of classifier because many of the features that are redundant and not optimal so that feature selection is required to select the optimal features. This paper proposed a feature selection strategy based on Particle Swarm Optimization (PSO) and Genetic Algorithm (GA) methods for Arabic Document Classification with Naive Bayes Multinomial (NBM). Particle Swarm Optimization (PSO) is adopted in the first phase with the aim to eliminate the insignificant features and prepared the reduce features to the next phase. In the second phase, the reduced features are optimized using the new evolutionary computation method, Genetic Algorithm (GA). These methods have greatly reduced the features and achieved higher classification compared with full features without features selection. From the experiment that has been done the obtained results of accuracy are NBM 85.31%, NBM-PSO 83.91% and NBM-PSO-GA 90.20%. Keywords: Document Classification, Feature Selection, Particle Swarm Optimization (PSO), Genetic Algorithm (GA). I. Introduction Text data mining[1] is a research domain involving many research areas, such as natural language processing, machine learning, information retrieval[2], and data mining. Text categorization and feature selection are two of the many text data mining problems. The text document categorization problem has been studied by many researchers[3][4][5]. Yang et al. showed comparative research on feature selection in text classification[5]. Many machine learning methods have been used for the text categorization problem, but the problem of feature selection is to find a subset of features for optimal classification. Even some noise features may sharply reduce the classification accuracy. Furthermore, a high number of features can slow down the classification process or even make some classifiers inapplicable. According to John, Kohavi, and Pfleger[6], there are mainly two types of feature selection methods in machine learning: wrappers and filters. Wrappers use the classification accuracy of some learning algorithm as their evaluation function. Since wrappers have to train a classifier for each feature subset to be evaluated, they are usually much more time consuming especially when the number of features is high. So wrappers are generally not suitable for text classification. Hence, feature selection is commonly used in text classification to reduce the dimensionality of feature space and improve the efficiency and accuracy of classifiers. Text classification tasks can usually be accurately performed with less than 100 words in simple cases, and do best with words in the thousands in the complex ones[7]. As opposed to wrappers, filters perform feature selection independently of the learning algorithm that will use the selected features. In order to evaluate a feature, filters use an evaluation metric that measures the ability of the feature to differentiate each class. Since wrappers have to train a classifier for each feature subset to be evaluated, they are usually much more time consuming especially when the number of features is high. So wrappers are generally not suitable for text classification. For this reason, our motivation is to build a good text classifier by investigating hybrid filter and wrapper method for feature selection. Finally, our proposed algorithm Particle Swarm Optimation
  • 2. Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification… DOI: 10.9790/0661-17664551 www.iosrjournals.org 46 | Page for filtering and second phase, we reduce result features filtering using wrapper Genetic Algorithm (GA) for optimizing features for Classification using Naive Bayes Multinominal for Arabic Document text Classification. The evaluation used an Arabic corpus that consists of 478 documents from www.shamela.ws, which are independently classified into seven categories. II. Methodology General description of the research method is shown in Figure 1. The stages and the methods used to lead this study include. Stages and methods used to underpin this study include: A. DataSet Document Document used as experimental data taken from www.shamela.ws and taken seven categories according to the most often discussed is Sholat, Zakat, Shaum, hajj, mutazawwij, Baa'aand Isytaro and Wakaf/ B. PreProcessing Fig 1. Research Stage In the process of preprocessing, each document is converted to vector document news with the following sequence: - Filtering, namely the elimination of illegal characters (numbers and symbols) in the document's contents. - Stoplist removal, i.e. removal of the characters included in the category of stopword or words that have a high frequency contained in the data stoplist. - Terms extraction, which extract the terms (words) of each document to be processed and compiled into a vector of terms that represent the document. - Stemming, namely to restore the basic shape of each term found in the document vector and grouping based on the terms similar. - TF-IDF weighting i.e. performs weighting TF-IDF on any terms that exist in the document vector. C. Filter - Particle Swarm Optimization (PSO) Particle swarm optimization (PSO) is an evolutionary computation technique developed by Kennedy and Eberhart [11]. The Particle Swarms find optimal regions of the complex search space through the interaction of individuals in the population. PSO is attractive for feature selection in that particle swarms will discover best feature combinations as they fly within the subset space.During
  • 3. Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification… DOI: 10.9790/0661-17664551 www.iosrjournals.org 47 | Page movement, the current position of particle i is represented by a vector xi = (xi1, xi2, ..., xiD), where D is the dimensionality of the search space. The velocity of particle i is represented as vi = (vi1, vi2, ..., viD) and it must be in a range defined by parameters vmin and vmax. The personal best position of the particle local best is the best previous position of that particle and the best position obtained by the population thus far is called global best. According to the following equations, PSO searches for the optimal solution by updating the velocity and the position of each particle: xid(t+1) = xid(t) + vid(t+1) (1) vid(t+1) = w * vid(t) +c1* r1i * ( pid - xid(t)) + c2* r2i * ( pgd - xid(t)) (2) where t denotes the tth iteration, d ϵ D denotes the dth dimension in the search space, w is inertia weight, c1 and c2are the learning factors called, respectively, cognitive parameter, social parameter, r1i and r2i are random values uniformly distributed in [0, 1], and finally pid and pgd represent the elements of local best and global best in the dth dimension. The personal best position of particle i is calculated as (3) In this work, PSO is a filter with CFS (Correlation-based Feature Selection) as a fitness function. Like the majority of feature selection techniques, CFS uses a search algorithm along with a function to evaluate the worth of feature subsets. CFS measures the usefulness of each feature for predicting the class label along with the level of inter correlation among them, based on the hypothesis: Good feature subsets contain features highly correlated (predictive of) with the class, yet uncorrelated with (not predictive of)each other [12]. D. Design of the Wrapper Phase After filtering, the next phase is the wrapper, Wrapper methods used previously usually adopt random search strategies, such as Adaptive Genetic Algorithm (AGA), Chaotic Binary Particle Swarm Optimization (CBPSO) and Clonal Selection Algorithm (CSA). The wrapper approach consists of methods choosing a minimum subset of features that satisfies an evaluation criterion. It was proved that the wrapper approach produces the best results out of the feature selection methods [13], although this is a time-consuming method since each feature subset considered must be evaluated with the classifier algorithm. to overcome this, we perform filtering using the PSO in advance so that we can reduce the execution time.In the wrapper method, the features subset selection algorithm exists as a wrapper around the data mining algorithm and outcome evaluation.The induction algorithm is used as a black box. The feature selection algorithm conducts a search for a proper subset using the induction algorithm itself as a part of the evaluation function. GA-based wrapper methods involve a Genetic Algorithm (GA) as a search method of subset features. GA is a random search method, effectively exploring large search spaces [14]. The basic idea of GA is to evolve a population of individuals (chromosomes), where an individual is a possible solution to a given problem. In the case of searching the appropriate subset of features, a population consists of different subsets evolved by a mutation, a crossover, and selection operations. After reaching maximum generations, algorithms returns the chromosome with the highest fitness, i.e. the subset of features with the highest accuracy.
  • 4. Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification… DOI: 10.9790/0661-17664551 www.iosrjournals.org 48 | Page Figure 2. GA-based wrapper feature selection with Naive Bayes Multinomial as an induction algorithm evaluating feature subset. E. Naive Bayes Multinomial Multinomial Naive Bayes (MNB) is the version of Naive Bayes that is commonly used for text categorization problems. In the MNB classifier, each document is viewed as a collection of words and the order of words is considered irrelevant. The probability of a class value c given a test document d is computed as (4) where nwd is the number of times word w occurs in document d, P (w|c) is the probability of observing word w given class c, P (c) is the prior probability of class c, and P (d) is a constant that makes the probabilities for the different classes sum to one. P (c) is estimated by the proportion of training documents pertaining to class c and P (w|c) is estimated as (5) Many text categorization problems are unbalanced. This can cause problems because of the Laplace correction used in. Consider a word w in a two class problem with classes c1 and c2, where w is completely irrelevant to the classification. This means the odds ratio for that particular word should be one, i.e. , so that this word does not influence the class probability. Assume the word occurs with equal relative frequency 0.1 in the text of each of the two classes. Assume there are 20,000 words in the vocabulary (k = 20, 000) and the total size of the corpus is 100,000 words. Fortunately, it turns out that there is a simple remedy: we can normalize the word counts in each class so that the total size of the classes is the same for both classes after normalization[17]. III. Experimental results In order to evaluate the performance of the proposed method, our algorithm has been tested using 478 documents from www.shamela.ws, which are independently classified into seven categories, 70% for data Training and 30% for data testing. The number of words (features) in this dataset is 1578, in our experiments, we have successfully reduced the number of words (features) into
  • 5. Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification… DOI: 10.9790/0661-17664551 www.iosrjournals.org 49 | Page 618 using CFS-PSO filtering algorithm and then the second phase we reduce result features filtering using wrapper Genetic Algorithm (GA) for optimizing features into 266. We have compared the accuracy our proposed method NBM-PSO-GA with NBM without filtering and NBM using Algorithm PSO for filtering with the same dataset. 3.1. The classification results of the NBM, NBM-PSO, and NBM-PSO-GA The classification results of the NBM, NBM-PSO, and NBM-PSO-GA Classifier are shown in Tables 3.2, 3.3, and 3.4 respectively. Tabel 3.1. Number of documents selected from Dataset Class Name Doc Num. Sholat ÕáÇÉ 136 Zakat ÇáÒßÇÉ 36 Shaum ÕíÇã 42 Hajj ÇáÍÇÌ 64 mutazawwij ãÊÒæÌ 46 Baa'aand Isytaro ÇáÈíÚ æÇáÔÑÇÁ 109 Wakaf ÇáÃæÞÇÝ 45 As shown in Table 3.2, when the Naive Bayes Multinomial classifier is used, the Accuracy Rate, Recall Rate, F-Measure and Precision of the entire classification result Accuracy Rate are 81.8%. The table indicates that Recall is 0.82, Precession is 0.82 and F-Measure are 0.828. Moreover, the table indicates that “Sholat” classification Accuracy can reach 97.4 % and the Accuracy Rate of the „„Shaum” class is as low as 66.7% and this influences the entire classification result. As shown in Table 3.3, when the Naive Bayes Multinomial with a features filter selection (PSO) classifier is used, the Accuracy Rate, Recall Rate, F-Measure and Precision of the entire classification result Accuracy Rate are 81.8%. The table indicates that Recall is 0.82, Precession is 0.84, and F-Measure is 0.82. Moreover, the table indicates that “Sholat" classification accuracy can reach 97,4% and the Accuracy Rate of the „„Baa'aand Isytaro” class is as low as 68,8% and this very influences the entire classification result. Finally, the method we propose, as shown in Table 3.3. Naive Bayes Multinomial with Hybrid features selection with Filter(PSO) and Wrapper(GA) classifier is used, the Accuracy Rate, Recall Rate, F-Measure and Precision of the entire classification result Accuracy Rate are 90,2%, Recall is 0.90, Precession is 0.91 and F-Measure is 0.90. Moreover, the table indicates that “Sholat” classification accuracy can reach 97,z% and the Accuracy Rate of the „„Baa'aand Isytaro” class is as low as 71,4% and this very influences the entire classification result. Tabel 3.2. Classification results of the Naive Bayes Multinomial Class Name Recall Precision F-Measure Accurate Sholat ‫صالة‬ 0.97 0.93 0.95 97.4 Zakat ‫اة‬ ‫زك‬ ‫ال‬ 0.82 0.69 0.75 81.8 Shaum ‫يام‬ ‫ص‬ 0.67 0.75 0.71 66.7 Hajj ‫حاج‬ ‫ال‬ 0.86 0.67 0.75 85.7 Mutazawwij ‫تزوج‬ ‫م‬ 0.71 0.83 0.77 71.4 Baa'aand Isytaro ‫يع‬ ‫ب‬ ‫ال‬ ‫شراء‬ ‫وال‬ 0.75 0.84 0.79 75 Wakaf ‫اف‬ ‫األوق‬ 0.73 0.79 0.76 73.3 Total 0.82 0.82 0.82 81.8 Tabel 3.3. Classification results of the Naive Bayes Multinomial Filtering with PSO Class Name Recall Precision F-Measure Accurate Sholat ‫صالة‬ 0.97 0.93 0.95 97.4 Zakat ‫اة‬ ‫زك‬ ‫ال‬ 0.91 0.71 0.80 90.9 Shaum ‫يام‬ ‫ص‬ 0.78 0.70 0.74 77.8 Hajj ‫حاج‬ ‫ال‬ 0.93 0.65 0.77 92.9
  • 6. Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification… DOI: 10.9790/0661-17664551 www.iosrjournals.org 50 | Page Mutazawwij ‫تزوج‬ ‫م‬ 0.71 0.71 0.71 71.4 Baa'aand Isytaro ‫يع‬ ‫ب‬ ‫ال‬ ‫شراء‬ ‫وال‬ 0.69 0.92 0.79 68.8 Wakaf ‫اف‬ ‫األوق‬ 0.73 0.73 0.73 73.3 Total 0.82 0.84 0.82 81.8 Tabel 3.4. Classification results of the Naive Bayes Multinomial Filtering PSO + Wrapper GA Class Name Recall Precision F-Measure Accurate Sholat ‫صالة‬ 0.97 0.93 0.95 97.4 Zakat ‫اة‬ ‫زك‬ ‫ال‬ 0.91 0.83 0.87 90.9 Shaum ‫يام‬ ‫ص‬ 0.89 0.89 0.89 88.9 Hajj ‫حاج‬ ‫ال‬ 0.93 0.81 0.87 92.9 Mutazawwij ‫تزوج‬ ‫م‬ 0.71 0.71 0.71 71.4 Baa'aand Isytaro ‫يع‬ ‫ب‬ ‫ال‬ ‫شراء‬ ‫وال‬ 0.88 0.96 0.91 87.5 Wakaf ‫اف‬ ‫األوق‬ 0.87 0.93 0.90 86.7 Total 0.90 0.91 0.90 90.2 Conclusions of 3 The table above shows that the method that we propose to show the value of the highest accuracy is 85.90%, compared with the NBM and NBM-PSO, and the lowest for the accuracy of the 81% that NBM-PSO. Figure 3 Accuracy with number of features From Figure 3.4 it can be seen, features selection using Filtering PSO, successful can reduce features from 1578 into 493, with the same accuracy of the classification without any filtering, and the we propose method is the result of filtering of the PSO, we Optimizing features using GA , and the results are in addition to the number of features is reduced to 266, accuracy is also increased by 8.4%. IV. Conclusion and future work This paper proposed the integration of NBM-PSO-GA methods for Arabic Document Classification, our experiments revealed the important of feature selection process in building a classifier. The integration of PSO+GA has greatly reduced the features by keeping the resources to a minimum while at the same time improves the classification accuracy. The accuracy rate for our method is 85.89%. For future work, we will try to combine the two methods of classification with other selection features for optimal results. Reference [1]. Chakrabarti, S., 2000. Data mining for hypertext: A tutorial survey. SIGKDD explorations 1(2), ACM SIGKDD. [2]. Salton, G., 1989. Automatic Text Procession: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley, Reading, PA. [3]. Rifkin, R.M., 2002. Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning, Ph.D. Dissertation, Sloan School of Management Science, September, MIT. [4]. Salton, G., 1989. Automatic Text Procession: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley, Reading, PA. [5]. Yang, Y., Pedersen, J., 1997. A comparative study on feature selection in text categorization. In: Proceedings of ICML-97, Fourteenth International Conference on Machine Learning. [6]. John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant Features and the Subset Selection Problem. In Proceedings of the 11th International Conference on machine learning (pp. 121–129). San Francisco: Morgan Kaufmann [7]. A. McCallum, K. Nigam, A comparison of event models for naive Bayes text classification, in: AAAI-98 Workshop on Learning for Text Categorization, 752, 1998, pp. 41–48.
  • 7. Filter-Wrapper Approach to Feature Selection Using PSO-GA For Arabic Document Classification… DOI: 10.9790/0661-17664551 www.iosrjournals.org 51 | Page [8]. G. Forman, An extensive empirical study of feature selection metrics for text classification, J. Mach. Learn. Res. 3 (2003) 1289– 1305. [9]. M. Rogati, Y. Yang, High-performing variable selection for text classification, in: CIKM ‟02 Proceedings of the 11th International Conference on Information and Knowledge Management, 2002, pp. 659–661. [10]. Y. Yang, J.O. Pedersen, A comparative study on feature selection in text categorization, in: The Fourteenth International Conference on Machine Learning (ICML 97), 1997, pp. 412–420. [11]. J .Kennedy, R.Eberhart, “Particle Swarm Optimization",In :Proc IEEE Int. Conf. On Neural Networks, Perth, pp. 1942-1948, 1995. [12]. Matthew Settles, “An Introduction to Particle Swarm Optimization”, November 2007. [13]. X. Zhiwei and W. Xinghua, “Research for information extraction based on wrapper model algorithm,” in 2010 SecondInternational Conference on Computer Research and Development, Kuala Lumpur, Malaysia, 2010, pp. 652–655 [14]. D. Goldberg, Genetic Algorithms in Search, Optimization, andMachine Learning. Addison-Wesley, Reading, MA, 1989. [15]. Asha Gowda Karegowda, M.A.Jayaram, A.S .Manjunath.”Feature Subset Selection using Cascaded GA & CFS:A Filter Approach in Supervised Learning”. International Journal of Computer Applications (0975 – 8887) , Volume 23– No.2, June 2011 [16]. Hall, Mark A and Smith, Lloyd A, " Feature subset selection: a correlation based filter approach", Springer, 1997. [17]. Eibe Frank1 and Remco R. Bouckaert ,” Naive Bayes for Text Classification with Unbalanced Classes” Computer Science Department, University of Waikato,2006