SlideShare une entreprise Scribd logo
1  sur  5
Télécharger pour lire hors ligne
ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012



  Using Class Frequency for Improving Centroid-based
                  Text Classification
                                      Verayuth Lertnattee1 and Chanisara Leuviphan1
                                1
                               Department of Health-related Informatics, Faculty of Pharmacy
                               Silpakorn University, Maung, Nakorn Pathom, 73000 Thailand
                           E-mail: verayuths@hotmail.com, verayuth@su.ac.th, chanisara@su.ac.th

Abstract— M ost previous works on text classification,                   This paper investigates the usefulness of inverse class
represented importance of terms by term occurrence frequency             frequency in a more systematic way. The traditional term
(tf) and inverse document frequency (idf). This paper presents           frequency and inverse document frequency utilize information
the ways to apply class frequency in centroid-based text                 within classes and in the whole collection of training data.
categorization. Three approaches are taken into account. The
                                                                         Besides these two sources of information, information among
first one is to explore the effectiveness of inverse class
frequency on the popular term weighting, i.e., TFIDF, as a               classes so called inter-class information is expected to be
replacement of idf and an addition to TFIDF. The second                  useful. An inverse class frequency can utilize this source of
approach is to evaluate some functions, which are used to                information. The effectiveness of applying class frequency
adjust the power of inverse class frequency. The other approach          in term weighting and classification process is investigated.
is to apply terms, which are found in only one class or few              Three approaches are taken into account. The first one is to
classes, to improve classification performance, using two-step           explore the effectiveness of inverse class frequency on the
classification. From the results, class frequency expresses its
usefulness on text classification, especially the two-step               popular term weighting, i.e., tf  idf , as a replacement of
classification.
                                                                         idf and an addition to tf  idf . The second approach is
Index Terms—text classification, term weighting, class                   to evaluate some functions, which are used to adjust the
frequency, linear classifier                                             power of inverse class frequency. The other approach is to
                                                                         apply terms that are found in only one class or few classes,
                        I. INTRODUCTION                                  to improve classification performance, using two steps of
                                                                         classification. In the rest of this paper, section II presents
    The increasing availability of online text information, there
                                                                         centroid-based text categorization. Term weighting in
has been extreme need to find and organize relevant
                                                                         centroid-based text classification by frequency-based
information in text documents. The automated text
                                                                         patterns is given in section III. Section IV described three
categorization (also known as text classification) becomes a
                                                                         approaches for applying class frequency in text classification.
significant tool to organize text documents efficiently. A
                                                                         The data sets and experimental settings are described in
variety of classification methods were developed and used
                                                                         section V. In section VI, a number of experimental results are
in different schemes, such as probabilistic models [1], neural
                                                                         given. A conclusion is made in section VII.
network [2], example-based models (e.g., k -nearest neighbor)
[3], linear models [4], [5], support vector machine [6] and so                        II. CENTROID-BASED TEXT CLASSIFICATION
on. Among these methods, a linear model called a centroid-
based method is attractive since it has relatively less                      In the centroid-based text categorization, a document (or
computation than other methods in both the learning and                  a class) is represented by a vector using a vector space model
classification stages. Despite less computation time, a                  with a bag of words (BOW) [9]. The simplest and popular
centroid-based method was shown in several literatures                   one is applied term frequency (tf) and inverse document
including those in [4], [5], to achieve relatively high accuracy.        frequency (idf) in the form of tf  idf as a term weight for
In this method, an individual class is modeled by weighting              representing a document. In a vector space model, given a
terms appearing in training documents assigned to the class.             set of documents D = {d1 , d 2 ,..., d|D| } , a document d j is
This makes classification performance strongly depend on
                                                                         represented               by         a         document                vector
term weighting applied in the model. Most previous works of              
a centroid-based method focused on weighting factors related             d j ={w1 j , w2 j ,...,w|T| j }={tf1 j idf , tf2 j idf2 ,...,tf|T| j idf|T|} ,
                                                                                                                   1
to frequency patterns of terms or documents in the class.
The most popular two factors are term frequency (tf ) and                where wij is a weight assigned to a term ti in a set of terms

inverse document frequency (idf ) . Some previous works,                 (T ) of the document. In this definition, tf ij is term frequency
such as those in [7], [8], attempted to apply another factor
                                                                         of a term ti in a document d j and idf i is inverse document
called inverse class frequency (icf ) . However, the impact
of this factor needs more investigation.                                 frequency, defined as log (| D | /df i ) . Here, | D | is the total
© 2012 ACEEE                                                        62
DOI: 01.IJIT.02.02.57
ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012


number of documents in a collection and df i is the number            This represents an average class term frequency in a class.
                                                                      The tf is considered as an intra-class factor. A term that is
of documents, which contain the term ti . Besides term
weighting, normalization is another important factor to               important for a specific class, should have a high tf          . The
                                                                                                                               ik
represent a document or a class. Class prototype ck is
                                                                      tf deals with the first property of the significant terms for
obtained by summing up all document vectors in C k and                classification. Term frequency alone may not be enough to
then normalizing the result by its size. The formal description       represent the contribution of a term in a document. To achieve
                                                                   a better performance, the well-known inverse document
of a class prototype ck is djC   d j / || djC d j || , where      frequency can be applied to eliminate the impact of frequent
                                 k               k
                                                                      terms that exist in almost all documents. The idf i is the
C k = { d j | d j is a document belonging to the class ck }.
                                                                      inverse ratio of the number of training documents that contain
The simple term weighting is tf  idf where tf is an average          the term ti to the total number of training documents. It is
class term frequency of the term. The formal description of           usually applied in the form of the logarithm value of its original
                                                                      value. It deals with the second property of the significant
tf is      djC k
                     tf ijk / | Ck | , where | C | is the number of
                                                k                     terms. However, this tends to be true if those documents,
                                                                      which hold the terms, are in the same class. Inversely, if the
documents in a class ck . Term weighting described above
                                                                      distribution of the terms is uniform in all classes, idf
can also be applied to a query or a test document. In general,
                                                                      becomes useless and is not helpful in classification. The idf
the term weighting for a query is tf  idf . Once a class
prototype vector and a query vector have been constructed,            is considered as a collection factor, i.e., its value is the same
the similarity between these two vectors can be calculated.           for a particular term, independent of arrangement of classes
The most popular one is cosine distance [10]. This similarity         in a collection.
can be calculated by the dot product between these two                    The first and second items can be coped by the
vectors. Therefore, the test document will be assigned to the         conventional term frequency and inverse document
class whose class prototype vector is the most similar to the         frequency, respectively. However, for the third property, it is
vector of the test document.                                          necessary to utilize other frequency factors. For this property,
                                                                      class frequency is expected to be useful. Class frequency
        III. TERM WEIGHTING IN CENTROID-BASED T EXT                   enables the classifier to utilize the information among classes.
        CLASSIFICATION BY FREQUENCY-BASED PATTERNS                    It is considered as an inter-class factor. Some collections
                                                                      may have several possible organizations of documents
    This section presents concept of term weighting in                (viewpoints). For example, a collection may be grouped into
centroid-based text classification using frequency-based              a set of classes based on its content (e.g., course, faculty,
patterns. Before the detail is described, some general                project, student, ...) or it may be grouped by its university
characteristics of terms that are significant for representing a      (e.g., Cornell, Texas, Washington, Wisconsin,...). In these two
certain class, are listed below:
                                                                      cases, the collection factor (e.g., idf ) of a term is identical
   • A term tends to be a representative of a certain class, it
should appear frequently with high occurrence frequency               while the inter-class factor of that term is varied. An important
within the class.                                                     term of the specific category seems to exist in one class or
    • An important term seems to exist in relatively a few            only few classes (the third property of significant terms).
documents while a general term tends to appear in most                Therefore, we can apply class frequency to represent
documents in a collection.                                            important of that term. In the past, there were some works
   • A crucial term seems to exist in only one class or few           utilizing class frequency in general forms, such as logarithmic
classes.                                                              of inverse class frequency ( icf ). It is an analogous of idf .
To realize these characteristics, frequency-based patterns in
a training collection can be applied. Three main frequency            In some situations that we cannot calculate idf such as in
factors are term frequency, document frequency and class              [8], it is possible to set a sentence as a processing unit instead
frequency. The simplest and popular one is applied term               of a document, and hence icf replaces idf . Although the
frequency (tf ) and inverse document frequency (idf ) in
                                                                      icf seems to be useful, it is not popular for applying into
the form of tf  idf for representing a document. In a                term weighting.
centroid-based method, term frequency of a class vector
comes from average value of term frequencies of documents                IV. APPLYING CLASS FREQUENCY IN TEXT CLASSIFICATION

in a class. Due to this, we use a symbol tf instead of tf .              In this paper, the usefulness of class frequency is
© 2012 ACEEE                                                63
DOI: 01.IJIT.02.02.57
ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012

presented on both the concept and experiments. Three                      contrary, it demotes tf and tf  idf when its value less
approaches are taken into account: (1) the effectiveness of
                                                                          than 1.
icf in term weighting (2) several functions are applied to
                                                                          B. Functions to modify ICF.
icf to enhance classification performance and (3) the algo-
rithm to apply terms which are found in only one class or few                In this approach, several functions are applied to icf for
classes, to improve classification performance, using two                 adjusting level of promoting and demoting to tf and
steps of classification. The detail is decribed as follow:
A. The Effectiveness of ICF in Term Weighting                             tf  idf . The first pattern is adjusting the power of icf .
   The popular method to utilize class frequency is in the
                                                                          When the term weighting is tf  idf  icf , a new term
form of logarithmic function of inverse class frequency. The
equation is shown below.                                                  weighting is shown below

                           |C |                                                               tf  idf  icf                       (2)
             ICFi = log                                   (1)
                            cf i                                          Here, the    is the power of icf . The icf promotes the
Here cf i is the number of classes that contain a term ti . The           tf  idf when its value > 1. If the power of the icf is greater
effect of the icf is to promote a term which occurs in only               than 1, it is more powerful to promote or demote a term than
few classes (later called few-class terms) and demote a term              a normal icf . On the other hand, when the power of the icf
which appears in many classes (later called most-class terms).            is less than 1, it is less powerful to promote or demote a term
In the extreme case, where a term occurs only in one class                than a normal icf . The optimum value of the  depends on
(cf i = 1) , the function obtains the maximum value. These                data sets. In the first pattern, a term may be promoted or
terms are called one-class terms. We belief that few-class                demoted. The second pattern considers only promotion of a
terms are quite important for classifying documents. Inverse              term. An example of a term weighting formula is shown below
class frequency promotes the importance of these terms.
                                                                                         tf  idf  (icf  1)                       (3)
Inversely, when a term occurs in all classes (cf i =| C |) , a            From the equation, terms are found on all classes, are given a
function of inverse class frequency achieves a minimum value,
                                                                          term weight of the tf  idf . The other terms are given with
i.e., 0. These terms are called all-class terms. They are
considered as less important and may cause the classifier                 the values which are higher than the tf  idf .
misclassify the document. However, their usefulness also
depends on their term frequencies. The promotion of few-                  C. Two-step Classification
class terms is more effective than the demotion of most-class
                                                                             In this approach of classification, class frequency of a
terms since it may demote the important terms that occur in
                                                                          term is applied to select representations of the class prototype
all classes but useful for classifying documents. The icf                 and the test document vectors. The representative vectors
can be included into term weighting. Two simple patterns are              are based on n-class terms. A representative vector of n-
                                                                          class terms, means it represents by terms which occur in
applied. The first pattern is tf  icf , i.e., substitution of the
                                                                          1,2,...,n classes where 1  n | C | . In the first step of
idf with icf . The other pattern is tf  idf  icf , i.e.,                classification, the n is set. A test document is classified to an
                                                                          appropriated class, based on n-class term. The rest of the
include the icf in the tf  idf . In normal situation, the                test documents, which have not n-class terms as their
number of documents is greater than the number of classes                 representation, will be classified using all terms in the second
                                                                          step. For example, when n=1, i.e., the representation vectors
( | D |>| C | ). This means the idf promotes the importance               of prototypes and test documents is only a set of terms that
of terms more than the icf . When the logarithm of base 2 is              are found only in one class. A test document whose have
                                                                          one-class terms of a class, is automatically classified to that
used, the value of icf for a term is equal to 1 when the term             class in the first step. In the second step, the rest of the test
is found in the half number of the total classes in a data set            documents will be classified by all terms.
( cf i =| C | /2 ). Therefore, the value of icf > 1 when
                                                                                             V. EXPERIMENTAL SETTING
cf i <| C | /2 and icf < 1 when cf i >| C | /2 . If the value                 To evaluate our concept about class frequency, three
                                                                          collections called WebKB, 7Sectors and 20Newsgroups are
of icf more than 1, it promotes tf and tf  idf . On the                  used. The WebKB, is a collection of web pages of computer
© 2012 ACEEE                                                         64
DOI: 01.IJIT.02.02.57
ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012


science departments in four universities with some additional            WebKB1, WebKB2, 7Sectors and 20Newsgroups. Table I
pages from other universities. This collection can be viewed             showed the result in forms of the classification accuracy.
as two-dimensional viewpoints. In our experiment, we use                             TABLE I. EFFECT   OF   ICF ON THE FOUR D ATA SETS
the four most popular classes: student, faculty, course and
 project. This includes 4,199 web pages. Focusing on each
class, five subclasses (the 2 nd dimension) are defined
according to the university a web page belongs to: Cornell,
Texas, Washington, Wisconsin and miscellaneous. We use
this collection in two viewpoints: first dimension (WebKB1)              From the result, some observations can be made as follows.
and second dimension (WebKB2). The total number of data
sets for this collection is two. Therefore, the effect of inverse        The performance of a prototype vector with tf  idf , is
class frequency with different numbers of classes on the
                                                                         better than that of a prototype vector with tf  icf on three
same collection, can be evaluated. The second collection is
the 7Sectors, a data set from CMU World Wide Knowledge                   data sets, except only the 7Sectors. The performance of
Base Project from the CMU Text Learning Group. This data                 classifiers with tf  icf on the same collection but different
set has seven classes: basic material, energy, financial,
healthcare, technology, transportation and utilities. The                viewpoints, i.e. WebKB1 and WebKB2, may be different with
number of documents in this collection is 4,582. The third               a large gap. The most effective term weighting for all data
collection is 20Newsgroups (20News). The articles are                    sets is tf  idf  icf . It can be concluded that the icf
grouped into 20 different UseNet discussion groups. It
contains 19,997 documents and some groups are very similar.              express its usefulness when it is used to combine with
    All experiments were performed on closed test, .i.e., we             tf  idf . In generally, the should not been used to replace
used a training set as a test set. The performance was mea-
                                                                         the .
sured by classification accuracy defined as the ratio between
the number of documents assigned with correct classes and                B. Funtions to Modify ICF
the total number of test documents. As a preprocessing, some                 In this experiment, three functions are applied to , i.e., ,
stop words (e.g., a, an, the) are excluded from all data sets.           and . These denote by (ICF+1), SqrtICF and (ICF)^2,
For the HTML-based data sets, all HTML tags (e.g., < B > ,               respectively. The result is shown in Table II. Note that the
 < /HTML > ) were omitted from documents to eliminate the                result of TFIDFICF is represented again for comparison.
affect of these common words and typographic words. All                  Some observations on several functions on ICF can be made
headers are omitted from Newsgroups documents. A unigram                 as follows. The (ICF)^2 is the best by average on the four
model is applied in all experiments.                                     data sets. It can improve the performance of the classifier on
    Three experiments are performed. The first experiment is             three of four data sets, with the exception of WebKB2. On
to investigate effects of inverse class frequency on the four            the 20News, the (ICF)^2 improve the performance with a gap
data sets of the three collections. The icf is multiplied to             of 4.43%. Although the performance of the classifier is the
                                                                         best by (ICF+1) on WebKB2, the average performance is less
tf and tf  idf . The standard term weighting, tf  idf , is             than the ICF. The average performance of the classifier by
used as a baseline for comparison. The default term weighting            the SqrtICF is a little bit less than ICF.
for a test document is tf  idf on all experiments. In the               C. Two-step Classification
second one, we investigate effects of the three functions on                 In the last experiment, two-step classification is used. We
                                                                         apply n-class terms in the first step. The values of n are 1 and
icf . In the last experiment, two-step classification is
                                                                         2, i.e., one-class terms and two-class terms are investigated.
performed by using one-class terms and two-class terms in                For n=1, only a set of one-class terms is used as a
the first step.                                                          representation of prototype and test document vectors. A
                                                                         test document that contains a set of one-class terms can be
                   VI. EXPERIMENTAL RESULTS                              automatically classified to a class of those terms. In case of
A. Effect of ICF                                                         n=2, a set of one-class terms and two-class terms is used as
                                                                         a representation.
   In this first experiment, the effect of inverse class fre-
                                                                                 TABLE II. EFFECT OF D IFFERENT FUNCTIONS APPLIED TO ICF
quency are investigated on tf and tf  idf . Three differ-

ent term weightings, i.e., tf  idf (TFIDF), tf  icf (TFICF)

and tf  idf  icf (TFICFIDF), are applied to centroid-
based classifiers. Four data sets are used for evaluation, i.e.,

© 2012 ACEEE                                                        65
DOI: 01.IJIT.02.02.57
ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012


The term weighting of the prototype vector is tf  idf  icf .               From the result, the function of class frequency, which ex-
                                                                             pressed the effect on both promoting and demoting some
The rest of test documents are classified in the second step.                terms was more effective. Moreover, increasing the exponent
Two term weightings are applied, i.e., tf  idf and                          of the icf , performance of classifiers is better on several
                                                                             data sets. The other approach was to apply terms, which
tf  idf  icf . The result is shown in Table III.                           were found in only one class or few classes, to improve clas-
    The result shows a lot improvement of performance from                   sification performance, using two-step classification. The
classifiers when two-step classification is applied. In the first            result showed obviously that, a lot improvement of perfor-
step, average classification accuracy on the four data sets is               mance from classifier over the single step classification. For
relatively high. Although performance in the first step of one-              conclusion, class frequency expressed its usefulness in text
class terms representation is less than performance in two-                  classification.
class terms representation, the number of the rest documents                     For the future works, effect of class frequency on other
of one-class terms representation is larger than that of two-                classification methods and performance of classifiers with
class terms representation. The consequence of this is the                   class frequency on cross data sets, should be evaluated.
number of documents that are assigned the correct class in
the second step from one-class terms representation, is larger                                      ACKNOWLEDGMENT
than that of two-class terms representation. When the
classification process is completed, classification accuracy                     This work was funded by the Research and Development
from all documents in a collection, beginning with one-class                 Institute, Silpakorn Univeristy via research grant SURDI 53/
terms representation is higher than that of beginning with                   01/12.
two-class terms representation. From the result, performance
                                                                                                        REFERENCES
of classifiers with tf  idf and tf  idf  icf in the second
                                                                             [1] K. Nigam, A. K.  McCallum,  S. Thrun,  and  T. M.  Mitchell,
step, is quite competitive. Performance of the two-step                      “Text classification from labeled and unlabeled documents using
classification is superior than that of the single step                      em,” Machine Learning, vol. 39, no. 2/3, pp. 103–134, 2000.
classification.                                                              [2] M. Ruiz  and  P. Srinivasan,  “Hierarchical  text  classification
                                                                             using neural networks,” Information Retrieval, vol. 5, no. 1, pp. 87–
                         VII. CONCLUSION                                     118, 2002.
                                                                             [3] M. Kubat  and  M. Cooperson,  Jr.,  “Voting  nearest-neighbor
    This paper showed that class frequency was useful in                     subclassifiers,” in Proceedings of 17th International Conference on
centroid-based classification. Three approaches of a class                   Machine Learning, pp. 503–510, Morgan Kaufmann, San Francisco,
frequency were investigated to exploit information among                     CA, 2000.
classes in a systematic way. The evaluation was conducted                    [4] E.-H. Han and G. Karypis, “Centroid-based document
using various data sets. The first approach was to explore                   classification: Analysis and experimental results,” in Proceedings
the effectiveness of inverse class frequency on the popular                  of PKDD-00, 4th European Conference on Principles and Practice
                                                                             of Knowledge Discovery in Databases, (Lyon, FR), pp. 424–431,
term weighting, i.e., tf  idf , as a replacement of idf and                 Springer-Verlag Publisher, 2000.
                                                                             [5] V. Lertnattee  and  T. Theeramunkong,  “Effect  of  term
an addition to tf  idf . The experimental results showed                    distributions on centroid-based text categorization,” Information
that classification accuracy of a classifier using the term                  Sciences, vol. 158, pp. 89–115, 2004.
                                                                             [6] T. Joachims, Learning to Classify Text using Support Vector
weighting of tf  idf  icf , outperformed those of                          Machines. Dordrecht, NL: Kluwer Academic Publishers, 2002.
                                                                             [7] K. Cho  and  J. Kim,  “Automatic  text  categorization  on
tf  idf and tf  icf . The second approach was to evalu-                    hierarchical category structure by using icf (inverse category
                                                                             frequency) weighting.,” in Proceedings of KISS-97, Conference of
ate some functions, which were used to adjust the power of                   Korean Institute of Intelligent Systems, pp. 507–510, 1997.
inverse class frequency.                                                     [8] Y. Ko  and  J. Seo,  “Automatic  text  categorization  by
       TABLE III. TWO -STEP CLASSIFICATION WITH O NE-CLASS TERMS             unsupervised learning,” in Proceedings of COLING-00, the 18th
                         AND T WO-CLASS T ERMS                               International Conference on Computational Linguistics, pp. 453-
                                                                             459, Saarbrücken, DE, 2000.
                                                                             [9] G. Salton  and  C. Buckley,  “Term-weighting  approaches  in
                                                                             automatic text retrieval,” Information Processing and Management,
                                                                             vol. 24,  no. 5, pp. 513–523,  1988.
                                                                             [10] A. Singhal, G. Salton, and  C. Buckley, “Length  normalization
                                                                             in degraded text collections,” Tech. Rep. TR95-1507, 1995.




© 2012 ACEEE                                                            66
DOI: 01.IJIT.02.02.57

Contenu connexe

Tendances

Blei lafferty2009
Blei lafferty2009Blei lafferty2009
Blei lafferty2009
Ajay Ohri
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
Sihan Chen
 
Search Engines
Search EnginesSearch Engines
Search Engines
butest
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...
butest
 
Lecture Notes in Computer Science:
Lecture Notes in Computer Science:Lecture Notes in Computer Science:
Lecture Notes in Computer Science:
butest
 
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
RENDER project
 

Tendances (19)

A comparative study on term weighting methods for automated telugu text categ...
A comparative study on term weighting methods for automated telugu text categ...A comparative study on term weighting methods for automated telugu text categ...
A comparative study on term weighting methods for automated telugu text categ...
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...
 
Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis Indexing of Arabic documents automatically based on lexical analysis
Indexing of Arabic documents automatically based on lexical analysis
 
Blei lafferty2009
Blei lafferty2009Blei lafferty2009
Blei lafferty2009
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
graduate_thesis (1)
graduate_thesis (1)graduate_thesis (1)
graduate_thesis (1)
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
The effect of training set size in authorship attribution: application on sho...
The effect of training set size in authorship attribution: application on sho...The effect of training set size in authorship attribution: application on sho...
The effect of training set size in authorship attribution: application on sho...
 
Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...
Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...
Efficient Feature Selection for Fault Diagnosis of Aerospace System Using Syn...
 
Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...Statistical Named Entity Recognition for Hungarian – analysis ...
Statistical Named Entity Recognition for Hungarian – analysis ...
 
Question Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical featuresQuestion Classification using Semantic, Syntactic and Lexical features
Question Classification using Semantic, Syntactic and Lexical features
 
Lecture Notes in Computer Science:
Lecture Notes in Computer Science:Lecture Notes in Computer Science:
Lecture Notes in Computer Science:
 
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
 
IRJET- Personality Recognition using Multi-Label Classification
IRJET- Personality Recognition using Multi-Label ClassificationIRJET- Personality Recognition using Multi-Label Classification
IRJET- Personality Recognition using Multi-Label Classification
 

En vedette

Development and Testing of Feedback Control System for Fused Deposition by El...
Development and Testing of Feedback Control System for Fused Deposition by El...Development and Testing of Feedback Control System for Fused Deposition by El...
Development and Testing of Feedback Control System for Fused Deposition by El...
IDES Editor
 

En vedette (9)

Application of Bio-Inspired Optimization Technique for Finding the Optimal se...
Application of Bio-Inspired Optimization Technique for Finding the Optimal se...Application of Bio-Inspired Optimization Technique for Finding the Optimal se...
Application of Bio-Inspired Optimization Technique for Finding the Optimal se...
 
Context-Aware Intrusion Detection and Tolerance in MANETs
Context-Aware Intrusion Detection and Tolerance in MANETsContext-Aware Intrusion Detection and Tolerance in MANETs
Context-Aware Intrusion Detection and Tolerance in MANETs
 
Illumination-robust Recognition and Inspection in Carbide Insert Production
Illumination-robust Recognition and Inspection in Carbide Insert ProductionIllumination-robust Recognition and Inspection in Carbide Insert Production
Illumination-robust Recognition and Inspection in Carbide Insert Production
 
Spread Spectrum Based Energy Efficient Wireless Sensor Networks
Spread Spectrum Based Energy Efficient Wireless Sensor NetworksSpread Spectrum Based Energy Efficient Wireless Sensor Networks
Spread Spectrum Based Energy Efficient Wireless Sensor Networks
 
Development and Testing of Feedback Control System for Fused Deposition by El...
Development and Testing of Feedback Control System for Fused Deposition by El...Development and Testing of Feedback Control System for Fused Deposition by El...
Development and Testing of Feedback Control System for Fused Deposition by El...
 
Differential Amplifiers in Bioimpedance Measurement Systems: A Comparison Bas...
Differential Amplifiers in Bioimpedance Measurement Systems: A Comparison Bas...Differential Amplifiers in Bioimpedance Measurement Systems: A Comparison Bas...
Differential Amplifiers in Bioimpedance Measurement Systems: A Comparison Bas...
 
Loss Reduction by Optimal Placement of Distributed Generation on a Radial feeder
Loss Reduction by Optimal Placement of Distributed Generation on a Radial feederLoss Reduction by Optimal Placement of Distributed Generation on a Radial feeder
Loss Reduction by Optimal Placement of Distributed Generation on a Radial feeder
 
Analysis and Simulation of Sub-threshold Leakage Current in P3 SRAM Cell at D...
Analysis and Simulation of Sub-threshold Leakage Current in P3 SRAM Cell at D...Analysis and Simulation of Sub-threshold Leakage Current in P3 SRAM Cell at D...
Analysis and Simulation of Sub-threshold Leakage Current in P3 SRAM Cell at D...
 
Power System State Estimation - A Review
Power System State Estimation - A ReviewPower System State Estimation - A Review
Power System State Estimation - A Review
 

Similaire à Using Class Frequency for Improving Centroid-based Text Classification

IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
onlmcq
 
Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...
ijbuiiir1
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
IJRAT
 

Similaire à Using Class Frequency for Improving Centroid-based Text Classification (20)

DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Hc3612711275
Hc3612711275Hc3612711275
Hc3612711275
 
Ir 09
Ir   09Ir   09
Ir 09
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdf
 
Text Categorizationof Multi-Label Documents For Text Mining
Text Categorizationof Multi-Label Documents For Text MiningText Categorizationof Multi-Label Documents For Text Mining
Text Categorizationof Multi-Label Documents For Text Mining
 
Farthest Neighbor Approach for Finding Initial Centroids in K- Means
Farthest Neighbor Approach for Finding Initial Centroids in K- MeansFarthest Neighbor Approach for Finding Initial Centroids in K- Means
Farthest Neighbor Approach for Finding Initial Centroids in K- Means
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
 
Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Text categorization
Text categorizationText categorization
Text categorization
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text Classification
 
3517 10753-1-pb
3517 10753-1-pb3517 10753-1-pb
3517 10753-1-pb
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniques
 
Different Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using KnnDifferent Similarity Measures for Text Classification Using Knn
Different Similarity Measures for Text Classification Using Knn
 
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 

Plus de IDES Editor

Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
IDES Editor
 
Line Losses in the 14-Bus Power System Network using UPFC
Line Losses in the 14-Bus Power System Network using UPFCLine Losses in the 14-Bus Power System Network using UPFC
Line Losses in the 14-Bus Power System Network using UPFC
IDES Editor
 
Cloud Security and Data Integrity with Client Accountability Framework
Cloud Security and Data Integrity with Client Accountability FrameworkCloud Security and Data Integrity with Client Accountability Framework
Cloud Security and Data Integrity with Client Accountability Framework
IDES Editor
 
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
IDES Editor
 

Plus de IDES Editor (20)

Artificial Intelligence Technique based Reactive Power Planning Incorporating...
Artificial Intelligence Technique based Reactive Power Planning Incorporating...Artificial Intelligence Technique based Reactive Power Planning Incorporating...
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
 
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
 
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
 
Line Losses in the 14-Bus Power System Network using UPFC
Line Losses in the 14-Bus Power System Network using UPFCLine Losses in the 14-Bus Power System Network using UPFC
Line Losses in the 14-Bus Power System Network using UPFC
 
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
 
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
Assessing Uncertainty of Pushover Analysis to Geometric ModelingAssessing Uncertainty of Pushover Analysis to Geometric Modeling
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
 
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
 
Selfish Node Isolation & Incentivation using Progressive Thresholds
Selfish Node Isolation & Incentivation using Progressive ThresholdsSelfish Node Isolation & Incentivation using Progressive Thresholds
Selfish Node Isolation & Incentivation using Progressive Thresholds
 
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
 
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
 
Cloud Security and Data Integrity with Client Accountability Framework
Cloud Security and Data Integrity with Client Accountability FrameworkCloud Security and Data Integrity with Client Accountability Framework
Cloud Security and Data Integrity with Client Accountability Framework
 
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
Genetic Algorithm based Layered Detection and Defense of HTTP BotnetGenetic Algorithm based Layered Detection and Defense of HTTP Botnet
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
 
Enhancing Data Storage Security in Cloud Computing Through Steganography
Enhancing Data Storage Security in Cloud Computing Through SteganographyEnhancing Data Storage Security in Cloud Computing Through Steganography
Enhancing Data Storage Security in Cloud Computing Through Steganography
 
Low Energy Routing for WSN’s
Low Energy Routing for WSN’sLow Energy Routing for WSN’s
Low Energy Routing for WSN’s
 
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
 
Rotman Lens Performance Analysis
Rotman Lens Performance AnalysisRotman Lens Performance Analysis
Rotman Lens Performance Analysis
 
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral ImagesBand Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
 
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
 
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
 
Mental Stress Evaluation using an Adaptive Model
Mental Stress Evaluation using an Adaptive ModelMental Stress Evaluation using an Adaptive Model
Mental Stress Evaluation using an Adaptive Model
 

Dernier

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Using Class Frequency for Improving Centroid-based Text Classification

  • 1. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 Using Class Frequency for Improving Centroid-based Text Classification Verayuth Lertnattee1 and Chanisara Leuviphan1 1 Department of Health-related Informatics, Faculty of Pharmacy Silpakorn University, Maung, Nakorn Pathom, 73000 Thailand E-mail: verayuths@hotmail.com, verayuth@su.ac.th, chanisara@su.ac.th Abstract— M ost previous works on text classification, This paper investigates the usefulness of inverse class represented importance of terms by term occurrence frequency frequency in a more systematic way. The traditional term (tf) and inverse document frequency (idf). This paper presents frequency and inverse document frequency utilize information the ways to apply class frequency in centroid-based text within classes and in the whole collection of training data. categorization. Three approaches are taken into account. The Besides these two sources of information, information among first one is to explore the effectiveness of inverse class frequency on the popular term weighting, i.e., TFIDF, as a classes so called inter-class information is expected to be replacement of idf and an addition to TFIDF. The second useful. An inverse class frequency can utilize this source of approach is to evaluate some functions, which are used to information. The effectiveness of applying class frequency adjust the power of inverse class frequency. The other approach in term weighting and classification process is investigated. is to apply terms, which are found in only one class or few Three approaches are taken into account. The first one is to classes, to improve classification performance, using two-step explore the effectiveness of inverse class frequency on the classification. From the results, class frequency expresses its usefulness on text classification, especially the two-step popular term weighting, i.e., tf  idf , as a replacement of classification. idf and an addition to tf  idf . The second approach is Index Terms—text classification, term weighting, class to evaluate some functions, which are used to adjust the frequency, linear classifier power of inverse class frequency. The other approach is to apply terms that are found in only one class or few classes, I. INTRODUCTION to improve classification performance, using two steps of classification. In the rest of this paper, section II presents The increasing availability of online text information, there centroid-based text categorization. Term weighting in has been extreme need to find and organize relevant centroid-based text classification by frequency-based information in text documents. The automated text patterns is given in section III. Section IV described three categorization (also known as text classification) becomes a approaches for applying class frequency in text classification. significant tool to organize text documents efficiently. A The data sets and experimental settings are described in variety of classification methods were developed and used section V. In section VI, a number of experimental results are in different schemes, such as probabilistic models [1], neural given. A conclusion is made in section VII. network [2], example-based models (e.g., k -nearest neighbor) [3], linear models [4], [5], support vector machine [6] and so II. CENTROID-BASED TEXT CLASSIFICATION on. Among these methods, a linear model called a centroid- based method is attractive since it has relatively less In the centroid-based text categorization, a document (or computation than other methods in both the learning and a class) is represented by a vector using a vector space model classification stages. Despite less computation time, a with a bag of words (BOW) [9]. The simplest and popular centroid-based method was shown in several literatures one is applied term frequency (tf) and inverse document including those in [4], [5], to achieve relatively high accuracy. frequency (idf) in the form of tf  idf as a term weight for In this method, an individual class is modeled by weighting representing a document. In a vector space model, given a terms appearing in training documents assigned to the class. set of documents D = {d1 , d 2 ,..., d|D| } , a document d j is This makes classification performance strongly depend on represented by a document vector term weighting applied in the model. Most previous works of  a centroid-based method focused on weighting factors related d j ={w1 j , w2 j ,...,w|T| j }={tf1 j idf , tf2 j idf2 ,...,tf|T| j idf|T|} , 1 to frequency patterns of terms or documents in the class. The most popular two factors are term frequency (tf ) and where wij is a weight assigned to a term ti in a set of terms inverse document frequency (idf ) . Some previous works, (T ) of the document. In this definition, tf ij is term frequency such as those in [7], [8], attempted to apply another factor of a term ti in a document d j and idf i is inverse document called inverse class frequency (icf ) . However, the impact of this factor needs more investigation. frequency, defined as log (| D | /df i ) . Here, | D | is the total © 2012 ACEEE 62 DOI: 01.IJIT.02.02.57
  • 2. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 number of documents in a collection and df i is the number This represents an average class term frequency in a class. The tf is considered as an intra-class factor. A term that is of documents, which contain the term ti . Besides term weighting, normalization is another important factor to important for a specific class, should have a high tf . The  ik represent a document or a class. Class prototype ck is tf deals with the first property of the significant terms for obtained by summing up all document vectors in C k and classification. Term frequency alone may not be enough to then normalizing the result by its size. The formal description represent the contribution of a term in a document. To achieve    a better performance, the well-known inverse document of a class prototype ck is djC d j / || djC d j || , where frequency can be applied to eliminate the impact of frequent k k terms that exist in almost all documents. The idf i is the C k = { d j | d j is a document belonging to the class ck }. inverse ratio of the number of training documents that contain The simple term weighting is tf  idf where tf is an average the term ti to the total number of training documents. It is class term frequency of the term. The formal description of usually applied in the form of the logarithm value of its original value. It deals with the second property of the significant tf is  djC k tf ijk / | Ck | , where | C | is the number of k terms. However, this tends to be true if those documents, which hold the terms, are in the same class. Inversely, if the documents in a class ck . Term weighting described above distribution of the terms is uniform in all classes, idf can also be applied to a query or a test document. In general, becomes useless and is not helpful in classification. The idf the term weighting for a query is tf  idf . Once a class prototype vector and a query vector have been constructed, is considered as a collection factor, i.e., its value is the same the similarity between these two vectors can be calculated. for a particular term, independent of arrangement of classes The most popular one is cosine distance [10]. This similarity in a collection. can be calculated by the dot product between these two The first and second items can be coped by the vectors. Therefore, the test document will be assigned to the conventional term frequency and inverse document class whose class prototype vector is the most similar to the frequency, respectively. However, for the third property, it is vector of the test document. necessary to utilize other frequency factors. For this property, class frequency is expected to be useful. Class frequency III. TERM WEIGHTING IN CENTROID-BASED T EXT enables the classifier to utilize the information among classes. CLASSIFICATION BY FREQUENCY-BASED PATTERNS It is considered as an inter-class factor. Some collections may have several possible organizations of documents This section presents concept of term weighting in (viewpoints). For example, a collection may be grouped into centroid-based text classification using frequency-based a set of classes based on its content (e.g., course, faculty, patterns. Before the detail is described, some general project, student, ...) or it may be grouped by its university characteristics of terms that are significant for representing a (e.g., Cornell, Texas, Washington, Wisconsin,...). In these two certain class, are listed below: cases, the collection factor (e.g., idf ) of a term is identical • A term tends to be a representative of a certain class, it should appear frequently with high occurrence frequency while the inter-class factor of that term is varied. An important within the class. term of the specific category seems to exist in one class or • An important term seems to exist in relatively a few only few classes (the third property of significant terms). documents while a general term tends to appear in most Therefore, we can apply class frequency to represent documents in a collection. important of that term. In the past, there were some works • A crucial term seems to exist in only one class or few utilizing class frequency in general forms, such as logarithmic classes. of inverse class frequency ( icf ). It is an analogous of idf . To realize these characteristics, frequency-based patterns in a training collection can be applied. Three main frequency In some situations that we cannot calculate idf such as in factors are term frequency, document frequency and class [8], it is possible to set a sentence as a processing unit instead frequency. The simplest and popular one is applied term of a document, and hence icf replaces idf . Although the frequency (tf ) and inverse document frequency (idf ) in icf seems to be useful, it is not popular for applying into the form of tf  idf for representing a document. In a term weighting. centroid-based method, term frequency of a class vector comes from average value of term frequencies of documents IV. APPLYING CLASS FREQUENCY IN TEXT CLASSIFICATION in a class. Due to this, we use a symbol tf instead of tf . In this paper, the usefulness of class frequency is © 2012 ACEEE 63 DOI: 01.IJIT.02.02.57
  • 3. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 presented on both the concept and experiments. Three contrary, it demotes tf and tf  idf when its value less approaches are taken into account: (1) the effectiveness of than 1. icf in term weighting (2) several functions are applied to B. Functions to modify ICF. icf to enhance classification performance and (3) the algo- rithm to apply terms which are found in only one class or few In this approach, several functions are applied to icf for classes, to improve classification performance, using two adjusting level of promoting and demoting to tf and steps of classification. The detail is decribed as follow: A. The Effectiveness of ICF in Term Weighting tf  idf . The first pattern is adjusting the power of icf . The popular method to utilize class frequency is in the When the term weighting is tf  idf  icf , a new term form of logarithmic function of inverse class frequency. The equation is shown below. weighting is shown below |C | tf  idf  icf  (2) ICFi = log (1) cf i Here, the  is the power of icf . The icf promotes the Here cf i is the number of classes that contain a term ti . The tf  idf when its value > 1. If the power of the icf is greater effect of the icf is to promote a term which occurs in only than 1, it is more powerful to promote or demote a term than few classes (later called few-class terms) and demote a term a normal icf . On the other hand, when the power of the icf which appears in many classes (later called most-class terms). is less than 1, it is less powerful to promote or demote a term In the extreme case, where a term occurs only in one class than a normal icf . The optimum value of the  depends on (cf i = 1) , the function obtains the maximum value. These data sets. In the first pattern, a term may be promoted or terms are called one-class terms. We belief that few-class demoted. The second pattern considers only promotion of a terms are quite important for classifying documents. Inverse term. An example of a term weighting formula is shown below class frequency promotes the importance of these terms. tf  idf  (icf  1) (3) Inversely, when a term occurs in all classes (cf i =| C |) , a From the equation, terms are found on all classes, are given a function of inverse class frequency achieves a minimum value, term weight of the tf  idf . The other terms are given with i.e., 0. These terms are called all-class terms. They are considered as less important and may cause the classifier the values which are higher than the tf  idf . misclassify the document. However, their usefulness also depends on their term frequencies. The promotion of few- C. Two-step Classification class terms is more effective than the demotion of most-class In this approach of classification, class frequency of a terms since it may demote the important terms that occur in term is applied to select representations of the class prototype all classes but useful for classifying documents. The icf and the test document vectors. The representative vectors can be included into term weighting. Two simple patterns are are based on n-class terms. A representative vector of n- class terms, means it represents by terms which occur in applied. The first pattern is tf  icf , i.e., substitution of the 1,2,...,n classes where 1  n | C | . In the first step of idf with icf . The other pattern is tf  idf  icf , i.e., classification, the n is set. A test document is classified to an appropriated class, based on n-class term. The rest of the include the icf in the tf  idf . In normal situation, the test documents, which have not n-class terms as their number of documents is greater than the number of classes representation, will be classified using all terms in the second step. For example, when n=1, i.e., the representation vectors ( | D |>| C | ). This means the idf promotes the importance of prototypes and test documents is only a set of terms that of terms more than the icf . When the logarithm of base 2 is are found only in one class. A test document whose have one-class terms of a class, is automatically classified to that used, the value of icf for a term is equal to 1 when the term class in the first step. In the second step, the rest of the test is found in the half number of the total classes in a data set documents will be classified by all terms. ( cf i =| C | /2 ). Therefore, the value of icf > 1 when V. EXPERIMENTAL SETTING cf i <| C | /2 and icf < 1 when cf i >| C | /2 . If the value To evaluate our concept about class frequency, three collections called WebKB, 7Sectors and 20Newsgroups are of icf more than 1, it promotes tf and tf  idf . On the used. The WebKB, is a collection of web pages of computer © 2012 ACEEE 64 DOI: 01.IJIT.02.02.57
  • 4. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 science departments in four universities with some additional WebKB1, WebKB2, 7Sectors and 20Newsgroups. Table I pages from other universities. This collection can be viewed showed the result in forms of the classification accuracy. as two-dimensional viewpoints. In our experiment, we use TABLE I. EFFECT OF ICF ON THE FOUR D ATA SETS the four most popular classes: student, faculty, course and project. This includes 4,199 web pages. Focusing on each class, five subclasses (the 2 nd dimension) are defined according to the university a web page belongs to: Cornell, Texas, Washington, Wisconsin and miscellaneous. We use this collection in two viewpoints: first dimension (WebKB1) From the result, some observations can be made as follows. and second dimension (WebKB2). The total number of data sets for this collection is two. Therefore, the effect of inverse The performance of a prototype vector with tf  idf , is class frequency with different numbers of classes on the better than that of a prototype vector with tf  icf on three same collection, can be evaluated. The second collection is the 7Sectors, a data set from CMU World Wide Knowledge data sets, except only the 7Sectors. The performance of Base Project from the CMU Text Learning Group. This data classifiers with tf  icf on the same collection but different set has seven classes: basic material, energy, financial, healthcare, technology, transportation and utilities. The viewpoints, i.e. WebKB1 and WebKB2, may be different with number of documents in this collection is 4,582. The third a large gap. The most effective term weighting for all data collection is 20Newsgroups (20News). The articles are sets is tf  idf  icf . It can be concluded that the icf grouped into 20 different UseNet discussion groups. It contains 19,997 documents and some groups are very similar. express its usefulness when it is used to combine with All experiments were performed on closed test, .i.e., we tf  idf . In generally, the should not been used to replace used a training set as a test set. The performance was mea- the . sured by classification accuracy defined as the ratio between the number of documents assigned with correct classes and B. Funtions to Modify ICF the total number of test documents. As a preprocessing, some In this experiment, three functions are applied to , i.e., , stop words (e.g., a, an, the) are excluded from all data sets. and . These denote by (ICF+1), SqrtICF and (ICF)^2, For the HTML-based data sets, all HTML tags (e.g., < B > , respectively. The result is shown in Table II. Note that the < /HTML > ) were omitted from documents to eliminate the result of TFIDFICF is represented again for comparison. affect of these common words and typographic words. All Some observations on several functions on ICF can be made headers are omitted from Newsgroups documents. A unigram as follows. The (ICF)^2 is the best by average on the four model is applied in all experiments. data sets. It can improve the performance of the classifier on Three experiments are performed. The first experiment is three of four data sets, with the exception of WebKB2. On to investigate effects of inverse class frequency on the four the 20News, the (ICF)^2 improve the performance with a gap data sets of the three collections. The icf is multiplied to of 4.43%. Although the performance of the classifier is the best by (ICF+1) on WebKB2, the average performance is less tf and tf  idf . The standard term weighting, tf  idf , is than the ICF. The average performance of the classifier by used as a baseline for comparison. The default term weighting the SqrtICF is a little bit less than ICF. for a test document is tf  idf on all experiments. In the C. Two-step Classification second one, we investigate effects of the three functions on In the last experiment, two-step classification is used. We apply n-class terms in the first step. The values of n are 1 and icf . In the last experiment, two-step classification is 2, i.e., one-class terms and two-class terms are investigated. performed by using one-class terms and two-class terms in For n=1, only a set of one-class terms is used as a the first step. representation of prototype and test document vectors. A test document that contains a set of one-class terms can be VI. EXPERIMENTAL RESULTS automatically classified to a class of those terms. In case of A. Effect of ICF n=2, a set of one-class terms and two-class terms is used as a representation. In this first experiment, the effect of inverse class fre- TABLE II. EFFECT OF D IFFERENT FUNCTIONS APPLIED TO ICF quency are investigated on tf and tf  idf . Three differ- ent term weightings, i.e., tf  idf (TFIDF), tf  icf (TFICF) and tf  idf  icf (TFICFIDF), are applied to centroid- based classifiers. Four data sets are used for evaluation, i.e., © 2012 ACEEE 65 DOI: 01.IJIT.02.02.57
  • 5. ACEEE Int. J. on Information Technology, Vol. 02, No. 02, April 2012 The term weighting of the prototype vector is tf  idf  icf . From the result, the function of class frequency, which ex- pressed the effect on both promoting and demoting some The rest of test documents are classified in the second step. terms was more effective. Moreover, increasing the exponent Two term weightings are applied, i.e., tf  idf and of the icf , performance of classifiers is better on several data sets. The other approach was to apply terms, which tf  idf  icf . The result is shown in Table III. were found in only one class or few classes, to improve clas- The result shows a lot improvement of performance from sification performance, using two-step classification. The classifiers when two-step classification is applied. In the first result showed obviously that, a lot improvement of perfor- step, average classification accuracy on the four data sets is mance from classifier over the single step classification. For relatively high. Although performance in the first step of one- conclusion, class frequency expressed its usefulness in text class terms representation is less than performance in two- classification. class terms representation, the number of the rest documents For the future works, effect of class frequency on other of one-class terms representation is larger than that of two- classification methods and performance of classifiers with class terms representation. The consequence of this is the class frequency on cross data sets, should be evaluated. number of documents that are assigned the correct class in the second step from one-class terms representation, is larger ACKNOWLEDGMENT than that of two-class terms representation. When the classification process is completed, classification accuracy This work was funded by the Research and Development from all documents in a collection, beginning with one-class Institute, Silpakorn Univeristy via research grant SURDI 53/ terms representation is higher than that of beginning with 01/12. two-class terms representation. From the result, performance REFERENCES of classifiers with tf  idf and tf  idf  icf in the second [1] K. Nigam, A. K.  McCallum,  S. Thrun,  and  T. M.  Mitchell, step, is quite competitive. Performance of the two-step “Text classification from labeled and unlabeled documents using classification is superior than that of the single step em,” Machine Learning, vol. 39, no. 2/3, pp. 103–134, 2000. classification. [2] M. Ruiz  and  P. Srinivasan,  “Hierarchical  text  classification using neural networks,” Information Retrieval, vol. 5, no. 1, pp. 87– VII. CONCLUSION 118, 2002. [3] M. Kubat  and  M. Cooperson,  Jr.,  “Voting  nearest-neighbor This paper showed that class frequency was useful in subclassifiers,” in Proceedings of 17th International Conference on centroid-based classification. Three approaches of a class Machine Learning, pp. 503–510, Morgan Kaufmann, San Francisco, frequency were investigated to exploit information among CA, 2000. classes in a systematic way. The evaluation was conducted [4] E.-H. Han and G. Karypis, “Centroid-based document using various data sets. The first approach was to explore classification: Analysis and experimental results,” in Proceedings the effectiveness of inverse class frequency on the popular of PKDD-00, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, (Lyon, FR), pp. 424–431, term weighting, i.e., tf  idf , as a replacement of idf and Springer-Verlag Publisher, 2000. [5] V. Lertnattee  and  T. Theeramunkong,  “Effect  of  term an addition to tf  idf . The experimental results showed distributions on centroid-based text categorization,” Information that classification accuracy of a classifier using the term Sciences, vol. 158, pp. 89–115, 2004. [6] T. Joachims, Learning to Classify Text using Support Vector weighting of tf  idf  icf , outperformed those of Machines. Dordrecht, NL: Kluwer Academic Publishers, 2002. [7] K. Cho  and  J. Kim,  “Automatic  text  categorization  on tf  idf and tf  icf . The second approach was to evalu- hierarchical category structure by using icf (inverse category frequency) weighting.,” in Proceedings of KISS-97, Conference of ate some functions, which were used to adjust the power of Korean Institute of Intelligent Systems, pp. 507–510, 1997. inverse class frequency. [8] Y. Ko  and  J. Seo,  “Automatic  text  categorization  by TABLE III. TWO -STEP CLASSIFICATION WITH O NE-CLASS TERMS unsupervised learning,” in Proceedings of COLING-00, the 18th AND T WO-CLASS T ERMS International Conference on Computational Linguistics, pp. 453- 459, Saarbrücken, DE, 2000. [9] G. Salton  and  C. Buckley,  “Term-weighting  approaches  in automatic text retrieval,” Information Processing and Management, vol. 24,  no. 5, pp. 513–523,  1988. [10] A. Singhal, G. Salton, and  C. Buckley, “Length  normalization in degraded text collections,” Tech. Rep. TR95-1507, 1995. © 2012 ACEEE 66 DOI: 01.IJIT.02.02.57