SlideShare une entreprise Scribd logo
1  sur  14
Project Report on

Building Descriptor Based SVM for Document Categorization




             Student: Naveen Kumar Ratkal
                  Advisor: Dr. Zubair
ABSTRACT........................................................................................................................3
INTRODUCTION..............................................................................................................3
SUPPORT VECTOR MACHINE....................................................................................4
   INTRODUCTION.....................................................................................................................4
    SVM MODEL....................................................................................................................4
SVM IMPLEMENTATION.............................................................................................7
   TRANSFORMATION................................................................................................................8
   SCALING.............................................................................................................................8
   RBF KERNEL......................................................................................................................8
APPROACH.......................................................................................................................8
ARCHITECTURE.............................................................................................................9
   SVM BASED ARCHITECTURE..............................................................................................10
   TRAINING..........................................................................................................................10
   ASSIGNMENT......................................................................................................................11
EXPERIMENTS AND RESULTS.................................................................................11
CONCLUSION ...............................................................................................................13
REFERENCE...................................................................................................................14
Abstract
Automated document categorization is extensively used and studied for document
categorization, various approaches have been implemented. SVM is one of the machine
learning technique which is promising for text categorization. Most of the text
categorization systems are not widespread. One of the reasons is that very often it is not
clear how to adapt a machine learning approach to a collection of an organization with its
unique needs. DTIC currently uses a dictionary based approach to assign DTIC thesaurus
descriptors to an acquired document based on its title and abstract. The thesaurus
currently consists of around 14K descriptors. The main objective of this project is to
build a framework for automatically downloading the documents from the DTIC website,
OCRing (Optical Character Recognition) the downloaded files and finally train these files
using the positive documents and negative documents (which are taken from the others
class positive set).

Introduction
Automated document categorization has been extensively studied and a good survey
article discusses evolution of various techniques for document categorization with
particular focus on machine learning approaches. One of the machine learning
techniques, Support Vector Machines (SVMs), is promising for text categorization
(Dumais 1998, Joachims 1998). Dumais et al. evaluated SVMs for the Reuters-21578
collection. They found SVMs to be most accurate for text categorization and quick to
train.

The automatic text categorization area has matured and a number of experimental
prototypes are available. However, most of these experimental prototypes, for the
purpose of evaluating different techniques, have restricted to a standard collection such as
Reuters. As pointed out in, the commercial text categorization systems are not
widespread. One of the reasons is that very often it is not clear how to adapt a machine
learning approach to a collection of an organization with its unique needs.

Here we implement a framework that we have created that allows one to evaluate various
approaches of machine learning to the categorization problem for a specific collection.
Specifically, we valuate the applicability of SVMs for Defense Technical Information
Center (DTIC) needs. The DTIC currently uses a dictionary based approach to assign
DTIC thesaurus descriptors to an acquired document based on its title and abstract. The
descriptor assignment is validated by humans and one or more fields/groups (subject
categorization) are assigned to the target document. The thesaurus currently consists of
around 14K descriptors. For subject categorization, DTIC uses 25 main categories
(fields) and 251 subcategories (groups). Typically a document is assigned two or three
fields/groups. Further, a document is also assigned five or six descriptors from the DTIC
thesaurus.
The dictionary based approach currently used by DTIC relies on mapping of document
terms to DTIC thesaurus descriptions and is not efficient as it is requires continuous
update of the mapping table. Additionally, it suffers from the quality of descriptors that
are assigned using this approach. There is a need to improve this process to make it
efficient in terms of time and the quality of descriptors that are assigned to documents.

Support Vector Machine
Introduction

SVM (Support Vector Machine) was introduced by V. Vapnik in late 70s. It is widely
used in pattern recognition areas such as face detection, isolated handwriting digit
recognition, gene classification, and text categorization [11]. The goal of text
categorization is the classification of documents into a fixed number of predefined
categories. Each document can be in multiple, exactly one, or no category at all [8].
Text categorization with SVM eliminates the need for manual, difficult, time-consuming
classifying by learning classifiers from examples and performing the category
assignments automatically. This is a supervised learning problem

SVMs are a set of related supervised learning methods used for classification and
regression. They belong to a family of generalized linear classifiers. A special property of
SVMs is that they simultaneously minimize the empirical classification error and
maximize the geometric margin. Hence it is also known as the maximum margin
classifier. [8]

The main idea of Support Vector Machine is to find an optimal hyperplane to separate
two classes with the largest margin from pre-classified data [12]. The use of the
maximum-margin hyperplane is motivated by Vapnik Chervonenkis theory, which
provides a probabilistic test error bond that is minimized when the margin is maximized
[1].

SVM Model

SVM models can be divided into four distinct groups based on the error function:
(1) C-SVM classification
(2) nu-SVM classification
(3) epsilon-SVM regression
(4) nu-SVM regression
Compared to regular C-SVM, the formulation of nu-SVM is more complicated, so up to
now there have been no effective methods for solving large-scale nu-SVM [3]. We are
using C-SVM classification in our project. For C-SVM, training involves the
minimization of the error function:


                  Subject to the constraints:
Where C is the capacity constant or penalty parameter of the error function, w is the
vector of coefficients, b is a constant and is parameter for handling non separable data
(inputs). The index i labels the N training cases. Note that    is the class labels and xi
is the independent variables. The kernel is used to transform data from the input
(independent) to the feature space [13].

SVM kernels:

Kernels area class of algorithms whose task is to detect and exploit complex patterns in
data (eg: by clustering, classifying, ranking, cleaning, etc. the data). Typical problems
are: how to represent complex patterns; and how to exclude spurious (unstable) patterns
(= over fitting). The first is a computational problem; the second a statistical problem.
[13].

The class of kernel methods implicitly defines the class of possible patterns by
introducing a notion of similarity between data, for example, similarity between
documents by length, topic, language, etc. Kernel methods exploit information about the
inner products between data items. Many standard algorithms can be rewritten so that
they only require inner products between data (inputs) . When a kernel is given there is
no need to specify what features of the data are being used. [13].

Kernel functions = inner products in some feature space (potentially very complex)

In SVM there are four common kernels:

   •   linear
   •   polynomial
   •   (RBF) radial basis function
   •   sigmoid

In general RBF is a reasonable first choice because the kernel matrix using sigmoid may
not be positive definite and in general it’s accuracy is not better than RBF[12], linear is a
special case of RBF, and polynomial may have numerical difficulties if a high degree is
used. In this implementation we use an RBF kernel whose kernel function is:

exp (γ||xy||²)

Kernel trick:

In machine learning, the kernel trick is a method for easily converting a linear classifier
algorithm into a non-linear one, by mapping the original observations into a higher-
dimensional non-linear space so that linear classification in the new space is equivalent to
non-linear classification in the original space [1]. Let’s look at this with an example.
Here we are interested in classifying data as a part of a machine-learning process. These
data points may not necessarily be points in      but may be multidimensional         points
[1]. We are interested in whether we can separate them by a n-1 dimensional hyperplane.
This is a typical form of linear classifier. Let us consider a two-class, linearly separable
classification problem as below:




Figure 1: Linear Classifier [12].

But not all classification problems are as simple as this.




Figure 2: Non Linear Input space [12].

In case of complex problems, where the input space is not linear, the solution is a
complex curve as below.




Figure       3:
Kernel    Trick
[12].

Incase of such
non       linear
input space, the data space is mapped into a linear feature space. This mapping of the
feature space is performed by a class of algorithms called Kernels.


Good Decision Boundary:
As discussed above theinput data points may not necessarily be points in   but may be
multidimensional     (computer science notation) points. We are interested in whether
we can separate them by a n-1 dimensional hyperplane [1]. This is a typical form of
linear classifier. There are many linear classifiers that might satisfy this property.
However, a given set of linear class data points may have more than one margin or
boundary that separates them. A Perceptron algorithm can be used to find these multiple
boundaries.




 Class
 1




                                Class
                                2
Figure 3: Good Decision Boundary

SVM solves this by finding the maximum separation (margin) between the two classes.
and is known as the maximum-margin hyper plane.




Figure 4: maximum margin hyper plane

SVM Implementation
The SVM implemented in this project follows the following steps [7]:
•   Transform data to the format of an SVM software
   •   Conduct simple scaling/normalization on the data
   •   Consider the RBF kernel
   •   Find the best parameter C and Gamma
   •   Use the best parameter C and Gamma to train the training set
   •   Test

Transformation
There are many existing SVM library we can use, such as rainbow [9], LIBSVM etc.
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC),
regression (epsilon-SVR, nu-SVR) etc [5]. In this project we select LibSVM because of
the following reasons:
(1) cross validation for model selection
(2) different SVM formulations
(3) Java sources

The data format supported by LIBSVM consists of representing every documents as a
line of feature id pairs:

[label] [feature Id]:[feature Value] [feature Id]:[feature Value]……
The label is the class [0,1] or [-1,1] that the document belongs to .
The input here is the text file containing the articles or metadata of the images. LIBSVM
however takes the data files of the above described format. The input files are
transformed into the LIBSVM format by processing each document as follows. Every
document is scanned to pick up features (or key words that represent a document) and
every terms frequency (TF) and the inverse document frequency (IDF) is calculated.
These are then used to compute the weight of every term in the training set.
There are many tools such as lucene that calculate the TF and DF for a given document
and create its SVM data model. In this project I have developed my own API to calculate,
compute and create the svm data files (based on the required feature sets).

Scaling
The weighted frequencies are scaled to avoid attributes in greater numeric ranges
dominate those in smaller ranges.

RBF Kernel
In this project we use the RBF kernel as it is a simple model to start with. The number of
hyper parameters influences the complexity of model selection and Polynomial kernel
has more hyper parameters than the RBF kernel.




Approach
There are several ways one can apply SVM to the problem of assigning fields/groups and
descriptors to new documents based on learning set. In one approach we would treat the
problems of categorization and descriptor selection as independent problems each one
solved by independent SVMs. A combined approach would either start with the
categorization problem or the descriptor problem and then solve the other as a restricted
domain problem. For instance, we could solve the categorization problem first and the for
the resulting specific field/group solve the descriptor selection problem for the document
at hand. Here, we discuss the .independent. approach. And discuss how to resolve
inconsistencies that can result when the two problems are solved independently. An
equally important task, which is not the focus of this paper is to study different ways to
apply SVM approach and their tradeoffs and possible other machine learning techniques
that can result in better quality descriptors and fields/group mapping. The overall
approach consists of the following major steps.

Step 1. We use the existing DTIC collection for the training phase. For this, we first need
to have a representation of a document. A document is represented by a vector of
weighted terms. The weights are determined by the term frequency and inverse document
frequency . a standard technique used in the IR area. Using this representation of a
document, we train the SVM for 251 fields/groups and 14000 descriptors.

Step 2. We use trained SVM to identify the fields and groups for a document. We also
assign a likelihood factor (varying from 0 to 1) to an assigned field/group based on the
document distance from the hyperplane (please refer to SVM background). We sort the
assigned fields/groups based on the likelihood factor and select first .k. fields groups
based on a threshold.

Step 3. Similar to Step 2, we identify descriptors, sort them, and select .m. descriptors
based on a threshold.

Step 4. Note that fields/groups identified in Step 2 and descriptors identified in Step 3
may be inconsistent. That is we may have a field/group assignment for a new document
without its descriptor as identified by the DTIC thesaurus. One straightforward way to
resolve this is to use intersection of descriptors identified by the fields/groups mapping
and the descriptors identified by the SVM. The likelihood factor can then be used to
select few fields/groups (around two or three) and five or six descriptors.

As we are using SVM to classify a large number of classes (251 + 14000), there can be
performance issues. As part of this work we investigate ways to improve the performance
without sacrificing the quality of fields/groups, and descriptors assignment. In this paper,
we focus on Step 3. of the overall process with particular attention on the automated
framework that allows collection to be analyzed, trained, processed, and results
presented.


Architecture
SVM Based Architecture
The process to identify a descriptor for a document consists of two phases: training phase
and assignment phase. In the training phase, we train as many SVMs as the number of
thesaurus terms. Next in the assignment phase, we present a document to all the trained
SVMs and select .m. descriptors based on a threshold. We now give details of the two
processes.

Training




As DTIC thesaurus consists of 14000 terms, the training process to build 14000 SVMs
needs to be automated. Figure 2 illustrates the training process. For each thesaurus term,
we construct a URL and use this URL to search documents from DTIC collection in that
category. For our testing purposes we have enabled this process for only five thesaurus
terms. From the search results, we download a subset of the documents that are selected
randomly. These documents, which are in PDF format, are converted to text using
Omnipage, an OCR engine. We use the traditional IR document model, which is based on
term frequency (TF) and document frequency (DF). In this project, we use Apache
Lucene package to determine TF and DF to come up with a document representation that
can be used for training a SVM. These documents form the positive training set for the
selected thesaurus term. The negative training set is created by randomly selecting
documents from positive training sets for terms other than the selected term.
Assignment




Figure 3 illustrates the assignment phase. The input document is represented using TF
and DF as in the training phase. This document is presented to all the trained SVMs,
which in turn output an estimate in the range from 0 to 1 indicating how likely the
selected term maps to the test document. Based on a threshold, we can then assign .m.
thesaurus terms (descriptors) to the test document.

Experiments and Results
We use recall and precision metrics that are commonly used by the information
extraction and data mining communities. The general definition of recall and precision is:

Recall = Correct Answers / Total Possible Answers
Precision = Correct Answers / Answers Produced

To compute recall and precision typically one uses confusion matrix C [10], which is a K
X K matrix for a K-class classifier. In our case, it is a 5 by 5 matrix and an element of the
matrix indicates how many documents in class i have been classified as class j. For an
ideal classifier all off diagonal entries will be zeroes, and if there are n documents in a
class j.
              c                          c                             Σ cii
          = ii
   Recall Σ c                        = ii
                            Precision Σ c              Correct Rate = i
             j
                 ij
                                       j
                                           ji                         Σ Σ cij
                                                                     j i



We created a testbed with various descriptors, below are the details of the testbed.

Testbed 1: In this case we took 5 descriptors (damage tolerance, fabrication, machine,
military history and tactical analysis).
Training size: 50 positive documents and 50 negative documents.
Test Documents size : 20 documents.
Number of pages: First five and Last five.

D: Damage Tolerance
F: Fabrication
M: Machine
H: Military History
T: Tactical Analysis

   D F M H T Recall Precision
    1
 D  8 2 0 0 0   0.9      0.90
 F  2 14 1 0 3  0.7      0.77
         1
 M 0 1 8 0 1    0.9      0.81
           1
 H  0 1 1 6 2   0.8      0.66
             1
 T  0 0 2 8 0   0.5      0.62

Correct Rate : 0.76

Testbed 2: we took 5 narrow descriptors (military budgets, military capabilities, military
commanders, military government and military operations).
Training size: 50 positive documents and 50 negative documents.
Test Documents size : 40 documents.
Number of pages: First five and Last five.


   MB MC MCO MG MO Recall Precision
MB  22  6   4 3 5 0.55          0.79
MC   1 28   4 4 3 0.70          0.61
MCO 1   2  24 10 3 0.60         0.52
MG   2  6   5 25 2 0.63         0.53
MO   2  4   9 5 20 0.50         0.61


Correct Rate : 0.59

We multiplied the constant values for the terms that are occurring in the first two pages of
the document (we used the same Testbed 2).


    MB MCO MC MG MO Recall Precision
MB   15   0  4  0  1    0.75    0.65
MCO   2  10  1  3  3    0.53    0.50
MC         4     3    10     2     1      0.50      0.59
MG         1     1     0     8     0      0.80      0.50
MO         1     6     2     3     8      0.40      0.62


Correct Rate : 0.57

We added the constant values for the terms that are occurring in the first two pages of the
document (we used the same Testbed 2).

          MB MCO MC MG MO Recall Precision
MB         16   0 2 1 1         0.8   0.73
MCO         1  13 1 3 2       0.65    0.57
MC          3   2 14 1 0        0.7   0.67
MG          1   1 0 8 0         0.8   0.53
MO          1   7 4 2 6         0.3   0.67

Correct Rate : 0.63



Conclusion
In this paper, we proposed a framework to evaluate the effectiveness of SVMs for both
subject categorization and descriptor selection problem for a DTIC collection. We have
improved the existing DTIC process, which uses a dictionary to assign thesaurus
descriptors to an acquired document based on its title and abstract. Our preliminary
results are encouraging. We still need to do more testing to determine the right training
sizes for various SVMs.
Reference
[1] Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM
Computing Surveys. Vol. 34(1). pp. 1-47.

[2] Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M. (1998). .Inductive learning
algorithms and representations for text categorization., In Proceedings of CIKM-98, 7 th
ACM International Conference on Information and Knowledge Management
(Washington, US, 1998), pp. 148.155.

[3] Joachims, T. (1998). .Text categorization with support vector machines: learning with
many relevant features., In Proceedings of ECML-98, 10th European Conference on
Machine Learning (Chemnitz, DE, 1998), pp. 137.142. [4] Reuters-21578 collection.
URL: http://www.research.att.com/~lewis/reuters21578.htm

[5] V. N. Vapnik. The nature of Statistical Learning Theory. Springer, Berlin, 1995.

[6] C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2(2): 955-974, 1998.

[7] T. Joachims, Text Categorization with Suport Vector Machines: Learning with Many
Relevant Features. Proceedings of the European Conference on Machine Learning,
Springer, 1998.

[8] T. Joachims, Learning to Classify Text Using Support Vector Machines. Dissertation,
Kluwer, 2002.

[9] J.T. Kwok. Automated text categorization using support vector machine. In
Proceedings of the International Conference on Neural Information Processing,
Kitakyushu, Japan, Oct. 1998, pp. 347- 351.

[10]. Kohavi R and Provost F. Glossary of Terms. Editorial for the Special Issue on
Applications of Machine Learning and the Knowledge Discovery Process, Vol. 30, No.
2/3, February/March 1998.

Contenu connexe

Tendances

Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Miningijdmtaiir
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 ClassificationKhalid Elshafie
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & predictionhktripathy
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 
Classifiers
ClassifiersClassifiers
ClassifiersAyurdata
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Classification vs clustering
Classification vs clusteringClassification vs clustering
Classification vs clusteringKhadija Parween
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree inductionthamizh arasi
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic conceptsKrish_ver2
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretizationKrish_ver2
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsIJERA Editor
 
Comparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationComparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationeSAT Journals
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and predictionAcad
 

Tendances (20)

Data Mining
Data MiningData Mining
Data Mining
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Mining
 
Chapter 4 Classification
Chapter 4 ClassificationChapter 4 Classification
Chapter 4 Classification
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Classifiers
ClassifiersClassifiers
Classifiers
 
Clustering
ClusteringClustering
Clustering
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Classification vs clustering
Classification vs clusteringClassification vs clustering
Classification vs clustering
 
Decision tree induction
Decision tree inductionDecision tree induction
Decision tree induction
 
Data Mining
Data MiningData Mining
Data Mining
 
2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts2.1 Data Mining-classification Basic concepts
2.1 Data Mining-classification Basic concepts
 
18 ijcse-01232
18 ijcse-0123218 ijcse-01232
18 ijcse-01232
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Classification Continued
Classification ContinuedClassification Continued
Classification Continued
 
Hypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining AlgorithmsHypothesis on Different Data Mining Algorithms
Hypothesis on Different Data Mining Algorithms
 
Comparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorizationComparative study of classification algorithm for text based categorization
Comparative study of classification algorithm for text based categorization
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and prediction
 

En vedette

Dynamic model of pmsm (lq and la)
Dynamic model of pmsm  (lq and la)Dynamic model of pmsm  (lq and la)
Dynamic model of pmsm (lq and la)warluck88
 
51640977 modelling-simulation-and-analysis-of-low-cost-direct-torque-control-...
51640977 modelling-simulation-and-analysis-of-low-cost-direct-torque-control-...51640977 modelling-simulation-and-analysis-of-low-cost-direct-torque-control-...
51640977 modelling-simulation-and-analysis-of-low-cost-direct-torque-control-...warluck88
 
Implementation of hysteresis current control for single phase grid connected ...
Implementation of hysteresis current control for single phase grid connected ...Implementation of hysteresis current control for single phase grid connected ...
Implementation of hysteresis current control for single phase grid connected ...Asoka Technologies
 
Pmsm mathematical model
Pmsm mathematical modelPmsm mathematical model
Pmsm mathematical modelwarluck88
 
Modeling and simulation of pmsm
Modeling and simulation of pmsmModeling and simulation of pmsm
Modeling and simulation of pmsmRavi teja Damerla
 
Permanent magnet Synchronous machines
Permanent magnet Synchronous machinesPermanent magnet Synchronous machines
Permanent magnet Synchronous machinesRajeev Kumar
 
Space Vector Modulation(SVM) Technique for PWM Inverter
Space Vector Modulation(SVM) Technique for PWM InverterSpace Vector Modulation(SVM) Technique for PWM Inverter
Space Vector Modulation(SVM) Technique for PWM InverterPurushotam Kumar
 

En vedette (10)

Dynamic model of pmsm (lq and la)
Dynamic model of pmsm  (lq and la)Dynamic model of pmsm  (lq and la)
Dynamic model of pmsm (lq and la)
 
51640977 modelling-simulation-and-analysis-of-low-cost-direct-torque-control-...
51640977 modelling-simulation-and-analysis-of-low-cost-direct-torque-control-...51640977 modelling-simulation-and-analysis-of-low-cost-direct-torque-control-...
51640977 modelling-simulation-and-analysis-of-low-cost-direct-torque-control-...
 
Implementation of hysteresis current control for single phase grid connected ...
Implementation of hysteresis current control for single phase grid connected ...Implementation of hysteresis current control for single phase grid connected ...
Implementation of hysteresis current control for single phase grid connected ...
 
Pmsm mathematical model
Pmsm mathematical modelPmsm mathematical model
Pmsm mathematical model
 
THREE PHASE INVERTER FED BLDC MOTOR DRIVE
THREE PHASE INVERTER FED BLDC MOTOR DRIVETHREE PHASE INVERTER FED BLDC MOTOR DRIVE
THREE PHASE INVERTER FED BLDC MOTOR DRIVE
 
Modeling and simulation of pmsm
Modeling and simulation of pmsmModeling and simulation of pmsm
Modeling and simulation of pmsm
 
Svpwm
SvpwmSvpwm
Svpwm
 
Permanent magnet Synchronous machines
Permanent magnet Synchronous machinesPermanent magnet Synchronous machines
Permanent magnet Synchronous machines
 
MODELLING OF PMSM
MODELLING OF PMSMMODELLING OF PMSM
MODELLING OF PMSM
 
Space Vector Modulation(SVM) Technique for PWM Inverter
Space Vector Modulation(SVM) Technique for PWM InverterSpace Vector Modulation(SVM) Technique for PWM Inverter
Space Vector Modulation(SVM) Technique for PWM Inverter
 

Similaire à report.doc

A Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text ClassificationA Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text ClassificationJoshua Gorinson
 
Generalization of linear and non-linear support vector machine in multiple fi...
Generalization of linear and non-linear support vector machine in multiple fi...Generalization of linear and non-linear support vector machine in multiple fi...
Generalization of linear and non-linear support vector machine in multiple fi...CSITiaesprime
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435IJRAT
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniquesijnlc
 
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018Codemotion
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...IJECEIAES
 
powerpoint
powerpointpowerpoint
powerpointbutest
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsIJERA Editor
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.bhavinecindus
 
A report of the work done in this project is available here
A report of the work done in this project is available hereA report of the work done in this project is available here
A report of the work done in this project is available herebutest
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code OptimizationIRJET Journal
 
SVM Based POS Tagger (copy)
SVM Based POS Tagger (copy)SVM Based POS Tagger (copy)
SVM Based POS Tagger (copy)Sidharth Kamboj
 
IRJET- Machine Learning and Deep Learning Methods for Cybersecurity
IRJET- Machine Learning and Deep Learning Methods for CybersecurityIRJET- Machine Learning and Deep Learning Methods for Cybersecurity
IRJET- Machine Learning and Deep Learning Methods for CybersecurityIRJET Journal
 
The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...IJECEIAES
 

Similaire à report.doc (20)

Stock Market Prediction Using ANN
Stock Market Prediction Using ANNStock Market Prediction Using ANN
Stock Market Prediction Using ANN
 
A Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text ClassificationA Survey Of Various Machine Learning Techniques For Text Classification
A Survey Of Various Machine Learning Techniques For Text Classification
 
Generalization of linear and non-linear support vector machine in multiple fi...
Generalization of linear and non-linear support vector machine in multiple fi...Generalization of linear and non-linear support vector machine in multiple fi...
Generalization of linear and non-linear support vector machine in multiple fi...
 
Paper id 25201435
Paper id 25201435Paper id 25201435
Paper id 25201435
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniques
 
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018
 
Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...Text classification supervised algorithms with term frequency inverse documen...
Text classification supervised algorithms with term frequency inverse documen...
 
powerpoint
powerpointpowerpoint
powerpoint
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data Streams
 
SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.SYNOPSIS on Parse representation and Linear SVM.
SYNOPSIS on Parse representation and Linear SVM.
 
journal for research
journal for researchjournal for research
journal for research
 
A report of the work done in this project is available here
A report of the work done in this project is available hereA report of the work done in this project is available here
A report of the work done in this project is available here
 
IRJET- Machine Learning Techniques for Code Optimization
IRJET-  	  Machine Learning Techniques for Code OptimizationIRJET-  	  Machine Learning Techniques for Code Optimization
IRJET- Machine Learning Techniques for Code Optimization
 
SVM Based POS Tagger (copy)
SVM Based POS Tagger (copy)SVM Based POS Tagger (copy)
SVM Based POS Tagger (copy)
 
IRJET- Machine Learning and Deep Learning Methods for Cybersecurity
IRJET- Machine Learning and Deep Learning Methods for CybersecurityIRJET- Machine Learning and Deep Learning Methods for Cybersecurity
IRJET- Machine Learning and Deep Learning Methods for Cybersecurity
 
The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...The effect of gamma value on support vector machine performance with differen...
The effect of gamma value on support vector machine performance with differen...
 
Aggreagate awareness
Aggreagate awarenessAggreagate awareness
Aggreagate awareness
 

Plus de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Plus de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

report.doc

  • 1. Project Report on Building Descriptor Based SVM for Document Categorization Student: Naveen Kumar Ratkal Advisor: Dr. Zubair
  • 2. ABSTRACT........................................................................................................................3 INTRODUCTION..............................................................................................................3 SUPPORT VECTOR MACHINE....................................................................................4 INTRODUCTION.....................................................................................................................4 SVM MODEL....................................................................................................................4 SVM IMPLEMENTATION.............................................................................................7 TRANSFORMATION................................................................................................................8 SCALING.............................................................................................................................8 RBF KERNEL......................................................................................................................8 APPROACH.......................................................................................................................8 ARCHITECTURE.............................................................................................................9 SVM BASED ARCHITECTURE..............................................................................................10 TRAINING..........................................................................................................................10 ASSIGNMENT......................................................................................................................11 EXPERIMENTS AND RESULTS.................................................................................11 CONCLUSION ...............................................................................................................13 REFERENCE...................................................................................................................14
  • 3. Abstract Automated document categorization is extensively used and studied for document categorization, various approaches have been implemented. SVM is one of the machine learning technique which is promising for text categorization. Most of the text categorization systems are not widespread. One of the reasons is that very often it is not clear how to adapt a machine learning approach to a collection of an organization with its unique needs. DTIC currently uses a dictionary based approach to assign DTIC thesaurus descriptors to an acquired document based on its title and abstract. The thesaurus currently consists of around 14K descriptors. The main objective of this project is to build a framework for automatically downloading the documents from the DTIC website, OCRing (Optical Character Recognition) the downloaded files and finally train these files using the positive documents and negative documents (which are taken from the others class positive set). Introduction Automated document categorization has been extensively studied and a good survey article discusses evolution of various techniques for document categorization with particular focus on machine learning approaches. One of the machine learning techniques, Support Vector Machines (SVMs), is promising for text categorization (Dumais 1998, Joachims 1998). Dumais et al. evaluated SVMs for the Reuters-21578 collection. They found SVMs to be most accurate for text categorization and quick to train. The automatic text categorization area has matured and a number of experimental prototypes are available. However, most of these experimental prototypes, for the purpose of evaluating different techniques, have restricted to a standard collection such as Reuters. As pointed out in, the commercial text categorization systems are not widespread. One of the reasons is that very often it is not clear how to adapt a machine learning approach to a collection of an organization with its unique needs. Here we implement a framework that we have created that allows one to evaluate various approaches of machine learning to the categorization problem for a specific collection. Specifically, we valuate the applicability of SVMs for Defense Technical Information Center (DTIC) needs. The DTIC currently uses a dictionary based approach to assign DTIC thesaurus descriptors to an acquired document based on its title and abstract. The descriptor assignment is validated by humans and one or more fields/groups (subject categorization) are assigned to the target document. The thesaurus currently consists of around 14K descriptors. For subject categorization, DTIC uses 25 main categories (fields) and 251 subcategories (groups). Typically a document is assigned two or three fields/groups. Further, a document is also assigned five or six descriptors from the DTIC thesaurus.
  • 4. The dictionary based approach currently used by DTIC relies on mapping of document terms to DTIC thesaurus descriptions and is not efficient as it is requires continuous update of the mapping table. Additionally, it suffers from the quality of descriptors that are assigned using this approach. There is a need to improve this process to make it efficient in terms of time and the quality of descriptors that are assigned to documents. Support Vector Machine Introduction SVM (Support Vector Machine) was introduced by V. Vapnik in late 70s. It is widely used in pattern recognition areas such as face detection, isolated handwriting digit recognition, gene classification, and text categorization [11]. The goal of text categorization is the classification of documents into a fixed number of predefined categories. Each document can be in multiple, exactly one, or no category at all [8]. Text categorization with SVM eliminates the need for manual, difficult, time-consuming classifying by learning classifiers from examples and performing the category assignments automatically. This is a supervised learning problem SVMs are a set of related supervised learning methods used for classification and regression. They belong to a family of generalized linear classifiers. A special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin. Hence it is also known as the maximum margin classifier. [8] The main idea of Support Vector Machine is to find an optimal hyperplane to separate two classes with the largest margin from pre-classified data [12]. The use of the maximum-margin hyperplane is motivated by Vapnik Chervonenkis theory, which provides a probabilistic test error bond that is minimized when the margin is maximized [1]. SVM Model SVM models can be divided into four distinct groups based on the error function: (1) C-SVM classification (2) nu-SVM classification (3) epsilon-SVM regression (4) nu-SVM regression Compared to regular C-SVM, the formulation of nu-SVM is more complicated, so up to now there have been no effective methods for solving large-scale nu-SVM [3]. We are using C-SVM classification in our project. For C-SVM, training involves the minimization of the error function: Subject to the constraints:
  • 5. Where C is the capacity constant or penalty parameter of the error function, w is the vector of coefficients, b is a constant and is parameter for handling non separable data (inputs). The index i labels the N training cases. Note that is the class labels and xi is the independent variables. The kernel is used to transform data from the input (independent) to the feature space [13]. SVM kernels: Kernels area class of algorithms whose task is to detect and exploit complex patterns in data (eg: by clustering, classifying, ranking, cleaning, etc. the data). Typical problems are: how to represent complex patterns; and how to exclude spurious (unstable) patterns (= over fitting). The first is a computational problem; the second a statistical problem. [13]. The class of kernel methods implicitly defines the class of possible patterns by introducing a notion of similarity between data, for example, similarity between documents by length, topic, language, etc. Kernel methods exploit information about the inner products between data items. Many standard algorithms can be rewritten so that they only require inner products between data (inputs) . When a kernel is given there is no need to specify what features of the data are being used. [13]. Kernel functions = inner products in some feature space (potentially very complex) In SVM there are four common kernels: • linear • polynomial • (RBF) radial basis function • sigmoid In general RBF is a reasonable first choice because the kernel matrix using sigmoid may not be positive definite and in general it’s accuracy is not better than RBF[12], linear is a special case of RBF, and polynomial may have numerical difficulties if a high degree is used. In this implementation we use an RBF kernel whose kernel function is: exp (γ||xy||²) Kernel trick: In machine learning, the kernel trick is a method for easily converting a linear classifier algorithm into a non-linear one, by mapping the original observations into a higher- dimensional non-linear space so that linear classification in the new space is equivalent to non-linear classification in the original space [1]. Let’s look at this with an example.
  • 6. Here we are interested in classifying data as a part of a machine-learning process. These data points may not necessarily be points in but may be multidimensional points [1]. We are interested in whether we can separate them by a n-1 dimensional hyperplane. This is a typical form of linear classifier. Let us consider a two-class, linearly separable classification problem as below: Figure 1: Linear Classifier [12]. But not all classification problems are as simple as this. Figure 2: Non Linear Input space [12]. In case of complex problems, where the input space is not linear, the solution is a complex curve as below. Figure 3: Kernel Trick [12]. Incase of such non linear input space, the data space is mapped into a linear feature space. This mapping of the feature space is performed by a class of algorithms called Kernels. Good Decision Boundary: As discussed above theinput data points may not necessarily be points in but may be multidimensional (computer science notation) points. We are interested in whether
  • 7. we can separate them by a n-1 dimensional hyperplane [1]. This is a typical form of linear classifier. There are many linear classifiers that might satisfy this property. However, a given set of linear class data points may have more than one margin or boundary that separates them. A Perceptron algorithm can be used to find these multiple boundaries. Class 1 Class 2 Figure 3: Good Decision Boundary SVM solves this by finding the maximum separation (margin) between the two classes. and is known as the maximum-margin hyper plane. Figure 4: maximum margin hyper plane SVM Implementation The SVM implemented in this project follows the following steps [7]:
  • 8. Transform data to the format of an SVM software • Conduct simple scaling/normalization on the data • Consider the RBF kernel • Find the best parameter C and Gamma • Use the best parameter C and Gamma to train the training set • Test Transformation There are many existing SVM library we can use, such as rainbow [9], LIBSVM etc. LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) etc [5]. In this project we select LibSVM because of the following reasons: (1) cross validation for model selection (2) different SVM formulations (3) Java sources The data format supported by LIBSVM consists of representing every documents as a line of feature id pairs: [label] [feature Id]:[feature Value] [feature Id]:[feature Value]…… The label is the class [0,1] or [-1,1] that the document belongs to . The input here is the text file containing the articles or metadata of the images. LIBSVM however takes the data files of the above described format. The input files are transformed into the LIBSVM format by processing each document as follows. Every document is scanned to pick up features (or key words that represent a document) and every terms frequency (TF) and the inverse document frequency (IDF) is calculated. These are then used to compute the weight of every term in the training set. There are many tools such as lucene that calculate the TF and DF for a given document and create its SVM data model. In this project I have developed my own API to calculate, compute and create the svm data files (based on the required feature sets). Scaling The weighted frequencies are scaled to avoid attributes in greater numeric ranges dominate those in smaller ranges. RBF Kernel In this project we use the RBF kernel as it is a simple model to start with. The number of hyper parameters influences the complexity of model selection and Polynomial kernel has more hyper parameters than the RBF kernel. Approach
  • 9. There are several ways one can apply SVM to the problem of assigning fields/groups and descriptors to new documents based on learning set. In one approach we would treat the problems of categorization and descriptor selection as independent problems each one solved by independent SVMs. A combined approach would either start with the categorization problem or the descriptor problem and then solve the other as a restricted domain problem. For instance, we could solve the categorization problem first and the for the resulting specific field/group solve the descriptor selection problem for the document at hand. Here, we discuss the .independent. approach. And discuss how to resolve inconsistencies that can result when the two problems are solved independently. An equally important task, which is not the focus of this paper is to study different ways to apply SVM approach and their tradeoffs and possible other machine learning techniques that can result in better quality descriptors and fields/group mapping. The overall approach consists of the following major steps. Step 1. We use the existing DTIC collection for the training phase. For this, we first need to have a representation of a document. A document is represented by a vector of weighted terms. The weights are determined by the term frequency and inverse document frequency . a standard technique used in the IR area. Using this representation of a document, we train the SVM for 251 fields/groups and 14000 descriptors. Step 2. We use trained SVM to identify the fields and groups for a document. We also assign a likelihood factor (varying from 0 to 1) to an assigned field/group based on the document distance from the hyperplane (please refer to SVM background). We sort the assigned fields/groups based on the likelihood factor and select first .k. fields groups based on a threshold. Step 3. Similar to Step 2, we identify descriptors, sort them, and select .m. descriptors based on a threshold. Step 4. Note that fields/groups identified in Step 2 and descriptors identified in Step 3 may be inconsistent. That is we may have a field/group assignment for a new document without its descriptor as identified by the DTIC thesaurus. One straightforward way to resolve this is to use intersection of descriptors identified by the fields/groups mapping and the descriptors identified by the SVM. The likelihood factor can then be used to select few fields/groups (around two or three) and five or six descriptors. As we are using SVM to classify a large number of classes (251 + 14000), there can be performance issues. As part of this work we investigate ways to improve the performance without sacrificing the quality of fields/groups, and descriptors assignment. In this paper, we focus on Step 3. of the overall process with particular attention on the automated framework that allows collection to be analyzed, trained, processed, and results presented. Architecture
  • 10. SVM Based Architecture The process to identify a descriptor for a document consists of two phases: training phase and assignment phase. In the training phase, we train as many SVMs as the number of thesaurus terms. Next in the assignment phase, we present a document to all the trained SVMs and select .m. descriptors based on a threshold. We now give details of the two processes. Training As DTIC thesaurus consists of 14000 terms, the training process to build 14000 SVMs needs to be automated. Figure 2 illustrates the training process. For each thesaurus term, we construct a URL and use this URL to search documents from DTIC collection in that category. For our testing purposes we have enabled this process for only five thesaurus terms. From the search results, we download a subset of the documents that are selected randomly. These documents, which are in PDF format, are converted to text using Omnipage, an OCR engine. We use the traditional IR document model, which is based on term frequency (TF) and document frequency (DF). In this project, we use Apache Lucene package to determine TF and DF to come up with a document representation that can be used for training a SVM. These documents form the positive training set for the selected thesaurus term. The negative training set is created by randomly selecting documents from positive training sets for terms other than the selected term.
  • 11. Assignment Figure 3 illustrates the assignment phase. The input document is represented using TF and DF as in the training phase. This document is presented to all the trained SVMs, which in turn output an estimate in the range from 0 to 1 indicating how likely the selected term maps to the test document. Based on a threshold, we can then assign .m. thesaurus terms (descriptors) to the test document. Experiments and Results We use recall and precision metrics that are commonly used by the information extraction and data mining communities. The general definition of recall and precision is: Recall = Correct Answers / Total Possible Answers Precision = Correct Answers / Answers Produced To compute recall and precision typically one uses confusion matrix C [10], which is a K X K matrix for a K-class classifier. In our case, it is a 5 by 5 matrix and an element of the matrix indicates how many documents in class i have been classified as class j. For an ideal classifier all off diagonal entries will be zeroes, and if there are n documents in a class j. c c Σ cii = ii Recall Σ c = ii Precision Σ c Correct Rate = i j ij j ji Σ Σ cij j i We created a testbed with various descriptors, below are the details of the testbed. Testbed 1: In this case we took 5 descriptors (damage tolerance, fabrication, machine, military history and tactical analysis). Training size: 50 positive documents and 50 negative documents.
  • 12. Test Documents size : 20 documents. Number of pages: First five and Last five. D: Damage Tolerance F: Fabrication M: Machine H: Military History T: Tactical Analysis D F M H T Recall Precision 1 D 8 2 0 0 0 0.9 0.90 F 2 14 1 0 3 0.7 0.77 1 M 0 1 8 0 1 0.9 0.81 1 H 0 1 1 6 2 0.8 0.66 1 T 0 0 2 8 0 0.5 0.62 Correct Rate : 0.76 Testbed 2: we took 5 narrow descriptors (military budgets, military capabilities, military commanders, military government and military operations). Training size: 50 positive documents and 50 negative documents. Test Documents size : 40 documents. Number of pages: First five and Last five. MB MC MCO MG MO Recall Precision MB 22 6 4 3 5 0.55 0.79 MC 1 28 4 4 3 0.70 0.61 MCO 1 2 24 10 3 0.60 0.52 MG 2 6 5 25 2 0.63 0.53 MO 2 4 9 5 20 0.50 0.61 Correct Rate : 0.59 We multiplied the constant values for the terms that are occurring in the first two pages of the document (we used the same Testbed 2). MB MCO MC MG MO Recall Precision MB 15 0 4 0 1 0.75 0.65 MCO 2 10 1 3 3 0.53 0.50
  • 13. MC 4 3 10 2 1 0.50 0.59 MG 1 1 0 8 0 0.80 0.50 MO 1 6 2 3 8 0.40 0.62 Correct Rate : 0.57 We added the constant values for the terms that are occurring in the first two pages of the document (we used the same Testbed 2). MB MCO MC MG MO Recall Precision MB 16 0 2 1 1 0.8 0.73 MCO 1 13 1 3 2 0.65 0.57 MC 3 2 14 1 0 0.7 0.67 MG 1 1 0 8 0 0.8 0.53 MO 1 7 4 2 6 0.3 0.67 Correct Rate : 0.63 Conclusion In this paper, we proposed a framework to evaluate the effectiveness of SVMs for both subject categorization and descriptor selection problem for a DTIC collection. We have improved the existing DTIC process, which uses a dictionary to assign thesaurus descriptors to an acquired document based on its title and abstract. Our preliminary results are encouraging. We still need to do more testing to determine the right training sizes for various SVMs.
  • 14. Reference [1] Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM Computing Surveys. Vol. 34(1). pp. 1-47. [2] Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M. (1998). .Inductive learning algorithms and representations for text categorization., In Proceedings of CIKM-98, 7 th ACM International Conference on Information and Knowledge Management (Washington, US, 1998), pp. 148.155. [3] Joachims, T. (1998). .Text categorization with support vector machines: learning with many relevant features., In Proceedings of ECML-98, 10th European Conference on Machine Learning (Chemnitz, DE, 1998), pp. 137.142. [4] Reuters-21578 collection. URL: http://www.research.att.com/~lewis/reuters21578.htm [5] V. N. Vapnik. The nature of Statistical Learning Theory. Springer, Berlin, 1995. [6] C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2): 955-974, 1998. [7] T. Joachims, Text Categorization with Suport Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning, Springer, 1998. [8] T. Joachims, Learning to Classify Text Using Support Vector Machines. Dissertation, Kluwer, 2002. [9] J.T. Kwok. Automated text categorization using support vector machine. In Proceedings of the International Conference on Neural Information Processing, Kitakyushu, Japan, Oct. 1998, pp. 347- 351. [10]. Kohavi R and Provost F. Glossary of Terms. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, Vol. 30, No. 2/3, February/March 1998.