1. Project Report on
Building Descriptor Based SVM for Document Categorization
Student: Naveen Kumar Ratkal
Advisor: Dr. Zubair
2. ABSTRACT........................................................................................................................3
INTRODUCTION..............................................................................................................3
SUPPORT VECTOR MACHINE....................................................................................4
INTRODUCTION.....................................................................................................................4
SVM MODEL....................................................................................................................4
SVM IMPLEMENTATION.............................................................................................7
TRANSFORMATION................................................................................................................8
SCALING.............................................................................................................................8
RBF KERNEL......................................................................................................................8
APPROACH.......................................................................................................................8
ARCHITECTURE.............................................................................................................9
SVM BASED ARCHITECTURE..............................................................................................10
TRAINING..........................................................................................................................10
ASSIGNMENT......................................................................................................................11
EXPERIMENTS AND RESULTS.................................................................................11
CONCLUSION ...............................................................................................................13
REFERENCE...................................................................................................................14
3. Abstract
Automated document categorization is extensively used and studied for document
categorization, various approaches have been implemented. SVM is one of the machine
learning technique which is promising for text categorization. Most of the text
categorization systems are not widespread. One of the reasons is that very often it is not
clear how to adapt a machine learning approach to a collection of an organization with its
unique needs. DTIC currently uses a dictionary based approach to assign DTIC thesaurus
descriptors to an acquired document based on its title and abstract. The thesaurus
currently consists of around 14K descriptors. The main objective of this project is to
build a framework for automatically downloading the documents from the DTIC website,
OCRing (Optical Character Recognition) the downloaded files and finally train these files
using the positive documents and negative documents (which are taken from the others
class positive set).
Introduction
Automated document categorization has been extensively studied and a good survey
article discusses evolution of various techniques for document categorization with
particular focus on machine learning approaches. One of the machine learning
techniques, Support Vector Machines (SVMs), is promising for text categorization
(Dumais 1998, Joachims 1998). Dumais et al. evaluated SVMs for the Reuters-21578
collection. They found SVMs to be most accurate for text categorization and quick to
train.
The automatic text categorization area has matured and a number of experimental
prototypes are available. However, most of these experimental prototypes, for the
purpose of evaluating different techniques, have restricted to a standard collection such as
Reuters. As pointed out in, the commercial text categorization systems are not
widespread. One of the reasons is that very often it is not clear how to adapt a machine
learning approach to a collection of an organization with its unique needs.
Here we implement a framework that we have created that allows one to evaluate various
approaches of machine learning to the categorization problem for a specific collection.
Specifically, we valuate the applicability of SVMs for Defense Technical Information
Center (DTIC) needs. The DTIC currently uses a dictionary based approach to assign
DTIC thesaurus descriptors to an acquired document based on its title and abstract. The
descriptor assignment is validated by humans and one or more fields/groups (subject
categorization) are assigned to the target document. The thesaurus currently consists of
around 14K descriptors. For subject categorization, DTIC uses 25 main categories
(fields) and 251 subcategories (groups). Typically a document is assigned two or three
fields/groups. Further, a document is also assigned five or six descriptors from the DTIC
thesaurus.
4. The dictionary based approach currently used by DTIC relies on mapping of document
terms to DTIC thesaurus descriptions and is not efficient as it is requires continuous
update of the mapping table. Additionally, it suffers from the quality of descriptors that
are assigned using this approach. There is a need to improve this process to make it
efficient in terms of time and the quality of descriptors that are assigned to documents.
Support Vector Machine
Introduction
SVM (Support Vector Machine) was introduced by V. Vapnik in late 70s. It is widely
used in pattern recognition areas such as face detection, isolated handwriting digit
recognition, gene classification, and text categorization [11]. The goal of text
categorization is the classification of documents into a fixed number of predefined
categories. Each document can be in multiple, exactly one, or no category at all [8].
Text categorization with SVM eliminates the need for manual, difficult, time-consuming
classifying by learning classifiers from examples and performing the category
assignments automatically. This is a supervised learning problem
SVMs are a set of related supervised learning methods used for classification and
regression. They belong to a family of generalized linear classifiers. A special property of
SVMs is that they simultaneously minimize the empirical classification error and
maximize the geometric margin. Hence it is also known as the maximum margin
classifier. [8]
The main idea of Support Vector Machine is to find an optimal hyperplane to separate
two classes with the largest margin from pre-classified data [12]. The use of the
maximum-margin hyperplane is motivated by Vapnik Chervonenkis theory, which
provides a probabilistic test error bond that is minimized when the margin is maximized
[1].
SVM Model
SVM models can be divided into four distinct groups based on the error function:
(1) C-SVM classification
(2) nu-SVM classification
(3) epsilon-SVM regression
(4) nu-SVM regression
Compared to regular C-SVM, the formulation of nu-SVM is more complicated, so up to
now there have been no effective methods for solving large-scale nu-SVM [3]. We are
using C-SVM classification in our project. For C-SVM, training involves the
minimization of the error function:
Subject to the constraints:
5. Where C is the capacity constant or penalty parameter of the error function, w is the
vector of coefficients, b is a constant and is parameter for handling non separable data
(inputs). The index i labels the N training cases. Note that is the class labels and xi
is the independent variables. The kernel is used to transform data from the input
(independent) to the feature space [13].
SVM kernels:
Kernels area class of algorithms whose task is to detect and exploit complex patterns in
data (eg: by clustering, classifying, ranking, cleaning, etc. the data). Typical problems
are: how to represent complex patterns; and how to exclude spurious (unstable) patterns
(= over fitting). The first is a computational problem; the second a statistical problem.
[13].
The class of kernel methods implicitly defines the class of possible patterns by
introducing a notion of similarity between data, for example, similarity between
documents by length, topic, language, etc. Kernel methods exploit information about the
inner products between data items. Many standard algorithms can be rewritten so that
they only require inner products between data (inputs) . When a kernel is given there is
no need to specify what features of the data are being used. [13].
Kernel functions = inner products in some feature space (potentially very complex)
In SVM there are four common kernels:
• linear
• polynomial
• (RBF) radial basis function
• sigmoid
In general RBF is a reasonable first choice because the kernel matrix using sigmoid may
not be positive definite and in general it’s accuracy is not better than RBF[12], linear is a
special case of RBF, and polynomial may have numerical difficulties if a high degree is
used. In this implementation we use an RBF kernel whose kernel function is:
exp (γ||xy||²)
Kernel trick:
In machine learning, the kernel trick is a method for easily converting a linear classifier
algorithm into a non-linear one, by mapping the original observations into a higher-
dimensional non-linear space so that linear classification in the new space is equivalent to
non-linear classification in the original space [1]. Let’s look at this with an example.
6. Here we are interested in classifying data as a part of a machine-learning process. These
data points may not necessarily be points in but may be multidimensional points
[1]. We are interested in whether we can separate them by a n-1 dimensional hyperplane.
This is a typical form of linear classifier. Let us consider a two-class, linearly separable
classification problem as below:
Figure 1: Linear Classifier [12].
But not all classification problems are as simple as this.
Figure 2: Non Linear Input space [12].
In case of complex problems, where the input space is not linear, the solution is a
complex curve as below.
Figure 3:
Kernel Trick
[12].
Incase of such
non linear
input space, the data space is mapped into a linear feature space. This mapping of the
feature space is performed by a class of algorithms called Kernels.
Good Decision Boundary:
As discussed above theinput data points may not necessarily be points in but may be
multidimensional (computer science notation) points. We are interested in whether
7. we can separate them by a n-1 dimensional hyperplane [1]. This is a typical form of
linear classifier. There are many linear classifiers that might satisfy this property.
However, a given set of linear class data points may have more than one margin or
boundary that separates them. A Perceptron algorithm can be used to find these multiple
boundaries.
Class
1
Class
2
Figure 3: Good Decision Boundary
SVM solves this by finding the maximum separation (margin) between the two classes.
and is known as the maximum-margin hyper plane.
Figure 4: maximum margin hyper plane
SVM Implementation
The SVM implemented in this project follows the following steps [7]:
8. • Transform data to the format of an SVM software
• Conduct simple scaling/normalization on the data
• Consider the RBF kernel
• Find the best parameter C and Gamma
• Use the best parameter C and Gamma to train the training set
• Test
Transformation
There are many existing SVM library we can use, such as rainbow [9], LIBSVM etc.
LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC),
regression (epsilon-SVR, nu-SVR) etc [5]. In this project we select LibSVM because of
the following reasons:
(1) cross validation for model selection
(2) different SVM formulations
(3) Java sources
The data format supported by LIBSVM consists of representing every documents as a
line of feature id pairs:
[label] [feature Id]:[feature Value] [feature Id]:[feature Value]……
The label is the class [0,1] or [-1,1] that the document belongs to .
The input here is the text file containing the articles or metadata of the images. LIBSVM
however takes the data files of the above described format. The input files are
transformed into the LIBSVM format by processing each document as follows. Every
document is scanned to pick up features (or key words that represent a document) and
every terms frequency (TF) and the inverse document frequency (IDF) is calculated.
These are then used to compute the weight of every term in the training set.
There are many tools such as lucene that calculate the TF and DF for a given document
and create its SVM data model. In this project I have developed my own API to calculate,
compute and create the svm data files (based on the required feature sets).
Scaling
The weighted frequencies are scaled to avoid attributes in greater numeric ranges
dominate those in smaller ranges.
RBF Kernel
In this project we use the RBF kernel as it is a simple model to start with. The number of
hyper parameters influences the complexity of model selection and Polynomial kernel
has more hyper parameters than the RBF kernel.
Approach
9. There are several ways one can apply SVM to the problem of assigning fields/groups and
descriptors to new documents based on learning set. In one approach we would treat the
problems of categorization and descriptor selection as independent problems each one
solved by independent SVMs. A combined approach would either start with the
categorization problem or the descriptor problem and then solve the other as a restricted
domain problem. For instance, we could solve the categorization problem first and the for
the resulting specific field/group solve the descriptor selection problem for the document
at hand. Here, we discuss the .independent. approach. And discuss how to resolve
inconsistencies that can result when the two problems are solved independently. An
equally important task, which is not the focus of this paper is to study different ways to
apply SVM approach and their tradeoffs and possible other machine learning techniques
that can result in better quality descriptors and fields/group mapping. The overall
approach consists of the following major steps.
Step 1. We use the existing DTIC collection for the training phase. For this, we first need
to have a representation of a document. A document is represented by a vector of
weighted terms. The weights are determined by the term frequency and inverse document
frequency . a standard technique used in the IR area. Using this representation of a
document, we train the SVM for 251 fields/groups and 14000 descriptors.
Step 2. We use trained SVM to identify the fields and groups for a document. We also
assign a likelihood factor (varying from 0 to 1) to an assigned field/group based on the
document distance from the hyperplane (please refer to SVM background). We sort the
assigned fields/groups based on the likelihood factor and select first .k. fields groups
based on a threshold.
Step 3. Similar to Step 2, we identify descriptors, sort them, and select .m. descriptors
based on a threshold.
Step 4. Note that fields/groups identified in Step 2 and descriptors identified in Step 3
may be inconsistent. That is we may have a field/group assignment for a new document
without its descriptor as identified by the DTIC thesaurus. One straightforward way to
resolve this is to use intersection of descriptors identified by the fields/groups mapping
and the descriptors identified by the SVM. The likelihood factor can then be used to
select few fields/groups (around two or three) and five or six descriptors.
As we are using SVM to classify a large number of classes (251 + 14000), there can be
performance issues. As part of this work we investigate ways to improve the performance
without sacrificing the quality of fields/groups, and descriptors assignment. In this paper,
we focus on Step 3. of the overall process with particular attention on the automated
framework that allows collection to be analyzed, trained, processed, and results
presented.
Architecture
10. SVM Based Architecture
The process to identify a descriptor for a document consists of two phases: training phase
and assignment phase. In the training phase, we train as many SVMs as the number of
thesaurus terms. Next in the assignment phase, we present a document to all the trained
SVMs and select .m. descriptors based on a threshold. We now give details of the two
processes.
Training
As DTIC thesaurus consists of 14000 terms, the training process to build 14000 SVMs
needs to be automated. Figure 2 illustrates the training process. For each thesaurus term,
we construct a URL and use this URL to search documents from DTIC collection in that
category. For our testing purposes we have enabled this process for only five thesaurus
terms. From the search results, we download a subset of the documents that are selected
randomly. These documents, which are in PDF format, are converted to text using
Omnipage, an OCR engine. We use the traditional IR document model, which is based on
term frequency (TF) and document frequency (DF). In this project, we use Apache
Lucene package to determine TF and DF to come up with a document representation that
can be used for training a SVM. These documents form the positive training set for the
selected thesaurus term. The negative training set is created by randomly selecting
documents from positive training sets for terms other than the selected term.
11. Assignment
Figure 3 illustrates the assignment phase. The input document is represented using TF
and DF as in the training phase. This document is presented to all the trained SVMs,
which in turn output an estimate in the range from 0 to 1 indicating how likely the
selected term maps to the test document. Based on a threshold, we can then assign .m.
thesaurus terms (descriptors) to the test document.
Experiments and Results
We use recall and precision metrics that are commonly used by the information
extraction and data mining communities. The general definition of recall and precision is:
Recall = Correct Answers / Total Possible Answers
Precision = Correct Answers / Answers Produced
To compute recall and precision typically one uses confusion matrix C [10], which is a K
X K matrix for a K-class classifier. In our case, it is a 5 by 5 matrix and an element of the
matrix indicates how many documents in class i have been classified as class j. For an
ideal classifier all off diagonal entries will be zeroes, and if there are n documents in a
class j.
c c Σ cii
= ii
Recall Σ c = ii
Precision Σ c Correct Rate = i
j
ij
j
ji Σ Σ cij
j i
We created a testbed with various descriptors, below are the details of the testbed.
Testbed 1: In this case we took 5 descriptors (damage tolerance, fabrication, machine,
military history and tactical analysis).
Training size: 50 positive documents and 50 negative documents.
12. Test Documents size : 20 documents.
Number of pages: First five and Last five.
D: Damage Tolerance
F: Fabrication
M: Machine
H: Military History
T: Tactical Analysis
D F M H T Recall Precision
1
D 8 2 0 0 0 0.9 0.90
F 2 14 1 0 3 0.7 0.77
1
M 0 1 8 0 1 0.9 0.81
1
H 0 1 1 6 2 0.8 0.66
1
T 0 0 2 8 0 0.5 0.62
Correct Rate : 0.76
Testbed 2: we took 5 narrow descriptors (military budgets, military capabilities, military
commanders, military government and military operations).
Training size: 50 positive documents and 50 negative documents.
Test Documents size : 40 documents.
Number of pages: First five and Last five.
MB MC MCO MG MO Recall Precision
MB 22 6 4 3 5 0.55 0.79
MC 1 28 4 4 3 0.70 0.61
MCO 1 2 24 10 3 0.60 0.52
MG 2 6 5 25 2 0.63 0.53
MO 2 4 9 5 20 0.50 0.61
Correct Rate : 0.59
We multiplied the constant values for the terms that are occurring in the first two pages of
the document (we used the same Testbed 2).
MB MCO MC MG MO Recall Precision
MB 15 0 4 0 1 0.75 0.65
MCO 2 10 1 3 3 0.53 0.50
13. MC 4 3 10 2 1 0.50 0.59
MG 1 1 0 8 0 0.80 0.50
MO 1 6 2 3 8 0.40 0.62
Correct Rate : 0.57
We added the constant values for the terms that are occurring in the first two pages of the
document (we used the same Testbed 2).
MB MCO MC MG MO Recall Precision
MB 16 0 2 1 1 0.8 0.73
MCO 1 13 1 3 2 0.65 0.57
MC 3 2 14 1 0 0.7 0.67
MG 1 1 0 8 0 0.8 0.53
MO 1 7 4 2 6 0.3 0.67
Correct Rate : 0.63
Conclusion
In this paper, we proposed a framework to evaluate the effectiveness of SVMs for both
subject categorization and descriptor selection problem for a DTIC collection. We have
improved the existing DTIC process, which uses a dictionary to assign thesaurus
descriptors to an acquired document based on its title and abstract. Our preliminary
results are encouraging. We still need to do more testing to determine the right training
sizes for various SVMs.
14. Reference
[1] Sebastiani, F (2002). .Machine learning in automated text categorization.. ACM
Computing Surveys. Vol. 34(1). pp. 1-47.
[2] Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M. (1998). .Inductive learning
algorithms and representations for text categorization., In Proceedings of CIKM-98, 7 th
ACM International Conference on Information and Knowledge Management
(Washington, US, 1998), pp. 148.155.
[3] Joachims, T. (1998). .Text categorization with support vector machines: learning with
many relevant features., In Proceedings of ECML-98, 10th European Conference on
Machine Learning (Chemnitz, DE, 1998), pp. 137.142. [4] Reuters-21578 collection.
URL: http://www.research.att.com/~lewis/reuters21578.htm
[5] V. N. Vapnik. The nature of Statistical Learning Theory. Springer, Berlin, 1995.
[6] C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2(2): 955-974, 1998.
[7] T. Joachims, Text Categorization with Suport Vector Machines: Learning with Many
Relevant Features. Proceedings of the European Conference on Machine Learning,
Springer, 1998.
[8] T. Joachims, Learning to Classify Text Using Support Vector Machines. Dissertation,
Kluwer, 2002.
[9] J.T. Kwok. Automated text categorization using support vector machine. In
Proceedings of the International Conference on Neural Information Processing,
Kitakyushu, Japan, Oct. 1998, pp. 347- 351.
[10]. Kohavi R and Provost F. Glossary of Terms. Editorial for the Special Issue on
Applications of Machine Learning and the Knowledge Discovery Process, Vol. 30, No.
2/3, February/March 1998.