SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Using Natural Language Processing For
Automated Text Classification
Abhishek Oswal
March 15, 2016
Contents
1 Introduction 3
2 Background 5
3 Types of Learning Techniques 6
3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 6
3.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 7
3.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Comparision between supervised and unsupervised learning . . 8
3.4 Example for diffrerent learning technique . . . . . . . . . . . . 9
4 Process of Classification 10
4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 10
4.2 Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.4 Creation of Model . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.6 Classify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1
CONTENTS
5 Text Categorization 13
5.1 Mathematical Definition of the Text Classification Task . . . 13
5.2 Text Representation Format . . . . . . . . . . . . . . . . . . . 14
5.2.1 Bags of Word Representation . . . . . . . . . . . . . . 15
5.2.2 Document–Term Matrix . . . . . . . . . . . . . . . . . 15
5.3 Methods to Classify . . . . . . . . . . . . . . . . . . . . . . . . 16
6 General Approach 17
6.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . 18
7 Bayesian Categorization 19
7.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7.2 Naive Bayes Equation . . . . . . . . . . . . . . . . . . . . . . 20
8 Support Vector Machines 23
8.1 SVM Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9 k-Nearest Neighbor Categorization 25
9.1 k-NN Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.2 kNN Algorithm Example . . . . . . . . . . . . . . . . . . . . . 26
10 Properties 27
10.1 Properties of Naïve Bayes Categorization . . . . . . . . . . . 27
10.2 Limitations of Naïve Bayes Categorization . . . . . . . . . . . 28
10.3 Properties of k Nearest Neighbor Categorization . . . . . . . . 28
10.4 Limitations of k Nearest Neighbor Categorization . . . . . . . 29
11 Conclusion 30
2
Abstract
With the growth of technology and Internet, it has become natural that we
need to manage a text as online information. Today, we search books and
news using Internet. Many companies and individuals have their web pages.
When there is some information to find, we search for the information on
the Internet. Hence a lot of information has become open .If you have huge
information, you would need to classify it. It is possible to classify it if
information is less. However, today as there is lot of information, it is be-
coming difficult to classify them by hand. Hence, we need some automatic
and fast apporach to classify text information into various fields. Text clas-
sification is gaining more importance due to the accessibility of large number
of electronic documents from a variety of resources. Problem of classification
has been studied in the Natural Language Processing ,Dataminig ,Machine
Learning with variety of applications in a various diverse domains, such as
, news group filtering ,document organization and target marketing. This
report mainly focuses on analysis of naive Bayes Categorization algorithm
for automated text classification.
Key-words:Text Classification,Text Categorization,Naive Bayes,Support Vec-
tor Machine,Spam Filtering
List of Figures
3.1 Comparision between supervised and unsupervised learning . . 8
3.2 Examples of different learning techniques . . . . . . . . . . . . 9
5.1 Example representing categorization . . . . . . . . . . . . . . . 14
5.2 Bags of word representation . . . . . . . . . . . . . . . . . . . 15
5.3 Document-term matrix representation . . . . . . . . . . . . . . 16
6.1 General apporach of classification . . . . . . . . . . . . . . . . 17
6.2 Formula for calculating precision . . . . . . . . . . . . . . . . 18
6.3 Formula for calculating recall . . . . . . . . . . . . . . . . . . 18
7.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7.2 General NaiveBayes Theorem . . . . . . . . . . . . . . . . . . 20
7.3 Formula for calculating P(c) . . . . . . . . . . . . . . . . . . . 21
7.4 Formula for calculating P(x|c) . . . . . . . . . . . . . . . . . . 21
7.5 Maximizing estimation . . . . . . . . . . . . . . . . . . . . . . 21
7.6 Calculating prior probability . . . . . . . . . . . . . . . . . . . 22
7.7 Formula for predicting category . . . . . . . . . . . . . . . . . 22
8.1 Support vector machine . . . . . . . . . . . . . . . . . . . . . 24
9.1 Eucladian Distance . . . . . . . . . . . . . . . . . . . . . . . . 26
1
LIST OF FIGURES
9.2 KNN algorithm equation . . . . . . . . . . . . . . . . . . . . . 26
2
Chapter 1
Introduction
Categorization is the process in which objects are recognized and differenti-
ated on basis of various properties known as features.A category indicates a
relationship between the sujects based on attributes and objects of knowl-
edge.Hence categorization implies that objects are grouped into various cat-
egories for some specific purpose.
Categorization is used for prediction, decision making etc.Objects which
we classify may be audio,image,video,text etc. Text Categorization is also
known as text classification. Text Classification is a process of classifying doc-
uments with respectt to a group of one or more existent categories.Categories
are formed according to concepts or themes or relation present in their con-
tents. Current research topic of text classification mainly aims to improve
the quality of text representation increase efficiency and develop high quality
classifiers.
Text classification process consists of collection of data documents ( gath-
ering ), data preprocessing ( converting raw data to refined data ),Index-
ing,term weighing methods,classification algorithms ( developing classifiers )
based on various features.
3
CHAPTER 1. INTRODUCTION
The basic goal of text categorization is the classification of documents
into a number of predecided categories. Each document can be in exactly
one, multiple or no category at all.Machine learning apporaches have been
actively explored for classification purpose. Among these are Naive bayes
classifier , K-nearest neighbor classifiers , support vector machine , neural
networks.
Services like mail filters, web filters, and online help desks are based on
text classification. Mail filters sorts business e-mails or spam e- mails , by
classifying e-mail into “ordinary mail” or “spam mail.” Web filters prevent
children from accessing undesirable website content , by classifying web sites
categories. Hence , Text Classification technology is important for these
services to run.
Mainly research works in the area of Text Categorization use supervised
learning methods, which mainly dependson on huge amount of labeled train-
ing data to get better and fast classification. Due to lack of available resources
of labeled training data it requires manual labelling of data so that it can be
used for classification method and that is really very long and to expensive
task. On the other hand, there are wide resources of unlabeled training data
that can be utilized for Text Classification purpose.
Recently there were various research efforts done which tried to estab-
lish their methods on basis of unlabeled training data.Rather than using
labelled data or manually labelling of data of same group and one of those
method that really worked is Keyword-based Text Categorization.Keyword-
based TextClassification is mainly based on keyword representation of cate-
gories and documents.
4
Chapter 2
Background
Today’s world is weighed down with lots of data and information from various
sources.IT field made the collection of data more easier than ever before.
Data Mining is a technique of extracting interesting patterns , known features
and knowledge from a very huge amount of data. It mainly helps large
business orgatization.
Recently data mining also attracted the whole IT industry. It majorly
helps the real world applications, to convert large amount of data to meaning-
full information. Data Mining is used in various field of businesses, banking
sector, scientific research , intelligence agencies ,social media sector , robotics
and many more. And Categorization is one of data mining tasks.
5
Chapter 3
Types of Learning Techniques
Machine Learning has ability to learn from observation , previous experi-
ences, and other means, that results in a system that can be infintely self-
improved to give increased efficiency and better effectiveness.
There are different types of learning techniques.
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised learning
Text Categorization using K- Nearest Neighbor algorithms and Naive
Bayes belongs to supervised learning techniques.
3.1 Supervised Learning
Supervised Learning is a technique in which conclusions are drawn from a
training set. Training set is a set which contains pairs of input data and
6
CHAPTER 3. TYPES OF LEARNING TECHNIQUES
category labels to which they belong. Trained data is initially categorized to
construct categorization model by experts
When the categorization model is trained, it must have ability to cate-
gorize the test data to its appropriate category.Test data is a set of data for
is use for validating our categorization model developed on basis of training
data set.
Supervised learning problems are categorized into "regression" and "clas-
sification" problems.
3.1.1 Regression
In a regression problem, we are trying to predict results within a continuous
output, meaning that we are trying to map input variables to some continuous
function.
3.1.2 Classification
In a classification problem, we are instead trying to predict results in a dis-
crete output. In other words, we are trying to map input variables into
discrete categories.
3.2 Unsupervised Learning
Unsupervised Learning is a technique of detecting a function to describe
hidden pattern from unlabeled data. As the data set given to the learner
are unlabeled there is no error or reward mark or signal to get a potential
solution
7
CHAPTER 3. TYPES OF LEARNING TECHNIQUES
3.2.1 Clustering
We can derive this structure by clustering the data based on relationships
among the variables in the data.With unsupervised learning there is no feed-
back based on the prediction results, i.e., there is no teacher to correct you.
It’s not just about clustering. For example, associative memory is unsuper-
vised learning.
3.3 Comparision between supervised and un-
supervised learning
Hence in supervised learning, the output datasets are provided which are used
to train the machine and get the desired outputs whereas in unsupervised
learning no datasets are provided, instead the data is clustered into different
classes whereas in unsupervised learning there is no desired output that is
provided .
Figure 3.1: Comparision between supervised and unsupervised learning
8
CHAPTER 3. TYPES OF LEARNING TECHNIQUES
3.4 Example for diffrerent learning technique
Figure 3.2: Examples of different learning techniques
9
Chapter 4
Process of Classification
A Categorization process is a proper approach to build the categorization
model from an input set of data. This method requires a learning algorithm
to identify a model that understands the relationship between the attribute
set and class label of the input data. This learning algorithm should fit
the input data well and also predict the class labels of previously unknown
records.
There are various steps involved in this process.
4.1 Data Preprocessing
Data preprocessingis a data mining technique that involves transforming raw
data into an understandable format.Data is often inconsistent, incomplete
and lacking in certain behaviors and is likely to contain many errors.Data
preprocessing prepares raw data for further processing. Data goes through a
different of steps during preprocessing
• Data Cleaning
• Data Integration
10
CHAPTER 4. PROCESS OF CLASSIFICATION
• Data Transformation
• Data Discretization
4.2 Training Set
Training set is a set of data used to find potentially predictive relationships.
Training Data Set refers to the collection of data record whose class labels
are already known and which is used to generate the categorization model.
It is then applied to the test data set.
4.3 Test Set
A test set is a set of data used to discover the utility and strength of a predic-
tive relationship. Test Data Set means the collection of records whose class
labels are known but when given as an input to the built classification model,
it must return the accurate class labels of the records. It determines the ac-
curacy of the model based on the count of correct and incorrect predictions
of the test records
4.4 Creation of Model
This is a first draft of some ideas and principles of modelling, with expectation
of a need for future clarification, development, and revision.It is used for
understanding some part of the world.
11
CHAPTER 4. PROCESS OF CLASSIFICATION
4.5 Algorithm
An algorithm is a self-contained step-by-step set of operations to be per-
formed. Algorithms exist that perform data processing ,automated reasoning
and calculation.It is procedure or steps to be used to solve problem.
4.6 Classify
It is classification of test data set using model developed on basis of training
data set following particular algorithm.It is output which we need.
12
Chapter 5
Text Categorization
Categorization is classifying the data for its most effective manner and for
most efficient use.The Text Classification task is defined as the automatic
classification of a document into two or more already fixed classes.
5.1 Mathematical Definition of the Text Clas-
sification Task
Let ( d j, ci) ∈ D >> C, whereDisthecollectionofdocumentsandletC =
{c1, c2 · · · c|C|}aresetofcategorieswhicharepredefined.ThenthemaintaskofTextCategorization
Consider the Fig in which D is the Domain of documents and C 1 , C
2 and C 3 are different categories. D contains three different kind of docu-
ments. After categorization, each document is categorized in to its respective
category.
Hence , in simple words the problem of classification can be defined as
below. We have a set of training records D = { X 1 , . . . , X N } ,
such that each record is labeled with a class value drawn from a set of c
different discrete values indexed by { 1 . . . c} .Now the training data is
13
CHAPTER 5. TEXT CATEGORIZATION
Figure 5.1: Example representing categorization
used for construction of a classification model, which relates the features in
the underlying record.
Here it must be noted that the frequency of words also plays a major role
in the classification process.
5.2 Text Representation Format
First step in text categorization is to transform documents, which typicaly
are strings of characters, into a representation which would be suitable for
the algorithm and the classication process.
14
CHAPTER 5. TEXT CATEGORIZATION
5.2.1 Bags of Word Representation
Information Retrieval research suggests that word frequency and the word
itself works well as representation units and so their ordering in a document
is of very less importance for many tasks such as classification and it takes
us to the conclusion that an [ attribute - value ] representation of text is very
appropriate for text classification process.
Figure 5.2: Bags of word representation
5.2.2 Document–Term Matrix
A document-term matrix or term-document matrix is a mathematical matrix
which describes the frequency of words that occur in a collection of docu-
ments. In a term-document matrix, columns correspond to terms and rows
correspond to documents in the collection.
Each distinct word 1 w i corresponds to a feature, with the number of
times word w i occurs in the document as its value. To avoid unnecessarily
large feature vectors, words are considered as features only if they occur in
the training data at least 3 times and if they are not stop-words “ like ”
,“and”, “or”, etc.
15
CHAPTER 5. TEXT CATEGORIZATION
Figure 5.3: Document-term matrix representation
5.3 Methods to Classify
There are many categorization techniques in use. They are:
• Bayesian Categorization.
• K Nearest Neighbor Categorization.
• Decision Tree Categorization.
• Rule Based Categorization.
• Support Vector Machines.
• Neural Networks.
16
Chapter 6
General Approach
Two major categorization techniques are
• Bayesian
• Support Vector Machine
• kNN
Figure 6.1: General apporach of classification
17
CHAPTER 6. GENERAL APPROACH
6.1 Precision and Recall
Precision and Recall values evaluate the performance of the categorization
model. Precision computes exactness where as Recall computes complete-
ness.
Let TP be number of true positives, i.e. number of documents correctly
labeled and as agreed by both the experts and the model. Let FP be the
number of false positives, i.e. the number of documents that are wrongly
categorized by the model as belonging to that category. Let FN be the
number of false negatives, i.e. the number of documents which are not labeled
as belonging to the category but should have been
Hence, Precision is defined as
Figure 6.2: Formula for calculating precision
Recall is defined as
Figure 6.3: Formula for calculating recall
18
Chapter 7
Bayesian Categorization
Bayesian is well known techniques of classification. It is used to predict the
class membership probabilities i.e. probability of a given record belongs to a
specific category and that is based on Bayes Theorem.
Bayes theorem is a simple mathematical formula used for calculating con-
ditional probabilities.
7.1 Bayes Theorem
Let X be a sample data record whose category is not known and H is some
assumption. Let sample X belongs to a specified category C. If one needs to
determine P (H|X) i.e the probability that the assumption H holds given the
data sample X.
Bayes Theorem is
Where P (H|X) is the posterior probability of H on X. Posterior proba-
bility is based on information such as background knowledge rather than the
prior probability which is independent of data sample X.
P (X|H) is the posterior probability of X on H. But if the given date
19
CHAPTER 7. BAYESIAN CATEGORIZATION
Figure 7.1: Bayes Theorem
is huge , it would be difficult to calculate above probabilities. Conditional
independency was introduced to overcome this limitation.
7.2 Naive Bayes Equation
Naive Bayes categorization is one of the simplest probabilistic Bayesian cat-
egorization. It is based on an assumption that the effect of an attribute
value on a given category is independent of the values of other attributes
which is called as conditional independence. It is used to simplify complex
computations .
The Naive Bayes classifier is a probabilistic classifier which is based on
the Naïve bayes assumption.
From Bayes rule, the posterior probability can be given as
Figure 7.2: General NaiveBayes Theorem
Where x is a feature vector and x =(x 1 ,...,x n ) and c is category.Assume
20
CHAPTER 7. BAYESIAN CATEGORIZATION
that the category c max yields to the maximum value for P (c|x).
The parameter P(c) is estimated as
Figure 7.3: Formula for calculating P(c)
The classification results are not affected because parameter p(x) is inde-
pendent of categories.
Assuming that the components of feature vectors are statistically indepen-
dent of each other, p (x|c) can be calculated as
Figure 7.4: Formula for calculating P(x|c)
If the maximum estimation is used then Where N(x, c) is the joint fre-
Figure 7.5: Maximizing estimation
quency of x and c, If some data x (i) disappears in the training data, the
probability of any instance containing x (i) becomes zero, without consider-
ing the other features in the vector. Therefore, to avoid zero probability, by
using Laplacian prior probabilities, p (x i |c) is estimated as follows
21
CHAPTER 7. BAYESIAN CATEGORIZATION
Figure 7.6: Calculating prior probability
The Naive Bayes classifier predicts the category ( c max ) with the largest
posterior probability
Figure 7.7: Formula for predicting category
22
Chapter 8
Support Vector Machines
A support vector machine (SVM) is a machine learning method that divides
space into a training positive examples side and a negative examples side.
It also creates hyperplanes as the margin between the positive and negative
examples . These hyperplanes serve as the optimum solution based on the
concept of structural risk minimization.
8.1 SVM Equation
SVM calculates the optimal hyperplanes that supply the maximum margin,
where w.x + b = 0 is the final border hyperplane for classification. The
training examples on w.x+b = 1 and w.x+b = 1 are called support vectors.
23
CHAPTER 8. SUPPORT VECTOR MACHINES
Figure 8.1: Support vector machine
24
Chapter 9
k-Nearest Neighbor
Categorization
Nearest Neighbor search is an optimization problem which is used for finding
closest points in space. It is also called as similarity search or closest point
search. For a given set of points S in a space M and a query point q, the
problem is to find the closest point in S to q. Usually the distance is measured
by Euclidean distance .
9.1 k-NN Equation
The k-Nearest Neighbor (k-NN) categorization is the simplest among all the
supervised machine learning techniques but mainly used method for classi-
fication and retrieval. It classifies the objects based on the closest training
examples in the feature space. It is an instance based learning and known
as lazy learning algorithm. Here the object instance query is classified based
on the majority of k nearest neighbor category. All the k nearest neighbors
in a database of a query are found by calculating Euclidean distance. The
25
CHAPTER 9. K-NEAREST NEIGHBOR CATEGORIZATION
neighbors of a query instance are taken from the data set of objects which
are already categorized of the classification i.e which are previously known .
Euclidean distance is calculated as
Figure 9.1: Eucladian Distance
9.2 kNN Algorithm Example
Below figure shows feature space for different values of k
Figure 9.2: KNN algorithm equation
26
Chapter 10
Properties
10.1 Properties of Naïve Bayes Categoriza-
tion
• Naïve Bayes categorization is a probabilistic categorization which is
based on Conditional Independence between features.
• Naïve Bayes classifies an unknown instance by computing the category
which maximizes the posterior probability.
• Naïve Bayes categorization is flexible and robust to errors.
• The prior and the likelihood can be updated dramatically with each
training example.
• Probabilistic hypothesis that it outputs not only classification, but a
probability distribution over all categories.
• Naïve Bayes is very efficient and linearly proportional to the time .
• It is easy to implement when compared with other algorithms.
27
CHAPTER 10. PROPERTIES
• Naïve Bayes has low variance and high bias compared to other algo-
rithms.
10.2 Limitations of Naïve Bayes Categoriza-
tion
• Sometimes the assumption of Conditional Independence is violated by
the real world data.
• It gives poor performance when the features are highly correlated.
• It does not consider the frequency of the word occurrences.
• Another problem is that the features are assumed to be independent
compared to results, even when the words are dependent ,asit consid-
eres each word contribution individually.
• It cannot be used for solving more complex classification problems.
10.3 Properties of k Nearest Neighbor Cate-
gorization
• Unlike Naïve Bayes, kNN doesn’t relay on prior probabilities.
• KNN computes the similarity between a testing instance and all the
nearest training examples in a collection.
• It does not explicitly compute a generalization or category prototypes.
• It is also called as Case-based, Instance-based, Memory-based and Lazy
learning algorithm.
28
CHAPTER 10. PROPERTIES
• K Nearest Neighbor is the most robust alternative to find k-most similar
examples and return the majority of theses k instances.
• It can work with relatively little information.
• Nearest Neighbor method depends on the similarity or distance metric.
• K Nearest Neighbor algorithm has the potential advantage for the prob-
lems with large number of classes.
10.4 Limitations of k Nearest Neighbor Cat-
egorization
• Classification time is too long.
• It is difficult to find the optimal value of k.
• If the training data is large and complex, target functions may reduce
the speed in sorting out queries and irrelevant attributes may fool the
neighbor.
29
Chapter 11
Conclusion
We discussed the background of the categorization.Also discussed the dif-
ferent methodologies and explained them theoretically .We discussed naive
bayes algorithm, support vector machine, k-nearest neighbour algorithm.We
then discussed about the time efficiencies, advantages and disadvantages of
two engines. From our entire study, we observe that the standard preci-
sion and recall values of k Nearest Neighbor categorization engine are better
than Naïve Bayes engine. It has been observed that the kNN has the better
Text Categorization is an active area of research in the field of information
retrieval and machine learning. In future, this study can be extended by
implementing the categorization engines on larger datasets.
30
Bibliography
[1] Text Categorization with SVM by Thorsten Joachims
http://www.cs.cornell.edu/people/tj/publications/joachims98a.pdf
[2] Scholarpedia, K Nearest neighbor
http://www.scholarpedia.org/article/K-nearestneighbor
[3] Thesis on ‘Clustering Approaches to Text Categorization’ by Hiroya Taka-
mura
http://www.lr.pi.titech.ac.jp/ takamura/pubs/dthesisoriginal.pdf
[4] k-Nearest Neighbor (kNN) Algorithm
https://kiwi.ecn.purdue.edu/rhea/index.php/KNNAlgorithmOldKiwi
31

Contenu connexe

Tendances

Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Chapter 09 class advanced
Chapter 09 class advancedChapter 09 class advanced
Chapter 09 class advancedHouw Liong The
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationPaul Houle
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster AnalysisSSA KPI
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clusteringKrish_ver2
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basicHouw Liong The
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 
K means clustering
K means clusteringK means clustering
K means clusteringkeshav goyal
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET Journal
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 

Tendances (17)

Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Chapter 09 class advanced
Chapter 09 class advancedChapter 09 class advanced
Chapter 09 class advanced
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Data clustering
Data clustering Data clustering
Data clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
Clustering
ClusteringClustering
Clustering
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
3.1 clustering
3.1 clustering3.1 clustering
3.1 clustering
 
Capter10 cluster basic
Capter10 cluster basicCapter10 cluster basic
Capter10 cluster basic
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
K means clustering
K means clusteringK means clustering
K means clustering
 
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET- Multi Label Document Classification Approach using Machine Learning Te...
IRJET- Multi Label Document Classification Approach using Machine Learning Te...
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 

Similaire à Text Classification/Categorization

[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...Evaldas Taroza
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownAdrian Cuyugan
 
Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Gora Buzz
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataKathleneNgo
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKSara Parker
 
An Introduction to Statistical Learning R Fourth Printing.pdf
An Introduction to Statistical Learning R Fourth Printing.pdfAn Introduction to Statistical Learning R Fourth Printing.pdf
An Introduction to Statistical Learning R Fourth Printing.pdfDanielMondragon15
 
Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Pieter Van Zyl
 
Annotating Digital Documents For Asynchronous Collaboration
Annotating Digital Documents For Asynchronous CollaborationAnnotating Digital Documents For Asynchronous Collaboration
Annotating Digital Documents For Asynchronous CollaborationClaire Webber
 
Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.Pedro Ernesto Alonso
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network HamdaAnees
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov
 
Active Learning Literature Survey
Active Learning Literature SurveyActive Learning Literature Survey
Active Learning Literature Surveybutest
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Roman Atachiants
 

Similaire à Text Classification/Categorization (20)

Machine_Learning_Co__
Machine_Learning_Co__Machine_Learning_Co__
Machine_Learning_Co__
 
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
[Evaldas Taroza - Master thesis] Schema Matching and Automatic Web Data Extra...
 
Investigation in deep web
Investigation in deep webInvestigation in deep web
Investigation in deep web
 
Arules_TM_Rpart_Markdown
Arules_TM_Rpart_MarkdownArules_TM_Rpart_Markdown
Arules_TM_Rpart_Markdown
 
main
mainmain
main
 
Thesis
ThesisThesis
Thesis
 
Thesis
ThesisThesis
Thesis
 
Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020
 
Mat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports DataMat189: Cluster Analysis with NBA Sports Data
Mat189: Cluster Analysis with NBA Sports Data
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORK
 
An Introduction to Statistical Learning R Fourth Printing.pdf
An Introduction to Statistical Learning R Fourth Printing.pdfAn Introduction to Statistical Learning R Fourth Printing.pdf
An Introduction to Statistical Learning R Fourth Printing.pdf
 
Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010Dissertation_of_Pieter_van_Zyl_2_March_2010
Dissertation_of_Pieter_van_Zyl_2_March_2010
 
Annotating Digital Documents For Asynchronous Collaboration
Annotating Digital Documents For Asynchronous CollaborationAnnotating Digital Documents For Asynchronous Collaboration
Annotating Digital Documents For Asynchronous Collaboration
 
Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.Thesis. A comparison between some generative and discriminative classifiers.
Thesis. A comparison between some generative and discriminative classifiers.
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
Parallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets AnalysisParallel and Distributed Algorithms for Large Text Datasets Analysis
Parallel and Distributed Algorithms for Large Text Datasets Analysis
 
Active Learning Literature Survey
Active Learning Literature SurveyActive Learning Literature Survey
Active Learning Literature Survey
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...
 
Lakhotia09
Lakhotia09Lakhotia09
Lakhotia09
 
Access tutorial
Access tutorialAccess tutorial
Access tutorial
 

Dernier

Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01KreezheaRecto
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 

Dernier (20)

Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
NFPA 5000 2024 standard .
NFPA 5000 2024 standard                                  .NFPA 5000 2024 standard                                  .
NFPA 5000 2024 standard .
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 

Text Classification/Categorization

  • 1. Using Natural Language Processing For Automated Text Classification Abhishek Oswal March 15, 2016
  • 2. Contents 1 Introduction 3 2 Background 5 3 Types of Learning Techniques 6 3.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 6 3.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 7 3.2.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Comparision between supervised and unsupervised learning . . 8 3.4 Example for diffrerent learning technique . . . . . . . . . . . . 9 4 Process of Classification 10 4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . 10 4.2 Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.3 Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.4 Creation of Model . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.6 Classify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1
  • 3. CONTENTS 5 Text Categorization 13 5.1 Mathematical Definition of the Text Classification Task . . . 13 5.2 Text Representation Format . . . . . . . . . . . . . . . . . . . 14 5.2.1 Bags of Word Representation . . . . . . . . . . . . . . 15 5.2.2 Document–Term Matrix . . . . . . . . . . . . . . . . . 15 5.3 Methods to Classify . . . . . . . . . . . . . . . . . . . . . . . . 16 6 General Approach 17 6.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . 18 7 Bayesian Categorization 19 7.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 19 7.2 Naive Bayes Equation . . . . . . . . . . . . . . . . . . . . . . 20 8 Support Vector Machines 23 8.1 SVM Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 23 9 k-Nearest Neighbor Categorization 25 9.1 k-NN Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 25 9.2 kNN Algorithm Example . . . . . . . . . . . . . . . . . . . . . 26 10 Properties 27 10.1 Properties of Naïve Bayes Categorization . . . . . . . . . . . 27 10.2 Limitations of Naïve Bayes Categorization . . . . . . . . . . . 28 10.3 Properties of k Nearest Neighbor Categorization . . . . . . . . 28 10.4 Limitations of k Nearest Neighbor Categorization . . . . . . . 29 11 Conclusion 30 2
  • 4. Abstract With the growth of technology and Internet, it has become natural that we need to manage a text as online information. Today, we search books and news using Internet. Many companies and individuals have their web pages. When there is some information to find, we search for the information on the Internet. Hence a lot of information has become open .If you have huge information, you would need to classify it. It is possible to classify it if information is less. However, today as there is lot of information, it is be- coming difficult to classify them by hand. Hence, we need some automatic and fast apporach to classify text information into various fields. Text clas- sification is gaining more importance due to the accessibility of large number of electronic documents from a variety of resources. Problem of classification has been studied in the Natural Language Processing ,Dataminig ,Machine Learning with variety of applications in a various diverse domains, such as , news group filtering ,document organization and target marketing. This report mainly focuses on analysis of naive Bayes Categorization algorithm for automated text classification. Key-words:Text Classification,Text Categorization,Naive Bayes,Support Vec- tor Machine,Spam Filtering
  • 5. List of Figures 3.1 Comparision between supervised and unsupervised learning . . 8 3.2 Examples of different learning techniques . . . . . . . . . . . . 9 5.1 Example representing categorization . . . . . . . . . . . . . . . 14 5.2 Bags of word representation . . . . . . . . . . . . . . . . . . . 15 5.3 Document-term matrix representation . . . . . . . . . . . . . . 16 6.1 General apporach of classification . . . . . . . . . . . . . . . . 17 6.2 Formula for calculating precision . . . . . . . . . . . . . . . . 18 6.3 Formula for calculating recall . . . . . . . . . . . . . . . . . . 18 7.1 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 20 7.2 General NaiveBayes Theorem . . . . . . . . . . . . . . . . . . 20 7.3 Formula for calculating P(c) . . . . . . . . . . . . . . . . . . . 21 7.4 Formula for calculating P(x|c) . . . . . . . . . . . . . . . . . . 21 7.5 Maximizing estimation . . . . . . . . . . . . . . . . . . . . . . 21 7.6 Calculating prior probability . . . . . . . . . . . . . . . . . . . 22 7.7 Formula for predicting category . . . . . . . . . . . . . . . . . 22 8.1 Support vector machine . . . . . . . . . . . . . . . . . . . . . 24 9.1 Eucladian Distance . . . . . . . . . . . . . . . . . . . . . . . . 26 1
  • 6. LIST OF FIGURES 9.2 KNN algorithm equation . . . . . . . . . . . . . . . . . . . . . 26 2
  • 7. Chapter 1 Introduction Categorization is the process in which objects are recognized and differenti- ated on basis of various properties known as features.A category indicates a relationship between the sujects based on attributes and objects of knowl- edge.Hence categorization implies that objects are grouped into various cat- egories for some specific purpose. Categorization is used for prediction, decision making etc.Objects which we classify may be audio,image,video,text etc. Text Categorization is also known as text classification. Text Classification is a process of classifying doc- uments with respectt to a group of one or more existent categories.Categories are formed according to concepts or themes or relation present in their con- tents. Current research topic of text classification mainly aims to improve the quality of text representation increase efficiency and develop high quality classifiers. Text classification process consists of collection of data documents ( gath- ering ), data preprocessing ( converting raw data to refined data ),Index- ing,term weighing methods,classification algorithms ( developing classifiers ) based on various features. 3
  • 8. CHAPTER 1. INTRODUCTION The basic goal of text categorization is the classification of documents into a number of predecided categories. Each document can be in exactly one, multiple or no category at all.Machine learning apporaches have been actively explored for classification purpose. Among these are Naive bayes classifier , K-nearest neighbor classifiers , support vector machine , neural networks. Services like mail filters, web filters, and online help desks are based on text classification. Mail filters sorts business e-mails or spam e- mails , by classifying e-mail into “ordinary mail” or “spam mail.” Web filters prevent children from accessing undesirable website content , by classifying web sites categories. Hence , Text Classification technology is important for these services to run. Mainly research works in the area of Text Categorization use supervised learning methods, which mainly dependson on huge amount of labeled train- ing data to get better and fast classification. Due to lack of available resources of labeled training data it requires manual labelling of data so that it can be used for classification method and that is really very long and to expensive task. On the other hand, there are wide resources of unlabeled training data that can be utilized for Text Classification purpose. Recently there were various research efforts done which tried to estab- lish their methods on basis of unlabeled training data.Rather than using labelled data or manually labelling of data of same group and one of those method that really worked is Keyword-based Text Categorization.Keyword- based TextClassification is mainly based on keyword representation of cate- gories and documents. 4
  • 9. Chapter 2 Background Today’s world is weighed down with lots of data and information from various sources.IT field made the collection of data more easier than ever before. Data Mining is a technique of extracting interesting patterns , known features and knowledge from a very huge amount of data. It mainly helps large business orgatization. Recently data mining also attracted the whole IT industry. It majorly helps the real world applications, to convert large amount of data to meaning- full information. Data Mining is used in various field of businesses, banking sector, scientific research , intelligence agencies ,social media sector , robotics and many more. And Categorization is one of data mining tasks. 5
  • 10. Chapter 3 Types of Learning Techniques Machine Learning has ability to learn from observation , previous experi- ences, and other means, that results in a system that can be infintely self- improved to give increased efficiency and better effectiveness. There are different types of learning techniques. 1. Supervised learning 2. Unsupervised learning 3. Semi-supervised learning Text Categorization using K- Nearest Neighbor algorithms and Naive Bayes belongs to supervised learning techniques. 3.1 Supervised Learning Supervised Learning is a technique in which conclusions are drawn from a training set. Training set is a set which contains pairs of input data and 6
  • 11. CHAPTER 3. TYPES OF LEARNING TECHNIQUES category labels to which they belong. Trained data is initially categorized to construct categorization model by experts When the categorization model is trained, it must have ability to cate- gorize the test data to its appropriate category.Test data is a set of data for is use for validating our categorization model developed on basis of training data set. Supervised learning problems are categorized into "regression" and "clas- sification" problems. 3.1.1 Regression In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. 3.1.2 Classification In a classification problem, we are instead trying to predict results in a dis- crete output. In other words, we are trying to map input variables into discrete categories. 3.2 Unsupervised Learning Unsupervised Learning is a technique of detecting a function to describe hidden pattern from unlabeled data. As the data set given to the learner are unlabeled there is no error or reward mark or signal to get a potential solution 7
  • 12. CHAPTER 3. TYPES OF LEARNING TECHNIQUES 3.2.1 Clustering We can derive this structure by clustering the data based on relationships among the variables in the data.With unsupervised learning there is no feed- back based on the prediction results, i.e., there is no teacher to correct you. It’s not just about clustering. For example, associative memory is unsuper- vised learning. 3.3 Comparision between supervised and un- supervised learning Hence in supervised learning, the output datasets are provided which are used to train the machine and get the desired outputs whereas in unsupervised learning no datasets are provided, instead the data is clustered into different classes whereas in unsupervised learning there is no desired output that is provided . Figure 3.1: Comparision between supervised and unsupervised learning 8
  • 13. CHAPTER 3. TYPES OF LEARNING TECHNIQUES 3.4 Example for diffrerent learning technique Figure 3.2: Examples of different learning techniques 9
  • 14. Chapter 4 Process of Classification A Categorization process is a proper approach to build the categorization model from an input set of data. This method requires a learning algorithm to identify a model that understands the relationship between the attribute set and class label of the input data. This learning algorithm should fit the input data well and also predict the class labels of previously unknown records. There are various steps involved in this process. 4.1 Data Preprocessing Data preprocessingis a data mining technique that involves transforming raw data into an understandable format.Data is often inconsistent, incomplete and lacking in certain behaviors and is likely to contain many errors.Data preprocessing prepares raw data for further processing. Data goes through a different of steps during preprocessing • Data Cleaning • Data Integration 10
  • 15. CHAPTER 4. PROCESS OF CLASSIFICATION • Data Transformation • Data Discretization 4.2 Training Set Training set is a set of data used to find potentially predictive relationships. Training Data Set refers to the collection of data record whose class labels are already known and which is used to generate the categorization model. It is then applied to the test data set. 4.3 Test Set A test set is a set of data used to discover the utility and strength of a predic- tive relationship. Test Data Set means the collection of records whose class labels are known but when given as an input to the built classification model, it must return the accurate class labels of the records. It determines the ac- curacy of the model based on the count of correct and incorrect predictions of the test records 4.4 Creation of Model This is a first draft of some ideas and principles of modelling, with expectation of a need for future clarification, development, and revision.It is used for understanding some part of the world. 11
  • 16. CHAPTER 4. PROCESS OF CLASSIFICATION 4.5 Algorithm An algorithm is a self-contained step-by-step set of operations to be per- formed. Algorithms exist that perform data processing ,automated reasoning and calculation.It is procedure or steps to be used to solve problem. 4.6 Classify It is classification of test data set using model developed on basis of training data set following particular algorithm.It is output which we need. 12
  • 17. Chapter 5 Text Categorization Categorization is classifying the data for its most effective manner and for most efficient use.The Text Classification task is defined as the automatic classification of a document into two or more already fixed classes. 5.1 Mathematical Definition of the Text Clas- sification Task Let ( d j, ci) ∈ D >> C, whereDisthecollectionofdocumentsandletC = {c1, c2 · · · c|C|}aresetofcategorieswhicharepredefined.ThenthemaintaskofTextCategorization Consider the Fig in which D is the Domain of documents and C 1 , C 2 and C 3 are different categories. D contains three different kind of docu- ments. After categorization, each document is categorized in to its respective category. Hence , in simple words the problem of classification can be defined as below. We have a set of training records D = { X 1 , . . . , X N } , such that each record is labeled with a class value drawn from a set of c different discrete values indexed by { 1 . . . c} .Now the training data is 13
  • 18. CHAPTER 5. TEXT CATEGORIZATION Figure 5.1: Example representing categorization used for construction of a classification model, which relates the features in the underlying record. Here it must be noted that the frequency of words also plays a major role in the classification process. 5.2 Text Representation Format First step in text categorization is to transform documents, which typicaly are strings of characters, into a representation which would be suitable for the algorithm and the classication process. 14
  • 19. CHAPTER 5. TEXT CATEGORIZATION 5.2.1 Bags of Word Representation Information Retrieval research suggests that word frequency and the word itself works well as representation units and so their ordering in a document is of very less importance for many tasks such as classification and it takes us to the conclusion that an [ attribute - value ] representation of text is very appropriate for text classification process. Figure 5.2: Bags of word representation 5.2.2 Document–Term Matrix A document-term matrix or term-document matrix is a mathematical matrix which describes the frequency of words that occur in a collection of docu- ments. In a term-document matrix, columns correspond to terms and rows correspond to documents in the collection. Each distinct word 1 w i corresponds to a feature, with the number of times word w i occurs in the document as its value. To avoid unnecessarily large feature vectors, words are considered as features only if they occur in the training data at least 3 times and if they are not stop-words “ like ” ,“and”, “or”, etc. 15
  • 20. CHAPTER 5. TEXT CATEGORIZATION Figure 5.3: Document-term matrix representation 5.3 Methods to Classify There are many categorization techniques in use. They are: • Bayesian Categorization. • K Nearest Neighbor Categorization. • Decision Tree Categorization. • Rule Based Categorization. • Support Vector Machines. • Neural Networks. 16
  • 21. Chapter 6 General Approach Two major categorization techniques are • Bayesian • Support Vector Machine • kNN Figure 6.1: General apporach of classification 17
  • 22. CHAPTER 6. GENERAL APPROACH 6.1 Precision and Recall Precision and Recall values evaluate the performance of the categorization model. Precision computes exactness where as Recall computes complete- ness. Let TP be number of true positives, i.e. number of documents correctly labeled and as agreed by both the experts and the model. Let FP be the number of false positives, i.e. the number of documents that are wrongly categorized by the model as belonging to that category. Let FN be the number of false negatives, i.e. the number of documents which are not labeled as belonging to the category but should have been Hence, Precision is defined as Figure 6.2: Formula for calculating precision Recall is defined as Figure 6.3: Formula for calculating recall 18
  • 23. Chapter 7 Bayesian Categorization Bayesian is well known techniques of classification. It is used to predict the class membership probabilities i.e. probability of a given record belongs to a specific category and that is based on Bayes Theorem. Bayes theorem is a simple mathematical formula used for calculating con- ditional probabilities. 7.1 Bayes Theorem Let X be a sample data record whose category is not known and H is some assumption. Let sample X belongs to a specified category C. If one needs to determine P (H|X) i.e the probability that the assumption H holds given the data sample X. Bayes Theorem is Where P (H|X) is the posterior probability of H on X. Posterior proba- bility is based on information such as background knowledge rather than the prior probability which is independent of data sample X. P (X|H) is the posterior probability of X on H. But if the given date 19
  • 24. CHAPTER 7. BAYESIAN CATEGORIZATION Figure 7.1: Bayes Theorem is huge , it would be difficult to calculate above probabilities. Conditional independency was introduced to overcome this limitation. 7.2 Naive Bayes Equation Naive Bayes categorization is one of the simplest probabilistic Bayesian cat- egorization. It is based on an assumption that the effect of an attribute value on a given category is independent of the values of other attributes which is called as conditional independence. It is used to simplify complex computations . The Naive Bayes classifier is a probabilistic classifier which is based on the Naïve bayes assumption. From Bayes rule, the posterior probability can be given as Figure 7.2: General NaiveBayes Theorem Where x is a feature vector and x =(x 1 ,...,x n ) and c is category.Assume 20
  • 25. CHAPTER 7. BAYESIAN CATEGORIZATION that the category c max yields to the maximum value for P (c|x). The parameter P(c) is estimated as Figure 7.3: Formula for calculating P(c) The classification results are not affected because parameter p(x) is inde- pendent of categories. Assuming that the components of feature vectors are statistically indepen- dent of each other, p (x|c) can be calculated as Figure 7.4: Formula for calculating P(x|c) If the maximum estimation is used then Where N(x, c) is the joint fre- Figure 7.5: Maximizing estimation quency of x and c, If some data x (i) disappears in the training data, the probability of any instance containing x (i) becomes zero, without consider- ing the other features in the vector. Therefore, to avoid zero probability, by using Laplacian prior probabilities, p (x i |c) is estimated as follows 21
  • 26. CHAPTER 7. BAYESIAN CATEGORIZATION Figure 7.6: Calculating prior probability The Naive Bayes classifier predicts the category ( c max ) with the largest posterior probability Figure 7.7: Formula for predicting category 22
  • 27. Chapter 8 Support Vector Machines A support vector machine (SVM) is a machine learning method that divides space into a training positive examples side and a negative examples side. It also creates hyperplanes as the margin between the positive and negative examples . These hyperplanes serve as the optimum solution based on the concept of structural risk minimization. 8.1 SVM Equation SVM calculates the optimal hyperplanes that supply the maximum margin, where w.x + b = 0 is the final border hyperplane for classification. The training examples on w.x+b = 1 and w.x+b = 1 are called support vectors. 23
  • 28. CHAPTER 8. SUPPORT VECTOR MACHINES Figure 8.1: Support vector machine 24
  • 29. Chapter 9 k-Nearest Neighbor Categorization Nearest Neighbor search is an optimization problem which is used for finding closest points in space. It is also called as similarity search or closest point search. For a given set of points S in a space M and a query point q, the problem is to find the closest point in S to q. Usually the distance is measured by Euclidean distance . 9.1 k-NN Equation The k-Nearest Neighbor (k-NN) categorization is the simplest among all the supervised machine learning techniques but mainly used method for classi- fication and retrieval. It classifies the objects based on the closest training examples in the feature space. It is an instance based learning and known as lazy learning algorithm. Here the object instance query is classified based on the majority of k nearest neighbor category. All the k nearest neighbors in a database of a query are found by calculating Euclidean distance. The 25
  • 30. CHAPTER 9. K-NEAREST NEIGHBOR CATEGORIZATION neighbors of a query instance are taken from the data set of objects which are already categorized of the classification i.e which are previously known . Euclidean distance is calculated as Figure 9.1: Eucladian Distance 9.2 kNN Algorithm Example Below figure shows feature space for different values of k Figure 9.2: KNN algorithm equation 26
  • 31. Chapter 10 Properties 10.1 Properties of Naïve Bayes Categoriza- tion • Naïve Bayes categorization is a probabilistic categorization which is based on Conditional Independence between features. • Naïve Bayes classifies an unknown instance by computing the category which maximizes the posterior probability. • Naïve Bayes categorization is flexible and robust to errors. • The prior and the likelihood can be updated dramatically with each training example. • Probabilistic hypothesis that it outputs not only classification, but a probability distribution over all categories. • Naïve Bayes is very efficient and linearly proportional to the time . • It is easy to implement when compared with other algorithms. 27
  • 32. CHAPTER 10. PROPERTIES • Naïve Bayes has low variance and high bias compared to other algo- rithms. 10.2 Limitations of Naïve Bayes Categoriza- tion • Sometimes the assumption of Conditional Independence is violated by the real world data. • It gives poor performance when the features are highly correlated. • It does not consider the frequency of the word occurrences. • Another problem is that the features are assumed to be independent compared to results, even when the words are dependent ,asit consid- eres each word contribution individually. • It cannot be used for solving more complex classification problems. 10.3 Properties of k Nearest Neighbor Cate- gorization • Unlike Naïve Bayes, kNN doesn’t relay on prior probabilities. • KNN computes the similarity between a testing instance and all the nearest training examples in a collection. • It does not explicitly compute a generalization or category prototypes. • It is also called as Case-based, Instance-based, Memory-based and Lazy learning algorithm. 28
  • 33. CHAPTER 10. PROPERTIES • K Nearest Neighbor is the most robust alternative to find k-most similar examples and return the majority of theses k instances. • It can work with relatively little information. • Nearest Neighbor method depends on the similarity or distance metric. • K Nearest Neighbor algorithm has the potential advantage for the prob- lems with large number of classes. 10.4 Limitations of k Nearest Neighbor Cat- egorization • Classification time is too long. • It is difficult to find the optimal value of k. • If the training data is large and complex, target functions may reduce the speed in sorting out queries and irrelevant attributes may fool the neighbor. 29
  • 34. Chapter 11 Conclusion We discussed the background of the categorization.Also discussed the dif- ferent methodologies and explained them theoretically .We discussed naive bayes algorithm, support vector machine, k-nearest neighbour algorithm.We then discussed about the time efficiencies, advantages and disadvantages of two engines. From our entire study, we observe that the standard preci- sion and recall values of k Nearest Neighbor categorization engine are better than Naïve Bayes engine. It has been observed that the kNN has the better Text Categorization is an active area of research in the field of information retrieval and machine learning. In future, this study can be extended by implementing the categorization engines on larger datasets. 30
  • 35. Bibliography [1] Text Categorization with SVM by Thorsten Joachims http://www.cs.cornell.edu/people/tj/publications/joachims98a.pdf [2] Scholarpedia, K Nearest neighbor http://www.scholarpedia.org/article/K-nearestneighbor [3] Thesis on ‘Clustering Approaches to Text Categorization’ by Hiroya Taka- mura http://www.lr.pi.titech.ac.jp/ takamura/pubs/dthesisoriginal.pdf [4] k-Nearest Neighbor (kNN) Algorithm https://kiwi.ecn.purdue.edu/rhea/index.php/KNNAlgorithmOldKiwi 31