SlideShare une entreprise Scribd logo
1  sur  51
Text Categorization
(part 1)
Categorization
2
 Text categorization (TC) - given a set of categories (subjects, topics) and a
collection of text documents, the process of finding the correct topic (or
topics) for each document.
Given:
 A description of an instance, xX, where X is the instance language or
instance space.
 A fixed set of categories: C={c1, c2,…cn}
Determine:
 The category of x: c(x)C, where c(x) is a categorization function whose
domain is X and whose range is C.
Text Categorization
 Pre-given categories and labeled document examples (Categories may
form hierarchy)
 Classify new documents
 A standard classification (supervised learning ) problem
Categorization
System
…
Sports
Business
Education
Science
…Sports
Business
Education
Applications of Text Categorization
Text Indexing
Indexing of Texts Using Controlled Vocabulary
 The documents according to the user queries, which are based on the
key terms. The key terms all belong to a finite set called controlled
vocabulary.
 The task of assigning keywords from a controlled vocabulary to text
documents is called text indexing.
Document Sorting and Text Filtering
Document Sorting
 Sorting the given collection of documents into several “bins.”
 Examples:
 In a newspaper, the classified ads may need to be categorized into
“Personal,” “Car Sale,” “Real Estate,” and so on.
 E-mail coming into an organization, which may need to be sorted
into categories such as “Complaints,” “Deals,” “Job applications,”
and others.
Text Filtering
 Text filtering activity can be seen as document sorting with only two
bins – the “relevant” and “irrelevant” documents.
 Examples:
 A sports related online magazine should filter out all non-sport
stories it receives from the news feed.
 An e-mail client should filter away spam.
Web page categorization
Hierarchical Web Page Categorization
 A common use of TC is the automatic classification of Web pages
under the hierarchical catalogues posted by popular Internet portals
such as Yahoo.
 constrains the number of documents belonging to a particular category
to prevent the categories from becoming excessively large.
 Hyper textual nature of the documents is also another feature of the
problem.
Definition of Text Categorization
 The task of approximating an unknown category assignment function
F:D×C→{0,1},
Where, D - set of all possible documents and
C - set of predefined categories.
 The value of F(d, c) is 1 if the document d belongs to the category c
and 0 otherwise.
 The approximating function M:D×C→{0,1} is called a classifier, and
the task is to build a classifier that produces results as “close” as
possible to the true category assignment function F.
 Classification Problems
 Single-Label versus Multi-label Categorization
 Document-Pivoted versus Category-Pivoted Categorization
 Hard versus Soft Categorization
Classification Problems
Single-Label versus Multi-label Categorization
 In single-label categorization, each document belongs to exactly one
category.
 In multi-label categorization the categories overlap, and a document may
belong to any number of categories.
Document-Pivoted versus Category-Pivoted Categorization
 For a given document, the classifier finds all categories to which the
document belongs is called a document-pivoted categorization.
 Finding all documents that should be filed under a given category is
called a category-pivoted categorization.
Hard versus Soft Categorization
 A particular label is explicitly assigned to the instance is called
Soft/Ranking Categorization.
 A probability value is assigned to the test instance is known as Hard
Categorization.
Categorization Methods
Manual: Typically rule-based
 Does not scale up (labor-intensive, rule inconsistency)
 May be appropriate for special data on a particular domain
Automatic: Typically exploiting machine learning techniques
 Vector space model based
Prototype-based (Rocchio)
K-nearest neighbor (KNN)
Decision-tree (learn rules)
Neural Networks (learn non-linear classifier)
Support Vector Machines (SVM)
 Probabilistic or generative model based
Naïve Bayes classifier
Document Representation
 The classifiers and learning algorithms cannot directly process the text
documents in their original form.
 During a preprocessing step, the documents are converted into a more
manageable representation.
 Typically, the documents are represented by feature vectors
 A feature is simply an entity without internal structure - a
dimension in the feature space.
 A document is represented as a vector in this space - a sequence of
features and their weights.
Common models for Doc
Representation
 Bag-of-words model uses all words in a document as the features, and thus
the dimension of the feature space is equal to the number of different words
in all of the documents.
 Binary (simplest), in which the feature weight is either one - if the
corresponding word is present in the document - or zero otherwise.
 The most common TF-IDF scheme gives the word w in the document d the
weight,
TF-IDF Weight (w, d) = TermFreq(w, d) · log (N / DocFreq(w)),
Where,
TermFreq(w, d) - frequency of the word in the document,
N - number of all documents, and
DocFreq(w) - number of documents containing the word w.
Feature Selection
Feature selection:
 Removes non-informative terms (irrelevant words) from documents.
 Improves classification effectiveness.
 Reduces computational complexity.
Measures of feature relevance:
 Relations between features and the categories.
 Feature Selection Methods
o Document Frequency Threshold (DF)
o Information Gain (IG)
o 2statistic (CHI)
o Mutual Information (MI)
 Document Frequency Threshold (DF)
 Document frequency DocFreq(w) is a measure of the relevance of each
feature in the document
 Information Gain (IG)
 Measures the number of bits of information obtained for the prediction of
categories by the presence or absence in a document of the feature f.
 The probabilities are ratios of frequencies in the training data.
 
 

CCc wwf cP
fcP
cfPwIG
},{ )(
)|(
log).,()(
 Chi-square
 Measures the maximal strength of dependence between the feature
and the categories.
 Mutual Information
)
)()(
),(
log()
)|(
1
log()
)(
1
log()(
cPfP
cfP
fcPcP
fMI 
)().().().(
)),().,(),().,(.(||
max)(
2
2
max
cPcPfPfP
cfPcfPcfPcfPT
f r
Cc




Dimensionality Reduction by
Feature Extraction
 Feature reduction refers to the mapping of the original high-
dimensional data onto a lower-dimensional space
 Criterion for feature reduction can be different based on different
problem settings.
 Unsupervised setting: minimize the information loss
 Supervised setting: maximize the class discrimination
Feature Reduction Algorithms
 Unsupervised
 Latent Semantic Indexing (LSI): truncated SVD
 Independent Component Analysis (ICA)
 Principal Component Analysis (PCA)
 Manifold learning algorithms
 Supervised
 Linear Discriminant Analysis (LDA)
 Canonical Correlation Analysis (CCA)
 Partial Least Squares (PLS)
 Semi-supervised
Feature Reduction Algorithms
 Linear
 Latent Semantic Indexing (LSI): truncated SVD
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Canonical Correlation Analysis (CCA)
 Partial Least Squares (PLS)
 Nonlinear
 Nonlinear feature reduction using kernels
 Manifold learning
Knowledge Engineering Approach
To Text Categorization
 Focus on manual development of classification rules.
 A set of sufficient conditions for a document to be labeled with a given
category is defined.
 Example: CONSTRUE system
if DNF (disjunction of conjunctive clauses) formula then category else
¬category
 Such rule may look like the following:
If ((wheat & farm) or
(wheat & commodity) or
(bushels & export) or
(wheat & tonnes) or
(wheat & winter & ¬soft))
then Wheat
else ¬Wheat
Machine Learning Approach to
Text Categorization
 In the ML approach, the classifier is built automatically by learning the
properties of categories from a set of pre-classified training documents.
 Learning process is an instance of
 Supervised Learning - process of applying the known true category
assignment function on the training set.
 The unsupervised version of the classification task, called clustering.
 Issues in using machine learning techniques to develop an application
based on text categorization:
 Choose how to decide on the categories that will be used to classify the
instances.
 Choose how to provide a training set for each of the categories.
 Choose how to represent each of the instances on the features.
 Choose a learning algorithm to be used for the categorization.
Approaches to Classifier Learning
 Probabilistic Classifiers
 Bayesian Logistic Regression
 Decision Tree Classifiers
 Decision Rule Classifiers
 Regression Methods
 The Rocchio Methods
 Neural Networks
 Example-Based Classifiers
 Support Vector Machines
 Classifier Committees: Bagging and Boosting
Probabilistic Classifiers
 Probabilistic classifiers view the categorization status value CSV(d,c),
 The probability P(c | d) that the document d belongs to the category c and
compute this probability by application of Bayes’ theorem
 The marginal probability P(d) is constant for all categories.
 To calculate P(d | c), however, we need to make some assumptions about the
structure of the document d.
 With the document representation as a feature vector d = (w1, w2, . . .), the
most common assumption is that all coordinates are independent, and thus
 The classifiers resulting from this assumption are called Naive Bayes (NB)
classifiers.

i
i cwPcdP )|()|(
)(
)()|(
)|(
dP
cPcdP
dcP 
Analysis of Naïve Bayes Algorithm
Advantages:
 Work well on numeric and textual data
 Easy to implement and computation comparing with other algorithms
Disadvantages:
 Conditional independence assumption is violated by real-world data,
perform very poorly when features are highly correlated
Bayesian Logistic Regression
 Bayesian logistic regression (BLR) is a statistical approach applied to the TC
problem with apparently very high performance.
 Let categorization be binary, then the logistic regression model has the form,
where ,
c = ±1 - category membership value,
d = (d1, d2, . . .) - document representation in the feature space,
β = (β1, β2, . . .) - model parameters vector, and
ϕ - logistic link function.
)().()|(  i iidddcP 
)exp(1
1
)exp(1
)exp(
)(
xx
x
x




 The Bayesian approach is to use a prior distribution for the parameter
vector β. Different priors are possible,
 Gaussian and
 Laplace priors.
 Gaussian prior with zero mean and variance τ
 The maximum a posteriori (MAP) estimate of β is equivalent to ridge
regression for the logistic model.
 Disadvantage of the Gaussian prior in the TC problem,
 The MAP estimates of the parameters will rarely be exactly zero; thus,
the model will not be sparse.
 Laplace prior does achieve sparseness





 





2
exp
2
1
),0()|(
2
i
i Np
|)|exp(
2
)|( iip 

 
 The log-posterior distribution of β is
 where, D= {(d1, c1), (d2, c2) . . .} is the set of training documents di and
their true category membership values ci = ±1, and p(β) is the chosen prior:
 In for Gaussian prior
 In for Laplace prior
 The MAP estimate of β is arg maxβl(β), which can be computed by any
convex optimization algorithm






  )||ln2(ln)(
i
ip 














 i
i
p



2
2
2ln
ln)(
)(ln1).ln(exp()|()(
),(
 pdcDpl
Dcd








 
Decision Tree Classifiers
 A decision tree (DT) classifier is a tree in which the internal nodes are labeled
by the features, the edges leaving a node are labeled by tests on the feature’s
weight, and the leaves are labeled by categories.
 Decision tree associated with document:
 Root node contains all documents
 Each internal node is subset of documents separated according to one
attribute
 Each arc is labeled with predicate which can be applied to attribute at
parent
 Each leaf node is labeled with a class
Fig: A Decision Tree Classifier
The tree that corresponds to the CONSTRUE rule
 Performance of a DT classifier is mixed but is inferior to the top-ranking
classifiers.
 Recursive partition procedure from root node
 Set of documents separated into subsets according to an attribute
 Use the most discriminative attribute first (highest IG)
 Pruning to deal with over-fitting
 Hierarchical classifier compares the data sequentially with carefully
selected features.
 Features are determined from the spectral distributions or separability
of the classes.
 There is no general procedure. Each decision tree or set of rules is
custom-designed.
 A decision tree that provides only two outcomes at each stage is called
a “binary decision tree” (BDT) classifier
Analysis of Decision Tree Algorithm
Advantages:
 Easy to understand
 Easy to generate rules
 Reduce problem complexity
Disadvantages:
 Training time is relatively expensive
 A document is only connected with one branch
 Once a mistake is made at a higher level, any sub-tree is wrong
 Does not handle continuous variable well
 May suffer from over-fitting
Decision Rule Classifiers
 Decision rule (DR) classifiers are also symbolic like decision trees and
are built from the training collection using Inductive Rule Learning.
 DNF(Disjunctive Normal Form) rules are built in a bottom-up fashion.
 The initial classifier is built from the training set by viewing each
training document as a clause
d1 ∧ d2 ∧ . . . ∧ dn → c
Where, di - features of the document and c its category.
 The learner applies a series of generalizations for maximizing the
compactness of the rules by keeping the covering property. (e.g., by
removing terms from the clauses and by merging rules).
 At the end of the process, a pruning step similar to the DT pruning is
applied that trades covering for more generality.
 Rule learners vary widely in their specific methods, heuristics, and
optimality criteria.
 Example:
 RIPPER (Repeated Incremental Pruning to Produce Error Reduction)
(Cohen 1995a; Cohen 1995b; Cohen and Singer 1996).
 Ripper builds a rule set by
 Adding new rules until all positive category instances are covered and
 Adding conditions to the rules until no negative instance is covered.
 Features of Ripper is its ability to bias the performance toward higher
precision or higher recall.
 Determined by the loss ratio parameter, which measures the relative cost
of “false negative” and “false positive” errors.
Regression Methods
 Regression is a technique for approximating a real-valued function
using the knowledge of its values on a set of points.
 It can be applied to TC, which is the problem of approximating the
category assignment function.
 Regression techniques can be applied to generate the (real-valued)
classifier for continuous real-valued functions.
 Regression Method - Linear Least-square Fit (LLSF),
 The category assignment function - |C| × |F| matrix, describes some
linear transformation from the feature space to the space of all
possible category assignments.
 To build a classifier, a matrix is created that best accounts for the
training data.
 The LLSF model computes the matrix by minimizing the error on the
training collection according to the formula
Where,
 D is the |F| × |TrainingCollection| matrix of the training document
representations,
 O is the |C| × |TrainingCollection| matrix of the true category
assignments, and
 The || · ||F is the Frobenius norm
 The matrix M can be computed by performing singular value
decomposition on the training data.
 The matrix element mij represents the degree of association between the ith
feature and the jth category.
FM OMDM ||||minarg 
 2
|||| ijF AA
The Rocchio Methods
 The Rocchio classifier categorizes a document by computing its distance
to the prototypical examples of the categories.
 A prototypical example for the category c is a vector (w1, w2, . . .) in the
feature space computed by
where ,
 POS(c) - sets of all training documents that belong to the category c
 NEG(c) - sets of all training documents that do not belong to the
category c
 wdi is the weight of ith feature in the document d.
 Usually, the positive examples are much more important than the negative
ones, and so α >>β.
 If β = 0, then the prototypical example for a category is simply the
centroid of all documents belonging to the category.
 

)()( |)(||)(| cNEGd
di
cPOSd
dii w
cNEG
w
cPOS
w

Analysis of Rocchio’s Algorithm
Advantages:
 Easy to implement
 Cheap computationally.
 Very fast learner
 Relevance feedback mechanism
Disadvantages:
 Low classification accuracy
 Linear combination too simple for classification
 Constant  and  are empirical
Neural Networks
 Neural network (NN) can be built to perform text categorization.
 The input nodes of the network receive the feature values,
 The output nodes produce the categorization status values, and
 The link weights represent dependence relations.
 For classifying a document,
 Its feature weights are loaded into the input nodes;
 The activation of the nodes is propagated forward through the
network, and
 The final values on output nodes determine the categorization
decisions.
 The neural networks are trained by back propagation, whereby the training
documents are loaded into the input nodes.
 If a misclassification occurs, the error is propagated back through the
network, modifying the link weights in order to minimize the error.
 The simplest kind of a neural network is a perceptron
 It has only two layers - the input and the output nodes.
 Such network is equivalent to a linear classifier.
 More complex networks contain one or more hidden layers between the
input and output layers.
 The experiments (Schutze, Hull, and Pederson 1995; Wiener 1995) shows
very small or no improvement of nonlinear networks over their linear
counterparts in the text categorization task.
Analysis of NN Algorithm
Advantages:
 Produce good results in complex domains
 Suitable for both discrete and continuous data (especially better for
the continuous domain)
 Testing is very fast
Disadvantages:
 Training is relatively slow
 Learned results are difficult for users to interpret than learned rules
(comparing with DT)
 Empirical Risk Minimization (ERM) makes NN try to minimize
training error, may lead to over-fitting
Example-Based Classifiers
 Rely on computing the similarity between the document to be classified
and the training documents.
 These methods are called lazy learners because they defer the decision
on how to generalize beyond the training data until each new query
instance is encountered.
 “Training” for such classifiers consists of simply storing the
representations of the training documents together with their category
labels.
 The most prominent example of an example-based classifier is kNN (k-
nearest neighbor).
 To decide whether a document d belongs to the category c, kNN
checks whether the k training documents most similar to d belong to c
 If the answer is positive for a sufficiently large proportion of them, a
positive decision is made; otherwise, the decision is negative.
K-Nearest-Neighbor Algorithm
 Principle: points (documents) that are close in the space belong to the
same class
 Calculate similarity between test
document and each neighbor
 Select k nearest neighbors of a test
document among training examples
 Assign test document to the class
which contains most of the neighbors
)),((*)|(maxarg
1
iDCDDsim j
k
j
ji 
Analysis of k-NN Algorithm
Advantages:
 Effective
 one of the best-performing text classifiers
 Non-parametric
 Robust in terms of not requiring the categories to be linearly
separated
 More local characteristics of document are considered
comparing with Rocchio
Disadvantages:
 High computational cost
 Classification time is long
 Difficult to find optimal value of k
Support Vector Machines
• Main idea of SVMs
Find out the linear separating
hyperplane which maximize the
margin, i.e., the optimal separating
hyperplane (OSH)
• Nonlinear separable case
Kernel function and Hilbert space
FX
f(x)
f(x)
f(x)
f(x)
x
xx
x
0
0
0
0 f(0)
f(0)
f(0)
f(0)
X 
f(X)
The support vector machine (SVM) algorithm is very fast and effective for
text classification problems.
SVM classification
Ni
Nibxw
Cww
i
ii
T
N
i i
T


 
1,0
1,01)(ytosubject
)(
2
1
minimize
i
1
ζb,w, i



Maximizing the margin is equivalent to:
Introducing Lagrange multipliers , the Lagrangian is:,






















N
i
i
N
i
ii
N
i
T
iii
N
i
iii
T
N
i
ii
N
i
ii
T
ii
N
i
i
T
i
bywxy
Cww
bxwy
Cwwbw
111
1
11
1
)(
2
1
]1)([
2
1
),;,,(




Dual problem:
 
ji
jijiji
i
i
D
xxyy
,2
1
maximize
α

subject to:
,0 Ci   .0i
ii y
The solution is given by:


N
i
iii xyw
1

The problem of classifying a new data point x is now simply solved by looking at the sigh of
bxw 
jj xwyb 
Analysis of SVM Algorithm
Advantages:
 Comparing with NN, SVM capture the
inherent characteristics of the data better
 Embedding the Structural Risk Minimization
(SRM) principle, which minimizes the upper
bound on the generalization error
(better than the Empirical Risk Minimization
principle)
 Ability to learn can be independent of the
dimensionality of the feature space
 Global minima vs. Local minima
Disadvantage:
 Parameter Tuning
 Kernel Selection
0
0
0
0
x
x
x x
0
0
0
0
x
x
x x
Classifier Committees:
Bagging and Boosting
 Committees of classifiers stems from the intuition that a team of
experts, by combining their knowledge, may produce better results than
a single expert alone.
 Principle: using multiple evidence (multiple poor classifiers => single
good classifier)
 Generate some base classifiers
 Combine them to make the final decision
 Bagging Method (classifiers are trained in parallel manner)
 Boosting Method (classifiers are trained sequentially)
Bagging Algorithm
 Use multiple versions of a training set D of size N, each
created by resampling N examples from D with bootstrap
 Each of data sets is used to train a base classifier, the final
classification decision is made by the majority voting of
these classifiers.
Boosting Algorithm: AdaBoost
Main idea:
 Maintain a distribution or set
of weights over the training
set.
 Initially, all weights are set
equally, but in each iteration
the weights of incorrectly
classified examples are
increased so that the base
classifier is forced to focus on
the ‘hard’ examples in the
training set.
 For those correctly classified
examples, their weights are
decreased so that they are less
important in next iteration.
Why ensembles can improve
performance:
 Uncorrelated errors made by the
individual classifiers can be removed by
voting.
 Hypothesis space H may not contain the
true function f. Instead, H may include
several equally good approximations to f.
By taking weighted combinations of
these approximations, we may be able to
represent classifiers that lie outside of H.
AdaBoost Algorithm
Given: m examples ),(),...,,( 11 mm yxyx where }1,1{,  YyXx ii
Initialize miD /1)(1 
For t = 1,…,T:
 Train base classifier using distribution .tD
 Get a hypothesis }1,1{: Xht with error .)(])([Prε
)(:
~t 

iit
t
yxhi
tiitDi iDyxh
for all i = 1…m
 Choose )
1
ln(
2
1
t
t
t





.
 Update:
t
ititt
iit
iit
t
t
t
Z
xhyiD
yxhe
yxhe
Z
iD
iD
t
t
))(exp()(
)(if
)(if
{
)(
)(1










where tZ is a normalization factor (chosen so that 1tD will be a distribution).
Output the final hypothesis:
).)(()(
1


T
t
tt xhsignxH 
Analysis of Voting Algorithms
Advantages:
 Surprisingly effective
 Robust to noise
 Decrease the over-fitting effect
Disadvantages:
 Require more calculation and memory
Using Unlabeled Data to Improve
Classification
Two ways of incorporating knowledge from unlabeled documents:
Expectation Maximization (EM)
 EM works with probabilistic generative classifiers. The idea is to find the
most probable model given both labeled and unlabeled documents.
 The EM algorithm performs the optimization:
 First, the model is trained over the labeled documents.
 Then the following steps are iterated until convergence in a local
maximum occurs:
E-step: the unlabeled documents are classified by the current
model.
M-step: the model is trained over the combined corpus.
Cotraining
 The cotraining is a bootstrapping strategy in which the unlabeled documents
classified by means of one of the views are then used for training the
classifier using the other view, and vice versa.
Performance Measure
Performance of algorithm:
 Training time
 Testing time
 Classification accuracy
 Precision, Recall
 Micro-average / Macro-average
 Breakeven: precision = recall
Goal: High classification quality and computation efficiency
Comparison among Classifiers
 SVM, AdaBoost, k-NN, and Regression are showed good
performance
 NB and Rocchio showed relatively poor performance
among the ML classifiers, but both are often used as
baseline classifiers.
 There are mixed results regarding the Neural Networks and
Decision Tree classifiers.

Contenu connexe

Tendances

Text classification
Text classificationText classification
Text classificationJames Wong
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/CategorizationOswal Abhishek
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.pptImXaib
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean modelVaibhav Khanna
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social mediaJeremiah Fadugba
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining Sulman Ahmed
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 

Tendances (20)

Text MIning
Text MIningText MIning
Text MIning
 
Text classification
Text classificationText classification
Text classification
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
Text mining
Text miningText mining
Text mining
 
Text Classification/Categorization
Text Classification/CategorizationText Classification/Categorization
Text Classification/Categorization
 
Document similarity
Document similarityDocument similarity
Document similarity
 
Topic Modeling
Topic ModelingTopic Modeling
Topic Modeling
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
lecture12-clustering.ppt
lecture12-clustering.pptlecture12-clustering.ppt
lecture12-clustering.ppt
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Word embedding
Word embedding Word embedding
Word embedding
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
DMQL(Data Mining Query Language).pptx
DMQL(Data Mining Query Language).pptxDMQL(Data Mining Query Language).pptx
DMQL(Data Mining Query Language).pptx
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
web mining
web miningweb mining
web mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social media
 
Text mining
Text miningText mining
Text mining
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 

Similaire à Text categorization

20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Text Categorizationof Multi-Label Documents For Text Mining
Text Categorizationof Multi-Label Documents For Text MiningText Categorizationof Multi-Label Documents For Text Mining
Text Categorizationof Multi-Label Documents For Text MiningIIRindia
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.pptIndra Hermawan
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationPaul Houle
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfIJDKP
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification ofijaia
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
 
文本分类综述
文本分类综述文本分类综述
文本分类综述Marissa
 
Text Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningFabrizio Sebastiani
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.pptbutest
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Infrrd
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval ModelsNisha Arankandath
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive ModelsDatamining Tools
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Modelsguest0edcaf
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniquesijnlc
 

Similaire à Text categorization (20)

20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Text Categorizationof Multi-Label Documents For Text Mining
Text Categorizationof Multi-Label Documents For Text MiningText Categorizationof Multi-Label Documents For Text Mining
Text Categorizationof Multi-Label Documents For Text Mining
 
lecture15-supervised.ppt
lecture15-supervised.pptlecture15-supervised.ppt
lecture15-supervised.ppt
 
Mapping Subsets of Scholarly Information
Mapping Subsets of Scholarly InformationMapping Subsets of Scholarly Information
Mapping Subsets of Scholarly Information
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdf
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
 
Discovering Novel Information with sentence Level clustering From Multi-docu...
Discovering Novel Information with sentence Level clustering  From Multi-docu...Discovering Novel Information with sentence Level clustering  From Multi-docu...
Discovering Novel Information with sentence Level clustering From Multi-docu...
 
文本分类综述
文本分类综述文本分类综述
文本分类综述
 
Text Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion MiningText Classification, Sentiment Analysis, and Opinion Mining
Text Classification, Sentiment Analysis, and Opinion Mining
 
Cluster
ClusterCluster
Cluster
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
 
Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...Learning from similarity and information extraction from structured documents...
Learning from similarity and information extraction from structured documents...
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
 
A systematic study of text mining techniques
A systematic study of text mining techniquesA systematic study of text mining techniques
A systematic study of text mining techniques
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 

Dernier

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 

Dernier (20)

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 

Text categorization

  • 2. Categorization 2  Text categorization (TC) - given a set of categories (subjects, topics) and a collection of text documents, the process of finding the correct topic (or topics) for each document. Given:  A description of an instance, xX, where X is the instance language or instance space.  A fixed set of categories: C={c1, c2,…cn} Determine:  The category of x: c(x)C, where c(x) is a categorization function whose domain is X and whose range is C.
  • 3. Text Categorization  Pre-given categories and labeled document examples (Categories may form hierarchy)  Classify new documents  A standard classification (supervised learning ) problem Categorization System … Sports Business Education Science …Sports Business Education
  • 4. Applications of Text Categorization Text Indexing Indexing of Texts Using Controlled Vocabulary  The documents according to the user queries, which are based on the key terms. The key terms all belong to a finite set called controlled vocabulary.  The task of assigning keywords from a controlled vocabulary to text documents is called text indexing. Document Sorting and Text Filtering Document Sorting  Sorting the given collection of documents into several “bins.”  Examples:  In a newspaper, the classified ads may need to be categorized into “Personal,” “Car Sale,” “Real Estate,” and so on.  E-mail coming into an organization, which may need to be sorted into categories such as “Complaints,” “Deals,” “Job applications,” and others.
  • 5. Text Filtering  Text filtering activity can be seen as document sorting with only two bins – the “relevant” and “irrelevant” documents.  Examples:  A sports related online magazine should filter out all non-sport stories it receives from the news feed.  An e-mail client should filter away spam. Web page categorization Hierarchical Web Page Categorization  A common use of TC is the automatic classification of Web pages under the hierarchical catalogues posted by popular Internet portals such as Yahoo.  constrains the number of documents belonging to a particular category to prevent the categories from becoming excessively large.  Hyper textual nature of the documents is also another feature of the problem.
  • 6. Definition of Text Categorization  The task of approximating an unknown category assignment function F:D×C→{0,1}, Where, D - set of all possible documents and C - set of predefined categories.  The value of F(d, c) is 1 if the document d belongs to the category c and 0 otherwise.  The approximating function M:D×C→{0,1} is called a classifier, and the task is to build a classifier that produces results as “close” as possible to the true category assignment function F.  Classification Problems  Single-Label versus Multi-label Categorization  Document-Pivoted versus Category-Pivoted Categorization  Hard versus Soft Categorization
  • 7. Classification Problems Single-Label versus Multi-label Categorization  In single-label categorization, each document belongs to exactly one category.  In multi-label categorization the categories overlap, and a document may belong to any number of categories. Document-Pivoted versus Category-Pivoted Categorization  For a given document, the classifier finds all categories to which the document belongs is called a document-pivoted categorization.  Finding all documents that should be filed under a given category is called a category-pivoted categorization. Hard versus Soft Categorization  A particular label is explicitly assigned to the instance is called Soft/Ranking Categorization.  A probability value is assigned to the test instance is known as Hard Categorization.
  • 8. Categorization Methods Manual: Typically rule-based  Does not scale up (labor-intensive, rule inconsistency)  May be appropriate for special data on a particular domain Automatic: Typically exploiting machine learning techniques  Vector space model based Prototype-based (Rocchio) K-nearest neighbor (KNN) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM)  Probabilistic or generative model based Naïve Bayes classifier
  • 9. Document Representation  The classifiers and learning algorithms cannot directly process the text documents in their original form.  During a preprocessing step, the documents are converted into a more manageable representation.  Typically, the documents are represented by feature vectors  A feature is simply an entity without internal structure - a dimension in the feature space.  A document is represented as a vector in this space - a sequence of features and their weights.
  • 10. Common models for Doc Representation  Bag-of-words model uses all words in a document as the features, and thus the dimension of the feature space is equal to the number of different words in all of the documents.  Binary (simplest), in which the feature weight is either one - if the corresponding word is present in the document - or zero otherwise.  The most common TF-IDF scheme gives the word w in the document d the weight, TF-IDF Weight (w, d) = TermFreq(w, d) · log (N / DocFreq(w)), Where, TermFreq(w, d) - frequency of the word in the document, N - number of all documents, and DocFreq(w) - number of documents containing the word w.
  • 11. Feature Selection Feature selection:  Removes non-informative terms (irrelevant words) from documents.  Improves classification effectiveness.  Reduces computational complexity. Measures of feature relevance:  Relations between features and the categories.  Feature Selection Methods o Document Frequency Threshold (DF) o Information Gain (IG) o 2statistic (CHI) o Mutual Information (MI)
  • 12.  Document Frequency Threshold (DF)  Document frequency DocFreq(w) is a measure of the relevance of each feature in the document  Information Gain (IG)  Measures the number of bits of information obtained for the prediction of categories by the presence or absence in a document of the feature f.  The probabilities are ratios of frequencies in the training data.      CCc wwf cP fcP cfPwIG },{ )( )|( log).,()(
  • 13.  Chi-square  Measures the maximal strength of dependence between the feature and the categories.  Mutual Information ) )()( ),( log() )|( 1 log() )( 1 log()( cPfP cfP fcPcP fMI  )().().().( )),().,(),().,(.(|| max)( 2 2 max cPcPfPfP cfPcfPcfPcfPT f r Cc    
  • 14. Dimensionality Reduction by Feature Extraction  Feature reduction refers to the mapping of the original high- dimensional data onto a lower-dimensional space  Criterion for feature reduction can be different based on different problem settings.  Unsupervised setting: minimize the information loss  Supervised setting: maximize the class discrimination
  • 15. Feature Reduction Algorithms  Unsupervised  Latent Semantic Indexing (LSI): truncated SVD  Independent Component Analysis (ICA)  Principal Component Analysis (PCA)  Manifold learning algorithms  Supervised  Linear Discriminant Analysis (LDA)  Canonical Correlation Analysis (CCA)  Partial Least Squares (PLS)  Semi-supervised
  • 16. Feature Reduction Algorithms  Linear  Latent Semantic Indexing (LSI): truncated SVD  Principal Component Analysis (PCA)  Linear Discriminant Analysis (LDA)  Canonical Correlation Analysis (CCA)  Partial Least Squares (PLS)  Nonlinear  Nonlinear feature reduction using kernels  Manifold learning
  • 17. Knowledge Engineering Approach To Text Categorization  Focus on manual development of classification rules.  A set of sufficient conditions for a document to be labeled with a given category is defined.  Example: CONSTRUE system if DNF (disjunction of conjunctive clauses) formula then category else ¬category  Such rule may look like the following: If ((wheat & farm) or (wheat & commodity) or (bushels & export) or (wheat & tonnes) or (wheat & winter & ¬soft)) then Wheat else ¬Wheat
  • 18. Machine Learning Approach to Text Categorization  In the ML approach, the classifier is built automatically by learning the properties of categories from a set of pre-classified training documents.  Learning process is an instance of  Supervised Learning - process of applying the known true category assignment function on the training set.  The unsupervised version of the classification task, called clustering.  Issues in using machine learning techniques to develop an application based on text categorization:  Choose how to decide on the categories that will be used to classify the instances.  Choose how to provide a training set for each of the categories.  Choose how to represent each of the instances on the features.  Choose a learning algorithm to be used for the categorization.
  • 19. Approaches to Classifier Learning  Probabilistic Classifiers  Bayesian Logistic Regression  Decision Tree Classifiers  Decision Rule Classifiers  Regression Methods  The Rocchio Methods  Neural Networks  Example-Based Classifiers  Support Vector Machines  Classifier Committees: Bagging and Boosting
  • 20. Probabilistic Classifiers  Probabilistic classifiers view the categorization status value CSV(d,c),  The probability P(c | d) that the document d belongs to the category c and compute this probability by application of Bayes’ theorem  The marginal probability P(d) is constant for all categories.  To calculate P(d | c), however, we need to make some assumptions about the structure of the document d.  With the document representation as a feature vector d = (w1, w2, . . .), the most common assumption is that all coordinates are independent, and thus  The classifiers resulting from this assumption are called Naive Bayes (NB) classifiers.  i i cwPcdP )|()|( )( )()|( )|( dP cPcdP dcP 
  • 21. Analysis of Naïve Bayes Algorithm Advantages:  Work well on numeric and textual data  Easy to implement and computation comparing with other algorithms Disadvantages:  Conditional independence assumption is violated by real-world data, perform very poorly when features are highly correlated
  • 22. Bayesian Logistic Regression  Bayesian logistic regression (BLR) is a statistical approach applied to the TC problem with apparently very high performance.  Let categorization be binary, then the logistic regression model has the form, where , c = ±1 - category membership value, d = (d1, d2, . . .) - document representation in the feature space, β = (β1, β2, . . .) - model parameters vector, and ϕ - logistic link function. )().()|(  i iidddcP  )exp(1 1 )exp(1 )exp( )( xx x x    
  • 23.  The Bayesian approach is to use a prior distribution for the parameter vector β. Different priors are possible,  Gaussian and  Laplace priors.  Gaussian prior with zero mean and variance τ  The maximum a posteriori (MAP) estimate of β is equivalent to ridge regression for the logistic model.  Disadvantage of the Gaussian prior in the TC problem,  The MAP estimates of the parameters will rarely be exactly zero; thus, the model will not be sparse.  Laplace prior does achieve sparseness             2 exp 2 1 ),0()|( 2 i i Np |)|exp( 2 )|( iip    
  • 24.  The log-posterior distribution of β is  where, D= {(d1, c1), (d2, c2) . . .} is the set of training documents di and their true category membership values ci = ±1, and p(β) is the chosen prior:  In for Gaussian prior  In for Laplace prior  The MAP estimate of β is arg maxβl(β), which can be computed by any convex optimization algorithm         )||ln2(ln)( i ip                 i i p    2 2 2ln ln)( )(ln1).ln(exp()|()( ),(  pdcDpl Dcd          
  • 25. Decision Tree Classifiers  A decision tree (DT) classifier is a tree in which the internal nodes are labeled by the features, the edges leaving a node are labeled by tests on the feature’s weight, and the leaves are labeled by categories.  Decision tree associated with document:  Root node contains all documents  Each internal node is subset of documents separated according to one attribute  Each arc is labeled with predicate which can be applied to attribute at parent  Each leaf node is labeled with a class
  • 26. Fig: A Decision Tree Classifier The tree that corresponds to the CONSTRUE rule  Performance of a DT classifier is mixed but is inferior to the top-ranking classifiers.  Recursive partition procedure from root node  Set of documents separated into subsets according to an attribute  Use the most discriminative attribute first (highest IG)  Pruning to deal with over-fitting
  • 27.  Hierarchical classifier compares the data sequentially with carefully selected features.  Features are determined from the spectral distributions or separability of the classes.  There is no general procedure. Each decision tree or set of rules is custom-designed.  A decision tree that provides only two outcomes at each stage is called a “binary decision tree” (BDT) classifier
  • 28. Analysis of Decision Tree Algorithm Advantages:  Easy to understand  Easy to generate rules  Reduce problem complexity Disadvantages:  Training time is relatively expensive  A document is only connected with one branch  Once a mistake is made at a higher level, any sub-tree is wrong  Does not handle continuous variable well  May suffer from over-fitting
  • 29. Decision Rule Classifiers  Decision rule (DR) classifiers are also symbolic like decision trees and are built from the training collection using Inductive Rule Learning.  DNF(Disjunctive Normal Form) rules are built in a bottom-up fashion.  The initial classifier is built from the training set by viewing each training document as a clause d1 ∧ d2 ∧ . . . ∧ dn → c Where, di - features of the document and c its category.  The learner applies a series of generalizations for maximizing the compactness of the rules by keeping the covering property. (e.g., by removing terms from the clauses and by merging rules).  At the end of the process, a pruning step similar to the DT pruning is applied that trades covering for more generality.
  • 30.  Rule learners vary widely in their specific methods, heuristics, and optimality criteria.  Example:  RIPPER (Repeated Incremental Pruning to Produce Error Reduction) (Cohen 1995a; Cohen 1995b; Cohen and Singer 1996).  Ripper builds a rule set by  Adding new rules until all positive category instances are covered and  Adding conditions to the rules until no negative instance is covered.  Features of Ripper is its ability to bias the performance toward higher precision or higher recall.  Determined by the loss ratio parameter, which measures the relative cost of “false negative” and “false positive” errors.
  • 31. Regression Methods  Regression is a technique for approximating a real-valued function using the knowledge of its values on a set of points.  It can be applied to TC, which is the problem of approximating the category assignment function.  Regression techniques can be applied to generate the (real-valued) classifier for continuous real-valued functions.  Regression Method - Linear Least-square Fit (LLSF),  The category assignment function - |C| × |F| matrix, describes some linear transformation from the feature space to the space of all possible category assignments.  To build a classifier, a matrix is created that best accounts for the training data.
  • 32.  The LLSF model computes the matrix by minimizing the error on the training collection according to the formula Where,  D is the |F| × |TrainingCollection| matrix of the training document representations,  O is the |C| × |TrainingCollection| matrix of the true category assignments, and  The || · ||F is the Frobenius norm  The matrix M can be computed by performing singular value decomposition on the training data.  The matrix element mij represents the degree of association between the ith feature and the jth category. FM OMDM ||||minarg   2 |||| ijF AA
  • 33. The Rocchio Methods  The Rocchio classifier categorizes a document by computing its distance to the prototypical examples of the categories.  A prototypical example for the category c is a vector (w1, w2, . . .) in the feature space computed by where ,  POS(c) - sets of all training documents that belong to the category c  NEG(c) - sets of all training documents that do not belong to the category c  wdi is the weight of ith feature in the document d.  Usually, the positive examples are much more important than the negative ones, and so α >>β.  If β = 0, then the prototypical example for a category is simply the centroid of all documents belonging to the category.    )()( |)(||)(| cNEGd di cPOSd dii w cNEG w cPOS w 
  • 34. Analysis of Rocchio’s Algorithm Advantages:  Easy to implement  Cheap computationally.  Very fast learner  Relevance feedback mechanism Disadvantages:  Low classification accuracy  Linear combination too simple for classification  Constant  and  are empirical
  • 35. Neural Networks  Neural network (NN) can be built to perform text categorization.  The input nodes of the network receive the feature values,  The output nodes produce the categorization status values, and  The link weights represent dependence relations.  For classifying a document,  Its feature weights are loaded into the input nodes;  The activation of the nodes is propagated forward through the network, and  The final values on output nodes determine the categorization decisions.
  • 36.  The neural networks are trained by back propagation, whereby the training documents are loaded into the input nodes.  If a misclassification occurs, the error is propagated back through the network, modifying the link weights in order to minimize the error.  The simplest kind of a neural network is a perceptron  It has only two layers - the input and the output nodes.  Such network is equivalent to a linear classifier.  More complex networks contain one or more hidden layers between the input and output layers.  The experiments (Schutze, Hull, and Pederson 1995; Wiener 1995) shows very small or no improvement of nonlinear networks over their linear counterparts in the text categorization task.
  • 37. Analysis of NN Algorithm Advantages:  Produce good results in complex domains  Suitable for both discrete and continuous data (especially better for the continuous domain)  Testing is very fast Disadvantages:  Training is relatively slow  Learned results are difficult for users to interpret than learned rules (comparing with DT)  Empirical Risk Minimization (ERM) makes NN try to minimize training error, may lead to over-fitting
  • 38. Example-Based Classifiers  Rely on computing the similarity between the document to be classified and the training documents.  These methods are called lazy learners because they defer the decision on how to generalize beyond the training data until each new query instance is encountered.  “Training” for such classifiers consists of simply storing the representations of the training documents together with their category labels.  The most prominent example of an example-based classifier is kNN (k- nearest neighbor).  To decide whether a document d belongs to the category c, kNN checks whether the k training documents most similar to d belong to c  If the answer is positive for a sufficiently large proportion of them, a positive decision is made; otherwise, the decision is negative.
  • 39. K-Nearest-Neighbor Algorithm  Principle: points (documents) that are close in the space belong to the same class  Calculate similarity between test document and each neighbor  Select k nearest neighbors of a test document among training examples  Assign test document to the class which contains most of the neighbors )),((*)|(maxarg 1 iDCDDsim j k j ji 
  • 40. Analysis of k-NN Algorithm Advantages:  Effective  one of the best-performing text classifiers  Non-parametric  Robust in terms of not requiring the categories to be linearly separated  More local characteristics of document are considered comparing with Rocchio Disadvantages:  High computational cost  Classification time is long  Difficult to find optimal value of k
  • 41. Support Vector Machines • Main idea of SVMs Find out the linear separating hyperplane which maximize the margin, i.e., the optimal separating hyperplane (OSH) • Nonlinear separable case Kernel function and Hilbert space FX f(x) f(x) f(x) f(x) x xx x 0 0 0 0 f(0) f(0) f(0) f(0) X  f(X) The support vector machine (SVM) algorithm is very fast and effective for text classification problems.
  • 42. SVM classification Ni Nibxw Cww i ii T N i i T     1,0 1,01)(ytosubject )( 2 1 minimize i 1 ζb,w, i    Maximizing the margin is equivalent to: Introducing Lagrange multipliers , the Lagrangian is:,                       N i i N i ii N i T iii N i iii T N i ii N i ii T ii N i i T i bywxy Cww bxwy Cwwbw 111 1 11 1 )( 2 1 ]1)([ 2 1 ),;,,(     Dual problem:   ji jijiji i i D xxyy ,2 1 maximize α  subject to: ,0 Ci   .0i ii y The solution is given by:   N i iii xyw 1  The problem of classifying a new data point x is now simply solved by looking at the sigh of bxw  jj xwyb 
  • 43. Analysis of SVM Algorithm Advantages:  Comparing with NN, SVM capture the inherent characteristics of the data better  Embedding the Structural Risk Minimization (SRM) principle, which minimizes the upper bound on the generalization error (better than the Empirical Risk Minimization principle)  Ability to learn can be independent of the dimensionality of the feature space  Global minima vs. Local minima Disadvantage:  Parameter Tuning  Kernel Selection 0 0 0 0 x x x x 0 0 0 0 x x x x
  • 44. Classifier Committees: Bagging and Boosting  Committees of classifiers stems from the intuition that a team of experts, by combining their knowledge, may produce better results than a single expert alone.  Principle: using multiple evidence (multiple poor classifiers => single good classifier)  Generate some base classifiers  Combine them to make the final decision  Bagging Method (classifiers are trained in parallel manner)  Boosting Method (classifiers are trained sequentially)
  • 45. Bagging Algorithm  Use multiple versions of a training set D of size N, each created by resampling N examples from D with bootstrap  Each of data sets is used to train a base classifier, the final classification decision is made by the majority voting of these classifiers.
  • 46. Boosting Algorithm: AdaBoost Main idea:  Maintain a distribution or set of weights over the training set.  Initially, all weights are set equally, but in each iteration the weights of incorrectly classified examples are increased so that the base classifier is forced to focus on the ‘hard’ examples in the training set.  For those correctly classified examples, their weights are decreased so that they are less important in next iteration. Why ensembles can improve performance:  Uncorrelated errors made by the individual classifiers can be removed by voting.  Hypothesis space H may not contain the true function f. Instead, H may include several equally good approximations to f. By taking weighted combinations of these approximations, we may be able to represent classifiers that lie outside of H.
  • 47. AdaBoost Algorithm Given: m examples ),(),...,,( 11 mm yxyx where }1,1{,  YyXx ii Initialize miD /1)(1  For t = 1,…,T:  Train base classifier using distribution .tD  Get a hypothesis }1,1{: Xht with error .)(])([Prε )(: ~t   iit t yxhi tiitDi iDyxh for all i = 1…m  Choose ) 1 ln( 2 1 t t t      .  Update: t ititt iit iit t t t Z xhyiD yxhe yxhe Z iD iD t t ))(exp()( )(if )(if { )( )(1           where tZ is a normalization factor (chosen so that 1tD will be a distribution). Output the final hypothesis: ).)(()( 1   T t tt xhsignxH 
  • 48. Analysis of Voting Algorithms Advantages:  Surprisingly effective  Robust to noise  Decrease the over-fitting effect Disadvantages:  Require more calculation and memory
  • 49. Using Unlabeled Data to Improve Classification Two ways of incorporating knowledge from unlabeled documents: Expectation Maximization (EM)  EM works with probabilistic generative classifiers. The idea is to find the most probable model given both labeled and unlabeled documents.  The EM algorithm performs the optimization:  First, the model is trained over the labeled documents.  Then the following steps are iterated until convergence in a local maximum occurs: E-step: the unlabeled documents are classified by the current model. M-step: the model is trained over the combined corpus. Cotraining  The cotraining is a bootstrapping strategy in which the unlabeled documents classified by means of one of the views are then used for training the classifier using the other view, and vice versa.
  • 50. Performance Measure Performance of algorithm:  Training time  Testing time  Classification accuracy  Precision, Recall  Micro-average / Macro-average  Breakeven: precision = recall Goal: High classification quality and computation efficiency
  • 51. Comparison among Classifiers  SVM, AdaBoost, k-NN, and Regression are showed good performance  NB and Rocchio showed relatively poor performance among the ML classifiers, but both are often used as baseline classifiers.  There are mixed results regarding the Neural Networks and Decision Tree classifiers.