Bag-of-Words Methods for Text Mining CSGI-GA.2590 Lecture

Bag-of-Words Methods
for
Text Mining
CSCI-GA.2590 – Lecture 2A
Ralph Grishman
NYU

Bag of Words Models
• do we really need elaborate linguistic analysis?
• look at text mining applications
– document retrieval
– opinion mining
– association mining
• see how far we can get with document-level bag-
of-words models
– and introduce some of our mathematical approaches
1/16/14 NYU 2

• document retrieval
• opinion mining
• association mining
1/16/14 NYU 3

Information Retrieval
• Task: given query = list of keywords, identify
and rank relevant documents from collection
• Basic idea:
find documents whose set of words most
closely matches words in query
1/16/14 NYU 4

Topic Vector
• Suppose the document collection has n distinct words,
w1, …, wn
• Each document is characterized by an n-dimensional
vector whose ith component is the frequency of word
wi in the document
Example
• D1 = [The cat chased the mouse.]
• D2 = [The dog chased the cat.]
• W = [The, chased, dog, cat, mouse] (n = 5)
• V1 = [ 2 , 1 , 0 , 1 , 1 ]
• V2 = [ 2 , 1 , 1 , 1 , 0 ]
1/16/14 NYU 5

Weighting the components
• Unusual words like elephant determine the topic much more
than common words such as “the” or “have”
• can ignore words on a stop list or
• weight each term frequency tfi by its inverse document
frequency idfi
• where N = size of collection and ni = number of documents
containing term i
1/16/14 NYU 6

witfiidfi

idfilog(N
ni
)

Cosine similarity metric
Define a similarity metric between topic vectors
A common choice is cosine similarity (dot
product):
1/16/14 NYU 7

sim(A,B)
aibi
i

ai
2
i
  bi
2
i


Cosine similarity metric
• the cosine similarity metric is the cosine of the
angle between the term vectors:
1/16/14 NYU 8
w1
w2

Verdict: a success
• For heterogenous text collections, the vector
space model, tf-idf weighting, and cosine
similarity have been the basis for successful
document retrieval for over 50 years

• opinion mining
1/16/14 NYU 10

Opinion Mining
– Task: judge whether a document expresses a positive
or negative opinion (or no opinion) about an object or
topic
• classification task
• valuable for producers/marketers of all sorts of products
– Simple strategy: bag-of-words
• make lists of positive and negative words
• see which predominate in a given document
(and mark as ‘no opinion’ if there are few words of either
type
• problem: hard to make such lists
– lists will differ for different topics
1/16/14 11
NYU

Training a Model
– Instead, label the documents in a corpus and then
train a classifier from the corpus
• labeling is easier than thinking up words
• in effect, learn positive and negative words from the
corpus
– probabilistic classifier
• identify most likely class
s = argmax P ( t | W)
t є {pos, neg}
1/16/14 12
NYU

Training
• We now estimate these probabilities from the
training corpus (of N documents) using maximum
likelihood estimators
• P ( t ) = count ( docs labeled t) / N
• P ( wi | t ) =
count ( docs labeled t containing wi )
count ( docs labeled t)
1/16/14 NYU 14

A Problem
• Suppose a glowing review GR (with lots of
positive words) includes one word,
“mathematical”, previously seen only in
negative reviews
• P ( positive | GR ) = ?
1/16/14 NYU 15

P ( positive | GR ) = 0
because P ( “mathematical” | positive ) = 0
The maximum likelihood estimate is poor when
there is very little data
We need to ‘smooth’ the probabilities to avoid
this problem
1/16/14 NYU 16

Laplace Smoothing
• A simple remedy is to add 1 to each count
– to keep them as probabilities
we increase the denominator N by the number of outcomes
(values of t) (2 for ‘positive’ and ‘negative’)
– for the conditional probabilities P( w | t ) there are similarly two
outcomes (w is present or absent)
1/16/14 NYU 17
 
t
t
P 1
)
(

An NLTK Demo
Sentiment Analysis with Python NLTK Text
Classification
• http://text-processing.com/demo/sentiment/
NLTK Code (simplified classifier)
• http://streamhacker.com/2010/05/10/text-
classification-sentiment-analysis-naive-bayes-classifier
1/16/14 NYU 18

• Is “low” a positive or a negative term?
1/16/14 NYU 19

Ambiguous terms
“low” can be positive
“low price”
or negative
“low quality”
1/16/14 NYU 20

Verdict: mixed
• A simple bag-of-words strategy with a NB
model works quite well for simple reviews
referring to a single item, but fails
– for ambiguous terms
– for comparative reviews
– to reveal aspects of an opinion
• the car looked great and handled well,
but the wheels kept falling off
1/16/14 NYU 21

• opinion mining
1/16/14 NYU 22

Association Mining
• Goal: find interesting relationships among attributes
of an object in a large collection …
objects with attribute A also have attribute B
– e.g., “people who bought A also bought B”
• For text: documents with term A also have term B
– widely used in scientific and medical literature
1/16/14 NYU 23

Bag-of-words
• Simplest approach
• look for words A and B for which
frequency (A and B in same document) >>
frequency of A x frequency of B
• doesn’t work well
• want to find names (of companies, products, genes),
not individual words
• interested in specific types of terms
• want to learn from a few examples
– need contexts to avoid noise
1/16/14 NYU 24

Needed Ingredients
• Effective text association mining needs
– name recognition
– term classification
– preferably: ability to learn patterns
– preferably: a good GUI
– demo: www.coremine.com
1/16/14 NYU 25

Conclusion
• Some tasks can be handled effectively (and
very simply) by bag-of-words models,
• but most benefit from an analysis of language
structure
1/16/14 NYU 26

Bag-of-Words Methods for Text Mining CSGI-GA.2590 Lecture

Recommandé

Recommandé

Contenu connexe

Similaire à Bag-of-Words Methods for Text Mining CSGI-GA.2590 Lecture

Similaire à Bag-of-Words Methods for Text Mining CSGI-GA.2590 Lecture (20)

Plus de nyomans1

Plus de nyomans1 (20)

Dernier

Dernier (20)

Bag-of-Words Methods for Text Mining CSGI-GA.2590 Lecture