This document discusses bag-of-words methods for text mining applications such as document retrieval, opinion mining, and association mining. It describes how a basic bag-of-words approach can be used for document retrieval by creating topic vectors from word frequencies and calculating cosine similarity between documents. Opinion mining can be done by training a naive Bayes classifier on labeled documents to learn positive and negative words. Association mining seeks relationships between terms, but a simple bag-of-words approach has limitations and better methods use term classification and learning from contexts.
2. Bag of Words Models
• do we really need elaborate linguistic analysis?
• look at text mining applications
– document retrieval
– opinion mining
– association mining
• see how far we can get with document-level bag-
of-words models
– and introduce some of our mathematical approaches
1/16/14 NYU 2
4. Information Retrieval
• Task: given query = list of keywords, identify
and rank relevant documents from collection
• Basic idea:
find documents whose set of words most
closely matches words in query
1/16/14 NYU 4
5. Topic Vector
• Suppose the document collection has n distinct words,
w1, …, wn
• Each document is characterized by an n-dimensional
vector whose ith component is the frequency of word
wi in the document
Example
• D1 = [The cat chased the mouse.]
• D2 = [The dog chased the cat.]
• W = [The, chased, dog, cat, mouse] (n = 5)
• V1 = [ 2 , 1 , 0 , 1 , 1 ]
• V2 = [ 2 , 1 , 1 , 1 , 0 ]
1/16/14 NYU 5
6. Weighting the components
• Unusual words like elephant determine the topic much more
than common words such as “the” or “have”
• can ignore words on a stop list or
• weight each term frequency tfi by its inverse document
frequency idfi
• where N = size of collection and ni = number of documents
containing term i
1/16/14 NYU 6
witfiidfi
idfilog(N
ni
)
7. Cosine similarity metric
Define a similarity metric between topic vectors
A common choice is cosine similarity (dot
product):
1/16/14 NYU 7
sim(A,B)
aibi
i
ai
2
i
bi
2
i
8. Cosine similarity metric
• the cosine similarity metric is the cosine of the
angle between the term vectors:
1/16/14 NYU 8
w1
w2
9. Verdict: a success
• For heterogenous text collections, the vector
space model, tf-idf weighting, and cosine
similarity have been the basis for successful
document retrieval for over 50 years
11. Opinion Mining
– Task: judge whether a document expresses a positive
or negative opinion (or no opinion) about an object or
topic
• classification task
• valuable for producers/marketers of all sorts of products
– Simple strategy: bag-of-words
• make lists of positive and negative words
• see which predominate in a given document
(and mark as ‘no opinion’ if there are few words of either
type
• problem: hard to make such lists
– lists will differ for different topics
1/16/14 11
NYU
12. Training a Model
– Instead, label the documents in a corpus and then
train a classifier from the corpus
• labeling is easier than thinking up words
• in effect, learn positive and negative words from the
corpus
– probabilistic classifier
• identify most likely class
s = argmax P ( t | W)
t є {pos, neg}
1/16/14 12
NYU
13. Using Bayes’ Rule
• The last step is based on the naïve assumption
of independence of the word probabilities
1/16/14 NYU 13
argmaxt P(t|W)
argmaxt P(W |t)P(t)
P(W)
argmaxt P(W |t)P(t)
argmaxt P(w1,...,wn |t)P(t)
argmaxt P(wi |t)P(t)
i
14. Training
• We now estimate these probabilities from the
training corpus (of N documents) using maximum
likelihood estimators
• P ( t ) = count ( docs labeled t) / N
• P ( wi | t ) =
count ( docs labeled t containing wi )
count ( docs labeled t)
1/16/14 NYU 14
15. A Problem
• Suppose a glowing review GR (with lots of
positive words) includes one word,
“mathematical”, previously seen only in
negative reviews
• P ( positive | GR ) = ?
1/16/14 NYU 15
16. P ( positive | GR ) = 0
because P ( “mathematical” | positive ) = 0
The maximum likelihood estimate is poor when
there is very little data
We need to ‘smooth’ the probabilities to avoid
this problem
1/16/14 NYU 16
17. Laplace Smoothing
• A simple remedy is to add 1 to each count
– to keep them as probabilities
we increase the denominator N by the number of outcomes
(values of t) (2 for ‘positive’ and ‘negative’)
– for the conditional probabilities P( w | t ) there are similarly two
outcomes (w is present or absent)
1/16/14 NYU 17
t
t
P 1
)
(
18. An NLTK Demo
Sentiment Analysis with Python NLTK Text
Classification
• http://text-processing.com/demo/sentiment/
NLTK Code (simplified classifier)
• http://streamhacker.com/2010/05/10/text-
classification-sentiment-analysis-naive-bayes-classifier
1/16/14 NYU 18
19. • Is “low” a positive or a negative term?
1/16/14 NYU 19
21. Verdict: mixed
• A simple bag-of-words strategy with a NB
model works quite well for simple reviews
referring to a single item, but fails
– for ambiguous terms
– for comparative reviews
– to reveal aspects of an opinion
• the car looked great and handled well,
but the wheels kept falling off
1/16/14 NYU 21
23. Association Mining
• Goal: find interesting relationships among attributes
of an object in a large collection …
objects with attribute A also have attribute B
– e.g., “people who bought A also bought B”
• For text: documents with term A also have term B
– widely used in scientific and medical literature
1/16/14 NYU 23
24. Bag-of-words
• Simplest approach
• look for words A and B for which
frequency (A and B in same document) >>
frequency of A x frequency of B
• doesn’t work well
• want to find names (of companies, products, genes),
not individual words
• interested in specific types of terms
• want to learn from a few examples
– need contexts to avoid noise
1/16/14 NYU 24
25. Needed Ingredients
• Effective text association mining needs
– name recognition
– term classification
– preferably: ability to learn patterns
– preferably: a good GUI
– demo: www.coremine.com
1/16/14 NYU 25
26. Conclusion
• Some tasks can be handled effectively (and
very simply) by bag-of-words models,
• but most benefit from an analysis of language
structure
1/16/14 NYU 26