2. Topic Index
• Why Vectorization?
• Vector Space Model
• Bag of Words
• TF-IDF
• N-Grams
• Kernel Hashing
3. “How is it possible for a slow, tiny brain, whether biological or
electronic, to perceive, understand, predict, and manipulate a
world far larger and more complicated than itself?”
--- Peter Norvig, “Artificial Intelligence: A Modern Approach”
WHY VECTORIZATION?
5. What Needs to Happen?
• Need each tweet as some structure that can be
fed to a learning algorithm
– To represent the knowledge of “negative” vs
“positive” tweet
• How does that happen?
– We need to take the raw text and convert it into what
is called a “vector”
• Vector relates to the fundamentals of linear
algebra
– “Solving sets of linear equations”
6. Wait. What’s a Vector Again?
• An array of floating point numbers
• Represents data
– Text
– Audio
– Image
• Example:
–[ 1.0, 0.0, 1.0, 0.5 ]
7. “I am putting myself to the fullest possible use, which is
all I think that any conscious entity can ever hope to do.”
--- Hal, 2001
VECTOR SPACE MODEL
8. Vector Space Model
• Common way of vectorizing text
– every possible word is mapped to a specific integer
• If we have a large enough array then every word
fits into a unique slot in the array
– value at that index is the number of the times the
word occurs
• Most often our array size is less than our corpus
vocabulary
– so we have to have a “vectorization strategy” to
account for this
9. Text Can Include Several Stages
• Sentence Segmentation
– can skip straight to tokenization depending on use case
• Tokenization
– find individual words
• Lemmatization
– finding the base or stem of words
• Removing Stop words
– “the”, “and”, etc
• Vectorization
– we take the output of the process and make an array of
floating point values
10. “A man who carries a cat by the tail learns something he can
learn in no other way.”
--- Mark Twain
TEXT VECTORIZATION STRATEGIES
11. Bag of Words
• A group of words or a document is represented as a bag
– or “multi-set” of its words
• Bag of words is a list of words and their word counts
– simplest vector model
– but can end up using a lot of columns due to number of words
involved.
• Grammar and word ordering is ignored
– but we still track how many times the word occurs in the
document
• has been used most frequently in the document
classification
– and information retrieval domains.
12. Term frequency inverse document
frequency (TF-IDF)
• Fixes some issues with “bag of words”
• allows us to leverage the information about
how often a word occurs in a document (TF)
– while considering the frequency of the word in the
corpus to control for the facet that some words
will be more common than others (IDF)
• more accurate than the basic bag of words
model
– but computationally more expensive
13. TF-IDF Formula
• wi = TFi * IDFi
• TFi(t)
– = (Number of times term t appears in a document) /
(Total number of terms in the document).
• IDFi = log (N / Dfi)
– N is total documents in corpus
– Dfi is documents containing the term t
14. N-grams
• A group of words in a sequence is called an n-gram
• A single word can be called a unigram
• Two words like “Coca Cola” can be considered
a single unit and called a bigram
• Three and more terms can be called trigrams,
4-grams, 5-grams and so on and so forth
15. N-Grams Usage
• If we combine the unigrams and bigrams from a document and
generate weights using TF-IDF
– will end up with large vectors with many meaningless bigrams
– having large weights on account of their large IDF
• Can pass n-gram through something called a log-likelihood test
– which can determine whether two words occurred together rather by
chance, or because they form a significant unit
– It selects the most significant ones and prunes away the least
significant ones
• Using the remaining n-grams, TF-IDF weighting scheme is applied
and vectors are produced
– In this way, significant bigrams like “Coca Cola” can be more properly
accounted for in a TF-IDF weighting.
16. Kernel Hashing
• When we want to vectorize the data in a single
pass
– making it a “just in time” vectorizer.
• Can be used when we want to vectorize text right
before we feed it to our learning algorithm.
• We come up with a fixed sized vector that is
typically smaller than the total possible words
that we could index or vectorize
– Then we use a hash function to create an index into
the vector.
17. More Kernel Hashing
• Advantage to use kernel hashing is that we don’t
need the pre-cursor pass like we do with TF-IDF
– but we run the risk of having collisions between words
• The reality is that these collisions occur very
infrequently
– and don’t have a noticeable impact on learning
performance
• For more reading:
– http://jeremydhoon.github.com/2013/03/19/abusing-hash-
kernels-for-wildly-unprincipled-machine-learning/