Natural Language Processing

1
Natural Language Processing
Toine Bogers
Aalborg University Copenhagen
Christina Lioma
University of Copenhagen
QUARTZ WINTER SCHOOL / FEBRUARY 12, 2018 / PADUA, ITALY

Who?
• Toine Bogers (toine@hum.aau.dk)
– Associate professor @ Aalborg University Copenhagen
– Interests
§ Recommender systems
§ Information retrieval (search engines)
§ Information behavior
• Christina Lioma (c.lioma@di.ku.dk)
– Full professor @ University of Copenhagen
– Interests
§ Information retrieval (search engines)
§ Natural language processing, computational linguistics
2

Outline
• Introduction to NLP
• Vector semantics
• Text classification
3

Useful references
• Slides in this lecture are based on Jurafsky & Martin
book
– Jurafsky, D., & Martin, J. H. (2014). Speech and
Language Processing (2nd ed.). Harlow: Pearson
Education. https://web.stanford.edu/~jurafsky/slp3/
• Other good textbooks on NLP
– Jackson, P., & Moulinier, I. (2007). Natural Language
Processing for Online Applications:Text Retrieval,
Extractionand Categorization. Amsterdam: Benjamins.
– Manning, C. D., & Schütze, H. (1999). Foundations of
StatisticalNatural Language Processing. Cambridge,
MA: MIT Press.
4

What is Natural Language Processing?
• Multidisciplinary branch of CS drawing largely from linguistics
• Definition by Liddy (1998)
– “Natural language processing is a range of computational techniques for
analyzing and representing naturally occurring texts at one or more levels
of linguistic analysis for the purpose of achieving human-like language
processing for a range of particular tasks or applications.”
• Goal: applied mechanization of human language
6

Levels of linguistic analysis
• Phonological — Interpretation of speech sounds within and across words
• Morphological — Componential analysis of words, including prefixes, suffixes
and roots
• Lexical — Word-level analysis including lexical meaning and part-of-speech
analysis
• Syntactic — Analysis of words in a sentence in order to uncover the grammatical
structure of the sentence
• Semantic — Determining the possible meanings of a sentence, including
disambiguation of words in context
• Discourse — Interpreting structure and meaning conveyed by texts larger than a
sentence
• Pragmatic — Understanding the purposeful use of language in situations,
particularly those aspects of language which require world knowledge
7Liddy (1998)

NLP is hard!
• Multidisciplinary knowledgegap
– Some computer scientists might not “get” linguistics
– Some linguists might not “get” computer science
– Goal: bridge this gap
• String similarity aloneis not enough
– Very similar strings can mean different things (1 & 2)
– Very different strings can mean similar things (2 & 3)
– Examples:
1. How fast is the TZ?
2. How fast will my TZ arrive?
3. Please tell me when I can expect the TZ I ordered
8

NLP is hard!
• Ambiguity
– Identical strings can have different meanings
– Example: “I made her duck” has at least five possible meanings
§ I cooked waterfowl for her
§ I cooked waterfowl belonging to her
§ I created the (plaster?) duck she owns
§ I caused her to quickly lower her head or body
§ I waved my magic wand and turned her into undifferentiated waterfowl
9

Overcoming NLP difficulty
• Natural languageambiguity is very common, but also largely local
– Immediate context resolves ambiguity
– Immediate context ➝ common sense
– Example: “My connection is too slow today.”
• Humans use common sense to resolve ambiguity, sometimes
without being aware there was ambiguity
10

Overcoming NLP difficulty
• Machines do not have common sense
• Initial suggestion
– Hand-code common sense in machines
– Impossibly hard to do for more than very limiteddomains
• Present suggestion
– Applications that work with very limited domains
– Approximate common sense by relatively simple techniques
11

Applications of NLP
• Language identification
• Spelling & grammar checking
• Speech recognition & synthesis
• Sentiment analysis
• Automatic summarization
• Machine translation
• Information retrieval
• Information extraction
• …
12

Applications of NLP
• Some applications are widespread (e.g., spell check), while others
are not ready for industry or are too expensive for popular use
• NLP tools rarely hit a 100% success rate
– Accuracy is assessed in statistical terms
– Tools become mature and usable when they operate above a certain
precision and below an acceptable cost
• All NLP tools improve continuously (and often rapidly)
13

Generic architecture of (most) NLP tools
1. Input pre-processing
2. Morphological & part-of-speech analysis (tokens)
3. Parsing (syntactic & semantic relations between tokens)
4. Context module (context-specific resolution)
5. Inference (according to the aim of the tool)
6. Generation (output representation)
7. Output processing (output representation refinement)
14

2. Morphological & Part-of-Speech analysis (tokens)
15
today’s
focus

16
Part 2a
Vector semantics
Introduction to distributional semantics

Word similarity
• Understanding word similarity is essential for NLP
– Example
§ “fast” is similar to “rapid”
§ “tall” is similar to “height”
– Question answering
§ Question: “How tall is Mt. Everest?”
§ Candidate answer: “The official height of Mount Everest is 29029 feet.”
• Can we compute the similarity between words automatically?
– Distributional semantics is an approach to doing this
17

Application: Plagiarism detection
18

Distributional semantics
• Distributional semantics is the study of semantic similarities between
words using their distributional properties in text corpora
– Distributional models of meaning = vector-space models of meaning =
vector semantics
• Intuition behind this
– Linguistic items with similar distributions have similar meanings
– Zellig Harris (1954):
§ “oculist and eye-doctor … occur in almost the same environments”
§ “If A and B have almost identical environments we say that they are synonyms.”
– Firth (1957)
§ “You shall know a word by the company it keeps!”
19

Distributional semantics
• Example
– From context words, humans can guess tesgüino means an alcoholic
beverage like beer
• Intuition for algorithm
– Two words are similarif they have similarword contexts
20
A bottle of tesgüino is on the table.
Everybody likes tesgüino.
Tesgüino makes you drunk.
We make tesgüino out of corn.

Four kinds of models of vector semantics
• Sparse vector representations
– Mutual-information weighted word co-occurrence matrices
• Dense vector representations
– Singular value decomposition (and Latent Semantic Analysis)
– Neural-network-inspired models (skip-grams, CBOW)
– Brown clusters
21
today’s
focus

Shared intuition
• Model the meaning of a word by “embedding” in a vector space
– The meaning of a word is a vector of numbers
– Vector models are also called embeddings
• Contrast
– Word meaning is represented in many computational linguistic
applications by a vocabulary index (“word number 545”)
22

Representing vector semantics
• Words are related if they occur in the same context
• How big is this context?
– Entire document? Paragraph? Window of ±n words?
– Smaller contexts (e.g., context windows) are best for capturing similarity
– If a word w occurs a lot in the context windows of another word v (i.e., if
they frequently co-occur), then they are probably related
23
words rather than documents. This matrix is thus of dimensionality |V| ⇥ |V| and
each cell records the number of times the row (target) word and the column (context)
word co-occur in some context in some training corpus. The context could be the
document, in which case the cell represents the number of times the two words
appear in the same document. It is most common, however, to use smaller contexts,
such as a window around the word, for example of 4 words to the left and 4 words
to the right, in which case the cell represents the number of times (in some training
corpus) the column word occurs in such a ±4 word window around the row word.
For example here are 7-word windows surrounding four sample words from the
Brown corpus (just one example of each word):
sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of,
their enjoyment. Cautiously she sampled her ﬁrst pineapple and another fruit whose taste she likened
well suited to programming on the digital computer. In ﬁnding the optimal R-stage policy from
for the purpose of gathering data and information necessary for the study authorized in the
For each word we collect the counts (from the windows around each occurrence)
of the occurrences of context words. Fig. 17.2 shows a selection from the word-word
co-occurrence matrix computed from the Brown corpus for these four words.

Representing vector semantics
• We can now define a word w by a vector of counts of context
words
– Counts represent how often those words have co-occurred in the same
context window with word w
– Each vector is of length |V|, where V is the vocabulary
– Vector semantics is captured in the word-word matrix is |V| x |V|
24

Example: Contexts ±7 words
25
aardvark computer data pinch result sugar …
apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0
…
such as a window around the word, for example of 4 words to the left and 4 words
to the right, in which case the cell represents the number of times (in some training
corpus) the column word occurs in such a ±4 word window around the row word.
For example here are 7-word windows surrounding four sample words from the
Brown corpus (just one example of each word):
sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of,
their enjoyment. Cautiously she sampled her ﬁrst pineapple and another fruit whose taste she likened
well suited to programming on the digital computer. In ﬁnding the optimal R-stage policy from
for the purpose of gathering data and information necessary for the study authorized in the
For each word we collect the counts (from the windows around each occurrence)
of the occurrences of context words. Fig. 17.2 shows a selection from the word-word
co-occurrence matrix computed from the Brown corpus for these four words.
aardvark ... computer data pinch result sugar ...
apricot 0 ... 0 0 1 0 1
pineapple 0 ... 0 0 1 0 1
digital 0 ... 2 1 0 1 0
information 0 ... 1 6 0 4 0

Word-word matrix
• We showed only 4 x 6, but the real matrix is 50,000 x 50,000
– Most values are 0 so it is very sparse
– That’s OK, since there are lots of efficient algorithms for sparse matrices
• The size of windows depends on your goals
– The shorter the windows, the more syntactic the representation
§ ± 1-3 very syntactic
– The longer the windows, the more semantic the representation
§ ± 4-10 more semantic
26

Two types of co-occurrence
• First-order co-occurrence (syntagmatic association)
– They are typically nearby each other
– wrote is a first-order associate of book or poem
• Second-order co-occurrence (paradigmatic association)
– They have similarneighbors
– wrote is a second-order associate of words like said or remarked
27

28
Part 2b
Vector semantics
Positive Pointwise Mutual Information (PPMI)

Problem with raw co-occurrence counts
• Raw word frequency is not a great measure of association between
words
– It’s very skewed
– “the” and “of” are very frequent, but maybe not the most discriminative
• We would prefer a measure that asks whether a context word is
particularly informative about the target word
– Positive Pointwise Mutual Information (PPMI)
29

Pointwise Mutual Information
• Do events x and y co-occur more than if they were independent?
• PMI between two words (Church & Hanks, 1989)
– Do words x and y co-occur more than if they were independent?
30
PMI(word1, word2) = lo 2
P(word1, word2)
P(word1)P(word2)
PMI(X,Y) = lo 2
P(x, )
P(x)P( )

Positive Pointwise Mutual Information
• PMI ranges from –∞ to +∞, but negative values are problematic
– Things are co-occurring less than we expect by chance
– Unreliable without enormous corpora
§ Imagine w1 and w2 whose probability is each 10-6
§ Hard to be sure p(w1, w2) is significantly different than 10-12
– Plus it is not clear people are good at “unrelatedness”
• So we just replace negative PMI values by 0
– Positive PMI (PPMI) between word1 and word2:
31
PMI(word1, word2) = max
✓
log2
P(word1, word2)
P(word1)P(word2)
, 0
◆

• Matrix F with W rows (words) and C columns (contexts)
• fij is # of times wi occurs in context cj
32
pij =
fij
fij
j=1
C
∑
i=1
W
∑
pi* =
fij
j=1
C
∑
fij
j=1
C
∑
i=1
W
∑
p* j =
fij
i=1
W
∑
fij
j=1
C
∑
i=1
W
∑
pmiij = log2
pij
pi* p* j
ppmiij =
pmiij if pmiij > 0
0 otherwise
!
"
#
$#
Computing PPMI on a term-context matrix

p(w=information, c=data) =
p(w=information) =
p(c=data) =
33
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
= .326/19
11/19 = .58
7/19 = .37
pij =
fij
fij
j=1
C
∑
i=1
W
∑
p(wi ) =
fij
j=1
C
∑
N
p(cj ) =
fij
i=1
W
∑
N

34
pmiij = log2
pij
pi* p* j
p(w,context) p(w)
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
PPMI(w,context)
apricot - - 2.25 - 2.25
pineapple - - 2.25 - 2.25
digital 1.66 0.00 - 0.00 -
information 0.00 0.57 - 0.47 -
PMI(information, data) = lo 2
✓ .32
.37 ⇥ .58
◆
= .57

One more problem…
• We are unlikely to encounter rare
words unless we have large
corpora
– PMI values cannot be calculated if
co-occurrence count is 0
• Solution: give rare words slightly
higher probabilities
– Steal probability mass to
generalize better
– Laplace smoothing (aka add-one
smoothing)
§ Pretend we saw each word one
more time than we did
35
allegations
reports
claims
attack
request
man
outcome
…
allegations
attack
man
outcome
…
allegations
reports
claims
request

36
Part 2c
Vector semantics
Measuring word similarity: the cosine

Measuring similarity
• We need a way to measure the similarity between two target words v
and w
– Most vector similarity measures are based on the dot product or inner
product from linear algebra
– High when two vectors have large values in same dimensions
– Low (in fact 0) for orthogonal vectors with zeros in complementary
distribution
37
pineapple 0 0 0.56 0 0.56
digital 0.62 0 0 0 0
information 0 0.58 0 0.37 0
Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing c
in Fig. 17.5.
The cosine—like most measures for vector similarity used in NLP—is bas
the dot product operator from linear algebra, also called the inner product:duct
duct
dot-product(~v,~w) =~v·~w =
NX
i=1
viwi = v1w1 +v2w2 +...+vNwN (
Intuitively, the dot product acts as a similarity metric because it will tend
high just when the two vectors have large values in the same dimensions. Al
tively, vectors that have zeros in different dimensions—orthogonal vectors— w
very dissimilar, with a dot product of 0.

Measuring similarity
• Problem: dot product is not normalized for vector length
– Vectors are longer if they have higher values in each dimension
– That means more frequent words will have higher dot products
– Our similarity metric should not be sensitive to word frequency
• Solution: divide it by the length of the two vectors
– Is equal to the cosine of the angle between the two vectors!
– This is the cosine similarity
38
The dot product is higher if a vector is longer, with higher values in each dimension.
More frequent words have longer vectors, since they tend to co-occur with more
words and have higher co-occurrence values with each of them. Raw dot product
thus will be higher for frequent words. But this is a problem; we’d like a similarity
metric that tells us how similar two words are irregardless of their frequency.
The simplest way to modify the dot product to normalize for the vector length is
to divide the dot product by the lengths of each of the two vectors. This normalized
dot product turns out to be the same as the cosine of the angle between the two
vectors, following from the deﬁnition of the dot product between two vectors ~a and
~b:
~a·~b = |~a||~b|cosq
~a·~b
|~a||~b|
= cosq (19.12)

Calculating the cosine similarity
• vi is the PPMI value for word v in context i
• wi is the PPMI value for word w in context i
• cos(v, w) is the cosine similarity between v and w
– Raw frequency or PPMI are non-negative, so cosine range is [0, 1]
39
cos(

v,

w) =

v •

w

v

w
=

v

v
•

w

w
=
viwii=1
N
∑
vi
2
i=1
N
∑ wi
2
i=1
N
∑
dot product unit vectors

large data computer
apricot 2 0 0
digital 0 1 2
information 1 6 1
40
Which pair of words is more similar?
cos(

v,

w) =

v •

w

v

w
=

v

v
•

w

w
=
viwii=1
N
∑
vi
2
i=1
N
∑ wi
2
i=1
N
∑
Example
cosine(apricot, digital) =
0 + 0 + 0
p
1 + 0 + 0
p
0 + 1 + 4
= 0
cosine(digital, information) =
0 + 6 + 2
p
0 + 1 + 4
p
1 + 36 + 1
=
8
p
5
p
38
= .58
cosine(apricot, information) =
2 + 0 + 0
p
4 + 0 + 0
p
1 + 36 + 1
=
2
2
p
38
= .16

1 2 3 4 5 6 7
1
2
3
digital
apricot
information
Dimension1:‘large’
Dimension 2: ‘data’ 41
large data
apricot 2 0
digital 0 1
information 1 6
Visualizing vectors and angles

WRIST
ANKLE
SHOULDER
ARM
LEG
HAND
FOOT
HEAD
NOSE
FINGER
TOE
FACE
EAR
EYE
TOOTH
DOG
CAT
PUPPY
KITTEN
COW
MOUSE
TURTLE
OYSTER
LION
BULL
CHICAGO
ATLANTA
MONTREAL
NASHVILLE
TOKYO
CHINA
RUSSIA
AFRICA
ASIA
EUROPE
AMERICA
BRAZIL
MOSCOW
FRANCE
HAWAII
42Rohde et al. (2006)
Clustering vectors to
visualize similarity in co-
occurrence matrices

Other possible similarity measures
43

44
Part 2d
Vector semantics
Evaluation

Evaluating word similarity
• Intrinsic evaluation
– Correlation between algorithm and human word similarity ratings
§ Wordsim353 ➞ 353 noun pairs rated on a 0-10 scale
– Taking TOEFL multiple-choice vocabulary tests
• Extrinsic (= task-based, end-to-end) evaluation
– Question answering
– Spell-checking
– Essay grading
45
sim(plane, car) = 5.77
Levied is closest in meaning to:
imposed / believed / requested / correlated

46
Part 2e
Vector semantics
Dense vectors

Sparse vs. dense vectors
• PPMI vectors are
– Long (length |V|= 20,000 to 50,000)
– Sparse (most elements are zero)
• Alternative: learn vectors which are
– Short (length 200-1000)
– Dense (most elements are non-zero)
47

Sparse vs. dense vectors
• Why dense vectors?
– Short vectors may be easier to use as features in machine learning (less
weights to tune)
– De-noising ➞ low-order dimensions may represent unimportant
information
– Dense vectors may generalize better than storing explicit counts
– Dense models may do better at capturing higher-order co-occurrence
§ car and automobile are synonyms, but represented as distinct dimensions
§ Fails to capture similarity between a word with car as a neighbor and a word
with automobile as a neighbor (= paradigmatic association)
48

Methods for creating dense vector embeddings
• Singular Value Decomposition (SVD)
– A special case of this is called Latent Semantic Analysis (LSA)
• “Neural Language Model”-inspired predictive models (skip-grams
and CBOW)
• Brown clustering
49

50

51
Part 3a
Text classification
Introduction
50

What is text classsification?
• Goal: Take a document and assign it a label representing its content
• Classic example: decide if a newspaper article is about politics, sports, or
business
• Many uses for the same technology
– Is this e-mail spam or not?
– Is this page a laser printer product page?
– Does this company accept overseas orders?
– Is this tweet positive or negative?
– Does this part of the CV describe part of a person’s work experience?
– Is this text written in Danish or Norwegian?
– Is this the “computer” or “harbor” sense of port?
53

Definition
• Input:
– A document d
– A fixed set of classes C = {c1, c2,…, cj}
• Output
– A predicted class c ∈ C
54

Top-down vs. bottom-up text classification
• Top-down approach
– We tell the computer exactly how it should solve a task (e.g., expert
systems)
– Example: black-list-address OR ("dollars" AND "have been
selected")
– Potential for high accuracy, but requires expensive & extensive
refinement by experts
• Bottom-up approach
– The computer finds out for itselfwith a ‘little’ help from us (e.g., machine
learning)
– Example: The word “viagra” is very often found in spam e-mails
55

Definition (updated)
• Input:
– A document d
– A fixed set of classes C = {c1, c2,…, cj}
– A training set of m hand-labeled documents (d1, c1), …, (dm, cm)
• Output
– A learned classifierγ: d ➞ c
56

Bottom-up text classification
• Any kind of classifier
– Naïve Bayes
– Logistic regression
– Support Vector Machines
– k-Nearest Neighbor
– Decision tree learning
– ...
57
today’s
focus

58
Part 3b
Text classification
Naïve Bayes

Naïve Bayes
• Simple (“naïve”) classification method based on Bayes rule
– Relies on very simple representation of document: bag of words
59

Bag-of-words representation
60
it
it
it
it
it
it
I
I
I
I
I
love
recommend
movie
the
the
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romantic
of
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
are
anyone
adventure
always
again
about
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun...
It manages to be whimsical
and romantic while laughing
at the conventions of the
fairy tale genre. I would
recommend it to just about
anyone. I've seen it several
times, and I'm always happy
to see it again whenever I
have a friend who hasn't
seen it yet!
it
I
the
to
and
seen
yet
would
whimsical
times
sweet
satirical
adventure
genre
fairy
humor
have
great
it
it
it
it
it
it
I
I
I
I
I
love
recommend
movie
the
the
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romantic
of
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
are
anyone
adventure
always
again
about
seen it yet!
it
I
the
to
and
seen
yet
would
whimsical
times
sweet
satirical
adventure
genre
fairy
humor
have
great
…
6
5
4
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
…
it
it
it
it
it
it
I
I
I
I
I
love
recommend
movie
the
the
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romantic
of
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
are
anyone
adventure
always
again
about
seen it yet!
it
I
the
to
and
seen
yet
would
whimsical
times
sweet
satirical
adventure
genre
fairy
humor
have
great
…
6
5
4
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
…

Bag-of-words representation
61
γ( )=c
seen 2
sweet 1
whimsical 1
recommend 1
happy 1
... ...

Naïve Bayes
• Simple (“naïve”) classification method based on Bayes rule
– Relies on very simple representation of document: bag of words
• Bayes’ rule applied to documents and classes
– For a document d and a class c
62
P(c|d) =
P(d|c)P(c)
P(d)

Naïve Bayes (cont’d)
63
cMAP = argmax
c 2C
P(c|d)
MAP is "maximum a
posteriori" = most
likely class
= argmax
c 2C
P(d|c)P(c)
P(d)
Bayes' Rule
= argmax
c 2C
P(d|c)P(c) Dropping the
denominator
= argmax
c 2C
P(x1,x2, . . . ,xn|c)P(c)
Document d
represented as
features x1, …, xn

Naïve Bayes (cont’d)
• How often does this class occur?
– We can just count the relative frequencies in a corpus
– O(|X|n•|C|) parameters
– Could only be estimated if a very, very large number of training
examples was available!
64
cMAP = argmax
c 2C
P(x1,x2, . . . ,xn|c)P(c)

Multinomial Naïve Bayes
65
• Bag-of-words assumption
– Assume position does not matter
• Conditional independence assumption
– Assume the feature probabilities P(xi | cj) are independent given the
class c
P(x1,x2, . . . ,xn|c)
P(x1, ...,xn|c) = P(x1|c) · P(x2|c) · P(x3|c) · ... · P(xn|c)

cNB = argmax
c 2C
P(c)
Y
x 2X
P(x|c)
Multinomial Naïve Bayes (cont’d)
66
cMAP = argmax
c 2C
P(x1,x2, . . . ,xn|c)P(c)

Multinomial Naïve Bayes (cont’d)
67
cNB = argmax
cj 2C
P(cj )
Y
i 2positions
P(xi |cj )
all word positions in
test document

Learning the Multinomial Naïve Bayes Model
• First attempt: maximum likelihood estimates
– Simply use the frequencies in the data
68
ˆP(wi |cj ) =
count(wi,cj )
P
w 2V
count(w,cj )
ˆP(cj ) =
doc_count(C = cj )
Ndoc
fraction of times word wi appears among
all words in documents of topic cj
fraction of documents with topic cj

Problem with Maximum Likelihood
• What if we have seen no training documents with the word fantastic
and classified in the topic positive (= thumbs-up)?
• Zero probabilities cannot be conditioned away, no matter the other
evidence!
69
ˆP("fantastic"|positive) =
count("fantastic", positive)
P
w 2V
count(w, positive)
= 0
cMAP = argmax
c
ˆP(c)
Y
i
ˆP(xi |c)

Solution: Laplace smoothing
• Also known as add-one (or add- )
– Pretend we saw each word one more time (or more times) than we did
– For
70
ˆP(wi |cj ) =
count(wi,cj ) + 1
P
w 2V
count(w,cj ) + 1
=
count(wi,cj ) + 1
P
w 2V
count(w,cj )
!
+ |V |
= 1

Learning a multinomial Naïve Bayes model
• From training corpus, extract vocabulary V
• Calculate P(cj) terms
– For each cj in C do
§ docsj ⟵ all docs with class = cj
71
• Calculate P(wk | cj) terms
– Textj ⟵ single doc containing all
docsj
– For each word wk in V
§ nk ⟵ # of occurrences of wk in Textj
aggregrate representation
of all documents of class j
P(wk |cj )
nk +
n + |V |
P(cj )
|docsj |
|total # documents|

Doc Words Class
TRAINING
1 Chinese Beijing Chinese China
2 Chinese Chinese Shanghai China
3 Chinese Macao China
4 Tokyo Japan Chinese Japan
TEST 5 Chinese Chinese Chinese Tokyo Japan ?
Worked example: Priors
72
ˆP(c) =
Nc
N P(Japan) =
1
4
P(China) =
3
4
priors

P("Chinese" | China) =
count("Chinese", China) + 1
count(China) + |V |
Doc Words Class
TRAINING
Worked example: Conditional probabilities
73
ˆP(w|c) =
count(w,c) + 1
count(c) + |V |
=
5 + 1
8 + 6
=
6
14
=
3
7

Doc Words Class
TRAINING
Worked example: Conditional probabilities
74
P("Japan" | China) =
0 + 1
8 + 6
=
1
14
P("Tokyo" | China) =
0 + 1
8 + 6
=
1
14
5 + 1
8 + 6
=
3
7
P("Japan" | Japan) =
1 + 1
3 + 6
=
2
9
P("Tokyo" | Japan) =
1 + 1
3 + 6
=
2
9
P("Chinese" | Japan) =
1 + 1
3 + 6
=
2
9

Doc Words Class
TRAINING
Worked example: Choosing a class
75
P("Japan" | China) =
0 + 1
8 + 6
=
1
14
P("Tokyo" | China) =
0 + 1
8 + 6
=
1
14
5 + 1
8 + 6
=
3
7
P("Japan" | Japan) =
1 + 1
3 + 6
=
2
9
P("Tokyo" | Japan) =
1 + 1
3 + 6
=
2
9
P("Chinese" | Japan) =
1 + 1
3 + 6
=
2
9
P(Japan) =
1
4
P(China) =
3
4
P(China | doc5) /
3
4
⇥
3
7
3
⇥
1
14
⇥
1
14
⇡ 0.0003
P(Japan | doc5) /
1
4
⇥
2
9
3
⇥
2
9
⇥
2
9
⇡ 0.0001
cNB = argmax
c 2C
P(c)
Y
x 2X
P(x|c)

Summary
• Naive Bayes is not so naïve
– Very fast
– Low storage requirements
– Robust to irrelevant features
§ Irrelevant features cancel each other out without affecting results
– Very good in domains with many equally important features
§ Decision Trees suffer from fragmentation in such cases – especially if little data
– A good, dependable baseline for text classification
76

Naïve Bayes in spam filtering
• SpamAssassin features (http://spamassassin.apache.org/tests_3_3_x.html)
– Properties
§ From: starts with many numbers
§ Subject is all capitals
§ HTML has a low ratio of text to image area
§ Claims you can be removed from the list
– Phrases
§ “viagra”
§ “impress ... girl”
§ “One hundred percent guaranteed”
§ “Prestigious Non-Accredited Universities” 77

78
Part 3c
Text classification
Evaluation

Evaluation
• Text classification can be seen as an application of machine learning
– Evaluation is similar to best-practice in machine learning
• Experimental setup
– Splitting in training, development, and test set
– Cross-validation for parameter optimization
• Evaluation
– Precision
§ For a particular class c, how many times were we correct in predicting c?
– Recall
§ For a particular class c, how many of the actual instances of c did we manage to
find?
– Other metrics: F-score, AUC 79

Precision vs. recall
• High precision
– When all returned answers must be correct
– Good when missing results are not problematic
– More common from hand-built systems
• High recall
– You get all the right answers, but garbage too
– Good when incorrect results are not problematic
– More common from automatic systems
• Trade-off
– In general, one can trade one for the other
– But it is harder to score well on both
80
precision
recall

Non-binary classification
• What if we have more than two classes?
– Solution: Train sets of binary classifiers
• Scenario 1: Any-of classification (aka multivalue classification)
– A document can belong to 0, 1, or >1 classes
– For each class c∈C
§ Build a classifier γc to distinguish c from all other classes c’∈C
– Given test document d
§ Evaluate it for membership in each class using each γc
§ d belongs to any class for which γc returns true
81

Non-binary classification
• Scenario 2: One-of classification (aka multinomial classification)
– A document can belong to exactly 1 class
– For each class c∈C
§ Build a classifier γc to distinguish c from all other classes c’∈C
– Given test document d
§ Evaluate it for membership in each class using each γc
§ d belongs to the one class with maximum score
82

83
Part 3d
Text classification
Practical issues

Text classification
• Usually, simple machine learning algorithms are used
– Examples: Naïve Bayes, decision trees
– Very robust, very re-usable, very fast
• Recently, slightly better performance from better algorithms
– Examples: SVMs, Winnow, boosting, k-NN
• Accuracy is more dependent on
– Naturalness of classes
– Quality of features extracted
– Amount of training data available
• Accuracy typically ranges from 65% to 97% depending on the situation
– Note particularly performance on rare classes
84

The real world
• Gee, I’m building a text classifier for real, now! What should I do?
• No training data?
– Manually written rules
§ Example: IF (wheat OR grain) AND NOT (whole OR bread) THEN
CATEGORIZE AS 'grain'
– Need careful crafting
§ Human tuning on development data
§ Time-consuming: 2 days per class
85

The real world
• Very little data?
– Use Naïve Bayes
§ Naïve Bayes is a “high-bias” algorithm (Ng & Jordan, 2002)
– Get more labeled data
§ Find clever ways to get humans to label data for you
– Try semi-supervised training methods
§ Bootstrapping, EM over unlabeled documents, …
• A reasonableamount of data?
– Perfect for all the clever classifiers
§ SVM, Regularized Logistic Regression, ...
– You can even use user-interpretable decision trees
§ Users like to hack & management likes quick fixes 86

Banko & Brill (2001)
The real world
• A huge amount of data?
– Can achieve high accuracy!
– Comes at a cost
§ SVMs (train time) or kNN (test time)
can be too slow
§ Regularized logistic regression can
be somewhat better
§ So Naïve Bayes can come back into
its own again!
– With enough data, classifier may
not matter…
87

Tweaking performance
• Domain-specific features and weights are very important in real
performance
• Sometimes need to collapse terms:
– Part numbers, chemical formulas, …
– But stemming generally does not help
• Upweighting (= counting a word as if it occurred twice)
– Title words (Cohen & Singer, 1996)
– First sentence of each paragraph (Murata, 1999)
– In sentences that contain title words (Ko et al., 2002)
88

89

References
• Jackson, P., & Moulinier, I. (2007). NaturalLanguageProcessing for
Online Applications:Text Retrieval, Extraction and Categorization.
Amsterdam: Benjamins.
• Jurafsky, D., & Martin, J. H. (2014). Speech and LanguageProcessing
(2nd ed.). Harlow: Pearson Education.
• Liddy, E. D. (2005). Enhanced Text Retrieval Using Natural Language
Processing. Bulletin of the American Society for Information Science
and Technology,24(4), 14-16.
• Manning,C. D., & Schütze, H. (1999).Foundations of Statistical
Natural LanguageProcessing. Cambridge, MA: MIT Press.
91

Natural Language Processing

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Natural Language Processing

Similaire à Natural Language Processing (20)

Plus de Toine Bogers

Plus de Toine Bogers (16)

Dernier

Dernier (20)

Natural Language Processing