SlideShare une entreprise Scribd logo
1  sur  91
Télécharger pour lire hors ligne
1
Natural Language Processing
Toine Bogers
Aalborg University Copenhagen
Christina Lioma
University of Copenhagen
QUARTZ WINTER SCHOOL / FEBRUARY 12, 2018 / PADUA, ITALY
Who?
• Toine Bogers (toine@hum.aau.dk)
– Associate professor @ Aalborg University Copenhagen
– Interests
§ Recommender systems
§ Information retrieval (search engines)
§ Information behavior
• Christina Lioma (c.lioma@di.ku.dk)
– Full professor @ University of Copenhagen
– Interests
§ Information retrieval (search engines)
§ Natural language processing, computational linguistics
2
Outline
• Introduction to NLP
• Vector semantics
• Text classification
3
Useful references
• Slides in this lecture are based on Jurafsky & Martin
book
– Jurafsky, D., & Martin, J. H. (2014). Speech and
Language Processing (2nd ed.). Harlow: Pearson
Education. https://web.stanford.edu/~jurafsky/slp3/
• Other good textbooks on NLP
– Jackson, P., & Moulinier, I. (2007). Natural Language
Processing for Online Applications:Text Retrieval,
Extractionand Categorization. Amsterdam: Benjamins.
– Manning, C. D., & Schütze, H. (1999). Foundations of
StatisticalNatural Language Processing. Cambridge,
MA: MIT Press.
4
5
Part 1
Introduction to NLP
What is Natural Language Processing?
• Multidisciplinary branch of CS drawing largely from linguistics
• Definition by Liddy (1998)
– “Natural language processing is a range of computational techniques for
analyzing and representing naturally occurring texts at one or more levels
of linguistic analysis for the purpose of achieving human-like language
processing for a range of particular tasks or applications.”
• Goal: applied mechanization of human language
6
Levels of linguistic analysis
• Phonological — Interpretation of speech sounds within and across words
• Morphological — Componential analysis of words, including prefixes, suffixes
and roots
• Lexical — Word-level analysis including lexical meaning and part-of-speech
analysis
• Syntactic — Analysis of words in a sentence in order to uncover the grammatical
structure of the sentence
• Semantic — Determining the possible meanings of a sentence, including
disambiguation of words in context
• Discourse — Interpreting structure and meaning conveyed by texts larger than a
sentence
• Pragmatic — Understanding the purposeful use of language in situations,
particularly those aspects of language which require world knowledge
7Liddy (1998)
NLP is hard!
• Multidisciplinary knowledgegap
– Some computer scientists might not “get” linguistics
– Some linguists might not “get” computer science
– Goal: bridge this gap
• String similarity aloneis not enough
– Very similar strings can mean different things (1 & 2)
– Very different strings can mean similar things (2 & 3)
– Examples:
1. How fast is the TZ?
2. How fast will my TZ arrive?
3. Please tell me when I can expect the TZ I ordered
8
NLP is hard!
• Ambiguity
– Identical strings can have different meanings
– Example: “I made her duck” has at least five possible meanings
§ I cooked waterfowl for her
§ I cooked waterfowl belonging to her
§ I created the (plaster?) duck she owns
§ I caused her to quickly lower her head or body
§ I waved my magic wand and turned her into undifferentiated waterfowl
9
Overcoming NLP difficulty
• Natural languageambiguity is very common, but also largely local
– Immediate context resolves ambiguity
– Immediate context ➝ common sense
– Example: “My connection is too slow today.”
• Humans use common sense to resolve ambiguity, sometimes
without being aware there was ambiguity
10
Overcoming NLP difficulty
• Machines do not have common sense
• Initial suggestion
– Hand-code common sense in machines
– Impossibly hard to do for more than very limiteddomains
• Present suggestion
– Applications that work with very limited domains
– Approximate common sense by relatively simple techniques
11
Applications of NLP
• Language identification
• Spelling & grammar checking
• Speech recognition & synthesis
• Sentiment analysis
• Automatic summarization
• Machine translation
• Information retrieval
• Information extraction
• …
12
Applications of NLP
• Some applications are widespread (e.g., spell check), while others
are not ready for industry or are too expensive for popular use
• NLP tools rarely hit a 100% success rate
– Accuracy is assessed in statistical terms
– Tools become mature and usable when they operate above a certain
precision and below an acceptable cost
• All NLP tools improve continuously (and often rapidly)
13
Generic architecture of (most) NLP tools
1. Input pre-processing
2. Morphological & part-of-speech analysis (tokens)
3. Parsing (syntactic & semantic relations between tokens)
4. Context module (context-specific resolution)
5. Inference (according to the aim of the tool)
6. Generation (output representation)
7. Output processing (output representation refinement)
14
Generic architecture of (most) NLP tools
1. Input pre-processing
2. Morphological & Part-of-Speech analysis (tokens)
3. Parsing (syntactic & semantic relations between tokens)
4. Context module (context-specific resolution)
5. Inference (according to the aim of the tool)
6. Generation (output representation)
7. Output processing (output representation refinement)
15
today’s
focus
16
Part 2a
Vector semantics
Introduction to distributional semantics
Word similarity
• Understanding word similarity is essential for NLP
– Example
§ “fast” is similar to “rapid”
§ “tall” is similar to “height”
– Question answering
§ Question: “How tall is Mt. Everest?”
§ Candidate answer: “The official height of Mount Everest is 29029 feet.”
• Can we compute the similarity between words automatically?
– Distributional semantics is an approach to doing this
17
Application: Plagiarism detection
18
Distributional semantics
• Distributional semantics is the study of semantic similarities between
words using their distributional properties in text corpora
– Distributional models of meaning = vector-space models of meaning =
vector semantics
• Intuition behind this
– Linguistic items with similar distributions have similar meanings
– Zellig Harris (1954):
§ “oculist and eye-doctor … occur in almost the same environments”
§ “If A and B have almost identical environments we say that they are synonyms.”
– Firth (1957)
§ “You shall know a word by the company it keeps!”
19
Distributional semantics
• Example
– From context words, humans can guess tesgüino means an alcoholic
beverage like beer
• Intuition for algorithm
– Two words are similarif they have similarword contexts
20
A bottle of tesgüino is on the table.
Everybody likes tesgüino.
Tesgüino makes you drunk.
We make tesgüino out of corn.
Four kinds of models of vector semantics
• Sparse vector representations
– Mutual-information weighted word co-occurrence matrices
• Dense vector representations
– Singular value decomposition (and Latent Semantic Analysis)
– Neural-network-inspired models (skip-grams, CBOW)
– Brown clusters
21
today’s
focus
Shared intuition
• Model the meaning of a word by “embedding” in a vector space
– The meaning of a word is a vector of numbers
– Vector models are also called embeddings
• Contrast
– Word meaning is represented in many computational linguistic
applications by a vocabulary index (“word number 545”)
22
Representing vector semantics
• Words are related if they occur in the same context
• How big is this context?
– Entire document? Paragraph? Window of ±n words?
– Smaller contexts (e.g., context windows) are best for capturing similarity
– If a word w occurs a lot in the context windows of another word v (i.e., if
they frequently co-occur), then they are probably related
23
words rather than documents. This matrix is thus of dimensionality |V| ⇥ |V| and
each cell records the number of times the row (target) word and the column (context)
word co-occur in some context in some training corpus. The context could be the
document, in which case the cell represents the number of times the two words
appear in the same document. It is most common, however, to use smaller contexts,
such as a window around the word, for example of 4 words to the left and 4 words
to the right, in which case the cell represents the number of times (in some training
corpus) the column word occurs in such a ±4 word window around the row word.
For example here are 7-word windows surrounding four sample words from the
Brown corpus (just one example of each word):
sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of,
their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened
well suited to programming on the digital computer. In finding the optimal R-stage policy from
for the purpose of gathering data and information necessary for the study authorized in the
For each word we collect the counts (from the windows around each occurrence)
of the occurrences of context words. Fig. 17.2 shows a selection from the word-word
co-occurrence matrix computed from the Brown corpus for these four words.
Representing vector semantics
• We can now define a word w by a vector of counts of context
words
– Counts represent how often those words have co-occurred in the same
context window with word w
– Each vector is of length |V|, where V is the vocabulary
– Vector semantics is captured in the word-word matrix is |V| x |V|
24
Example: Contexts ±7 words
25
aardvark computer data pinch result sugar …
apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0
…
such as a window around the word, for example of 4 words to the left and 4 words
to the right, in which case the cell represents the number of times (in some training
corpus) the column word occurs in such a ±4 word window around the row word.
For example here are 7-word windows surrounding four sample words from the
Brown corpus (just one example of each word):
sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of,
their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened
well suited to programming on the digital computer. In finding the optimal R-stage policy from
for the purpose of gathering data and information necessary for the study authorized in the
For each word we collect the counts (from the windows around each occurrence)
of the occurrences of context words. Fig. 17.2 shows a selection from the word-word
co-occurrence matrix computed from the Brown corpus for these four words.
aardvark ... computer data pinch result sugar ...
apricot 0 ... 0 0 1 0 1
pineapple 0 ... 0 0 1 0 1
digital 0 ... 2 1 0 1 0
information 0 ... 1 6 0 4 0
Word-word matrix
• We showed only 4 x 6, but the real matrix is 50,000 x 50,000
– Most values are 0 so it is very sparse
– That’s OK, since there are lots of efficient algorithms for sparse matrices
• The size of windows depends on your goals
– The shorter the windows, the more syntactic the representation
§ ± 1-3 very syntactic
– The longer the windows, the more semantic the representation
§ ± 4-10 more semantic
26
Two types of co-occurrence
• First-order co-occurrence (syntagmatic association)
– They are typically nearby each other
– wrote is a first-order associate of book or poem
• Second-order co-occurrence (paradigmatic association)
– They have similarneighbors
– wrote is a second-order associate of words like said or remarked
27
28
Part 2b
Vector semantics
Positive Pointwise Mutual Information (PPMI)
Problem with raw co-occurrence counts
• Raw word frequency is not a great measure of association between
words
– It’s very skewed
– “the” and “of” are very frequent, but maybe not the most discriminative
• We would prefer a measure that asks whether a context word is
particularly informative about the target word
– Positive Pointwise Mutual Information (PPMI)
29
Pointwise Mutual Information
• Do events x and y co-occur more than if they were independent?
• PMI between two words (Church & Hanks, 1989)
– Do words x and y co-occur more than if they were independent?
30
PMI(word1, word2) = lo 2
P(word1, word2)
P(word1)P(word2)
PMI(X,Y) = lo 2
P(x, )
P(x)P( )
Positive Pointwise Mutual Information
• PMI ranges from –∞ to +∞, but negative values are problematic
– Things are co-occurring less than we expect by chance
– Unreliable without enormous corpora
§ Imagine w1 and w2 whose probability is each 10-6
§ Hard to be sure p(w1, w2) is significantly different than 10-12
– Plus it is not clear people are good at “unrelatedness”
• So we just replace negative PMI values by 0
– Positive PMI (PPMI) between word1 and word2:
31
PMI(word1, word2) = max
✓
log2
P(word1, word2)
P(word1)P(word2)
, 0
◆
• Matrix F with W rows (words) and C columns (contexts)
• fij is # of times wi occurs in context cj
32
pij =
fij
fij
j=1
C
∑
i=1
W
∑
pi* =
fij
j=1
C
∑
fij
j=1
C
∑
i=1
W
∑
p* j =
fij
i=1
W
∑
fij
j=1
C
∑
i=1
W
∑
pmiij = log2
pij
pi* p* j
ppmiij =
pmiij if pmiij > 0
0 otherwise
!
"
#
$#
Computing PPMI on a term-context matrix
p(w=information, c=data) =
p(w=information) =
p(c=data) =
33
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
= .326/19
11/19 = .58
7/19 = .37
pij =
fij
fij
j=1
C
∑
i=1
W
∑
p(wi ) =
fij
j=1
C
∑
N
p(cj ) =
fij
i=1
W
∑
N
34
pmiij = log2
pij
pi* p* j
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
PPMI(w,context)
computer data pinch result sugar
apricot - - 2.25 - 2.25
pineapple - - 2.25 - 2.25
digital 1.66 0.00 - 0.00 -
information 0.00 0.57 - 0.47 -
PMI(information, data) = lo 2
✓ .32
.37 ⇥ .58
◆
= .57
One more problem…
• We are unlikely to encounter rare
words unless we have large
corpora
– PMI values cannot be calculated if
co-occurrence count is 0
• Solution: give rare words slightly
higher probabilities
– Steal probability mass to
generalize better
– Laplace smoothing (aka add-one
smoothing)
§ Pretend we saw each word one
more time than we did
35
allegations
reports
claims
attack
request
man
outcome
…
allegations
attack
man
outcome
…
allegations
reports
claims
request
36
Part 2c
Vector semantics
Measuring word similarity: the cosine
Measuring similarity
• We need a way to measure the similarity between two target words v
and w
– Most vector similarity measures are based on the dot product or inner
product from linear algebra
– High when two vectors have large values in same dimensions
– Low (in fact 0) for orthogonal vectors with zeros in complementary
distribution
37
pineapple 0 0 0.56 0 0.56
digital 0.62 0 0 0 0
information 0 0.58 0 0.37 0
Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing c
in Fig. 17.5.
The cosine—like most measures for vector similarity used in NLP—is bas
the dot product operator from linear algebra, also called the inner product:duct
duct
dot-product(~v,~w) =~v·~w =
NX
i=1
viwi = v1w1 +v2w2 +...+vNwN (
Intuitively, the dot product acts as a similarity metric because it will tend
high just when the two vectors have large values in the same dimensions. Al
tively, vectors that have zeros in different dimensions—orthogonal vectors— w
very dissimilar, with a dot product of 0.
Measuring similarity
• Problem: dot product is not normalized for vector length
– Vectors are longer if they have higher values in each dimension
– That means more frequent words will have higher dot products
– Our similarity metric should not be sensitive to word frequency
• Solution: divide it by the length of the two vectors
– Is equal to the cosine of the angle between the two vectors!
– This is the cosine similarity
38
The dot product is higher if a vector is longer, with higher values in each dimension.
More frequent words have longer vectors, since they tend to co-occur with more
words and have higher co-occurrence values with each of them. Raw dot product
thus will be higher for frequent words. But this is a problem; we’d like a similarity
metric that tells us how similar two words are irregardless of their frequency.
The simplest way to modify the dot product to normalize for the vector length is
to divide the dot product by the lengths of each of the two vectors. This normalized
dot product turns out to be the same as the cosine of the angle between the two
vectors, following from the definition of the dot product between two vectors ~a and
~b:
~a·~b = |~a||~b|cosq
~a·~b
|~a||~b|
= cosq (19.12)
Calculating the cosine similarity
• vi is the PPMI value for word v in context i
• wi is the PPMI value for word w in context i
• cos(v, w) is the cosine similarity between v and w
– Raw frequency or PPMI are non-negative, so cosine range is [0, 1]
39
cos(

v,

w) =

v •

w

v

w
=

v

v
•

w

w
=
viwii=1
N
∑
vi
2
i=1
N
∑ wi
2
i=1
N
∑
dot product unit vectors
large data computer
apricot 2 0 0
digital 0 1 2
information 1 6 1
40
Which pair of words is more similar?
cos(

v,

w) =

v •

w

v

w
=

v

v
•

w

w
=
viwii=1
N
∑
vi
2
i=1
N
∑ wi
2
i=1
N
∑
Example
cosine(apricot, digital) =
0 + 0 + 0
p
1 + 0 + 0
p
0 + 1 + 4
= 0
cosine(digital, information) =
0 + 6 + 2
p
0 + 1 + 4
p
1 + 36 + 1
=
8
p
5
p
38
= .58
cosine(apricot, information) =
2 + 0 + 0
p
4 + 0 + 0
p
1 + 36 + 1
=
2
2
p
38
= .16
1 2 3 4 5 6 7
1
2
3
digital
apricot
information
Dimension1:‘large’
Dimension 2: ‘data’ 41
large data
apricot 2 0
digital 0 1
information 1 6
Visualizing vectors and angles
WRIST
ANKLE
SHOULDER
ARM
LEG
HAND
FOOT
HEAD
NOSE
FINGER
TOE
FACE
EAR
EYE
TOOTH
DOG
CAT
PUPPY
KITTEN
COW
MOUSE
TURTLE
OYSTER
LION
BULL
CHICAGO
ATLANTA
MONTREAL
NASHVILLE
TOKYO
CHINA
RUSSIA
AFRICA
ASIA
EUROPE
AMERICA
BRAZIL
MOSCOW
FRANCE
HAWAII
42Rohde et al. (2006)
Clustering vectors to
visualize similarity in co-
occurrence matrices
Other possible similarity measures
43
44
Part 2d
Vector semantics
Evaluation
Evaluating word similarity
• Intrinsic evaluation
– Correlation between algorithm and human word similarity ratings
§ Wordsim353 ➞ 353 noun pairs rated on a 0-10 scale
– Taking TOEFL multiple-choice vocabulary tests
• Extrinsic (= task-based, end-to-end) evaluation
– Question answering
– Spell-checking
– Essay grading
45
sim(plane, car) = 5.77
Levied is closest in meaning to:
imposed / believed / requested / correlated
46
Part 2e
Vector semantics
Dense vectors
Sparse vs. dense vectors
• PPMI vectors are
– Long (length |V|= 20,000 to 50,000)
– Sparse (most elements are zero)
• Alternative: learn vectors which are
– Short (length 200-1000)
– Dense (most elements are non-zero)
47
Sparse vs. dense vectors
• Why dense vectors?
– Short vectors may be easier to use as features in machine learning (less
weights to tune)
– De-noising ➞ low-order dimensions may represent unimportant
information
– Dense vectors may generalize better than storing explicit counts
– Dense models may do better at capturing higher-order co-occurrence
§ car and automobile are synonyms, but represented as distinct dimensions
§ Fails to capture similarity between a word with car as a neighbor and a word
with automobile as a neighbor (= paradigmatic association)
48
Methods for creating dense vector embeddings
• Singular Value Decomposition (SVD)
– A special case of this is called Latent Semantic Analysis (LSA)
• “Neural Language Model”-inspired predictive models (skip-grams
and CBOW)
• Brown clustering
49
Generic architecture of (most) NLP tools
1. Input pre-processing
2. Morphological & Part-of-Speech analysis (tokens)
3. Parsing (syntactic & semantic relations between tokens)
4. Context module (context-specific resolution)
5. Inference (according to the aim of the tool)
6. Generation (output representation)
7. Output processing (output representation refinement)
50
51
Part 3a
Text classification
Introduction
50
Is this spam?
52
What is text classsification?
• Goal: Take a document and assign it a label representing its content
• Classic example: decide if a newspaper article is about politics, sports, or
business
• Many uses for the same technology
– Is this e-mail spam or not?
– Is this page a laser printer product page?
– Does this company accept overseas orders?
– Is this tweet positive or negative?
– Does this part of the CV describe part of a person’s work experience?
– Is this text written in Danish or Norwegian?
– Is this the “computer” or “harbor” sense of port?
53
Definition
• Input:
– A document d
– A fixed set of classes C = {c1, c2,…, cj}
• Output
– A predicted class c ∈ C
54
Top-down vs. bottom-up text classification
• Top-down approach
– We tell the computer exactly how it should solve a task (e.g., expert
systems)
– Example: black-list-address OR ("dollars" AND "have been
selected")
– Potential for high accuracy, but requires expensive & extensive
refinement by experts
• Bottom-up approach
– The computer finds out for itselfwith a ‘little’ help from us (e.g., machine
learning)
– Example: The word “viagra” is very often found in spam e-mails
55
Definition (updated)
• Input:
– A document d
– A fixed set of classes C = {c1, c2,…, cj}
– A training set of m hand-labeled documents (d1, c1), …, (dm, cm)
• Output
– A learned classifierγ: d ➞ c
56
Bottom-up text classification
• Any kind of classifier
– Naïve Bayes
– Logistic regression
– Support Vector Machines
– k-Nearest Neighbor
– Decision tree learning
– ...
57
today’s
focus
58
Part 3b
Text classification
Naïve Bayes
Naïve Bayes
• Simple (“naïve”) classification method based on Bayes rule
– Relies on very simple representation of document: bag of words
59
Bag-of-words representation
60
it
it
it
it
it
it
I
I
I
I
I
love
recommend
movie
the
the
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romantic
of
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
are
anyone
adventure
always
again
about
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun...
It manages to be whimsical
and romantic while laughing
at the conventions of the
fairy tale genre. I would
recommend it to just about
anyone. I've seen it several
times, and I'm always happy
to see it again whenever I
have a friend who hasn't
seen it yet!
it
I
the
to
and
seen
yet
would
whimsical
times
sweet
satirical
adventure
genre
fairy
humor
have
great
it
it
it
it
it
it
I
I
I
I
I
love
recommend
movie
the
the
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romantic
of
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
are
anyone
adventure
always
again
about
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun...
It manages to be whimsical
and romantic while laughing
at the conventions of the
fairy tale genre. I would
recommend it to just about
anyone. I've seen it several
times, and I'm always happy
to see it again whenever I
have a friend who hasn't
seen it yet!
it
I
the
to
and
seen
yet
would
whimsical
times
sweet
satirical
adventure
genre
fairy
humor
have
great
…
6
5
4
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
…
it
it
it
it
it
it
I
I
I
I
I
love
recommend
movie
the
the
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romantic
of
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
are
anyone
adventure
always
again
about
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun...
It manages to be whimsical
and romantic while laughing
at the conventions of the
fairy tale genre. I would
recommend it to just about
anyone. I've seen it several
times, and I'm always happy
to see it again whenever I
have a friend who hasn't
seen it yet!
it
I
the
to
and
seen
yet
would
whimsical
times
sweet
satirical
adventure
genre
fairy
humor
have
great
…
6
5
4
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
…
Bag-of-words representation
61
γ( )=c
seen 2
sweet 1
whimsical 1
recommend 1
happy 1
... ...
Naïve Bayes
• Simple (“naïve”) classification method based on Bayes rule
– Relies on very simple representation of document: bag of words
• Bayes’ rule applied to documents and classes
– For a document d and a class c
62
P(c|d) =
P(d|c)P(c)
P(d)
Naïve Bayes (cont’d)
63
cMAP = argmax
c 2C
P(c|d)
MAP is "maximum a
posteriori" = most
likely class
= argmax
c 2C
P(d|c)P(c)
P(d)
Bayes' Rule
= argmax
c 2C
P(d|c)P(c) Dropping the
denominator
= argmax
c 2C
P(x1,x2, . . . ,xn|c)P(c)
Document d
represented as
features x1, …, xn
Naïve Bayes (cont’d)
• How often does this class occur?
– We can just count the relative frequencies in a corpus
– O(|X|n•|C|) parameters
– Could only be estimated if a very, very large number of training
examples was available!
64
cMAP = argmax
c 2C
P(x1,x2, . . . ,xn|c)P(c)
Multinomial Naïve Bayes
65
• Bag-of-words assumption
– Assume position does not matter
• Conditional independence assumption
– Assume the feature probabilities P(xi | cj) are independent given the
class c
P(x1,x2, . . . ,xn|c)
P(x1, ...,xn|c) = P(x1|c) · P(x2|c) · P(x3|c) · ... · P(xn|c)
cNB = argmax
c 2C
P(c)
Y
x 2X
P(x|c)
Multinomial Naïve Bayes (cont’d)
66
cMAP = argmax
c 2C
P(x1,x2, . . . ,xn|c)P(c)
Multinomial Naïve Bayes (cont’d)
67
cNB = argmax
cj 2C
P(cj )
Y
i 2positions
P(xi |cj )
all word positions in
test document
Learning the Multinomial Naïve Bayes Model
• First attempt: maximum likelihood estimates
– Simply use the frequencies in the data
68
ˆP(wi |cj ) =
count(wi,cj )
P
w 2V
count(w,cj )
ˆP(cj ) =
doc_count(C = cj )
Ndoc
fraction of times word wi appears among
all words in documents of topic cj
fraction of documents with topic cj
Problem with Maximum Likelihood
• What if we have seen no training documents with the word fantastic
and classified in the topic positive (= thumbs-up)?
• Zero probabilities cannot be conditioned away, no matter the other
evidence!
69
ˆP("fantastic"|positive) =
count("fantastic", positive)
P
w 2V
count(w, positive)
= 0
cMAP = argmax
c
ˆP(c)
Y
i
ˆP(xi |c)
Solution: Laplace smoothing
• Also known as add-one (or add- )
– Pretend we saw each word one more time (or more times) than we did
– For
70
ˆP(wi |cj ) =
count(wi,cj ) + 1
P
w 2V
count(w,cj ) + 1
=
count(wi,cj ) + 1
P
w 2V
count(w,cj )
!
+ |V |
= 1
Learning a multinomial Naïve Bayes model
• From training corpus, extract vocabulary V
• Calculate P(cj) terms
– For each cj in C do
§ docsj ⟵ all docs with class = cj
71
• Calculate P(wk | cj) terms
– Textj ⟵ single doc containing all
docsj
– For each word wk in V
§ nk ⟵ # of occurrences of wk in Textj
aggregrate representation
of all documents of class j
P(wk |cj )
nk +
n + |V |
P(cj )
|docsj |
|total # documents|
Doc Words Class
TRAINING
1 Chinese Beijing Chinese China
2 Chinese Chinese Shanghai China
3 Chinese Macao China
4 Tokyo Japan Chinese Japan
TEST 5 Chinese Chinese Chinese Tokyo Japan ?
Worked example: Priors
72
ˆP(c) =
Nc
N P(Japan) =
1
4
P(China) =
3
4
priors
P("Chinese" | China) =
count("Chinese", China) + 1
count(China) + |V |
Doc Words Class
TRAINING
1 Chinese Beijing Chinese China
2 Chinese Chinese Shanghai China
3 Chinese Macao China
4 Tokyo Japan Chinese Japan
TEST 5 Chinese Chinese Chinese Tokyo Japan ?
Worked example: Conditional probabilities
73
ˆP(w|c) =
count(w,c) + 1
count(c) + |V |
=
5 + 1
8 + 6
=
6
14
=
3
7
Doc Words Class
TRAINING
1 Chinese Beijing Chinese China
2 Chinese Chinese Shanghai China
3 Chinese Macao China
4 Tokyo Japan Chinese Japan
TEST 5 Chinese Chinese Chinese Tokyo Japan ?
Worked example: Conditional probabilities
74
P("Japan" | China) =
0 + 1
8 + 6
=
1
14
P("Tokyo" | China) =
0 + 1
8 + 6
=
1
14
P("Chinese" | China) =
5 + 1
8 + 6
=
3
7
P("Japan" | Japan) =
1 + 1
3 + 6
=
2
9
P("Tokyo" | Japan) =
1 + 1
3 + 6
=
2
9
P("Chinese" | Japan) =
1 + 1
3 + 6
=
2
9
Doc Words Class
TRAINING
1 Chinese Beijing Chinese China
2 Chinese Chinese Shanghai China
3 Chinese Macao China
4 Tokyo Japan Chinese Japan
TEST 5 Chinese Chinese Chinese Tokyo Japan ?
Worked example: Choosing a class
75
P("Japan" | China) =
0 + 1
8 + 6
=
1
14
P("Tokyo" | China) =
0 + 1
8 + 6
=
1
14
P("Chinese" | China) =
5 + 1
8 + 6
=
3
7
P("Japan" | Japan) =
1 + 1
3 + 6
=
2
9
P("Tokyo" | Japan) =
1 + 1
3 + 6
=
2
9
P("Chinese" | Japan) =
1 + 1
3 + 6
=
2
9
P(Japan) =
1
4
P(China) =
3
4
P(China | doc5) /
3
4
⇥
3
7
3
⇥
1
14
⇥
1
14
⇡ 0.0003
P(Japan | doc5) /
1
4
⇥
2
9
3
⇥
2
9
⇥
2
9
⇡ 0.0001
cNB = argmax
c 2C
P(c)
Y
x 2X
P(x|c)
Summary
• Naive Bayes is not so naïve
– Very fast
– Low storage requirements
– Robust to irrelevant features
§ Irrelevant features cancel each other out without affecting results
– Very good in domains with many equally important features
§ Decision Trees suffer from fragmentation in such cases – especially if little data
– A good, dependable baseline for text classification
76
Naïve Bayes in spam filtering
• SpamAssassin features (http://spamassassin.apache.org/tests_3_3_x.html)
– Properties
§ From: starts with many numbers
§ Subject is all capitals
§ HTML has a low ratio of text to image area
§ Claims you can be removed from the list
– Phrases
§ “viagra”
§ “impress ... girl”
§ “One hundred percent guaranteed”
§ “Prestigious Non-Accredited Universities” 77
78
Part 3c
Text classification
Evaluation
Evaluation
• Text classification can be seen as an application of machine learning
– Evaluation is similar to best-practice in machine learning
• Experimental setup
– Splitting in training, development, and test set
– Cross-validation for parameter optimization
• Evaluation
– Precision
§ For a particular class c, how many times were we correct in predicting c?
– Recall
§ For a particular class c, how many of the actual instances of c did we manage to
find?
– Other metrics: F-score, AUC 79
Precision vs. recall
• High precision
– When all returned answers must be correct
– Good when missing results are not problematic
– More common from hand-built systems
• High recall
– You get all the right answers, but garbage too
– Good when incorrect results are not problematic
– More common from automatic systems
• Trade-off
– In general, one can trade one for the other
– But it is harder to score well on both
80
precision
recall
Non-binary classification
• What if we have more than two classes?
– Solution: Train sets of binary classifiers
• Scenario 1: Any-of classification (aka multivalue classification)
– A document can belong to 0, 1, or >1 classes
– For each class c∈C
§ Build a classifier γc to distinguish c from all other classes c’∈C
– Given test document d
§ Evaluate it for membership in each class using each γc
§ d belongs to any class for which γc returns true
81
Non-binary classification
• Scenario 2: One-of classification (aka multinomial classification)
– A document can belong to exactly 1 class
– For each class c∈C
§ Build a classifier γc to distinguish c from all other classes c’∈C
– Given test document d
§ Evaluate it for membership in each class using each γc
§ d belongs to the one class with maximum score
82
83
Part 3d
Text classification
Practical issues
Text classification
• Usually, simple machine learning algorithms are used
– Examples: Naïve Bayes, decision trees
– Very robust, very re-usable, very fast
• Recently, slightly better performance from better algorithms
– Examples: SVMs, Winnow, boosting, k-NN
• Accuracy is more dependent on
– Naturalness of classes
– Quality of features extracted
– Amount of training data available
• Accuracy typically ranges from 65% to 97% depending on the situation
– Note particularly performance on rare classes
84
The real world
• Gee, I’m building a text classifier for real, now! What should I do?
• No training data?
– Manually written rules
§ Example: IF (wheat OR grain) AND NOT (whole OR bread) THEN
CATEGORIZE AS 'grain'
– Need careful crafting
§ Human tuning on development data
§ Time-consuming: 2 days per class
85
The real world
• Very little data?
– Use Naïve Bayes
§ Naïve Bayes is a “high-bias” algorithm (Ng & Jordan, 2002)
– Get more labeled data
§ Find clever ways to get humans to label data for you
– Try semi-supervised training methods
§ Bootstrapping, EM over unlabeled documents, …
• A reasonableamount of data?
– Perfect for all the clever classifiers
§ SVM, Regularized Logistic Regression, ...
– You can even use user-interpretable decision trees
§ Users like to hack & management likes quick fixes 86
Banko & Brill (2001)
The real world
• A huge amount of data?
– Can achieve high accuracy!
– Comes at a cost
§ SVMs (train time) or kNN (test time)
can be too slow
§ Regularized logistic regression can
be somewhat better
§ So Naïve Bayes can come back into
its own again!
– With enough data, classifier may
not matter…
87
Tweaking performance
• Domain-specific features and weights are very important in real
performance
• Sometimes need to collapse terms:
– Part numbers, chemical formulas, …
– But stemming generally does not help
• Upweighting (= counting a word as if it occurred twice)
– Title words (Cohen & Singer, 1996)
– First sentence of each paragraph (Murata, 1999)
– In sentences that contain title words (Ko et al., 2002)
88
Generic architecture of (most) NLP tools
1. Input pre-processing
2. Morphological & Part-of-Speech analysis (tokens)
3. Parsing (syntactic & semantic relations between tokens)
4. Context module (context-specific resolution)
5. Inference (according to the aim of the tool)
6. Generation (output representation)
7. Output processing (output representation refinement)
89
90
Questions?
References
• Jackson, P., & Moulinier, I. (2007). NaturalLanguageProcessing for
Online Applications:Text Retrieval, Extraction and Categorization.
Amsterdam: Benjamins.
• Jurafsky, D., & Martin, J. H. (2014). Speech and LanguageProcessing
(2nd ed.). Harlow: Pearson Education.
• Liddy, E. D. (2005). Enhanced Text Retrieval Using Natural Language
Processing. Bulletin of the American Society for Information Science
and Technology,24(4), 14-16.
• Manning,C. D., & Schütze, H. (1999).Foundations of Statistical
Natural LanguageProcessing. Cambridge, MA: MIT Press.
91

Contenu connexe

Tendances

Tendances (20)

Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Language models
Language modelsLanguage models
Language models
 
NLP
NLPNLP
NLP
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Representation Learning of Text for NLP
Representation Learning of Text for NLPRepresentation Learning of Text for NLP
Representation Learning of Text for NLP
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Nlp
NlpNlp
Nlp
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
natural language processing help at myassignmenthelp.net
natural language processing  help at myassignmenthelp.netnatural language processing  help at myassignmenthelp.net
natural language processing help at myassignmenthelp.net
 
Introduction to natural language processing, history and origin
Introduction to natural language processing, history and originIntroduction to natural language processing, history and origin
Introduction to natural language processing, history and origin
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 

Similaire à Natural Language Processing

Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
Mustafa Jarrar
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Abdullah al Mamun
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEO
Koray Tugberk GUBUR
 
lexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdflexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdf
Gagu6
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
tanishamahajan11
 

Similaire à Natural Language Processing (20)

Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
IJNLC 2013 - Ambiguity-Aware Document Similarity
IJNLC  2013 - Ambiguity-Aware Document SimilarityIJNLC  2013 - Ambiguity-Aware Document Similarity
IJNLC 2013 - Ambiguity-Aware Document Similarity
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
 
L1 nlp intro
L1 nlp introL1 nlp intro
L1 nlp intro
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEO
 
lexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdflexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdf
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 

Plus de Toine Bogers

Hands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while DrivingHands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Toine Bogers
 
A Longitudinal Analysis of Search Engine Index Size
A Longitudinal Analysis of Search Engine Index SizeA Longitudinal Analysis of Search Engine Index Size
A Longitudinal Analysis of Search Engine Index Size
Toine Bogers
 

Plus de Toine Bogers (16)

"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C..."If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
 
Hands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while DrivingHands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
 
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
 
A Study of Usage and Usability of Intelligent Personal Assistants in Denmark
A Study of Usage and Usability of Intelligent Personal Assistants in DenmarkA Study of Usage and Usability of Intelligent Personal Assistants in Denmark
A Study of Usage and Usability of Intelligent Personal Assistants in Denmark
 
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
 
"I just scroll through my stuff until I find it or give up": A Contextual Inq...
"I just scroll through my stuff until I find it or give up": A Contextual Inq..."I just scroll through my stuff until I find it or give up": A Contextual Inq...
"I just scroll through my stuff until I find it or give up": A Contextual Inq...
 
Defining and Supporting Narrative-driven Recommendation
Defining and Supporting Narrative-driven RecommendationDefining and Supporting Narrative-driven Recommendation
Defining and Supporting Narrative-driven Recommendation
 
An In-depth Analysis of Tags and Controlled Metadata for Book Search
An In-depth Analysis of Tags and Controlled Metadata for Book SearchAn In-depth Analysis of Tags and Controlled Metadata for Book Search
An In-depth Analysis of Tags and Controlled Metadata for Book Search
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
A Longitudinal Analysis of Search Engine Index Size
A Longitudinal Analysis of Search Engine Index SizeA Longitudinal Analysis of Search Engine Index Size
A Longitudinal Analysis of Search Engine Index Size
 
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage Systems
 
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?
 
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on TwitterMicro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program Committees
 

Dernier

Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Dernier (20)

FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 

Natural Language Processing

  • 1. 1 Natural Language Processing Toine Bogers Aalborg University Copenhagen Christina Lioma University of Copenhagen QUARTZ WINTER SCHOOL / FEBRUARY 12, 2018 / PADUA, ITALY
  • 2. Who? • Toine Bogers (toine@hum.aau.dk) – Associate professor @ Aalborg University Copenhagen – Interests § Recommender systems § Information retrieval (search engines) § Information behavior • Christina Lioma (c.lioma@di.ku.dk) – Full professor @ University of Copenhagen – Interests § Information retrieval (search engines) § Natural language processing, computational linguistics 2
  • 3. Outline • Introduction to NLP • Vector semantics • Text classification 3
  • 4. Useful references • Slides in this lecture are based on Jurafsky & Martin book – Jurafsky, D., & Martin, J. H. (2014). Speech and Language Processing (2nd ed.). Harlow: Pearson Education. https://web.stanford.edu/~jurafsky/slp3/ • Other good textbooks on NLP – Jackson, P., & Moulinier, I. (2007). Natural Language Processing for Online Applications:Text Retrieval, Extractionand Categorization. Amsterdam: Benjamins. – Manning, C. D., & Schütze, H. (1999). Foundations of StatisticalNatural Language Processing. Cambridge, MA: MIT Press. 4
  • 6. What is Natural Language Processing? • Multidisciplinary branch of CS drawing largely from linguistics • Definition by Liddy (1998) – “Natural language processing is a range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of particular tasks or applications.” • Goal: applied mechanization of human language 6
  • 7. Levels of linguistic analysis • Phonological — Interpretation of speech sounds within and across words • Morphological — Componential analysis of words, including prefixes, suffixes and roots • Lexical — Word-level analysis including lexical meaning and part-of-speech analysis • Syntactic — Analysis of words in a sentence in order to uncover the grammatical structure of the sentence • Semantic — Determining the possible meanings of a sentence, including disambiguation of words in context • Discourse — Interpreting structure and meaning conveyed by texts larger than a sentence • Pragmatic — Understanding the purposeful use of language in situations, particularly those aspects of language which require world knowledge 7Liddy (1998)
  • 8. NLP is hard! • Multidisciplinary knowledgegap – Some computer scientists might not “get” linguistics – Some linguists might not “get” computer science – Goal: bridge this gap • String similarity aloneis not enough – Very similar strings can mean different things (1 & 2) – Very different strings can mean similar things (2 & 3) – Examples: 1. How fast is the TZ? 2. How fast will my TZ arrive? 3. Please tell me when I can expect the TZ I ordered 8
  • 9. NLP is hard! • Ambiguity – Identical strings can have different meanings – Example: “I made her duck” has at least five possible meanings § I cooked waterfowl for her § I cooked waterfowl belonging to her § I created the (plaster?) duck she owns § I caused her to quickly lower her head or body § I waved my magic wand and turned her into undifferentiated waterfowl 9
  • 10. Overcoming NLP difficulty • Natural languageambiguity is very common, but also largely local – Immediate context resolves ambiguity – Immediate context ➝ common sense – Example: “My connection is too slow today.” • Humans use common sense to resolve ambiguity, sometimes without being aware there was ambiguity 10
  • 11. Overcoming NLP difficulty • Machines do not have common sense • Initial suggestion – Hand-code common sense in machines – Impossibly hard to do for more than very limiteddomains • Present suggestion – Applications that work with very limited domains – Approximate common sense by relatively simple techniques 11
  • 12. Applications of NLP • Language identification • Spelling & grammar checking • Speech recognition & synthesis • Sentiment analysis • Automatic summarization • Machine translation • Information retrieval • Information extraction • … 12
  • 13. Applications of NLP • Some applications are widespread (e.g., spell check), while others are not ready for industry or are too expensive for popular use • NLP tools rarely hit a 100% success rate – Accuracy is assessed in statistical terms – Tools become mature and usable when they operate above a certain precision and below an acceptable cost • All NLP tools improve continuously (and often rapidly) 13
  • 14. Generic architecture of (most) NLP tools 1. Input pre-processing 2. Morphological & part-of-speech analysis (tokens) 3. Parsing (syntactic & semantic relations between tokens) 4. Context module (context-specific resolution) 5. Inference (according to the aim of the tool) 6. Generation (output representation) 7. Output processing (output representation refinement) 14
  • 15. Generic architecture of (most) NLP tools 1. Input pre-processing 2. Morphological & Part-of-Speech analysis (tokens) 3. Parsing (syntactic & semantic relations between tokens) 4. Context module (context-specific resolution) 5. Inference (according to the aim of the tool) 6. Generation (output representation) 7. Output processing (output representation refinement) 15 today’s focus
  • 16. 16 Part 2a Vector semantics Introduction to distributional semantics
  • 17. Word similarity • Understanding word similarity is essential for NLP – Example § “fast” is similar to “rapid” § “tall” is similar to “height” – Question answering § Question: “How tall is Mt. Everest?” § Candidate answer: “The official height of Mount Everest is 29029 feet.” • Can we compute the similarity between words automatically? – Distributional semantics is an approach to doing this 17
  • 19. Distributional semantics • Distributional semantics is the study of semantic similarities between words using their distributional properties in text corpora – Distributional models of meaning = vector-space models of meaning = vector semantics • Intuition behind this – Linguistic items with similar distributions have similar meanings – Zellig Harris (1954): § “oculist and eye-doctor … occur in almost the same environments” § “If A and B have almost identical environments we say that they are synonyms.” – Firth (1957) § “You shall know a word by the company it keeps!” 19
  • 20. Distributional semantics • Example – From context words, humans can guess tesgüino means an alcoholic beverage like beer • Intuition for algorithm – Two words are similarif they have similarword contexts 20 A bottle of tesgüino is on the table. Everybody likes tesgüino. Tesgüino makes you drunk. We make tesgüino out of corn.
  • 21. Four kinds of models of vector semantics • Sparse vector representations – Mutual-information weighted word co-occurrence matrices • Dense vector representations – Singular value decomposition (and Latent Semantic Analysis) – Neural-network-inspired models (skip-grams, CBOW) – Brown clusters 21 today’s focus
  • 22. Shared intuition • Model the meaning of a word by “embedding” in a vector space – The meaning of a word is a vector of numbers – Vector models are also called embeddings • Contrast – Word meaning is represented in many computational linguistic applications by a vocabulary index (“word number 545”) 22
  • 23. Representing vector semantics • Words are related if they occur in the same context • How big is this context? – Entire document? Paragraph? Window of ±n words? – Smaller contexts (e.g., context windows) are best for capturing similarity – If a word w occurs a lot in the context windows of another word v (i.e., if they frequently co-occur), then they are probably related 23 words rather than documents. This matrix is thus of dimensionality |V| ⇥ |V| and each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus. The context could be the document, in which case the cell represents the number of times the two words appear in the same document. It is most common, however, to use smaller contexts, such as a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a ±4 word window around the row word. For example here are 7-word windows surrounding four sample words from the Brown corpus (just one example of each word): sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the For each word we collect the counts (from the windows around each occurrence) of the occurrences of context words. Fig. 17.2 shows a selection from the word-word co-occurrence matrix computed from the Brown corpus for these four words.
  • 24. Representing vector semantics • We can now define a word w by a vector of counts of context words – Counts represent how often those words have co-occurred in the same context window with word w – Each vector is of length |V|, where V is the vocabulary – Vector semantics is captured in the word-word matrix is |V| x |V| 24
  • 25. Example: Contexts ±7 words 25 aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0 … such as a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a ±4 word window around the row word. For example here are 7-word windows surrounding four sample words from the Brown corpus (just one example of each word): sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the For each word we collect the counts (from the windows around each occurrence) of the occurrences of context words. Fig. 17.2 shows a selection from the word-word co-occurrence matrix computed from the Brown corpus for these four words. aardvark ... computer data pinch result sugar ... apricot 0 ... 0 0 1 0 1 pineapple 0 ... 0 0 1 0 1 digital 0 ... 2 1 0 1 0 information 0 ... 1 6 0 4 0
  • 26. Word-word matrix • We showed only 4 x 6, but the real matrix is 50,000 x 50,000 – Most values are 0 so it is very sparse – That’s OK, since there are lots of efficient algorithms for sparse matrices • The size of windows depends on your goals – The shorter the windows, the more syntactic the representation § ± 1-3 very syntactic – The longer the windows, the more semantic the representation § ± 4-10 more semantic 26
  • 27. Two types of co-occurrence • First-order co-occurrence (syntagmatic association) – They are typically nearby each other – wrote is a first-order associate of book or poem • Second-order co-occurrence (paradigmatic association) – They have similarneighbors – wrote is a second-order associate of words like said or remarked 27
  • 28. 28 Part 2b Vector semantics Positive Pointwise Mutual Information (PPMI)
  • 29. Problem with raw co-occurrence counts • Raw word frequency is not a great measure of association between words – It’s very skewed – “the” and “of” are very frequent, but maybe not the most discriminative • We would prefer a measure that asks whether a context word is particularly informative about the target word – Positive Pointwise Mutual Information (PPMI) 29
  • 30. Pointwise Mutual Information • Do events x and y co-occur more than if they were independent? • PMI between two words (Church & Hanks, 1989) – Do words x and y co-occur more than if they were independent? 30 PMI(word1, word2) = lo 2 P(word1, word2) P(word1)P(word2) PMI(X,Y) = lo 2 P(x, ) P(x)P( )
  • 31. Positive Pointwise Mutual Information • PMI ranges from –∞ to +∞, but negative values are problematic – Things are co-occurring less than we expect by chance – Unreliable without enormous corpora § Imagine w1 and w2 whose probability is each 10-6 § Hard to be sure p(w1, w2) is significantly different than 10-12 – Plus it is not clear people are good at “unrelatedness” • So we just replace negative PMI values by 0 – Positive PMI (PPMI) between word1 and word2: 31 PMI(word1, word2) = max ✓ log2 P(word1, word2) P(word1)P(word2) , 0 ◆
  • 32. • Matrix F with W rows (words) and C columns (contexts) • fij is # of times wi occurs in context cj 32 pij = fij fij j=1 C ∑ i=1 W ∑ pi* = fij j=1 C ∑ fij j=1 C ∑ i=1 W ∑ p* j = fij i=1 W ∑ fij j=1 C ∑ i=1 W ∑ pmiij = log2 pij pi* p* j ppmiij = pmiij if pmiij > 0 0 otherwise ! " # $# Computing PPMI on a term-context matrix
  • 33. p(w=information, c=data) = p(w=information) = p(c=data) = 33 p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 = .326/19 11/19 = .58 7/19 = .37 pij = fij fij j=1 C ∑ i=1 W ∑ p(wi ) = fij j=1 C ∑ N p(cj ) = fij i=1 W ∑ N
  • 34. 34 pmiij = log2 pij pi* p* j p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 PPMI(w,context) computer data pinch result sugar apricot - - 2.25 - 2.25 pineapple - - 2.25 - 2.25 digital 1.66 0.00 - 0.00 - information 0.00 0.57 - 0.47 - PMI(information, data) = lo 2 ✓ .32 .37 ⇥ .58 ◆ = .57
  • 35. One more problem… • We are unlikely to encounter rare words unless we have large corpora – PMI values cannot be calculated if co-occurrence count is 0 • Solution: give rare words slightly higher probabilities – Steal probability mass to generalize better – Laplace smoothing (aka add-one smoothing) § Pretend we saw each word one more time than we did 35 allegations reports claims attack request man outcome … allegations attack man outcome … allegations reports claims request
  • 36. 36 Part 2c Vector semantics Measuring word similarity: the cosine
  • 37. Measuring similarity • We need a way to measure the similarity between two target words v and w – Most vector similarity measures are based on the dot product or inner product from linear algebra – High when two vectors have large values in same dimensions – Low (in fact 0) for orthogonal vectors with zeros in complementary distribution 37 pineapple 0 0 0.56 0 0.56 digital 0.62 0 0 0 0 information 0 0.58 0 0.37 0 Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing c in Fig. 17.5. The cosine—like most measures for vector similarity used in NLP—is bas the dot product operator from linear algebra, also called the inner product:duct duct dot-product(~v,~w) =~v·~w = NX i=1 viwi = v1w1 +v2w2 +...+vNwN ( Intuitively, the dot product acts as a similarity metric because it will tend high just when the two vectors have large values in the same dimensions. Al tively, vectors that have zeros in different dimensions—orthogonal vectors— w very dissimilar, with a dot product of 0.
  • 38. Measuring similarity • Problem: dot product is not normalized for vector length – Vectors are longer if they have higher values in each dimension – That means more frequent words will have higher dot products – Our similarity metric should not be sensitive to word frequency • Solution: divide it by the length of the two vectors – Is equal to the cosine of the angle between the two vectors! – This is the cosine similarity 38 The dot product is higher if a vector is longer, with higher values in each dimension. More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them. Raw dot product thus will be higher for frequent words. But this is a problem; we’d like a similarity metric that tells us how similar two words are irregardless of their frequency. The simplest way to modify the dot product to normalize for the vector length is to divide the dot product by the lengths of each of the two vectors. This normalized dot product turns out to be the same as the cosine of the angle between the two vectors, following from the definition of the dot product between two vectors ~a and ~b: ~a·~b = |~a||~b|cosq ~a·~b |~a||~b| = cosq (19.12)
  • 39. Calculating the cosine similarity • vi is the PPMI value for word v in context i • wi is the PPMI value for word w in context i • cos(v, w) is the cosine similarity between v and w – Raw frequency or PPMI are non-negative, so cosine range is [0, 1] 39 cos(  v,  w) =  v •  w  v  w =  v  v •  w  w = viwii=1 N ∑ vi 2 i=1 N ∑ wi 2 i=1 N ∑ dot product unit vectors
  • 40. large data computer apricot 2 0 0 digital 0 1 2 information 1 6 1 40 Which pair of words is more similar? cos(  v,  w) =  v •  w  v  w =  v  v •  w  w = viwii=1 N ∑ vi 2 i=1 N ∑ wi 2 i=1 N ∑ Example cosine(apricot, digital) = 0 + 0 + 0 p 1 + 0 + 0 p 0 + 1 + 4 = 0 cosine(digital, information) = 0 + 6 + 2 p 0 + 1 + 4 p 1 + 36 + 1 = 8 p 5 p 38 = .58 cosine(apricot, information) = 2 + 0 + 0 p 4 + 0 + 0 p 1 + 36 + 1 = 2 2 p 38 = .16
  • 41. 1 2 3 4 5 6 7 1 2 3 digital apricot information Dimension1:‘large’ Dimension 2: ‘data’ 41 large data apricot 2 0 digital 0 1 information 1 6 Visualizing vectors and angles
  • 45. Evaluating word similarity • Intrinsic evaluation – Correlation between algorithm and human word similarity ratings § Wordsim353 ➞ 353 noun pairs rated on a 0-10 scale – Taking TOEFL multiple-choice vocabulary tests • Extrinsic (= task-based, end-to-end) evaluation – Question answering – Spell-checking – Essay grading 45 sim(plane, car) = 5.77 Levied is closest in meaning to: imposed / believed / requested / correlated
  • 47. Sparse vs. dense vectors • PPMI vectors are – Long (length |V|= 20,000 to 50,000) – Sparse (most elements are zero) • Alternative: learn vectors which are – Short (length 200-1000) – Dense (most elements are non-zero) 47
  • 48. Sparse vs. dense vectors • Why dense vectors? – Short vectors may be easier to use as features in machine learning (less weights to tune) – De-noising ➞ low-order dimensions may represent unimportant information – Dense vectors may generalize better than storing explicit counts – Dense models may do better at capturing higher-order co-occurrence § car and automobile are synonyms, but represented as distinct dimensions § Fails to capture similarity between a word with car as a neighbor and a word with automobile as a neighbor (= paradigmatic association) 48
  • 49. Methods for creating dense vector embeddings • Singular Value Decomposition (SVD) – A special case of this is called Latent Semantic Analysis (LSA) • “Neural Language Model”-inspired predictive models (skip-grams and CBOW) • Brown clustering 49
  • 50. Generic architecture of (most) NLP tools 1. Input pre-processing 2. Morphological & Part-of-Speech analysis (tokens) 3. Parsing (syntactic & semantic relations between tokens) 4. Context module (context-specific resolution) 5. Inference (according to the aim of the tool) 6. Generation (output representation) 7. Output processing (output representation refinement) 50
  • 53. What is text classsification? • Goal: Take a document and assign it a label representing its content • Classic example: decide if a newspaper article is about politics, sports, or business • Many uses for the same technology – Is this e-mail spam or not? – Is this page a laser printer product page? – Does this company accept overseas orders? – Is this tweet positive or negative? – Does this part of the CV describe part of a person’s work experience? – Is this text written in Danish or Norwegian? – Is this the “computer” or “harbor” sense of port? 53
  • 54. Definition • Input: – A document d – A fixed set of classes C = {c1, c2,…, cj} • Output – A predicted class c ∈ C 54
  • 55. Top-down vs. bottom-up text classification • Top-down approach – We tell the computer exactly how it should solve a task (e.g., expert systems) – Example: black-list-address OR ("dollars" AND "have been selected") – Potential for high accuracy, but requires expensive & extensive refinement by experts • Bottom-up approach – The computer finds out for itselfwith a ‘little’ help from us (e.g., machine learning) – Example: The word “viagra” is very often found in spam e-mails 55
  • 56. Definition (updated) • Input: – A document d – A fixed set of classes C = {c1, c2,…, cj} – A training set of m hand-labeled documents (d1, c1), …, (dm, cm) • Output – A learned classifierγ: d ➞ c 56
  • 57. Bottom-up text classification • Any kind of classifier – Naïve Bayes – Logistic regression – Support Vector Machines – k-Nearest Neighbor – Decision tree learning – ... 57 today’s focus
  • 59. Naïve Bayes • Simple (“naïve”) classification method based on Bayes rule – Relies on very simple representation of document: bag of words 59
  • 60. Bag-of-words representation 60 it it it it it it I I I I I love recommend movie the the the the to to to and andand seen seen yet would with who whimsical whilewhenever times sweet several scenes satirical romantic of manages humor have happy fun friend fairy dialogue but conventions are anyone adventure always again about I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet! it I the to and seen yet would whimsical times sweet satirical adventure genre fairy humor have great it it it it it it I I I I I love recommend movie the the the the to to to and andand seen seen yet would with who whimsical whilewhenever times sweet several scenes satirical romantic of manages humor have happy fun friend fairy dialogue but conventions are anyone adventure always again about I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet! it I the to and seen yet would whimsical times sweet satirical adventure genre fairy humor have great … 6 5 4 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 … it it it it it it I I I I I love recommend movie the the the the to to to and andand seen seen yet would with who whimsical whilewhenever times sweet several scenes satirical romantic of manages humor have happy fun friend fairy dialogue but conventions are anyone adventure always again about I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet! it I the to and seen yet would whimsical times sweet satirical adventure genre fairy humor have great … 6 5 4 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 …
  • 61. Bag-of-words representation 61 γ( )=c seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...
  • 62. Naïve Bayes • Simple (“naïve”) classification method based on Bayes rule – Relies on very simple representation of document: bag of words • Bayes’ rule applied to documents and classes – For a document d and a class c 62 P(c|d) = P(d|c)P(c) P(d)
  • 63. Naïve Bayes (cont’d) 63 cMAP = argmax c 2C P(c|d) MAP is "maximum a posteriori" = most likely class = argmax c 2C P(d|c)P(c) P(d) Bayes' Rule = argmax c 2C P(d|c)P(c) Dropping the denominator = argmax c 2C P(x1,x2, . . . ,xn|c)P(c) Document d represented as features x1, …, xn
  • 64. Naïve Bayes (cont’d) • How often does this class occur? – We can just count the relative frequencies in a corpus – O(|X|n•|C|) parameters – Could only be estimated if a very, very large number of training examples was available! 64 cMAP = argmax c 2C P(x1,x2, . . . ,xn|c)P(c)
  • 65. Multinomial Naïve Bayes 65 • Bag-of-words assumption – Assume position does not matter • Conditional independence assumption – Assume the feature probabilities P(xi | cj) are independent given the class c P(x1,x2, . . . ,xn|c) P(x1, ...,xn|c) = P(x1|c) · P(x2|c) · P(x3|c) · ... · P(xn|c)
  • 66. cNB = argmax c 2C P(c) Y x 2X P(x|c) Multinomial Naïve Bayes (cont’d) 66 cMAP = argmax c 2C P(x1,x2, . . . ,xn|c)P(c)
  • 67. Multinomial Naïve Bayes (cont’d) 67 cNB = argmax cj 2C P(cj ) Y i 2positions P(xi |cj ) all word positions in test document
  • 68. Learning the Multinomial Naïve Bayes Model • First attempt: maximum likelihood estimates – Simply use the frequencies in the data 68 ˆP(wi |cj ) = count(wi,cj ) P w 2V count(w,cj ) ˆP(cj ) = doc_count(C = cj ) Ndoc fraction of times word wi appears among all words in documents of topic cj fraction of documents with topic cj
  • 69. Problem with Maximum Likelihood • What if we have seen no training documents with the word fantastic and classified in the topic positive (= thumbs-up)? • Zero probabilities cannot be conditioned away, no matter the other evidence! 69 ˆP("fantastic"|positive) = count("fantastic", positive) P w 2V count(w, positive) = 0 cMAP = argmax c ˆP(c) Y i ˆP(xi |c)
  • 70. Solution: Laplace smoothing • Also known as add-one (or add- ) – Pretend we saw each word one more time (or more times) than we did – For 70 ˆP(wi |cj ) = count(wi,cj ) + 1 P w 2V count(w,cj ) + 1 = count(wi,cj ) + 1 P w 2V count(w,cj ) ! + |V | = 1
  • 71. Learning a multinomial Naïve Bayes model • From training corpus, extract vocabulary V • Calculate P(cj) terms – For each cj in C do § docsj ⟵ all docs with class = cj 71 • Calculate P(wk | cj) terms – Textj ⟵ single doc containing all docsj – For each word wk in V § nk ⟵ # of occurrences of wk in Textj aggregrate representation of all documents of class j P(wk |cj ) nk + n + |V | P(cj ) |docsj | |total # documents|
  • 72. Doc Words Class TRAINING 1 Chinese Beijing Chinese China 2 Chinese Chinese Shanghai China 3 Chinese Macao China 4 Tokyo Japan Chinese Japan TEST 5 Chinese Chinese Chinese Tokyo Japan ? Worked example: Priors 72 ˆP(c) = Nc N P(Japan) = 1 4 P(China) = 3 4 priors
  • 73. P("Chinese" | China) = count("Chinese", China) + 1 count(China) + |V | Doc Words Class TRAINING 1 Chinese Beijing Chinese China 2 Chinese Chinese Shanghai China 3 Chinese Macao China 4 Tokyo Japan Chinese Japan TEST 5 Chinese Chinese Chinese Tokyo Japan ? Worked example: Conditional probabilities 73 ˆP(w|c) = count(w,c) + 1 count(c) + |V | = 5 + 1 8 + 6 = 6 14 = 3 7
  • 74. Doc Words Class TRAINING 1 Chinese Beijing Chinese China 2 Chinese Chinese Shanghai China 3 Chinese Macao China 4 Tokyo Japan Chinese Japan TEST 5 Chinese Chinese Chinese Tokyo Japan ? Worked example: Conditional probabilities 74 P("Japan" | China) = 0 + 1 8 + 6 = 1 14 P("Tokyo" | China) = 0 + 1 8 + 6 = 1 14 P("Chinese" | China) = 5 + 1 8 + 6 = 3 7 P("Japan" | Japan) = 1 + 1 3 + 6 = 2 9 P("Tokyo" | Japan) = 1 + 1 3 + 6 = 2 9 P("Chinese" | Japan) = 1 + 1 3 + 6 = 2 9
  • 75. Doc Words Class TRAINING 1 Chinese Beijing Chinese China 2 Chinese Chinese Shanghai China 3 Chinese Macao China 4 Tokyo Japan Chinese Japan TEST 5 Chinese Chinese Chinese Tokyo Japan ? Worked example: Choosing a class 75 P("Japan" | China) = 0 + 1 8 + 6 = 1 14 P("Tokyo" | China) = 0 + 1 8 + 6 = 1 14 P("Chinese" | China) = 5 + 1 8 + 6 = 3 7 P("Japan" | Japan) = 1 + 1 3 + 6 = 2 9 P("Tokyo" | Japan) = 1 + 1 3 + 6 = 2 9 P("Chinese" | Japan) = 1 + 1 3 + 6 = 2 9 P(Japan) = 1 4 P(China) = 3 4 P(China | doc5) / 3 4 ⇥ 3 7 3 ⇥ 1 14 ⇥ 1 14 ⇡ 0.0003 P(Japan | doc5) / 1 4 ⇥ 2 9 3 ⇥ 2 9 ⇥ 2 9 ⇡ 0.0001 cNB = argmax c 2C P(c) Y x 2X P(x|c)
  • 76. Summary • Naive Bayes is not so naïve – Very fast – Low storage requirements – Robust to irrelevant features § Irrelevant features cancel each other out without affecting results – Very good in domains with many equally important features § Decision Trees suffer from fragmentation in such cases – especially if little data – A good, dependable baseline for text classification 76
  • 77. Naïve Bayes in spam filtering • SpamAssassin features (http://spamassassin.apache.org/tests_3_3_x.html) – Properties § From: starts with many numbers § Subject is all capitals § HTML has a low ratio of text to image area § Claims you can be removed from the list – Phrases § “viagra” § “impress ... girl” § “One hundred percent guaranteed” § “Prestigious Non-Accredited Universities” 77
  • 79. Evaluation • Text classification can be seen as an application of machine learning – Evaluation is similar to best-practice in machine learning • Experimental setup – Splitting in training, development, and test set – Cross-validation for parameter optimization • Evaluation – Precision § For a particular class c, how many times were we correct in predicting c? – Recall § For a particular class c, how many of the actual instances of c did we manage to find? – Other metrics: F-score, AUC 79
  • 80. Precision vs. recall • High precision – When all returned answers must be correct – Good when missing results are not problematic – More common from hand-built systems • High recall – You get all the right answers, but garbage too – Good when incorrect results are not problematic – More common from automatic systems • Trade-off – In general, one can trade one for the other – But it is harder to score well on both 80 precision recall
  • 81. Non-binary classification • What if we have more than two classes? – Solution: Train sets of binary classifiers • Scenario 1: Any-of classification (aka multivalue classification) – A document can belong to 0, 1, or >1 classes – For each class c∈C § Build a classifier γc to distinguish c from all other classes c’∈C – Given test document d § Evaluate it for membership in each class using each γc § d belongs to any class for which γc returns true 81
  • 82. Non-binary classification • Scenario 2: One-of classification (aka multinomial classification) – A document can belong to exactly 1 class – For each class c∈C § Build a classifier γc to distinguish c from all other classes c’∈C – Given test document d § Evaluate it for membership in each class using each γc § d belongs to the one class with maximum score 82
  • 84. Text classification • Usually, simple machine learning algorithms are used – Examples: Naïve Bayes, decision trees – Very robust, very re-usable, very fast • Recently, slightly better performance from better algorithms – Examples: SVMs, Winnow, boosting, k-NN • Accuracy is more dependent on – Naturalness of classes – Quality of features extracted – Amount of training data available • Accuracy typically ranges from 65% to 97% depending on the situation – Note particularly performance on rare classes 84
  • 85. The real world • Gee, I’m building a text classifier for real, now! What should I do? • No training data? – Manually written rules § Example: IF (wheat OR grain) AND NOT (whole OR bread) THEN CATEGORIZE AS 'grain' – Need careful crafting § Human tuning on development data § Time-consuming: 2 days per class 85
  • 86. The real world • Very little data? – Use Naïve Bayes § Naïve Bayes is a “high-bias” algorithm (Ng & Jordan, 2002) – Get more labeled data § Find clever ways to get humans to label data for you – Try semi-supervised training methods § Bootstrapping, EM over unlabeled documents, … • A reasonableamount of data? – Perfect for all the clever classifiers § SVM, Regularized Logistic Regression, ... – You can even use user-interpretable decision trees § Users like to hack & management likes quick fixes 86
  • 87. Banko & Brill (2001) The real world • A huge amount of data? – Can achieve high accuracy! – Comes at a cost § SVMs (train time) or kNN (test time) can be too slow § Regularized logistic regression can be somewhat better § So Naïve Bayes can come back into its own again! – With enough data, classifier may not matter… 87
  • 88. Tweaking performance • Domain-specific features and weights are very important in real performance • Sometimes need to collapse terms: – Part numbers, chemical formulas, … – But stemming generally does not help • Upweighting (= counting a word as if it occurred twice) – Title words (Cohen & Singer, 1996) – First sentence of each paragraph (Murata, 1999) – In sentences that contain title words (Ko et al., 2002) 88
  • 89. Generic architecture of (most) NLP tools 1. Input pre-processing 2. Morphological & Part-of-Speech analysis (tokens) 3. Parsing (syntactic & semantic relations between tokens) 4. Context module (context-specific resolution) 5. Inference (according to the aim of the tool) 6. Generation (output representation) 7. Output processing (output representation refinement) 89
  • 91. References • Jackson, P., & Moulinier, I. (2007). NaturalLanguageProcessing for Online Applications:Text Retrieval, Extraction and Categorization. Amsterdam: Benjamins. • Jurafsky, D., & Martin, J. H. (2014). Speech and LanguageProcessing (2nd ed.). Harlow: Pearson Education. • Liddy, E. D. (2005). Enhanced Text Retrieval Using Natural Language Processing. Bulletin of the American Society for Information Science and Technology,24(4), 14-16. • Manning,C. D., & Schütze, H. (1999).Foundations of Statistical Natural LanguageProcessing. Cambridge, MA: MIT Press. 91