Parts of Speect Tagging

by

Mohd. Yaseen Ansari
From TE CSE

under the guidance of

Prof. Mrs. A.R.Kulkarni

Introduction
Principle
Parts of Speech Classes
What is POS Tagging good for ?
Tag Set
Tag Set Example
Why is POS Tagging Hard ?
Methods for POS Tagging ?
Stochastic POS Tagging
Definition of Hidden Markov Model
HMM for Tagging
Viterbi Tagging
Viterbi Algorithm
An Example

Definition
Parts of Speech Tagging is defined as the task
of labeling each word in a sentence with its appropriate
parts of speech

Example
The mother kissed the baby on the cheek.

The[AT] mother[NN] kissed[VBD] the[AT]
baby[NN] on[PRP] the[AT] cheek[NN].

The
mother
kissed Noun
the Verb
baby Article
on Preposition
the
cheek

Parts of speech tagging is harder than just having a list of
words and their parts of speech, because some words can
represent more than one part of speech at different
times, and because some parts of speech are complex or
unspoken. A large percentage of word-forms are
ambiguous. For example,

The sailor dogs the barmaid.
Even "dogs", which is usually thought of as just a plural
noun, can also be a verb.

There are two classes for parts of speech:-

1) Open Classes:- nouns , verbs , adjectives ,adverbs , etc.

2) Closed Classes:-

a) Conjunctions:- and , or , but , etc.
b) Pronouns:- I , she , him , etc.
c)Preposition:- with , on , under , etc.
d)Determiners:- the ,a ,an , etc.
e) Auxiliary verbs:- can , could , may , etc.

and there are many others.

1) Useful in -
a) Information Retrieval
b) Text to Speech
c) Word Sense Disambiguation

2) Useful as a preprocessing step of parsing –
unique tag to each word reduces the number of parses.

For POS Tagging , there is need of tag sets so that one may
not have any problem for assigning one tag for each parts
of speech. There are four tag sets used worldwide.

1) Brown Corpus – 87 tag sets
2) Penn Tree Bank – 45 tag sets
3) British National Corpus – 61 tag sets
4) C7 – 164 tag sets

There are tag sets available which have tags for phrases
also.

POS Tagging, most of the times is ambiguous that’s why one
can’t easily find the right tag for each word. For example, we
want to translate the ambiguous sentence.Example,

Time flies like an arrow.

Possibilities:-
1) Time/NN flies/NN like/VB an/AT arrow/NN.

2) Time/VB flies/NN like/IN an/AT arrow/NN.

3) Time/NN flies/VBZ like/IN an/AT arrow/NN.

Here the 3) is correct but see how many possibilities are there
and we don’t know exactly which one to choose. So one who has
a good hand in grammar and vocabulary can only make the
difference.

1) Rule-Based POS tagging
* e.g., ENGTWOL Tagger
* large collection (> 1000) of constraints on what
sequences of tags are allowable

2) Stochastic (Probabilistic) tagging
* e.g., HMM Tagger
* I’ll discuss this in a bit more detail

3) Transformation-based tagging
* e.g., Brill’s tagger
* Combination of Rule-Based and Stochastic
methodologies.

Input:- a string of words, tagset (ex. Book that flight, Penn
Treebank tagset)

Output:- a single best tag for each word (ex. Book/VB
that/DT flight/NN ./.)

Problem:- resolve ambiguity → disambiguation
Example-> book (Hand me that book, Book that flight)

Set of states – all possible tags
Output alphabet – all words in the language
State/tag transition probabilities
Initial state probabilities: the probability of beginning a
sentence with a tag t (t0t)
Output probabilities – producing word w at state t
Output sequence – observed word sequence
State sequence – underlying tag sequence

First-order (bigram) Markov assumptions:

1) Limited Horizon: Tag depends only on previous tag
P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)

2) Time invariance: No change over time
P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj  tk)

Output probabilities:

1) Probability of getting word wk for tag tj: P(wk | tj)

2) Assumption:

Not dependent on other tags or words!

Probability of a tag sequence:

P(t1t2…tn) = P(t1)P(t1t2)P(t2t3)…P(tn-1tn)

Assume t0 – starting tag:
= P(t0t1)P(t1t2)P(t2t3)…P(tn-1tn)

Probabilty of word sequence and tag sequence:

P(W,T) = i P(ti-1ti) P(wi | ti)

Labeled training = each word has a POS tag

Thus:
PMLE(tj) = C(tj) / N
PMLE(tjtk) = C(tj, tk) / C(tj)
PMLE(wk | tj) = C(tj:wk) / C(tj)

1) D(0, START) = 0
2) for each tag t != START do: D(1, t) = -
3) for i  1 to N do:
for each tag tj do:

D(i, tj)  maxk D(i-1,tk) + lm(tk tj) + lm(wi|tj)
Record best(i,j)=k which yielded the max

1) log P(W,T) = maxj D(N, tj)
2) Reconstruct path from maxj backwards

Where: lm(.) = log m(.) and D(i, tj) – max joint probability
of state and word sequences till position i, ending at tj.
Complexity: O(Nt2 N)

Most probable tag sequence given text:

T* = arg maxT Pm(T | W)
= arg maxT Pm(W | T) Pm(T) / Pm(W)
(Bayes’ Theorem)
= arg maxT Pm(W | T) Pm(T)
(W is constant for all T)
= arg maxT i[m(ti-1ti) m(wi | ti) ]
= arg maxT i log[m(ti-1ti) m(wi | ti) ]

Exponential number of possible tag sequences – use
dynamic programming for efficient computation

Parts of Speect Tagging

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Parts of Speect Tagging

Similar to Parts of Speect Tagging (20)

Recently uploaded

Recently uploaded (20)

Parts of Speect Tagging