LiCord: Language Independent Content Word Finder

LiCord: Language Independent Content Word Finder
Md-Mizanur Rahoman, Tetsuya Nasukawa, Hiroshi Kanayama &
Ryutaro Ichise
April 18, 2016

Background
currently 100s of languages are available, only few of them can be
automatically mined because of low or no NLP-resources availability
creating NLP-resources for all languages is not feasible
Content Words ﬁnding system for languages can be considered
basic NLP-resource
Rahoman et.al., | LiCord | 2

Content Word
definition: Content Words [ref: American Heritage Dictionary]
are nouns, most verbs, adjectives, and adverbs that refer to some
object, action, or characteristic
carry independent meaning
are usually open i.e, new words can be added
example: “NO8DO” is the official motto of Seville.
usage: Content Words can be used
(new) topic identification
document summarizing
question answering etc.

Problem & Possible Solution
problem
Content Words ﬁnding requires language dependent NLP-resource
language parser
parallel corpora etc.
NLP-resource developing for all language is costly and “not feasible”
possible solution
morphological features of text segment can classify whether a segment
is Content Word
machine learning model can classify text segment into Content Word
big text corpus can generate balanced morphological features for such
text segments

System Framework
the model generation has four processes:
NGram Constructor − perform text segmentation
Function Word Decider − devise feature values for the segments
Feature Value Calculator − devise feature values for the segments
Classiﬁer Learner − generate classiﬁcation model to decide the
segments into Content Words

System Framework

1.NGram Constructor
segment text and construct variable token (length) n-grams
calculate n-gram frequencies
Table: Variable length n-grams and their frequencies for an exemplary
corpus T- = “Japan is an Asian country. Japan is a peaceful country”.
n-grams and frequencies over the T-
size 1 n-gram {[Japan−2], [is−2], [an−1], ..., }
(/uni-gram) [country−2], [a−1], ... }
size 2 n-gram {[Japan is−2], [is an−1], ..., }
(/bi-gram) [Asian country−1], ...}
size 3 n-gram {[Japan is an−1], [is an Asian−1], }
(/tri-gram) [an Asian country−1], ... }

System Framework

2.Function Word Decider
Function Words
express grammatical relationships with other words
have little lexical meaning or have ambiguous meaning
are frequent n-grams over a text document
example: “the”, “in”, “in spite of” etc.
decide by
pick a threshold number of frequent n-grams
map frequent n-grams with available translation of known Function
Words
use threshold only, if translation service is not available
n-gram # of token frq frq%
the 1 3124631 67.60
in 1 1774988 38.40
... ... ... ...
united states 2 43698 0.94
... ... ... ...

System Framework

3.Feature Value Calculator
select ﬁfteen diﬀerent morphological features of text & calculate
their values for n-grams over a big corpus
where the n-grams appear i.e., begining/mid/end part of the sentences
how frequent the n-grams appear in a corpus
how the n-grams get added with Function Words, punctuation
etc.

System Framework

4.Classiﬁer Learner (1/2)
construct frequency-range-wise classiﬁcation models
Reason
consume a large amount of time, if all n-grams are used as training
example
does not represent entire dataset, if randomly picked
assume same frequency n-grams shares same kind of morphological
features (over the corpus)

4.Classifier Learner (2/2)
construct frequency-range-wise classification models
Method
collect range-based n-grams
X(i,j) = {x | x ∈ N ∧ i ≤ frq(x) ≤ j}
N = all n-grams in corpus, x = n-gram
select threshold number of n-grams as training n-grams for each range
calculate features for each range-wise selected n-grams
learn classification model for each range training n-grams

Experiment
check whether LiCord can identify Content Words language
independently
analyzed language − English, Vietnamese, and Indonesian
used training resource − Wikipedia Pages & Wikipedia Titles
+ve: when n-gram (text segment) exists on Wikipedia Title.
E.g., Seville, oﬃcial motto etc.
-ve: otherwise.
E.g.“NO8DO” is, is the etc.
classiﬁcation algorithm − Support Vector Machine and C4.5
(tree-based algorithm)

Language Independent Content Word Finding (1/2)
testing method − check test n-grams whether they are Content
Words
Table: CW ﬁnding accuracy %
Frequency English Indone- Vietnam-
Range sian ese
(1,1) 76.68 90.56 90.30
(2,2) 83.00 93.20 94.15
(3,4) 84.37 94.23 94.76
(5,9) 83.87 95.89 93.97
(10,14) 87.09 96.15 94.95
Average 83.25 93.80 93.54

Language Independent Content Word Finding (2/2)
Newly discovered Content Words finding accuracy %
Frequency English Indone- Vietnam-
Range sian ese
(1,1) 27.90 11.34 10.63
(2,2) 45.00 18.54 25.00
(3,4) 52.11 24.45 27.56
(5,9) 50.34 25.56 30.88
(10,14) 61.90 29.89 35.13
Average 47.45 21.95 22.50
finding − checking of a large number of sentences for their specific
morphological features over a big corpus can generate machine
learning model to find Content Words

Conclusion
language independent way Content Word ﬁnding a requirement in
current days’ text mining
we propose a supervised Machine Learning technique to classify
text segments to Content Words
experiment results show proposed methods can serve as a Content
Word ﬁnder

Question & Suggestion
Md-Mizanur Rahoman, mizan@nii.ac.jp

Experiment 1 (1/2)
purpose − whether LiCord can identify NEs (Named Entities), and
act like sentence parser
identifying NEs − executed for some test sentences, compared with
Wikifier and Spotlight
Table: Comparison for LiCord
with Wikifier
Recall
Wikifier 33.33%
LiCord 90.47%
Table: Comparison for LiCord
with Spotlight
Recall
Spotlight 83.33%
LiCord 91.66%

Experiment 1 (2/2)
acting as parser − executed for some test sentences, compared with
Stanford parser for Content Words
Table: Comparison for LiCord with Parser
Language Recall
English 92.30%
ﬁnding − checking of a large number of sentences for their speciﬁc
morphological features over a big corpus can support word
segmenting

LiCord: Language Independent Content Word Finder

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to LiCord: Language Independent Content Word Finder

Similar to LiCord: Language Independent Content Word Finder (20)

More from National Inistitute of Informatics (NII), Tokyo, Japann

More from National Inistitute of Informatics (NII), Tokyo, Japann (6)

Recently uploaded

Recently uploaded (20)

LiCord: Language Independent Content Word Finder