This document discusses using a data-mining approach to perform word sense detection and disambiguation in biblical texts. It aims to identify the different senses of words in the Bible and disambiguate which sense each instance refers to. The approach uses multiple Bible translations linked to the original texts and groups instances based on translation word similarities through a progressive merging technique. This allows automatic identification of word senses using translation data in an efficient and objective manner to build sense dictionaries and enable refined Bible search and translation tools.
2. Outline
Motivations for word sense identification
Problems of existing word sense data
The data-mining approach
Demo and Discussion
Asia Bible Society 2
3. Motivations
Addressing the issue of Polysemy
Bible translation
– Better understanding of every word
– Unification on the basis of senses rather than
words
Bible search
– More refined search results on the basis of
senses
Asia Bible Society 3
4. Goals
Word sense detection
For each content word in the Bible, find out how
many senses it has.
Word sense disambiguation
For each instance of the word, find out which of
the senses it has.
Asia Bible Society 4
5. Asia Bible Society 5
ית ִאשׁ ֵר
Sense 1
Sense 1-1:
beginning
Sense 1-2: first
Sense 2
Sense 2-1:
firstfruits
Sense 2-2:
firstborn
Sense 3
Sense 3-1: best
Sense 3-2:
choicest
Sense 4
Sense 4-1:
foremost
Word sense detection
Identify the senses of each word :
6. Identify the sense of each instance:
ְבּית ִאשׁ ֵרית ִאשׁ ֵרית ִאשׁ ֵרית ִאשׁ ֵרא ָר ָבִֹּלה ֱאץ׃ ֶר ָא ָה ת ֵא ְו יִם ַמ ָשּׁ ַה ת ֵא ים (Gen1:1)
ן ַבּ ְר ָקית ִאשׁ ֵרית ִאשׁ ֵרית ִאשׁ ֵרית ִאשׁ ֵרָיהו ַל ם ָֹתא יבוּ ִר ְק ַתּהל ֶא ְו־ַי א־ֹ ל ַח ֵבְּז ִמּ ַהל ֲעוּי ֵר ְלֹחיִנ ַח׃ ַח (Lev
2:12)
ִיּ ַוח ַקָּע ָהָל ָשּׁ ַה ֵמ םלר ָק ָוּב אןֹ צִאשׁ ֵרִאשׁ ֵרִאשׁ ֵרִאשׁ ֵריתיתיתיתַֹחבְּז ִל ם ֶר ֵח ַהַלָיהוהי ֶֹלה ֱאל׃ָגּ ְלִגּ ַבּ ָך
(1Sm 15:21)
הוּאִאשׁ ֵרִאשׁ ֵרִאשׁ ֵרִאשׁ ֵריתיתיתיתל ֵא י־ ֵכ ְר ַדֹּשׂע ָהוֵֹגַּיבּוֹ׃ ְר ַח שׁ (Job 40:19)
Asia Bible Society 6
Sense 1: beginning
Sense 2: firstfruits
Sense 3: best
Sense 4: foremost
Word sense disambiguation
7. Problems of Existing Data
No consensus on the number of senses
each word has
No complete data of instance-based sense
identification
Manual identification can be subjective,
inconsistent, and time-consuming
Asia Bible Society 7
8. The Data-Mining Approach
Theoretical assumption
Data for mining
Machine learning procedures
Advantages and limitations of the
approach
Tool for sense exploration
Asia Bible Society 8
9. Theoretical Assumption
Translators presumably use different target
language words to translate different senses of a
word (Translators have done the job of
disambiguation sub-consciously and defined
each sense with target language words).
Asia Bible Society 9
11. Basic Task
Take all instances of a word and group the
instances into different senses
Asia Bible Society 11
12. A Simple and Naive Approach
Look at the words used in a given translation and treat
instances with the same translation words as having the
same sense.
Asia Bible Society 12
13. A Simple and Naive Approach
Problems:
The translations may not be consistent:
Translators may use different words to translate the
same sense or the same word to translate different
senses
It can be subjective
It only reflects the opinions of a particular translation
The senses are too fragmentary
Asia Bible Society 13
14. The Voting Approach
Use multiple translations
Two instances of a word is considered to
have the same sense if most of the
translations use the same word to
translate it.
Check and balance
How to define “most”?
Asia Bible Society 14
15. The Voting Approach
How many votes to get?
Maximal agreement:
– Internal consistency within groups
– Too many senses
– Too many unassigned instances
Minimal agreement:
– Better grouping of senses
– Instances of different senses may be mixed together
Asia Bible Society 15
16. Progressive Merging
Trying to get the benefits of both maximal and minimal
agreement and avoid their disadvantages
Start with maximal agreement to get initial sense groups
that are internally consistent
Gradually merge the initial groups with decreasing
number of agreements N (N > 0, N < Maximal) and with
a variable association rate R (R > 0, R < 1)
Group B is merged into Group A if A contains B
A contains B if each instance in B is linked to at least R
of the instances in A by N agreements.
Pair-wise merge until no further merge can be done
Asia Bible Society 16
18. Progressive Merging
Example: Maximal N = 4, R = 70%
Merge 1: N-1 = 3
B merges into A if each instance in B is linked to at least 70% of the
instances in A by sharing the same translation in at least 3 versions
Merge 2: N-2 = 2
B merges into A if each instance in B is linked to at least 70% of the
instances in A by sharing the same translation in at least 2 versions
Merge 3: N-3 = 1
B merges into A if each instance in B is linked to at least 70% of the
instances in A by sharing the same translation in at least 1 version
Asia Bible Society 18
19. Tuning the Variables
We can get different results by tuning the
following variables:
The translations to use
The number translations to use
The number of merges to perform
The association rate
Asia Bible Society 19
20. The “Accents” of Senses
Senses based on English translations
Senses based on Chinese translations
Senses based on both English and Chinese
The triangulating effect of using different
translations
Asia Bible Society 20
21. Factors Affecting the Results
The versions of translations that are used
The quality of each translation
The degree of consensus between different
translations
Quality of lemmatization in English
Surface forms vs. lemmatized forms
Asia Bible Society 21
22. Other Features Considered
Syntactic contexts
– Instances that occur in similar syntactic contexts tend
to have the same sense
– Not used because of sparse data problem
Morphological information
– Verbs with different stems in Hebrew tend to have
different senses
– Not used because the stem distinctions do not always
correspond well with sense distinctions
Asia Bible Society 22
23. Editing Options
The data shown here has not been manually
edited, but it can be edited using the tool:
Merge sense groups
Split a sense group
Move an instance from one sense group to
another
Use of manual information in automatic learning
Asia Bible Society 23
26. Advantages of the Current Approach
Efficiency: a sense dictionary which not only lists
the senses but also the specific instances of the
sense can be built in a matter of days.
Objectivity: the results are based on actual data
and no pre-conceived subjective categorization
is required.
Flexibility: the granularity of sense divisions can
be adjusted by the values of similarity metrics in
the clustering process.
Asia Bible Society 26
27. Conclusion
A great tool for exploring and studying word
senses in biblical texts
Asia Bible Society 27