Overview of the design of the CLUES database, developed as an aid to the comparative method in historical linguistics. Includes information on the design of the database and the strategies used to detect correlate forms (potential cognates), including metrics used to rate similarity of form and meaning.
Strategies for Landing an Oracle DBA Job as a Fresher
The CLUES database: automated search for linguistic cognates
1. The CLUES database: automated
search for cognate forms
Australian Linguistics Society Conference, Canberra
4 December 2011
Mark Planigale (Mark Planigale Research & Consultancy)
Tonya Stebbins (RCLT, La Trobe University)
2. Introduction
Overview of the design of the CLUES database -
being developed as a tool to aid the search for
correlates across multiple datasets
Linguistic model underlying the database
Explore key issues in developing the methodology
Show examples of output from the database
Because the design of CLUES is relatively generic, it
is potentially applicable to a wide range of
languages, and to tasks other than correlate
detection.
4. What is CLUES?
“Correlate Linking and User-defined Evaluation System”.
Database designed to simultaneously handle lexical data
from multiple languages. It uses add-on modules for
comparative functions.
Primary purpose: identify correlates across two or more
languages.
Correlate: pair of lexemes which are similar in phonetic form
and/or meaning
The linguist assesses which of the identified correlates are
cognates, and which are similar due to some other reason
(borrowing, universal tendencies, accidental similarity)
Allows the user to adjust the criteria used to evaluate
degree of correlation between lexemes.
It can store, filter and organise results of comparisons.
6. A few examples
Lowe & Mazaudon 1994 – „Reconstruction Engine‟ (models
operation of proposed sound change rules as a means of checking
hypotheses)
Nakhleh et al 2005 – Indo-European, phylogenetic
Holman et al. 2008 – Automated Similarity Judgment Program –
4350 languages; 40 lexical items (edit distance); 85 most stable
grammatical (typological) features from WALS database.
Austronesian Basic Vocabulary Database: 874 mostly Austronesian
languages, each language represented by around 210 words.
http://language.psy.auckland.ac.nz/austronesian/ project had
phylogenetic focus – did some manual comparative work in
preparing the data)
Greenhill & Gray 2009 – Austronesian, phylogenetic
Dunn, Burenhult et al. 2011 – Aslian
Proto-Tai'o'Matic (“merges, searches, and extends several wordlists
and proposed reconstructions of proto-Tai and Southwestern Tai”
http://crcl.th.net/index.html?main=http%3A//crcl.th.net/crcl/assoc.htm)
7. Broad vs. deep approaches to automated
lexical comparison
Parameter ‘Broad and shallow’ ‘Narrow and deep’
Language sample Relatively large Relatively small
Vocabulary Constrained, based on All available lexical data for
sample standardised wordlist (e.g. selected languages
Swadesh 200, 100 or 40)
Purpose Establish (hypothesised) Linguistic and/or cultural
genetic relationships reconstruction; model language
contact and semantic shift
Method Lexicostatistics Comparative method with fuzzy
Phylogenetics matching
Typical metrics Phonetic (e.g. edit distance) Phonetic (e.g. edit distance)
Typological (shared Semantic
grammatical features) Grammatical
Maximum likelihood
CLUES comparisons can be constrained to core vocab (using wordlist feature)
however it is intended to be used within a „narrow and deep‟ approach.
9. CLUES: Desiderata
• Results agree with human expert judgment
Accuracy • Minimisation of false positives and negatives
• Computed similarity level does measure degree of
Validity correlation
• Computed similarity level varies directly with cognacy
• Like results for like comparison pairs
Reliability • Like results for a single comparison pair on repetition
• System performs accurately on new („unseen‟) data as well
Generalisability as the data that the similarity metrics were „trained‟ on
Efficiency • Comparisons are performed fast enough to be useful
10. Lexical model (partial)
Language
1
∞
Lexeme
Orthography 1 ∞ part of speech ∞ 1 Source
temporal information
1 1
∞ ∞
Sense ∞ ∞ Wordlist item
Written form
∞ 1 1
...
∞ ∞
∞ Gloss Semantic
Phone domain
11. Three dimensions of lexical similarity
Dimension of comparison Data fields currently
available
Phonetic / phonological Written form (mapped to
(phonetic form of lexeme) phonetic content)
Semantic Semantic domain
(meaning of lexeme) Gloss
Grammatical Word class
(grammatical features of
lexeme)
In the context of correlate detection, grammatical features may
be of interest as a „dis-similarising‟ feature for lexemes that are
highly correlated on form and meaning.
12. What affects the results?
Selection and evaluation of metrics
• Choice of appropriate formal (quantifiable) criteria for similarity
• Impact: Validity of results; generalisability of system
Inconsistent representations
• Systematic differences in the representations used for different
data sets within the corpus
• Impact: Validity of results
Noise
• Random fluctuations within the data that obscure the true value
of individual data items, but do not change the underlying nature
of the distribution
• Impact: Reliability of data, reliability of results
less
controllabl
e
13. CLUES: Managing representational issues
automated generation of phonetic form(s) from
written form(s)
where required, manual standardization to common
lexicographic conventions
manual assignment to common ontology
semantic domain set
automated mapping onto a shared common set of
grammatical features, values and terms
15. Similarity scores
Overall
Total
score
w5 w6 w7
Meaning Grammar
Subtotals Form subtotal
subtotal subtotal
w1 w2 w3 w4
Semantic
Written form Gloss Wordclass
Base domain
similarity similarity similarity
similarity
16. Ura ɣunǝga vs. Mali kunēngga ‘sun’
Base
4a. Lexeme 1 Lexeme 2 similarity Weight Subtotal Weight Overall
score
ɣunǝga kunēngga
Written form(s) 0.896 1.0 0.896 0.45
[ɣunǝga] [ɣunǝŋga]
Gloss(es)
Semantic domain(s)
Wordclass
sun
A3
N
sun
A3
N
1.0
1.0
1.0
0.5
0.5
1.0
} 1.0
1.0
0.45
0.1
} 0.953
17. Sulka kolkha ‘sun’ vs. Mali dulka ‘stone’
Base
4b. Lexeme 1 Lexeme 2 similarity Weight Subtotal Weight Overall
score
kolkha dulka
Written form(s) 0.828 1.0 0.828 0.45
[kolkha] [dulka]
Gloss(es)
Semantic domain(s)
sun
A3
stone
A5
0.0
0.333
0.5
0.5 } 0.167 0.45
} 0.548
Wordclass N N 1.0 1.0 1.0 0.1
Base
4c. Lexeme 1 Lexeme 2 similarity Weight Subtotal Weight Overall
score
kolkha dulka
Written form(s) 0.828 1.0 0.828 0.7
[kolkha] [dulka]
Gloss(es)
Semantic domain(s)
sun
A3
stone
A5
0.0
0.333
0.5
0.0 } 0.0 0.2
} 0.68
Wordclass N N 1.0 1.0 1.0 0.1
18. Sample results: across domains
Small set of lexical data from 7 languages; „symmetrical‟; overall
scores
(tau) N J1 (ura) N A3 (sul) N A5 (sul) N J1 (mal) N A3 (ura) N J1 (qaq) N T1 (mal) V T1 (mal) N J1
5a. kabarak ɣunǝga kre ka ptaik kunēngga slǝp ltigi lēt slēpki
'blood' 'sun' 'stone' 'skin' 'sun' 'bone' 'fire' 'light a fire' 'bone'
(tau) N J1
1 0.309 0.2905 0.657 0.2995 0.5435 0.278 0.2435 0.541
kabarak 'blood'
(ura) N A3
0.309 1 0.34725 0.2665 0.948 0.312 0.3515 0.2445 0.325
ɣunǝga 'sun'
(sul) N A5
0.2905 0.34725 1 0.2615 0.33875 0.3395 0.2825 0.294 0.2745
kre 'stone'
(sul) N J1
0.657 0.2665 0.2615 1 0.2895 0.5275 0.2835 0.226 0.587
ka ptaik 'skin'
(mal) N A3
0.2995 0.948 0.33875 0.2895 1 0.289 0.3025 0.22 0.3495
kunēngga 'sun'
(ura) N J1
0.5435 0.312 0.3395 0.5275 0.289 1 0.326 0.3815 0.8905
slǝp 'bone'
(qaq) N T1
0.278 0.3515 0.2825 0.2835 0.3025 0.326 1 0.6945 0.371
ltigi 'fire'
(mal) V T1
0.2435 0.2445 0.294 0.226 0.22 0.3815 0.6945 1 0.307
lēt 'light a fire'
(mal) N J1
0.541 0.325 0.2745 0.587 0.3495 0.8905 0.371 0.307 1
slēpki 'bone'
19. Sample results: within a domain
(kua) N M1
(qaq) N A5 (ura) N A5 (mal) N A5 (tau) N A5 (sul) N A5 (kua) N A5 (sia) N A5
5b. Overall dududul
dul dul dulka aaletpala kre vat fat
similarity score 'fighting
'stone' 'stone' 'stone' 'stone' 'stone' 'stone' 'stone'
stone'
(qaq) N A5
1 1 0.875 0.6945 0.7355 0.759 0.739 0.425
dul 'stone'
(ura) N A5
1 1 0.875 0.6945 0.7355 0.759 0.739 0.425
dul 'stone'
(mal) N A5
0.875 0.875 1 0.776 0.79 0.7205 0.7355 0.426
dulka 'stone'
(tau) N A5
0.6945 0.6945 0.776 1 0.7375 0.727 0.73 0.3815
aaletpala 'stone'
(sul) N A5
0.7355 0.7355 0.79 0.7375 1 0.7785 0.798 0.3075
kre 'stone'
(kua) N A5
0.759 0.759 0.7205 0.727 0.7785 1 0.9805 0.3095
vat 'stone'
(sia) N A5
0.739 0.739 0.7355 0.73 0.798 0.9805 1 0.298
fat 'stone'
(kua) N M1
0.425 0.425 0.426 0.3815 0.3075 0.3095 0.298 1
dududul 'fighting stone'
20. Sample results: within a domain
(kua) N M1
(qaq) N A5 (ura) N A5 (mal) N A5 (tau) N A5 (sul) N A5 (kua) N A5 (sia) N A5
5c. Form similarity dududul
dul dul dulka aaletpala kre vat fat
only 'fighting
'stone' 'stone' 'stone' 'stone' 'stone' 'stone' 'stone'
stone'
(qaq) N A5
1 1 0.75 0.389 0.471 0.518 0.478 0.6
dul 'stone'
(ura) N A5
1 1 0.75 0.389 0.471 0.518 0.478 0.6
dul 'stone'
(mal) N A5
0.75 0.75 1 0.552 0.58 0.441 0.471 0.602
dulka 'stone'
(tau) N A5
0.389 0.389 0.552 1 0.475 0.454 0.46 0.513
aaletpala 'stone'
(sul) N A5
0.471 0.471 0.58 0.475 1 0.557 0.596 0.365
kre 'stone'
(kua) N A5
0.518 0.518 0.441 0.454 0.557 1 0.961 0.369
vat 'stone'
(sia) N A5
0.478 0.478 0.471 0.46 0.596 0.961 1 0.346
fat 'stone'
(kua) N M1
0.6 0.6 0.602 0.513 0.365 0.369 0.346 1
dududul 'fighting stone'
21. Metrics
A wide variety of metrics can be implemented and
„plugged into‟ the comparison strategy
Metrics return a real value in range [0.0, 1.0]
representing the level of similarity of the items being
compared
User can control which set of metrics is used
Can use multiple comparison strategies on the same
data set and store and compare results
Metrics discussed here are those used to produce
the sample results
General principle: “best match” – prefer false
positives to false negatives
22. Phonetic form similarity metric
“edit distance with phone substitution probability matrix”
f1, f2 := phonetic forms being compared (lists of phones – generated
automatically from written forms, or transcribed manually)
Apply edit distance algorithm to f1 and f2 with following costs:
Deletion cost = 1.0 (constant)
Insertion cost = 1.0 (constant)
Substitution cost = 2 x (1 - sp), where sp is phone similarity. Substitution cost falls in
range [0.0, 2.0]
dmin := minimum edit distance for f1 and f2
dmax := maximum possible edit distance for f1 and f2 (sum of lengths of f1 and
f2 )
Similarity = 1 – (dmin / dmax)
Finds maximal unbounded alignment of two forms. Can also be understood
as detecting contribution of each form to a putative combined form
Examples:
mbias vs. biaska dmin= 3 dmax= 11 Similarity
= 1-(3/11) = 0.727 mbiaska
vat vs. fat dmin= 0.236 dmax= 6 Similarity = 1-(0.236/6) = 0.96 {v,f}at
23. Phone similarity metric
Phone similarity sp for a pair of phones is a real number
in range [0, 1] drawn from a phone similarity matrix
Matrix calculated automatically on the basis of weighted
sum of similarities between phonetic features of the two
phones
Examples of phonetic features include nasality (universal),
frontness (vowels), place of articulation (consonants)
Each phonetic feature has a set of possible values and a
similarity matrix for these values. Similarity matrix is user-
editable
Feature similarity matrix should reflect probability of various
paths of diachronic change
Possible to under-specify feature values for phones
Similarity of a phone with itself will always be 1.0
„Default‟ similarities can be overridden for particular
phones (universal) and/or phonemes (language pair-
24. Semantic domain similarity metric
“depth of deepest subsumer as
proportion of maximum local depth A
of semantic domain tree”
n1, n2 := the semantic domains B C ...
being compared (nodes in semantic
domain tree)
S := „subsumer‟: deepest node in D E
semantic domain tree that
subsumes both n1 and n2 F
ds := depth of S in tree (path length
from root node to S)
dm := maximum local depth of tree Examples:
(length of longest path from root
node to an ancestor of n1 or n2) F vs. F = 1.0
Similarity = ds / dm D vs. E = 0.333
B vs. C = 0.0
See also Li et al. (2003)
25. Gloss similarity metric
Crude sentence comparison metric: Examples:
“proportion of tokens in common”
g1, g2 := the glosses being „house‟ vs. „house‟ = 1.0
compared
„house‟ vs. „a house‟ = 1.0
r1, r2 := reduced glosses (after
removal of stop words, e.g. a, the, „house‟ vs. „raised sleeping
of)
house‟ = 0.333
len1, len2 := length of r1, r2 (number
of tokens) „house‟ vs. „hut‟ = 0.0
L := max (len1, len2)
If L = 0, Similarity = 1.0, else:
C := count of common tokens
(tokens that appear in both r1 and r2)
Similarity = C / L
This metric needs refinement
27. Possible extensions; unresolved questions
Extensions: find borrowings; detect duplicate lexicographic entries;
orthographic conversion; ...
Analytical questions: How to represent tone and incorporate within
phonetic comparison? Phonetic feature system – multi valued or
binary? Segmentation (comparison at phone, phone sequence or
phoneme level)? The Edit distance metric may be improved by
privileging uninterrupted identical sequences.
Elaborate semantic matching: more sophisticated approaches using:
taxonomies e.g. WordNet, with some way to map lexemes onto
concepts; compositional semantics – primitives.
Performance: Since comparison is parameterised, it may be possible
to use genetic algorithms to optimise performance. Need a
quantitative way to evaluate performance of system.
Relation to theory: How much theory is embedded in the instrument?
What effect does this have on results?
Inter-operability between databases is a key issue in the ultimate
usability of the tool.
28. Acknowledgements
Thanks to Christina Eira, Claire Bowern, Beth Evans,
Sander Adelaar, Friedel Frowein and Sheena Van
Der Mark, and Nicolas Tournadre for their comments
and suggestions on this project.
29. References
Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil
H. Brown, Pamela Brown, Dmitry Egorov, Robert Mailhammer,
Anthony Grant, and Eric W. Holman. 2009. Adding typology to
lexicostatistics: a combined approach to language classification.
Linguistic Typology 13: 167-179.
Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka
Velupillai, André Müller, and Dik Bakker. 2008. Explorations in
automated language classification. Folia Linguistica 42.2: 331-354.
Atkinson et al 2005.
Li, Yuhua, Bandar Z, McLean D (2003)
“An approach for measuring semantic similarity using multiple
information sources,” IEEE Transactions on Knowledge and Data
Engineering, vol. 15, no.4, pp. 871-882.
Nakhleh, Luay, Don Ringe, and Tandy Warnow (2005). Perfect
phylogenetic networks: A new methodology for reconstructing the
evolutionary history of natural languages. Language 81: 382-
420.(from Bakker et al. 2009)
Lowe, John Brandon and Martine Mazaudon. 1994. The
reconstruction engine: a computer implementation of the
comparative method," Association for Computational Linguistics
Special Issue in Computational Phonology 20.3:381-417.