SlideShare a Scribd company logo
1 of 29
The CLUES database: automated
        search for cognate forms
     Australian Linguistics Society Conference, Canberra
                                       4 December 2011
Mark Planigale (Mark Planigale Research & Consultancy)
            Tonya Stebbins (RCLT, La Trobe University)
Introduction
   Overview of the design of the CLUES database -
    being developed as a tool to aid the search for
    correlates across multiple datasets
   Linguistic model underlying the database
   Explore key issues in developing the methodology
   Show examples of output from the database

   Because the design of CLUES is relatively generic, it
    is potentially applicable to a wide range of
    languages, and to tasks other than correlate
    detection.
Context
What is CLUES?
   “Correlate Linking and User-defined Evaluation System”.
   Database designed to simultaneously handle lexical data
    from multiple languages. It uses add-on modules for
    comparative functions.
   Primary purpose: identify correlates across two or more
    languages.
       Correlate: pair of lexemes which are similar in phonetic form
        and/or meaning
       The linguist assesses which of the identified correlates are
        cognates, and which are similar due to some other reason
        (borrowing, universal tendencies, accidental similarity)
   Allows the user to adjust the criteria used to evaluate
    degree of correlation between lexemes.
   It can store, filter and organise results of comparisons.
Computational methods in historical
linguistics
   Lexicostatistics
   Typological comparison
   Phylogenetics
   Phoneme inventory comparison
   Modelling effects of sound change rules

   Correlate search > CLUES
A few examples
   Lowe & Mazaudon 1994 – „Reconstruction Engine‟ (models
    operation of proposed sound change rules as a means of checking
    hypotheses)
   Nakhleh et al 2005 – Indo-European, phylogenetic
   Holman et al. 2008 – Automated Similarity Judgment Program –
    4350 languages; 40 lexical items (edit distance); 85 most stable
    grammatical (typological) features from WALS database.
   Austronesian Basic Vocabulary Database: 874 mostly Austronesian
    languages, each language represented by around 210 words.
    http://language.psy.auckland.ac.nz/austronesian/ project had
    phylogenetic focus – did some manual comparative work in
    preparing the data)
   Greenhill & Gray 2009 – Austronesian, phylogenetic
   Dunn, Burenhult et al. 2011 – Aslian
   Proto-Tai'o'Matic (“merges, searches, and extends several wordlists
    and proposed reconstructions of proto-Tai and Southwestern Tai”
    http://crcl.th.net/index.html?main=http%3A//crcl.th.net/crcl/assoc.htm)
Broad vs. deep approaches to automated
lexical comparison
Parameter         ‘Broad and shallow’             ‘Narrow and deep’
Language sample   Relatively large                Relatively small
Vocabulary        Constrained, based on           All available lexical data for
sample            standardised wordlist (e.g.     selected languages
                  Swadesh 200, 100 or 40)
Purpose           Establish (hypothesised)        Linguistic and/or cultural
                  genetic relationships           reconstruction; model language
                                                  contact and semantic shift
Method            Lexicostatistics                Comparative method with fuzzy
                  Phylogenetics                   matching
Typical metrics   Phonetic (e.g. edit distance)   Phonetic (e.g. edit distance)
                  Typological (shared             Semantic
                  grammatical features)           Grammatical
                  Maximum likelihood

CLUES comparisons can be constrained to core vocab (using wordlist feature)
however it is intended to be used within a „narrow and deep‟ approach.
Design of CLUES
CLUES: Desiderata
                   • Results agree with human expert judgment
   Accuracy        • Minimisation of false positives and negatives


                   • Computed similarity level does measure degree of
    Validity         correlation
                   • Computed similarity level varies directly with cognacy


                   • Like results for like comparison pairs
  Reliability      • Like results for a single comparison pair on repetition



                   • System performs accurately on new („unseen‟) data as well
Generalisability     as the data that the similarity metrics were „trained‟ on



   Efficiency      • Comparisons are performed fast enough to be useful
Lexical model (partial)
   Language

       1
        ∞
                                    Lexeme
  Orthography         1   ∞      part of speech         ∞    1       Source
                              temporal information

                              1                  1

                 ∞                                      ∞
                                                     Sense       ∞   ∞      Wordlist item
           Written form

                      ∞                      1               1
                ...
                                      ∞                               ∞
                 ∞                   Gloss                       Semantic
              Phone                                               domain
Three dimensions of lexical similarity
Dimension of comparison               Data fields currently
                                      available
Phonetic / phonological               Written form (mapped to
(phonetic form of lexeme)             phonetic content)
Semantic                              Semantic domain
(meaning of lexeme)                   Gloss
Grammatical                           Word class
(grammatical features of
lexeme)
   In the context of correlate detection, grammatical features may
    be of interest as a „dis-similarising‟ feature for lexemes that are
    highly correlated on form and meaning.
What affects the results?
  Selection and evaluation of metrics

  • Choice of appropriate formal (quantifiable) criteria for similarity
  • Impact: Validity of results; generalisability of system

  Inconsistent representations

  • Systematic differences in the representations used for different
    data sets within the corpus
  • Impact: Validity of results

  Noise

  • Random fluctuations within the data that obscure the true value
    of individual data items, but do not change the underlying nature
    of the distribution
  • Impact: Reliability of data, reliability of results
                                                                             less
                                                                          controllabl
                                                                               e
CLUES: Managing representational issues
   automated generation of phonetic form(s) from
    written form(s)
   where required, manual standardization to common
    lexicographic conventions
   manual assignment to common ontology
       semantic domain set
   automated mapping onto a shared common set of
    grammatical features, values and terms
Calculating similarity
Similarity scores

                                               Overall
 Total
                                                score




                             w5                     w6                 w7




                                               Meaning                      Grammar
 Subtotals   Form subtotal
                                               subtotal                     subtotal



                     w1                 w2                 w3                     w4



                                  Semantic
             Written form                                   Gloss           Wordclass
 Base                              domain
              similarity                                  similarity        similarity
                                  similarity
Ura ɣunǝga vs. Mali kunēngga ‘sun’


                                              Base
4a.                  Lexeme 1   Lexeme 2    similarity   Weight   Subtotal    Weight   Overall
                                              score
                      ɣunǝga    kunēngga
Written form(s)                               0.896       1.0         0.896    0.45
                     [ɣunǝga]   [ɣunǝŋga]
Gloss(es)
Semantic domain(s)
Wordclass
                       sun
                       A3
                        N
                                  sun
                                   A3
                                   N
                                               1.0
                                               1.0
                                               1.0
                                                          0.5
                                                          0.5
                                                          1.0
                                                                  }    1.0

                                                                       1.0
                                                                               0.45

                                                                               0.1
                                                                                       }   0.953
Sulka kolkha ‘sun’ vs. Mali dulka ‘stone’
                                                Base
   4b.                  Lexeme 1   Lexeme 2   similarity   Weight   Subtotal    Weight   Overall
                                                score
                         kolkha     dulka
   Written form(s)                              0.828       1.0         0.828    0.45
                        [kolkha]   [dulka]
   Gloss(es)
   Semantic domain(s)
                          sun
                          A3
                                    stone
                                     A5
                                                 0.0
                                                0.333
                                                            0.5
                                                            0.5     } 0.167      0.45
                                                                                         }   0.548


   Wordclass               N          N          1.0        1.0          1.0     0.1



                                                Base
   4c.                  Lexeme 1   Lexeme 2   similarity   Weight   Subtotal    Weight   Overall
                                                score
                         kolkha     dulka
   Written form(s)                              0.828       1.0         0.828    0.7
                        [kolkha]   [dulka]
   Gloss(es)
   Semantic domain(s)
                          sun
                          A3
                                    stone
                                     A5
                                                 0.0
                                                0.333
                                                            0.5
                                                            0.0     }    0.0     0.2
                                                                                         }   0.68


   Wordclass               N          N          1.0        1.0          1.0     0.1
Sample results: across domains
   Small set of lexical data from 7 languages; „symmetrical‟; overall
    scores
                         (tau) N J1 (ura) N A3 (sul) N A5   (sul) N J1 (mal) N A3 (ura) N J1 (qaq) N T1 (mal) V T1 (mal) N J1
    5a.                   kabarak    ɣunǝga        kre       ka ptaik kunēngga       slǝp        ltigi       lēt       slēpki
                           'blood'     'sun'     'stone'      'skin'     'sun'      'bone'      'fire'  'light a fire' 'bone'
       (tau) N J1
                            1         0.309      0.2905      0.657      0.2995     0.5435      0.278     0.2435      0.541
    kabarak 'blood'
      (ura) N A3
                          0.309         1       0.34725      0.2665     0.948       0.312     0.3515     0.2445      0.325
     ɣunǝga 'sun'
       (sul) N A5
                          0.2905     0.34725       1         0.2615    0.33875     0.3395     0.2825      0.294      0.2745
      kre 'stone'
        (sul) N J1
                          0.657      0.2665      0.2615        1        0.2895     0.5275     0.2835      0.226      0.587
     ka ptaik 'skin'
      (mal) N A3
                          0.2995      0.948     0.33875      0.2895       1         0.289     0.3025       0.22      0.3495
    kunēngga 'sun'
       (ura) N J1
                          0.5435      0.312      0.3395      0.5275     0.289         1        0.326     0.3815      0.8905
      slǝp 'bone'
      (qaq) N T1
                          0.278      0.3515      0.2825      0.2835     0.3025      0.326        1       0.6945      0.371
        ltigi 'fire'
      (mal) V T1
                          0.2435     0.2445      0.294       0.226       0.22      0.3815     0.6945        1        0.307
    lēt 'light a fire'
       (mal) N J1
                          0.541       0.325      0.2745      0.587      0.3495     0.8905      0.371      0.307        1
     slēpki 'bone'
Sample results: within a domain
                                                                                                           (kua) N M1
                            (qaq) N A5 (ura) N A5 (mal) N A5 (tau) N A5 (sul) N A5 (kua) N A5 (sia) N A5
 5b. Overall                                                                                                 dududul
                                dul        dul       dulka    aaletpala     kre        vat         fat
    similarity score                                                                                         'fighting
                              'stone'    'stone'    'stone'    'stone'    'stone'    'stone'    'stone'
                                                                                                               stone'
       (qaq) N A5
                                1          1        0.875      0.6945     0.7355     0.759       0.739       0.425
       dul 'stone'
       (ura) N A5
                                1          1        0.875      0.6945     0.7355     0.759       0.739       0.425
       dul 'stone'
       (mal) N A5
                              0.875      0.875        1        0.776       0.79      0.7205     0.7355       0.426
      dulka 'stone'
       (tau) N A5
                             0.6945     0.6945      0.776        1        0.7375     0.727       0.73        0.3815
    aaletpala 'stone'
       (sul) N A5
                             0.7355     0.7355       0.79      0.7375       1        0.7785      0.798       0.3075
       kre 'stone'
       (kua) N A5
                              0.759      0.759     0.7205      0.727      0.7785       1        0.9805       0.3095
       vat 'stone'
       (sia) N A5
                              0.739      0.739     0.7355       0.73      0.798      0.9805        1         0.298
       fat 'stone'
      (kua) N M1
                              0.425      0.425      0.426      0.3815     0.3075     0.3095      0.298         1
 dududul 'fighting stone'
Sample results: within a domain
                                                                                                           (kua) N M1
                            (qaq) N A5 (ura) N A5 (mal) N A5 (tau) N A5 (sul) N A5 (kua) N A5 (sia) N A5
 5c. Form similarity                                                                                         dududul
                                dul        dul       dulka    aaletpala     kre        vat         fat
      only                                                                                                   'fighting
                              'stone'    'stone'    'stone'    'stone'    'stone'    'stone'    'stone'
                                                                                                               stone'
       (qaq) N A5
                                1          1         0.75      0.389      0.471      0.518       0.478        0.6
       dul 'stone'
       (ura) N A5
                                1          1         0.75      0.389      0.471      0.518       0.478        0.6
       dul 'stone'
       (mal) N A5
                              0.75       0.75         1        0.552       0.58      0.441       0.471       0.602
      dulka 'stone'
       (tau) N A5
                              0.389      0.389      0.552        1        0.475      0.454       0.46        0.513
    aaletpala 'stone'
       (sul) N A5
                              0.471      0.471       0.58      0.475        1        0.557       0.596       0.365
       kre 'stone'
       (kua) N A5
                              0.518      0.518      0.441      0.454      0.557        1         0.961       0.369
       vat 'stone'
       (sia) N A5
                              0.478      0.478      0.471       0.46      0.596      0.961         1         0.346
       fat 'stone'
      (kua) N M1
                               0.6        0.6       0.602      0.513      0.365      0.369       0.346         1
 dududul 'fighting stone'
Metrics
   A wide variety of metrics can be implemented and
    „plugged into‟ the comparison strategy
   Metrics return a real value in range [0.0, 1.0]
    representing the level of similarity of the items being
    compared
   User can control which set of metrics is used
   Can use multiple comparison strategies on the same
    data set and store and compare results
   Metrics discussed here are those used to produce
    the sample results

   General principle: “best match” – prefer false
    positives to false negatives
Phonetic form similarity metric
   “edit distance with phone substitution probability matrix”
   f1, f2 := phonetic forms being compared (lists of phones – generated
    automatically from written forms, or transcribed manually)
   Apply edit distance algorithm to f1 and f2 with following costs:
       Deletion cost = 1.0 (constant)
       Insertion cost = 1.0 (constant)
       Substitution cost = 2 x (1 - sp), where sp is phone similarity. Substitution cost falls in
        range [0.0, 2.0]
   dmin := minimum edit distance for f1 and f2
   dmax := maximum possible edit distance for f1 and f2 (sum of lengths of f1 and
    f2 )
   Similarity = 1 – (dmin / dmax)

   Finds maximal unbounded alignment of two forms. Can also be understood
    as detecting contribution of each form to a putative combined form

Examples:
mbias vs. biaska             dmin= 3 dmax= 11                        Similarity
   = 1-(3/11) = 0.727        mbiaska
vat vs. fat      dmin= 0.236 dmax= 6 Similarity = 1-(0.236/6) = 0.96 {v,f}at
Phone similarity metric
   Phone similarity sp for a pair of phones is a real number
    in range [0, 1] drawn from a phone similarity matrix
   Matrix calculated automatically on the basis of weighted
    sum of similarities between phonetic features of the two
    phones
       Examples of phonetic features include nasality (universal),
        frontness (vowels), place of articulation (consonants)
       Each phonetic feature has a set of possible values and a
        similarity matrix for these values. Similarity matrix is user-
        editable
       Feature similarity matrix should reflect probability of various
        paths of diachronic change
       Possible to under-specify feature values for phones
   Similarity of a phone with itself will always be 1.0
   „Default‟ similarities can be overridden for particular
    phones (universal) and/or phonemes (language pair-
Semantic domain similarity metric
   “depth of deepest subsumer as
    proportion of maximum local depth               A
    of semantic domain tree”
   n1, n2 := the semantic domains          B       C       ...
    being compared (nodes in semantic
    domain tree)
   S := „subsumer‟: deepest node in            D       E
    semantic domain tree that
    subsumes both n1 and n2                             F
   ds := depth of S in tree (path length
    from root node to S)
   dm := maximum local depth of tree       Examples:
    (length of longest path from root
    node to an ancestor of n1 or n2)        F vs. F = 1.0
   Similarity = ds / dm                    D vs. E = 0.333
                                            B vs. C = 0.0
   See also Li et al. (2003)
Gloss similarity metric
   Crude sentence comparison metric:        Examples:
    “proportion of tokens in common”
   g1, g2 := the glosses being              „house‟ vs. „house‟ = 1.0
    compared
                                             „house‟ vs. „a house‟ = 1.0
   r1, r2 := reduced glosses (after
    removal of stop words, e.g. a, the,      „house‟ vs. „raised sleeping
    of)
                                               house‟ = 0.333
   len1, len2 := length of r1, r2 (number
    of tokens)                               „house‟ vs. „hut‟ = 0.0
   L := max (len1, len2)
   If L = 0, Similarity = 1.0, else:
   C := count of common tokens
    (tokens that appear in both r1 and r2)
   Similarity = C / L

   This metric needs refinement
Conclusion
Possible extensions; unresolved questions
   Extensions: find borrowings; detect duplicate lexicographic entries;
    orthographic conversion; ...
   Analytical questions: How to represent tone and incorporate within
    phonetic comparison? Phonetic feature system – multi valued or
    binary? Segmentation (comparison at phone, phone sequence or
    phoneme level)? The Edit distance metric may be improved by
    privileging uninterrupted identical sequences.
   Elaborate semantic matching: more sophisticated approaches using:
    taxonomies e.g. WordNet, with some way to map lexemes onto
    concepts; compositional semantics – primitives.
   Performance: Since comparison is parameterised, it may be possible
    to use genetic algorithms to optimise performance. Need a
    quantitative way to evaluate performance of system.
   Relation to theory: How much theory is embedded in the instrument?
    What effect does this have on results?
   Inter-operability between databases is a key issue in the ultimate
    usability of the tool.
Acknowledgements
   Thanks to Christina Eira, Claire Bowern, Beth Evans,
    Sander Adelaar, Friedel Frowein and Sheena Van
    Der Mark, and Nicolas Tournadre for their comments
    and suggestions on this project.
References
   Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil
    H. Brown, Pamela Brown, Dmitry Egorov, Robert Mailhammer,
    Anthony Grant, and Eric W. Holman. 2009. Adding typology to
    lexicostatistics: a combined approach to language classification.
    Linguistic Typology 13: 167-179.
   Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka
    Velupillai, André Müller, and Dik Bakker. 2008. Explorations in
    automated language classification. Folia Linguistica 42.2: 331-354.
    Atkinson et al 2005.
   Li, Yuhua, Bandar Z, McLean D (2003)
    “An approach for measuring semantic similarity using multiple
    information sources,” IEEE Transactions on Knowledge and Data
    Engineering, vol. 15, no.4, pp. 871-882.
   Nakhleh, Luay, Don Ringe, and Tandy Warnow (2005). Perfect
    phylogenetic networks: A new methodology for reconstructing the
    evolutionary history of natural languages. Language 81: 382-
    420.(from Bakker et al. 2009)
   Lowe, John Brandon and Martine Mazaudon. 1994. The
    reconstruction engine: a computer implementation of the
    comparative method," Association for Computational Linguistics
    Special Issue in Computational Phonology 20.3:381-417.

More Related Content

What's hot

An introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semanticsAn introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semantics
Andre Freitas
 
Utilising wordsmith and atlas to explore, analyse and report qualitative data
Utilising wordsmith and atlas to explore, analyse and report qualitative dataUtilising wordsmith and atlas to explore, analyse and report qualitative data
Utilising wordsmith and atlas to explore, analyse and report qualitative data
Merlien Institute
 

What's hot (9)

An introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semanticsAn introduction to compositional models in distributional semantics
An introduction to compositional models in distributional semantics
 
LDG-basic-slides
LDG-basic-slidesLDG-basic-slides
LDG-basic-slides
 
Utilising wordsmith and atlas to explore, analyse and report qualitative data
Utilising wordsmith and atlas to explore, analyse and report qualitative dataUtilising wordsmith and atlas to explore, analyse and report qualitative data
Utilising wordsmith and atlas to explore, analyse and report qualitative data
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
OBSC Framework
OBSC FrameworkOBSC Framework
OBSC Framework
 
text summarization using amr
text summarization using amrtext summarization using amr
text summarization using amr
 
Mapping Landscape of Patterns - Vol.2
Mapping Landscape of Patterns - Vol.2Mapping Landscape of Patterns - Vol.2
Mapping Landscape of Patterns - Vol.2
 
A survey on parallel corpora alignment
A survey on parallel corpora alignment A survey on parallel corpora alignment
A survey on parallel corpora alignment
 
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
DDH 2021-03-03: Text Processing and Searching in the Medical DomainDDH 2021-03-03: Text Processing and Searching in the Medical Domain
DDH 2021-03-03: Text Processing and Searching in the Medical Domain
 

Similar to The CLUES database: automated search for linguistic cognates

Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Chunyang Chen
 
Towards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software DataTowards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software Data
Fernando Silva Parreiras
 
05 linguistic theory meets lexicography
05 linguistic theory meets lexicography05 linguistic theory meets lexicography
05 linguistic theory meets lexicography
Duygu Aşıklar
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation Extraction
Alexander Panchenko
 

Similar to The CLUES database: automated search for linguistic cognates (20)

ijcai11
ijcai11ijcai11
ijcai11
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
dialogue act modeling for automatic tagging and recognition
 dialogue act modeling for automatic tagging and recognition dialogue act modeling for automatic tagging and recognition
dialogue act modeling for automatic tagging and recognition
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Word Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology ClassesWord Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology Classes
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Visual Word Recognition. The Journey from Features to Meaning
Visual Word Recognition. The Journey from Features to MeaningVisual Word Recognition. The Journey from Features to Meaning
Visual Word Recognition. The Journey from Features to Meaning
 
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
 
2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx2023 EMNLP day_san.pptx
2023 EMNLP day_san.pptx
 
NLP todo
NLP todoNLP todo
NLP todo
 
1 l5eng
1 l5eng1 l5eng
1 l5eng
 
Towards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software DataTowards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software Data
 
05 linguistic theory meets lexicography
05 linguistic theory meets lexicography05 linguistic theory meets lexicography
05 linguistic theory meets lexicography
 
Corpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of PersianCorpus-based part-of-speech disambiguation of Persian
Corpus-based part-of-speech disambiguation of Persian
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation Extraction
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

The CLUES database: automated search for linguistic cognates

  • 1. The CLUES database: automated search for cognate forms Australian Linguistics Society Conference, Canberra 4 December 2011 Mark Planigale (Mark Planigale Research & Consultancy) Tonya Stebbins (RCLT, La Trobe University)
  • 2. Introduction  Overview of the design of the CLUES database - being developed as a tool to aid the search for correlates across multiple datasets  Linguistic model underlying the database  Explore key issues in developing the methodology  Show examples of output from the database  Because the design of CLUES is relatively generic, it is potentially applicable to a wide range of languages, and to tasks other than correlate detection.
  • 4. What is CLUES?  “Correlate Linking and User-defined Evaluation System”.  Database designed to simultaneously handle lexical data from multiple languages. It uses add-on modules for comparative functions.  Primary purpose: identify correlates across two or more languages.  Correlate: pair of lexemes which are similar in phonetic form and/or meaning  The linguist assesses which of the identified correlates are cognates, and which are similar due to some other reason (borrowing, universal tendencies, accidental similarity)  Allows the user to adjust the criteria used to evaluate degree of correlation between lexemes.  It can store, filter and organise results of comparisons.
  • 5. Computational methods in historical linguistics  Lexicostatistics  Typological comparison  Phylogenetics  Phoneme inventory comparison  Modelling effects of sound change rules  Correlate search > CLUES
  • 6. A few examples  Lowe & Mazaudon 1994 – „Reconstruction Engine‟ (models operation of proposed sound change rules as a means of checking hypotheses)  Nakhleh et al 2005 – Indo-European, phylogenetic  Holman et al. 2008 – Automated Similarity Judgment Program – 4350 languages; 40 lexical items (edit distance); 85 most stable grammatical (typological) features from WALS database.  Austronesian Basic Vocabulary Database: 874 mostly Austronesian languages, each language represented by around 210 words. http://language.psy.auckland.ac.nz/austronesian/ project had phylogenetic focus – did some manual comparative work in preparing the data)  Greenhill & Gray 2009 – Austronesian, phylogenetic  Dunn, Burenhult et al. 2011 – Aslian  Proto-Tai'o'Matic (“merges, searches, and extends several wordlists and proposed reconstructions of proto-Tai and Southwestern Tai” http://crcl.th.net/index.html?main=http%3A//crcl.th.net/crcl/assoc.htm)
  • 7. Broad vs. deep approaches to automated lexical comparison Parameter ‘Broad and shallow’ ‘Narrow and deep’ Language sample Relatively large Relatively small Vocabulary Constrained, based on All available lexical data for sample standardised wordlist (e.g. selected languages Swadesh 200, 100 or 40) Purpose Establish (hypothesised) Linguistic and/or cultural genetic relationships reconstruction; model language contact and semantic shift Method Lexicostatistics Comparative method with fuzzy Phylogenetics matching Typical metrics Phonetic (e.g. edit distance) Phonetic (e.g. edit distance) Typological (shared Semantic grammatical features) Grammatical Maximum likelihood CLUES comparisons can be constrained to core vocab (using wordlist feature) however it is intended to be used within a „narrow and deep‟ approach.
  • 9. CLUES: Desiderata • Results agree with human expert judgment Accuracy • Minimisation of false positives and negatives • Computed similarity level does measure degree of Validity correlation • Computed similarity level varies directly with cognacy • Like results for like comparison pairs Reliability • Like results for a single comparison pair on repetition • System performs accurately on new („unseen‟) data as well Generalisability as the data that the similarity metrics were „trained‟ on Efficiency • Comparisons are performed fast enough to be useful
  • 10. Lexical model (partial) Language 1 ∞ Lexeme Orthography 1 ∞ part of speech ∞ 1 Source temporal information 1 1 ∞ ∞ Sense ∞ ∞ Wordlist item Written form ∞ 1 1 ... ∞ ∞ ∞ Gloss Semantic Phone domain
  • 11. Three dimensions of lexical similarity Dimension of comparison Data fields currently available Phonetic / phonological Written form (mapped to (phonetic form of lexeme) phonetic content) Semantic Semantic domain (meaning of lexeme) Gloss Grammatical Word class (grammatical features of lexeme)  In the context of correlate detection, grammatical features may be of interest as a „dis-similarising‟ feature for lexemes that are highly correlated on form and meaning.
  • 12. What affects the results? Selection and evaluation of metrics • Choice of appropriate formal (quantifiable) criteria for similarity • Impact: Validity of results; generalisability of system Inconsistent representations • Systematic differences in the representations used for different data sets within the corpus • Impact: Validity of results Noise • Random fluctuations within the data that obscure the true value of individual data items, but do not change the underlying nature of the distribution • Impact: Reliability of data, reliability of results less controllabl e
  • 13. CLUES: Managing representational issues  automated generation of phonetic form(s) from written form(s)  where required, manual standardization to common lexicographic conventions  manual assignment to common ontology  semantic domain set  automated mapping onto a shared common set of grammatical features, values and terms
  • 15. Similarity scores Overall Total score w5 w6 w7 Meaning Grammar Subtotals Form subtotal subtotal subtotal w1 w2 w3 w4 Semantic Written form Gloss Wordclass Base domain similarity similarity similarity similarity
  • 16. Ura ɣunǝga vs. Mali kunēngga ‘sun’ Base 4a. Lexeme 1 Lexeme 2 similarity Weight Subtotal Weight Overall score ɣunǝga kunēngga Written form(s) 0.896 1.0 0.896 0.45 [ɣunǝga] [ɣunǝŋga] Gloss(es) Semantic domain(s) Wordclass sun A3 N sun A3 N 1.0 1.0 1.0 0.5 0.5 1.0 } 1.0 1.0 0.45 0.1 } 0.953
  • 17. Sulka kolkha ‘sun’ vs. Mali dulka ‘stone’ Base 4b. Lexeme 1 Lexeme 2 similarity Weight Subtotal Weight Overall score kolkha dulka Written form(s) 0.828 1.0 0.828 0.45 [kolkha] [dulka] Gloss(es) Semantic domain(s) sun A3 stone A5 0.0 0.333 0.5 0.5 } 0.167 0.45 } 0.548 Wordclass N N 1.0 1.0 1.0 0.1 Base 4c. Lexeme 1 Lexeme 2 similarity Weight Subtotal Weight Overall score kolkha dulka Written form(s) 0.828 1.0 0.828 0.7 [kolkha] [dulka] Gloss(es) Semantic domain(s) sun A3 stone A5 0.0 0.333 0.5 0.0 } 0.0 0.2 } 0.68 Wordclass N N 1.0 1.0 1.0 0.1
  • 18. Sample results: across domains  Small set of lexical data from 7 languages; „symmetrical‟; overall scores (tau) N J1 (ura) N A3 (sul) N A5 (sul) N J1 (mal) N A3 (ura) N J1 (qaq) N T1 (mal) V T1 (mal) N J1 5a. kabarak ɣunǝga kre ka ptaik kunēngga slǝp ltigi lēt slēpki 'blood' 'sun' 'stone' 'skin' 'sun' 'bone' 'fire' 'light a fire' 'bone' (tau) N J1 1 0.309 0.2905 0.657 0.2995 0.5435 0.278 0.2435 0.541 kabarak 'blood' (ura) N A3 0.309 1 0.34725 0.2665 0.948 0.312 0.3515 0.2445 0.325 ɣunǝga 'sun' (sul) N A5 0.2905 0.34725 1 0.2615 0.33875 0.3395 0.2825 0.294 0.2745 kre 'stone' (sul) N J1 0.657 0.2665 0.2615 1 0.2895 0.5275 0.2835 0.226 0.587 ka ptaik 'skin' (mal) N A3 0.2995 0.948 0.33875 0.2895 1 0.289 0.3025 0.22 0.3495 kunēngga 'sun' (ura) N J1 0.5435 0.312 0.3395 0.5275 0.289 1 0.326 0.3815 0.8905 slǝp 'bone' (qaq) N T1 0.278 0.3515 0.2825 0.2835 0.3025 0.326 1 0.6945 0.371 ltigi 'fire' (mal) V T1 0.2435 0.2445 0.294 0.226 0.22 0.3815 0.6945 1 0.307 lēt 'light a fire' (mal) N J1 0.541 0.325 0.2745 0.587 0.3495 0.8905 0.371 0.307 1 slēpki 'bone'
  • 19. Sample results: within a domain (kua) N M1 (qaq) N A5 (ura) N A5 (mal) N A5 (tau) N A5 (sul) N A5 (kua) N A5 (sia) N A5 5b. Overall dududul dul dul dulka aaletpala kre vat fat similarity score 'fighting 'stone' 'stone' 'stone' 'stone' 'stone' 'stone' 'stone' stone' (qaq) N A5 1 1 0.875 0.6945 0.7355 0.759 0.739 0.425 dul 'stone' (ura) N A5 1 1 0.875 0.6945 0.7355 0.759 0.739 0.425 dul 'stone' (mal) N A5 0.875 0.875 1 0.776 0.79 0.7205 0.7355 0.426 dulka 'stone' (tau) N A5 0.6945 0.6945 0.776 1 0.7375 0.727 0.73 0.3815 aaletpala 'stone' (sul) N A5 0.7355 0.7355 0.79 0.7375 1 0.7785 0.798 0.3075 kre 'stone' (kua) N A5 0.759 0.759 0.7205 0.727 0.7785 1 0.9805 0.3095 vat 'stone' (sia) N A5 0.739 0.739 0.7355 0.73 0.798 0.9805 1 0.298 fat 'stone' (kua) N M1 0.425 0.425 0.426 0.3815 0.3075 0.3095 0.298 1 dududul 'fighting stone'
  • 20. Sample results: within a domain (kua) N M1 (qaq) N A5 (ura) N A5 (mal) N A5 (tau) N A5 (sul) N A5 (kua) N A5 (sia) N A5 5c. Form similarity dududul dul dul dulka aaletpala kre vat fat only 'fighting 'stone' 'stone' 'stone' 'stone' 'stone' 'stone' 'stone' stone' (qaq) N A5 1 1 0.75 0.389 0.471 0.518 0.478 0.6 dul 'stone' (ura) N A5 1 1 0.75 0.389 0.471 0.518 0.478 0.6 dul 'stone' (mal) N A5 0.75 0.75 1 0.552 0.58 0.441 0.471 0.602 dulka 'stone' (tau) N A5 0.389 0.389 0.552 1 0.475 0.454 0.46 0.513 aaletpala 'stone' (sul) N A5 0.471 0.471 0.58 0.475 1 0.557 0.596 0.365 kre 'stone' (kua) N A5 0.518 0.518 0.441 0.454 0.557 1 0.961 0.369 vat 'stone' (sia) N A5 0.478 0.478 0.471 0.46 0.596 0.961 1 0.346 fat 'stone' (kua) N M1 0.6 0.6 0.602 0.513 0.365 0.369 0.346 1 dududul 'fighting stone'
  • 21. Metrics  A wide variety of metrics can be implemented and „plugged into‟ the comparison strategy  Metrics return a real value in range [0.0, 1.0] representing the level of similarity of the items being compared  User can control which set of metrics is used  Can use multiple comparison strategies on the same data set and store and compare results  Metrics discussed here are those used to produce the sample results  General principle: “best match” – prefer false positives to false negatives
  • 22. Phonetic form similarity metric  “edit distance with phone substitution probability matrix”  f1, f2 := phonetic forms being compared (lists of phones – generated automatically from written forms, or transcribed manually)  Apply edit distance algorithm to f1 and f2 with following costs:  Deletion cost = 1.0 (constant)  Insertion cost = 1.0 (constant)  Substitution cost = 2 x (1 - sp), where sp is phone similarity. Substitution cost falls in range [0.0, 2.0]  dmin := minimum edit distance for f1 and f2  dmax := maximum possible edit distance for f1 and f2 (sum of lengths of f1 and f2 )  Similarity = 1 – (dmin / dmax)  Finds maximal unbounded alignment of two forms. Can also be understood as detecting contribution of each form to a putative combined form Examples: mbias vs. biaska dmin= 3 dmax= 11 Similarity = 1-(3/11) = 0.727 mbiaska vat vs. fat dmin= 0.236 dmax= 6 Similarity = 1-(0.236/6) = 0.96 {v,f}at
  • 23. Phone similarity metric  Phone similarity sp for a pair of phones is a real number in range [0, 1] drawn from a phone similarity matrix  Matrix calculated automatically on the basis of weighted sum of similarities between phonetic features of the two phones  Examples of phonetic features include nasality (universal), frontness (vowels), place of articulation (consonants)  Each phonetic feature has a set of possible values and a similarity matrix for these values. Similarity matrix is user- editable  Feature similarity matrix should reflect probability of various paths of diachronic change  Possible to under-specify feature values for phones  Similarity of a phone with itself will always be 1.0  „Default‟ similarities can be overridden for particular phones (universal) and/or phonemes (language pair-
  • 24. Semantic domain similarity metric  “depth of deepest subsumer as proportion of maximum local depth A of semantic domain tree”  n1, n2 := the semantic domains B C ... being compared (nodes in semantic domain tree)  S := „subsumer‟: deepest node in D E semantic domain tree that subsumes both n1 and n2 F  ds := depth of S in tree (path length from root node to S)  dm := maximum local depth of tree Examples: (length of longest path from root node to an ancestor of n1 or n2) F vs. F = 1.0  Similarity = ds / dm D vs. E = 0.333 B vs. C = 0.0  See also Li et al. (2003)
  • 25. Gloss similarity metric  Crude sentence comparison metric: Examples: “proportion of tokens in common”  g1, g2 := the glosses being „house‟ vs. „house‟ = 1.0 compared „house‟ vs. „a house‟ = 1.0  r1, r2 := reduced glosses (after removal of stop words, e.g. a, the, „house‟ vs. „raised sleeping of) house‟ = 0.333  len1, len2 := length of r1, r2 (number of tokens) „house‟ vs. „hut‟ = 0.0  L := max (len1, len2)  If L = 0, Similarity = 1.0, else:  C := count of common tokens (tokens that appear in both r1 and r2)  Similarity = C / L  This metric needs refinement
  • 27. Possible extensions; unresolved questions  Extensions: find borrowings; detect duplicate lexicographic entries; orthographic conversion; ...  Analytical questions: How to represent tone and incorporate within phonetic comparison? Phonetic feature system – multi valued or binary? Segmentation (comparison at phone, phone sequence or phoneme level)? The Edit distance metric may be improved by privileging uninterrupted identical sequences.  Elaborate semantic matching: more sophisticated approaches using: taxonomies e.g. WordNet, with some way to map lexemes onto concepts; compositional semantics – primitives.  Performance: Since comparison is parameterised, it may be possible to use genetic algorithms to optimise performance. Need a quantitative way to evaluate performance of system.  Relation to theory: How much theory is embedded in the instrument? What effect does this have on results?  Inter-operability between databases is a key issue in the ultimate usability of the tool.
  • 28. Acknowledgements  Thanks to Christina Eira, Claire Bowern, Beth Evans, Sander Adelaar, Friedel Frowein and Sheena Van Der Mark, and Nicolas Tournadre for their comments and suggestions on this project.
  • 29. References  Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil H. Brown, Pamela Brown, Dmitry Egorov, Robert Mailhammer, Anthony Grant, and Eric W. Holman. 2009. Adding typology to lexicostatistics: a combined approach to language classification. Linguistic Typology 13: 167-179.  Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica 42.2: 331-354. Atkinson et al 2005.  Li, Yuhua, Bandar Z, McLean D (2003) “An approach for measuring semantic similarity using multiple information sources,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no.4, pp. 871-882.  Nakhleh, Luay, Don Ringe, and Tandy Warnow (2005). Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language 81: 382- 420.(from Bakker et al. 2009)  Lowe, John Brandon and Martine Mazaudon. 1994. The reconstruction engine: a computer implementation of the comparative method," Association for Computational Linguistics Special Issue in Computational Phonology 20.3:381-417.