Experiments on the construction and enrichment of a Portuguese wordnet
1. Alberto Simões <ambs@ilch.uminho.pt>
José João Almeida <jj@di.uminho.pt>
Xavier Gómez Guinovart <xgg@uvigo.es>
Experimentsonthe
constructionandenrichmentof
aPortuguesewordnet
2. ThePrincetonWordNet
WordNet® is a large lexical database of English.
Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets), each
expressing a distinct concept. Synsets are
interlinked by means of conceptual-semantic and
lexical relations.
3. WordNet
A lot of similar initiatives for other languages.
Some languages have more than one WordNet.
Each WordNet has its own peculiarities.
Some are free, some are not.
Some are bigger than others.
Some created automatically, some other, manually.
4. SomeofthePortugueseWordNets
WordNet.PT - Portuguese WordNet, manually constructed, only
queriable by web browser;
OWN-PT - Bootstrapped automatically, manually corrected and
enriched, free;
Onto.PT - automatically constructed resource, aims at a
general lexical ontology (not aligned with WordNet);
PULO - Portuguese Unified Lexical Ontology, aims at
constructing a resource aligned with WordNet.Pr, made
available through Multilingual Central Repository, and
currently built automatically.
8. BootstrappingandEnlargingPULO
1. Bootstrap through Probabilistic Translation Dictionaries
using Galician, Spanish and English WordNets
2. Enlarge using Apertium MT translation dictionaries of
Galician, Catalan and Spanish WordNets
3. Enlarge using a 100 years old definitions dictionary
4. Enlarge using a 10 years old definitions dictionary
13. Algorithm:
1. Translate EN, ES and GL synsets into Portuguese
2. Score each translation if:
a. translation is symmetrical;
b. obtained from multiple variants/wordnets
c. are direct Galician transltions;
d. are in the Portuguese vocabullary.
ExperimentI:BootstrapthroughEN,ESandGLtranslation
14. Evaluation:
Manual evaluation of samples for different scores:
score ≥ 2.0 results in 70% of correctness
score ≥ 1.5 results in 60% of correctness
Select higher scored variants (score ≥ 2.0)
Created 18K Variants, distributed by 17,8K Synsets.
ExperimentI:BootstrapthroughEN,ESandGLtranslation
17. Algorithm:
1. Translate GL, ES and CA into Portuguese;
2. Compute multisets with all possible translations;
3. Select translations that:
a. were obtained from two or more languages;
b. were obtained from two or more variants
(possible from the same language)
ExperimentII:EnlargetranslatingGL,ESandCA
24. ExperimentIV:Enlargingusing10y.o.Dictionary
Had the luck to put my hand on Dicionário da Língua
Portuguesa Contemporânea from Academia das Ciências de
Lisboa;
In PDF format 😡
Black Magic (with Perl), patience, more patience, and some
more Black Magic
Dictionary in XML format 😸
30. ExperimentIV:IntersectTranslatedDLPCSynsets
For each translated DLPC synset ($lang = EN, SP and GL)
For each WordNet.$lang synset
Check which has higher score intersection
Save mappings for each language
(DLPC synset ID -- WordNet ILI)
Intersect language mappings
(including the previous PT intersections)
31. ExperimentIV:Results
From this alignment of DLPC synsets with PULO synsets:
424 alignments resulted in no additions
(the variants were already part of PULO)
28,427 alignments suggest enrichment
(new variants to existing synsets)
10,012 alignments suggest enlargement
(new variants to empty synsets)
Confidences ranging from 1 to 678
Average at 25...
32. ExperimentIV:NewVariantsEvaluation
Cut by a score threshold of 100 (whatever that means)
- 2,114 extensions (some for same synset)
- manual evaluation of 100 expansions:
- 67% OK (166 variants)
- 20% NOK (35 variants)
- 13% Ambiguous (29 variants)
- 72% of added variants are correct
33. ExperimentIV:NewSynsetsEvaluation
Cut by a score threshold of 50(whatever that means)
- 366 new synsets (some for same synset)
- manual evaluation of 100 new synsets:
- 83% OK (267 variants)
- 11% AMB (29 variants)
- 6% NOK (19 variants)
- 85% of added variants are correct
34. ExperimentIV:Future
Previous results suggest different synonyms for same Synset:
1. create multisets of addition suggestions;
2. use multiset cardinality as confidence;
3. use morphological analyzer to match POS;
Some of which might have been done in the last week if not
for a acute conjunctivitis.
36. ManagingaWordNetfromCommandLine
Using SQL to manage directly a wordnet is
- tiresome,
- error prone,
Using a textual syntax is
- error prone,
Using a graphical tool is
- tiresome,
Using a command line tool is
- geek and fast.
38. Alberto Simões <ambs@ilch.uminho.pt>
José João Almeida <jj@di.uminho.pt>
Xavier Gómez Guinovart <xgg@uvigo.es>
Experimentsonthe
constructionandenrichmentof
aPortuguesewordnet