SlideShare a Scribd company logo
1 of 23
Download to read offline
Dictionary Alignment
by Rewrite-based Entry Translation
Alberto Sim˜oes1 Xavier G´omez Guinovart2
1Centro de Estudos Human´ısticos, Universidade do Minho
Campus de Gualtar, Braga, Portugal
ambs@ilch.uminho.pt
2Galician Language Technologies and Applications (TALG Group)
Universidade de Vigo, Galiza, Spain
xgg@uvigo.es
SLATE 2013
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Motivation
We have a running project, Dicion´ario-Aberto, that allows the
user to consult a Portuguese dictionary;
Dicion´ario-Aberto is also available in TEI and DB formats;
Within GALNET project, a Galician Synonyms Dictionary was
converted from a WYSIWYG format to a rich TEI format;
Would it be possible to integrate the GSD into DA?
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Problem
Dicion´ario-Aberto has more than a hundred thousand entries!
Galician Synonyms Dictionary is not that big, and has some
dozens of thousand entries.
Problem: how to align entries from both dictionaries?
The two languages are very close;
That help with concepts alignment!
There are too many different words;
There is a reasonable set of false friend words;
There isn’t a a free and big enough translation dictionary.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Problem
Dicion´ario-Aberto has more than a hundred thousand entries!
Galician Synonyms Dictionary is not that big, and has some
dozens of thousand entries.
Problem: how to align entries from both dictionaries?
The two languages are very close;
That help with concepts alignment!
There are too many different words;
There is a reasonable set of false friend words;
There isn’t a a free and big enough translation dictionary.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Inspiration (part 1)
In the first year,“s”will be used instead of the soft“c.” Sertainly,
sivil servants will resieve this news with joy. Also, the hard“c”will
be replaced with“k”. Not only will this klear up konfusion, but
typewriters kan have one less letter.
There will be growing publik emthusiasm in the sekond year, when
the troublesome“ph”will be replaced by“f”. This will make words
like“fotograf”20 persent shorter.
In the third year, publik akseptanse of the new spelling kan be
expekted to reach the stage where more komplikated changes are
possible. Governments will enkorage the removal of double letters,
which have always ben a deterent to akurate speling. Also, al wil
agre that the horible mes of silent“e”s in the languag is disgrasful,
and they would go.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Inspiration (part 2)
By the fourth year, peopl wil be reseptiv to steps such as replasing
“th”by“z”and“w”by“v”.
During ze fifz year, ze unesesary“o”kan be dropd from vords
kontaining“ou”, and similar changes vud of kors be aplid to ozer
kombinations of leters.
After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no
mor trubls or difikultis and evrivun vil find it ezi tu understand ech
ozer. Ze drem vil finali kum tru!!
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Approach
Define a translation function based on a set or sequence of text
transformations (mainly substitutions) that convert (translate)
Portuguese words into Galician words.
The translation function is defined as
T (Lgl , wpt) = wgl
Lgl is the target Galician lexicon, obtained from the words
present in the Galician Synonyms Dictionary;
wpt is the Portuguese word being translated;
wgl is the Galician translation.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Approach
Define a translation function based on a set or sequence of text
transformations (mainly substitutions) that convert (translate)
Portuguese words into Galician words.
The translation function is defined as
T (Lgl , wpt) = wgl
Lgl is the target Galician lexicon, obtained from the words
present in the Galician Synonyms Dictionary;
wpt is the Portuguese word being translated;
wgl is the Galician translation.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Translation Function
Substitutions can be simple, as:
ss > s — passo > paso
j > x — sujeito > suxeito, injectar > inxectar
z ([ei´e´ı^e^ı]) > c — bronze > bronce
Substitutions can over-generate:
-¸c~ao > -ci´on,-z´on —
adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on
-velmente > belmente,-blemente —
previsivelmente > previsibelmente, previsiblemente
rv > rv,rb —
preserva¸c˜ao > preservaci´on, estorvar > estorbar
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Translation Function
Substitutions can be simple, as:
ss > s — passo > paso
j > x — sujeito > suxeito, injectar > inxectar
z ([ei´e´ı^e^ı]) > c — bronze > bronce
Substitutions can over-generate:
-¸c~ao > -ci´on,-z´on —
adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on
-velmente > belmente,-blemente —
previsivelmente > previsibelmente, previsiblemente
rv > rv,rb —
preserva¸c˜ao > preservaci´on, estorvar > estorbar
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Translation Function
A word without substitutions can be a valid translation;
Substitutions can be inter-dependent;
(for example, -¸c~ao > ci´on should be applied before ¸c > z)
Substitutions are applied from more generic to more specific;
(unless there is interdependence)
Substitutions can generate more than one possible
translations;
Before returning, the first word in the possible translations
that exists in the target lexicon is returned.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Translation Function
Id. Substitution
ID —
A ss > s
B j > x
C -¸c~ao > -ci´on,-z´on
D ¸c > z
E nh > ~n
F -dizer > -dicir
G z ([ei´e´ı^e^ı]) > c
H lh > ll
I vr > br
J -agem > -axe
K g ([ei´e´ı^e^ı]) > x
L -´avel > -´abel,-able
M -´ıvel > -´ıbel,-ible
N -velmente > belmente,-blemente
O -eio > -eo
P -^ancia > -ancia
Q -^encia > -encia
R -aria > -er´ıa,-ar´ıa
S -´ario > -ario
T -´ori[oa] > -ori[oa]
Id. Substitution
U -s~ao > -si´on,-s´on
V -r~ao > -r´on,-r´an
W -m~ao > -m´on,-m´an
X -i~ao > i´on,-i´an
Y -´ıcio > -icio
Z -´oide > -oide
AA -´ıdio > -idio
AB -^anico > -´anico
AC -´edia > -edia
AD -cimento > -cemento
AE -m > -n
AF -crever > -cribir
AG -u > -u,-o
AH -var > -bar
AI im- > im-,inm-
AJ qua- > cua-,ca-
AK qua > cua
AL -x~ao > -x´on,-xi´on
AM rv > rv,rb
AN -iver > -ivir
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 1
Given a small (about 9K pairs) hand-cured translation dictionary. . .
Compute Type I/II Hypothesis:
T (Lgl , wpt) = wgl Correct Incorrect
wgl is a Galician word TP FP
wgl is not a Galician word TN FN
TP True Positives – Correct Translation
FP False Positives – Wrong Translation, but obtained Word is
present in Galician Lexicon;
TN True Negative – Correct translation, but translation not in
Galician Lexicon (always 0).
FN False Negative – Wrong Translation, and obtained Word is
not in Galician Lexicon;
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 1 — Measures
accuracy =
TP + TN
TP + TN + FP + FN
(1)
precision =
TP
TP + FP
(2)
recall =
TP
TP + FN
(3)
F1 = 2 ×
precision × recall
precision + recall
(4)
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 1 — Results
Id. Precision Recall F1 Accuracy Correct ∆
ID 0.9954 0.5859 0.7376 0.5843 5390 5390
A 0.9952 0.6038 0.7516 0.6020 5553 163
B 0.9951 0.6158 0.7608 0.6139 5663 110
C 0.9952 0.6567 0.7912 0.6546 6038 375
D 0.9951 0.6687 0.7999 0.6665 6148 110
E 0.9952 0.6782 0.8066 0.6760 6235 87
F 0.9952 0.6786 0.8070 0.6764 6239 4
G 0.9953 0.6838 0.8107 0.6816 6287 48
H 0.9953 0.6927 0.8169 0.6905 6369 82
I 0.9953 0.6934 0.8174 0.6911 6375 6
J 0.9953 0.6964 0.8195 0.6942 6403 28
K 0.9955 0.7210 0.8363 0.7187 6629 226
L 0.9955 0.7256 0.8394 0.7232 6671 42
M 0.9955 0.7284 0.8413 0.7260 6697 26
N 0.9957 0.7482 0.8544 0.7458 6879 182
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 1 — Results
Id. Precision Recall F1 Accuracy Correct ∆
O 0.9957 0.7496 0.8553 0.7472 6892 13
P 0.9957 0.7515 0.8565 0.7490 6909 17
Q 0.9957 0.7588 0.8612 0.7563 6976 67
R 0.9957 0.7602 0.8621 0.7577 6989 13
S 0.9958 0.7680 0.8672 0.7655 7061 72
T 0.9958 0.7703 0.8686 0.7678 7082 21
U 0.9958 0.7772 0.8731 0.7747 7146 64
V 0.9958 0.7780 0.8735 0.7755 7153 7
W 0.9958 0.7783 0.8737 0.7758 7156 3
X 0.9958 0.7796 0.8746 0.7771 7168 12
Y 0.9958 0.7806 0.8752 0.7781 7177 9
Z 0.9958 0.7807 0.8753 0.7782 7178 1
AA 0.9958 0.7813 0.8756 0.7787 7183 5
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 1 — Results
Id. Precision Recall F1 Accuracy Correct ∆
AB 0.9958 0.7818 0.8759 0.7793 7188 5
AC 0.9958 0.7822 0.8762 0.7797 7192 4
AD 0.9959 0.7836 0.8770 0.7810 7204 12
AE 0.9959 0.7855 0.8783 0.7830 7222 18
AF 0.9959 0.7863 0.8787 0.7837 7229 7
AG 0.9957 0.7876 0.8795 0.7849 7240 11
AH 0.9957 0.7882 0.8799 0.7856 7246 6
AI 0.9958 0.7903 0.8812 0.7876 7265 19
AJ 0.9956 0.7928 0.8827 0.7900 7287 22
AK 0.9956 0.7940 0.8834 0.7912 7298 11
AL 0.9956 0.7947 0.8839 0.7920 7305 7
AM 0.9956 0.7951 0.8842 0.7924 7309 4
AN 0.9956 0.7955 0.8844 0.7927 7312 3
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 2
Triangulating a bigger dictionary for evaluation purposes:
PT–SP (12 340) ◦ SP–GL (7 581) ⇒ PT–GL (5 045 pairs)
from Apertium translation software
PT–SP ◦ SP–EN (24 912) ◦ EN–GL (17 626) ⇒ PT–GL (6 644)
PT–SP and EN–GL from Apertium, En–GL from CLUVI
PT–EN (14 600) ◦ EN–GL ⇒ PT–GL (8 589 pairs)
PT–EN from a merchandising app, EN–GL from CLUVI
Adding dictionaries together resulted in a 14 492 pairs.
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Evaluation 2 – Results
Id. Precision Recall F1 Accuracy Correct ∆
ID 0.9668 0.5022 0.6611 0.4937 7155 7155
A 0.9664 0.5176 0.6741 0.5084 7368 213
B 0.9663 0.5275 0.6824 0.5179 7506 138
C 0.9668 0.5646 0.7129 0.5538 8026 520
D 0.9661 0.5746 0.7206 0.5633 8163 137
E 0.9658 0.5831 0.7272 0.5713 8279 116
...
...
...
...
...
...
...
AH 0.9660 0.6819 0.7994 0.6659 9650 7
AI 0.9661 0.6841 0.8010 0.6681 9682 32
AJ 0.9660 0.6863 0.8025 0.6701 9711 29
AK 0.9660 0.6873 0.8032 0.6711 9726 15
AL 0.9661 0.6881 0.8037 0.6718 9736 10
AM 0.9660 0.6884 0.8039 0.6721 9740 4
AN 0.9660 0.6887 0.8041 0.6724 9744 4
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment — Results
Portuguese Words Galician Words
Substitution Count Percentage Count Percentage
ID 12711 15.3502% 12711 33.7475%
A 13082 15.7982% 13065 34.6874%
B 13447 16.2390% 13421 35.6326%
C 14348 17.3270% 14321 38.0220%
D 14764 17.8294% 14728 39.1026%
E 15174 18.3245% 15138 40.1912%
...
...
...
...
...
AI 17712 21.3895% 17627 46.7994%
AJ 17740 21.4233% 17648 46.8552%
AK 17765 21.4535% 17673 46.9215%
AL 17784 21.4764% 17693 46.9746%
AM 17813 21.5115% 17718 47.0410%
AN 17817 21.5163% 17722 47.0516%
DIC 20084 24.2540% 19989 53.0705%
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Final Remarks
An approach to translate Portuguese words in a dictionary
into Galician words using a set of string substitutions;
Approach is unable to translate all words;
Reasonable amount of words in Dicion´ario-Aberto have
pre-1930 orthography, that wasn’t dealt with;
We deliberately ignored a relevant problem: false friends.
two words that share a subset of the meanings. For instance,
talho (PT) and tallo (GL) share the majority of their senses,
but there are some of them that are specific to Portuguese
(for example, the place where meat is sold);
two words that have complete different meanings. An example
would be the word presunto (written in the same way in the
two languages) that means ham in Portuguese (a noun), but
means alleged in Galician (an adjective);
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Final Remarks
An approach to translate Portuguese words in a dictionary
into Galician words using a set of string substitutions;
Approach is unable to translate all words;
Reasonable amount of words in Dicion´ario-Aberto have
pre-1930 orthography, that wasn’t dealt with;
We deliberately ignored a relevant problem: false friends.
two words that share a subset of the meanings. For instance,
talho (PT) and tallo (GL) share the majority of their senses,
but there are some of them that are specific to Portuguese
(for example, the place where meat is sold);
two words that have complete different meanings. An example
would be the word presunto (written in the same way in the
two languages) that means ham in Portuguese (a noun), but
means alleged in Galician (an adjective);
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
Dictionary Alignment
by Rewrite-based Entry Translation
Alberto Sim˜oes1 Xavier G´omez Guinovart2
1Centro de Estudos Human´ısticos, Universidade do Minho
Campus de Gualtar, Braga, Portugal
ambs@ilch.uminho.pt
2Galician Language Technologies and Applications (TALG Group)
Universidade de Vigo, Galiza, Spain
xgg@uvigo.es
SLATE 2013
Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation

More Related Content

More from Alberto Simões

Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryAlberto Simões
 
EMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesAlberto Simões
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAlberto Simões
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAlberto Simões
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAlberto Simões
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with PerlAlberto Simões
 
Processing XML: a rewriting system approach
Processing XML: a rewriting system approachProcessing XML: a rewriting system approach
Processing XML: a rewriting system approachAlberto Simões
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaAlberto Simões
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaAlberto Simões
 

More from Alberto Simões (20)

Google Maps JS API
Google Maps JS APIGoogle Maps JS API
Google Maps JS API
 
Making the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionaryMaking the most of a 100-year-old dictionary
Making the most of a 100-year-old dictionary
 
EMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized DictionariesEMLex-A5: Specialized Dictionaries
EMLex-A5: Specialized Dictionaries
 
Modelação de Dados
Modelação de DadosModelação de Dados
Modelação de Dados
 
Aula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de SequênciaAula 04 - Introdução aos Diagramas de Sequência
Aula 04 - Introdução aos Diagramas de Sequência
 
Aula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de RequisitosAula 02 - Engenharia de Requisitos
Aula 02 - Engenharia de Requisitos
 
Aula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de InformaçãoAula 01 - Planeamento de Sistemas de Informação
Aula 01 - Planeamento de Sistemas de Informação
 
Building C and C++ libraries with Perl
Building C and C++ libraries with PerlBuilding C and C++ libraries with Perl
Building C and C++ libraries with Perl
 
PLN em Perl
PLN em PerlPLN em Perl
PLN em Perl
 
Classification Systems
Classification SystemsClassification Systems
Classification Systems
 
Redes de Pert
Redes de PertRedes de Pert
Redes de Pert
 
Dancing Tutorial
Dancing TutorialDancing Tutorial
Dancing Tutorial
 
Processing XML: a rewriting system approach
Processing XML: a rewriting system approachProcessing XML: a rewriting system approach
Processing XML: a rewriting system approach
 
Sistemas de Numeração
Sistemas de NumeraçãoSistemas de Numeração
Sistemas de Numeração
 
Álgebra de Boole
Álgebra de BooleÁlgebra de Boole
Álgebra de Boole
 
Arquitecturas de Tradução Automática
Arquitecturas de Tradução AutomáticaArquitecturas de Tradução Automática
Arquitecturas de Tradução Automática
 
Extracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução AutomáticaExtracção de Recursos para Tradução Automática
Extracção de Recursos para Tradução Automática
 
Dicionário Aberto
Dicionário AbertoDicionário Aberto
Dicionário Aberto
 
Keynote Globs
Keynote GlobsKeynote Globs
Keynote Globs
 
Workshop GLOBS
Workshop GLOBSWorkshop GLOBS
Workshop GLOBS
 

Recently uploaded

Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Mark Simos
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 

Recently uploaded (20)

Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
Tampa BSides - The No BS SOC (slides from April 6, 2024 talk)
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 

Dictionary Alignment by Rewrite-based Entry Translation

  • 1. Dictionary Alignment by Rewrite-based Entry Translation Alberto Sim˜oes1 Xavier G´omez Guinovart2 1Centro de Estudos Human´ısticos, Universidade do Minho Campus de Gualtar, Braga, Portugal ambs@ilch.uminho.pt 2Galician Language Technologies and Applications (TALG Group) Universidade de Vigo, Galiza, Spain xgg@uvigo.es SLATE 2013 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 2. Motivation We have a running project, Dicion´ario-Aberto, that allows the user to consult a Portuguese dictionary; Dicion´ario-Aberto is also available in TEI and DB formats; Within GALNET project, a Galician Synonyms Dictionary was converted from a WYSIWYG format to a rich TEI format; Would it be possible to integrate the GSD into DA? Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 3. Problem Dicion´ario-Aberto has more than a hundred thousand entries! Galician Synonyms Dictionary is not that big, and has some dozens of thousand entries. Problem: how to align entries from both dictionaries? The two languages are very close; That help with concepts alignment! There are too many different words; There is a reasonable set of false friend words; There isn’t a a free and big enough translation dictionary. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 4. Problem Dicion´ario-Aberto has more than a hundred thousand entries! Galician Synonyms Dictionary is not that big, and has some dozens of thousand entries. Problem: how to align entries from both dictionaries? The two languages are very close; That help with concepts alignment! There are too many different words; There is a reasonable set of false friend words; There isn’t a a free and big enough translation dictionary. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 5. Inspiration (part 1) In the first year,“s”will be used instead of the soft“c.” Sertainly, sivil servants will resieve this news with joy. Also, the hard“c”will be replaced with“k”. Not only will this klear up konfusion, but typewriters kan have one less letter. There will be growing publik emthusiasm in the sekond year, when the troublesome“ph”will be replaced by“f”. This will make words like“fotograf”20 persent shorter. In the third year, publik akseptanse of the new spelling kan be expekted to reach the stage where more komplikated changes are possible. Governments will enkorage the removal of double letters, which have always ben a deterent to akurate speling. Also, al wil agre that the horible mes of silent“e”s in the languag is disgrasful, and they would go. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 6. Inspiration (part 2) By the fourth year, peopl wil be reseptiv to steps such as replasing “th”by“z”and“w”by“v”. During ze fifz year, ze unesesary“o”kan be dropd from vords kontaining“ou”, and similar changes vud of kors be aplid to ozer kombinations of leters. After zis fifz yer, ve vil hav a reli sensibl riten styl. Zer vil be no mor trubls or difikultis and evrivun vil find it ezi tu understand ech ozer. Ze drem vil finali kum tru!! Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 7. Approach Define a translation function based on a set or sequence of text transformations (mainly substitutions) that convert (translate) Portuguese words into Galician words. The translation function is defined as T (Lgl , wpt) = wgl Lgl is the target Galician lexicon, obtained from the words present in the Galician Synonyms Dictionary; wpt is the Portuguese word being translated; wgl is the Galician translation. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 8. Approach Define a translation function based on a set or sequence of text transformations (mainly substitutions) that convert (translate) Portuguese words into Galician words. The translation function is defined as T (Lgl , wpt) = wgl Lgl is the target Galician lexicon, obtained from the words present in the Galician Synonyms Dictionary; wpt is the Portuguese word being translated; wgl is the Galician translation. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 9. Translation Function Substitutions can be simple, as: ss > s — passo > paso j > x — sujeito > suxeito, injectar > inxectar z ([ei´e´ı^e^ı]) > c — bronze > bronce Substitutions can over-generate: -¸c~ao > -ci´on,-z´on — adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on -velmente > belmente,-blemente — previsivelmente > previsibelmente, previsiblemente rv > rv,rb — preserva¸c˜ao > preservaci´on, estorvar > estorbar Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 10. Translation Function Substitutions can be simple, as: ss > s — passo > paso j > x — sujeito > suxeito, injectar > inxectar z ([ei´e´ı^e^ı]) > c — bronze > bronce Substitutions can over-generate: -¸c~ao > -ci´on,-z´on — adivinha¸c˜ao > adivi˜naci´on, cora¸c˜ao > coraz´on -velmente > belmente,-blemente — previsivelmente > previsibelmente, previsiblemente rv > rv,rb — preserva¸c˜ao > preservaci´on, estorvar > estorbar Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 11. Translation Function A word without substitutions can be a valid translation; Substitutions can be inter-dependent; (for example, -¸c~ao > ci´on should be applied before ¸c > z) Substitutions are applied from more generic to more specific; (unless there is interdependence) Substitutions can generate more than one possible translations; Before returning, the first word in the possible translations that exists in the target lexicon is returned. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 12. Translation Function Id. Substitution ID — A ss > s B j > x C -¸c~ao > -ci´on,-z´on D ¸c > z E nh > ~n F -dizer > -dicir G z ([ei´e´ı^e^ı]) > c H lh > ll I vr > br J -agem > -axe K g ([ei´e´ı^e^ı]) > x L -´avel > -´abel,-able M -´ıvel > -´ıbel,-ible N -velmente > belmente,-blemente O -eio > -eo P -^ancia > -ancia Q -^encia > -encia R -aria > -er´ıa,-ar´ıa S -´ario > -ario T -´ori[oa] > -ori[oa] Id. Substitution U -s~ao > -si´on,-s´on V -r~ao > -r´on,-r´an W -m~ao > -m´on,-m´an X -i~ao > i´on,-i´an Y -´ıcio > -icio Z -´oide > -oide AA -´ıdio > -idio AB -^anico > -´anico AC -´edia > -edia AD -cimento > -cemento AE -m > -n AF -crever > -cribir AG -u > -u,-o AH -var > -bar AI im- > im-,inm- AJ qua- > cua-,ca- AK qua > cua AL -x~ao > -x´on,-xi´on AM rv > rv,rb AN -iver > -ivir Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 13. Evaluation 1 Given a small (about 9K pairs) hand-cured translation dictionary. . . Compute Type I/II Hypothesis: T (Lgl , wpt) = wgl Correct Incorrect wgl is a Galician word TP FP wgl is not a Galician word TN FN TP True Positives – Correct Translation FP False Positives – Wrong Translation, but obtained Word is present in Galician Lexicon; TN True Negative – Correct translation, but translation not in Galician Lexicon (always 0). FN False Negative – Wrong Translation, and obtained Word is not in Galician Lexicon; Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 14. Evaluation 1 — Measures accuracy = TP + TN TP + TN + FP + FN (1) precision = TP TP + FP (2) recall = TP TP + FN (3) F1 = 2 × precision × recall precision + recall (4) Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 15. Evaluation 1 — Results Id. Precision Recall F1 Accuracy Correct ∆ ID 0.9954 0.5859 0.7376 0.5843 5390 5390 A 0.9952 0.6038 0.7516 0.6020 5553 163 B 0.9951 0.6158 0.7608 0.6139 5663 110 C 0.9952 0.6567 0.7912 0.6546 6038 375 D 0.9951 0.6687 0.7999 0.6665 6148 110 E 0.9952 0.6782 0.8066 0.6760 6235 87 F 0.9952 0.6786 0.8070 0.6764 6239 4 G 0.9953 0.6838 0.8107 0.6816 6287 48 H 0.9953 0.6927 0.8169 0.6905 6369 82 I 0.9953 0.6934 0.8174 0.6911 6375 6 J 0.9953 0.6964 0.8195 0.6942 6403 28 K 0.9955 0.7210 0.8363 0.7187 6629 226 L 0.9955 0.7256 0.8394 0.7232 6671 42 M 0.9955 0.7284 0.8413 0.7260 6697 26 N 0.9957 0.7482 0.8544 0.7458 6879 182 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 16. Evaluation 1 — Results Id. Precision Recall F1 Accuracy Correct ∆ O 0.9957 0.7496 0.8553 0.7472 6892 13 P 0.9957 0.7515 0.8565 0.7490 6909 17 Q 0.9957 0.7588 0.8612 0.7563 6976 67 R 0.9957 0.7602 0.8621 0.7577 6989 13 S 0.9958 0.7680 0.8672 0.7655 7061 72 T 0.9958 0.7703 0.8686 0.7678 7082 21 U 0.9958 0.7772 0.8731 0.7747 7146 64 V 0.9958 0.7780 0.8735 0.7755 7153 7 W 0.9958 0.7783 0.8737 0.7758 7156 3 X 0.9958 0.7796 0.8746 0.7771 7168 12 Y 0.9958 0.7806 0.8752 0.7781 7177 9 Z 0.9958 0.7807 0.8753 0.7782 7178 1 AA 0.9958 0.7813 0.8756 0.7787 7183 5 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 17. Evaluation 1 — Results Id. Precision Recall F1 Accuracy Correct ∆ AB 0.9958 0.7818 0.8759 0.7793 7188 5 AC 0.9958 0.7822 0.8762 0.7797 7192 4 AD 0.9959 0.7836 0.8770 0.7810 7204 12 AE 0.9959 0.7855 0.8783 0.7830 7222 18 AF 0.9959 0.7863 0.8787 0.7837 7229 7 AG 0.9957 0.7876 0.8795 0.7849 7240 11 AH 0.9957 0.7882 0.8799 0.7856 7246 6 AI 0.9958 0.7903 0.8812 0.7876 7265 19 AJ 0.9956 0.7928 0.8827 0.7900 7287 22 AK 0.9956 0.7940 0.8834 0.7912 7298 11 AL 0.9956 0.7947 0.8839 0.7920 7305 7 AM 0.9956 0.7951 0.8842 0.7924 7309 4 AN 0.9956 0.7955 0.8844 0.7927 7312 3 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 18. Evaluation 2 Triangulating a bigger dictionary for evaluation purposes: PT–SP (12 340) ◦ SP–GL (7 581) ⇒ PT–GL (5 045 pairs) from Apertium translation software PT–SP ◦ SP–EN (24 912) ◦ EN–GL (17 626) ⇒ PT–GL (6 644) PT–SP and EN–GL from Apertium, En–GL from CLUVI PT–EN (14 600) ◦ EN–GL ⇒ PT–GL (8 589 pairs) PT–EN from a merchandising app, EN–GL from CLUVI Adding dictionaries together resulted in a 14 492 pairs. Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 19. Evaluation 2 – Results Id. Precision Recall F1 Accuracy Correct ∆ ID 0.9668 0.5022 0.6611 0.4937 7155 7155 A 0.9664 0.5176 0.6741 0.5084 7368 213 B 0.9663 0.5275 0.6824 0.5179 7506 138 C 0.9668 0.5646 0.7129 0.5538 8026 520 D 0.9661 0.5746 0.7206 0.5633 8163 137 E 0.9658 0.5831 0.7272 0.5713 8279 116 ... ... ... ... ... ... ... AH 0.9660 0.6819 0.7994 0.6659 9650 7 AI 0.9661 0.6841 0.8010 0.6681 9682 32 AJ 0.9660 0.6863 0.8025 0.6701 9711 29 AK 0.9660 0.6873 0.8032 0.6711 9726 15 AL 0.9661 0.6881 0.8037 0.6718 9736 10 AM 0.9660 0.6884 0.8039 0.6721 9740 4 AN 0.9660 0.6887 0.8041 0.6724 9744 4 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 20. Dictionary Alignment — Results Portuguese Words Galician Words Substitution Count Percentage Count Percentage ID 12711 15.3502% 12711 33.7475% A 13082 15.7982% 13065 34.6874% B 13447 16.2390% 13421 35.6326% C 14348 17.3270% 14321 38.0220% D 14764 17.8294% 14728 39.1026% E 15174 18.3245% 15138 40.1912% ... ... ... ... ... AI 17712 21.3895% 17627 46.7994% AJ 17740 21.4233% 17648 46.8552% AK 17765 21.4535% 17673 46.9215% AL 17784 21.4764% 17693 46.9746% AM 17813 21.5115% 17718 47.0410% AN 17817 21.5163% 17722 47.0516% DIC 20084 24.2540% 19989 53.0705% Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 21. Final Remarks An approach to translate Portuguese words in a dictionary into Galician words using a set of string substitutions; Approach is unable to translate all words; Reasonable amount of words in Dicion´ario-Aberto have pre-1930 orthography, that wasn’t dealt with; We deliberately ignored a relevant problem: false friends. two words that share a subset of the meanings. For instance, talho (PT) and tallo (GL) share the majority of their senses, but there are some of them that are specific to Portuguese (for example, the place where meat is sold); two words that have complete different meanings. An example would be the word presunto (written in the same way in the two languages) that means ham in Portuguese (a noun), but means alleged in Galician (an adjective); Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 22. Final Remarks An approach to translate Portuguese words in a dictionary into Galician words using a set of string substitutions; Approach is unable to translate all words; Reasonable amount of words in Dicion´ario-Aberto have pre-1930 orthography, that wasn’t dealt with; We deliberately ignored a relevant problem: false friends. two words that share a subset of the meanings. For instance, talho (PT) and tallo (GL) share the majority of their senses, but there are some of them that are specific to Portuguese (for example, the place where meat is sold); two words that have complete different meanings. An example would be the word presunto (written in the same way in the two languages) that means ham in Portuguese (a noun), but means alleged in Galician (an adjective); Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation
  • 23. Dictionary Alignment by Rewrite-based Entry Translation Alberto Sim˜oes1 Xavier G´omez Guinovart2 1Centro de Estudos Human´ısticos, Universidade do Minho Campus de Gualtar, Braga, Portugal ambs@ilch.uminho.pt 2Galician Language Technologies and Applications (TALG Group) Universidade de Vigo, Galiza, Spain xgg@uvigo.es SLATE 2013 Alberto Sim˜oes, Xavier G´omez Guinovart Dictionary Alignment by Rewrite-based Entry Translation