This document describes a study comparing machine translation (MT) quality using a general corpus versus specialized corpora. It outlines an 11-step context-based MT method that retrieves translation correspondences from massive corpora without parallel text. Testing on housing ads and news text corpora was faster and higher quality than using a large general corpus. While specialized corpora excel at domain-specific language, a general corpus requires more dictionary entries and common phrases to improve open content translation and agreement. The study finds corpora type suitability depends on the translation needs.
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Traductor ingl e9edimburghl
1. Variation-influenced quality for MT: General vs.
specialised corpora
Alejandro Curado
Martín Garay
University of Extremadura, Spain
2. Variation-influenced quality for MT: General vs. specialised corpora
Theoretical background / method: Context-based MT
>Data retrieved from massive corpus: A lot of data to compare ngrams (the more context the better correspondences)
>No need to use a parallel corpus (e.g., SMT = aligned / parallel
corpora translation)
>Optimal scores in BLEU (Bilingual Evaluation Under Study) scale for
MT--Carbonell, 2006 = “Meaningful Machines” = 0.66 “blind test” out of
0.7 (human) with 53 GB of target text
>General translation (vs. Specialised translation??)
4. Variation-influenced quality for MT: General vs. specialised corpora
Resources:
English Dictionary table with 200,000 entries (single and compound words /
idioms).
Spanish dictionary table with more than 5,000,000 entries
Large general numerical corpus that may reach up to 100 GB (end of July
2010): Indexed by text, sentence, word
5. Variation-influenced quality for MT: General vs. specialised corpora
Improve / increase resources :
Dictionary:
Web pages have been developed to:
1. Add all those word units missing (with equivalents)
2. Increase word meanings if not in the dictionaries (wordreference)
6. Variation-influenced quality for MT: General vs. specialised corpora
Improve / increase resources :
The large corpus.
Indexing :
1. Books on the web.
2. Wikipedia.
3. Sketch Engine-retrieved texts (seed keywords).
7. Variation-influenced quality for MT: General vs. specialised corpora
Types of Spanish corpora used for the translation tests :
The large corpus (late May 2010)
Nearly 73 million words, 11,490 texts, 3,900,000 sentences
(+ 1256 news texts indexed in June = +4 mill. words)
Experiment corpus (1) with apartment / housing ads (March 2010)
70 texts, 5,455 sentences, 87,353 words
Experiment corpus (2) with international news (June 2010)
286 texts, 2,791 sentences, 125,936 words
8. Variation-influenced quality for MT: General vs. specialised corpora
Translation procedure.
1 st step . Inserting the sentence or text.
The nice big house is located near the sea.
The nice big house is located near the sea.
9. Variation-influenced quality for MT: General vs. specialised corpora
2nd step . Dividing the text into phrases / sentences.
The segmentation is carried out by using the following punctuation symbols:
..
In our case:
;;
::
¿?
¿?
¡!
¡!
The nice big house is located near the sea.
10. Variation-influenced quality for MT: General vs. specialised corpora
3 rd step . Obtaining the numbers that correspond to those words / word
units in the English dictionary.
The
nice big
house is
44634 30497 6962 22817
3456
located near
27139
the
sea
30255 44634 39064
11. Variation-influenced quality for MT: General vs. specialised corpora
4th step . We remove the function / nexus words (those words that repeat
the most statistically in that language) from the sentence and we store them
on a separate table.
The
nice big house
is
located
Final phrase nice big house located sea.
near
the
sea.
12. Variation-influenced quality for MT: General vs. specialised corpora
5th step. The remaining words (content words) are sent to the dictionary to
retrieve the different translation equivalents they may have.
1: Restriction in the tests to only two equivalents in Spanish
13. Variation-influenced quality for MT: General vs. specialised corpora
Nice
big
house located
sea.
6th step. Each ngram is divided into subn-grams (different combinations of the
correspondences) which are then sent to the corpus.
bonito
gran
1º 1043795 284672
bonito
gran
2º 1043795 284672
casa
839170
casa
839170
situado
1098037
situada
1098063
……………………………………….
bonita
gran
casa situada
nº 1043794 284672 839170 1098063
14. Variation-influenced quality for MT: General vs. specialised corpora
7th step. A score is given to each result obtained; thus, each subn-gram will receive a
final score and an arrangement according to the score.
SCORE
Subngrama 1 .
bonito gran casa situado
2.5
Subngrama 2 .
bonito gran casa situada
3.1
Subngrama n .
bonita gran casa situada
7
- Parameters that decide the score given to each subn-gram:
Number of needed words found in the sentence.
Distance found beween the words.
Number of needed words found together inside the sentence.
15. Variation-influenced quality for MT: General vs. specialised corpora
8th Step. Scoring the n-gram in relation to the best scores obtained by the subngrams.
Nice
big
SCORE
house located
30
Subngrama 1 .
bonito gran casa situado
SCORE
2.5
Subngrama 2 .
bonito gran casa situada
3.1
Subngrama n .
bonita gran casa situada
7
SCORE
big
house located
sea.
50
16. Variation-influenced quality for MT: General vs. specialised corpora
9th step . Combining the n-grams integrated in the sentence / text.
Nice
big
house located
sea.
Parameters for the combination / overlapping
•
•
Scoring the texts that repeat for the n-grams
Scoring the sentences that repeat for the n-grams
17. Variation-influenced quality for MT: General vs. specialised corpora
10th step . We add the function words previously removed from the sentence.
We search for these words in the best subn-grams used.
The
nice big house
is
located
near
the
sea.
18. Variation-influenced quality for MT: General vs. specialised corpora
11th step . Obtaining the translated sentence
The nice big house is located near the sea .
La gran y bonita casa está situada cerca del mar
19. Variation-influenced quality for MT: General vs. specialised corpora
In the housing ads (first specialised corpus):
the nice big house is located near the sea .
La gran y bonita casa está situada cerca del mar.
the white house has an old gate that is broken and ugly .
La casa blanca tiene una vieja verja averiada y fea.
Time used by the system: 0.98 seconds / 1.2 seconds
20. Variation-influenced quality for MT: General vs. specialised corpora
In the large corpus (end of May):
the nice big house is located near the sea .
La casa grande está cerca del mar.
the white house has an old gate that is broken and ugly .
La casa blanca tiene una valla vieja que se rompe y es fea
Time used by the system: 3 minutes and 33 seconds / 3
minutes and 39 seconds
21. Variation-influenced quality for MT: General vs. specialised corpora
Other problems in the large corpus (June: + news):
The director checked the mail and said he had no new mail
Nuevos directores comprobaban correo y la dijo no hay correo
The salesperson decided to stop doing business with them
El vendedor decidió parar a hacer negocios con ellos
Time used by the system: 2 minutes and 12 seconds / 1
minute and 6 seconds
22. Variation-influenced quality for MT: General vs. specialised corpora
Some linguistic / technical conclusions:
>Data retrieved from massive corpus:
Important to obtain more common phrases / familiar expressions /
overlapping connectors
>Data retrieved from the specialised corpus:
Important for fixed phrases / collocations in the field / genre – BUT
may need more linguistic information for connections
< Problems: Verb agreement in indirect clauses? / Fewer
probabilities for open content combinations (e.g., new + mail)
<EVER important need to improve dictionary entries for general
corpus
<Scores according to context: Texts repeat more in specialised
translation (problem for large corpus—e.g., nuevos directores)