SlideShare a Scribd company logo
Variation-influenced quality for MT: General vs.
specialised corpora

Alejandro Curado
Martín Garay
University of Extremadura, Spain
Variation-influenced quality for MT: General vs. specialised corpora

 Theoretical background / method: Context-based MT
>Data retrieved from massive corpus: A lot of data to compare ngrams (the more context the better correspondences)
>No need to use a parallel corpus (e.g., SMT = aligned / parallel
corpora translation)
>Optimal scores in BLEU (Bilingual Evaluation Under Study) scale for
MT--Carbonell, 2006 = “Meaningful Machines” = 0.66 “blind test” out of
0.7 (human) with 53 GB of target text
>General translation (vs. Specialised translation??)
Variation-influenced quality for MT: General vs. specialised corpora
Variation-influenced quality for MT: General vs. specialised corpora

 Resources:
 English Dictionary table with 200,000 entries (single and compound words /
idioms).
 Spanish dictionary table with more than 5,000,000 entries
 Large general numerical corpus that may reach up to 100 GB (end of July
2010): Indexed by text, sentence, word
Variation-influenced quality for MT: General vs. specialised corpora

 Improve / increase resources :
 Dictionary:
Web pages have been developed to:
1. Add all those word units missing (with equivalents)
2. Increase word meanings if not in the dictionaries (wordreference)
Variation-influenced quality for MT: General vs. specialised corpora

 Improve / increase resources :
 The large corpus.
 Indexing :
1. Books on the web.
2. Wikipedia.
3. Sketch Engine-retrieved texts (seed keywords).
Variation-influenced quality for MT: General vs. specialised corpora

 Types of Spanish corpora used for the translation tests :
 The large corpus (late May 2010)
Nearly 73 million words, 11,490 texts, 3,900,000 sentences
(+ 1256 news texts indexed in June = +4 mill. words)
Experiment corpus (1) with apartment / housing ads (March 2010)
70 texts, 5,455 sentences, 87,353 words
Experiment corpus (2) with international news (June 2010)
286 texts, 2,791 sentences, 125,936 words
Variation-influenced quality for MT: General vs. specialised corpora



Translation procedure.



1 st step . Inserting the sentence or text.

The nice big house is located near the sea.
The nice big house is located near the sea.
Variation-influenced quality for MT: General vs. specialised corpora



2nd step . Dividing the text into phrases / sentences.
 The segmentation is carried out by using the following punctuation symbols:

..
 In our case:

;;

::

¿?
¿?
¡!
¡!

The nice big house is located near the sea.
Variation-influenced quality for MT: General vs. specialised corpora



3 rd step . Obtaining the numbers that correspond to those words / word
units in the English dictionary.
The

nice big

house is

44634 30497 6962 22817

3456

located near
27139

the

sea

30255 44634 39064
Variation-influenced quality for MT: General vs. specialised corpora



4th step . We remove the function / nexus words (those words that repeat
the most statistically in that language) from the sentence and we store them
on a separate table.
The

nice big house

is

located

Final phrase nice big house located sea.

near

the

sea.
Variation-influenced quality for MT: General vs. specialised corpora



5th step. The remaining words (content words) are sent to the dictionary to
retrieve the different translation equivalents they may have.
1: Restriction in the tests to only two equivalents in Spanish
Variation-influenced quality for MT: General vs. specialised corpora

Nice


big

house located

sea.

6th step. Each ngram is divided into subn-grams (different combinations of the
correspondences) which are then sent to the corpus.
bonito
gran
1º 1043795 284672
bonito
gran
2º 1043795 284672

casa
839170
casa
839170

situado
1098037
situada
1098063

……………………………………….
bonita
gran
casa situada
nº 1043794 284672 839170 1098063
Variation-influenced quality for MT: General vs. specialised corpora



7th step. A score is given to each result obtained; thus, each subn-gram will receive a
final score and an arrangement according to the score.
SCORE
Subngrama 1 .

bonito gran casa situado

2.5

Subngrama 2 .

bonito gran casa situada

3.1

Subngrama n .

bonita gran casa situada

7

- Parameters that decide the score given to each subn-gram:
 Number of needed words found in the sentence.
 Distance found beween the words.
 Number of needed words found together inside the sentence.
Variation-influenced quality for MT: General vs. specialised corpora

8th Step. Scoring the n-gram in relation to the best scores obtained by the subngrams.
Nice

big

SCORE

house located

30

Subngrama 1 .

bonito gran casa situado

SCORE
2.5

Subngrama 2 .

bonito gran casa situada

3.1

Subngrama n .

bonita gran casa situada

7

SCORE

big

house located

sea.

50
Variation-influenced quality for MT: General vs. specialised corpora



9th step . Combining the n-grams integrated in the sentence / text.

Nice

big

house located

sea.

Parameters for the combination / overlapping
•
•

Scoring the texts that repeat for the n-grams
Scoring the sentences that repeat for the n-grams
Variation-influenced quality for MT: General vs. specialised corpora



10th step . We add the function words previously removed from the sentence.
We search for these words in the best subn-grams used.

The

nice big house

is

located

near

the

sea.
Variation-influenced quality for MT: General vs. specialised corpora



11th step . Obtaining the translated sentence
The nice big house is located near the sea .

La gran y bonita casa está situada cerca del mar
Variation-influenced quality for MT: General vs. specialised corpora
In the housing ads (first specialised corpus):
the nice big house is located near the sea .
La gran y bonita casa está situada cerca del mar.
the white house has an old gate that is broken and ugly .
La casa blanca tiene una vieja verja averiada y fea.

 Time used by the system: 0.98 seconds / 1.2 seconds
Variation-influenced quality for MT: General vs. specialised corpora
In the large corpus (end of May):
the nice big house is located near the sea .
La casa grande está cerca del mar.
the white house has an old gate that is broken and ugly .
La casa blanca tiene una valla vieja que se rompe y es fea
 Time used by the system: 3 minutes and 33 seconds / 3
minutes and 39 seconds
Variation-influenced quality for MT: General vs. specialised corpora
Other problems in the large corpus (June: + news):
The director checked the mail and said he had no new mail
Nuevos directores comprobaban correo y la dijo no hay correo
The salesperson decided to stop doing business with them
El vendedor decidió parar a hacer negocios con ellos
 Time used by the system: 2 minutes and 12 seconds / 1
minute and 6 seconds
Variation-influenced quality for MT: General vs. specialised corpora
 Some linguistic / technical conclusions:
>Data retrieved from massive corpus:
Important to obtain more common phrases / familiar expressions /
overlapping connectors
>Data retrieved from the specialised corpus:
Important for fixed phrases / collocations in the field / genre – BUT
may need more linguistic information for connections
< Problems: Verb agreement in indirect clauses? / Fewer
probabilities for open content combinations (e.g., new + mail)
<EVER important need to improve dictionary entries for general
corpus
<Scores according to context: Texts repeat more in specialised
translation (problem for large corpus—e.g., nuevos directores)

More Related Content

Similar to Traductor ingl e9edimburghl

Similar to Traductor ingl e9edimburghl (20)

Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
 
2015ht13439 final presentation
2015ht13439 final presentation2015ht13439 final presentation
2015ht13439 final presentation
 
Yelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationYelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classification
 
Classification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricClassification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF Metric
 
More on Indexing Text Operations (1).pptx
More on Indexing  Text Operations (1).pptxMore on Indexing  Text Operations (1).pptx
More on Indexing Text Operations (1).pptx
 
Final product group_16_task_3 tt
Final product group_16_task_3  ttFinal product group_16_task_3  tt
Final product group_16_task_3 tt
 
Task 3 - Group 16
Task 3 -  Group 16Task 3 -  Group 16
Task 3 - Group 16
 
TAUS QE Summit 2017 eBay EN-DE MT Pilot
TAUS QE Summit 2017   eBay EN-DE MT PilotTAUS QE Summit 2017   eBay EN-DE MT Pilot
TAUS QE Summit 2017 eBay EN-DE MT Pilot
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
LLM GPT-3: Language models are few-shot learners
LLM GPT-3: Language models are few-shot learnersLLM GPT-3: Language models are few-shot learners
LLM GPT-3: Language models are few-shot learners
 
Kc3517481754
Kc3517481754Kc3517481754
Kc3517481754
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4
 
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 

Recently uploaded

ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyes
ashishpaul799
 

Recently uploaded (20)

The impact of social media on mental health and well-being has been a topic o...
The impact of social media on mental health and well-being has been a topic o...The impact of social media on mental health and well-being has been a topic o...
The impact of social media on mental health and well-being has been a topic o...
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPoint
 
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
The Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational ResourcesThe Benefits and Challenges of Open Educational Resources
The Benefits and Challenges of Open Educational Resources
 
How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17How to the fix Attribute Error in odoo 17
How to the fix Attribute Error in odoo 17
 
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxStudents, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptx
 
How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17
 
ppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyesppt your views.ppt your views of your college in your eyes
ppt your views.ppt your views of your college in your eyes
 
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
UNIT – IV_PCI Complaints: Complaints and evaluation of complaints, Handling o...
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
Morse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptxMorse OER Some Benefits and Challenges.pptx
Morse OER Some Benefits and Challenges.pptx
 
size separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceuticssize separation d pharm 1st year pharmaceutics
size separation d pharm 1st year pharmaceutics
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Basic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumersBasic phrases for greeting and assisting costumers
Basic phrases for greeting and assisting costumers
 
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
 

Traductor ingl e9edimburghl

  • 1. Variation-influenced quality for MT: General vs. specialised corpora Alejandro Curado Martín Garay University of Extremadura, Spain
  • 2. Variation-influenced quality for MT: General vs. specialised corpora  Theoretical background / method: Context-based MT >Data retrieved from massive corpus: A lot of data to compare ngrams (the more context the better correspondences) >No need to use a parallel corpus (e.g., SMT = aligned / parallel corpora translation) >Optimal scores in BLEU (Bilingual Evaluation Under Study) scale for MT--Carbonell, 2006 = “Meaningful Machines” = 0.66 “blind test” out of 0.7 (human) with 53 GB of target text >General translation (vs. Specialised translation??)
  • 3. Variation-influenced quality for MT: General vs. specialised corpora
  • 4. Variation-influenced quality for MT: General vs. specialised corpora  Resources:  English Dictionary table with 200,000 entries (single and compound words / idioms).  Spanish dictionary table with more than 5,000,000 entries  Large general numerical corpus that may reach up to 100 GB (end of July 2010): Indexed by text, sentence, word
  • 5. Variation-influenced quality for MT: General vs. specialised corpora  Improve / increase resources :  Dictionary: Web pages have been developed to: 1. Add all those word units missing (with equivalents) 2. Increase word meanings if not in the dictionaries (wordreference)
  • 6. Variation-influenced quality for MT: General vs. specialised corpora  Improve / increase resources :  The large corpus.  Indexing : 1. Books on the web. 2. Wikipedia. 3. Sketch Engine-retrieved texts (seed keywords).
  • 7. Variation-influenced quality for MT: General vs. specialised corpora  Types of Spanish corpora used for the translation tests :  The large corpus (late May 2010) Nearly 73 million words, 11,490 texts, 3,900,000 sentences (+ 1256 news texts indexed in June = +4 mill. words) Experiment corpus (1) with apartment / housing ads (March 2010) 70 texts, 5,455 sentences, 87,353 words Experiment corpus (2) with international news (June 2010) 286 texts, 2,791 sentences, 125,936 words
  • 8. Variation-influenced quality for MT: General vs. specialised corpora  Translation procedure.  1 st step . Inserting the sentence or text. The nice big house is located near the sea. The nice big house is located near the sea.
  • 9. Variation-influenced quality for MT: General vs. specialised corpora  2nd step . Dividing the text into phrases / sentences.  The segmentation is carried out by using the following punctuation symbols: ..  In our case: ;; :: ¿? ¿? ¡! ¡! The nice big house is located near the sea.
  • 10. Variation-influenced quality for MT: General vs. specialised corpora  3 rd step . Obtaining the numbers that correspond to those words / word units in the English dictionary. The nice big house is 44634 30497 6962 22817 3456 located near 27139 the sea 30255 44634 39064
  • 11. Variation-influenced quality for MT: General vs. specialised corpora  4th step . We remove the function / nexus words (those words that repeat the most statistically in that language) from the sentence and we store them on a separate table. The nice big house is located Final phrase nice big house located sea. near the sea.
  • 12. Variation-influenced quality for MT: General vs. specialised corpora  5th step. The remaining words (content words) are sent to the dictionary to retrieve the different translation equivalents they may have. 1: Restriction in the tests to only two equivalents in Spanish
  • 13. Variation-influenced quality for MT: General vs. specialised corpora Nice  big house located sea. 6th step. Each ngram is divided into subn-grams (different combinations of the correspondences) which are then sent to the corpus. bonito gran 1º 1043795 284672 bonito gran 2º 1043795 284672 casa 839170 casa 839170 situado 1098037 situada 1098063 ………………………………………. bonita gran casa situada nº 1043794 284672 839170 1098063
  • 14. Variation-influenced quality for MT: General vs. specialised corpora  7th step. A score is given to each result obtained; thus, each subn-gram will receive a final score and an arrangement according to the score. SCORE Subngrama 1 . bonito gran casa situado 2.5 Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7 - Parameters that decide the score given to each subn-gram:  Number of needed words found in the sentence.  Distance found beween the words.  Number of needed words found together inside the sentence.
  • 15. Variation-influenced quality for MT: General vs. specialised corpora 8th Step. Scoring the n-gram in relation to the best scores obtained by the subngrams. Nice big SCORE house located 30 Subngrama 1 . bonito gran casa situado SCORE 2.5 Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7 SCORE big house located sea. 50
  • 16. Variation-influenced quality for MT: General vs. specialised corpora  9th step . Combining the n-grams integrated in the sentence / text. Nice big house located sea. Parameters for the combination / overlapping • • Scoring the texts that repeat for the n-grams Scoring the sentences that repeat for the n-grams
  • 17. Variation-influenced quality for MT: General vs. specialised corpora  10th step . We add the function words previously removed from the sentence. We search for these words in the best subn-grams used. The nice big house is located near the sea.
  • 18. Variation-influenced quality for MT: General vs. specialised corpora  11th step . Obtaining the translated sentence The nice big house is located near the sea . La gran y bonita casa está situada cerca del mar
  • 19. Variation-influenced quality for MT: General vs. specialised corpora In the housing ads (first specialised corpus): the nice big house is located near the sea . La gran y bonita casa está situada cerca del mar. the white house has an old gate that is broken and ugly . La casa blanca tiene una vieja verja averiada y fea.  Time used by the system: 0.98 seconds / 1.2 seconds
  • 20. Variation-influenced quality for MT: General vs. specialised corpora In the large corpus (end of May): the nice big house is located near the sea . La casa grande está cerca del mar. the white house has an old gate that is broken and ugly . La casa blanca tiene una valla vieja que se rompe y es fea  Time used by the system: 3 minutes and 33 seconds / 3 minutes and 39 seconds
  • 21. Variation-influenced quality for MT: General vs. specialised corpora Other problems in the large corpus (June: + news): The director checked the mail and said he had no new mail Nuevos directores comprobaban correo y la dijo no hay correo The salesperson decided to stop doing business with them El vendedor decidió parar a hacer negocios con ellos  Time used by the system: 2 minutes and 12 seconds / 1 minute and 6 seconds
  • 22. Variation-influenced quality for MT: General vs. specialised corpora  Some linguistic / technical conclusions: >Data retrieved from massive corpus: Important to obtain more common phrases / familiar expressions / overlapping connectors >Data retrieved from the specialised corpus: Important for fixed phrases / collocations in the field / genre – BUT may need more linguistic information for connections < Problems: Verb agreement in indirect clauses? / Fewer probabilities for open content combinations (e.g., new + mail) <EVER important need to improve dictionary entries for general corpus <Scores according to context: Texts repeat more in specialised translation (problem for large corpus—e.g., nuevos directores)