SlideShare une entreprise Scribd logo
1  sur  22
Variation-influenced quality for MT: General vs.
specialised corpora

Alejandro Curado
Martín Garay
University of Extremadura, Spain
Variation-influenced quality for MT: General vs. specialised corpora

 Theoretical background / method: Context-based MT
>Data retrieved from massive corpus: A lot of data to compare ngrams (the more context the better correspondences)
>No need to use a parallel corpus (e.g., SMT = aligned / parallel
corpora translation)
>Optimal scores in BLEU (Bilingual Evaluation Under Study) scale for
MT--Carbonell, 2006 = “Meaningful Machines” = 0.66 “blind test” out of
0.7 (human) with 53 GB of target text
>General translation (vs. Specialised translation??)
Variation-influenced quality for MT: General vs. specialised corpora
Variation-influenced quality for MT: General vs. specialised corpora

 Resources:
 English Dictionary table with 200,000 entries (single and compound words /
idioms).
 Spanish dictionary table with more than 5,000,000 entries
 Large general numerical corpus that may reach up to 100 GB (end of July
2010): Indexed by text, sentence, word
Variation-influenced quality for MT: General vs. specialised corpora

 Improve / increase resources :
 Dictionary:
Web pages have been developed to:
1. Add all those word units missing (with equivalents)
2. Increase word meanings if not in the dictionaries (wordreference)
Variation-influenced quality for MT: General vs. specialised corpora

 Improve / increase resources :
 The large corpus.
 Indexing :
1. Books on the web.
2. Wikipedia.
3. Sketch Engine-retrieved texts (seed keywords).
Variation-influenced quality for MT: General vs. specialised corpora

 Types of Spanish corpora used for the translation tests :
 The large corpus (late May 2010)
Nearly 73 million words, 11,490 texts, 3,900,000 sentences
(+ 1256 news texts indexed in June = +4 mill. words)
Experiment corpus (1) with apartment / housing ads (March 2010)
70 texts, 5,455 sentences, 87,353 words
Experiment corpus (2) with international news (June 2010)
286 texts, 2,791 sentences, 125,936 words
Variation-influenced quality for MT: General vs. specialised corpora



Translation procedure.



1 st step . Inserting the sentence or text.

The nice big house is located near the sea.
The nice big house is located near the sea.
Variation-influenced quality for MT: General vs. specialised corpora



2nd step . Dividing the text into phrases / sentences.
 The segmentation is carried out by using the following punctuation symbols:

..
 In our case:

;;

::

¿?
¿?
¡!
¡!

The nice big house is located near the sea.
Variation-influenced quality for MT: General vs. specialised corpora



3 rd step . Obtaining the numbers that correspond to those words / word
units in the English dictionary.
The

nice big

house is

44634 30497 6962 22817

3456

located near
27139

the

sea

30255 44634 39064
Variation-influenced quality for MT: General vs. specialised corpora



4th step . We remove the function / nexus words (those words that repeat
the most statistically in that language) from the sentence and we store them
on a separate table.
The

nice big house

is

located

Final phrase nice big house located sea.

near

the

sea.
Variation-influenced quality for MT: General vs. specialised corpora



5th step. The remaining words (content words) are sent to the dictionary to
retrieve the different translation equivalents they may have.
1: Restriction in the tests to only two equivalents in Spanish
Variation-influenced quality for MT: General vs. specialised corpora

Nice


big

house located

sea.

6th step. Each ngram is divided into subn-grams (different combinations of the
correspondences) which are then sent to the corpus.
bonito
gran
1º 1043795 284672
bonito
gran
2º 1043795 284672

casa
839170
casa
839170

situado
1098037
situada
1098063

……………………………………….
bonita
gran
casa situada
nº 1043794 284672 839170 1098063
Variation-influenced quality for MT: General vs. specialised corpora



7th step. A score is given to each result obtained; thus, each subn-gram will receive a
final score and an arrangement according to the score.
SCORE
Subngrama 1 .

bonito gran casa situado

2.5

Subngrama 2 .

bonito gran casa situada

3.1

Subngrama n .

bonita gran casa situada

7

- Parameters that decide the score given to each subn-gram:
 Number of needed words found in the sentence.
 Distance found beween the words.
 Number of needed words found together inside the sentence.
Variation-influenced quality for MT: General vs. specialised corpora

8th Step. Scoring the n-gram in relation to the best scores obtained by the subngrams.
Nice

big

SCORE

house located

30

Subngrama 1 .

bonito gran casa situado

SCORE
2.5

Subngrama 2 .

bonito gran casa situada

3.1

Subngrama n .

bonita gran casa situada

7

SCORE

big

house located

sea.

50
Variation-influenced quality for MT: General vs. specialised corpora



9th step . Combining the n-grams integrated in the sentence / text.

Nice

big

house located

sea.

Parameters for the combination / overlapping
•
•

Scoring the texts that repeat for the n-grams
Scoring the sentences that repeat for the n-grams
Variation-influenced quality for MT: General vs. specialised corpora



10th step . We add the function words previously removed from the sentence.
We search for these words in the best subn-grams used.

The

nice big house

is

located

near

the

sea.
Variation-influenced quality for MT: General vs. specialised corpora



11th step . Obtaining the translated sentence
The nice big house is located near the sea .

La gran y bonita casa está situada cerca del mar
Variation-influenced quality for MT: General vs. specialised corpora
In the housing ads (first specialised corpus):
the nice big house is located near the sea .
La gran y bonita casa está situada cerca del mar.
the white house has an old gate that is broken and ugly .
La casa blanca tiene una vieja verja averiada y fea.

 Time used by the system: 0.98 seconds / 1.2 seconds
Variation-influenced quality for MT: General vs. specialised corpora
In the large corpus (end of May):
the nice big house is located near the sea .
La casa grande está cerca del mar.
the white house has an old gate that is broken and ugly .
La casa blanca tiene una valla vieja que se rompe y es fea
 Time used by the system: 3 minutes and 33 seconds / 3
minutes and 39 seconds
Variation-influenced quality for MT: General vs. specialised corpora
Other problems in the large corpus (June: + news):
The director checked the mail and said he had no new mail
Nuevos directores comprobaban correo y la dijo no hay correo
The salesperson decided to stop doing business with them
El vendedor decidió parar a hacer negocios con ellos
 Time used by the system: 2 minutes and 12 seconds / 1
minute and 6 seconds
Variation-influenced quality for MT: General vs. specialised corpora
 Some linguistic / technical conclusions:
>Data retrieved from massive corpus:
Important to obtain more common phrases / familiar expressions /
overlapping connectors
>Data retrieved from the specialised corpus:
Important for fixed phrases / collocations in the field / genre – BUT
may need more linguistic information for connections
< Problems: Verb agreement in indirect clauses? / Fewer
probabilities for open content combinations (e.g., new + mail)
<EVER important need to improve dictionary entries for general
corpus
<Scores according to context: Texts repeat more in specialised
translation (problem for large corpus—e.g., nuevos directores)

Contenu connexe

Similaire à Traductor ingl e9edimburghl

Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones RIILP
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Estelle Delpech
 
2015ht13439 final presentation
2015ht13439 final presentation2015ht13439 final presentation
2015ht13439 final presentationAshutosh Kumar
 
Yelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationYelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationChengeng Ma
 
Classification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricClassification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricMarie Vans
 
More on Indexing Text Operations (1).pptx
More on Indexing  Text Operations (1).pptxMore on Indexing  Text Operations (1).pptx
More on Indexing Text Operations (1).pptxMahsadelavari
 
Final product group_16_task_3 tt
Final product group_16_task_3  ttFinal product group_16_task_3  tt
Final product group_16_task_3 ttalfonsorojasc
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsVincenzo Lomonaco
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Shahriar Rafee
 
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...mlaij
 

Similaire à Traductor ingl e9edimburghl (20)

Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones
 
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...
 
2015ht13439 final presentation
2015ht13439 final presentation2015ht13439 final presentation
2015ht13439 final presentation
 
Yelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classificationYelp challenge reviews_sentiment_classification
Yelp challenge reviews_sentiment_classification
 
Classification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF MetricClassification of CNN.com Articles using a TF*IDF Metric
Classification of CNN.com Articles using a TF*IDF Metric
 
More on Indexing Text Operations (1).pptx
More on Indexing  Text Operations (1).pptxMore on Indexing  Text Operations (1).pptx
More on Indexing Text Operations (1).pptx
 
Task 3 - Group 16
Task 3 -  Group 16Task 3 -  Group 16
Task 3 - Group 16
 
Final product group_16_task_3 tt
Final product group_16_task_3  ttFinal product group_16_task_3  tt
Final product group_16_task_3 tt
 
TAUS QE Summit 2017 eBay EN-DE MT Pilot
TAUS QE Summit 2017   eBay EN-DE MT PilotTAUS QE Summit 2017   eBay EN-DE MT Pilot
TAUS QE Summit 2017 eBay EN-DE MT Pilot
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
Kc3517481754
Kc3517481754Kc3517481754
Kc3517481754
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 
Info 2402 irt-chapter_4
Info 2402 irt-chapter_4Info 2402 irt-chapter_4
Info 2402 irt-chapter_4
 
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 
Traslation the pragraph
Traslation the pragraphTraslation the pragraph
Traslation the pragraph
 

Dernier

Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxUmeshTimilsina1
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 

Dernier (20)

Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 

Traductor ingl e9edimburghl

  • 1. Variation-influenced quality for MT: General vs. specialised corpora Alejandro Curado Martín Garay University of Extremadura, Spain
  • 2. Variation-influenced quality for MT: General vs. specialised corpora  Theoretical background / method: Context-based MT >Data retrieved from massive corpus: A lot of data to compare ngrams (the more context the better correspondences) >No need to use a parallel corpus (e.g., SMT = aligned / parallel corpora translation) >Optimal scores in BLEU (Bilingual Evaluation Under Study) scale for MT--Carbonell, 2006 = “Meaningful Machines” = 0.66 “blind test” out of 0.7 (human) with 53 GB of target text >General translation (vs. Specialised translation??)
  • 3. Variation-influenced quality for MT: General vs. specialised corpora
  • 4. Variation-influenced quality for MT: General vs. specialised corpora  Resources:  English Dictionary table with 200,000 entries (single and compound words / idioms).  Spanish dictionary table with more than 5,000,000 entries  Large general numerical corpus that may reach up to 100 GB (end of July 2010): Indexed by text, sentence, word
  • 5. Variation-influenced quality for MT: General vs. specialised corpora  Improve / increase resources :  Dictionary: Web pages have been developed to: 1. Add all those word units missing (with equivalents) 2. Increase word meanings if not in the dictionaries (wordreference)
  • 6. Variation-influenced quality for MT: General vs. specialised corpora  Improve / increase resources :  The large corpus.  Indexing : 1. Books on the web. 2. Wikipedia. 3. Sketch Engine-retrieved texts (seed keywords).
  • 7. Variation-influenced quality for MT: General vs. specialised corpora  Types of Spanish corpora used for the translation tests :  The large corpus (late May 2010) Nearly 73 million words, 11,490 texts, 3,900,000 sentences (+ 1256 news texts indexed in June = +4 mill. words) Experiment corpus (1) with apartment / housing ads (March 2010) 70 texts, 5,455 sentences, 87,353 words Experiment corpus (2) with international news (June 2010) 286 texts, 2,791 sentences, 125,936 words
  • 8. Variation-influenced quality for MT: General vs. specialised corpora  Translation procedure.  1 st step . Inserting the sentence or text. The nice big house is located near the sea. The nice big house is located near the sea.
  • 9. Variation-influenced quality for MT: General vs. specialised corpora  2nd step . Dividing the text into phrases / sentences.  The segmentation is carried out by using the following punctuation symbols: ..  In our case: ;; :: ¿? ¿? ¡! ¡! The nice big house is located near the sea.
  • 10. Variation-influenced quality for MT: General vs. specialised corpora  3 rd step . Obtaining the numbers that correspond to those words / word units in the English dictionary. The nice big house is 44634 30497 6962 22817 3456 located near 27139 the sea 30255 44634 39064
  • 11. Variation-influenced quality for MT: General vs. specialised corpora  4th step . We remove the function / nexus words (those words that repeat the most statistically in that language) from the sentence and we store them on a separate table. The nice big house is located Final phrase nice big house located sea. near the sea.
  • 12. Variation-influenced quality for MT: General vs. specialised corpora  5th step. The remaining words (content words) are sent to the dictionary to retrieve the different translation equivalents they may have. 1: Restriction in the tests to only two equivalents in Spanish
  • 13. Variation-influenced quality for MT: General vs. specialised corpora Nice  big house located sea. 6th step. Each ngram is divided into subn-grams (different combinations of the correspondences) which are then sent to the corpus. bonito gran 1º 1043795 284672 bonito gran 2º 1043795 284672 casa 839170 casa 839170 situado 1098037 situada 1098063 ………………………………………. bonita gran casa situada nº 1043794 284672 839170 1098063
  • 14. Variation-influenced quality for MT: General vs. specialised corpora  7th step. A score is given to each result obtained; thus, each subn-gram will receive a final score and an arrangement according to the score. SCORE Subngrama 1 . bonito gran casa situado 2.5 Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7 - Parameters that decide the score given to each subn-gram:  Number of needed words found in the sentence.  Distance found beween the words.  Number of needed words found together inside the sentence.
  • 15. Variation-influenced quality for MT: General vs. specialised corpora 8th Step. Scoring the n-gram in relation to the best scores obtained by the subngrams. Nice big SCORE house located 30 Subngrama 1 . bonito gran casa situado SCORE 2.5 Subngrama 2 . bonito gran casa situada 3.1 Subngrama n . bonita gran casa situada 7 SCORE big house located sea. 50
  • 16. Variation-influenced quality for MT: General vs. specialised corpora  9th step . Combining the n-grams integrated in the sentence / text. Nice big house located sea. Parameters for the combination / overlapping • • Scoring the texts that repeat for the n-grams Scoring the sentences that repeat for the n-grams
  • 17. Variation-influenced quality for MT: General vs. specialised corpora  10th step . We add the function words previously removed from the sentence. We search for these words in the best subn-grams used. The nice big house is located near the sea.
  • 18. Variation-influenced quality for MT: General vs. specialised corpora  11th step . Obtaining the translated sentence The nice big house is located near the sea . La gran y bonita casa está situada cerca del mar
  • 19. Variation-influenced quality for MT: General vs. specialised corpora In the housing ads (first specialised corpus): the nice big house is located near the sea . La gran y bonita casa está situada cerca del mar. the white house has an old gate that is broken and ugly . La casa blanca tiene una vieja verja averiada y fea.  Time used by the system: 0.98 seconds / 1.2 seconds
  • 20. Variation-influenced quality for MT: General vs. specialised corpora In the large corpus (end of May): the nice big house is located near the sea . La casa grande está cerca del mar. the white house has an old gate that is broken and ugly . La casa blanca tiene una valla vieja que se rompe y es fea  Time used by the system: 3 minutes and 33 seconds / 3 minutes and 39 seconds
  • 21. Variation-influenced quality for MT: General vs. specialised corpora Other problems in the large corpus (June: + news): The director checked the mail and said he had no new mail Nuevos directores comprobaban correo y la dijo no hay correo The salesperson decided to stop doing business with them El vendedor decidió parar a hacer negocios con ellos  Time used by the system: 2 minutes and 12 seconds / 1 minute and 6 seconds
  • 22. Variation-influenced quality for MT: General vs. specialised corpora  Some linguistic / technical conclusions: >Data retrieved from massive corpus: Important to obtain more common phrases / familiar expressions / overlapping connectors >Data retrieved from the specialised corpus: Important for fixed phrases / collocations in the field / genre – BUT may need more linguistic information for connections < Problems: Verb agreement in indirect clauses? / Fewer probabilities for open content combinations (e.g., new + mail) <EVER important need to improve dictionary entries for general corpus <Scores according to context: Texts repeat more in specialised translation (problem for large corpus—e.g., nuevos directores)