SlideShare une entreprise Scribd logo
1  sur  30
3-step Parallel Corpus Cleaning using
Monolingual Crowd Workers
Toshiaki Nakazawa, Sadao Kurohashi
(Kyoto University)
Hayato Kobayashi, Hiroki Ishikawa
Manabu Sassano
(Yahoo Japan Corporation)
20/05/2015@PACLING2015
Parallel Corpora
• Essential resources for almost all MT systems
• The quality and quantity greatly affect the
translation quality
• Can be automatically constructed from
existing resources
– Europarl, patent families, Wikipedia…
• Need to manually construct it for domains
which do not have enough existing resources
2
Quality of Parallel Corpus
• Translation flaws are inevitable even thought
the professionals translate
– Homer nods (弘法にも筆の誤り in Japanese)
• The number of flaws might be reduced by
reviewing the whole corpus, but impossible
– The size of the parallel corpus is usually very big
– Very costly if we ask professionals to modify
3
This Work
• Detect and edit the translation flaws in the
existing manually-translated parallel corpus in
effective and cheap way
• Use crowdsourcing in 3-steps
1. Fluency Judgement
2. Edit of Unnatural Sentences
3. Verification of Edits
• The workers can be monolingual
4
Outline
• Motivation
• Brief introduction of collaborative research
between Yahoo Japan and Kyoto University
• 3-step parallel corpus cleaning
• Experiments
– Parallel corpus cleaning
– Translation
• Conclusion and Future Work
5
BRIEF INTRODUCTION OF
COLLABORATIVE RESEARCH
6
Collaborative Research Between Yahoo
Japan and Kyoto University
• Goal: Improve the Chinese-to-Japanese
translation for E-commerce site
• Task:
– Develop a corpus-based MT system
– Construct a parallel corpus of EC-site, especially
fashion domain
7
Fashion-domain EC-site Parallel Corpus
• 1.2M sentences (Zh: 6.3M, Ja: 8.7M words)
• Manually translated from fashion item pages
of Chinese EC-site (taobao) into Japanese
• Most of the sentences were translated by
Chinese native speakers (through the
translation company)
– Found many translation flaws in the Japanese
translations
8
[FDEC Corpus]
Mother-tongue Principle
9
http://portal.unesco.org/en/ev.php-URL_ID=13089&URL_DO=DO_TOPIC&URL_SECTION=201.html
“A translator should, as far as possible, translate
into his own mother tongue or into a language
of which he or she has a mastery equal to that of
his or her mother tongue.”
Recommendation on the Legal Protection of Translators and Translations and the
Practical Means to improve the Status of Translators, UNESCO, 22 Nov. 1976
Source Natives vs. Target Natives
• Pros and cons for source and target native
speakers
• Target natives for translation modification
[Albrecht+, 2009]
10
Source Natives Target Natives
Background knowledge
about the input sentence
High Medium/Low
Fluency and grammatical
correctness of the
output sentence
Medium/Low High
Examples of Translation Flaws
• Insertion
Ja: 随意にに1種類だけ注文
(order one type at at your own will)
• Remaining Chinese character
Ja: 元气あふれるという効果があります
Ref: 元気あふれるという効果があります
• Unnatural
Ja: お手入れの時、電源を切れ、 プラグを抜いてください。
(when cleaning, turn the power off, please pull out the plug)
11
气 気
Hanzi Kanji
Other Translation Flaws
• Omission (not translated)
Zh: 看看有没有其他合适的商品
Ja: 看看有没有その他合適的商品
• Mistranslation
Zh:加湿器功能 (functions of humidifier)
Ja: 除湿器の機能 (functions of dehumidifier)
Zh: 木耳
12
our framework cannot fix these kinds of flaws
frillwood ear mushroomskirt with wood ear mushroom?
3-STEP
PARALLEL CORPUS CLEANING
13
3-steps of Cleaning
1. Fluency Judgement
– detects the translation flaws
2. Edit of Unnatural Sentences
– edit the translated sentences
3. Verification of Edits
– check if the edited translation is better than the
original one
14
Step 1: Fluency Judgement
• Task: judge if the sentences are natural and
grammatically correct
• Only showing the translated (target, Japanese)
sentences
15
e.g. 随意にに1種類だけ注文
Is this sentence natural and
grammatically correct?
No!!
No!!
Step 2: Edit of Unnatural Sentences
• Task: edit the unnatural translated sentences
• only showing the translated sentences, or
show the source sentence as well for the
reference
16
e.g. 随意にに1種類だけ注文(随便拍下一种)
Please modify this
sentence to be natural and
grammatically correct
随意に1種類だけ注文
Nothing to modify!
Step 3: Verification of Edits
• Task: judge if the edited translation is better
than the original one
• This step is important to further improve the
quality of the outcome because the edits are
not necessarily correct
17
e.g. 随意にに1種類だけ注文 vs. 随意に1種類だけ注文
Is the right one more
natural and grammatically
correct than the left one?
Yes!!
Yes!!
CORPUS CLEANING EXPERIMENTS
18
Crowdsourcing Service in the World
19
http://www.crowdinfo.jp/2014/02/02/world-crowdsourcing-service/
• Several styles of crowdsourcing tasks such as
Yes/No questions and free writings
• The service is run in Japan; therefore most of
the workers are Japanese
• Not able to select the workers by their abilities
• The workers in our experiments do not
necessarily understand Chinese
– perhaps almost all of them does not
20
Crowdsourcing
http://crowdsourcing.yahoo.co.jp
Step 1: Fluency Judgement
• 358,085 sentences from the FDEC corpus with
length between 10 and 130 characters
• Only Japanese sentences are shown
• Asked 5 different workers for each question
21
# unnatural 5 4 3 2 1 0
# sents.
ratio
13,056
(3.6%)
35,048
(9.8%)
60,200
(16.8%)
83,150
(23.2%)
93,187
(26.0%)
73,444
(20.5%)
30%!
Step 2: Edit of Unnatural Sentences
• 47,420 sentences which were judged as
unnatural by 4 or more workers in Step 1
• Original Chinese sentence is also shown
• Asked 3 different workers for each question
22
# edits 3 2 1 0
# sents.
ratio
3,755
(7.9%)
12,498
(26.4%)
18,289
(38.6%)
12,878
(27.2%)
Step 3: Verification of Edits
• 54,550 edits which were generated in Step 2
• Original Chinese sentence is also shown
• Asked 5 different workers for each question
23
# better 5 4 3 2 1 0
# sents.
ratio
25,053
(45.9%)
16,478
(30.2%)
7,706
(14.1%)
3,338
(6.1%)
1,462
(2.7%)
513
(0.9%)
Translation Experiment
• Dataset: whole FDEC Corpus
– Cleaned: verified to be better by the majority
• Decoder: KyotoEBMT [Richardson+, 2014]
• Evaluation: BLEU
24
# sentences Original Cleaned
Train 1,220,597 1,256,908
Dev 11,186 11,489
Test 11,200 11,495
Experimental Results
• Corpus cleaning contributes to improve the
translation quality!
• Cleaning the Dev and Test sets has bad effect
on translation quality…
25
Train Original Cleaned Cleaned Cleaned
Dev Original Original Cleaned Cleaned
Test Original Original Original Cleaned
BLEU 21.39 21.69 21.34 21.12
Natural, but Incorrect/Unequal
• Reviewed 100 edits which are judged to be
more natural than the original sentence by 5
workers
• Found 3 types of inequalities
1. deletion of symbols (8 cases)
2. omission (13 cases)
3. mistranslation (5 cases)
see the proceedings for detailed examples
26
Experimental Results
• The inequalities have bad effect on the
automatic evaluation scores because they
suppose the content of the input and output
are strictly equal
27
Train Original Cleaned Cleaned Cleaned
Dev Original Original Cleaned Cleaned
Test Original Original Original Cleaned
BLEU 21.39 21.69 21.34 21.12
Crowdsourcing Cost
• Cost for cleaning 6.8M words used in the
experiments
28
Professional* Our Work
Fee 40 million JPY 2.6 million JPY
Time 1700 days 186 hours
* These values are estimated from
http://www.editage.com
Conclusion and Future Work
• Proposed a framework of cleaning existing
parallel corpora efficiently and cheaply
– 3-step monolingual crowdsourcing
– Improved the fluency of the sentences
• Future work
– How to reduce the inequalities of the edits?
– How to improve the correctness of the translation
by monolingual workers?
29
30

Contenu connexe

En vedette

Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsCrowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsMatthew Lease
 
Promoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine TranslationPromoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine TranslationToshiaki Nakazawa
 
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Toshiaki Nakazawa
 
Data Science with Humans in the Loop
Data Science with Humans in the LoopData Science with Humans in the Loop
Data Science with Humans in the LoopLora Aroyo
 
G社のNMT論文を読んでみた
G社のNMT論文を読んでみたG社のNMT論文を読んでみた
G社のNMT論文を読んでみたToshiaki Nakazawa
 
第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析Toshiaki Nakazawa
 
Attention-based NMT description
Attention-based NMT descriptionAttention-based NMT description
Attention-based NMT descriptionToshiaki Nakazawa
 
自然言語処理のためのDeep Learning
自然言語処理のためのDeep Learning自然言語処理のためのDeep Learning
自然言語処理のためのDeep LearningYuta Kikuchi
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情Yuta Kikuchi
 
ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017Toshiaki Nakazawa
 

En vedette (12)

Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsCrowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to Ethics
 
Promoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine TranslationPromoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine Translation
 
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
 
Data Science with Humans in the Loop
Data Science with Humans in the LoopData Science with Humans in the Loop
Data Science with Humans in the Loop
 
G社のNMT論文を読んでみた
G社のNMT論文を読んでみたG社のNMT論文を読んでみた
G社のNMT論文を読んでみた
 
第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析
 
Attention-based NMT description
Attention-based NMT descriptionAttention-based NMT description
Attention-based NMT description
 
NLP2017 NMT Tutorial
NLP2017 NMT TutorialNLP2017 NMT Tutorial
NLP2017 NMT Tutorial
 
自然言語処理のためのDeep Learning
自然言語処理のためのDeep Learning自然言語処理のためのDeep Learning
自然言語処理のためのDeep Learning
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向
 
ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017
 

Similaire à 3-step parallel corpus cleaning using monolingual crowd workers

Translation effect.ppt
Translation effect.pptTranslation effect.ppt
Translation effect.pptALFAFAAMIN
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
Intro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 PresentationIntro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 PresentationFederal Communicators Network
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8Ron Martinez
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Antonio Toral
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Ron Martinez
 
How to Self Assess Your Skillset
How to Self Assess Your SkillsetHow to Self Assess Your Skillset
How to Self Assess Your SkillsetEliana Lobo
 
Multi lingual corpus for machine aided translation
Multi lingual corpus for machine aided translationMulti lingual corpus for machine aided translation
Multi lingual corpus for machine aided translationAashna Phanda
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...Lifeng (Aaron) Han
 
TRANSLATETECHNIQUES2 esta en ingles pero puede ser algo interesante cristo.pdf
TRANSLATETECHNIQUES2  esta en ingles pero puede ser algo interesante  cristo.pdfTRANSLATETECHNIQUES2  esta en ingles pero puede ser algo interesante  cristo.pdf
TRANSLATETECHNIQUES2 esta en ingles pero puede ser algo interesante cristo.pdfCarlosGPNCCUTIMB
 
Non-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editorNon-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editorMark Matsuno
 

Similaire à 3-step parallel corpus cleaning using monolingual crowd workers (20)

1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
Translation effect.ppt
Translation effect.pptTranslation effect.ppt
Translation effect.ppt
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Intro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 PresentationIntro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 Presentation
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019
 
How to Self Assess Your Skillset
How to Self Assess Your SkillsetHow to Self Assess Your Skillset
How to Self Assess Your Skillset
 
Multi lingual corpus for machine aided translation
Multi lingual corpus for machine aided translationMulti lingual corpus for machine aided translation
Multi lingual corpus for machine aided translation
 
Modality-Preserving Phrase-based Statistical Machine Translation
Modality-Preserving Phrase-based Statistical Machine TranslationModality-Preserving Phrase-based Statistical Machine Translation
Modality-Preserving Phrase-based Statistical Machine Translation
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
 
TRANSLATETECHNIQUES2 esta en ingles pero puede ser algo interesante cristo.pdf
TRANSLATETECHNIQUES2  esta en ingles pero puede ser algo interesante  cristo.pdfTRANSLATETECHNIQUES2  esta en ingles pero puede ser algo interesante  cristo.pdf
TRANSLATETECHNIQUES2 esta en ingles pero puede ser algo interesante cristo.pdf
 
Machine translator Introduction
Machine translator IntroductionMachine translator Introduction
Machine translator Introduction
 
Non-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editorNon-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editor
 

Dernier

trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermicultureTakeleZike1
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxuniversity
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 

Dernier (20)

trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermiculture
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptxThermodynamics ,types of system,formulae ,gibbs free energy .pptx
Thermodynamics ,types of system,formulae ,gibbs free energy .pptx
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 

3-step parallel corpus cleaning using monolingual crowd workers

  • 1. 3-step Parallel Corpus Cleaning using Monolingual Crowd Workers Toshiaki Nakazawa, Sadao Kurohashi (Kyoto University) Hayato Kobayashi, Hiroki Ishikawa Manabu Sassano (Yahoo Japan Corporation) 20/05/2015@PACLING2015
  • 2. Parallel Corpora • Essential resources for almost all MT systems • The quality and quantity greatly affect the translation quality • Can be automatically constructed from existing resources – Europarl, patent families, Wikipedia… • Need to manually construct it for domains which do not have enough existing resources 2
  • 3. Quality of Parallel Corpus • Translation flaws are inevitable even thought the professionals translate – Homer nods (弘法にも筆の誤り in Japanese) • The number of flaws might be reduced by reviewing the whole corpus, but impossible – The size of the parallel corpus is usually very big – Very costly if we ask professionals to modify 3
  • 4. This Work • Detect and edit the translation flaws in the existing manually-translated parallel corpus in effective and cheap way • Use crowdsourcing in 3-steps 1. Fluency Judgement 2. Edit of Unnatural Sentences 3. Verification of Edits • The workers can be monolingual 4
  • 5. Outline • Motivation • Brief introduction of collaborative research between Yahoo Japan and Kyoto University • 3-step parallel corpus cleaning • Experiments – Parallel corpus cleaning – Translation • Conclusion and Future Work 5
  • 7. Collaborative Research Between Yahoo Japan and Kyoto University • Goal: Improve the Chinese-to-Japanese translation for E-commerce site • Task: – Develop a corpus-based MT system – Construct a parallel corpus of EC-site, especially fashion domain 7
  • 8. Fashion-domain EC-site Parallel Corpus • 1.2M sentences (Zh: 6.3M, Ja: 8.7M words) • Manually translated from fashion item pages of Chinese EC-site (taobao) into Japanese • Most of the sentences were translated by Chinese native speakers (through the translation company) – Found many translation flaws in the Japanese translations 8 [FDEC Corpus]
  • 9. Mother-tongue Principle 9 http://portal.unesco.org/en/ev.php-URL_ID=13089&URL_DO=DO_TOPIC&URL_SECTION=201.html “A translator should, as far as possible, translate into his own mother tongue or into a language of which he or she has a mastery equal to that of his or her mother tongue.” Recommendation on the Legal Protection of Translators and Translations and the Practical Means to improve the Status of Translators, UNESCO, 22 Nov. 1976
  • 10. Source Natives vs. Target Natives • Pros and cons for source and target native speakers • Target natives for translation modification [Albrecht+, 2009] 10 Source Natives Target Natives Background knowledge about the input sentence High Medium/Low Fluency and grammatical correctness of the output sentence Medium/Low High
  • 11. Examples of Translation Flaws • Insertion Ja: 随意にに1種類だけ注文 (order one type at at your own will) • Remaining Chinese character Ja: 元气あふれるという効果があります Ref: 元気あふれるという効果があります • Unnatural Ja: お手入れの時、電源を切れ、 プラグを抜いてください。 (when cleaning, turn the power off, please pull out the plug) 11 气 気 Hanzi Kanji
  • 12. Other Translation Flaws • Omission (not translated) Zh: 看看有没有其他合适的商品 Ja: 看看有没有その他合適的商品 • Mistranslation Zh:加湿器功能 (functions of humidifier) Ja: 除湿器の機能 (functions of dehumidifier) Zh: 木耳 12 our framework cannot fix these kinds of flaws frillwood ear mushroomskirt with wood ear mushroom?
  • 14. 3-steps of Cleaning 1. Fluency Judgement – detects the translation flaws 2. Edit of Unnatural Sentences – edit the translated sentences 3. Verification of Edits – check if the edited translation is better than the original one 14
  • 15. Step 1: Fluency Judgement • Task: judge if the sentences are natural and grammatically correct • Only showing the translated (target, Japanese) sentences 15 e.g. 随意にに1種類だけ注文 Is this sentence natural and grammatically correct? No!! No!!
  • 16. Step 2: Edit of Unnatural Sentences • Task: edit the unnatural translated sentences • only showing the translated sentences, or show the source sentence as well for the reference 16 e.g. 随意にに1種類だけ注文(随便拍下一种) Please modify this sentence to be natural and grammatically correct 随意に1種類だけ注文 Nothing to modify!
  • 17. Step 3: Verification of Edits • Task: judge if the edited translation is better than the original one • This step is important to further improve the quality of the outcome because the edits are not necessarily correct 17 e.g. 随意にに1種類だけ注文 vs. 随意に1種類だけ注文 Is the right one more natural and grammatically correct than the left one? Yes!! Yes!!
  • 19. Crowdsourcing Service in the World 19 http://www.crowdinfo.jp/2014/02/02/world-crowdsourcing-service/
  • 20. • Several styles of crowdsourcing tasks such as Yes/No questions and free writings • The service is run in Japan; therefore most of the workers are Japanese • Not able to select the workers by their abilities • The workers in our experiments do not necessarily understand Chinese – perhaps almost all of them does not 20 Crowdsourcing http://crowdsourcing.yahoo.co.jp
  • 21. Step 1: Fluency Judgement • 358,085 sentences from the FDEC corpus with length between 10 and 130 characters • Only Japanese sentences are shown • Asked 5 different workers for each question 21 # unnatural 5 4 3 2 1 0 # sents. ratio 13,056 (3.6%) 35,048 (9.8%) 60,200 (16.8%) 83,150 (23.2%) 93,187 (26.0%) 73,444 (20.5%) 30%!
  • 22. Step 2: Edit of Unnatural Sentences • 47,420 sentences which were judged as unnatural by 4 or more workers in Step 1 • Original Chinese sentence is also shown • Asked 3 different workers for each question 22 # edits 3 2 1 0 # sents. ratio 3,755 (7.9%) 12,498 (26.4%) 18,289 (38.6%) 12,878 (27.2%)
  • 23. Step 3: Verification of Edits • 54,550 edits which were generated in Step 2 • Original Chinese sentence is also shown • Asked 5 different workers for each question 23 # better 5 4 3 2 1 0 # sents. ratio 25,053 (45.9%) 16,478 (30.2%) 7,706 (14.1%) 3,338 (6.1%) 1,462 (2.7%) 513 (0.9%)
  • 24. Translation Experiment • Dataset: whole FDEC Corpus – Cleaned: verified to be better by the majority • Decoder: KyotoEBMT [Richardson+, 2014] • Evaluation: BLEU 24 # sentences Original Cleaned Train 1,220,597 1,256,908 Dev 11,186 11,489 Test 11,200 11,495
  • 25. Experimental Results • Corpus cleaning contributes to improve the translation quality! • Cleaning the Dev and Test sets has bad effect on translation quality… 25 Train Original Cleaned Cleaned Cleaned Dev Original Original Cleaned Cleaned Test Original Original Original Cleaned BLEU 21.39 21.69 21.34 21.12
  • 26. Natural, but Incorrect/Unequal • Reviewed 100 edits which are judged to be more natural than the original sentence by 5 workers • Found 3 types of inequalities 1. deletion of symbols (8 cases) 2. omission (13 cases) 3. mistranslation (5 cases) see the proceedings for detailed examples 26
  • 27. Experimental Results • The inequalities have bad effect on the automatic evaluation scores because they suppose the content of the input and output are strictly equal 27 Train Original Cleaned Cleaned Cleaned Dev Original Original Cleaned Cleaned Test Original Original Original Cleaned BLEU 21.39 21.69 21.34 21.12
  • 28. Crowdsourcing Cost • Cost for cleaning 6.8M words used in the experiments 28 Professional* Our Work Fee 40 million JPY 2.6 million JPY Time 1700 days 186 hours * These values are estimated from http://www.editage.com
  • 29. Conclusion and Future Work • Proposed a framework of cleaning existing parallel corpora efficiently and cheaply – 3-step monolingual crowdsourcing – Improved the fluency of the sentences • Future work – How to reduce the inequalities of the edits? – How to improve the correctness of the translation by monolingual workers? 29
  • 30. 30

Notes de l'éditeur

  1. proverb
  2. imperative
  3. mù’ěr
  4. The bilingual workers would edit the translations more precisely with the reference source sentence, and monolingual workers just ignore them