SlideShare une entreprise Scribd logo
1  sur  30
3-step Parallel Corpus Cleaning using
Monolingual Crowd Workers
Toshiaki Nakazawa, Sadao Kurohashi
(Kyoto University)
Hayato Kobayashi, Hiroki Ishikawa
Manabu Sassano
(Yahoo Japan Corporation)
20/05/2015@PACLING2015
Parallel Corpora
• Essential resources for almost all MT systems
• The quality and quantity greatly affect the
translation quality
• Can be automatically constructed from
existing resources
– Europarl, patent families, Wikipedia…
• Need to manually construct it for domains
which do not have enough existing resources
2
Quality of Parallel Corpus
• Translation flaws are inevitable even thought
the professionals translate
– Homer nods (弘法にも筆の誤り in Japanese)
• The number of flaws might be reduced by
reviewing the whole corpus, but impossible
– The size of the parallel corpus is usually very big
– Very costly if we ask professionals to modify
3
This Work
• Detect and edit the translation flaws in the
existing manually-translated parallel corpus in
effective and cheap way
• Use crowdsourcing in 3-steps
1. Fluency Judgement
2. Edit of Unnatural Sentences
3. Verification of Edits
• The workers can be monolingual
4
Outline
• Motivation
• Brief introduction of collaborative research
between Yahoo Japan and Kyoto University
• 3-step parallel corpus cleaning
• Experiments
– Parallel corpus cleaning
– Translation
• Conclusion and Future Work
5
BRIEF INTRODUCTION OF
COLLABORATIVE RESEARCH
6
Collaborative Research Between Yahoo
Japan and Kyoto University
• Goal: Improve the Chinese-to-Japanese
translation for E-commerce site
• Task:
– Develop a corpus-based MT system
– Construct a parallel corpus of EC-site, especially
fashion domain
7
Fashion-domain EC-site Parallel Corpus
• 1.2M sentences (Zh: 6.3M, Ja: 8.7M words)
• Manually translated from fashion item pages
of Chinese EC-site (taobao) into Japanese
• Most of the sentences were translated by
Chinese native speakers (through the
translation company)
– Found many translation flaws in the Japanese
translations
8
[FDEC Corpus]
Mother-tongue Principle
9
http://portal.unesco.org/en/ev.php-URL_ID=13089&URL_DO=DO_TOPIC&URL_SECTION=201.html
“A translator should, as far as possible, translate
into his own mother tongue or into a language
of which he or she has a mastery equal to that of
his or her mother tongue.”
Recommendation on the Legal Protection of Translators and Translations and the
Practical Means to improve the Status of Translators, UNESCO, 22 Nov. 1976
Source Natives vs. Target Natives
• Pros and cons for source and target native
speakers
• Target natives for translation modification
[Albrecht+, 2009]
10
Source Natives Target Natives
Background knowledge
about the input sentence
High Medium/Low
Fluency and grammatical
correctness of the
output sentence
Medium/Low High
Examples of Translation Flaws
• Insertion
Ja: 随意にに1種類だけ注文
(order one type at at your own will)
• Remaining Chinese character
Ja: 元气あふれるという効果があります
Ref: 元気あふれるという効果があります
• Unnatural
Ja: お手入れの時、電源を切れ、 プラグを抜いてください。
(when cleaning, turn the power off, please pull out the plug)
11
气 気
Hanzi Kanji
Other Translation Flaws
• Omission (not translated)
Zh: 看看有没有其他合适的商品
Ja: 看看有没有その他合適的商品
• Mistranslation
Zh:加湿器功能 (functions of humidifier)
Ja: 除湿器の機能 (functions of dehumidifier)
Zh: 木耳
12
our framework cannot fix these kinds of flaws
frillwood ear mushroomskirt with wood ear mushroom?
3-STEP
PARALLEL CORPUS CLEANING
13
3-steps of Cleaning
1. Fluency Judgement
– detects the translation flaws
2. Edit of Unnatural Sentences
– edit the translated sentences
3. Verification of Edits
– check if the edited translation is better than the
original one
14
Step 1: Fluency Judgement
• Task: judge if the sentences are natural and
grammatically correct
• Only showing the translated (target, Japanese)
sentences
15
e.g. 随意にに1種類だけ注文
Is this sentence natural and
grammatically correct?
No!!
No!!
Step 2: Edit of Unnatural Sentences
• Task: edit the unnatural translated sentences
• only showing the translated sentences, or
show the source sentence as well for the
reference
16
e.g. 随意にに1種類だけ注文(随便拍下一种)
Please modify this
sentence to be natural and
grammatically correct
随意に1種類だけ注文
Nothing to modify!
Step 3: Verification of Edits
• Task: judge if the edited translation is better
than the original one
• This step is important to further improve the
quality of the outcome because the edits are
not necessarily correct
17
e.g. 随意にに1種類だけ注文 vs. 随意に1種類だけ注文
Is the right one more
natural and grammatically
correct than the left one?
Yes!!
Yes!!
CORPUS CLEANING EXPERIMENTS
18
Crowdsourcing Service in the World
19
http://www.crowdinfo.jp/2014/02/02/world-crowdsourcing-service/
• Several styles of crowdsourcing tasks such as
Yes/No questions and free writings
• The service is run in Japan; therefore most of
the workers are Japanese
• Not able to select the workers by their abilities
• The workers in our experiments do not
necessarily understand Chinese
– perhaps almost all of them does not
20
Crowdsourcing
http://crowdsourcing.yahoo.co.jp
Step 1: Fluency Judgement
• 358,085 sentences from the FDEC corpus with
length between 10 and 130 characters
• Only Japanese sentences are shown
• Asked 5 different workers for each question
21
# unnatural 5 4 3 2 1 0
# sents.
ratio
13,056
(3.6%)
35,048
(9.8%)
60,200
(16.8%)
83,150
(23.2%)
93,187
(26.0%)
73,444
(20.5%)
30%!
Step 2: Edit of Unnatural Sentences
• 47,420 sentences which were judged as
unnatural by 4 or more workers in Step 1
• Original Chinese sentence is also shown
• Asked 3 different workers for each question
22
# edits 3 2 1 0
# sents.
ratio
3,755
(7.9%)
12,498
(26.4%)
18,289
(38.6%)
12,878
(27.2%)
Step 3: Verification of Edits
• 54,550 edits which were generated in Step 2
• Original Chinese sentence is also shown
• Asked 5 different workers for each question
23
# better 5 4 3 2 1 0
# sents.
ratio
25,053
(45.9%)
16,478
(30.2%)
7,706
(14.1%)
3,338
(6.1%)
1,462
(2.7%)
513
(0.9%)
Translation Experiment
• Dataset: whole FDEC Corpus
– Cleaned: verified to be better by the majority
• Decoder: KyotoEBMT [Richardson+, 2014]
• Evaluation: BLEU
24
# sentences Original Cleaned
Train 1,220,597 1,256,908
Dev 11,186 11,489
Test 11,200 11,495
Experimental Results
• Corpus cleaning contributes to improve the
translation quality!
• Cleaning the Dev and Test sets has bad effect
on translation quality…
25
Train Original Cleaned Cleaned Cleaned
Dev Original Original Cleaned Cleaned
Test Original Original Original Cleaned
BLEU 21.39 21.69 21.34 21.12
Natural, but Incorrect/Unequal
• Reviewed 100 edits which are judged to be
more natural than the original sentence by 5
workers
• Found 3 types of inequalities
1. deletion of symbols (8 cases)
2. omission (13 cases)
3. mistranslation (5 cases)
see the proceedings for detailed examples
26
Experimental Results
• The inequalities have bad effect on the
automatic evaluation scores because they
suppose the content of the input and output
are strictly equal
27
Train Original Cleaned Cleaned Cleaned
Dev Original Original Cleaned Cleaned
Test Original Original Original Cleaned
BLEU 21.39 21.69 21.34 21.12
Crowdsourcing Cost
• Cost for cleaning 6.8M words used in the
experiments
28
Professional* Our Work
Fee 40 million JPY 2.6 million JPY
Time 1700 days 186 hours
* These values are estimated from
http://www.editage.com
Conclusion and Future Work
• Proposed a framework of cleaning existing
parallel corpora efficiently and cheaply
– 3-step monolingual crowdsourcing
– Improved the fluency of the sentences
• Future work
– How to reduce the inequalities of the edits?
– How to improve the correctness of the translation
by monolingual workers?
29
30

Contenu connexe

En vedette

Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsCrowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsMatthew Lease
 
Promoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine TranslationPromoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine TranslationToshiaki Nakazawa
 
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Toshiaki Nakazawa
 
Data Science with Humans in the Loop
Data Science with Humans in the LoopData Science with Humans in the Loop
Data Science with Humans in the LoopLora Aroyo
 
G社のNMT論文を読んでみた
G社のNMT論文を読んでみたG社のNMT論文を読んでみた
G社のNMT論文を読んでみたToshiaki Nakazawa
 
第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析Toshiaki Nakazawa
 
Attention-based NMT description
Attention-based NMT descriptionAttention-based NMT description
Attention-based NMT descriptionToshiaki Nakazawa
 
自然言語処理のためのDeep Learning
自然言語処理のためのDeep Learning自然言語処理のためのDeep Learning
自然言語処理のためのDeep LearningYuta Kikuchi
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情Yuta Kikuchi
 
ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017Toshiaki Nakazawa
 

En vedette (12)

Crowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to EthicsCrowdsourcing for Information Retrieval: From Statistics to Ethics
Crowdsourcing for Information Retrieval: From Statistics to Ethics
 
Promoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine TranslationPromoting Science and Technology Exchange using Machine Translation
Promoting Science and Technology Exchange using Machine Translation
 
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
Insertion Position Selection Model for Flexible Non-Terminals in Dependency T...
 
Data Science with Humans in the Loop
Data Science with Humans in the LoopData Science with Humans in the Loop
Data Science with Humans in the Loop
 
G社のNMT論文を読んでみた
G社のNMT論文を読んでみたG社のNMT論文を読んでみた
G社のNMT論文を読んでみた
 
第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析第3回アジア翻訳ワークショップの人手評価結果の分析
第3回アジア翻訳ワークショップの人手評価結果の分析
 
Attention-based NMT description
Attention-based NMT descriptionAttention-based NMT description
Attention-based NMT description
 
NLP2017 NMT Tutorial
NLP2017 NMT TutorialNLP2017 NMT Tutorial
NLP2017 NMT Tutorial
 
自然言語処理のためのDeep Learning
自然言語処理のためのDeep Learning自然言語処理のためのDeep Learning
自然言語処理のためのDeep Learning
 
最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情最近のDeep Learning (NLP) 界隈におけるAttention事情
最近のDeep Learning (NLP) 界隈におけるAttention事情
 
深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向
 
ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017ニューラル機械翻訳の動向@IBIS2017
ニューラル機械翻訳の動向@IBIS2017
 

Similaire à 3-step parallel corpus cleaning using monolingual crowd workers

Translation effect.ppt
Translation effect.pptTranslation effect.ppt
Translation effect.pptALFAFAAMIN
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
Intro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 PresentationIntro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 PresentationFederal Communicators Network
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8Ron Martinez
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Antonio Toral
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Ron Martinez
 
How to Self Assess Your Skillset
How to Self Assess Your SkillsetHow to Self Assess Your Skillset
How to Self Assess Your SkillsetEliana Lobo
 
Multi lingual corpus for machine aided translation
Multi lingual corpus for machine aided translationMulti lingual corpus for machine aided translation
Multi lingual corpus for machine aided translationAashna Phanda
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...Lifeng (Aaron) Han
 
TRANSLATETECHNIQUES2 esta en ingles pero puede ser algo interesante cristo.pdf
TRANSLATETECHNIQUES2  esta en ingles pero puede ser algo interesante  cristo.pdfTRANSLATETECHNIQUES2  esta en ingles pero puede ser algo interesante  cristo.pdf
TRANSLATETECHNIQUES2 esta en ingles pero puede ser algo interesante cristo.pdfCarlosGPNCCUTIMB
 
Non-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editorNon-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editorMark Matsuno
 

Similaire à 3-step parallel corpus cleaning using monolingual crowd workers (20)

1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
Translation effect.ppt
Translation effect.pptTranslation effect.ppt
Translation effect.ppt
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Intro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 PresentationIntro to Plain Language-for FCN Apr2012 Presentation
Intro to Plain Language-for FCN Apr2012 Presentation
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8Scientific and technical translation in English - Week 8
Scientific and technical translation in English - Week 8
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Mach...
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019Scientific and technical translation in English - week 3 2019
Scientific and technical translation in English - week 3 2019
 
How to Self Assess Your Skillset
How to Self Assess Your SkillsetHow to Self Assess Your Skillset
How to Self Assess Your Skillset
 
Multi lingual corpus for machine aided translation
Multi lingual corpus for machine aided translationMulti lingual corpus for machine aided translation
Multi lingual corpus for machine aided translation
 
Modality-Preserving Phrase-based Statistical Machine Translation
Modality-Preserving Phrase-based Statistical Machine TranslationModality-Preserving Phrase-based Statistical Machine Translation
Modality-Preserving Phrase-based Statistical Machine Translation
 
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
TSD2013 PPT.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFO...
 
TRANSLATETECHNIQUES2 esta en ingles pero puede ser algo interesante cristo.pdf
TRANSLATETECHNIQUES2  esta en ingles pero puede ser algo interesante  cristo.pdfTRANSLATETECHNIQUES2  esta en ingles pero puede ser algo interesante  cristo.pdf
TRANSLATETECHNIQUES2 esta en ingles pero puede ser algo interesante cristo.pdf
 
Machine translator Introduction
Machine translator IntroductionMachine translator Introduction
Machine translator Introduction
 
Non-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editorNon-native users of English--Common writing mistakes and the role of the editor
Non-native users of English--Common writing mistakes and the role of the editor
 

Dernier

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 

Dernier (20)

Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 

3-step parallel corpus cleaning using monolingual crowd workers

  • 1. 3-step Parallel Corpus Cleaning using Monolingual Crowd Workers Toshiaki Nakazawa, Sadao Kurohashi (Kyoto University) Hayato Kobayashi, Hiroki Ishikawa Manabu Sassano (Yahoo Japan Corporation) 20/05/2015@PACLING2015
  • 2. Parallel Corpora • Essential resources for almost all MT systems • The quality and quantity greatly affect the translation quality • Can be automatically constructed from existing resources – Europarl, patent families, Wikipedia… • Need to manually construct it for domains which do not have enough existing resources 2
  • 3. Quality of Parallel Corpus • Translation flaws are inevitable even thought the professionals translate – Homer nods (弘法にも筆の誤り in Japanese) • The number of flaws might be reduced by reviewing the whole corpus, but impossible – The size of the parallel corpus is usually very big – Very costly if we ask professionals to modify 3
  • 4. This Work • Detect and edit the translation flaws in the existing manually-translated parallel corpus in effective and cheap way • Use crowdsourcing in 3-steps 1. Fluency Judgement 2. Edit of Unnatural Sentences 3. Verification of Edits • The workers can be monolingual 4
  • 5. Outline • Motivation • Brief introduction of collaborative research between Yahoo Japan and Kyoto University • 3-step parallel corpus cleaning • Experiments – Parallel corpus cleaning – Translation • Conclusion and Future Work 5
  • 7. Collaborative Research Between Yahoo Japan and Kyoto University • Goal: Improve the Chinese-to-Japanese translation for E-commerce site • Task: – Develop a corpus-based MT system – Construct a parallel corpus of EC-site, especially fashion domain 7
  • 8. Fashion-domain EC-site Parallel Corpus • 1.2M sentences (Zh: 6.3M, Ja: 8.7M words) • Manually translated from fashion item pages of Chinese EC-site (taobao) into Japanese • Most of the sentences were translated by Chinese native speakers (through the translation company) – Found many translation flaws in the Japanese translations 8 [FDEC Corpus]
  • 9. Mother-tongue Principle 9 http://portal.unesco.org/en/ev.php-URL_ID=13089&URL_DO=DO_TOPIC&URL_SECTION=201.html “A translator should, as far as possible, translate into his own mother tongue or into a language of which he or she has a mastery equal to that of his or her mother tongue.” Recommendation on the Legal Protection of Translators and Translations and the Practical Means to improve the Status of Translators, UNESCO, 22 Nov. 1976
  • 10. Source Natives vs. Target Natives • Pros and cons for source and target native speakers • Target natives for translation modification [Albrecht+, 2009] 10 Source Natives Target Natives Background knowledge about the input sentence High Medium/Low Fluency and grammatical correctness of the output sentence Medium/Low High
  • 11. Examples of Translation Flaws • Insertion Ja: 随意にに1種類だけ注文 (order one type at at your own will) • Remaining Chinese character Ja: 元气あふれるという効果があります Ref: 元気あふれるという効果があります • Unnatural Ja: お手入れの時、電源を切れ、 プラグを抜いてください。 (when cleaning, turn the power off, please pull out the plug) 11 气 気 Hanzi Kanji
  • 12. Other Translation Flaws • Omission (not translated) Zh: 看看有没有其他合适的商品 Ja: 看看有没有その他合適的商品 • Mistranslation Zh:加湿器功能 (functions of humidifier) Ja: 除湿器の機能 (functions of dehumidifier) Zh: 木耳 12 our framework cannot fix these kinds of flaws frillwood ear mushroomskirt with wood ear mushroom?
  • 14. 3-steps of Cleaning 1. Fluency Judgement – detects the translation flaws 2. Edit of Unnatural Sentences – edit the translated sentences 3. Verification of Edits – check if the edited translation is better than the original one 14
  • 15. Step 1: Fluency Judgement • Task: judge if the sentences are natural and grammatically correct • Only showing the translated (target, Japanese) sentences 15 e.g. 随意にに1種類だけ注文 Is this sentence natural and grammatically correct? No!! No!!
  • 16. Step 2: Edit of Unnatural Sentences • Task: edit the unnatural translated sentences • only showing the translated sentences, or show the source sentence as well for the reference 16 e.g. 随意にに1種類だけ注文(随便拍下一种) Please modify this sentence to be natural and grammatically correct 随意に1種類だけ注文 Nothing to modify!
  • 17. Step 3: Verification of Edits • Task: judge if the edited translation is better than the original one • This step is important to further improve the quality of the outcome because the edits are not necessarily correct 17 e.g. 随意にに1種類だけ注文 vs. 随意に1種類だけ注文 Is the right one more natural and grammatically correct than the left one? Yes!! Yes!!
  • 19. Crowdsourcing Service in the World 19 http://www.crowdinfo.jp/2014/02/02/world-crowdsourcing-service/
  • 20. • Several styles of crowdsourcing tasks such as Yes/No questions and free writings • The service is run in Japan; therefore most of the workers are Japanese • Not able to select the workers by their abilities • The workers in our experiments do not necessarily understand Chinese – perhaps almost all of them does not 20 Crowdsourcing http://crowdsourcing.yahoo.co.jp
  • 21. Step 1: Fluency Judgement • 358,085 sentences from the FDEC corpus with length between 10 and 130 characters • Only Japanese sentences are shown • Asked 5 different workers for each question 21 # unnatural 5 4 3 2 1 0 # sents. ratio 13,056 (3.6%) 35,048 (9.8%) 60,200 (16.8%) 83,150 (23.2%) 93,187 (26.0%) 73,444 (20.5%) 30%!
  • 22. Step 2: Edit of Unnatural Sentences • 47,420 sentences which were judged as unnatural by 4 or more workers in Step 1 • Original Chinese sentence is also shown • Asked 3 different workers for each question 22 # edits 3 2 1 0 # sents. ratio 3,755 (7.9%) 12,498 (26.4%) 18,289 (38.6%) 12,878 (27.2%)
  • 23. Step 3: Verification of Edits • 54,550 edits which were generated in Step 2 • Original Chinese sentence is also shown • Asked 5 different workers for each question 23 # better 5 4 3 2 1 0 # sents. ratio 25,053 (45.9%) 16,478 (30.2%) 7,706 (14.1%) 3,338 (6.1%) 1,462 (2.7%) 513 (0.9%)
  • 24. Translation Experiment • Dataset: whole FDEC Corpus – Cleaned: verified to be better by the majority • Decoder: KyotoEBMT [Richardson+, 2014] • Evaluation: BLEU 24 # sentences Original Cleaned Train 1,220,597 1,256,908 Dev 11,186 11,489 Test 11,200 11,495
  • 25. Experimental Results • Corpus cleaning contributes to improve the translation quality! • Cleaning the Dev and Test sets has bad effect on translation quality… 25 Train Original Cleaned Cleaned Cleaned Dev Original Original Cleaned Cleaned Test Original Original Original Cleaned BLEU 21.39 21.69 21.34 21.12
  • 26. Natural, but Incorrect/Unequal • Reviewed 100 edits which are judged to be more natural than the original sentence by 5 workers • Found 3 types of inequalities 1. deletion of symbols (8 cases) 2. omission (13 cases) 3. mistranslation (5 cases) see the proceedings for detailed examples 26
  • 27. Experimental Results • The inequalities have bad effect on the automatic evaluation scores because they suppose the content of the input and output are strictly equal 27 Train Original Cleaned Cleaned Cleaned Dev Original Original Cleaned Cleaned Test Original Original Original Cleaned BLEU 21.39 21.69 21.34 21.12
  • 28. Crowdsourcing Cost • Cost for cleaning 6.8M words used in the experiments 28 Professional* Our Work Fee 40 million JPY 2.6 million JPY Time 1700 days 186 hours * These values are estimated from http://www.editage.com
  • 29. Conclusion and Future Work • Proposed a framework of cleaning existing parallel corpora efficiently and cheaply – 3-step monolingual crowdsourcing – Improved the fluency of the sentences • Future work – How to reduce the inequalities of the edits? – How to improve the correctness of the translation by monolingual workers? 29
  • 30. 30

Notes de l'éditeur

  1. proverb
  2. imperative
  3. mù’ěr
  4. The bilingual workers would edit the translations more precisely with the reference source sentence, and monolingual workers just ignore them