SlideShare une entreprise Scribd logo
1  sur  26
Improving the quality of a customized SMT system using shared training data Chris.Wendt@microsoft.com Will.Lewis@microsoft.com August 28, 2009 1
Overview  Engine and Customization Basics Experiment Objective Experiment Setup Experiment Results Validation Conclusions 2
Microsoft’s Statistical MT Engine 3 Linguistically informed SMT
Microsoft Translator Runtime 4
Training 5
Microsoft’s Statistical MT Engine 6
Adding Domain Specificity  7
Experiment Objective Objective Determine the effect of pooling parallel data among multiple data providers within a domain, measured by the translation quality of an SMT system trained with that data. 8
Experiment Setup Data pool: TAUS Data Association’s repository of parallel translation data. Domain: computer-related technical documents. 	No difference is made between software,  hardware, documentation and marketing material. Criteria for test case selection: ,[object Object]
Less than 2M segments of parallel training data (at that point it would be valid to train a System using only the provider’s own data)Chosen case: Sybase Experiment Series: Observe BLEU scores using a reserved subset of the submitted data against systems trained with 1	General data, as used for www.microsofttranslator.com 2a	Only Microsoft’s internal parallel data, from localization of its own products 2b	Microsoft data + Sybase data 3a	General + Microsoft + TAUS 3b	General + Microsoft data + TAUS, with Sybase custom lambdas Measure BLEU on 3 sets of test documents, with 1 reference, reserved from the submission, not used in training: ,[object Object]
Microsoft
General9
System Details 10
Training data composition German Chinese (Simplified) Sybase does not have enough data to build a system exclusively with Sybase data 11
Experiment Results, measured in BLEU Chinese German 12
Experiment Results, measured in BLEU Chinese German 13
Experiment Results, measured in BLEU Chinese German 14 More than 8 point gain compared to system built without the shared data
Experiment Results, measured in BLEU Chinese Best results are achieved using the maximum available data within the domain, using custom lambda training German 15
Experiment Results, measured in BLEU Chinese Weight training (lambda training) without diversity in the training data has very little effect German 16 The diversity aspect was somewhat a surprise for us. Microsoft’s large data pool by itself did not give Sybase the hoped-for boost.
Experiment Results, measured in BLEU Chinese Lambda training with in-domain diversity has a significant positive effect for the lambda target, and a significant negative effect for everyone else German 17
Experiment Results, measured in BLEU Chinese A system can be customized with small amounts of target language material, as long as there is a diverse set of in-domain parallel data available German 18
Experiment Results, measured in BLEU Chinese Small data providers benefit more from sharing than large data providers, but all benefit German 19
Experiment Results, measured in BLEU Chinese This is the best German Sybase system we could have built without TAUS German 20
Validation: Adobe Polish Training Data (sentences): General	1.5M Microsoft	1.7M Adobe	129K TAUS other	70K 21 Even for a language without a lot of training data we can see nice gains by pooling.
Validation: Dell Japanese Training data (sentences) ,[object Object]
Microsoft 	3.2M

Contenu connexe

Similaire à Improving the quality of a customized SMT system using shared training data

Office automation system report
Office automation system reportOffice automation system report
Office automation system reportAmit Kulkarni
 
Office automation system report
Office automation system reportOffice automation system report
Office automation system reportAmit Kulkarni
 
Provable multicopy dynamic data possession
Provable multicopy dynamic data possessionProvable multicopy dynamic data possession
Provable multicopy dynamic data possessionnexgentech15
 
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSPROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSNexgen Technology
 
Provable multicopy dynamic data possession
Provable multicopy dynamic data possessionProvable multicopy dynamic data possession
Provable multicopy dynamic data possessionnexgentechnology
 
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSPROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSNexgen Technology
 
Provable multi copy dynamic data possession in cloud computing systems
Provable multi copy dynamic data possession in cloud computing systemsProvable multi copy dynamic data possession in cloud computing systems
Provable multi copy dynamic data possession in cloud computing systemsNagamalleswararao Tadikonda
 
Data Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large DeploymentsData Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large DeploymentsDenodo
 
Database Systems
Database SystemsDatabase Systems
Database SystemsUsman Tariq
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Cognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use CasesCognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use CasesSenturus
 
A database management system
A database management systemA database management system
A database management systemghulam120
 
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB
 
Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660syaidatulamirah
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Jiaheng Lu
 

Similaire à Improving the quality of a customized SMT system using shared training data (20)

Office automation system report
Office automation system reportOffice automation system report
Office automation system report
 
Office automation system report
Office automation system reportOffice automation system report
Office automation system report
 
Ranjitbanshpal1
Ranjitbanshpal1Ranjitbanshpal1
Ranjitbanshpal1
 
Provable multicopy dynamic data possession
Provable multicopy dynamic data possessionProvable multicopy dynamic data possession
Provable multicopy dynamic data possession
 
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSPROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
 
Provable multicopy dynamic data possession
Provable multicopy dynamic data possessionProvable multicopy dynamic data possession
Provable multicopy dynamic data possession
 
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMSPROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
PROVABLE MULTICOPY DYNAMIC DATA POSSESSION IN CLOUD COMPUTING SYSTEMS
 
Provable multi copy dynamic data possession in cloud computing systems
Provable multi copy dynamic data possession in cloud computing systemsProvable multi copy dynamic data possession in cloud computing systems
Provable multi copy dynamic data possession in cloud computing systems
 
Data Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large DeploymentsData Virtualization Deployments: How to Manage Very Large Deployments
Data Virtualization Deployments: How to Manage Very Large Deployments
 
Database Systems
Database SystemsDatabase Systems
Database Systems
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Cognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use CasesCognos Data Module Architectures & Use Cases
Cognos Data Module Architectures & Use Cases
 
A database management system
A database management systemA database management system
A database management system
 
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
 
EViews 81 Supplement
EViews 81 SupplementEViews 81 Supplement
EViews 81 Supplement
 
Best peer++
Best peer++Best peer++
Best peer++
 
Best peer++
Best peer++Best peer++
Best peer++
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660Spam detection using machine learning based binary classifier_043660
Spam detection using machine learning based binary classifier_043660
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
 

Plus de TAUS - The Language Data Network

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS - The Language Data Network
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS - The Language Data Network
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...TAUS - The Language Data Network
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)TAUS - The Language Data Network
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...TAUS - The Language Data Network
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...TAUS - The Language Data Network
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...TAUS - The Language Data Network
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...TAUS - The Language Data Network
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...TAUS - The Language Data Network
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...TAUS - The Language Data Network
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)TAUS - The Language Data Network
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...TAUS - The Language Data Network
 

Plus de TAUS - The Language Data Network (20)

TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
TAUS Global Content Summit Amsterdam 2019 / Beyond MT. A few premature reflec...
 
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
TAUS Global Content Summit Amsterdam 2019 / Measure with DQF, Dace Dzeguze (T...
 
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
TAUS Global Content Summit Amsterdam 2019 / Automatic for the People by Domin...
 
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
TAUS Global Content Summit Amsterdam 2019 / The Quantum Leap: Human Parity, C...
 
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
TAUS Global Content Summit Amsterdam 2019 / Growing Business by Connecting Co...
 
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
Achieving Translation Efficiency and Accuracy for Video Content, Xiao Yuan (P...
 
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
Introduction Innovation Contest Shenzhen by Henri Broekmate (Lionbridge)
 
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann... Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
Game Changer for Linguistic Review: Shifting the Paradigm, Klaus Fleischmann...
 
A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...A translation memory P2P trading platform - to make global translation memory...
A translation memory P2P trading platform - to make global translation memory...
 
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
Shiyibao — The Most Efficient Translation Feedback System Ever, Guanqing Hao ...
 
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
Stepes – Instant Human Translation Services for the Digital World, Carl Yao (...
 
Farmer Lv (TrueTran)
Farmer Lv (TrueTran)Farmer Lv (TrueTran)
Farmer Lv (TrueTran)
 
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
Smart Translation Resource Management: Semantic Matching, Kirk Zhang (Wiitran...
 
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 The Theory and Practice of Computer Aided Translation Training System, Liu Q... The Theory and Practice of Computer Aided Translation Training System, Liu Q...
The Theory and Practice of Computer Aided Translation Training System, Liu Q...
 
Translation Technology Showcase in Shenzhen
Translation Technology Showcase in ShenzhenTranslation Technology Showcase in Shenzhen
Translation Technology Showcase in Shenzhen
 
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
How to efficiently use large-scale TMs in translation, Jing Zhang (Tmxmall)
 
SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)SDL Trados Studio 2017, Jocelyn He (SDL)
SDL Trados Studio 2017, Jocelyn He (SDL)
 
How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)How we train post-editors - Yongpeng Wei (Lingosail)
How we train post-editors - Yongpeng Wei (Lingosail)
 
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 A use-case for getting MT into your company, Kerstin Berns (berns language c... A use-case for getting MT into your company, Kerstin Berns (berns language c...
A use-case for getting MT into your company, Kerstin Berns (berns language c...
 
QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)QE integrated in XTM, by Bob Willans (XTM)
QE integrated in XTM, by Bob Willans (XTM)
 

Dernier

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Dernier (20)

Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Improving the quality of a customized SMT system using shared training data

  • 1. Improving the quality of a customized SMT system using shared training data Chris.Wendt@microsoft.com Will.Lewis@microsoft.com August 28, 2009 1
  • 2. Overview Engine and Customization Basics Experiment Objective Experiment Setup Experiment Results Validation Conclusions 2
  • 3. Microsoft’s Statistical MT Engine 3 Linguistically informed SMT
  • 8. Experiment Objective Objective Determine the effect of pooling parallel data among multiple data providers within a domain, measured by the translation quality of an SMT system trained with that data. 8
  • 9.
  • 10.
  • 14. Training data composition German Chinese (Simplified) Sybase does not have enough data to build a system exclusively with Sybase data 11
  • 15. Experiment Results, measured in BLEU Chinese German 12
  • 16. Experiment Results, measured in BLEU Chinese German 13
  • 17. Experiment Results, measured in BLEU Chinese German 14 More than 8 point gain compared to system built without the shared data
  • 18. Experiment Results, measured in BLEU Chinese Best results are achieved using the maximum available data within the domain, using custom lambda training German 15
  • 19. Experiment Results, measured in BLEU Chinese Weight training (lambda training) without diversity in the training data has very little effect German 16 The diversity aspect was somewhat a surprise for us. Microsoft’s large data pool by itself did not give Sybase the hoped-for boost.
  • 20. Experiment Results, measured in BLEU Chinese Lambda training with in-domain diversity has a significant positive effect for the lambda target, and a significant negative effect for everyone else German 17
  • 21. Experiment Results, measured in BLEU Chinese A system can be customized with small amounts of target language material, as long as there is a diverse set of in-domain parallel data available German 18
  • 22. Experiment Results, measured in BLEU Chinese Small data providers benefit more from sharing than large data providers, but all benefit German 19
  • 23. Experiment Results, measured in BLEU Chinese This is the best German Sybase system we could have built without TAUS German 20
  • 24. Validation: Adobe Polish Training Data (sentences): General 1.5M Microsoft 1.7M Adobe 129K TAUS other 70K 21 Even for a language without a lot of training data we can see nice gains by pooling.
  • 25.
  • 28. Dell 172K22 Confirms the Sybase results
  • 29. Example SRC The Monitor collects metrics and performance data from the databases and MobiLink servers running on other computers, while a separate computer accesses the Monitor via a web browser. 1 Der Monitor sammelt Metriken und Leistungsdaten von Datenbanken und MobiLink-Servern, die auf anderen Computern ausführen, während auf ein separater Computer greift auf den Monitor über einen Web-Browser. 2a Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift. 2b Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift. 3a Der Monitor sammelt Metriken und Performance-Daten von der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer den Monitor über einen Webbrowser zugreift. 3b Der Monitor sammelt Kriterien und Performance-Daten aus der Datenbanken und MobiLink-Server auf anderen Computern ausgeführt werden, während ein separater Computer des Monitors über einen Webbrowser zugreift. REF Der Monitor sammelt Kriterien und Performance-Daten aus den Datenbanken und MobiLink-Servern die auf anderen Computern ausgeführt werden, während ein separater Computer auf den Monitor über einen Webbrowser zugreift. Google Der Monitor sammelt Metriken und Performance-Daten aus den Datenbanken und MobiLink-Server auf anderen Computern ausgeführt, während eine separate Computer auf dem Monitor über einen Web-Browser. 23
  • 30.
  • 31. Weight training (Lambda training) without diversity in the training data has almost no effect
  • 32. Lambda training with in-domain diversity has a significant positive effect for the lambda target, and a significant negative effect for everyone else
  • 33. A system can be customized with small amounts of target language material, as long as there is a diverse set of in-domain parallel data available
  • 34. Best results are achieved using the maximum available data within the domain, using custom lambda training
  • 35. Small data providers benefit more from sharing than large data providers, but all benefit24
  • 36.
  • 37. An MT system trained with the combined data can deliver significantly improved translation quality, compared to a system trained only with the provider’s own data plus baseline training.
  • 38. Customization via a separate target language model and lambda training works25
  • 39. References Chris Quirk, Arul Menezes, and Colin Cherry, Dependency Treelet Translation: Syntactically Informed Phrasal SMT, in Proceedings of ACL, Association for Computational Linguistics, June 2005 Microsoft Translator: www.microsofttranslator.com TAUS Data Association: www.tausdata.org 26

Notes de l'éditeur

  1. 2 things:Show that pooling works, especially for data owners with less bitext than MicrosoftCustomization gives the data a boost
  2. Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  3. Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  4. Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  5. Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  6. Not a surprise, we strongly prefer the lambda target’s style and terminology in this case
  7. System 1:- General test set BLEU = 0.1590- Tech* test set BLEU = 0.2890- TAUS Adobe test set BLEU = 0.1940 System 3b:- General test set BLEU = 0.1353- Tech* test set BLEU = 0.3388- TAUS Adobe test set BLEU = 0.3374 Training Data (lines/words):- Gendom: 1520177 lines / 22632777 enu words 19988095 plk words- Tech: 1786035 lines / 22717903 enu words 21205994 plk words- TAUS: 199210 lines / 2439361 enu words 2306301 plk words- TAUS Adobe: 129084 lines / 1664918 enu words 1512067 plk words
  8. System 1:- General test set BLEU = 0.1799- Tech* test set BLEU = 0.3788- TAUS Dell test set BLEU = 0.2672 System 2a:- General test set BLEU = 0.1476- Tech* test set BLEU = 0.3087- TAUS Dell test set BLEU = 0.3949 System 2b:- General test set BLEU = 0.1728- Tech* test set BLEU = 0.4132- TAUS Dell test set BLEU = 0.3264 System 3a:- General test set BLEU = 0.1733- Tech* test set BLEU = 0.4230- TAUS Dell test set BLEU = 0.3989 System 3b:- General test set BLEU = 0.1485- Tech* test set BLEU = 0.3221- TAUS Dell test set BLEU = 0.4243 Training Data:- Gendom: 4348176 lines- Tech: 3299908 lines- TAUS: 1612637 lines- TAUS Dell: 172017 lines * = new set of Tech test sentences deduped against entire training set