SlideShare une entreprise Scribd logo
1  sur  20
Human Evaluation: Why do we
need it?
The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
Dr. Sheila Castilho
www.adaptcentre.ieWhy do we need evaluation?
- Evaluation provide data on whether a system works and
why, which parts of it are effective and which need
improvement.
- Evaluation needs to be honest and replicable, and its
methods should be as rigorous as possible.
www.adaptcentre.ieA bit of history…
- ALPAC Report (1964)
- Generated a long and drastic cut in funding (especially in MT)
- Evaluation was a forbidden topic in the NLP community (Paroubek et
al 2007)
www.adaptcentre.ie
Automatic Metrics
 Interdisciplinary
- WER (speech recognition – MT)
- ROUGE (text summarization – MT)
- F-Measure (IR – many other areas)
www.adaptcentre.ie
Who’s afraid of Human Evaluation?
• Time consuming
• Expensive
• Humans don’t agree with each other
• Automatic metrics should be enough! It’s grand!
www.adaptcentre.ieHuman Translation Quality Assessment
- Why evaluate machine translation with humans?
- More detailed evaluation
- Assess complex linguistic phenomena
- Feedback to the MT system
- Diagnosis
www.adaptcentre.ieHuman Translation Quality Assessment
- most commonly carried out under the adequacy-fluency
paradigm and post-editing.
- secondary measures are: readability, comprehensibility,
usability, acceptability of source and target texts.
- carried out by professional and amateur evaluators.
- performance-based measures and user-centred
approaches are more recent additions.
www.adaptcentre.ieAdequacy
 also known as “accuracy” or “fidelity”
 Focus on the source text
 “the extent to which the translation transfers the meaning of the
source text translation unit into the target”
 Likert scale:
1. None of it
2. Little of it
3. Most of it
4. All of it
 Why is Adequacy useful for MT evaluation?
 It tells us how much of the source message has been
transferred to the translation
www.adaptcentre.ieFluency
 also known as intelligibility
 focuses on the target text
 “the flow and naturalness of the target text unit in the context
of the target audience and its linguistic and sociocultural
norms in the given context”
 Likert scale:
1.No fluency
2.Little fluency
3.Near native
4.Native
 Why is Fluency useful for MT evaluation?
 It tells if the message is fluent/intelligible (i.e. sounds natural to a native
speaker) or if it is “broken language”.
www.adaptcentre.iePE
 The “term used for the correction of machine translation
output by human linguists/editors” (Veale and Way 1997)
 “checking, proof-reading and revising translations
carried out by any kind of translating automaton”.
(Gouadec 2007)
 Common use of MT in production – over 80% of
Language Service Providers now offer post-edited MT
(Common Sense Advisory 2016)
www.adaptcentre.iePE
- Why use post-editing for Machine Translation evaluation?
- Assess usefulness of MT system in production
- Identify common errors
- Create new training or test data
- However, measurements of post-editing effort tend to differ
between novice (students) and professionals
- Temporal effort: time on PE, WPS
- Technical effort: edits perfromed – HTER
- Cognitive effort: several ways – eye tracking
www.adaptcentre.ieTranslation Quality Assessment
- Why use error taxonomies for translation evaluation?
- Identify types of errors in MT or human translation
- Detailed error report is useful for adjusting MT systems,
reporting back to clients
- LSPs use taxonomies and severity ratings to monitor
translators’ work
- However, error annotation is expensive
www.adaptcentre.ieDQF / Multidimensional Quality Metrics
www.adaptcentre.ieDQF / MQM Example
ST: Quando você faz avaliação humana dos sistemas, é mais provável
que os seus resultados tenham mais peso.
MT: When you make human systems evaluation, it is more likely that
the your results will have much more weight.
HT: When you do human evaluation of the systems, it is more likely that
your results will have more credibility.
Errors:
• Word order
• Extraneous function word
• Mistranslation
www.adaptcentre.ieCrowdsourcing
 Cheap
 Fast
 Various tasks (Fluency/adequacy, PE, error mark-up, ranking…)
 Quality?
 contributors’ level
 Country/region
 Constant monitoring
www.adaptcentre.ieUsability
 Concept borrowed for human-computer interaction
 Real world problems
 Understand how end users engage with machine-translated
texts or how usable such texts are.
 Applied for different areas (video/text summarisation, UI, information
retrieval, etc.).
 Why is Usability useful for MT evaluation?
 identify what impact the translation might have on the final readers of
the translation, including their satisfaction with the translation and
products.
 The users of the translation should be the ones who tell us if the final
translation is acceptable
www.adaptcentre.ie
www.adaptcentre.ieSo... Human Evaluation is a great thing!
 Human evaluation avoids awkward situations…
 And backs up good results!
www.adaptcentre.ie
Thank you!
www.adaptcentre.ie
• Patrick Paroubek, Stephane Chaudiron, Lynette Hirschman.
Principles of Evaluation in Natural Language Processing. Traitement
Automatique des Langues, ATALA, 2007, 48 (1), pp.7-31.

Contenu connexe

Tendances

LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
Jinho Choi
 
Language Grid
Language GridLanguage Grid
Language Grid
lindh
 

Tendances (20)

NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Blenderbot
BlenderbotBlenderbot
Blenderbot
 
HLT
HLTHLT
HLT
 
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
Natural Language Processing in Artificial Intelligence - Codeup #5 - PayU
Natural Language Processing in Artificial Intelligence  - Codeup #5 - PayU Natural Language Processing in Artificial Intelligence  - Codeup #5 - PayU
Natural Language Processing in Artificial Intelligence - Codeup #5 - PayU
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review
 
Conversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep LearningConversational Agents in Portuguese: A Study Using Deep Learning
Conversational Agents in Portuguese: A Study Using Deep Learning
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
 
PubhD talk: MT serving the society
PubhD talk: MT serving the societyPubhD talk: MT serving the society
PubhD talk: MT serving the society
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 
Language Grid
Language GridLanguage Grid
Language Grid
 

Similaire à Human Evaluation: Why do we need it? - Dr. Sheila Castilho

HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
Lifeng (Aaron) Han
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
Parisa Niksefat
 
MT(1).pdf
MT(1).pdfMT(1).pdf
MT(1).pdf
s n
 

Similaire à Human Evaluation: Why do we need it? - Dr. Sheila Castilho (20)

Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_sa
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio... HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
HOPE: A Task-Oriented and Human-Centric Evaluation Framework Using Professio...
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
MT(1).pdf
MT(1).pdfMT(1).pdf
MT(1).pdf
 
What machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happyWhat machine translation developers are doing to make post-editors happy
What machine translation developers are doing to make post-editors happy
 
Learn the different approaches to machine translation and how to improve the ...
Learn the different approaches to machine translation and how to improve the ...Learn the different approaches to machine translation and how to improve the ...
Learn the different approaches to machine translation and how to improve the ...
 
Teaminology - A New Crowdsourcing Application for Term & Translation Governan...
Teaminology - A New Crowdsourcing Application for Term & Translation Governan...Teaminology - A New Crowdsourcing Application for Term & Translation Governan...
Teaminology - A New Crowdsourcing Application for Term & Translation Governan...
 
Analytics and Data as a Keystone Technology for Translation Companies, Doron ...
Analytics and Data as a Keystone Technology for Translation Companies, Doron ...Analytics and Data as a Keystone Technology for Translation Companies, Doron ...
Analytics and Data as a Keystone Technology for Translation Companies, Doron ...
 
Man vs. Machine: A Guide to Understanding Translation Technology in Modern Bu...
Man vs. Machine: A Guide to Understanding Translation Technology in Modern Bu...Man vs. Machine: A Guide to Understanding Translation Technology in Modern Bu...
Man vs. Machine: A Guide to Understanding Translation Technology in Modern Bu...
 
IRJET- Speech Translation System for Language Barrier Reduction
IRJET-  	  Speech Translation System for Language Barrier ReductionIRJET-  	  Speech Translation System for Language Barrier Reduction
IRJET- Speech Translation System for Language Barrier Reduction
 
sample PPT.pptx
sample PPT.pptxsample PPT.pptx
sample PPT.pptx
 
IRJET - Text Optimization/Summarizer using Natural Language Processing
IRJET - Text Optimization/Summarizer using Natural Language Processing IRJET - Text Optimization/Summarizer using Natural Language Processing
IRJET - Text Optimization/Summarizer using Natural Language Processing
 
Machine Translation Master Class at the EUATC Conference by Diego Bartolome
Machine Translation Master Class at the EUATC Conference by Diego BartolomeMachine Translation Master Class at the EUATC Conference by Diego Bartolome
Machine Translation Master Class at the EUATC Conference by Diego Bartolome
 
Good Applications of Bad Machine Translation
Good Applications of Bad Machine TranslationGood Applications of Bad Machine Translation
Good Applications of Bad Machine Translation
 
AI for voice recognition.pptx
AI for voice recognition.pptxAI for voice recognition.pptx
AI for voice recognition.pptx
 
2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology 2013 CHAT tcworld tekom Welocalize Teaminology
2013 CHAT tcworld tekom Welocalize Teaminology
 
TAUS New Year's Reception 2014
TAUS New Year's Reception 2014TAUS New Year's Reception 2014
TAUS New Year's Reception 2014
 
Improving Translator Productivity with MT: A Patent Translation Case Study
Improving Translator Productivity with MT: A Patent Translation Case StudyImproving Translator Productivity with MT: A Patent Translation Case Study
Improving Translator Productivity with MT: A Patent Translation Case Study
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine Learning
 

Plus de Sebastian Ruder

Plus de Sebastian Ruder (20)

Strong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain ShiftStrong Baselines for Neural Semi-supervised Learning under Domain Shift
Strong Baselines for Neural Semi-supervised Learning under Domain Shift
 
On the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary InductionOn the Limitations of Unsupervised Bilingual Dictionary Induction
On the Limitations of Unsupervised Bilingual Dictionary Induction
 
Neural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain ShiftNeural Semi-supervised Learning under Domain Shift
Neural Semi-supervised Learning under Domain Shift
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian MihaiMachine intelligence in HR technology: resume analysis at scale - Adrian Mihai
Machine intelligence in HR technology: resume analysis at scale - Adrian Mihai
 
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana IfrimHashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
Hashtagger+: Real-time Social Tagging of Streaming News - Dr. Georgiana Ifrim
 
Transfer Learning for Natural Language Processing
Transfer Learning for Natural Language ProcessingTransfer Learning for Natural Language Processing
Transfer Learning for Natural Language Processing
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...Making sense of word senses: An introduction to word-sense disambiguation and...
Making sense of word senses: An introduction to word-sense disambiguation and...
 
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer GilmartinSpoken Dialogue Systems and Social Talk - Emer Gilmartin
Spoken Dialogue Systems and Social Talk - Emer Gilmartin
 
NIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian RuderNIPS 2016 Highlights - Sebastian Ruder
NIPS 2016 Highlights - Sebastian Ruder
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 
Funded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIENFunded PhD/MSc. Opportunities at AYLIEN
Funded PhD/MSc. Opportunities at AYLIEN
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
Dynamic Topic Modeling via Non-negative Matrix Factorization (Dr. Derek Greene)
 
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo ...
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
 
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
Topic Listener - Observing Key Topics from Multi-Channel Speech Audio Streams...
 

Dernier

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 

Dernier (20)

Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai YoungDubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
Dubai Call Girls Beauty Face Teen O525547819 Call Girls Dubai Young
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 

Human Evaluation: Why do we need it? - Dr. Sheila Castilho

  • 1. Human Evaluation: Why do we need it? The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. Dr. Sheila Castilho
  • 2. www.adaptcentre.ieWhy do we need evaluation? - Evaluation provide data on whether a system works and why, which parts of it are effective and which need improvement. - Evaluation needs to be honest and replicable, and its methods should be as rigorous as possible.
  • 3. www.adaptcentre.ieA bit of history… - ALPAC Report (1964) - Generated a long and drastic cut in funding (especially in MT) - Evaluation was a forbidden topic in the NLP community (Paroubek et al 2007)
  • 4. www.adaptcentre.ie Automatic Metrics  Interdisciplinary - WER (speech recognition – MT) - ROUGE (text summarization – MT) - F-Measure (IR – many other areas)
  • 5. www.adaptcentre.ie Who’s afraid of Human Evaluation? • Time consuming • Expensive • Humans don’t agree with each other • Automatic metrics should be enough! It’s grand!
  • 6. www.adaptcentre.ieHuman Translation Quality Assessment - Why evaluate machine translation with humans? - More detailed evaluation - Assess complex linguistic phenomena - Feedback to the MT system - Diagnosis
  • 7. www.adaptcentre.ieHuman Translation Quality Assessment - most commonly carried out under the adequacy-fluency paradigm and post-editing. - secondary measures are: readability, comprehensibility, usability, acceptability of source and target texts. - carried out by professional and amateur evaluators. - performance-based measures and user-centred approaches are more recent additions.
  • 8. www.adaptcentre.ieAdequacy  also known as “accuracy” or “fidelity”  Focus on the source text  “the extent to which the translation transfers the meaning of the source text translation unit into the target”  Likert scale: 1. None of it 2. Little of it 3. Most of it 4. All of it  Why is Adequacy useful for MT evaluation?  It tells us how much of the source message has been transferred to the translation
  • 9. www.adaptcentre.ieFluency  also known as intelligibility  focuses on the target text  “the flow and naturalness of the target text unit in the context of the target audience and its linguistic and sociocultural norms in the given context”  Likert scale: 1.No fluency 2.Little fluency 3.Near native 4.Native  Why is Fluency useful for MT evaluation?  It tells if the message is fluent/intelligible (i.e. sounds natural to a native speaker) or if it is “broken language”.
  • 10. www.adaptcentre.iePE  The “term used for the correction of machine translation output by human linguists/editors” (Veale and Way 1997)  “checking, proof-reading and revising translations carried out by any kind of translating automaton”. (Gouadec 2007)  Common use of MT in production – over 80% of Language Service Providers now offer post-edited MT (Common Sense Advisory 2016)
  • 11. www.adaptcentre.iePE - Why use post-editing for Machine Translation evaluation? - Assess usefulness of MT system in production - Identify common errors - Create new training or test data - However, measurements of post-editing effort tend to differ between novice (students) and professionals - Temporal effort: time on PE, WPS - Technical effort: edits perfromed – HTER - Cognitive effort: several ways – eye tracking
  • 12. www.adaptcentre.ieTranslation Quality Assessment - Why use error taxonomies for translation evaluation? - Identify types of errors in MT or human translation - Detailed error report is useful for adjusting MT systems, reporting back to clients - LSPs use taxonomies and severity ratings to monitor translators’ work - However, error annotation is expensive
  • 14. www.adaptcentre.ieDQF / MQM Example ST: Quando você faz avaliação humana dos sistemas, é mais provável que os seus resultados tenham mais peso. MT: When you make human systems evaluation, it is more likely that the your results will have much more weight. HT: When you do human evaluation of the systems, it is more likely that your results will have more credibility. Errors: • Word order • Extraneous function word • Mistranslation
  • 15. www.adaptcentre.ieCrowdsourcing  Cheap  Fast  Various tasks (Fluency/adequacy, PE, error mark-up, ranking…)  Quality?  contributors’ level  Country/region  Constant monitoring
  • 16. www.adaptcentre.ieUsability  Concept borrowed for human-computer interaction  Real world problems  Understand how end users engage with machine-translated texts or how usable such texts are.  Applied for different areas (video/text summarisation, UI, information retrieval, etc.).  Why is Usability useful for MT evaluation?  identify what impact the translation might have on the final readers of the translation, including their satisfaction with the translation and products.  The users of the translation should be the ones who tell us if the final translation is acceptable
  • 18. www.adaptcentre.ieSo... Human Evaluation is a great thing!  Human evaluation avoids awkward situations…  And backs up good results!
  • 20. www.adaptcentre.ie • Patrick Paroubek, Stephane Chaudiron, Lynette Hirschman. Principles of Evaluation in Natural Language Processing. Traitement Automatique des Langues, ATALA, 2007, 48 (1), pp.7-31.

Notes de l'éditeur

  1. Spread But what about HEval?