SlideShare a Scribd company logo
1 of 90
Download to read offline
Unifying Human and Statistical
Evaluation for Natural Language
Generation
Tatsunori Hashimoto*, Hugh Zhang*, Percy Liang
What are the goals
of natural language
generation?
2
Storytelling
3
A high
quality
story?
The scientist named the population, after
their distinctive horn, Ovid’s Unicorn.
These four-horned, silver-white unicorns
were previously unknown to science.
Now, after almost two centuries, the
mystery of what sparked this odd
phenomenon is finally solved.
[Radford+ 2019]
Storytelling
4
The scientist named the population, after
their distinctive horn, Ovid’s Unicorn.
These four-horned, silver-white unicorns
were previously unknown to science.
Now, after almost two centuries, the
mystery of what sparked this odd
phenomenon is finally solved.
A high
quality
story?
Storytelling
5
Atticus said to Jem one day, “I’d
rather you shot at tin cans in the
back yard, but I know you’ll go after
birds. Shoot all the bluejays you
want, if you can hit ‘em, but
remember it’s a sin to kill a
mockingbird.”
From Harper Lee’s
“To Kill A Mockingbird”
Storytelling
6
Atticus said to Jem one day, “I’d
rather you shot at tin cans in the
back yard, but I know you’ll go after
birds. Shoot all the bluejays you
want, if you can hit ‘em, but
remember it’s a sin to kill a
mockingbird.”
Good
story, but
not a good
model
From Harper Lee’s
“To Kill A Mockingbird”
Storytelling
7
Peter said to James one afternoon,
“I’d rather you fired at aluminum cans
in the garage, but I know you will go
after birds. Hit all the ravens you can,
if you want, but remember it is a sin
to murder a hummingbird.”
Storytelling
8
Peter said to James one afternoon,
“I’d rather you fired at aluminum cans
in the garage, but I know you will go
after birds. Hit all the ravens you can,
if you want, but remember it is a sin
to murder a hummingbird.”
Storytelling
9
Peter said to James one afternoon,
“I’d rather you fired at aluminum cans
in the garage, but I know you will go
after birds. Hit all the ravens you can,
if you want, but remember it is a sin
to murder a hummingbird.”
Diversity is
important
and hard to
quantify!
Goal
10
Evaluation should measure
both quality and diversity
11
12
Claim: Human evaluation
has difficulty catching
diversity defects.
Try It Yourself!
13
Task: Headline short
news articles
14
Context: Political leaders in Israel united in
prayers for Ariel Sharon as the prime minister
underwent surgery after suffering a stroke.
___________________________________________
Output: Sharon has stroke for stroke.
15
Context: Political leaders in Israel united in
prayers for Ariel Sharon as the prime minister
underwent surgery after suffering a stroke.
___________________________________________
Output: Sharon has stroke for stroke.
Machine generated
(obvious quality failure)
16
Context: The Buffalo Bills sacked Tom Donahoe
as president and general manager on Wednesday,
fulfilling expectations of a shake-up
___________________________________________
Output: Bills sack Donahoe as president and gm.
17
Context: The Buffalo Bills sacked Tom Donahoe
as president and general manager on Wednesday,
fulfilling expectations of a shake-up
___________________________________________
Output: Bills sack Donahoe as president and gm.
Machine generated
(hard to detect diversity issue)
18
Context: The Buffalo Bills sacked Tom Donahoe
as president and general manager on Wednesday,
fulfilling expectations of a shake-up.
___________________________________________
Output: Bills sack Donahoe as president and gm.
Machine generated
(hard to detect diversity issue)
19
Context: The Buffalo Bills sacked Tom Donahoe
as president and general manager on Wednesday,
fulfilling expectations of a shake-up.
___________________________________________
Output: Bills sack Donahoe as president and gm.
___________________________________________
Reference: NFL’s Bills shake up front office.
Existing Evaluations
20
Pros
Gold standard
for quality
Cons
Can be
cheated by
under diversity
Human Evaluation
Statistical (e.g., perplexity)
Reference Based (e.g., BLEU, ROUGE)
Learned Metrics (e.g., ADEM)
Existing Evaluations
21
Pros
Measures
diversity
Cons
Does not
measure
sample quality
Human Evaluation
Statistical (e.g., perplexity)
Reference Based (e.g., BLEU, ROUGE)
Learned Metrics (e.g., ADEM)
[Theis+ 2015]
Existing Evaluations
22
Human Evaluation
Statistical (e.g., perplexity)
Pros
Quick and
easy to
calculate
Cons
Inaccurate
measures of
both quality
and diversity
Reference Based (e.g., BLEU, ROUGE)
Learned Metrics (e.g., ADEM)
[Papineni+ 2002], [Lin+ 2004], [Banerjee+ 2005]
Existing Evaluations
23
Human Evaluation
Statistical (e.g., perplexity)
Reference Based (e.g., BLEU, ROUGE)
Learned Metrics (e.g., ADEM)
Pros
Quick and
easy to
calculate
Cons
Still unreliable.
Often still can’t
catch diversity.
[Lowe+ 2017], [Olsson+ 2018]
Existing Evaluations
24
Human Statistical Learned Reference Dist
Quality
Diversity
Our work: unifying human
and statistical evaluation
to measure both quality
and diversity.
25
26
27
28
29
Solution: Optimal
Classification
30
Optimal Classifier ...
31
Captures Quality
Captures Diversity
Intuitive
Low quality
samples are easily
distinguished
Optimal Classifier Is ...
32
Captures Quality
Captures Diversity
Intuitive
Under-diversity and
plagiarism is also
recognizable
Optimal Classifier Is ...
33
Captures Quality
Captures Diversity
Intuitive
Optimal
classification error
makes intuitive
sense to humans
Can we reliably
estimate the optimal
classification error?
34
35
Learned
Classifier
[Chaganty+ 2017], [Novikova+ 2017]
36
Learned
Classifier
[Chaganty+ 2017], [Novikova+ 2017]
Good model or
weak classifier?
Key Insight:
Only need access to two
features to optimally
classify sentences
37
38
39
40
41
42
43
Use Humans!
Crowdsource estimates
of “typicality” as a
substitute for p_ref
Human Judgement Score
44
1. 20 crowdworkers rate a sentence
from 1 (rare) to 5 (common)
2. Define HJ as the average of their
“typicality” judgements
45
46
47
Human Unified with Statistical Evaluation
48
HUSE
49
HUSE
50
HUSE
51
Learning a classifier in high
dimensions is hard. In two
dimensions, it is easy.
52
HUSE Guarantees
53
Optimal ≤ HUSE ≤ Humans
Lower is better
HUSE Guarantees
54
Optimal ≤ HUSE ≤ Humans
Always Outperforms Humans
Lower is better
HUSE Guarantees
55
Optimal ≤ HUSE ≤ Humans
gap from model
underdiversity
Lower is better
HUSE Guarantees
56
Optimal ≤ HUSE ≤ Humans
Always Outperforms Humans
Zero False Negatives
Lower is better
HUSE Guarantees
57
Optimal ≈ HUSE ≤ Humans
Case Study:
Summarization
58
59
60
61
62
63
64
65
66
67
68
69
70
Quality-Diversity
Tradeoffs
71
72
73
74
(1 - HUSE_Q) + (1 - HUSE_D) = (1 - HUSE)
quality + diversity = total error
75
76
77
78
79
Use HUSE
80
● Captures quality and diversity
● Statistically principled
Use our system!
81
https://github.com/hughbzhang/HUSE
Questions?
82
Appendix
83
84
85
Turk Prompt
86
Mutual Information Theorem
87
Holder’s Bound
88
89
90

More Related Content

Similar to Unifying Human and Statistical Evaluation for Natural Language Generation

Humanity presentations
Humanity presentationsHumanity presentations
Humanity presentations
Jenny Jeon
 
Humanity presentations
Humanity presentationsHumanity presentations
Humanity presentations
Jenny Jeon
 
Humanity presentations
Humanity presentationsHumanity presentations
Humanity presentations
Jenny Jeon
 
Humans are getting dumber ii
Humans are getting dumber iiHumans are getting dumber ii
Humans are getting dumber ii
Encarni González
 

Similar to Unifying Human and Statistical Evaluation for Natural Language Generation (20)

Interrobang
InterrobangInterrobang
Interrobang
 
Healthy Eating Essays
Healthy Eating EssaysHealthy Eating Essays
Healthy Eating Essays
 
Vivaksha - The science Quiz finals - 28/02/2015
Vivaksha - The science Quiz finals - 28/02/2015Vivaksha - The science Quiz finals - 28/02/2015
Vivaksha - The science Quiz finals - 28/02/2015
 
Humanity presentations
Humanity presentationsHumanity presentations
Humanity presentations
 
Humanity presentations
Humanity presentationsHumanity presentations
Humanity presentations
 
Death Of Salesman Essay.pdf
Death Of Salesman Essay.pdfDeath Of Salesman Essay.pdf
Death Of Salesman Essay.pdf
 
Practicum Journals. Online assignment writing service.
Practicum Journals. Online assignment writing service.Practicum Journals. Online assignment writing service.
Practicum Journals. Online assignment writing service.
 
Personal Strength Essay. 003 Strengths And Weaknesses Essay Personal Sample O...
Personal Strength Essay. 003 Strengths And Weaknesses Essay Personal Sample O...Personal Strength Essay. 003 Strengths And Weaknesses Essay Personal Sample O...
Personal Strength Essay. 003 Strengths And Weaknesses Essay Personal Sample O...
 
Humanity presentations
Humanity presentationsHumanity presentations
Humanity presentations
 
04 God’s Wisdom and Man’s Foolishness 1 Corinthians 1:18-31
04 God’s Wisdom and Man’s Foolishness 1 Corinthians 1:18-3104 God’s Wisdom and Man’s Foolishness 1 Corinthians 1:18-31
04 God’s Wisdom and Man’s Foolishness 1 Corinthians 1:18-31
 
Pros And Cons Of Abortion Essay
Pros And Cons Of Abortion EssayPros And Cons Of Abortion Essay
Pros And Cons Of Abortion Essay
 
Knowing what AI Systems Don't know and Why it matters
Knowing what AI  Systems Don't know and Why it mattersKnowing what AI  Systems Don't know and Why it matters
Knowing what AI Systems Don't know and Why it matters
 
Humans are getting dumber ii
Humans are getting dumber iiHumans are getting dumber ii
Humans are getting dumber ii
 
Shapes of 21st Century Stories
Shapes of 21st Century StoriesShapes of 21st Century Stories
Shapes of 21st Century Stories
 
conQUIZtadors - Finals Questions
conQUIZtadors - Finals QuestionsconQUIZtadors - Finals Questions
conQUIZtadors - Finals Questions
 
Gcse English Literature Essay Plan. Online assignment writing service.
Gcse English Literature Essay Plan. Online assignment writing service.Gcse English Literature Essay Plan. Online assignment writing service.
Gcse English Literature Essay Plan. Online assignment writing service.
 
How To Write Analytical Essay.pdf
How To Write Analytical Essay.pdfHow To Write Analytical Essay.pdf
How To Write Analytical Essay.pdf
 
Extreme Sports Essay Free. Online assignment writing service.
Extreme Sports Essay Free. Online assignment writing service.Extreme Sports Essay Free. Online assignment writing service.
Extreme Sports Essay Free. Online assignment writing service.
 
Knights Of Columbus Winning Essay
Knights Of Columbus Winning EssayKnights Of Columbus Winning Essay
Knights Of Columbus Winning Essay
 
Term Essay
Term EssayTerm Essay
Term Essay
 

More from NAVER Engineering

More from NAVER Engineering (20)

React vac pattern
React vac patternReact vac pattern
React vac pattern
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

Unifying Human and Statistical Evaluation for Natural Language Generation