SlideShare une entreprise Scribd logo
1  sur  29
Automatic Text Summarization
Trends, Challenges and Opportunities
Siddhartha Banerjee
Research Scientist, Content Platform
Yahoo! (now Oath, a Verizon Company)
September 22, 2017
2Talk @ Saama Technologies Siddhartha Banerjee
❑ Undergraduate degree
• Industrial Engineering - 2009 (IIT Kharagpur)
❑ Professional Experience: 2009 – 2012
• Sabre Airline Solutions and Oracle Retail
❑ Ph.D. @Penn State Information Sciences (2012 - Dec’ 2016)
• Advised by Prof. Prasenjit Mitra
• Natural Language Processing
❑ Back to Industry: 2017
• Yahoo! (March 2017 - present)
• Question Answering
• Relationship extraction using distant supervision
• Deep Learning
My background
3Talk @ Saama Technologies Siddhartha Banerjee
Outline
● What is Text Summarization?
● Overview of existing work
● Challenges
● Current Trends
● My experiences
● The Future of Summarization
● Q&A
4Talk @ Saama Technologies Siddhartha Banerjee
What is Text Summarization?
Single-document summarization
Multi-document summarization
5Talk @ Saama Technologies Siddhartha Banerjee
An “ideal” summary
Informativeness Coherence Grammaticality
6Talk @ Saama Technologies Siddhartha Banerjee
Types of Summarization
● Extractive
○ “Extract” certain sentences
○ Easier
○ No issues with grammaticality
● Abstractive
○ Produce “abstracts”
○ Content understanding
○ Generation
7Talk @ Saama Technologies Siddhartha Banerjee
Extractive Summarization
1958
We have come a long way since then!
Sentences that mention words that occur frequently in the document are more important.
8Talk @ Saama Technologies Siddhartha Banerjee
Extractive Techniques
• Word-statistics based techniques
• Centroid [Radev et. al, 2004]
• TextRank [Mihalcea and Tarau, 2004]
• Supervised techniques
• Provide ranked sentences to train from documents
• Learning to Rank
• Topic-model based techniques
• Model sentences as topic vectors [Blei et. al, 2003]
• Select sentences that are more “central” to the document vector.
9Talk @ Saama Technologies Siddhartha Banerjee
Why “abstractive”?
❑ Consider opinions on iphone:
• The iPhone’s battery lasts long…have to charge it once every few days.
• iPhone’s battery is bulky but it is cheap..
• iPhone’s battery is bulky but it lasts long!
❑ Extractive: The iPhone’s battery lasts long…have to charge it once every few days.
• Limit on summary length
❑ Ideal: The iPhone’s battery lasts long and is cheap but is bulky.
• HARD!!
• Preferred (Murray et. al, 2010 – user study)
10Talk @ Saama Technologies Siddhartha Banerjee
Abstractive Summarization techniques
❏ Text-to-text generation at sentence level – Independent of other sentences
❏ Sentence compression (Cohn and Lapata’ 2009)
❏ Extractive to abstractive: Not possible using just compression
❏ Sentence fusion (Barzilay and McKeown’ 2005, Filippova and Strube, 2008)
Template-based (Genest and Lapalme’, 2011)
❏ Domain-specific templates - Lot of manual effort
I: But a month ago, she returned to Britain, taking the children with her.
O: She returned to Britain, taking the children
11Talk @ Saama Technologies Siddhartha Banerjee
Current Trends
● Deep Learning!!
● Neural Attention Model for Sentence Summarization (FAIR, 2015)
○ Headline generation
○ Feed-forward neural network
○ Attention model
● RNN-based summarization (FAIR, 2016)
12Talk @ Saama Technologies Siddhartha Banerjee
Sequence to Sequence models
❏ Originally modelled for machine translation
❏
❏
13Talk @ Saama Technologies Siddhartha Banerjee
RNN’s with attention
http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html
● Rare-word problem: Reproducing factual details inaccurately
● Pointer-Generator Networks to the rescue! Copy words from source to text.
● Get To The Point: Summarization with Pointer-Generator Networks (Stanford NLP Group, 2017)
14Talk @ Saama Technologies Siddhartha Banerjee
Evaluation
Automatic Evaluation
• ROUGE – Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004)
Manual Evaluation
•Ask human judges and rate summaries on quality
15Talk @ Saama Technologies Siddhartha Banerjee
Datasets
• News articles
• CNN/Daily News dataset
• Document Understanding Conference datasets [DUC, now TAC]
• Several topics: Each topic with 8-10 documents
• Meeting conversations
• Single meeting transcript
• AMI Dataset [http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml]
• 139 meeting transcripts: 119 training + 20 test
16Talk @ Saama Technologies Siddhartha Banerjee
My Summarization Experience
Automatically authoring content for Wikipedia
Improving existing articles Constructing new articles
Web
information
Assign to Wiki
Sections
Summarization
17Talk @ Saama Technologies Siddhartha Banerjee
Summary sentence generation
S1 The outbreak is the largest ever reported in North America.
S2 Enterovirus D68 caused outbreak of respiratory disease.
S3 Clusters of the outbreak in the United States were reported in August.
1: Enterovirus D68 caused outbreak is the largest ever reported in North America.
2: Enterovirus D68 caused outbreak in the United States were reported in August.
3: The outbreak is the largest ever reported in August.
Output
Graph Construction
❑ Multi-sentence compression
(Filippova’ 2010)
• Directed Graph
• Nodes are words
■ (with POS)
• Edges are adjacencies
❑ Graph traversal
Overgenerate
and
Select
18Talk @ Saama Technologies Siddhartha Banerjee
A comprehensive model (Banerjee and Mitra’ 2016)
Word - graph
p2
p3 pk
Generated
sentences ❌ ❌✔
…...........
Select few sentences
Informativeness Linguistic Quality Coherence
p1
✔
Ordering of sentences
(Bollegala et al. 2012)
Information coverage Grammaticality
19Talk @ Saama Technologies Siddhartha Banerjee
Mathematical formulation
Maximize
Constraints
❑ Three factors:
• I – Information coverage [Textrank (2004)]
• LQ – Language model [Heafield et al. 2013]
• Coh – Regression based scoring
+
K
K
20Talk @ Saama Technologies Siddhartha Banerjee
Experimental Results: News dataset
•ROUGE evaluation on Document understanding conference (DUC) datasets
20
21Talk @ Saama Technologies Siddhartha Banerjee
❑ Manual Evaluation: 10 evaluators
• Informative coverage: ~5% improvement over `best’
extractive system
• Readability: ~4% reduction compared to extractive system
❑ Error Cases
• The U.N. imposed sanctions since 1992 for its refusal to hand over the two
Libyans wanted in the 1988 bombing that killed 270 people killed.
• The deal that will make Hun Sen prime minister and Ranariddh agreed to a
government formed.
Experimental Results (contd.)
22Talk @ Saama Technologies Siddhartha Banerjee
Disaster-event Tweet Summarization (Rudra et. al, 2016)
Content words: Numerals, nouns, locations, main verbs
• 5: Content word -> At least One Sentence
• 6: Sentence selected determines content words to be selected
Content- word based
Summary Quality Optimization
23Talk @ Saama Technologies Siddhartha Banerjee
Experimental Results
• Readability evaluation (COWABS is our proposed technique)
24Talk @ Saama Technologies Siddhartha Banerjee
Meeting summarization using fusion (Banerjee and Mitra, 2015)
•“Um well this is the kick-off meeting for our project.”
• “so we’re designing a new remote control and um.”
• “Um, as you can see it is supposed to be original, trendy and user friendly.”
25Talk @ Saama Technologies Siddhartha Banerjee
Results: Meeting data
❑ AMI Dataset (http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml)
• 139 meeting transcripts: 119 training + 20 test (for extractive)
❑ ROUGE Evaluation
• ~17 % R-2 score over other abstractive system (Filippova’ 2010)
❑ Readability Analysis
• Our model: Slightly curved around the sides like up to the main display as well. It
was voice activated .
• Human: The remote will be single-curved with a cherry design on top. A sample
sensor was included to add speech recognition.
26Talk @ Saama Technologies Siddhartha Banerjee
Resources
• https://github.com/miso-belica/sumy
• Lots of simple extractive summarization techniques
• https://github.com/facebookarchive/NAMAS
• Abstractive summarization: headline generation task
• http://kavita-ganesan.com/opinosis-summarizer-library
• Summarizing redundant opinions/ reviews
• http://pavel.surmenok.com/2016/10/15/how-to-run-text-summarization-with-tensorflow/
• Tutorial using seq2seq model on tensorflow
• https://github.com/g-deoliveira/TextSummarization
• Topic model-based summarization
• https://github.com/StevenLOL/AbTextSumm
• My abstractive summarization technique.
27Talk @ Saama Technologies Siddhartha Banerjee
Future of Summarization
❏ The importance of summarization is undeniable
❏ Growth of data
❏ Automatic authoring in journalism
❏ Medical report summarization
❏ Deep Learning (RNN’s)
❏ Still a long way to go!
❏ Sequence to sequence models are hard to control
❏ Better metrics. ROUGE is not good enough.
❏ Making sense of an entire summary -- mimicking human capabilities.
28Talk @ Saama Technologies Siddhartha Banerjee
Publications
• Siddhartha Banerjee and Prasenjit Mitra. WikiWrite: Generating Wikipedia Articles Automatically. 25th International Joint Conference
on Artificial Intelligence IJCAI-16.
• Koustav Rudra, Siddhartha Banerjee, Muhammad Imran, Niloy Ganguly, Pawan Goyal and Prasenjit Mitra. Summarizing Situational
Tweets in Crisis Scenario. ACM HyperText, 2016
• Siddhartha Banerjee and Prasenjit Mitra. Filling the Gaps: Improving Wikipedia Stubs., 15th ACM SIGWEB International Symposium on
Document Engineering (DocEng 2015).
• Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Generating Abstractive Summaries from Meeting Transcripts., 15th ACM
SIGWEB International Symposium on Document Engineering (DocEng 2015).
• Siddhartha Banerjee and Prasenjit Mitra. WikiKreator: Improving Wikipedia Stubs Automatically., Association of Computational
Linguistics (ACL, 2015).
• Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Multi-Document Abstractive Summarization using ILP-based
Multi-Sentence Compression. , International Joint Conference on Artificial Intelligence (IJCAI, 2015).
• Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Abstractive Meeting Summarization using Dependency Graph Fusion,
ACM International Conference on World Wide Web (WWW (poster) ), 2015, Florence, Italy.
• Siddhartha Banerjee, Cornelia Caragea and Prasenjit Mitra. Playscript Classification and Automatic Wikipedia Play Articles Generation,
International Conference on Pattern Recognition (ICPR '2014) Stockholm, Sweden
29Talk @ Saama Technologies Siddhartha Banerjee
Email id: sidd.iitkgp@gmail.com

Contenu connexe

Similaire à Text Summarization Talk @ Saama Technologies

DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016IXIASOFT
 
See to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquirySee to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquiryDeirdre Costello
 
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM DevOps Session#1: Intro to DevOps Presentation SlidesDSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM DevOps Session#1: Intro to DevOps Presentation SlidesDSC UTeM
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Lifeng (Aaron) Han
 
User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...Andrew Preater
 
Web crawlingchapter
Web crawlingchapterWeb crawlingchapter
Web crawlingchapterBorseshweta
 
Crafting a Compelling Data Science Resume
Crafting a Compelling Data Science ResumeCrafting a Compelling Data Science Resume
Crafting a Compelling Data Science ResumeArushi Prakash, Ph.D.
 
Predicting and Preparing For Emerging Learning Technologies
Predicting and Preparing For Emerging Learning TechnologiesPredicting and Preparing For Emerging Learning Technologies
Predicting and Preparing For Emerging Learning Technologies lisbk
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower
 
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...TechSoup
 
Writing research papers and articles
Writing research papers and articlesWriting research papers and articles
Writing research papers and articlesDrDivakarSingh
 
Interview preparation data_science
Interview preparation data_scienceInterview preparation data_science
Interview preparation data_scienceMallikarjuna G D
 
IT 150 Agenda for 11-14-16.pptx
IT 150 Agenda for 11-14-16.pptxIT 150 Agenda for 11-14-16.pptx
IT 150 Agenda for 11-14-16.pptxMattMarino13
 
Scrum in Distributed Teams
Scrum in Distributed TeamsScrum in Distributed Teams
Scrum in Distributed TeamsCprime
 
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and ChallengesPathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and ChallengesTao Xie
 

Similaire à Text Summarization Talk @ Saama Technologies (20)

DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
 
See to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquirySee to believe: capturing insights using contextual inquiry
See to believe: capturing insights using contextual inquiry
 
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM DevOps Session#1: Intro to DevOps Presentation SlidesDSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
DSC UTeM DevOps Session#1: Intro to DevOps Presentation Slides
 
Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...Apply chinese radicals into neural machine translation: deeper than character...
Apply chinese radicals into neural machine translation: deeper than character...
 
User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...User experience at Imperial: a case study of qualitative approaches to Primo ...
User experience at Imperial: a case study of qualitative approaches to Primo ...
 
Dean r berry project the challenges of technology
Dean r berry project the challenges of  technologyDean r berry project the challenges of  technology
Dean r berry project the challenges of technology
 
Harsh patel
Harsh patelHarsh patel
Harsh patel
 
Web crawlingchapter
Web crawlingchapterWeb crawlingchapter
Web crawlingchapter
 
Ask your users
Ask your usersAsk your users
Ask your users
 
Dean r berry project loss of privacy
Dean r berry project loss of privacy Dean r berry project loss of privacy
Dean r berry project loss of privacy
 
Crafting a Compelling Data Science Resume
Crafting a Compelling Data Science ResumeCrafting a Compelling Data Science Resume
Crafting a Compelling Data Science Resume
 
Predicting and Preparing For Emerging Learning Technologies
Predicting and Preparing For Emerging Learning TechnologiesPredicting and Preparing For Emerging Learning Technologies
Predicting and Preparing For Emerging Learning Technologies
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
Webinar - SEO for Beginners: Simple Steps for Nonprofits and Libraries - 2016...
 
Writing research papers and articles
Writing research papers and articlesWriting research papers and articles
Writing research papers and articles
 
Interview preparation data_science
Interview preparation data_scienceInterview preparation data_science
Interview preparation data_science
 
IT 150 Agenda for 11-14-16.pptx
IT 150 Agenda for 11-14-16.pptxIT 150 Agenda for 11-14-16.pptx
IT 150 Agenda for 11-14-16.pptx
 
Scrum in Distributed Teams
Scrum in Distributed TeamsScrum in Distributed Teams
Scrum in Distributed Teams
 
Life after-phd-10-nov
Life after-phd-10-novLife after-phd-10-nov
Life after-phd-10-nov
 
Pathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and ChallengesPathways to Technology Transfer and Adoption: Achievements and Challenges
Pathways to Technology Transfer and Adoption: Achievements and Challenges
 

Dernier

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 

Dernier (20)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 

Text Summarization Talk @ Saama Technologies

  • 1. Automatic Text Summarization Trends, Challenges and Opportunities Siddhartha Banerjee Research Scientist, Content Platform Yahoo! (now Oath, a Verizon Company) September 22, 2017
  • 2. 2Talk @ Saama Technologies Siddhartha Banerjee ❑ Undergraduate degree • Industrial Engineering - 2009 (IIT Kharagpur) ❑ Professional Experience: 2009 – 2012 • Sabre Airline Solutions and Oracle Retail ❑ Ph.D. @Penn State Information Sciences (2012 - Dec’ 2016) • Advised by Prof. Prasenjit Mitra • Natural Language Processing ❑ Back to Industry: 2017 • Yahoo! (March 2017 - present) • Question Answering • Relationship extraction using distant supervision • Deep Learning My background
  • 3. 3Talk @ Saama Technologies Siddhartha Banerjee Outline ● What is Text Summarization? ● Overview of existing work ● Challenges ● Current Trends ● My experiences ● The Future of Summarization ● Q&A
  • 4. 4Talk @ Saama Technologies Siddhartha Banerjee What is Text Summarization? Single-document summarization Multi-document summarization
  • 5. 5Talk @ Saama Technologies Siddhartha Banerjee An “ideal” summary Informativeness Coherence Grammaticality
  • 6. 6Talk @ Saama Technologies Siddhartha Banerjee Types of Summarization ● Extractive ○ “Extract” certain sentences ○ Easier ○ No issues with grammaticality ● Abstractive ○ Produce “abstracts” ○ Content understanding ○ Generation
  • 7. 7Talk @ Saama Technologies Siddhartha Banerjee Extractive Summarization 1958 We have come a long way since then! Sentences that mention words that occur frequently in the document are more important.
  • 8. 8Talk @ Saama Technologies Siddhartha Banerjee Extractive Techniques • Word-statistics based techniques • Centroid [Radev et. al, 2004] • TextRank [Mihalcea and Tarau, 2004] • Supervised techniques • Provide ranked sentences to train from documents • Learning to Rank • Topic-model based techniques • Model sentences as topic vectors [Blei et. al, 2003] • Select sentences that are more “central” to the document vector.
  • 9. 9Talk @ Saama Technologies Siddhartha Banerjee Why “abstractive”? ❑ Consider opinions on iphone: • The iPhone’s battery lasts long…have to charge it once every few days. • iPhone’s battery is bulky but it is cheap.. • iPhone’s battery is bulky but it lasts long! ❑ Extractive: The iPhone’s battery lasts long…have to charge it once every few days. • Limit on summary length ❑ Ideal: The iPhone’s battery lasts long and is cheap but is bulky. • HARD!! • Preferred (Murray et. al, 2010 – user study)
  • 10. 10Talk @ Saama Technologies Siddhartha Banerjee Abstractive Summarization techniques ❏ Text-to-text generation at sentence level – Independent of other sentences ❏ Sentence compression (Cohn and Lapata’ 2009) ❏ Extractive to abstractive: Not possible using just compression ❏ Sentence fusion (Barzilay and McKeown’ 2005, Filippova and Strube, 2008) Template-based (Genest and Lapalme’, 2011) ❏ Domain-specific templates - Lot of manual effort I: But a month ago, she returned to Britain, taking the children with her. O: She returned to Britain, taking the children
  • 11. 11Talk @ Saama Technologies Siddhartha Banerjee Current Trends ● Deep Learning!! ● Neural Attention Model for Sentence Summarization (FAIR, 2015) ○ Headline generation ○ Feed-forward neural network ○ Attention model ● RNN-based summarization (FAIR, 2016)
  • 12. 12Talk @ Saama Technologies Siddhartha Banerjee Sequence to Sequence models ❏ Originally modelled for machine translation ❏ ❏
  • 13. 13Talk @ Saama Technologies Siddhartha Banerjee RNN’s with attention http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html ● Rare-word problem: Reproducing factual details inaccurately ● Pointer-Generator Networks to the rescue! Copy words from source to text. ● Get To The Point: Summarization with Pointer-Generator Networks (Stanford NLP Group, 2017)
  • 14. 14Talk @ Saama Technologies Siddhartha Banerjee Evaluation Automatic Evaluation • ROUGE – Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004) Manual Evaluation •Ask human judges and rate summaries on quality
  • 15. 15Talk @ Saama Technologies Siddhartha Banerjee Datasets • News articles • CNN/Daily News dataset • Document Understanding Conference datasets [DUC, now TAC] • Several topics: Each topic with 8-10 documents • Meeting conversations • Single meeting transcript • AMI Dataset [http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml] • 139 meeting transcripts: 119 training + 20 test
  • 16. 16Talk @ Saama Technologies Siddhartha Banerjee My Summarization Experience Automatically authoring content for Wikipedia Improving existing articles Constructing new articles Web information Assign to Wiki Sections Summarization
  • 17. 17Talk @ Saama Technologies Siddhartha Banerjee Summary sentence generation S1 The outbreak is the largest ever reported in North America. S2 Enterovirus D68 caused outbreak of respiratory disease. S3 Clusters of the outbreak in the United States were reported in August. 1: Enterovirus D68 caused outbreak is the largest ever reported in North America. 2: Enterovirus D68 caused outbreak in the United States were reported in August. 3: The outbreak is the largest ever reported in August. Output Graph Construction ❑ Multi-sentence compression (Filippova’ 2010) • Directed Graph • Nodes are words ■ (with POS) • Edges are adjacencies ❑ Graph traversal Overgenerate and Select
  • 18. 18Talk @ Saama Technologies Siddhartha Banerjee A comprehensive model (Banerjee and Mitra’ 2016) Word - graph p2 p3 pk Generated sentences ❌ ❌✔ …........... Select few sentences Informativeness Linguistic Quality Coherence p1 ✔ Ordering of sentences (Bollegala et al. 2012) Information coverage Grammaticality
  • 19. 19Talk @ Saama Technologies Siddhartha Banerjee Mathematical formulation Maximize Constraints ❑ Three factors: • I – Information coverage [Textrank (2004)] • LQ – Language model [Heafield et al. 2013] • Coh – Regression based scoring + K K
  • 20. 20Talk @ Saama Technologies Siddhartha Banerjee Experimental Results: News dataset •ROUGE evaluation on Document understanding conference (DUC) datasets 20
  • 21. 21Talk @ Saama Technologies Siddhartha Banerjee ❑ Manual Evaluation: 10 evaluators • Informative coverage: ~5% improvement over `best’ extractive system • Readability: ~4% reduction compared to extractive system ❑ Error Cases • The U.N. imposed sanctions since 1992 for its refusal to hand over the two Libyans wanted in the 1988 bombing that killed 270 people killed. • The deal that will make Hun Sen prime minister and Ranariddh agreed to a government formed. Experimental Results (contd.)
  • 22. 22Talk @ Saama Technologies Siddhartha Banerjee Disaster-event Tweet Summarization (Rudra et. al, 2016) Content words: Numerals, nouns, locations, main verbs • 5: Content word -> At least One Sentence • 6: Sentence selected determines content words to be selected Content- word based Summary Quality Optimization
  • 23. 23Talk @ Saama Technologies Siddhartha Banerjee Experimental Results • Readability evaluation (COWABS is our proposed technique)
  • 24. 24Talk @ Saama Technologies Siddhartha Banerjee Meeting summarization using fusion (Banerjee and Mitra, 2015) •“Um well this is the kick-off meeting for our project.” • “so we’re designing a new remote control and um.” • “Um, as you can see it is supposed to be original, trendy and user friendly.”
  • 25. 25Talk @ Saama Technologies Siddhartha Banerjee Results: Meeting data ❑ AMI Dataset (http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml) • 139 meeting transcripts: 119 training + 20 test (for extractive) ❑ ROUGE Evaluation • ~17 % R-2 score over other abstractive system (Filippova’ 2010) ❑ Readability Analysis • Our model: Slightly curved around the sides like up to the main display as well. It was voice activated . • Human: The remote will be single-curved with a cherry design on top. A sample sensor was included to add speech recognition.
  • 26. 26Talk @ Saama Technologies Siddhartha Banerjee Resources • https://github.com/miso-belica/sumy • Lots of simple extractive summarization techniques • https://github.com/facebookarchive/NAMAS • Abstractive summarization: headline generation task • http://kavita-ganesan.com/opinosis-summarizer-library • Summarizing redundant opinions/ reviews • http://pavel.surmenok.com/2016/10/15/how-to-run-text-summarization-with-tensorflow/ • Tutorial using seq2seq model on tensorflow • https://github.com/g-deoliveira/TextSummarization • Topic model-based summarization • https://github.com/StevenLOL/AbTextSumm • My abstractive summarization technique.
  • 27. 27Talk @ Saama Technologies Siddhartha Banerjee Future of Summarization ❏ The importance of summarization is undeniable ❏ Growth of data ❏ Automatic authoring in journalism ❏ Medical report summarization ❏ Deep Learning (RNN’s) ❏ Still a long way to go! ❏ Sequence to sequence models are hard to control ❏ Better metrics. ROUGE is not good enough. ❏ Making sense of an entire summary -- mimicking human capabilities.
  • 28. 28Talk @ Saama Technologies Siddhartha Banerjee Publications • Siddhartha Banerjee and Prasenjit Mitra. WikiWrite: Generating Wikipedia Articles Automatically. 25th International Joint Conference on Artificial Intelligence IJCAI-16. • Koustav Rudra, Siddhartha Banerjee, Muhammad Imran, Niloy Ganguly, Pawan Goyal and Prasenjit Mitra. Summarizing Situational Tweets in Crisis Scenario. ACM HyperText, 2016 • Siddhartha Banerjee and Prasenjit Mitra. Filling the Gaps: Improving Wikipedia Stubs., 15th ACM SIGWEB International Symposium on Document Engineering (DocEng 2015). • Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Generating Abstractive Summaries from Meeting Transcripts., 15th ACM SIGWEB International Symposium on Document Engineering (DocEng 2015). • Siddhartha Banerjee and Prasenjit Mitra. WikiKreator: Improving Wikipedia Stubs Automatically., Association of Computational Linguistics (ACL, 2015). • Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Multi-Document Abstractive Summarization using ILP-based Multi-Sentence Compression. , International Joint Conference on Artificial Intelligence (IJCAI, 2015). • Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Abstractive Meeting Summarization using Dependency Graph Fusion, ACM International Conference on World Wide Web (WWW (poster) ), 2015, Florence, Italy. • Siddhartha Banerjee, Cornelia Caragea and Prasenjit Mitra. Playscript Classification and Automatic Wikipedia Play Articles Generation, International Conference on Pattern Recognition (ICPR '2014) Stockholm, Sweden
  • 29. 29Talk @ Saama Technologies Siddhartha Banerjee Email id: sidd.iitkgp@gmail.com