1. Automatic Text Summarization
Trends, Challenges and Opportunities
Siddhartha Banerjee
Research Scientist, Content Platform
Yahoo! (now Oath, a Verizon Company)
September 22, 2017
2. 2Talk @ Saama Technologies Siddhartha Banerjee
❑ Undergraduate degree
• Industrial Engineering - 2009 (IIT Kharagpur)
❑ Professional Experience: 2009 – 2012
• Sabre Airline Solutions and Oracle Retail
❑ Ph.D. @Penn State Information Sciences (2012 - Dec’ 2016)
• Advised by Prof. Prasenjit Mitra
• Natural Language Processing
❑ Back to Industry: 2017
• Yahoo! (March 2017 - present)
• Question Answering
• Relationship extraction using distant supervision
• Deep Learning
My background
3. 3Talk @ Saama Technologies Siddhartha Banerjee
Outline
● What is Text Summarization?
● Overview of existing work
● Challenges
● Current Trends
● My experiences
● The Future of Summarization
● Q&A
4. 4Talk @ Saama Technologies Siddhartha Banerjee
What is Text Summarization?
Single-document summarization
Multi-document summarization
5. 5Talk @ Saama Technologies Siddhartha Banerjee
An “ideal” summary
Informativeness Coherence Grammaticality
6. 6Talk @ Saama Technologies Siddhartha Banerjee
Types of Summarization
● Extractive
○ “Extract” certain sentences
○ Easier
○ No issues with grammaticality
● Abstractive
○ Produce “abstracts”
○ Content understanding
○ Generation
7. 7Talk @ Saama Technologies Siddhartha Banerjee
Extractive Summarization
1958
We have come a long way since then!
Sentences that mention words that occur frequently in the document are more important.
8. 8Talk @ Saama Technologies Siddhartha Banerjee
Extractive Techniques
• Word-statistics based techniques
• Centroid [Radev et. al, 2004]
• TextRank [Mihalcea and Tarau, 2004]
• Supervised techniques
• Provide ranked sentences to train from documents
• Learning to Rank
• Topic-model based techniques
• Model sentences as topic vectors [Blei et. al, 2003]
• Select sentences that are more “central” to the document vector.
9. 9Talk @ Saama Technologies Siddhartha Banerjee
Why “abstractive”?
❑ Consider opinions on iphone:
• The iPhone’s battery lasts long…have to charge it once every few days.
• iPhone’s battery is bulky but it is cheap..
• iPhone’s battery is bulky but it lasts long!
❑ Extractive: The iPhone’s battery lasts long…have to charge it once every few days.
• Limit on summary length
❑ Ideal: The iPhone’s battery lasts long and is cheap but is bulky.
• HARD!!
• Preferred (Murray et. al, 2010 – user study)
10. 10Talk @ Saama Technologies Siddhartha Banerjee
Abstractive Summarization techniques
❏ Text-to-text generation at sentence level – Independent of other sentences
❏ Sentence compression (Cohn and Lapata’ 2009)
❏ Extractive to abstractive: Not possible using just compression
❏ Sentence fusion (Barzilay and McKeown’ 2005, Filippova and Strube, 2008)
Template-based (Genest and Lapalme’, 2011)
❏ Domain-specific templates - Lot of manual effort
I: But a month ago, she returned to Britain, taking the children with her.
O: She returned to Britain, taking the children
11. 11Talk @ Saama Technologies Siddhartha Banerjee
Current Trends
● Deep Learning!!
● Neural Attention Model for Sentence Summarization (FAIR, 2015)
○ Headline generation
○ Feed-forward neural network
○ Attention model
● RNN-based summarization (FAIR, 2016)
12. 12Talk @ Saama Technologies Siddhartha Banerjee
Sequence to Sequence models
❏ Originally modelled for machine translation
❏
❏
13. 13Talk @ Saama Technologies Siddhartha Banerjee
RNN’s with attention
http://www.abigailsee.com/2017/04/16/taming-rnns-for-better-summarization.html
● Rare-word problem: Reproducing factual details inaccurately
● Pointer-Generator Networks to the rescue! Copy words from source to text.
● Get To The Point: Summarization with Pointer-Generator Networks (Stanford NLP Group, 2017)
14. 14Talk @ Saama Technologies Siddhartha Banerjee
Evaluation
Automatic Evaluation
• ROUGE – Recall-Oriented Understudy for Gisting Evaluation (Lin, 2004)
Manual Evaluation
•Ask human judges and rate summaries on quality
15. 15Talk @ Saama Technologies Siddhartha Banerjee
Datasets
• News articles
• CNN/Daily News dataset
• Document Understanding Conference datasets [DUC, now TAC]
• Several topics: Each topic with 8-10 documents
• Meeting conversations
• Single meeting transcript
• AMI Dataset [http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml]
• 139 meeting transcripts: 119 training + 20 test
16. 16Talk @ Saama Technologies Siddhartha Banerjee
My Summarization Experience
Automatically authoring content for Wikipedia
Improving existing articles Constructing new articles
Web
information
Assign to Wiki
Sections
Summarization
17. 17Talk @ Saama Technologies Siddhartha Banerjee
Summary sentence generation
S1 The outbreak is the largest ever reported in North America.
S2 Enterovirus D68 caused outbreak of respiratory disease.
S3 Clusters of the outbreak in the United States were reported in August.
1: Enterovirus D68 caused outbreak is the largest ever reported in North America.
2: Enterovirus D68 caused outbreak in the United States were reported in August.
3: The outbreak is the largest ever reported in August.
Output
Graph Construction
❑ Multi-sentence compression
(Filippova’ 2010)
• Directed Graph
• Nodes are words
■ (with POS)
• Edges are adjacencies
❑ Graph traversal
Overgenerate
and
Select
18. 18Talk @ Saama Technologies Siddhartha Banerjee
A comprehensive model (Banerjee and Mitra’ 2016)
Word - graph
p2
p3 pk
Generated
sentences ❌ ❌✔
…...........
Select few sentences
Informativeness Linguistic Quality Coherence
p1
✔
Ordering of sentences
(Bollegala et al. 2012)
Information coverage Grammaticality
19. 19Talk @ Saama Technologies Siddhartha Banerjee
Mathematical formulation
Maximize
Constraints
❑ Three factors:
• I – Information coverage [Textrank (2004)]
• LQ – Language model [Heafield et al. 2013]
• Coh – Regression based scoring
+
K
K
20. 20Talk @ Saama Technologies Siddhartha Banerjee
Experimental Results: News dataset
•ROUGE evaluation on Document understanding conference (DUC) datasets
20
21. 21Talk @ Saama Technologies Siddhartha Banerjee
❑ Manual Evaluation: 10 evaluators
• Informative coverage: ~5% improvement over `best’
extractive system
• Readability: ~4% reduction compared to extractive system
❑ Error Cases
• The U.N. imposed sanctions since 1992 for its refusal to hand over the two
Libyans wanted in the 1988 bombing that killed 270 people killed.
• The deal that will make Hun Sen prime minister and Ranariddh agreed to a
government formed.
Experimental Results (contd.)
22. 22Talk @ Saama Technologies Siddhartha Banerjee
Disaster-event Tweet Summarization (Rudra et. al, 2016)
Content words: Numerals, nouns, locations, main verbs
• 5: Content word -> At least One Sentence
• 6: Sentence selected determines content words to be selected
Content- word based
Summary Quality Optimization
23. 23Talk @ Saama Technologies Siddhartha Banerjee
Experimental Results
• Readability evaluation (COWABS is our proposed technique)
24. 24Talk @ Saama Technologies Siddhartha Banerjee
Meeting summarization using fusion (Banerjee and Mitra, 2015)
•“Um well this is the kick-off meeting for our project.”
• “so we’re designing a new remote control and um.”
• “Um, as you can see it is supposed to be original, trendy and user friendly.”
25. 25Talk @ Saama Technologies Siddhartha Banerjee
Results: Meeting data
❑ AMI Dataset (http://groups.inf.ed.ac.uk/ami/corpus/overview.shtml)
• 139 meeting transcripts: 119 training + 20 test (for extractive)
❑ ROUGE Evaluation
• ~17 % R-2 score over other abstractive system (Filippova’ 2010)
❑ Readability Analysis
• Our model: Slightly curved around the sides like up to the main display as well. It
was voice activated .
• Human: The remote will be single-curved with a cherry design on top. A sample
sensor was included to add speech recognition.
26. 26Talk @ Saama Technologies Siddhartha Banerjee
Resources
• https://github.com/miso-belica/sumy
• Lots of simple extractive summarization techniques
• https://github.com/facebookarchive/NAMAS
• Abstractive summarization: headline generation task
• http://kavita-ganesan.com/opinosis-summarizer-library
• Summarizing redundant opinions/ reviews
• http://pavel.surmenok.com/2016/10/15/how-to-run-text-summarization-with-tensorflow/
• Tutorial using seq2seq model on tensorflow
• https://github.com/g-deoliveira/TextSummarization
• Topic model-based summarization
• https://github.com/StevenLOL/AbTextSumm
• My abstractive summarization technique.
27. 27Talk @ Saama Technologies Siddhartha Banerjee
Future of Summarization
❏ The importance of summarization is undeniable
❏ Growth of data
❏ Automatic authoring in journalism
❏ Medical report summarization
❏ Deep Learning (RNN’s)
❏ Still a long way to go!
❏ Sequence to sequence models are hard to control
❏ Better metrics. ROUGE is not good enough.
❏ Making sense of an entire summary -- mimicking human capabilities.
28. 28Talk @ Saama Technologies Siddhartha Banerjee
Publications
• Siddhartha Banerjee and Prasenjit Mitra. WikiWrite: Generating Wikipedia Articles Automatically. 25th International Joint Conference
on Artificial Intelligence IJCAI-16.
• Koustav Rudra, Siddhartha Banerjee, Muhammad Imran, Niloy Ganguly, Pawan Goyal and Prasenjit Mitra. Summarizing Situational
Tweets in Crisis Scenario. ACM HyperText, 2016
• Siddhartha Banerjee and Prasenjit Mitra. Filling the Gaps: Improving Wikipedia Stubs., 15th ACM SIGWEB International Symposium on
Document Engineering (DocEng 2015).
• Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Generating Abstractive Summaries from Meeting Transcripts., 15th ACM
SIGWEB International Symposium on Document Engineering (DocEng 2015).
• Siddhartha Banerjee and Prasenjit Mitra. WikiKreator: Improving Wikipedia Stubs Automatically., Association of Computational
Linguistics (ACL, 2015).
• Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Multi-Document Abstractive Summarization using ILP-based
Multi-Sentence Compression. , International Joint Conference on Artificial Intelligence (IJCAI, 2015).
• Siddhartha Banerjee, Prasenjit Mitra and Kazunari Sugiyama. Abstractive Meeting Summarization using Dependency Graph Fusion,
ACM International Conference on World Wide Web (WWW (poster) ), 2015, Florence, Italy.
• Siddhartha Banerjee, Cornelia Caragea and Prasenjit Mitra. Playscript Classification and Automatic Wikipedia Play Articles Generation,
International Conference on Pattern Recognition (ICPR '2014) Stockholm, Sweden
29. 29Talk @ Saama Technologies Siddhartha Banerjee
Email id: sidd.iitkgp@gmail.com