This document discusses text summarization using machine learning. It begins by defining text summarization as reducing a text to create a summary that retains the most important points. There are two main types: single document summarization and multiple document summarization. Extractive summarization creates summaries by extracting phrases or sentences from the source text, while abstractive summarization expresses ideas using different words. Supervised machine learning approaches use labeled training data to train classifiers to select content, while unsupervised approaches select content based on metrics like term frequency-inverse document frequency. ROUGE is commonly used to automatically evaluate summaries by comparing them to human references. Query-focused multi-document summarization aims to answer a user's information need by summarizing relevant documents
1. Text Summarization - Machine Learning
TEXT SUMMARIZATION
1 Kareem El-Sayed Hashem
Mohamed Mohsen Brary
2. TEXT SUMMARIZATION
Goal: reducing a text with a computer program in
order to create a summary that retains the most
important points of the original text.
Text Summarization - Machine Learning
Summarization Applications
summaries of email threads
action items from a meeting
simplifying text by compressing sentences
2
3. WHAT TO SUMMARIZE?
SINGLE VS. MULTIPLE DOCUMENTS
Single Document Summarization
Given a single document produce
Text Summarization - Machine Learning
Abstract
Outline
Headline
Multiple Document Summarization
Given a group of document produce a gist of the
document
A series of news stories of the same event
A set of webpages about some topic or question
3
4. QUERY-FOCUSED SUMMARIZATION
& GENERIC SUMMARIZATION
Generic Summarization
Summarize the content of a document
Text Summarization - Machine Learning
Query-focused Summarization
Summarize a document with respect to an
information need expressed in a user query
A kind of complex question answering
Answer a question by summarizing a document that has
the information to construct the answer
4
5. SUMMARIZATION FOR QUESTION
ANSWERING:
Snippets
Create snippets summarizing a web page for a query
Text Summarization - Machine Learning
Multiple Documents
Create answer to complex questions summarizing
multiple documents.
Instead of giving a snippet for each document
Create a cohesive answer that combines information from
each document
5
6. EXTRACTIVE SUMMARIZATION
& ABSTRACTIVE SUMMARIZATION
Extractive Summarization:
Create the summary from phrases or sentences in the
source document(s)
Text Summarization - Machine Learning
Abstractive Summarization
Express the ideas in the source document using
different words
6
7. SUMMARIZATION: THREE STAGES
Content Selection: choose sentences to extract
from the document
Text Summarization - Machine Learning
Information Ordering: choose an order to place
them in the summary
Sentence Realization: clean up the sentence
7
8. UNSUPERVISED CONTENT SELECTION
Intuition Dating Back to Luhn (1958):
Choose sentences that have distinguished or
informative words
Text Summarization - Machine Learning
Two Approaches to Define distinguished words
tf-idf: weigh each word wi in document j by tf-idf
Topic signature: choose smaller set of distinguished
words
Log-likelihood ratio (LLR)
8
9. TOPIC SIGNATURE-BASED CONTENT
SELECTION WITH QUERIES
Choose words that are informative either
By log-likelihood ratio (LLR)
Text Summarization - Machine Learning
Or by appearing in the query
Weigh a sentence by weight of its words:
9
10. SUPERVISED CONTENT SELECTION
Given
A labeled training set of good summaries for each
document
Text Summarization - Machine Learning
Align
The sentences in the document with sentences in the
summary
Extract Features
Position
Length of sentence
Word informativeness
Cohesion
10
11. SUPERVISED CONTENT SELECTION
Train
A binary classifier (put sentence in summary? Yes or
no)
Text Summarization - Machine Learning
Problems
Hard to get labeled training data
Alignment is difficult
Performance not better that unsupervised algorithm
11
12. EVALUATING SUMMARIES: ROUGE
ROUGE “ Recall Oriented Understudy for
Gisting Evaluation ”
Text Summarization - Machine Learning
Internal metric for automatically evaluating
summaries
Based on BLEU (a metric used for machine
translation)
Not as good as human evaluation.
But much more convenient
12
13. EVALUATING SUMMARIES: ROUGE
Given a document D, and an automatic
summary X:
Text Summarization - Machine Learning
Have N humans produce a set of reference
summaries of D
Run System, giving automatic summary X
What percentage of the bigrams from the reference
summaries appear in X?
13
14. EXAMPLE
Human 1: water spinach is a green leafy
vegetable grown in the tropics.
Text Summarization - Machine Learning
Human 2: water spinach is a semi-aquatic
tropical plant grown as a vegetable.
Human 3: water spinach is a commonly eaten
leaf vegetable of Asia.
System: water spinach is a leaf vegetable
commonly eaten in tropical areas of Asia.
ROUGE -2= = 12/28 = 0.43 14
15. ANSWERING HARDER QUESTION:
QUERY-FOCUSED MULTI-DOCUMENT
SUMMARIZATION
The (bottom-up) snippet method
Find a set of relevant documents
Text Summarization - Machine Learning
Extract informative sentences form the documents
Order and modify the sentences into an answer
The(top-down) information extraction method
Build specific answers for different questions types:
Definition questions
Biography questions
Certain medical questions
15
17. MAXIMAL MARGINAL RELEVANCE (MMR)
An iterative method for content selection from
multiple documents
Text Summarization - Machine Learning
Iteratively (greedily) choose the best sentence to
insert in the summary/answer so far:
Relevant: maximally relevant to the user query
High cosine similarity to the query
Novel: minimally redundant with the summary so
far:
Low cosine similarity to the summary
17
Stop when desired length
18. LLR + MMR CHOOSING INFORMATIVE YET
NON-REDUNDANT SENTENCES
One of many ways to combine the intuitions of
LLR and MMR:
Text Summarization - Machine Learning
Score each sentence based on LLR (including query
words)
Include the sentence with highest score in the
summary
Iteratively add into the summary high-scoring
sentences that are not redundant with the summary
so far.
18
19. INFORMATION ORDERING
Chronological ordering:
Order sentences by the date of the document “ for
summarizing news”
Text Summarization - Machine Learning
Coherence:
Choose ordering that make neighboring sentences
similar(by cosine similarity)
Choose ordering in which neighboring sentences
discuss the same entity
Topical ordering
19
Learn the ordering of topics in the source document
20. DOMAIN-SPECIFIC ANSWERING:
THE INFORMATION EXTRACTION METHOD
A good biography of a person contains:
A person’s birth/death, fame factor, education …etc
Text Summarization - Machine Learning
A good definition contains
Type or category “ The Hajj is a type of ritual ”
A medical answer about a drug’s use contains:
The problem : medical condition
The intervention : drug or procedure
The outcome : the result of the study
20
21. INFORMATION THAT SHOULD BE IN THE
ANSWER FOR 3 KINDS OF QUESTIONS
Text Summarization - Machine Learning
21