Lazy man's learning: How To Build Your Own Text Summarizer

LAZY MAN’S LEARNING
How to BuildYour OwnText Summarizer
Sho Fola Soboyejo, Digital Architect, Kroger Co.
April 19th, 2018
@shoreason

I’VE GOT A FEVER ANDTHE ONLY
PRESCRIPTION IS … MORE BOOKS

NATURAL LANGUAGE
PROCESSING (NLP) DOMAINS
• Mostly Solved: SPAM detection, parts of speech
tagging , named entity recognition
• Making Progress: Sentiment analysis, coreference
resolution, word sense disambiguation, parsing,
machine translation, information extraction
• Still Really Hard: Question answering, Paraphrase,
Summarization and dialogue

PROBLEMS IN NLP
• Ambiguity: RedTape Holds Up New Bridges
• Idioms: Get Cold Feet, Dark Horse
• Neologisms: Bromance, Unfriend, Retweet
• Tricky name entities:Where is Black Panther Playing?
• Non-Standard English: #challengeday, @mlmeetup
Stanford NLP: Dan Jurafsky

“HOW CANYOU
SAYTHE MOST
IMPORTANTTHINGS
INTHE SHORTEST
AMOUNT OFTIME ?”
- Siraj Raval

PRACTICAL APPLICATIONS
FOR SUMMARIZATION
• Headlines (from around the world)
• Outlines (notes for students)
• Minutes (of a meeting)
• Previews (of movies)
• Synopses (soap opera listings)
• Reviews (of a book, CD, movie, etc.)
• Bulletins (weather forecasts/stock market
reports)
• Sound bites (politicians on a current issue)
— Page 1, Advances in AutomaticText
Summarization, 1999.

FORMS OF SUMMARIZATION
Single Document vs Multi Document

APPROACHES
Extractive vs Abstractive

EXTRACTIVE
• Pick ﬁgure out most
important sentences in
document.Then simply
extract and order those.
• Same words and sentences
in document. No abstract.
• Ranking phrase relevance

ABSTRACTIVE
• Boil down the gist of a
document into an abstract
likely using new words in
summary.
• Very much what you and I
would do.
• Much harder

“IT’S FAR EASIERTO
RECOGNIZE
WORDSTHAN IT IS
TO UNDERSTAND
THE MEANING”
- Laura Klein (Design forVoice
Interfaces)

SPEED READINGTIPS
• 1st and last sentence
(Order in text)
• Title and other paragraphs
(Connection to other
sentences)
• Index (Word Frequency)
• Focus on Keywords

BASIC CLEAN UP EXPECTED
• Remove Stop Words
• Stemming
• Lower case
• Remove Punctuation
• Remove Numbers

STAGES
CONTENT
SELECTION
INFORMATION
ORDERING
▸ Sentence Segmentation
▸ Document order
▸ Sentence Extraction
▸ Keep original sentences
▸ Sentence weight
▸ Sentence simpliﬁcation
SENTENCE
REALIZATION

SUMMARY OPTIONS
Algorithmia
Gensim (summarization)

OFFTOTHE RACES
Algorithmia &
Gensim in Action

NAIVE ALGORITHM
• Determine most frequent content words in original document
(Word frequency table)
• N most common words are stored and sorted (100)
• Score each sentence based on how many high frequency words it
contains
• Build summary by compiling sentences above certain score threshold
• Select N top sentences and sort based on order in original text

https://koko-summarizer.herokuapp.com/content
NAIVE 1.0
ALGORITHM
IN
ACTION

NAIVE EXTRACTIVE
ALGORITHM 2.0
• Compare each sentence in document against other sentences and determine
intersection
• [0][2] = intersection score of comparing sentence 1 to sentence 3
• Treating each sentence as a node the connection between the nodes is the intersection
score.Weight of the edges
• Calculate the score of each sentence/node as key value pair {sentence: nodeScore}
• NodeScore = sum of all intersections with other sentences excluding itself. Sum of all
edges connected to the node
• Split text into paragraphs pick best sentence in each paragraph. Essentially, treating
paragraphs as subset of graph and pick best node in each subset

• s1 = "my friend's car is nicer than
mine but my wife is way more
beautiful"
• s2 = "my wife is more beautiful and
has brown eyes”
• s1.intersection(s2) = {'is', 'wife',
'beautiful', 'my',‘more'}
• Intersection score =
len(s1.intersection(s2)) / ((len(s1) +
len(s2)) / 2) = .4762
• lower score less similarity, higher
score more similarity
SENTENCE INTERSECTIONS

6
6
1
11
12
2
1
3
8
1
3
1
2
GraphTheory Implications

WHYTHIS MIGHT WORK
• Again, a paragraph can be treated as a subatomic
piece of a text
• Sentences with strong intersection likely hold the
same or very similar information
• Sentences with intersection with many other
sentences is likely very key to the text

NAIVE 2.0
ALGORITHM
IN
ACTION
built on code by Shlomi Babluki
https://koko-summarizer.herokuapp.com/content

GOING MUCH FURTHER
• Bi-Grams
• TF-IDF (frequent in a
document but not across
documents)
• IncludingTitle
• Apply stemming
• RNN (Recurrent Neural
Network)

GOAL
Train an encoder-decoder recurrent neural network
with LSTM units and attention for generating
summaries using the texts of news articles from the
Gigaword dataset

WHAT IS A NEURAL
NETWORK?
• Modeled after the human brain
(neurons) and nervous system
• Like a neuron, it has input,
hidden and output layers
• Network initializes with a
guessers and the learns adjusts
as more data passes through it
• Deep learning is using a neural
network with more hidden
layers

NEURAL NETWORKS (WHITE
PAPERS)

SEQTO SEQ LEARNING
Courtesy: QuocV. Le & Mike Schuster, Research Scientists,
Google BrainTeam

SALESFORCE PAPER
https://www.salesforce.com/
products/einstein/ai-
research/tl-dr-reinforced-
model-abstractive-
summarization/

Abstractive
Neural Networks
Extractive
Algorithmia, Gensim, Naive 1.0 and 2.0
BRINGING ITTOGETHER

GETTING STARTED
• Try out Algorithmia and
Gensim
• Fork my github code and try
your hand on Naive 3.0
• Explore some NLP and
Machine Learning intro
courses
• Check out the White Papers
I referenced in this talk

ACCESSTO RICH DATASETS
• CNN/Daily Mail Stories (Kyunghyun Cho)
• https://drive.google.com/uc?
export=download&id=0BwmD_VLjR
OrfTHk4NFg2SndKcjQ
• BCC Stories
• http://mlg.ucd.ie/
• Annotated English Gigaword
• https://catalog.ldc.upenn.edu/
LDC2012T21

Look out for deck on Slideshare
@shoreason
www.shoreason.com
github.com/shoreason

Lazy man's learning: How To Build Your Own Text Summarizer

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (8)

Similaire à Lazy man's learning: How To Build Your Own Text Summarizer

Similaire à Lazy man's learning: How To Build Your Own Text Summarizer (20)

Dernier

Dernier (20)

Lazy man's learning: How To Build Your Own Text Summarizer