This document discusses different approaches to text summarization, including extractive and abstractive summarization. It presents several naive extractive algorithms using word frequency, sentence intersection scores, and graph theory. It also discusses using neural networks with encoder-decoder models and attention mechanisms for abstractive summarization. The document provides resources for practicing summarization techniques and accessing text datasets.
Lazy man's learning: How To Build Your Own Text Summarizer
1. LAZY MAN’S LEARNING
How to BuildYour OwnText Summarizer
Sho Fola Soboyejo, Digital Architect, Kroger Co.
April 19th, 2018
@shoreason
2. I’VE GOT A FEVER ANDTHE ONLY
PRESCRIPTION IS … MORE BOOKS
3. NATURAL LANGUAGE
PROCESSING (NLP) DOMAINS
• Mostly Solved: SPAM detection, parts of speech
tagging , named entity recognition
• Making Progress: Sentiment analysis, coreference
resolution, word sense disambiguation, parsing,
machine translation, information extraction
• Still Really Hard: Question answering, Paraphrase,
Summarization and dialogue
4. PROBLEMS IN NLP
• Ambiguity: RedTape Holds Up New Bridges
• Idioms: Get Cold Feet, Dark Horse
• Neologisms: Bromance, Unfriend, Retweet
• Tricky name entities:Where is Black Panther Playing?
• Non-Standard English: #challengeday, @mlmeetup
Stanford NLP: Dan Jurafsky
9. EXTRACTIVE
• Pick figure out most
important sentences in
document.Then simply
extract and order those.
• Same words and sentences
in document. No abstract.
• Ranking phrase relevance
10. ABSTRACTIVE
• Boil down the gist of a
document into an abstract
likely using new words in
summary.
• Very much what you and I
would do.
• Much harder
12. SPEED READINGTIPS
• 1st and last sentence
(Order in text)
• Title and other paragraphs
(Connection to other
sentences)
• Index (Word Frequency)
• Focus on Keywords
13. BASIC CLEAN UP EXPECTED
• Remove Stop Words
• Stemming
• Lower case
• Remove Punctuation
• Remove Numbers
17. NAIVE ALGORITHM
• Determine most frequent content words in original document
(Word frequency table)
• N most common words are stored and sorted (100)
• Score each sentence based on how many high frequency words it
contains
• Build summary by compiling sentences above certain score threshold
• Select N top sentences and sort based on order in original text
19. NAIVE EXTRACTIVE
ALGORITHM 2.0
• Compare each sentence in document against other sentences and determine
intersection
• [0][2] = intersection score of comparing sentence 1 to sentence 3
• Treating each sentence as a node the connection between the nodes is the intersection
score.Weight of the edges
• Calculate the score of each sentence/node as key value pair {sentence: nodeScore}
• NodeScore = sum of all intersections with other sentences excluding itself. Sum of all
edges connected to the node
• Split text into paragraphs pick best sentence in each paragraph. Essentially, treating
paragraphs as subset of graph and pick best node in each subset
20. • s1 = "my friend's car is nicer than
mine but my wife is way more
beautiful"
• s2 = "my wife is more beautiful and
has brown eyes”
• s1.intersection(s2) = {'is', 'wife',
'beautiful', 'my',‘more'}
• Intersection score =
len(s1.intersection(s2)) / ((len(s1) +
len(s2)) / 2) = .4762
• lower score less similarity, higher
score more similarity
SENTENCE INTERSECTIONS
23. WHYTHIS MIGHT WORK
• Again, a paragraph can be treated as a subatomic
piece of a text
• Sentences with strong intersection likely hold the
same or very similar information
• Sentences with intersection with many other
sentences is likely very key to the text
25. GOING MUCH FURTHER
• Bi-Grams
• TF-IDF (frequent in a
document but not across
documents)
• IncludingTitle
• Apply stemming
• RNN (Recurrent Neural
Network)
26. GOAL
Train an encoder-decoder recurrent neural network
with LSTM units and attention for generating
summaries using the texts of news articles from the
Gigaword dataset
27. WHAT IS A NEURAL
NETWORK?
• Modeled after the human brain
(neurons) and nervous system
• Like a neuron, it has input,
hidden and output layers
• Network initializes with a
guessers and the learns adjusts
as more data passes through it
• Deep learning is using a neural
network with more hidden
layers
32. GETTING STARTED
• Try out Algorithmia and
Gensim
• Fork my github code and try
your hand on Naive 3.0
• Explore some NLP and
Machine Learning intro
courses
• Check out the White Papers
I referenced in this talk