SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
LAZY MAN’S LEARNING
How to BuildYour OwnText Summarizer
Sho Fola Soboyejo, Digital Architect, Kroger Co.
April 19th, 2018
@shoreason
I’VE GOT A FEVER ANDTHE ONLY
PRESCRIPTION IS … MORE BOOKS
NATURAL LANGUAGE
PROCESSING (NLP) DOMAINS
• Mostly Solved: SPAM detection, parts of speech
tagging , named entity recognition
• Making Progress: Sentiment analysis, coreference
resolution, word sense disambiguation, parsing,
machine translation, information extraction
• Still Really Hard: Question answering, Paraphrase,
Summarization and dialogue
PROBLEMS IN NLP
• Ambiguity: RedTape Holds Up New Bridges
• Idioms: Get Cold Feet, Dark Horse
• Neologisms: Bromance, Unfriend, Retweet
• Tricky name entities:Where is Black Panther Playing?
• Non-Standard English: #challengeday, @mlmeetup
Stanford NLP: Dan Jurafsky
“HOW CANYOU
SAYTHE MOST
IMPORTANTTHINGS
INTHE SHORTEST
AMOUNT OFTIME ?”
- Siraj Raval
PRACTICAL APPLICATIONS
FOR SUMMARIZATION
• Headlines (from around the world)
• Outlines (notes for students)
• Minutes (of a meeting)
• Previews (of movies)
• Synopses (soap opera listings)
• Reviews (of a book, CD, movie, etc.)
• Bulletins (weather forecasts/stock market
reports)
• Sound bites (politicians on a current issue)
— Page 1, Advances in AutomaticText
Summarization, 1999.
FORMS OF SUMMARIZATION
Single Document vs Multi Document
APPROACHES
Extractive vs Abstractive
EXTRACTIVE
• Pick figure out most
important sentences in
document.Then simply
extract and order those.
• Same words and sentences
in document. No abstract.
• Ranking phrase relevance
ABSTRACTIVE
• Boil down the gist of a
document into an abstract
likely using new words in
summary.
• Very much what you and I
would do.
• Much harder
“IT’S FAR EASIERTO
RECOGNIZE
WORDSTHAN IT IS
TO UNDERSTAND
THE MEANING”
- Laura Klein (Design forVoice
Interfaces)
SPEED READINGTIPS
• 1st and last sentence
(Order in text)
• Title and other paragraphs
(Connection to other
sentences)
• Index (Word Frequency)
• Focus on Keywords
BASIC CLEAN UP EXPECTED
• Remove Stop Words
• Stemming
• Lower case
• Remove Punctuation
• Remove Numbers
STAGES
CONTENT
SELECTION
INFORMATION
ORDERING
▸ Sentence Segmentation
▸ Document order
▸ Sentence Extraction
▸ Keep original sentences
▸ Sentence weight
▸ Sentence simplification
SENTENCE
REALIZATION
SUMMARY OPTIONS
Algorithmia
Gensim (summarization)
OFFTOTHE RACES
Algorithmia &
Gensim in Action
NAIVE ALGORITHM
• Determine most frequent content words in original document
(Word frequency table)
• N most common words are stored and sorted (100)
• Score each sentence based on how many high frequency words it
contains
• Build summary by compiling sentences above certain score threshold
• Select N top sentences and sort based on order in original text
https://koko-summarizer.herokuapp.com/content
NAIVE 1.0
ALGORITHM
IN
ACTION
NAIVE EXTRACTIVE
ALGORITHM 2.0
• Compare each sentence in document against other sentences and determine
intersection
• [0][2] = intersection score of comparing sentence 1 to sentence 3
• Treating each sentence as a node the connection between the nodes is the intersection
score.Weight of the edges
• Calculate the score of each sentence/node as key value pair {sentence: nodeScore}
• NodeScore = sum of all intersections with other sentences excluding itself. Sum of all
edges connected to the node
• Split text into paragraphs pick best sentence in each paragraph. Essentially, treating
paragraphs as subset of graph and pick best node in each subset
• s1 = "my friend's car is nicer than
mine but my wife is way more
beautiful"
• s2 = "my wife is more beautiful and
has brown eyes”
• s1.intersection(s2) = {'is', 'wife',
'beautiful', 'my',‘more'}
• Intersection score =
len(s1.intersection(s2)) / ((len(s1) +
len(s2)) / 2) = .4762
• lower score less similarity, higher
score more similarity
SENTENCE INTERSECTIONS
1
3
8
1
3
1
2
6
6
1
11
12
2
1
3
8
1
3
1
2
GraphTheory Implications
WHYTHIS MIGHT WORK
• Again, a paragraph can be treated as a subatomic
piece of a text
• Sentences with strong intersection likely hold the
same or very similar information
• Sentences with intersection with many other
sentences is likely very key to the text
NAIVE 2.0
ALGORITHM
IN
ACTION
built on code by Shlomi Babluki
https://koko-summarizer.herokuapp.com/content
GOING MUCH FURTHER
• Bi-Grams
• TF-IDF (frequent in a
document but not across
documents)
• IncludingTitle
• Apply stemming
• RNN (Recurrent Neural
Network)
GOAL
Train an encoder-decoder recurrent neural network
with LSTM units and attention for generating
summaries using the texts of news articles from the
Gigaword dataset
WHAT IS A NEURAL
NETWORK?
• Modeled after the human brain
(neurons) and nervous system
• Like a neuron, it has input,
hidden and output layers
• Network initializes with a
guessers and the learns adjusts
as more data passes through it
• Deep learning is using a neural
network with more hidden
layers
NEURAL NETWORKS (WHITE
PAPERS)
SEQTO SEQ LEARNING
Courtesy: QuocV. Le & Mike Schuster, Research Scientists,
Google BrainTeam
SALESFORCE PAPER
https://www.salesforce.com/
products/einstein/ai-
research/tl-dr-reinforced-
model-abstractive-
summarization/
Abstractive
Neural Networks
Extractive
Algorithmia, Gensim, Naive 1.0 and 2.0
BRINGING ITTOGETHER
GETTING STARTED
• Try out Algorithmia and
Gensim
• Fork my github code and try
your hand on Naive 3.0
• Explore some NLP and
Machine Learning intro
courses
• Check out the White Papers
I referenced in this talk
ACCESSTO RICH DATASETS
• CNN/Daily Mail Stories (Kyunghyun Cho)
• https://drive.google.com/uc?
export=download&id=0BwmD_VLjR
OrfTHk4NFg2SndKcjQ
• BCC Stories
• http://mlg.ucd.ie/
• Annotated English Gigaword
• https://catalog.ldc.upenn.edu/
LDC2012T21
Look out for deck on Slideshare
@shoreason
www.shoreason.com
github.com/shoreason

Contenu connexe

Tendances

Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingankit_ppt
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectorsOsebe Sammi
 
Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLRBasis Technology
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vecananth
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Alia Hamwi
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1Pier Luca Lanzi
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 

Tendances (8)

Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
 
Pycon ke word vectors
Pycon ke   word vectorsPycon ke   word vectors
Pycon ke word vectors
 
Optimizing multilingual search in SOLR
Optimizing multilingual search in SOLROptimizing multilingual search in SOLR
Optimizing multilingual search in SOLR
 
Word representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2VecWord representation: SVD, LSA, Word2Vec
Word representation: SVD, LSA, Word2Vec
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1DMTM 2015 - 17 Text Mining Part 1
DMTM 2015 - 17 Text Mining Part 1
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 

Similaire à Lazy man's learning: How To Build Your Own Text Summarizer

Natural Language Processing Crash Course
Natural Language Processing Crash CourseNatural Language Processing Crash Course
Natural Language Processing Crash CourseCharlie Greenbacker
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfAdityaMishra178868
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayMichael Yarichuk
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search SolutionsFindwise
 
Functional programming
Functional programmingFunctional programming
Functional programmingPrateek Jain
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchNoemi Derzsy
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...wltrimbl
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solrNick Zadrozny
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appNick Zadrozny
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing WorkshopLakshya Sivaramakrishnan
 

Similaire à Lazy man's learning: How To Build Your Own Text Summarizer (20)

Natural Language Processing Crash Course
Natural Language Processing Crash CourseNatural Language Processing Crash Course
Natural Language Processing Crash Course
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
Taming Text
Taming TextTaming Text
Taming Text
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
NLTK
NLTKNLTK
NLTK
 
ANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy WayANTLR - Writing Parsers the Easy Way
ANTLR - Writing Parsers the Easy Way
 
Designing and Implementing Search Solutions
Designing and Implementing Search SolutionsDesigning and Implementing Search Solutions
Designing and Implementing Search Solutions
 
Functional programming
Functional programmingFunctional programming
Functional programming
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
Lexing and parsing
Lexing and parsingLexing and parsing
Lexing and parsing
 
All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...All kmers are not created equal: recognizing the signal from the noise in lar...
All kmers are not created equal: recognizing the signal from the noise in lar...
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr2010 08-06 - sd ruby - solr
2010 08-06 - sd ruby - solr
 
Solr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your appSolr Powr — Enterprise-grade search for your app
Solr Powr — Enterprise-grade search for your app
 
Pycon India 2018 Natural Language Processing Workshop
Pycon India 2018   Natural Language Processing WorkshopPycon India 2018   Natural Language Processing Workshop
Pycon India 2018 Natural Language Processing Workshop
 

Dernier

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Lazy man's learning: How To Build Your Own Text Summarizer

  • 1. LAZY MAN’S LEARNING How to BuildYour OwnText Summarizer Sho Fola Soboyejo, Digital Architect, Kroger Co. April 19th, 2018 @shoreason
  • 2. I’VE GOT A FEVER ANDTHE ONLY PRESCRIPTION IS … MORE BOOKS
  • 3. NATURAL LANGUAGE PROCESSING (NLP) DOMAINS • Mostly Solved: SPAM detection, parts of speech tagging , named entity recognition • Making Progress: Sentiment analysis, coreference resolution, word sense disambiguation, parsing, machine translation, information extraction • Still Really Hard: Question answering, Paraphrase, Summarization and dialogue
  • 4. PROBLEMS IN NLP • Ambiguity: RedTape Holds Up New Bridges • Idioms: Get Cold Feet, Dark Horse • Neologisms: Bromance, Unfriend, Retweet • Tricky name entities:Where is Black Panther Playing? • Non-Standard English: #challengeday, @mlmeetup Stanford NLP: Dan Jurafsky
  • 5. “HOW CANYOU SAYTHE MOST IMPORTANTTHINGS INTHE SHORTEST AMOUNT OFTIME ?” - Siraj Raval
  • 6. PRACTICAL APPLICATIONS FOR SUMMARIZATION • Headlines (from around the world) • Outlines (notes for students) • Minutes (of a meeting) • Previews (of movies) • Synopses (soap opera listings) • Reviews (of a book, CD, movie, etc.) • Bulletins (weather forecasts/stock market reports) • Sound bites (politicians on a current issue) — Page 1, Advances in AutomaticText Summarization, 1999.
  • 7. FORMS OF SUMMARIZATION Single Document vs Multi Document
  • 9. EXTRACTIVE • Pick figure out most important sentences in document.Then simply extract and order those. • Same words and sentences in document. No abstract. • Ranking phrase relevance
  • 10. ABSTRACTIVE • Boil down the gist of a document into an abstract likely using new words in summary. • Very much what you and I would do. • Much harder
  • 11. “IT’S FAR EASIERTO RECOGNIZE WORDSTHAN IT IS TO UNDERSTAND THE MEANING” - Laura Klein (Design forVoice Interfaces)
  • 12. SPEED READINGTIPS • 1st and last sentence (Order in text) • Title and other paragraphs (Connection to other sentences) • Index (Word Frequency) • Focus on Keywords
  • 13. BASIC CLEAN UP EXPECTED • Remove Stop Words • Stemming • Lower case • Remove Punctuation • Remove Numbers
  • 14. STAGES CONTENT SELECTION INFORMATION ORDERING ▸ Sentence Segmentation ▸ Document order ▸ Sentence Extraction ▸ Keep original sentences ▸ Sentence weight ▸ Sentence simplification SENTENCE REALIZATION
  • 17. NAIVE ALGORITHM • Determine most frequent content words in original document (Word frequency table) • N most common words are stored and sorted (100) • Score each sentence based on how many high frequency words it contains • Build summary by compiling sentences above certain score threshold • Select N top sentences and sort based on order in original text
  • 19. NAIVE EXTRACTIVE ALGORITHM 2.0 • Compare each sentence in document against other sentences and determine intersection • [0][2] = intersection score of comparing sentence 1 to sentence 3 • Treating each sentence as a node the connection between the nodes is the intersection score.Weight of the edges • Calculate the score of each sentence/node as key value pair {sentence: nodeScore} • NodeScore = sum of all intersections with other sentences excluding itself. Sum of all edges connected to the node • Split text into paragraphs pick best sentence in each paragraph. Essentially, treating paragraphs as subset of graph and pick best node in each subset
  • 20. • s1 = "my friend's car is nicer than mine but my wife is way more beautiful" • s2 = "my wife is more beautiful and has brown eyes” • s1.intersection(s2) = {'is', 'wife', 'beautiful', 'my',‘more'} • Intersection score = len(s1.intersection(s2)) / ((len(s1) + len(s2)) / 2) = .4762 • lower score less similarity, higher score more similarity SENTENCE INTERSECTIONS
  • 23. WHYTHIS MIGHT WORK • Again, a paragraph can be treated as a subatomic piece of a text • Sentences with strong intersection likely hold the same or very similar information • Sentences with intersection with many other sentences is likely very key to the text
  • 24. NAIVE 2.0 ALGORITHM IN ACTION built on code by Shlomi Babluki https://koko-summarizer.herokuapp.com/content
  • 25. GOING MUCH FURTHER • Bi-Grams • TF-IDF (frequent in a document but not across documents) • IncludingTitle • Apply stemming • RNN (Recurrent Neural Network)
  • 26. GOAL Train an encoder-decoder recurrent neural network with LSTM units and attention for generating summaries using the texts of news articles from the Gigaword dataset
  • 27. WHAT IS A NEURAL NETWORK? • Modeled after the human brain (neurons) and nervous system • Like a neuron, it has input, hidden and output layers • Network initializes with a guessers and the learns adjusts as more data passes through it • Deep learning is using a neural network with more hidden layers
  • 29. SEQTO SEQ LEARNING Courtesy: QuocV. Le & Mike Schuster, Research Scientists, Google BrainTeam
  • 31. Abstractive Neural Networks Extractive Algorithmia, Gensim, Naive 1.0 and 2.0 BRINGING ITTOGETHER
  • 32. GETTING STARTED • Try out Algorithmia and Gensim • Fork my github code and try your hand on Naive 3.0 • Explore some NLP and Machine Learning intro courses • Check out the White Papers I referenced in this talk
  • 33. ACCESSTO RICH DATASETS • CNN/Daily Mail Stories (Kyunghyun Cho) • https://drive.google.com/uc? export=download&id=0BwmD_VLjR OrfTHk4NFg2SndKcjQ • BCC Stories • http://mlg.ucd.ie/ • Annotated English Gigaword • https://catalog.ldc.upenn.edu/ LDC2012T21
  • 34. Look out for deck on Slideshare @shoreason www.shoreason.com github.com/shoreason