SlideShare une entreprise Scribd logo
1  sur  63
Télécharger pour lire hors ligne
Natural Language Processing
with Python
April 9, 2016
Intro to Natural Language
Processing
What is Language?
Or, why should we analyze it?
What does this mean?
Crab
Is the symbol representative of its meaning?
Crab/kræb/
Or is it just a mapping in a mental lexicon?
Crab/kræb/
symbolsound sight
These symbols are arbitrary
κάβουρας/kávouras/
symbolsound sight
Linguistic symbols:
- Not acoustic “things”
- Not mental processes
Critical implications for:
- Literature
- Linguistics
- Computer Science
- Artificial Intelligence
What is Language?
The science that has been developed around the facts of language
passed through three stages before finding its true and unique
object. First something called "grammar" was studied. This study,
initiated by the Greeks and continued mainly by the French, was
based on logic. It lacked a scientific approach and was detached
from language itself. Its only aim was to give rules for
distinguishing between correct and incorrect forms; it was a
normative discipline, far removed from actual observation, and its
scope was limited.
-- Ferdinand de Saussure
What is Natural Language Processing?
The science that has been developed around the facts of language
passed through three stages before finding its true and unique
object. First something called "grammar" was studied. This study,
initiated by the Greeks and continued mainly by the French, was
based on logic. It lacked a scientific approach and was detached
from language itself. Its only aim was to give rules for
distinguishing between correct and incorrect forms; it was a
normative discipline, far removed from actual observation, and its
scope was limited.
-- Ferdinand de Saussure
What is Natural Language Processing?
Formal vs. Natural Languages
Formal Languages
- Strict, unchanging rules defined
by grammars and parsed by
regular expressions
- Generally application specific
(chemistry, math)
- Literal: exactly what is said is
meant.
- No ambiguity
- Parsable by regular expressions
- Inflexible: no new terms or
meaning.
Natural Languages
- Flexible, evolving language that
occurs naturally in human
communication
- Unspecific and used in many
domains and applications
- Redundant and verbose in order
to make up for ambiguity
- Expressive
- Difficult to parse
- Very flexible even in narrow
contexts
Computer science has
traditionally focused on formal
languages.
tokens
Who has written a compiler or interpreter?
Lexical
Analysis
Syntactic
Analysis
“Lexing” “Parsing”
Regular Grammar
(Word Rules)
Context Free Grammar
(Sentence Rules)
string tree
Syntax Errors are your friend! http://bbc.in/23pSPRq
However, ambiguity is required
for understanding when
communicating between people
with diverse experience.
Most of the time. http://bit.ly/1XlOtHv
Natural Language Processing
requires flexibility, which
generally comes from machine
learning.
Language models are used for flexibility
“There was a ton of traffic on the
beltway so I was ________.”
“At beach we watched the ________.”
“Watch out for that ________!”
Intuition: Language is Predictable (if flexible)
These models predict relationships between tokens.
crab
beach
steamed
boat
ocean
sand
seafood
But they can’t understand meaning (yet)
So please keep in mind:
Tokens != Words
- Substrings
- Only structural
- Data
“bearing”
“shouldn’t”
- Objects
- Contains a “sense”
- Meaning
to bear.verb-1
should.auxverb-3
not.adverb-1
The State of the Art
- Academic design for use alongside intelligent
agents (AI discipline)
- Relies on formal models or representations of
knowledge & language
- Models are adapted and augmented through
probabilistic methods and machine learning.
- A small number of algorithms comprise the
standard framework.
Traditional NLP Applications
- Summarization
- Reference Resolution
- Machine Translation
- Language Generation
- Language Understanding
- Document Classification
- Author Identification
- Part of Speech Tagging
- Question Answering
- Information Extraction
- Information Retrieval
- Speech Recognition
- Sense Disambiguation
- Topic Recognition
- Relationship Detection
- Named Entity Recognition
What is Required?
Domain Knowledge
A Corpus in the Domain
The NLP Pipeline
The study of the forms of things, words in particular.
Consider pluralization for English:
● Orthographic Rules: puppy → puppies
● Morphological Rules: goose → geese or fish
Major parsing tasks:
stemming, lemmatization and tokenization.
Morphology
The study of the rules for the formation of sentences.
Major tasks:
chunking, parsing, feature parsing, grammars
Syntax
Semantics
The study of meaning.
- I see what I eat.
- I eat what I see.
- He poached salmon.
Major Tasks
Frame extraction, creation of TMRs
Thematic Meaning Representation
“The man hit the building with the baseball bat”
{
"subject": {"text": "the man", "sense": human-agent},
"predicate": {"text": "hit", "sense": strike-physical-force},
"object": {"text": "the building", "sense": habitable-structure},
"instrument": {"text": "with the baseball bat" "sense": sports-equipment}
}
Recent NLP Applications
- Yelp Insights
- Winning Jeopardy! IBM Watson
- Computer assisted medical coding (3M Health Information Systems)
- Geoparsing -- CALVIN (built by Charlie Greenbacker)
- Author Identification (classification/clustering)
- Sentiment Analysis (RTNNs, classification)
- Language Detection
- Event Detection
- Google Knowledge Graph
- Named Entity Recognition and Classification
- Machine Translation
Applications are BIG data
- Examples are easier to create than rules.
- Rules and logic miss frequency and language
dynamics
- More data is better for machine learning,
relevance is in the long tail
- Knowledge engineering is not scalable
- Computational linguistics methodologies are
stochastic
The Natural Language
Toolkit (NLTK)
What is NLTK?
- Python interface to over 50 corpora and lexical
resources
- Focus on Machine Learning with specific domain
knowledge
- Free and Open Source
- Numpy and Scipy under the hood
- Fast and Formal
What is NLTK?
Suite of libraries for a variety of academic text
processing tasks:
- tokenization, stemming, tagging,
- chunking, parsing, classification,
- language modeling, logical semantics
Pedagogical resources for teaching NLP theory in
Python ...
Who Wrote NLTK?
Steven Bird
Associate Professor
University of Melbourne
Senior Research Associate, LDC
Ewan Klein
Professor of Language Technology
University of Edinburgh.
Batteries Included
NLTK = Corpora + Algorithms
Ready for Research!
What is NLTK not?
- Production ready out of the box*
- Lightweight
- Generally applicable
- Magic
*There are actually a few things that are production
ready right out of the box.
The Good
- Preprocessing
- segmentation, tokenization, PoS tagging
- Word level processing
- WordNet, Lemmatization, Stemming, NGram
- Utilities
- Tree, FreqDist, ConditionalFreqDist
- Streaming CorpusReader objects
- Classification
- Maximum Entropy, Naive Bayes, Decision Tree
- Chunking, Named Entity Recognition
- Parsers Galore!
- Languages Galore!
The Bad
- Syntactic Parsing
- No included grammar (not a black box)
- Feature/Dependency Parsing
- No included feature grammar
- The sem package
- Toy only (lambda-calculus & first order logic)
- Lots of extra stuff
- papers, chat programs, alignments, etc.
Other Python NLP Libraries
- TextBlob
- SpaCy
- Scikit-Learn
- Pattern
- gensim
- MITIE
- guess_language
- Python wrapper for Stanford CoreNLP
- Python wrapper for Berkeley Parser
- readability-lxml
- BeautifulSoup
NLTK Demo
- Working with Included Corpora
- Segmentation
- Tokenization
- Tagging
- A Parsing Exercise
- Named Entity RecognitionWorking with Text
Machine Learning on Text
Machine learning uses instances
(examples) of data to fit a
parameterized model which is
used to make predictions
concerning new instances.
In text analysis, what are the
instances?
Instances = Documents
(no matter their size)
A corpus is a collection of
documents to learn about.
(labeled or unlabeled)
Features describe instances in a
way that machines can learn on by
putting them into feature space.
Document Features
Document level features
- Metadata: title, author
- Paragraphs
- Sentence construction
Word level features
- Vocabulary
- Form (capitalization)
- Frequency
Document
Paragraphs
Sentences
Tokens
Vector Encoding
- Basic representation of documents: a vector whose length is
equal to the vocabulary of the entire corpus.
- Word positions in the vector are based on lexicographic order.
The elephant sneezed
at the sight of
potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she
opened the door to
the studio.
at
bat
can
door
echolocationelephant
of
open
potato
see
she
sightsneeze
studio
the
to
viaw
onder
- One of the simplest models: compute the frequency of words in
the document and use those numbers as the vector encoding.
0
at
2
bat
1
can
0
door
1
echolocation
0
elephant
0
of
0
open
0
potato
2
see
0
she
1sight
1
sneeze
0
studio
1
the
0
to
1
via
0
w
onder
Bag of Words: Token Frequency
The elephant sneezed
at the sight of
potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she
opened the door to
the studio.
One Hot Encoding
- The feature vector encodes the vocabulary of the document.
- All words are equally distant, so must reduce word forms.
- Usually used for artificial neural network models.
The elephant sneezed
at the sight of
potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she
opened the door to
the studio.
1
at
0
bat
0
can
0
door
0
echolocation
1
elephant
0
of
0
open
1
potato
0
see
0
she
1sight
1
sneeze
0
studio
1
the
0
to
0
via
0
w
onder
TF-IDF Encoding
- Highlight terms that are very relevant to a document relative to
the rest of the corpus by computing the term frequency times
the inverse document frequency of the term.
The elephant sneezed
at the sight of
potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she
opened the door to
the studio.
0
at
0
bat
0
can
.3
door
0
echolocation
0
elephant
0
of
.3
open
0
potato
0
see
.3
she
0sight
0
sneeze
.4
studio
0
the
0
to
0
via
.3
w
onder
Pros and Cons of Vector Encoding
Pros
- Machine learning requires
a vector anyway.
- Can embed complex
representations like TF-IDF
into the vector form.
- Drives towards token-
concept mapping without
rules.
Cons
- The vectors have lots of
columns (high dimension)
- Word order, grammar, and
other structural features
are natively lost.
- Difficult to add knowledge
to learning process.
In the end, much of the work for
language aware applications
comes from domain specific
feature analysis; not just simple
vectorization.
Classification and Clustering
Classification
- Supervised ML
- Requires pre-labeled
corpus of documents
- Sentiment Analysis
- Models:
- Naive Bayes
- Maximum Entropy
Clustering
- Unsupervised ML
- Groups similar
documents together.
- Topic Modeling
- Models:
- LDA
- NNMF
Classification Pipeline
Training Text,
Documents
Labels
Feature
Vectors
Classifier
Algorithm
New Document Feature
Vector
Predictive
Model
Classification
Topic Modeling Pipeline
Training Text,
Documents
Feature
Vectors
Clustering
Algorithm
New Document Feature
Vector
Topic A Topic B Topic C
Similarity
These two fundamental techniques
are the basis of most NLP.
It’s all about designing the problem
correctly.
Consider an automatic question
answering system: how might
you architect the application?
Production Grade NLP
NLTK
TextBlob
- Access to LDA model
- Good TF-IDF Modeling
- Use word2vec
- Text processing
- Lexical resources
- Access to WordNet
- More model families
- Faster implementations
- Pipelines for NLP
Our Task
Build a system that ingests raw language data
and transforms it into a suitable representation
for creating revolutionary applications.
Workshop
Building language aware data
products
- Reading corpora for analysis
- Feature Extraction
- Document Classification
- Topic Modeling (Clustering)

Contenu connexe

Tendances

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Natural language processing
Natural language processing Natural language processing
Natural language processing Md.Sumon Sarder
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processingSanzid Kawsar
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingRishikese MR
 
Artificial intelligence and knowledge representation
Artificial intelligence and knowledge representationArtificial intelligence and knowledge representation
Artificial intelligence and knowledge representationSajan Sahu
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingVeenaSKumar2
 
Natural language processing
Natural language processingNatural language processing
Natural language processingHansi Thenuwara
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing Adarsh Saxena
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentationMarijn van Zelst
 

Tendances (20)

Text Classification
Text ClassificationText Classification
Text Classification
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
NLP
NLPNLP
NLP
 
Artificial intelligence and knowledge representation
Artificial intelligence and knowledge representationArtificial intelligence and knowledge representation
Artificial intelligence and knowledge representation
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Divide and conquer
Divide and conquerDivide and conquer
Divide and conquer
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
Knowledge representation
Knowledge representationKnowledge representation
Knowledge representation
 

Similaire à Natural Language Processing with Python

NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptOlusolaTop
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Natural language processing
Natural language processingNatural language processing
Natural language processingBasha Chand
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language ProcessingMichel Bruley
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 documentUma Kant
 
Natural_Language_Processing_1.ppt
Natural_Language_Processing_1.pptNatural_Language_Processing_1.ppt
Natural_Language_Processing_1.ppttestbest6
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxAlyaaMachi
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
1. level of language study.pptx
1. level of language study.pptx1. level of language study.pptx
1. level of language study.pptxAlkadumiHamletto
 

Similaire à Natural Language Processing with Python (20)

NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
 
Nlp
NlpNlp
Nlp
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Intro
IntroIntro
Intro
 
Intro
IntroIntro
Intro
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Nlp (1)
Nlp (1)Nlp (1)
Nlp (1)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Intro 2 document
Intro 2 documentIntro 2 document
Intro 2 document
 
NLP
NLPNLP
NLP
 
Natural_Language_Processing_1.ppt
Natural_Language_Processing_1.pptNatural_Language_Processing_1.ppt
Natural_Language_Processing_1.ppt
 
Nltk
NltkNltk
Nltk
 
NLP todo
NLP todoNLP todo
NLP todo
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
L1 nlp intro
L1 nlp introL1 nlp intro
L1 nlp intro
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
1. level of language study.pptx
1. level of language study.pptx1. level of language study.pptx
1. level of language study.pptx
 

Plus de Benjamin Bengfort

Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Benjamin Bengfort
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportBenjamin Bengfort
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataBenjamin Bengfort
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Benjamin Bengfort
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Benjamin Bengfort
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with PythonBenjamin Bengfort
 

Plus de Benjamin Bengfort (19)

Getting Started with TRISA
Getting Started with TRISAGetting Started with TRISA
Getting Started with TRISA
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation Report
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
 
Annotation with Redfox
Annotation with RedfoxAnnotation with Redfox
Annotation with Redfox
 
Rasta processing of speech
Rasta processing of speechRasta processing of speech
Rasta processing of speech
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 

Dernier

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Dernier (20)

DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Natural Language Processing with Python

  • 1. Natural Language Processing with Python April 9, 2016
  • 2. Intro to Natural Language Processing
  • 3. What is Language? Or, why should we analyze it?
  • 4. What does this mean? Crab
  • 5. Is the symbol representative of its meaning? Crab/kræb/
  • 6. Or is it just a mapping in a mental lexicon? Crab/kræb/ symbolsound sight
  • 7. These symbols are arbitrary κάβουρας/kávouras/ symbolsound sight
  • 8. Linguistic symbols: - Not acoustic “things” - Not mental processes Critical implications for: - Literature - Linguistics - Computer Science - Artificial Intelligence What is Language?
  • 9. The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited. -- Ferdinand de Saussure What is Natural Language Processing?
  • 10. The science that has been developed around the facts of language passed through three stages before finding its true and unique object. First something called "grammar" was studied. This study, initiated by the Greeks and continued mainly by the French, was based on logic. It lacked a scientific approach and was detached from language itself. Its only aim was to give rules for distinguishing between correct and incorrect forms; it was a normative discipline, far removed from actual observation, and its scope was limited. -- Ferdinand de Saussure What is Natural Language Processing?
  • 11. Formal vs. Natural Languages Formal Languages - Strict, unchanging rules defined by grammars and parsed by regular expressions - Generally application specific (chemistry, math) - Literal: exactly what is said is meant. - No ambiguity - Parsable by regular expressions - Inflexible: no new terms or meaning. Natural Languages - Flexible, evolving language that occurs naturally in human communication - Unspecific and used in many domains and applications - Redundant and verbose in order to make up for ambiguity - Expressive - Difficult to parse - Very flexible even in narrow contexts
  • 12. Computer science has traditionally focused on formal languages.
  • 13. tokens Who has written a compiler or interpreter? Lexical Analysis Syntactic Analysis “Lexing” “Parsing” Regular Grammar (Word Rules) Context Free Grammar (Sentence Rules) string tree
  • 14. Syntax Errors are your friend! http://bbc.in/23pSPRq
  • 15. However, ambiguity is required for understanding when communicating between people with diverse experience.
  • 16. Most of the time. http://bit.ly/1XlOtHv
  • 17. Natural Language Processing requires flexibility, which generally comes from machine learning.
  • 18. Language models are used for flexibility
  • 19. “There was a ton of traffic on the beltway so I was ________.” “At beach we watched the ________.” “Watch out for that ________!” Intuition: Language is Predictable (if flexible)
  • 20. These models predict relationships between tokens. crab beach steamed boat ocean sand seafood
  • 21. But they can’t understand meaning (yet)
  • 22. So please keep in mind: Tokens != Words - Substrings - Only structural - Data “bearing” “shouldn’t” - Objects - Contains a “sense” - Meaning to bear.verb-1 should.auxverb-3 not.adverb-1
  • 23. The State of the Art - Academic design for use alongside intelligent agents (AI discipline) - Relies on formal models or representations of knowledge & language - Models are adapted and augmented through probabilistic methods and machine learning. - A small number of algorithms comprise the standard framework.
  • 24. Traditional NLP Applications - Summarization - Reference Resolution - Machine Translation - Language Generation - Language Understanding - Document Classification - Author Identification - Part of Speech Tagging - Question Answering - Information Extraction - Information Retrieval - Speech Recognition - Sense Disambiguation - Topic Recognition - Relationship Detection - Named Entity Recognition
  • 25. What is Required? Domain Knowledge A Corpus in the Domain
  • 27. The study of the forms of things, words in particular. Consider pluralization for English: ● Orthographic Rules: puppy → puppies ● Morphological Rules: goose → geese or fish Major parsing tasks: stemming, lemmatization and tokenization. Morphology
  • 28. The study of the rules for the formation of sentences. Major tasks: chunking, parsing, feature parsing, grammars Syntax
  • 29. Semantics The study of meaning. - I see what I eat. - I eat what I see. - He poached salmon. Major Tasks Frame extraction, creation of TMRs
  • 30. Thematic Meaning Representation “The man hit the building with the baseball bat” { "subject": {"text": "the man", "sense": human-agent}, "predicate": {"text": "hit", "sense": strike-physical-force}, "object": {"text": "the building", "sense": habitable-structure}, "instrument": {"text": "with the baseball bat" "sense": sports-equipment} }
  • 31. Recent NLP Applications - Yelp Insights - Winning Jeopardy! IBM Watson - Computer assisted medical coding (3M Health Information Systems) - Geoparsing -- CALVIN (built by Charlie Greenbacker) - Author Identification (classification/clustering) - Sentiment Analysis (RTNNs, classification) - Language Detection - Event Detection - Google Knowledge Graph - Named Entity Recognition and Classification - Machine Translation
  • 32. Applications are BIG data - Examples are easier to create than rules. - Rules and logic miss frequency and language dynamics - More data is better for machine learning, relevance is in the long tail - Knowledge engineering is not scalable - Computational linguistics methodologies are stochastic
  • 34. What is NLTK? - Python interface to over 50 corpora and lexical resources - Focus on Machine Learning with specific domain knowledge - Free and Open Source - Numpy and Scipy under the hood - Fast and Formal
  • 35. What is NLTK? Suite of libraries for a variety of academic text processing tasks: - tokenization, stemming, tagging, - chunking, parsing, classification, - language modeling, logical semantics Pedagogical resources for teaching NLP theory in Python ...
  • 36. Who Wrote NLTK? Steven Bird Associate Professor University of Melbourne Senior Research Associate, LDC Ewan Klein Professor of Language Technology University of Edinburgh.
  • 37. Batteries Included NLTK = Corpora + Algorithms Ready for Research!
  • 38. What is NLTK not? - Production ready out of the box* - Lightweight - Generally applicable - Magic *There are actually a few things that are production ready right out of the box.
  • 39. The Good - Preprocessing - segmentation, tokenization, PoS tagging - Word level processing - WordNet, Lemmatization, Stemming, NGram - Utilities - Tree, FreqDist, ConditionalFreqDist - Streaming CorpusReader objects - Classification - Maximum Entropy, Naive Bayes, Decision Tree - Chunking, Named Entity Recognition - Parsers Galore! - Languages Galore!
  • 40. The Bad - Syntactic Parsing - No included grammar (not a black box) - Feature/Dependency Parsing - No included feature grammar - The sem package - Toy only (lambda-calculus & first order logic) - Lots of extra stuff - papers, chat programs, alignments, etc.
  • 41. Other Python NLP Libraries - TextBlob - SpaCy - Scikit-Learn - Pattern - gensim - MITIE - guess_language - Python wrapper for Stanford CoreNLP - Python wrapper for Berkeley Parser - readability-lxml - BeautifulSoup
  • 42. NLTK Demo - Working with Included Corpora - Segmentation - Tokenization - Tagging - A Parsing Exercise - Named Entity RecognitionWorking with Text
  • 44. Machine learning uses instances (examples) of data to fit a parameterized model which is used to make predictions concerning new instances.
  • 45. In text analysis, what are the instances?
  • 46. Instances = Documents (no matter their size)
  • 47. A corpus is a collection of documents to learn about. (labeled or unlabeled)
  • 48. Features describe instances in a way that machines can learn on by putting them into feature space.
  • 49. Document Features Document level features - Metadata: title, author - Paragraphs - Sentence construction Word level features - Vocabulary - Form (capitalization) - Frequency Document Paragraphs Sentences Tokens
  • 50. Vector Encoding - Basic representation of documents: a vector whose length is equal to the vocabulary of the entire corpus. - Word positions in the vector are based on lexicographic order. The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio. at bat can door echolocationelephant of open potato see she sightsneeze studio the to viaw onder
  • 51. - One of the simplest models: compute the frequency of words in the document and use those numbers as the vector encoding. 0 at 2 bat 1 can 0 door 1 echolocation 0 elephant 0 of 0 open 0 potato 2 see 0 she 1sight 1 sneeze 0 studio 1 the 0 to 1 via 0 w onder Bag of Words: Token Frequency The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio.
  • 52. One Hot Encoding - The feature vector encodes the vocabulary of the document. - All words are equally distant, so must reduce word forms. - Usually used for artificial neural network models. The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio. 1 at 0 bat 0 can 0 door 0 echolocation 1 elephant 0 of 0 open 1 potato 0 see 0 she 1sight 1 sneeze 0 studio 1 the 0 to 0 via 0 w onder
  • 53. TF-IDF Encoding - Highlight terms that are very relevant to a document relative to the rest of the corpus by computing the term frequency times the inverse document frequency of the term. The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio. 0 at 0 bat 0 can .3 door 0 echolocation 0 elephant 0 of .3 open 0 potato 0 see .3 she 0sight 0 sneeze .4 studio 0 the 0 to 0 via .3 w onder
  • 54. Pros and Cons of Vector Encoding Pros - Machine learning requires a vector anyway. - Can embed complex representations like TF-IDF into the vector form. - Drives towards token- concept mapping without rules. Cons - The vectors have lots of columns (high dimension) - Word order, grammar, and other structural features are natively lost. - Difficult to add knowledge to learning process.
  • 55. In the end, much of the work for language aware applications comes from domain specific feature analysis; not just simple vectorization.
  • 56. Classification and Clustering Classification - Supervised ML - Requires pre-labeled corpus of documents - Sentiment Analysis - Models: - Naive Bayes - Maximum Entropy Clustering - Unsupervised ML - Groups similar documents together. - Topic Modeling - Models: - LDA - NNMF
  • 58. Topic Modeling Pipeline Training Text, Documents Feature Vectors Clustering Algorithm New Document Feature Vector Topic A Topic B Topic C Similarity
  • 59. These two fundamental techniques are the basis of most NLP. It’s all about designing the problem correctly.
  • 60. Consider an automatic question answering system: how might you architect the application?
  • 61. Production Grade NLP NLTK TextBlob - Access to LDA model - Good TF-IDF Modeling - Use word2vec - Text processing - Lexical resources - Access to WordNet - More model families - Faster implementations - Pipelines for NLP
  • 62. Our Task Build a system that ingests raw language data and transforms it into a suitable representation for creating revolutionary applications.
  • 63. Workshop Building language aware data products - Reading corpora for analysis - Feature Extraction - Document Classification - Topic Modeling (Clustering)