SlideShare une entreprise Scribd logo
1  sur  47
Télécharger pour lire hors ligne
Crash Course in
Natural Language Processing
Vsevolod Dyomkin
08/2015
A Bit about Me
* Lisp programmer
* Research lead at Grammarly
(4 years of practical NLP work)
* Teacher at KPI: Operating Systems course
https://vseloved.github.io
A Bit about Grammarly
The best English language writing app
Spellcheck - Grammar check - Style
improvement - Synonyms and word choice
Plagiarism check
Plan
* Overview of NLP
* Where to get Data
* Common NLP problems
and approaches
* How to develop an NLP
system
What Is NLP?
Transforming free-form text
into structured data and back
What Is NLP?
Transforming free-form text
into structured data and back
Intersection of:
* Computational Linguistics
* CompSci & AI
* Stats & Information Theory
Linguistic Basis
* Syntax (form)
* Semantics (meaning)
* Pragmatics (intent/logic)
Natural Language
* ambiguous
* noisy
* evolving
Time flies like an arrow.
Fruit flies like a banana.
I read a story about evolution in ten minutes.
I read a story about evolution in the last million years.
NLP & Data
Types of text data:
* structured
* semi-structured
* unstructured
“Data is ten times more
powerful than algorithms.”
-- Peter Norvig
The Unreasonable Effectiveness of Data.
http://youtu.be/yvDCzhbjYWs
Kinds of Data
* Dictionaries
* Databases/Ontologies
* Corpora
* User Data
Where to Get Data?
* Linguistic Data Consortium
http://www.ldc.upenn.edu/
* Common Crawl
* Wikimedia
* Wordnet
* APIs: Twitter, Wordnik, ...
* University sites &
the academic community:
Stanford, Oxford, CMU, ...
Create Your Own!
* Linguists
* Crowdsourcing
* By-product
-- Johnatahn Zittrain
http://goo.gl/hs4qB
Classic NLP Problems
* Linguistically-motivated:
segmentation, tagging, parsing
* Analytical:
classification, sentiment analysis
* Transformation:
translation, correction, generation
* Conversation:
question answering, dialog
Tokenization
Example:
This is a test that isn't so simple: 1.23.
"This" "is" "a" "test" "that" "is" "n't"
"so" "simple" ":" "1.23" "."
Issues:
* Finland’s capital -
Finland Finlands Finland’s
* what’re, I’m, isn’t -
what ’re, I ’m, is n’t
* Hewlett-Packard or Hewlett Packard
* San Francisco - one token or two?
* m.p.h., PhD.
Regular Expressions
Simplest regex: [^s]+
More advanced regex:
w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–—
«»“”‘’-]―
Even more advanced regex:
[+-]?[0-9](?:[0-9,.]*[0-9])?
|[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']?
|["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—―
|[.!?]+
|-+
Post-processing
* concatenate abbreviations and decimals
* split contractions with regexes
2-character:
i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they)
['‘’`]d$
3-character:
(?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
Rule-based Approach
* easy to understand and
reason about
* can be arbitrarily precise
* iterative, can be used to
gather more data
Limitations:
* recall problems
* poor adaptability
Rule-based NLP tools
* SpamAssasin
* LanguageTool
* ELIZA
* GATE
Statistical Approach
“Probability theory
is nothing but
common sense
reduced to calculation.”
-- Pierre-Simon Laplace
Language Models
Question: what is the probability of a
sequence of words/sentence?
Language Models
Question: what is the probability of a
sequence of words/sentence?
Answer: Apply the chain rule
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w0 w1 w2) * …
where S = w0 w1 w2 …
Ngrams
Apply Markov assumption: each word depends
only on N previous words (in practice
N=1..4 which results in bigrams-fivegrams,
because we include the current word also).
If n=2:
P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1)
* P(w3|w1 w2) * …
According to the chain rule:
P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
Spelling Correction
Problem: given an out-of-dictionary
word return a list of most probable
in-dictionary corrections.
http://norvig.com/spell-correct.html
Edit Distance
Minimum-edit (Levenstein) distance the–
minimum number of
insertions/deletions/substitutions needed
to transform string A into B.
Other distance metrics:
* the Damerau-Levenstein distance adds
another operation: transposition
* the longest common subsequence (LCS)
metric allows only insertion and deletion,
not substitution
* the Hamming distance allows only
substitution, hence, it only applies to
strings of the same length
Dynamic Programming
Initialization:
D(i,0) = i
D(0,j) = j
Recurrence relation:
For each i = 1..M
For each j = 1..N
D(i,j) = D(i-1,j-1), if X(i) = Y(j)
otherwise:
min D(i-1,j) + w_del(Y(j))
D(i,j-1) + w_ins(X(i))
D(i-1,j-1) + w_subst(X(i),Y(j))
Noisy Channel Model
Given an alphabet A, let A* be the set of all finite
strings over A. Let the dictionary D of valid words be
some subset of A*.
The noisy channel is the matrix G = P(s|w) where w in D is
the intended word and s in A* is the scrambled word that
was actually received.
P(s|w) = sum(P(x(i)|y(i)))
for x(i) in s* (s aligned with w)
for y(i) in w* (w aligned with s)
Machine Learning
Approach
Spam Filtering
A 2-class classification problem with a
bias towards minimizing FPs.
Default approach: rule-based (SpamAssassin)
Problems:
* scales poorly
* hard to reach arbitrary precision
* hard to rank the importance of
complex features?
Bag-of-words Models
* each word is a feature
* each word is independent of others
* position of the word in a sentence is irrelevant
Pros:
* simple
* fast
* scalable
Limitations:
* independence assumption doesn't hold
Initial results: recall: 92%, precision: 98.84%
Improved results: recall: 99.5%, precision: 99.97%
http://www.paulgraham.com/spam.html
Naive Bayes
Classifier
P(Y|X) = P(Y) * P(X|Y) / P(X)
select Y = argmax P(Y|x)
Naive step:
P(Y|x) = P(Y) * prod(P(x|Y))
for all x in X
(P(x) is marginalized out because it's the
same for all Y)
Dependency Parsing
nsubj(ate-2, They-1)
root(ROOT-0, ate-2)
det(pizza-4, the-3)
dobj(ate-2, pizza-4)
prep(ate-2, with-5)
pobj(with-5, anchovies-6)
https://honnibal.wordpress.com/2013/12/18/a-simple-fas
t-algorithm-for-natural-language-dependency-parsing/
Shift-reduce Parsing
Shift-reduce Parsing
ML-based Parsing
The parser starts with an empty stack, and a buffer index at 0, with no
dependencies recorded. It chooses one of the valid actions, and applies it to
the state. It continues choosing actions and applying them until the stack is
empty and the buffer index is at the end of the input.
SHIFT = 0; RIGHT = 1; LEFT = 2
MOVES = [SHIFT, RIGHT, LEFT]
def parse(words, tags):
n = len(words)
deps = init_deps(n)
idx = 1
stack = [0]
while stack or idx < n:
features = extract_features(words, tags, idx, n, stack, deps)
scores = score(features)
valid_moves = get_valid_moves(i, n, len(stack))
next_move = max(valid_moves, key=lambda move: scores[move])
idx = transition(next_move, idx, stack, parse)
return tags, parse
Averaged Perceptron
def train(model, number_iter, examples):
for i in range(number_iter):
for features, true_tag in examples:
guess = model.predict(features)
if guess != true_tag:
for f in features:
model.weights[f][true_tag] += 1
model.weights[f][guess] -= 1
random.shuffle(examples)
Features
* Word and tag unigrams, bigrams, trigrams
* The first three words of the buffer
* The top three words of the stack
* The two leftmost children of the top of
the stack
* The two rightmost children of the top of
the stack
* The two leftmost children of the first
word in the buffer
* Distance between top of buffer and stack
Discriminative ML
Models
Linear:
* (Averaged) Perceptron
* Maximum Entropy / LogLinear / Logistic
Regression; Conditional Random Field
* SVM
Non-linear:
* Decision Trees, Random Forests
* Other ensemble classifiers
* Neural networks
Semantics
Question: how to model relationships
between words?
Semantics
Question: how to model relationships
between words?
Answer: build a graph
Wordnet
Freebase
DBPedia
Word Similarity
Next question: now, how do we measure those
relations?
Word Similarity
Next question: now, how do we measure those
relations?
* different Wordnet similarity measures
Word Similarity
Next question: now, how do we measure those
relations?
* different Wordnet similarity measures
* PMI(x,y) = log(p(x,y) / p(x) * p(y))
Distributional
Semantics
Distributional hypothesis:
"You shall know a word by
the company it keeps"
--John Rupert Firth
Word representations:
* Explicit representation
Number of nonzero dimensions:
max:474234, min:3, mean:1595, median:415
* Dense representation (word2vec, GloVe)
* Hierarchical representation
(Brown clustering)
Steps to Develop
an NLP System
* Translate real-world requirements
into a measurable goal
* Find a suitable level and
representation
* Find initial data for experiments
* Find and utilize existing tools and
Frameworks where possible
* Don't trust research results
* Setup and perform a proper
experiment (series of experiments)
Going into Prod
* NLP tasks are usually CPU-intensive
but stateless
* General-purpose NLP frameworks are
(mostly) not production-ready
* Value pre- and post- processing
* Gather user feedback
Final Words
We have discussed:
* linguistic basis of NLP
- although some people manage to do NLP
without it:
http://arxiv.org/pdf/1103.0398.pdf
* rule-based & statistical/ML approaches
* different concrete tasks
We haven't covered:
* all the different tasks, such as MT,
question answering, etc.
(but they use the same technics)
* deep learning for NLP
* natural language understanding
(which remains an unsolved problem)

Contenu connexe

Tendances

Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationssChandan Deb
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionFlorian Leitner
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLPBill Liu
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - IJaganadh Gopinadhan
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...KozoChikai
 
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedLidia Pivovarova
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoLidia Pivovarova
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingFlorian Leitner
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and ChallengesJens Lehmann
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)Florian Leitner
 

Tendances (20)

Language models
Language modelsLanguage models
Language models
 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Feature Engineering for NLP
Feature Engineering for NLPFeature Engineering for NLP
Feature Engineering for NLP
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
AINL 2016: Maraev
AINL 2016: MaraevAINL 2016: Maraev
AINL 2016: Maraev
 
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, Nikolenko
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)
 

En vedette

NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationVsevolod Dyomkin
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python ProgrammersVsevolod Dyomkin
 
Current state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyCurrent state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyiaemedu
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная оберткаVsevolod Dyomkin
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelinesVsevolod Dyomkin
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхVsevolod Dyomkin
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Vsevolod Dyomkin
 
악평생성기 (Bad Comment Generator using RNN) _ 송치성
악평생성기 (Bad Comment Generator using RNN) _ 송치성악평생성기 (Bad Comment Generator using RNN) _ 송치성
악평생성기 (Bad Comment Generator using RNN) _ 송치성Chisung Song
 
시나브로 배우는 자연어처리 바벨피쉬 송치성
시나브로 배우는 자연어처리 바벨피쉬 송치성시나브로 배우는 자연어처리 바벨피쉬 송치성
시나브로 배우는 자연어처리 바벨피쉬 송치성Chisung Song
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common LispVsevolod Dyomkin
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturyVsevolod Dyomkin
 

En vedette (20)

NLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language IdentificationNLP in the WILD or Building a System for Text Language Identification
NLP in the WILD or Building a System for Text Language Identification
 
Lisp for Python Programmers
Lisp for Python ProgrammersLisp for Python Programmers
Lisp for Python Programmers
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
L3 v2
L3 v2L3 v2
L3 v2
 
Current state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyCurrent state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a study
 
Ignat vita artur
Ignat vita arturIgnat vita artur
Ignat vita artur
 
NLP
NLPNLP
NLP
 
Lisp как универсальная обертка
Lisp как универсальная оберткаLisp как универсальная обертка
Lisp как универсальная обертка
 
Tedxkyiv communication guidelines
Tedxkyiv communication guidelinesTedxkyiv communication guidelines
Tedxkyiv communication guidelines
 
Новые нереляционные системы хранения данных
Новые нереляционные системы хранения данныхНовые нереляционные системы хранения данных
Новые нереляционные системы хранения данных
 
Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?Чему мы можем научиться у Lisp'а?
Чему мы можем научиться у Lisp'а?
 
Lisp Machine Prunciples
Lisp Machine PrunciplesLisp Machine Prunciples
Lisp Machine Prunciples
 
악평생성기 (Bad Comment Generator using RNN) _ 송치성
악평생성기 (Bad Comment Generator using RNN) _ 송치성악평생성기 (Bad Comment Generator using RNN) _ 송치성
악평생성기 (Bad Comment Generator using RNN) _ 송치성
 
CL-NLP
CL-NLPCL-NLP
CL-NLP
 
시나브로 배우는 자연어처리 바벨피쉬 송치성
시나브로 배우는 자연어처리 바벨피쉬 송치성시나브로 배우는 자연어처리 바벨피쉬 송치성
시나브로 배우는 자연어처리 바벨피쉬 송치성
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Экосистема Common Lisp
Экосистема Common LispЭкосистема Common Lisp
Экосистема Common Lisp
 
Ngrams smoothing
Ngrams smoothingNgrams smoothing
Ngrams smoothing
 
Sugaring Lisp for the 21st Century
Sugaring Lisp for the 21st CenturySugaring Lisp for the 21st Century
Sugaring Lisp for the 21st Century
 

Similaire à Crash-course in Natural Language Processing

Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory buildingClarkTony
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learningtelss09
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.pptbutest
 
Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Eran Yahav
 
Using Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPUUsing Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPUSkills Matter
 
Fosdem 2013 petra selmer flexible querying of graph data
Fosdem 2013 petra selmer   flexible querying of graph dataFosdem 2013 petra selmer   flexible querying of graph data
Fosdem 2013 petra selmer flexible querying of graph dataPetra Selmer
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Avelin Huo
 
Declare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionDeclare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionEelco Visser
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
INTRODUCTION AND HISTORY OF R PROGRAMMING.pdf
INTRODUCTION AND HISTORY OF R PROGRAMMING.pdfINTRODUCTION AND HISTORY OF R PROGRAMMING.pdf
INTRODUCTION AND HISTORY OF R PROGRAMMING.pdfranapoonam1
 
Introduction to Python , Overview
Introduction to Python , OverviewIntroduction to Python , Overview
Introduction to Python , OverviewNB Veeresh
 
Five Languages in a Moment
Five Languages in a MomentFive Languages in a Moment
Five Languages in a MomentSergio Gil
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMassimo Schenone
 
Word embeddings
Word embeddingsWord embeddings
Word embeddingsShruti kar
 
Declare Your Language (at DLS)
Declare Your Language (at DLS)Declare Your Language (at DLS)
Declare Your Language (at DLS)Eelco Visser
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesValentina Paunovic
 

Similaire à Crash-course in Natural Language Processing (20)

Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory building
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Python slide
Python slidePython slide
Python slide
 
Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)Programming with Millions of Examples (HRL)
Programming with Millions of Examples (HRL)
 
Using Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPUUsing Language Oriented Programming to Execute Computations on the GPU
Using Language Oriented Programming to Execute Computations on the GPU
 
Fosdem 2013 petra selmer flexible querying of graph data
Fosdem 2013 petra selmer   flexible querying of graph dataFosdem 2013 petra selmer   flexible querying of graph data
Fosdem 2013 petra selmer flexible querying of graph data
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
 
Declare Your Language: Syntax Definition
Declare Your Language: Syntax DefinitionDeclare Your Language: Syntax Definition
Declare Your Language: Syntax Definition
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
INTRODUCTION AND HISTORY OF R PROGRAMMING.pdf
INTRODUCTION AND HISTORY OF R PROGRAMMING.pdfINTRODUCTION AND HISTORY OF R PROGRAMMING.pdf
INTRODUCTION AND HISTORY OF R PROGRAMMING.pdf
 
Introduction to Python , Overview
Introduction to Python , OverviewIntroduction to Python , Overview
Introduction to Python , Overview
 
Five Languages in a Moment
Five Languages in a MomentFive Languages in a Moment
Five Languages in a Moment
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Getting started with R
Getting started with RGetting started with R
Getting started with R
 
MACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSISMACHINE-DRIVEN TEXT ANALYSIS
MACHINE-DRIVEN TEXT ANALYSIS
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Declare Your Language (at DLS)
Declare Your Language (at DLS)Declare Your Language (at DLS)
Declare Your Language (at DLS)
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
 

Dernier

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Crash-course in Natural Language Processing

  • 1. Crash Course in Natural Language Processing Vsevolod Dyomkin 08/2015
  • 2. A Bit about Me * Lisp programmer * Research lead at Grammarly (4 years of practical NLP work) * Teacher at KPI: Operating Systems course https://vseloved.github.io
  • 3. A Bit about Grammarly The best English language writing app Spellcheck - Grammar check - Style improvement - Synonyms and word choice Plagiarism check
  • 4. Plan * Overview of NLP * Where to get Data * Common NLP problems and approaches * How to develop an NLP system
  • 5. What Is NLP? Transforming free-form text into structured data and back
  • 6. What Is NLP? Transforming free-form text into structured data and back Intersection of: * Computational Linguistics * CompSci & AI * Stats & Information Theory
  • 7. Linguistic Basis * Syntax (form) * Semantics (meaning) * Pragmatics (intent/logic)
  • 9. Time flies like an arrow. Fruit flies like a banana. I read a story about evolution in ten minutes. I read a story about evolution in the last million years.
  • 10. NLP & Data Types of text data: * structured * semi-structured * unstructured “Data is ten times more powerful than algorithms.” -- Peter Norvig The Unreasonable Effectiveness of Data. http://youtu.be/yvDCzhbjYWs
  • 11. Kinds of Data * Dictionaries * Databases/Ontologies * Corpora * User Data
  • 12. Where to Get Data? * Linguistic Data Consortium http://www.ldc.upenn.edu/ * Common Crawl * Wikimedia * Wordnet * APIs: Twitter, Wordnik, ... * University sites & the academic community: Stanford, Oxford, CMU, ...
  • 13. Create Your Own! * Linguists * Crowdsourcing * By-product -- Johnatahn Zittrain http://goo.gl/hs4qB
  • 14. Classic NLP Problems * Linguistically-motivated: segmentation, tagging, parsing * Analytical: classification, sentiment analysis * Transformation: translation, correction, generation * Conversation: question answering, dialog
  • 15. Tokenization Example: This is a test that isn't so simple: 1.23. "This" "is" "a" "test" "that" "is" "n't" "so" "simple" ":" "1.23" "." Issues: * Finland’s capital - Finland Finlands Finland’s * what’re, I’m, isn’t - what ’re, I ’m, is n’t * Hewlett-Packard or Hewlett Packard * San Francisco - one token or two? * m.p.h., PhD.
  • 16. Regular Expressions Simplest regex: [^s]+ More advanced regex: w+|[!"#$%&'*+,./:;<=>?@^`~…() {}[|]⟨⟩ ‒–— «»“”‘’-]― Even more advanced regex: [+-]?[0-9](?:[0-9,.]*[0-9])? |[w@](?:[w'’`@-][w']|[w'][w@'’`-])*[w']? |["#$%&*+,/:;<=>@^`~…() {}[|] «»“”‘’']⟨⟩ ‒–—― |[.!?]+ |-+
  • 17. Post-processing * concatenate abbreviations and decimals * split contractions with regexes 2-character: i['‘’`]m|(?:s?he|it)['‘’`]s|(?:i|you|s?he|we|they) ['‘’`]d$ 3-character: (?:i|you|s?he|we|they)['‘’`](?:ll|[vr]e)|n['‘’`]t$
  • 18. Rule-based Approach * easy to understand and reason about * can be arbitrarily precise * iterative, can be used to gather more data Limitations: * recall problems * poor adaptability
  • 19. Rule-based NLP tools * SpamAssasin * LanguageTool * ELIZA * GATE
  • 20. Statistical Approach “Probability theory is nothing but common sense reduced to calculation.” -- Pierre-Simon Laplace
  • 21. Language Models Question: what is the probability of a sequence of words/sentence?
  • 22. Language Models Question: what is the probability of a sequence of words/sentence? Answer: Apply the chain rule P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w0 w1 w2) * … where S = w0 w1 w2 …
  • 23. Ngrams Apply Markov assumption: each word depends only on N previous words (in practice N=1..4 which results in bigrams-fivegrams, because we include the current word also). If n=2: P(S) = P(w0) * P(w1|w0) * P(w2|w0 w1) * P(w3|w1 w2) * … According to the chain rule: P(w2|w0 w1) = P(w0 w1 w2) / P(w0 w1)
  • 24. Spelling Correction Problem: given an out-of-dictionary word return a list of most probable in-dictionary corrections. http://norvig.com/spell-correct.html
  • 25. Edit Distance Minimum-edit (Levenstein) distance the– minimum number of insertions/deletions/substitutions needed to transform string A into B. Other distance metrics: * the Damerau-Levenstein distance adds another operation: transposition * the longest common subsequence (LCS) metric allows only insertion and deletion, not substitution * the Hamming distance allows only substitution, hence, it only applies to strings of the same length
  • 26. Dynamic Programming Initialization: D(i,0) = i D(0,j) = j Recurrence relation: For each i = 1..M For each j = 1..N D(i,j) = D(i-1,j-1), if X(i) = Y(j) otherwise: min D(i-1,j) + w_del(Y(j)) D(i,j-1) + w_ins(X(i)) D(i-1,j-1) + w_subst(X(i),Y(j))
  • 27. Noisy Channel Model Given an alphabet A, let A* be the set of all finite strings over A. Let the dictionary D of valid words be some subset of A*. The noisy channel is the matrix G = P(s|w) where w in D is the intended word and s in A* is the scrambled word that was actually received. P(s|w) = sum(P(x(i)|y(i))) for x(i) in s* (s aligned with w) for y(i) in w* (w aligned with s)
  • 29. Spam Filtering A 2-class classification problem with a bias towards minimizing FPs. Default approach: rule-based (SpamAssassin) Problems: * scales poorly * hard to reach arbitrary precision * hard to rank the importance of complex features?
  • 30. Bag-of-words Models * each word is a feature * each word is independent of others * position of the word in a sentence is irrelevant Pros: * simple * fast * scalable Limitations: * independence assumption doesn't hold Initial results: recall: 92%, precision: 98.84% Improved results: recall: 99.5%, precision: 99.97% http://www.paulgraham.com/spam.html
  • 31. Naive Bayes Classifier P(Y|X) = P(Y) * P(X|Y) / P(X) select Y = argmax P(Y|x) Naive step: P(Y|x) = P(Y) * prod(P(x|Y)) for all x in X (P(x) is marginalized out because it's the same for all Y)
  • 32. Dependency Parsing nsubj(ate-2, They-1) root(ROOT-0, ate-2) det(pizza-4, the-3) dobj(ate-2, pizza-4) prep(ate-2, with-5) pobj(with-5, anchovies-6) https://honnibal.wordpress.com/2013/12/18/a-simple-fas t-algorithm-for-natural-language-dependency-parsing/
  • 35. ML-based Parsing The parser starts with an empty stack, and a buffer index at 0, with no dependencies recorded. It chooses one of the valid actions, and applies it to the state. It continues choosing actions and applying them until the stack is empty and the buffer index is at the end of the input. SHIFT = 0; RIGHT = 1; LEFT = 2 MOVES = [SHIFT, RIGHT, LEFT] def parse(words, tags): n = len(words) deps = init_deps(n) idx = 1 stack = [0] while stack or idx < n: features = extract_features(words, tags, idx, n, stack, deps) scores = score(features) valid_moves = get_valid_moves(i, n, len(stack)) next_move = max(valid_moves, key=lambda move: scores[move]) idx = transition(next_move, idx, stack, parse) return tags, parse
  • 36. Averaged Perceptron def train(model, number_iter, examples): for i in range(number_iter): for features, true_tag in examples: guess = model.predict(features) if guess != true_tag: for f in features: model.weights[f][true_tag] += 1 model.weights[f][guess] -= 1 random.shuffle(examples)
  • 37. Features * Word and tag unigrams, bigrams, trigrams * The first three words of the buffer * The top three words of the stack * The two leftmost children of the top of the stack * The two rightmost children of the top of the stack * The two leftmost children of the first word in the buffer * Distance between top of buffer and stack
  • 38. Discriminative ML Models Linear: * (Averaged) Perceptron * Maximum Entropy / LogLinear / Logistic Regression; Conditional Random Field * SVM Non-linear: * Decision Trees, Random Forests * Other ensemble classifiers * Neural networks
  • 39. Semantics Question: how to model relationships between words?
  • 40. Semantics Question: how to model relationships between words? Answer: build a graph Wordnet Freebase DBPedia
  • 41. Word Similarity Next question: now, how do we measure those relations?
  • 42. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures
  • 43. Word Similarity Next question: now, how do we measure those relations? * different Wordnet similarity measures * PMI(x,y) = log(p(x,y) / p(x) * p(y))
  • 44. Distributional Semantics Distributional hypothesis: "You shall know a word by the company it keeps" --John Rupert Firth Word representations: * Explicit representation Number of nonzero dimensions: max:474234, min:3, mean:1595, median:415 * Dense representation (word2vec, GloVe) * Hierarchical representation (Brown clustering)
  • 45. Steps to Develop an NLP System * Translate real-world requirements into a measurable goal * Find a suitable level and representation * Find initial data for experiments * Find and utilize existing tools and Frameworks where possible * Don't trust research results * Setup and perform a proper experiment (series of experiments)
  • 46. Going into Prod * NLP tasks are usually CPU-intensive but stateless * General-purpose NLP frameworks are (mostly) not production-ready * Value pre- and post- processing * Gather user feedback
  • 47. Final Words We have discussed: * linguistic basis of NLP - although some people manage to do NLP without it: http://arxiv.org/pdf/1103.0398.pdf * rule-based & statistical/ML approaches * different concrete tasks We haven't covered: * all the different tasks, such as MT, question answering, etc. (but they use the same technics) * deep learning for NLP * natural language understanding (which remains an unsolved problem)