Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks

Enhancing NLP with Deep Neural Networks
github.com/adarsh0806/ml_workshop_udacity

Outline
Classic NLP
• Text Processing
• Feature Extraction
• Topic Modeling
• Lab: Topic modeling using LDA
Deep NLP
• Neural Networks
• Recurrent Neural Networks
• Word Embeddings
• Lab: Sentiment Analysis using RNNs

Introduction to NLP
• Communication and Cognition
• Structured Languages
• Unstructured Text
• Applications and Challenges

Communication and Cognition
Language is…
• a medium of communication
• a vehicle for thinking and reasoning

Structured Languages
• Natural language lacks precisely deﬁned structure

• Mathematics:
y = 2x + 5

• Formal Logic:
Parent(x, y) ⋀ Parent(x, z) → Sibling(y, z)

• SQL:
SELECT name, email 
FROM users 
WHERE name LIKE 'A%';

Grammar
• Arithmetic (single digit):
E → E+E | E−E | E×E | E÷E | (E) | D
D → 0 | 1 | 2 | … | 9
E
E E
T E E
5
D
+
×
T
7
D
T
2
D

Grammar
• English sentences (limited):
S → NP VP
NP → N | DET NP | ADJ NP
VP → V | V NP
…
S
NP VP
DET NP V NP
NMy
dog
N likes
candy

– Stuart Little, E.B. White
“Because he was so small, Stuart was often hard to
ﬁnd around the house.”
noun
verb

Unstructured Text
the quick brown fox jumps over the lazy dog

Unstructured Text
the
quick
brown
fox
jumps
over
the
lazy
dog

Unstructured Text
the
quick
brown
fox
jumps
over
the
lazy
dog
+ -

Applications
📃
📑
😀
😞
😡
health
politics
science
what time is it?
¿que hora es?¿ ?

Challenges: Representation
My dog likes candy.
0.2 0.8 0.1 0.2 0.3 0.4 0.1 0.7
{“my”, “dog”, “likes”, “candy”}
I dog
likes
candy
own

Challenges: Temporal Sequence
of milk
water
petrol
I want to buy a gallon

Challenges: Context
The old Welshman came home toward daylight, spattered
with candle-grease, smeared with clay, and almost worn
out. He found Huck still in the bed that had been provided
for him, and delirious with fever. The physicians were all at
the cave, so the Widow Douglas came and took charge of
the patient.
—The Adventures of Tom Sawyer, Mark Twain

Process
“Mary went back home. …”
Transform
Analyze
Predict
Present
{<“mary”, “go”, “home”>, … }
{<0.4, 0.8, 0.3, 0.1, 0.7>, …}

Text Processing
• Tokenization
• Stop Word Removal
• Stemming and Lemmatization

Tokenization
“Jack and Jill went up the hill” <“jack”, “and”, “jill”, 
“went”, “up”, “the”, “hill”>

Tokenization
“No, she didn’t do it.”
<“no,”, “she”, “didn”, 
“’”, “t”, “do” “it”, “.”>
<“no”, “she”, “didnt”, 
“do”, “it”>
?

Tokenization
Big money behind big special
effects tends to suggest a big
story. Nope, not here. Instead
this huge edifice is like one of
those over huge luxury
condos that're empty in every
American town, pretending as
if there's a local economy
huge enough to support such.
—Rotten Tomatoes
?
<"big", "money", "behind", "big", "special",
"effects", "tends", "to", "suggest", "big", "story",
"nope", "not", "here", "instead", "this", "huge",
"edifice", "is", "like", "one", "of", "those", "over",
"huge", "luxury", "condos", "that", "re", "empty",
"in", "every", "american", "town", "pretending",
"as", "if", "there", "local", "economy", "huge",
"enough", "to", "support", "such">
<"big", "money", "behind", "big", "special",
"effects", "tends", "to", "suggest", "big", "story">
<"nope", "not", "here">
<"instead", "this", "huge", "edifice", "is", "like",
"one", "of", "those", "over", "huge", "luxury",
"condos", "that", "re", "empty", "in", "every",
"american", "town", "pretending", "as", "if",
"there", "local", "economy", "huge", "enough",
"to", "support", "such">

Stop Word Removal
The wristwatch was invented in 1904 by Louis Cartier.

Stemming
branching
branched
branches
branch

Stemming
caching
cached
caches
cach
📄

Text Processing Summary
Tokenize
“Jenna went back to University.”
Clean
Normalize <“jenna”, “go”, “univers”>
lowercase,
split
remove stop
words
lemmatize,
stem
<“jenna”, “went”, “back”, “to”, “university”>
<“jenna”, “went”, “university”>

Classic NLP: Feature Extraction

Feature Extraction
• Bag of Words Representation
• Document-Term Matrix
• Term Frequency-Inverse Document Frequency (TF-IDF)

Bag of Words
📄
foo
bar
baz
zip
jam

Bag of Words
“Little House on the Prairie” {“littl”, “hous”, “prairi”}
“Mary had a Little Lamb” {“mari”, “littl”, “lamb”}
“The Silence of the Lambs” {“silenc”, “lamb”}
“Twinkle Twinkle Little Star” {“twinkl”, “littl”, “star”} ?

Bag of Words
littl hous prairi mari
lamb silenc twinkl star
“Little House on the Prairie”
“Mary had a Little Lamb”
“The Silence of the Lambs”
“Twinkle Twinkle Little Star”
corpus (D) vocabulary (V)

Bag of Words
littl hous prairi mari lamb silenc twinkl star

1 1 1 0 0 0 0 0
1 0 0 1 1 0 0 0
0 0 0 0 1 1 0 0
1 0 0 0 0 0 2 1
Bag of Words
Document-Term Matrix term frequency

Document Similarity
1 1 1 0 0 0 0 0
1 0 0 1 1 0 0 0
a
b
a·b = ∑ a0b0 + a1b1 + … + anbn = 1 + 0 + 0 + 0 + 0 + 0 + 0 + 01 dot product

Document Similarity
1 1 1 0 0 0 0 0
1 0 0 1 1 0 0 0
a
b
a·b = ∑ a0b0 + a1b1 + … + anbn = 1 dot product
cos(θ) =
a·b
‖a‖·‖b‖ √3×√3
=
1
=
1
3
cosine similarity

1 1 1 0 0 0 0 0
1 0 0 1 1 0 0 0
0 0 0 0 1 1 0 0
1 0 0 0 0 0 2 1
3 1 1 1 2 1 1 1
Term Speciﬁcity
document frequency
/3
/3
/3
/3
/1
/1
/1
/1
/1
/1
/1
/1
/1
/1
/1
/1
/2
/2
/2
/2
/1
/1
/1
/1
/1
/1
/1
/1
/1
/1
/1
/1

Term Speciﬁcity
1/3 1 1 0 0 0 0 0
1/3 0 0 1 1/2 0 0 0
0 0 0 0 1/2 1 0 0
1/3 0 0 0 0 0 2 1

TF-IDF
tﬁdf(t, d, D) = tf(t, d) · idf(t, D)
term frequency
inverse document frequency
count(t, d)/|d|
log(|D|/|{d∈D : t∈d}|)

Example Task: Spam Detection
0.2 0.1 0.6 0.0 0.8 0.4 0.6 0.2
TF-IDF
Classiﬁer

Topic Modeling
• Latent Variables
• Latent Dirichlet Allocation
• Lab: Topic Modeling using LDA

Bag of Words: Graphical Model
📄📄 📄 📄 📄d
t
P(t|d)
500 docs
1000 words
# parameters ?
climate tax rule cure votespace play

science politics sports
Mixture Model: Document → Topic
📄
P(z|d)
d
z
0.1 0.8 0.1 ∑z
P(z|d) = 1

Mixture Model: Document → Topic
📄
P(z|d)
d
z
0.1 0.8 0.1
∑z
P(z|d) = 1
θ =
multinomial
distribution

Binomial
H T
0.5 0.5
0.7 0.3
0.4 0.6

Binomial Multinomial
H T
0.5 0.5
0.7 0.3
0.4 0.6
science politics
0.9 0.05
0.1 0.8
0.2 0.1
sports
0.05
0.1
0.7

Probability Simplex
Multinomial
science politics
0.9 0.05
0.1 0.8
0.2 0.1
sports
0.05
0.1
0.7
science
politics
sports
1
0
1
1

Probability Simplex
Multinomial
science politics
0.9 0.05
0.1 0.8
0.2 0.1
sports
0.05
0.1
0.7
sciencepolitics
sports

Dirichlet Distributions
α = <a, b, c>
concentration
parameters

LDA: Sample a Document
θα
0.1 0.8 0.1

LDA: Sample a Topic
θ z
politics
0.1 0.8 0.1

Mixture Model: Topic → Word
📄d
t
z
P(z|d)
P(t|z)
0.1 0.8 0.1θ =

Mixture Model: Topic → Word
📄d
t
z
P(z|d)
P(t|z)
0.1 0.8 0.1θ =
𝝋 =
space climate tax rule …
0.05 0.31 0.12 0.27 …
α
β

LDA: Sample a Word
z w
politics
rule
space climate tax rule …
0.05 0.31 0.12 0.27 …
β 𝝋

LDA: Plate Model
θ z wα
β
N
M
document
corpus
𝝋
Ktopic

LDA: Parameter Estimation
Choose |Z| = k
Initialize distribution parameters
Sample <document, words>
Update distribution parameters
expectation
maximization

LDA: Use Cases
• Topic modeling, document categorization.
• Mixture of topics in a new document: P(z | w, α, β)
• Generate collections of words with desired mixture.

LDA: Further Reading
David Blei, Andrew Ng, Michael Jordan, 2003. Latent Dirichlet Allocation,
In Journal of Machine Learning Research, vol. 3, pp. 993-102.

Thomas Boggs, 2014. Visualizing Dirichlet Distributions with matplotlib.

Lab: Topic Modeling using LDA
Categorize Newsgroups Data

Admissions Oﬃce
Grades
1
2
3
4
5
6
7
8
9
10
Exam
1 2 3 4 5 6 7 8 9 10
GradesExam
No
Student 1
Exam: 9/10
Grades: 8/10
Student 2
Exam: 3/10
Grades: 4/10
Student 3
Exam: 7/10
Grades: 6/10
Yes
Quiz:
Does the
student get
accepted?
Logistic Regression
Score = Exam + Grades
Score > 10
Score = Exam + 2*Grades
Score > 18
Score = Exam - Grades
Score > 5

GradesExam Rank
Admissions Oﬃce

Admissions Oﬃce
Grades
1
2
3
4
5
6
7
8
9
10
Exam
1 2 3 4 5 6 7 8 9 10
Question:
How do we ﬁnd this line?

Goal: Split Data
I'm
good
I'm
good
I'm
goodI'm
good
??

Goal: Split Data
Come
closer
Farther
Closer
Quiz
Where would the
misclassiﬁed point
want the line to
move?
I'm
good

Come
closer
I'm
good
Come
closer
I'm
good
Perceptron Algorithm
I'm
good
I'm
good
I'm
goodI'm
good

Grades
0
1
2
3
4
5
6
7
8
9
10
Exam
0 1 2 3 4 5 6 7 8 9 10
Admissions Oﬃce
Student 4
Exam: 10
Grades: 0

Grades
0
1
2
3
4
5
6
7
8
9
10
Exam
0 1 2 3 4 5 6 7 8 9 10
Admissions Oﬃce

Admissions Oﬃce
Judge 1 Judge 2
Score = 1*Exam + 8*Grades Score = 7*Exam + 2*Grades

Admissions Oﬃce
President
If both judges say 'admit', then admit

Normalizing the Score
Accept
Reject

Normalizing the Score
0.6
0.4
0.5
0.7
0.3
0.2
0.8

Admissions Oﬃce
President
Score = 2*(Judge 1 Score) + 3*(Judge 2 Score)

Neural Network
2
3
Judge 1
Judge 2
President
Score = 2*(Judge 1 score)
+ 3*(Judge 2 score)
Score = 1*Exam + 8*Grades
Score = 7*Exam + 2*Grades
Grades
Exam
1
8
7
2

Neural Network
Judge 1
Judge 2
President
Grades
Exam

Neural Network
x2
x1
Hidden layerInput layer Output layer

x2
x1

x3
x1
x2

x2
x1
Dog
Cat
Bird

Deep Neural Network
x2
x1
Hidden layerInput layer Output layerHidden layer

Temperature
Symptoms
Etc.
Medical Applications
Sick/Healthy

Stock Market
GOOG: 7.32
AAPL: 3.14
MSFT: 1.32
Buy GOOG

YouTubeAge
Location
Watched: Despacito
Watched: Pitbull video
Probability of
watching Ricky
Martin video: 80%

Spam Detection
Hello,
it’s grandma!
Not spam

Spam Detection
E@rn c@$h
quickly!
Spam

Lab: Sentiment Analysis
Classify Movie Reviews
Notebook: IMDB_Keras.ipynb

Sentiment Analysis
What a great movie!
That was terrible.

One-hot encoding
What a great movie!
aardvark
zygote
a
great
m
ovie
rhinoceros
cookie
ﬁgurine
w
hat
1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0
a
great
m
ovie
w
hat

2. Examining the data
What a great movie!
aardvark
zygote
a
great
m
ovie
rhinoceros
cookie
ﬁgurine
w
hat
1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25
a
great
m
ovie
w
hat
[1, 7, 15, 23]

3. One-hot encoding the input
What a great movie!
aardvark
zygote
a
great
m
ovie
rhinoceros
cookie
ﬁgurine
w
hat
1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25
a
great
m
ovie
w
hat
[1, 7, 15, 23]
1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0

3. One-hot encoding the output
[1,0]
[0,1]

4. Building the Model Architecture
Build the Model

4. Building the Model Architecture
Compile Model

1. Compile Model
2. Fit Model
5. Training the Model

Building the Neural Network
Running the Neural Network

Deep NLP: Recurrent Neural Networks

One-hot encoding
“I expected this movie to be much better”
“This movie is much better than I expected”
I
expected
m
ovie
to
1 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0
be
better
m
uch
is
this
I
expected
m
ovie
than
1 0 1 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0
be
better
m
uch
this

Recurrent Neural Networks
John saw Mary. Mary saw John.

Neural Network
NN
John
NN
saw
NN
Mary

Recurrent Neural Network
NN NN
sawJohn
NN
Mary
John saw Mary

NN NN
sawMary
NN
John
Mary saw John

new
memory
new word
previous
memory
RNN

previous
memory
new
memory
new word
RNN

how
howhow
HiHiHi
how
how how
how

RNN
(phrase)
NN
(sentiment)
Review

Word Embeddings
• Document vs. Word Representations
• Word2Vec
• GloVe
• Embeddings in Deep Learning
• Visualizing Word Vectors: tSNE

Document vs. Word Representations
📄 <0.2, 0.7, 0.1, 0.3, 0.4>
+ -
📄📄
<0.8, 0.3, 0.2, 0.7, 0.5>w

Word Embeddings
๏kid
๏child
๏horse
๏chair

Word2Vec
the quick brown fox jumps over the lazy dog
Continuous Bag of Words (CBoW)
Continuous Skip-gram

Skip-gram Model
jumps
0
1
0
0
over
word vector
⋮
⋮
⋮
⋮
fox
brown
the
wt
1
0
0
0
0
0
0
1
0
0
1
0
0
1
0
0
wt-2
wt-1
wt+1
wt+2

Word2Vec: Recap
• Robust, distributed representation.
• Vector size independent of vocabulary.
• Train once, store in lookup table.
• Deep learning ready!

Word2Vec: Further Reading
Tomas Mikolov, et al., 2013. Distributed Representation of Words and
Phrases and their Compositionality, In Advances of Neural Information
Processing Systems (NIPS), pp. 3111-3119.
Adrian Colyer, 2016. The amazing power of word vectors.

GloVe
Global Vectors for Word Representation

P(j|i) j
i Context?
a cup of coﬀee

P(j|i)
Context
Target
j
i
wi wj
· =

woman - man + king = queen
๏man
๏king
๏queen
๏woman

Distributional Hypothesis
“Would you like a cup of ______?”
“I like my ______ black.”
“I need my morning ______ before I can do anything!”

“Would you like a cup of ______?”
“I like my ______ black.”
“I need my morning ______ before I can do anything!”

tea
coﬀee
“________ grounds are great for composting!”
“I prefer loose leaf ______.”
Coﬀee
tea

Word Embedding 
one-hot encoded word (size ≈ 105
)
word vector 
(size ≈ 102
)
one-hot encoded output (size ≈ 105
)
(Lookup)
Dense 
Layer(s)
Recurrent 
Layer(s)
Learn

Word Embedding 
one-hot encoded word (size ≈ 105
)
word vector 
(size ≈ 102
)
one-hot encoded output (size ≈ 105
)
(Lookup)
Dense 
Layer(s)
Recurrent 
Layer(s)
input image 
(size ≈ 105
)
Convolutional 
Layer(s)
output label (size ≈ 102
)
Dense 
Layer(s)
128
128
Learn
(Pre-
trained)
Learn

Visualizing Word Vectors: t-SNE
t-Distributed Stochastic Neighbor Embedding

t-SNE
t-Distributed Stochastic Neighbor Embedding
mango
avocado
car
pencil

Lab: Sentiment Analysis using RNNs
Classify Movie Reviews
Notebook: IMDB_Keras_RNN.ipynb

Workshop Summary
Classic NLP
• Text Processing: Stop word removal, stemming, lemmatization
• Feature Extraction: Bag-of-Words, TF-IDF
• Topic Modeling: Latent Dirichlet Allocation
• Lab: Topic modeling using LDA
Deep NLP
• Neural Networks
• Recurrent Neural Networks
• Word Embeddings: Word2Vec, GloVe, tSNE
• Lab: Sentiment Analysis using RNNs

Additional Resources
• Recurrent Neural Networks (RNNs)
• Long Short-Term Memory Networks (LSTMs)
• Visual Question-Answering

Arpan Chakraborty
Adarsh Nair
Luis Serrano
udacity.com/school-of-ai

Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks

Recommandé

Recommandé

Contenu connexe

Plus de AI Frontiers

Plus de AI Frontiers (20)

Dernier

Dernier (20)

Training at AI Frontiers 2018 - Udacity: Enhancing NLP with Deep Neural Networks