This paper aims to develop an effective sentence model using a dynamic convolutional neural network (DCNN) architecture. The DCNN applies 1D convolutions and dynamic k-max pooling to capture syntactic and semantic information from sentences with varying lengths. This allows the model to relate phrases far apart in the input sentence and draw together important features. Experiments show the DCNN approach achieves strong performance on tasks like sentiment analysis of movie reviews and question type classification.
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
Convolutional Neural Network for Modelling Sentences
1. A Convolutional Neural Network for
Modelling Sentences
Presented By :
Anish Bhanushali
Anjani Jha
Mansi Goel
Authors :
Nal Kalchbrenner (University of Oxford)
Edward Grefenstette (University of Oxford)
Phil Blunsom (University of Oxford)
2. Objective :
This paper aims at develop an effective sentence model to analyse
and represent the semantic content of a sentence using a dynamic
CNN architecture
3. Word representation
The vast majority of rule-based and statistical NLP work regards words as atomic
symbols: hotel,conference,walk
One-hot Representation:
In vector space terms, this is a vector with one 1 and a lot of zeroes
[0 0 0 0 0 0 0 0 0 0 1 0
0 0 0]
Dimensionality: 20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)
Problem with this representation:
motel [0 0 0 0 0 0 0 0 0 0 1
0 0 0 0] AND
4. Distributional similarity based representations
You can get a lot of value by representing a word by means of its
neighbors
“You shall know a word by the company it keeps”
(J. R. Firth 1957: 11)
One of the most successful ideas of modern statistical NLP
government debt problems turning into
saying that Europe needs unified
banking
banking
crises as has happened in
regulation to replace the hodgepodge
These words will represent banking
5. How to make neighbors represent words?
Answer: With a cooccurrence matrix X
Window based co-occurrence matrix
• Window around each word captures both syntactic (POS) and semantic
information
• Window length 1 (more common: 5 - 10)
• Symmetric (irrelevant whether left or right context)
6. Window based cooccurence matrix
3/31/1 6Richard Soc her
• Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
7. Problems with simple cooccurrence vectors
• Increase in size with vocabulary
• Very high dimensional: require a lot of storage
• Subsequent classification models have sparsity issues
• Models are less robust
8. Solution: Low dimensional vectors
• Idea: store “most” of the important information in a fixed, small
number of dimensions: a dense vector
• Usually around 25 – 1000 dimensions
• How to reduce the dimensionality?
9. Method 1: Dimensionality Reduction on X
Singular Value Decomposition of cooccurrence matrix X.
r
= nn
r
r
S
.
S2
S1
Sr
m m
V1
V2
V3
S3 .
. .
.
.
U1U2U3 . . .
=n
X U S
X
m
T
V
Sk
0
0
0
0
T
V
m V1
U1U2U3
. .
.
S2
V2
S1
S3 .
.
V.
3
k
U
k
kn
r
k
X is the best rank k approximation to X , in terms of least squares.
10. Simple SVD word vectors in Python
Corpus:
I like deep learning. I like NLP. I enjoy flying.
11. Simple SVD word vectors in Python
Richard Socher 3/31/16
Corpus: I like deep learning. I like NLP. I enjoy flying.
Printing first two columns of U corresponding to the 2 biggest singular values
12. Interesting syntactic patterns emerge in the vectors
TAKE
SHOW
TAKING
TOOK
TAKEN
SPEAK
SPEAKING
GROW
GROWING
SHOWN
SHOWED
SHOWING
EATING
SPOKE
CHOOSE
CHOSCEHNOSE
GROWN
GREW
SPOKEN
THROWING
STEAL
ATE
THROWTNHRTOHWREW
STOLEN
STEALING
CHOOSING
STOLE
EATEEANT
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005
13. Interesting semantic patterns emerge in the vectors
DRIVE
LEARN
DOCTOR
CLEAN
DRIVER
TEACH
TEACHER
TREAT PRAY
PRIEST
MARRY
SWIM
BRIDE
JANITOR
STUDENT
SWIMMER
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005
14. Problems with SVD
• Computational cost scales quadratically for n matrix: O(mn2) flops (when n<m)
• Bad for millions of words or documents
• Hard to incorporate new words or documents
• Instead of capturing co-occurrence counts directly,
• Predict surrounding words of every word
• Faster and can easily incorporate a new sentence/ document or add a word to
the vocabulary
Idea: Directly learn low-dimensional word vectors
15. Main Idea of word2vec
• Instead of capturing cooccurrence counts directly,
• Predict surrounding words of every word
• Faster and can easily incorporate a new sentence/ document or add
a word to the vocabulary
16. Details of Word2Vec
• Predict surrounding words in a window of length m of every word.
• Objective function: Maximize the log probability of any context word
given the current center word:
•
Where represents all variables we optimize
17. Linear Relationships in word2vec
These representations are very good at encoding dimensions of similarity!
• Analogies testing dimensions of similarity can be solved quite well just by doing
vector subtraction in the embedding space Syntactically
•xapple
− xapples
≈ xcar
− xcars
≈ xfamily
− xfamilies
•Similarly for verb and adjective morphological forms Semantically
(Semeval 2012 task 2)
•xshirt
− xclothing
≈ xchair
− xfurniture
•xking
− xman
≈ xqueen
− xwoman
18. The continuous bag of words model
• Main idea for continuous bag of words (CBOW): Predict center word from sum of
surrounding word vectors instead of predicting surrounding single words from center
word as in skip- gram model
• Disregard grammar and
work order
• Share the weight of
each words
19. The skip-gram model and negative sampling
• From paper: “Distributed Representations of Words and Phrases and their
Compositionality”(Mikolov et al. 2013)
• Main Idea : train binary logistic regressions for a true pair (center word and word in its
context window) and a couple of random pairs (the center word with a random word)
• Overall objective function:
where k is the number of negative samples we use
• So we maximize the probability of two words co-occurring in first log and minimize
prob. that random words appear around center word in second log
20. Convolution (one dimensional )
• Here we’ll introduce two type of convolution
• 1) Narrow
• 2) Wide
cj= tr(m)sj-m+1:j
Here m = 5 , m<=j<=s
|C| =|S|-|m|+1
Ci from R
|C| =|S|+|m|-1
Ci from R
Out-of-range input values s where i < 1
or i > s are taken to be zero.
21. Time-Delay Neural Networks
• The sequence S is viewed as having a time dimension and the
convolution is applied over the time dimension
• Each Sj is vector in Sentence matrix of d x s and M is matrix of
weight of size d x m
• Each row of m is convolved with the corresponding row of s and the
convolution is usually of the narrow type.
• To address the problem of varying sentence lengths, the Max-TDNN
takes the maximum of each row in the resulting matrix c yielding a
vector of d values
22. Narrow one dim . convolution through time
axis
S = d = 4 , s = 5 d = 4 , m =2
d = 4 , s=4
Taking Max
of every
row
Result matrix C
C vector is used as input to fully connected layer for
classification
23. But DCNN is slightly different
• 1) we use wide convolution at row wise 1D convolution
• 2)Then Dynamic k-max pooling operation will be used
• 3) We apply non linearity to the pooled output
• 4) Collection of 1-3 steps could be repeated n-times
• 5)Folding (usually comes around last layer )
• 6) k-max pooling
• 7)FULLY connected layer .
26. Dynamic k-Max Pooling
• The k-max pooling operation makes it possible to pool the k most active
features in p that may be a number of positions apart;
• it preserves the order of the features, but is insensitive to their specific
positions.
• It can also discern more finely the number of times the feature is highly
activated in p
• The k-max pooling operator is applied in the network after the topmost
convolutional layer.
• At intermediate convolutional layers the pooling parameter k is not fixed,
but is dynamically selected in order to allow for a smooth extraction of
higher-order and longer-range features.
27. Dynamic k-Max Pooling
Here,
l is the number of the current convolutional layer to which the pooling is applied
L is the total number of convolutional layers in the network;
ktop is the fixed pooling parameter for the topmost convolutional layer.
28. Non-Linear Features
• After (dynamic) k-max pooling is applied to the result of a
convolution, a bias b and a nonlinear function g are applied
component-wise to the pooled matrix. There is a single bias value for
each row of the pooled matrix d.
29. So why does this model work ?
• This model is sensitive to word order in input
• It can discriminate if any specific n-gram occurs in input
• To some extent it can tell relative position of most relevant n-gram
30. The left diagram emphasizes the pooled nodes. The
width of the convolutional filters is 3 and 2
respectively. With dynamic pooling, a filter with small
width at the higher layers can relate phrases far apart
in the input sentence.
What makes the feature graph of a DCNN peculiar
is the global range of the pooling operations.
The (dynamic) k-max pooling operator can draw
together features that correspond to words that are
many positions apart in the sentence
32. Experiments (Sentiment Prediction in Movie Reviews)
Training :
The top layer of the network has a fully connected layer followed by a softmax non-linearity
that predicts the probability distribution over classes given the input sentence. The network
is trained to minimise the cross-entropy of the predicted and true distributions; the
objective includes an L 2 regularisation
In the binary case,
we use the given splits of 6920 training, 872 dev
and 1821 test sentences.
In the fine-grained case, we use the
standard 8544/1101/2210 splits.
The size of the vocabulary is 15448.
33. Question Type Classification
A question may be classified as
belonging to one of many question
types. The TREC questions dataset
involves six different question
types, e.g. whether the question
is about a location, about a person or
about some numeric information (Li
and Roth, 2002). The training
dataset consists of 5452 labelled
questions whereas the test dataset
consists of 500 questions.
34. Twitter Sentiment Prediction with
Distant Supervision
Train the models on a large dataset of
tweets, where a tweet is automatically
labelled as positive or negative depending
on the emoticon that occurs in it.
The training set consists of 1.6 million
tweets with emoticon-based labels and
the test set of about 400 hand-annotated
tweets.
This results in a vocabulary of 76643 word
types. The architecture of the DCNN