Objective :
This paper aims at develop an effective sentence model to analyse
and represent the semantic content of a sentence using a dynamic
CNN architecture

Word representation
The vast majority of rule-based and statistical NLP work regards words as atomic
symbols: hotel,conference,walk
One-hot Representation:
In vector space terms, this is a vector with one 1 and a lot of zeroes
[0 0 0 0 0 0 0 0 0 0 1 0
0 0 0]
Dimensionality: 20K (speech) – 50K (PTB) – 500K (big vocab) – 13M (Google 1T)
Problem with this representation:
motel [0 0 0 0 0 0 0 0 0 0 1
0 0 0 0] AND

Distributional similarity based representations
You can get a lot of value by representing a word by means of its
neighbors
“You shall know a word by the company it keeps”
(J. R. Firth 1957: 11)
One of the most successful ideas of modern statistical NLP
government debt problems turning into
saying that Europe needs unified
banking
banking
crises as has happened in
regulation to replace the hodgepodge
These words will represent banking

How to make neighbors represent words?
Answer: With a cooccurrence matrix X
Window based co-occurrence matrix
• Window around each word captures both syntactic (POS) and semantic
information
• Window length 1 (more common: 5 - 10)
• Symmetric (irrelevant whether left or right context)

Window based cooccurence matrix
3/31/1 6Richard Soc her
• Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0

Problems with simple cooccurrence vectors
• Increase in size with vocabulary
• Very high dimensional: require a lot of storage
• Subsequent classiﬁcation models have sparsity issues
• Models are less robust

Solution: Low dimensional vectors
• Idea: store “most” of the important information in a ﬁxed, small
number of dimensions: a dense vector
• Usually around 25 – 1000 dimensions
• How to reduce the dimensionality?

Method 1: Dimensionality Reduction on X
Singular Value Decomposition of cooccurrence matrix X.
r
= nn
r
r
S
.
S2
S1
Sr
m m
V1
V2
V3
S3 .
. .
.
.
U1U2U3 . . .
=n
X U S
X
m
T
V
Sk
0
0
0
0
T
V
m V1
U1U2U3
. .
.
S2
V2
S1
S3 .
.
V.
3
k
U
k
kn
r
k
X is the best rank k approximation to X , in terms of least squares.

Simple SVD word vectors in Python
Corpus:
I like deep learning. I like NLP. I enjoy ﬂying.

Simple SVD word vectors in Python
Richard Socher 3/31/16
Corpus: I like deep learning. I like NLP. I enjoy ﬂying.
Printing ﬁrst two columns of U corresponding to the 2 biggest singular values

Interesting syntactic patterns emerge in the vectors
TAKE
SHOW
TAKING
TOOK
TAKEN
SPEAK
SPEAKING
GROW
GROWING
SHOWN
SHOWED
SHOWING
EATING
SPOKE
CHOOSE
CHOSCEHNOSE
GROWN
GREW
SPOKEN
THROWING
STEAL
ATE
THROWTNHRTOHWREW
STOLEN
STEALING
CHOOSING
STOLE
EATEEANT
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005

Interesting semantic patterns emerge in the vectors
DRIVE
LEARN
DOCTOR
CLEAN
DRIVER
TEACH
TEACHER
TREAT PRAY
PRIEST
MARRY
SWIM
BRIDE
JANITOR
STUDENT
SWIMMER
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Rohde et al. 2005

Problems with SVD
• Computational cost scales quadratically for n matrix: O(mn2) ﬂops (when n<m)
• Bad for millions of words or documents
• Hard to incorporate new words or documents
• Instead of capturing co-occurrence counts directly,
• Predict surrounding words of every word
• Faster and can easily incorporate a new sentence/ document or add a word to
the vocabulary
Idea: Directly learn low-dimensional word vectors

Main Idea of word2vec
• Instead of capturing cooccurrence counts directly,
• Predict surrounding words of every word
• Faster and can easily incorporate a new sentence/ document or add
a word to the vocabulary

Details of Word2Vec
• Predict surrounding words in a window of length m of every word.
• Objective function: Maximize the log probability of any context word
given the current center word:
•
Where represents all variables we optimize

Linear Relationships in word2vec
These representations are very good at encoding dimensions of similarity!
• Analogies testing dimensions of similarity can be solved quite well just by doing
vector subtraction in the embedding space Syntactically
•xapple
− xapples
≈ xcar
− xcars
≈ xfamily
− xfamilies
•Similarly for verb and adjective morphological forms Semantically
(Semeval 2012 task 2)
•xshirt
− xclothing
≈ xchair
− xfurniture
•xking
− xman
≈ xqueen
− xwoman

The continuous bag of words model
• Main idea for continuous bag of words (CBOW): Predict center word from sum of
surrounding word vectors instead of predicting surrounding single words from center
word as in skip- gram model
• Disregard grammar and
work order
• Share the weight of
each words

The skip-gram model and negative sampling
• From paper: “Distributed Representations of Words and Phrases and their
Compositionality”(Mikolov et al. 2013)
• Main Idea : train binary logistic regressions for a true pair (center word and word in its
context window) and a couple of random pairs (the center word with a random word)
• Overall objective function:
where k is the number of negative samples we use
• So we maximize the probability of two words co-occurring in first log and minimize
prob. that random words appear around center word in second log

Convolution (one dimensional )
• Here we’ll introduce two type of convolution
• 1) Narrow
• 2) Wide
cj= tr(m)sj-m+1:j
Here m = 5 , m<=j<=s
|C| =|S|-|m|+1
Ci from R
|C| =|S|+|m|-1
Ci from R
Out-of-range input values s where i < 1
or i > s are taken to be zero.

Time-Delay Neural Networks
• The sequence S is viewed as having a time dimension and the
convolution is applied over the time dimension
• Each Sj is vector in Sentence matrix of d x s and M is matrix of
weight of size d x m
• Each row of m is convolved with the corresponding row of s and the
convolution is usually of the narrow type.
• To address the problem of varying sentence lengths, the Max-TDNN
takes the maximum of each row in the resulting matrix c yielding a
vector of d values

Narrow one dim . convolution through time
axis
S = d = 4 , s = 5 d = 4 , m =2
d = 4 , s=4
Taking Max
of every
row
Result matrix C
C vector is used as input to fully connected layer for
classification

But DCNN is slightly different
• 1) we use wide convolution at row wise 1D convolution
• 2)Then Dynamic k-max pooling operation will be used
• 3) We apply non linearity to the pooled output
• 4) Collection of 1-3 steps could be repeated n-times
• 5)Folding (usually comes around last layer )
• 6) k-max pooling
• 7)FULLY connected layer .

Image from Kalchbrenner (2014)

Dynamic k-Max Pooling
• The k-max pooling operation makes it possible to pool the k most active
features in p that may be a number of positions apart;
• it preserves the order of the features, but is insensitive to their specific
positions.
• It can also discern more finely the number of times the feature is highly
activated in p
• The k-max pooling operator is applied in the network after the topmost
convolutional layer.
• At intermediate convolutional layers the pooling parameter k is not fixed,
but is dynamically selected in order to allow for a smooth extraction of
higher-order and longer-range features.

Dynamic k-Max Pooling
Here,
l is the number of the current convolutional layer to which the pooling is applied
L is the total number of convolutional layers in the network;
ktop is the ﬁxed pooling parameter for the topmost convolutional layer.

Non-Linear Features
• After (dynamic) k-max pooling is applied to the result of a
convolution, a bias b and a nonlinear function g are applied
component-wise to the pooled matrix. There is a single bias value for
each row of the pooled matrix d.

So why does this model work ?
• This model is sensitive to word order in input
• It can discriminate if any specific n-gram occurs in input
• To some extent it can tell relative position of most relevant n-gram

The left diagram emphasizes the pooled nodes. The
width of the convolutional ﬁlters is 3 and 2
respectively. With dynamic pooling, a ﬁlter with small
width at the higher layers can relate phrases far apart
in the input sentence.
What makes the feature graph of a DCNN peculiar
is the global range of the pooling operations.
The (dynamic) k-max pooling operator can draw
together features that correspond to words that are
many positions apart in the sentence

One variation to current approach
• Yoon Kim(2014)

Experiments (Sentiment Prediction in Movie Reviews)
Training :
The top layer of the network has a fully connected layer followed by a softmax non-linearity
that predicts the probability distribution over classes given the input sentence. The network
is trained to minimise the cross-entropy of the predicted and true distributions; the
objective includes an L 2 regularisation
In the binary case,
we use the given splits of 6920 training, 872 dev
and 1821 test sentences.
In the fine-grained case, we use the
standard 8544/1101/2210 splits.
The size of the vocabulary is 15448.

Question Type Classification
A question may be classified as
belonging to one of many question
types. The TREC questions dataset
involves six different question
types, e.g. whether the question
is about a location, about a person or
about some numeric information (Li
and Roth, 2002). The training
dataset consists of 5452 labelled
questions whereas the test dataset
consists of 500 questions.

Twitter Sentiment Prediction with
Distant Supervision
Train the models on a large dataset of
tweets, where a tweet is automatically
labelled as positive or negative depending
on the emoticon that occurs in it.
The training set consists of 1.6 million
tweets with emoticon-based labels and
the test set of about 400 hand-annotated
tweets.
This results in a vocabulary of 76643 word
types. The architecture of the DCNN

Convolutional Neural Network for Modelling Sentences

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Convolutional Neural Network for Modelling Sentences

Similaire à Convolutional Neural Network for Modelling Sentences (20)

Dernier

Dernier (20)

Convolutional Neural Network for Modelling Sentences