2. Some of the challenges in Language
Understanding
• Language is ambiguous:
– Every sentence has many possible
interpretations.
• Language is productive:
– We will always encounter new
words or new
constructions
• Language is culturally specific
2
fruit flies like a banana
NN NN VB DT NN
NN VB P DT NN
NN NN P DT NN
NN VB VB DT NN
3. ML: Traditional Approach
• For each new problem/question
– Gather as much LABELED data as you can get
– Throw some algorithms at it (mainly put in an SVM and
keep it at that)
– If you actually have tried more algos: Pick the best
– Spend hours hand engineering some features / feature
selection / dimensionality reduction (PCA, SVD, etc)
– Repeat…
3
5. Deep Learning: Why for NLP ?
• Beat state of the art
– Language Modeling (Mikolov et al. 2011) [WSJ AR task]
– Speech Recognition (Dahl et al. 2012, Seide et al 2011;
following Mohammed et al. 2011)
– Sentiment Classification (Socher et al. 2011)
– MNIST hand-written digit recognition (Ciresan et al.
2010)
– Image Recognition (Krizhevsky et al. 2012) [ImageNet]
5
6. Language semantics
• What is the meaning of a word?
(Lexical semantics)
• What is the meaning of a sentence?
([Compositional] semantics)
• What is the meaning of a longer piece of
text?
(Discourse semantics)
6
7. One-hot encoding
• Form vocabulary of words that maps lemmatized words to a
unique ID (position of word in vocabulary)
• Typical vocabulary sizes will vary between 10 000 and 250
000
7
8. One-hot encoding
• The one-hot vector of an ID is a vector filled with 0s, except
for a 1at the position associated with the ID
– for vocabulary size D=10, the one-hot vector of word ID w=4 is e(w)
= [ 0 0 0 1 0 0 0 0 0 0 ]
• A one-hot encoding makes no assumption about word
similarity
• All words are equally different from each other
8
9. Word representation
• Standard
– Bag of Words
– A one-hot encoding
– 20k to 50k dimensions
– Can be improved by
factoring in document
frequency
• Word embedding
– Neural Word
embeddings
– Uses a vector space
that attempts to
predict a word given a
context window
– 200-400 dimensions
Word
embeddings
make
seman0c
similarity
and
synonyms
possible
9
10. Distributional representations
• “You shall know a word by the company it
keeps” (J. R. Firth 1957)
• One of the most successful ideas of modern
• statistical NLP!
10
11. • Word Embeddings (Bengio et al, 2001; Bengio et al,
2003) based on idea of distributed representations
for symbols (Hinton 1986)
• Neural Word embeddings (Mnih and Hinton 2007,
Collobert & Weston 2008, Turian et al 2010;
Collobert et al. 2011, Mikolov et al.2011)
11
12. Neural distributional
representations
• Neural word embeddings
• Combine vector space semantics with the
prediction of probabilistic models
• Words are represented as a dense vector
Human
=
12
18. Word Embeddings
• one of the most exciting area of research in deep learning
• introduced by Bengio, et al. 2013
• W:words→Rn is a paramaterized function mapping words in
some language to high-dimensional vectors (200 to 500).
– W(‘‘cat")=(0.2, -0.4, 0.7, ...)
– W(‘‘mat")=(0.0, 0.6, -0.1, ...)
• Typically, the function is a lookup table, parameterized by a
matrix, θ, with a row for each word: Wθ(wn)=θn
• W is initialized as random vectors for each word.
• Word embedding learns to have meaningful vectors to
perform some task.
18
19. Learning word vectors (Collobert et al. JMLR 2011)
• Idea: A word and its context is a positive
training example, a random word in the same
context give a negative training example
19
20. Example
• Train a network for is predicting whether a 5-
gram (sequence of five words) is ‘valid.’
• Source
– any text corpus (wikipedia)
• Break half number of 5-grams to get negative
training examples
– Make 5-gram nonsensical
– "cat sat song the mat”
20
21. Neural network to determine if a 5-gram is
'valid' (Bottou 2011)
• Look up each word in the 5-gram through W
• Feed those into R network
• R tries to predict if the 5-gram is 'valid' or 'invalid'
– R(W(‘‘cat"), W(‘‘sat"), W(‘‘on"), W(‘‘the"), W(‘‘mat"))= 1
– R(W(‘‘cat"), W(‘‘sat"), W(‘‘song"), W(‘‘the"), W(‘‘mat"))=0
• The network needs to learn good parameters for both W
and R.
21
23. Idea
• “a few people sing well” → “a couple people
sing well”
• the validity of the sentence doesn’t change
• if W maps synonyms (like “few” and
“couple”) close together
– R’s perspective little changes.
23
24. Bingo
• The number of possible 5-grams is massive
• But, small number of data points to learn
from
• Similar class of words
– “the wall is blue” → “the wall is red”
• Multiple words
– “the wall is blue” → “the ceiling is red”
• Shifting “red” closer to “blue” makes the
network R perform better.
24
25. Word embedding property
• Analogies between words encoded in the
difference vectors between words.
– W(‘‘woman")−W(‘‘man") ≃ W(‘‘aunt")−W(‘‘uncle")
– W(‘‘woman")−W(‘‘man") ≃ W(‘‘queen")−W(‘‘king")
25
27. Word embedding property: Shared
representations
• The use of word representations… has become a
key “secret sauce” for the success of many NLP
systems in recent years, across tasks including
named entity recognition, part-of-speech tagging,
parsing, and semantic role labeling. (Luong et al.
(2013))
27
28. • W and F learn to perform
task A. Later, G can learn
to perform B based on W
28
34. Simple RNN training
• Input vector: 1-of-N encoding (one hot)
• Repeated epoch
– S(0): vector of small value (0,1)
– Hidden layer: 30 – 500 units
– All training data from corpus are sequentially presented
– Init learning rate: 0.1
– Error function
– Standard backpropagation with stochastic gradient descent
• Conversion achieved after 10 – 20 epochs
34
35. Word2vec (Mikolov et. al., 2013)
• Log-linear model
• Previous models: non-linear hidden layer ->
complexity
• Continuous word vectors are learned using
simple model
35
36. Continuous BoW (CBOW) Model
• Similar to the feed-forward NNLM, but
– Non-linear hidden layer removed
• Called CBOW (continuous BoW) because the
order of the words is lost
38. Continuous Skip-gram Model
• Similar to CBOW, but
– Tries to maximize classification of a word based on
another word in the same sentence
• Predicts words within a certain window
• Observations
– Larger window size => better quality of the resulting
word vectors, higher training time
– More distant words are usually less related to the current
word than those close to it
– Give less weight to the distant words by sampling less
from those words in the training examples
62. Language Modeling
• A language model is a probabilistic model that
assigns probabilities to any sequence of words
p(w1, ... ,wT)
• Language modeling is the task of learning a
language model that assigns high probabilities to
well formed sentences
• Plays a crucial role in speech recognition and
machine translation systems
62
63. N-gram models
• An n-gram is a sequence of n words
– unigrams(n=1):’‘is’’,‘‘a’’,‘‘sequence’’,etc.
– bigrams(n=2): [‘‘is’’,‘‘a’’], [‘’a’’,‘‘sequence’’],etc.
– trigrams(n=3): [‘’is’’,‘‘a’’,‘‘sequence’’],
[‘‘a’’,‘‘sequence’’,‘‘of’’],etc.
• n-gram models estimate the conditional from n-
grams counts
• The counts are obtained from a training corpus (a
dataset of word text)
63