General background and conceptual explanation of word embeddings (word2vec in particular). Mostly aimed at linguists, but also understandable for non-linguists.
Leiden University, 23 March 2018
4. ABOUT ME
Master (2002): Natural Language Processing (NLP)
PhD (2010): NLP and Information Retrieval (IR)
Now: Assistant Professor at the Leiden Institute for Advanced
Computer Science (LIACS), affiliated with the Data Science research
programme.
Research: projects on text mining and information retrieval in
various domains
Teaching: Data Science & Text Mining
Suzan Verberne
6. WHERE TO START
Linguistics: Distributional hypothesis
Data science: Vector Space Model (VSM)
Suzan Verberne
7. DISTRIBUTIONAL HYPOTHESIS
Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162.
“Words that occur in similar contexts tend to be similar”
Suzan Verberne
8. VECTOR SPACE MODEL
Traditional Vector Space Model (Information Retrieval):
documents and queries represented in a vector space
where the dimensions are the words
Suzan Verberne
12. VECTOR SPACE MODEL
Vector Space Models represent (embed) words or documents in a
continuous vector space
The vector space is relatively low-dimensional (100 – 320
dimensions instead of 10,000s)
Semantically and syntactically similar words are mapped to nearby
points (Distributional Hypothesis)
Suzan Verberne
15. WHAT IS WORD2VEC?
Word2vec is a particularly computationally-efficient predictive
model for learning word embeddings from raw text
So that similar words are mapped to nearby points
Suzan Verberne
17. WHERE DOES IT COME FROM
Neural network language model (NNLM) (Bengio et al., 2003 – not
discussed today)
Mikolov previously proposed (MSc thesis, PhD thesis, other papers)
to first learn word vectors “using neural network with a single
hidden layer” and then train the NNLM independently
Word2vec directly extends this work => “word vectors learned
using a simple model”
Many architectures and models have been proposed for computing
word vectors (e.g. see Socher’s Stanford group work which resulted
in GloVe - http://nlp.stanford.edu/projects/glove/)
Suzan Verberne
18. WHAT IS WORD2VEC (NOT)?
It is:
Neural network-based
Word embeddings
VSM-based
Co-occurrence-based
It is not:
Deep learning
Suzan Verberne
Word embedding is the collective name for a
set of language modeling and feature learning
techniques in natural language processing
(NLP) where words or phrases from the
vocabulary are mapped to vectors of real
numbers in a low-dimensional space relative to
the vocabulary size
20. HOW IS THE MODEL TRAINED?
Move through the training corpus with a sliding window. Each word
is a prediction problem:
the objective is to predict the current word with the help of its
contexts (or vice versa)
The outcome of the prediction determines whether we adjust the
current word vector. Gradually, vectors converge to (hopefully)
optimal values
It is important that prediction here is not an aim in itself: it is just a
proxy to learn vector representations good for other tasks
Suzan Verberne
21. OTHER VECTOR SPACE SEMANTICS TECHNIQUES
Some “classic” NLP for estimating continuous representations of
words
– LSA (Latent Semantic Analysis)
– LDA (Latent Dirichlet Allocation)
Distributed representations of words learned by neural networks
outperform LSA on various tasks
LDA is computationally expensive and cannot be trained on very
large datasets
Suzan Verberne
22. ADVANTAGES OF WORD2VEC
It scales
Train on billion word corpora
In limited time
Possibility of parallel training
Pre-trained word embeddings trained by one can be used by others
For entirely different tasks
Incremental training
Train on one piece of data, save results, continue training later on
There is a Python module for it:
Gensim word2vec
Suzan Verberne
24. DO IT YOURSELF
Implementation in Python package gensim
import gensim
model = gensim.models.Word2Vec(sentences, size=100,
window=5, min_count=5,
workers=4)
size: the dimensionality of the feature vectors (common: 200 or 320)
window: the maximum distance between the current and predicted word
within a sentence
min_count: minimum number of occurrences of a word in the corpus to be
included in the model
workers: for parallellization with multicore machines
Suzan Verberne
26. WHAT CAN YOU DO WITH IT?
A is to B as C is to ?
This is the famous example:
vector(king) – vector(man) + vector(woman) = vector(queen)
Actually, what the original paper says is: if you substract the vector
for ‘man’ from the one for ‘king’ and add the vector for ‘woman’,
the vector closest to the one you end up with turns out to be the
one for ‘queen’.
More interesting:
France is to Paris as Germany is to …
Suzan Verberne
27. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In Advances in
Neural Information Processing Systems, pages 3111–3119, 2013.
28. WHAT CAN YOU DO WITH IT?
A is to B as C is to ?
It also works for syntactic relations:
vector(biggest) - vector(big) + vector(small) =
Suzan Verberne
vector(smallest)
29. WHAT CAN YOU DO WITH IT?
Mining knowledge about natural language
Improve NLP applications
Improve Text Mining applications
Suzan Verberne
30. WHAT CAN YOU DO WITH IT?
Mining knowledge about natural language
Selecting out-of-the-list words
Example: which word does not belong in [monkey, lion, dog, truck]
Selectional preferences
Example: predict typical verb-noun pairs: people as subject of eating is more
likely than people as object of eating
Imrove NLP applications:
Sentence completion/text prediction
Bilingual Word Embeddings for Phrase-Based Machine Translation
Suzan Verberne
31. WHAT CAN YOU DO WITH IT?
Improve Text Mining applications:
(Near-)Synonym detection ( query expansion)
Concept representation of texts
Example: Twitter sentiment classification
Document similarity
Example: cluster news articles per news event
Suzan Verberne
32. DEAL WITH PHRASES
Entities as single words:
during pre-processing: find collocations (e.g. Wikipedia anchor texts),
and feed them as single ‘words’ to the neural network.
Bigrams (built-in function)
bigram = gensim.models.Phrases(sentence_stream)
term_list = list(bigram[sentence_stream])
model = gensim.models.Word2Vec(term_list)
Suzan Verberne
33. HOW TO EVALUATE A MODEL
There is a comprehensive test set that contains five types of
semantic questions, and nine types of syntactic questions
– 8869 semantic questions – 10675 syntactic questions
https://rare-technologies.com/word2vec-tutorial/
Suzan Verberne
34. FURTHER READING
T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems, pages 3111– 3119,
2013.
http://radimrehurek.com/2014/02/word2vec-tutorial/
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-
model/
https://rare-technologies.com/word2vec-tutorial/
Suzan Verberne
36. PRE-TRAINED WORD2VEC MODELS
We will compare two pre-trained word2vec models:
A model trained on a Dutch web corpus (COW)
A model trained on Dutch web forum data
Goal: inspect the differences between the models
Suzan Verberne