This document introduces GloVe (Global Vectors), a method for creating word embeddings that combines global matrix factorization and local context window models. It discusses how global matrix factorization uses singular value decomposition to reduce a term-frequency matrix to learn word vectors from global corpus statistics. It also explains how local context window models like skip-gram and CBOW learn word embeddings by predicting words from a fixed-size window of surrounding context words during training. GloVe aims to learn from both global co-occurrence patterns and local context to generate word vectors.
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
[Emnlp] what is glo ve part i - towards data science
1. [EMNLP] What is GloVe? Part I
An introduction to unsupervised learning of word embeddings from
co-occurrence matrices.
Brendan Whitaker
May 24, 2018 · 4 min read
source
In this article, we’ll discuss one of the newer methods of creating vector space models
of word semantics, more commonly known as word embeddings. The original paper by
J. Pennington, R. Socher, and C. Manning is available here:
http://www.aclweb.org/anthology/D14-1162.This method combines elements from
the two main word embedding models which existed when GloVe, short for “Global
Vectors [for word representation]” was proposed: global matrix factorization and local
context window methods. In Part I, we explore these previous models and the
mechanics behind them.
2. . . .
Global matrix factorization.
In natural language processing, global matrix factorization is the process of using
matrix factorization methods from linear algebra to perform rank reduction on a large
term-frequency matrix. These matrices usually represent either term-document
frequencies, in which the rows are words and the columns are documents (or
sometimes paragraphs), or term-term frequencies, which have words on both axes and
measure co-occurrence. Global matrix factorization applied to term-document
frequency matrices is more commonly known as latent semantic analysis (LSA). In
latent semantic analysis, the high-dimensional matrix is reduced via singular value
decomposition (SVD).
We won’t fully treat the math behind singular value decomposition in this article, but
it’s essentially a factorization of a general matrix m-by-n matrix M into a product U Σ
V*, where U is m-by-m and unitary, Σ is an m-by-n rectangular diagonal matrix (the
nonzero entries of which are known as the singular values of M), and V is n-by-n and
unitary.
3. wikipedia
Recall that the conjugate transpose A* of a matrix A is the matrix given by taking the
complex conjugate of every entry in the transpose (reflection over diagonal) of A. A
unitary matrix is any square matrix whose conjugate transpose is its inverse, i.e. a
matrix A such that AA* = A*A = I. This factorization is then used to find a low-rank
approximation to M, by first choosing r, the desired rank of our approximation matrix
M`, and then computing Σ`, which is just Σ but with only the r largest singular values.
Then our approximation is given by the formula M` = U Σ`V*.
These low-rank approximations to the term frequency matrices then give us reasonably
sized vector space embeddings of the global corpus statistics.
Local context window.
The other family of word embedding model learns semantics by passing a window over
the corpus line-by-line and learning to predict either the surroundings of a given word
(Skip-gram model), or predict a word given its surroundings (continuous bag-of-
words model). Note the bag-of-words problem is often shortened to “CBOW”.
an example that works for both skip-gram and CBOW. Our context window is shaded blue and includes +/- 2
words around the relevant term.
In the continuous bag-of-words problem, we are given the words in the context
window. In the top position, these would be “what”,”if”,”was”, and “short”. We would
then train a neural network to predict the word “mike”, highlighted in red. The
4. illustration displays the context window as we move through the corpus, and each shift
of the context window serves as a training example for our model.
In the skip-gram problem, the roles are reversed. As noted above, we’re now predicting
context from the relevant term. So in the top example, we would want to predict the
words in blue from the word in red. We note here that more distant words are weighted
to reflect their distance from the center term in the window via random sampling:
instead of fixing the width of the context window, we instead specify a maximum value
for its width. We randomly choose a context window width between 1 and the max
width for each training example, such that a word which has distance k from the
relevant term will be observed (i.e. contribute to training) with probability 1/k, and a
term directly adjacent to the center word will always be observed.
In either case, it’s just a simple supervised learning problem that we’re training our
network on. The stuff we’re given are the features, and the word(s) we’re predicting are
the labels. Both of these problems are the core of the word2vec embedding creation
algorithm, which precedes GloVe and from which the authors draw several insights.
According to Mikolov et at., the authors of the word2vec paper, the two approaches
differ slightly in performance:
Skip-gram: works well with small amount of the training data, represents well even rare
words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the
frequent words.
The authors of the GloVe paper note, however, that these context window-based
methods suffer from the disadvantage of not learning from the global corpus statistics.
As a result, repetition and large-scale patterns may not be learned as well with these
models as they are with global matrix factorization. In Part II, we’ll discuss the heart of
the GloVe model and it’s performance compared to the existing word embedding
generation algorithms we discussed above.
[EMNLP] What is GloVe? Part II
An introduction to unsupervised learning of word
embeddings from co-occurrence matrices.
towardsdatascience.com