SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
word2vec and friends
• Computers are really good at crunching numbers but not so much when it
comes to words.
• Perhaps can we represent words numerically?

Teaching machines to read!
a 1
about 2
above 3
after 4
again 5
against 6
all 7
am 8
an 9
and 10
any 11
are 12
aren't 13
as 14
… …
• Computers are really good at crunching numbers but not so much when it
comes to words.
• Perhaps can we represent words numerically?

• Can we do it in a way that preserves semantic information?

• Words that have similar meanings are used in similar contexts and the context
in which a word is used helps us understand it’s meaning.
Teaching machines to read!
The red house is beautiful.

The blue house is old.

The red car is beautiful.

The blue car is old.
“You shall know a word by the company it keeps”

(J. R. Firth)
vafter = (0, 0, 0, 1, 0, 0, · · · )
vabove = (0, 0, 1, 0, 0, 0, · · · )
Teaching machines to read!
➡Words with similar meanings should have similar representations.
➡From a word we can get some idea about the context where it might appear 

➡And from the context we have some idea about possible words
“You shall know a word by the company it keeps”

(J. R. Firth)
max p (C|w)
max p (w|C)
The red _____ is beautiful.

The blue _____ is old.
___ ___ house __ ____.

___ ___ car __ _______.
max p (C|w) max p (w|C)
Skipgram Continuous Bag of Words
Word Context Context Word
word embeddings
context embeddings
one hot vector
activation function
Mikolov 2013
• Let us take a better look at a simplified case with a single context word
• Words are one-hot encoded vectors of length V
• is an matrix so that when we take the product:
• We are effectively selecting the j’th column of :
• The linear activation function simply passes this value along

which is then multiplied by , a matrix.
• Each element k of the output layer its then given by:
• We convert these values to a normalized probability distribution by using the softmax
wj = (0, 0, 1, 0, 0, 0, · · · )
⇥1 (M ⇥ V )
⇥1 · wj
vj = ⇥1 · wj
⇥2 (V ⇥ M)
k · vj
• A standard way of converting a set of number to a normalized probability distribution:

• With this final ingredient we obtain:

• Our goal is then to learn:
• so that we can predict what the next word is likely to be using:

• But how can we quantify how far we are from the correct answer? Our error measure
shouldn’t be just binary (right or wrong)…
softmax (x) =
exp (xj)
l exp (xl)
p (wk|wj) ⌘ softmax uT
k · vj =
exp uT
k · vj
l exp uT
l · vj
p (wj+1|wj)
⇥1 ⇥2
• First we have to recall that what we are, in effect, comparing two probability distributions:
• and the one-hot encoding of the context:
• The Cross Entropy measures the distance, in number of bits, between two probability
distributions p and q: 

• In our case, this becomes:
• So it’s clear that the only non zero term is the one that corresponds to the “hot” element of 

• This is our Error function. But how can we use this to update the values of and ?
p (wk|wj)
H (p, q) =
pk log qk
H = log p (wj+1|wj)
wj+1 = (0, 0, 0, 1, 0, 0, · · · )
H [wj+1, p (wk|wj)] =
j+1 log p (wk|wj)
⇥1 ⇥2
Gradient Descent
• Find the gradient for each training batch
• Take a step downhill along the
direction of the gradient 

• where is the step size.
• Repeat until “convergence”.
✓mn ✓mn ↵
• How can we calculate

• we rewrite:

• and expand:
• Then we can rewrite:

• and apply the chain rule:
log p (wj+1|wj)
exp uT
k · vj
l exp uT
l · vj
k · vj =
kq ✓
@f (g (x))
@f (g (x))
@g (x)
@g (x)
k · vj log
exp uT
l · vj
✓mn =
mn, ✓(2)
Training procedures
• online learning - update weights after each case

- might be useful to update model as new data is obtained

- subject to fluctuations
• mini-batch - update weights after a “small” number of cases

- batches should be balanced

- if dataset is redundant, the gradient estimated using only a fraction of the data 

is a good approximation to the full gradient.
• momentum - let gradient change the velocity of weight change instead of the value directly
• rmsprop - divide learning rate for each weight by a running average of “recent” gradients
• learning rate - vary over the course of the training procedure and use different learning rates

for each weight
SkipGram with Larger Contexts
• Use the same for all context words.
• Use the average of cross entropy.
• word order is not important (the average does not change)
• Can essentially be trained one context word at at time..
wj+1 ⇥1
H = log p (wj+1|wj) H =
log p (wj+t|wj)
Continuous Bag of Words
• The process is essentially the same
• Hierarchical Softmax:
• Approximate the softmax using a binary tree
• Reduce the number of calculations per training example from to and
increase performance by orders of magnitude.
• Negative Sampling:
• Under sample the most frequent words by removing them from the text before
generating the contexts
• Similar idea to removing stop-words — very frequent words are less informative.
• Effectively makes the window larger, increasing the amount of information available for
V log2 V
• word2vec, even in its original formulation is actually a family
of algorithms using various combinations of:
• Skip-gram, CBOW
• Hierarchical Softmax, Negative Sampling
• The output of this neural network is deterministic:
• If two words appear in the same context (“blue” vs “red”,
for e.g.), they will have similar internal representations in 

• and are vector embeddings of the input words
and the context words respectively
• Words that are too rare are also removed.
• The original implementation had a dynamic window size:
• for each word in the corpus a window size k’ is
sampled uniformly between 1 and k
⇥1 ⇥2
⇥1 ⇥2
Online resources
• C - (the original one)
• Python/tensorflow -
• Both a minimalist and an efficient versions are available in the tutorial
• Python/gensim -
• Pretrained embeddings:
• 30+ languages,
• 100+ languages trained using wikipedia:
• The embedding of each word is a function of the context it appears in:

• words that appear in similar contexts will have similar embeddings:
• “Distributional hypotesis” in linguistics
(red) = f (context (red))
context (red) ⇡ context (blue) =) (red) ⇡ (blue)
“You shall know a word by the company it keeps”

(J. R. Firth)
Geometrical relations
between contexts imply
semantic relations
between words!
Washington DC
Capital context
Country context
(France) (Paris) + (Rome) = (Italy)
~b ~a + ~c = ~d
What is the word d that is most similar to 

b and c and most dissimilar to a?
= argmax
~b ~a + ~c
~b ~a + ~c
⇠ argmax
~x ~aT
~x + ~cT
~b ~a + ~c = ~d
• Let’s imagine I want to perform these calculations:

• for some given .
• To calculate we must follow a certain sequence of operations.
A diversion…
y = f (x)
z = g (y)
• Let’s imagine I want to perform these calculations:

• for some given .
• To calculate we must follow a certain sequence of operations.
• Which can be shortened if we are interested in just the value of
• In Tensorflow, this is called a Computational Graph and it’s the
most fundamental concept to understand
• Data, in the form of tensors, flows through the graph from inputs
to outputs
• Tensorflow, is, essentially, a way of defining arbitrary computational
graphs in a way that can be automatically distributed and
A diversion…
y = f (x)
z = g (y)
• If we use base functions, tensorflow knows how to automatically calculate the respective
• Automatic BackProp
• Graphs can have multiple outputs
• Predictions
• Cost functions
• etc…
Computational Graphs
• After we have defined the computational graph, we can start using it
to make calculations
• All computations must take place within a “session” that defines the
values of all required input values
• Which values are required for a specific computation depend on
what part of the graph is actually being executed.
• When you request the value of a specific output, tensorflow
determines what is the specific subgraph that must be executed
and what are the required input values.
• For optimization purposes, it can also execute independent parts of
the graph in different devices (CPUs, GPUs, TPUs, etc) at the same
A basic Tensorflow program
import tensorflow as tf
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
c = tf.constant(3.)
m = tf.add(x, y)
z = tf.multiply(m, c)
with tf.Session() as sess:
output =, feed_dict={x: 1., y: 2.})
print("Output value is:", output)
z = c ⇤ (x + y)
A basic Tensorflow program
import tensorflow as tf
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
c = tf.constant(3.)
m = tf.add(x, y)
z = tf.multiply(m, c)
with tf.Session() as sess:
output =, feed_dict={x: 1., y: 2.})
print("Output value is:", output)
z = c ⇤ (x + y)
1 2
Jupyter Notebook
Linguistic Change
• Train word embeddings for different years using
Google Books
• Independently trained embeddings differ by an
arbitrary rotation
• Align the different embeddings for different years
• Track the way in which the meaning of words shifted
over time!
Statistically Significant Detection of Linguistic Change
Vivek Kulkarni
Stony Brook University, USA
Rami Al-Rfou
Stony Brook University, USA
Bryan Perozzi
Stony Brook University, USA
Steven Skiena
Stony Brook University, USA
We propose a new computational approach for tracking and
detecting statistically significant linguistic shifts in the mean-
ing and usage of words. Such linguistic shifts are especially
prevalent on the Internet, where the rapid exchange of ideas
can quickly change a word’s meaning. Our meta-analysis
approach constructs property time series of word usage, and
then uses statistically sound change point detection algo-
rithms to identify significant linguistic shifts.
We consider and analyze three approaches of increasing
complexity to generate such linguistic property time series,
the culmination of which uses distributional characteristics
inferred from word co-occurrences. Using recently proposed
deep neural language models, we first train vector represen-
tations of words for each time period. Second, we warp the
vector spaces into one unified coordinate system. Finally, we
construct a distance-based distributional time series for each
word to track its linguistic displacement over time.
We demonstrate that our approach is scalable by track-
ing linguistic change across years of micro-blogging using
Twitter, a decade of product reviews using a corpus of movie
reviews from Amazon, and a century of written books using
the Google Book Ngrams. Our analysis reveals interesting
patterns of language usage change commensurate with each
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval
Web Mining;Computational Linguistics
Natural languages are inherently dynamic, evolving over
time to accommodate the needs of their speakers. This
e↵ect is especially prevalent on the Internet, where the rapid
exchange of ideas can change a word’s meaning overnight.
Figure 1: A 2-dimensional projection of the latent seman-
tic space captured by our algorithm. Notice the semantic
trajectory of the word gay transitioning meaning in the space.
In this paper, we study the problem of detecting such
linguistic shifts on a variety of media including micro-blog
posts, product reviews, and books. Specifically, we seek to
detect the broadening and narrowing of semantic senses of
words, as they continually change throughout the lifetime of
a medium.
We propose the first computational approach for track-
ing and detecting statistically significant linguistic shifts of
words. To model the temporal evolution of natural language,
we construct a time series per word. We investigate three
methods to build our word time series. First, we extract
Frequency based statistics to capture sudden changes in word
usage. Second, we construct Syntactic time series by ana-
lyzing each word’s part of speech (POS) tag distribution.
Finally, we infer contextual cues from word co-occurrence
statistics to construct Distributional time series. In order to
detect and establish statistical significance of word changes
over time, we present a change point detection algorithm,
which is compatible with all methods.
Figure 1 illustrates a 2-dimensional projection of the latent
semantic space captured by our Distributional method. We
clearly observe the sequence of semantic shifts that the word
gay has undergone over the last century (1900-2005). Ini-
tially, gay was an adjective that meant cheerful or dapper.
WWW’15, 625 (2015)Statistically Significant Detection of Linguistic Change
Vivek Kulkarni
Stony Brook University, USA
Rami Al-Rfou
Stony Brook University, USA
Bryan Perozzi
Stony Brook University, USA
Steven Skiena
Stony Brook University, USA
We propose a new computational approach for tracking and
detecting statistically significant linguistic shifts in the mean-
ing and usage of words. Such linguistic shifts are especially
prevalent on the Internet, where the rapid exchange of ideas
can quickly change a word’s meaning. Our meta-analysis
approach constructs property time series of word usage, and
then uses statistically sound change point detection algo-
rithms to identify significant linguistic shifts.
We consider and analyze three approaches of increasing
complexity to generate such linguistic property time series,
the culmination of which uses distributional characteristics
inferred from word co-occurrences. Using recently proposed
deep neural language models, we first train vector represen-
tations of words for each time period. Second, we warp the
vector spaces into one unified coordinate system. Finally, we
construct a distance-based distributional time series for each
word to track its linguistic displacement over time.
We demonstrate that our approach is scalable by track-
ing linguistic change across years of micro-blogging using
Twitter, a decade of product reviews using a corpus of movie
reviews from Amazon, and a century of written books using
Figure 1: A 2-dimensional projection of the latent seman-
tic space captured by our algorithm. Notice the semantic
trajectory of the word gay transitioning meaning in the space.
In this paper, we study the problem of detecting such
linguistic shifts on a variety of media including micro-blog
posts, product reviews, and books. Specifically, we seek to
• You can generate a graph out of a sequence of words by assigning a node to each word and
connecting the words within their neighbors through edges.
• With this representation, a piece of text is a walk through the network. Then perhaps we can
invert the process? Use walks through a network to generate a sequence of nodes that can
be used to train node embeddings?
• node embeddings should capture features of the network structure and allow for detection of
similarities between nodes.
node2vec: Scalable Feature Learning for Networks
Aditya Grover
Stanford University
Jure Leskovec
Stanford University
Prediction tasks over nodes and edges in networks require careful
effort in engineering features used by learning algorithms. Recent
research in the broader field of representation learning has led to
significant progress in automating prediction by learning the fea-
tures themselves. However, present feature learning approaches
are not expressive enough to capture the diversity of connectivity
patterns observed in networks.
Here we propose node2vec, an algorithmic framework for learn-
ing continuous feature representations for nodes in networks. In
node2vec, we learn a mapping of nodes to a low-dimensional space
of features that maximizes the likelihood of preserving network
neighborhoods of nodes. We define a flexible notion of a node’s
network neighborhood and design a biased random walk procedure,
which efficiently explores diverse neighborhoods. Our algorithm
generalizes prior work which is based on rigid notions of network
neighborhoods, and we argue that the added flexibility in exploring
neighborhoods is the key to learning richer representations.
We demonstrate the efficacy of node2vec over existing state-of-
the-art techniques on multi-label classification and link prediction
in several real-world networks from diverse domains. Taken to-
gether, our work represents a new way for efficiently learning state-
of-the-art task-independent representations in complex networks.
Categories and Subject Descriptors: H.2.8 [Database Manage-
ment]: Database applications—Data mining; I.2.6 [Artificial In-
telligence]: Learning
General Terms: Algorithms; Experimentation.
Keywords: Information networks, Feature learning, Node embed-
dings, Graph representations.
Many important tasks in network analysis involve predictions
over nodes and edges. In a typical node classification task, we
are interested in predicting the most probable labels of nodes in
a network [33]. For example, in a social network, we might be
interested in predicting interests of users, or in a protein-protein in-
teraction network we might be interested in predicting functional
labels of proteins [25, 37]. Similarly, in link prediction, we wish to
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
predict whether a pair of nodes in a network should have an edge
connecting them [18]. Link prediction is useful in a wide variety
of domains; for instance, in genomics, it helps us discover novel
interactions between genes, and in social networks, it can identify
real-world friends [2, 34].
Any supervised machine learning algorithm requires a set of in-
formative, discriminating, and independent features. In prediction
problems on networks this means that one has to construct a feature
vector representation for the nodes and edges. A typical solution in-
volves hand-engineering domain-specific features based on expert
knowledge. Even if one discounts the tedious effort required for
feature engineering, such features are usually designed for specific
tasks and do not generalize across different prediction tasks.
An alternative approach is to learn feature representations by
solving an optimization problem [4]. The challenge in feature learn-
ing is defining an objective function, which involves a trade-off
in balancing computational efficiency and predictive accuracy. On
one side of the spectrum, one could directly aim to find a feature
representation that optimizes performance of a downstream predic-
tion task. While this supervised procedure results in good accu-
racy, it comes at the cost of high training time complexity due to a
blowup in the number of parameters that need to be estimated. At
the other extreme, the objective function can be defined to be inde-
pendent of the downstream prediction task and the representations
can be learned in a purely unsupervised way. This makes the op-
timization computationally efficient and with a carefully designed
objective, it results in task-independent features that closely match
task-specific approaches in predictive accuracy [21, 23].
However, current techniques fail to satisfactorily define and opti-
mize a reasonable objective required for scalable unsupervised fea-
ture learning in networks. Classic approaches based on linear and
non-linear dimensionality reduction techniques such as Principal
Component Analysis, Multi-Dimensional Scaling and their exten-
sions [3, 27, 30, 35] optimize an objective that transforms a repre-
sentative data matrix of the network such that it maximizes the vari-
ance of the data representation. Consequently, these approaches in-
variably involve eigendecomposition of the appropriate data matrix
which is expensive for large real-world networks. Moreover, the
resulting latent representations give poor performance on various
prediction tasks over networks.
Alternatively, we can design an objective that seeks to preserve
local neighborhoods of nodes. The objective can be efficiently op-
KDD’16, 855 (2016)
• The features depends strongly on the way in which the network is
• Generate the contexts for each node using Breath First Search and
Depth First Search

• Perform a biased Random Walk
KDD’16, 855 (2016)
n communities they belong to (i.e., ho-
he organization could be based on the
n the network (i.e., structural equiva-
tance, in Figure 1, we observe nodes
ame tightly knit community of nodes,
the two distinct communities share the
node. Real-world networks commonly
uivalences. Thus, it is essential to allow
can learn node representations obeying
earn representations that embed nodes
mmunity closely together, as well as to
nodes that share similar roles have sim-
d allow feature learning algorithms to
iety of domains and prediction tasks.
node2vec, a semi-supervised algorithm
g in networks. We optimize a custom
on using SGD motivated by prior work
ing [21]. Intuitively, our approach re-
s that maximize the likelihood of pre-
oods of nodes in a d-dimensional fea-
der random walk approach to generate
hoods for nodes.
n defining a flexible notion of a node’s
y choosing an appropriate notion of a
an learn representations that organize
ork roles and/or communities they be-
Figure 1: BFS and DFS search strategies from node u (k = 3).
principles in network science, providing flexibility in discov-
ering representations conforming to different equivalences.
3. We extend node2vec and other feature learning methods based
on neighborhood preserving objectives, from nodes to pairs
of nodes for edge-based prediction tasks.
4. We empirically evaluate node2vec for multi-label classifica-
tion and link prediction on several real-world datasets.
The rest of the paper is structured as follows. In Section 2, we
briefly survey related work in feature learning for networks. We
present the technical details for feature learning using node2vec
in Section 3. In Section 4, we empirically evaluate node2vec on
prediction tasks over nodes and edges on various real-world net-
works and assess the parameter sensitivity, perturbation analysis,
and scalability aspects of our algorithm. We conclude with a dis-
cussion of the node2vec framework and highlight some promis-
ing directions for future work in Section 5. Datasets and a refer-
ain structural equivalence, it is of-
e local neighborhoods accurately.
nce based on network roles such as
d just by observing the immediate
restricting search to nearby nodes,
on and obtains a microscopic view
ode. Additionally, in BFS, nodes in
d to repeat many times. This is also
ance in characterizing the distribu-
t the source node. However, a very
plored for any given k.
which can explore larger parts of
Figure 2: Illustration of the random walk procedure in node2vec.
The walk just transitioned from t to v and is now evaluating its next
• BFS - Explores only
limited neighborhoods.
Suitable for structural
• DFS - Freely explores
neighborhoods and
covers homophiles
• By modifying the
parameter of the model it
can interpolate between
the BFS and DFS
• Separate the genome into long non-overlapping DNS fragments.
• Convert long DNA fragments into overlapping variable length k-mers
• Train embeddings of each k-mer using Gensim implementation of SkipGram.
• Summing embeddings is related to concatenating k-mers
• Cosign similarity of k-mer embeddings reproduces a biologically motivated
similarity score (Needleman-Wunsch) that is used to align nucleoti
dna2vec: Consistent vector representations of
variable-length k-mers
Patrick Ng
One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components.
Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the
curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This
is particularly problematic when applying the latest machine learning algorithms to solve problems in
biological sequence analysis. In this paper, we propose a novel method to train distributed representations
of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which
is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing
of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation
between Needleman-Wunsch similarity score and cosine similarity of dna2vec vectors.
1 Introduction
The usage of k-mer representation has been a popular approach in analyzing long sequence of DNA fragments.
The k-mer representation is simple to understand and compute. Unfortunately, its straightforward vector
encoding as a one-hot vector (i.e. bit vector that consists of all zeros except for a single dimension) is
vulnerable to curse of dimensionality. Specifically, its one-hot vector has dimension exponential to the length
of k. For example, an 8-mer needs a bit vector of dimension 48
= 65536. This is problematic when applying
the latest machine learning algorithms to solve problems in biological sequence analysis, due to the fact that
most of these tools prefer lower-dimensional continuous vectors as input (Suykens and Vandewalle, 1999;
Angermueller et al., 2016; Turian et al., 2010). Worse yet, the distance between any arbitrary pair of one-hot
vectors is equidistant, even though ATGGC should be closer to ATGGG than CACGA.
1.1 Word embeddings
The Natural Language Processing (NLP) research community has a long tradition of using bag-of-words with
one-hot vector, where its dimension is equal to the vocabulary size. Recently, there has been an explosion of
using word embeddings as inputs to machine learning algorithms, especially in the deep learning community
(Mikolov et al., 2013b; LeCun et al., 2015; Bengio et al., 2013). Word embeddings are vectors of real numbers
that are distributed representations of words.
A popular training technique for word embeddings, word2vec (Mikolov et al., 2013a), consists of using a
2-layer neural network that is trained on the current word and its surrounding context words (see Section
2.3). This reconstruction of context of words is loosely inspired by the linguistic concept of distributional
hypothesis, which states that words that appear in the same context have similar meaning (Harris, 1954).
Deep learning algorithms applied with word embeddings have had dramatic improvements in the areas of
machine translation (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014), summarization (Chopra
arXiv: 1701.06279 (2017)
• Apply word2vec to the 40 years of stock market data
• Identify significant semantic similarities between companies working
in the same area
Thank you!
You can hear me speak more about
word2vec in this weeks podcast!

Contenu connexe


Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlpankit_ppt
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...Shuyo Nakatani
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practicehen_drik
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distancesGanesh Borle
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra

Tendances (19)

DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
Science in text mining
Science in text miningScience in text mining
Science in text mining
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
[Yang, Downey and Boyd-Graber 2015] Efficient Methods for Incorporating Knowl...
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
Language models
Language modelsLanguage models
Language models
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval

Similaire à Word2vec and Friends

Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowBruno Gonçalves
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validationgmorishita
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
Introduction to Boost regex
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regexYongqiang Li
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes👋 Christopher Moody
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksJonathan Mugan
C Programming Interview Questions
C Programming Interview QuestionsC Programming Interview Questions
C Programming Interview QuestionsGradeup
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and SelectionAhmed Nobi
Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02mansab MIRZA
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine LearningPranav Ainavolu
Asymptotic Notations.pptx
Asymptotic Notations.pptxAsymptotic Notations.pptx
Asymptotic Notations.pptxSunilWork1
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure Eman magdy

Similaire à Word2vec and Friends (20)

Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
Realtime Analytics
Realtime AnalyticsRealtime Analytics
Realtime Analytics
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
Introduction to Boost regex
Introduction to Boost regexIntroduction to Boost regex
Introduction to Boost regex
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
Lda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notesLda2vec text by the bay 2016 with notes
Lda2vec text by the bay 2016 with notes
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural Networks
Word embeddings
Word embeddingsWord embeddings
Word embeddings
C Programming Interview Questions
C Programming Interview QuestionsC Programming Interview Questions
C Programming Interview Questions
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and Selection
Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02Asymptotics 140510003721-phpapp02
Asymptotics 140510003721-phpapp02
Understanding Basics of Machine Learning
Understanding Basics of Machine LearningUnderstanding Basics of Machine Learning
Understanding Basics of Machine Learning
Asymptotic Notations.pptx
Asymptotic Notations.pptxAsymptotic Notations.pptx
Asymptotic Notations.pptx
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure

Plus de Bruno Gonçalves

RNNs for Timeseries Analysis
RNNs for Timeseries AnalysisRNNs for Timeseries Analysis
RNNs for Timeseries AnalysisBruno Gonçalves
Blockchain Technologies for Data Science
Blockchain Technologies for Data ScienceBlockchain Technologies for Data Science
Blockchain Technologies for Data ScienceBruno Gonçalves
Data Visualization using matplotlib
Data Visualization using matplotlibData Visualization using matplotlib
Data Visualization using matplotlibBruno Gonçalves
Spatio Temporal Analysis of Language use.
Spatio Temporal Analysis of Language use.Spatio Temporal Analysis of Language use.
Spatio Temporal Analysis of Language use.Bruno Gonçalves
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksBruno Gonçalves
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningBruno Gonçalves
Human Mobility (with Mobile Devices)
Human Mobility (with Mobile Devices)Human Mobility (with Mobile Devices)
Human Mobility (with Mobile Devices)Bruno Gonçalves
Twitterology - The Science of Twitter
Twitterology - The Science of TwitterTwitterology - The Science of Twitter
Twitterology - The Science of TwitterBruno Gonçalves

Plus de Bruno Gonçalves (9)

RNNs for Timeseries Analysis
RNNs for Timeseries AnalysisRNNs for Timeseries Analysis
RNNs for Timeseries Analysis
Blockchain Technologies for Data Science
Blockchain Technologies for Data ScienceBlockchain Technologies for Data Science
Blockchain Technologies for Data Science
Data Visualization using matplotlib
Data Visualization using matplotlibData Visualization using matplotlib
Data Visualization using matplotlib
Spatio Temporal Analysis of Language use.
Spatio Temporal Analysis of Language use.Spatio Temporal Analysis of Language use.
Spatio Temporal Analysis of Language use.
Machine(s) Learning with Neural Networks
Machine(s) Learning with Neural NetworksMachine(s) Learning with Neural Networks
Machine(s) Learning with Neural Networks
A practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) LearningA practical Introduction to Machine(s) Learning
A practical Introduction to Machine(s) Learning
Human Mobility (with Mobile Devices)
Human Mobility (with Mobile Devices)Human Mobility (with Mobile Devices)
Human Mobility (with Mobile Devices)
Twitterology - The Science of Twitter
Twitterology - The Science of TwitterTwitterology - The Science of Twitter
Twitterology - The Science of Twitter
Mining Georeferenced Data
Mining Georeferenced DataMining Georeferenced Data
Mining Georeferenced Data


Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics

Dernier (20)

Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...

Word2vec and Friends

  • 2. • Computers are really good at crunching numbers but not so much when it comes to words. • Perhaps can we represent words numerically?
 Teaching machines to read! a 1 about 2 above 3 after 4 again 5 against 6 all 7 am 8 an 9 and 10 any 11 are 12 aren't 13 as 14 … …
  • 3. • Computers are really good at crunching numbers but not so much when it comes to words. • Perhaps can we represent words numerically?
 • Can we do it in a way that preserves semantic information?
 • Words that have similar meanings are used in similar contexts and the context in which a word is used helps us understand it’s meaning. Teaching machines to read! The red house is beautiful.
 The blue house is old.
 The red car is beautiful.
 The blue car is old. “You shall know a word by the company it keeps”
 (J. R. Firth) vafter = (0, 0, 0, 1, 0, 0, · · · ) T vabove = (0, 0, 1, 0, 0, 0, · · · ) T One-hot encoding
  • 4. Teaching machines to read! ➡Words with similar meanings should have similar representations. ➡From a word we can get some idea about the context where it might appear 
 ➡And from the context we have some idea about possible words “You shall know a word by the company it keeps”
 (J. R. Firth) max p (C|w) max p (w|C) The red _____ is beautiful.
 The blue _____ is old. ___ ___ house __ ____.
 ___ ___ car __ _______.

  • 5. word2vec max p (C|w) max p (w|C) Skipgram Continuous Bag of Words ⇥1 wj1wj+1 wj ⇥2 ⇥2 wj+1 ⇥2 ⇥2 ⇥1 wj wj1 Word Context Context Word ⇥1 ⇥2 wj word embeddings context embeddings one hot vector activation function Mikolov 2013
  • 6. • Let us take a better look at a simplified case with a single context word • Words are one-hot encoded vectors of length V • is an matrix so that when we take the product: • We are effectively selecting the j’th column of : • The linear activation function simply passes this value along
 which is then multiplied by , a matrix. • Each element k of the output layer its then given by: • We convert these values to a normalized probability distribution by using the softmax Skipgram ⇥1 softmax wj ⇥2 wj+1 wj = (0, 0, 1, 0, 0, 0, · · · ) T ⇥1 (M ⇥ V ) ⇥1 · wj ⇥1 vj = ⇥1 · wj ⇥2 (V ⇥ M) uT k · vj
  • 7. • A standard way of converting a set of number to a normalized probability distribution:
 • With this final ingredient we obtain:
 • Our goal is then to learn: • so that we can predict what the next word is likely to be using:
 • But how can we quantify how far we are from the correct answer? Our error measure shouldn’t be just binary (right or wrong)… Softmax ⇥1 softmax wj ⇥2 wj+1 softmax (x) = exp (xj) P l exp (xl) p (wk|wj) ⌘ softmax uT k · vj = exp uT k · vj P l exp uT l · vj p (wj+1|wj) ⇥1 ⇥2
  • 8. Cross-Entropy • First we have to recall that what we are, in effect, comparing two probability distributions: • and the one-hot encoding of the context: • The Cross Entropy measures the distance, in number of bits, between two probability distributions p and q: 
 • In our case, this becomes: • So it’s clear that the only non zero term is the one that corresponds to the “hot” element of 
 • This is our Error function. But how can we use this to update the values of and ? p (wk|wj) H (p, q) = X k pk log qk H = log p (wj+1|wj) wj+1 wj+1 = (0, 0, 0, 1, 0, 0, · · · ) T H [wj+1, p (wk|wj)] = X k wk j+1 log p (wk|wj) ⇥1 ⇥2
  • 9. Gradient Descent • Find the gradient for each training batch • Take a step downhill along the direction of the gradient 
 • where is the step size. • Repeat until “convergence”. H ✓mn ✓mn ↵ @H @✓mn @H @✓mn ↵
  • 10. Chain-rule • How can we calculate
 • we rewrite:
 • and expand: • Then we can rewrite:
 • and apply the chain rule: @H @✓mn = @ @✓mn log p (wj+1|wj) @H @✓mn = @ @✓mn log exp uT k · vj P l exp uT l · vj uT k · vj = X q ✓ (2) kq ✓ (1) qj @f (g (x)) @x = @f (g (x)) @g (x) @g (x) @x @H @✓mn = @ @✓mn " uT k · vj log X l exp uT l · vj # ✓mn = n ✓(1) mn, ✓(2) mn o
  • 11. Training procedures • online learning - update weights after each case
 - might be useful to update model as new data is obtained
 - subject to fluctuations • mini-batch - update weights after a “small” number of cases
 - batches should be balanced
 - if dataset is redundant, the gradient estimated using only a fraction of the data 
 is a good approximation to the full gradient. • momentum - let gradient change the velocity of weight change instead of the value directly • rmsprop - divide learning rate for each weight by a running average of “recent” gradients • learning rate - vary over the course of the training procedure and use different learning rates
 for each weight
  • 12. SkipGram with Larger Contexts • Use the same for all context words. • Use the average of cross entropy. • word order is not important (the average does not change) • Can essentially be trained one context word at at time.. ⇥1 softmax wj ⇥2 wj+1 ⇥1 wj1wj+1 wj ⇥2 ⇥2 H = log p (wj+1|wj) H = 1 T X t log p (wj+t|wj) ⇥2
  • 13. Continuous Bag of Words • The process is essentially the same wj+1 ⇥2 ⇥2 ⇥1 wj wj1
  • 14. Variations • Hierarchical Softmax: • Approximate the softmax using a binary tree • Reduce the number of calculations per training example from to and increase performance by orders of magnitude. • Negative Sampling: • Under sample the most frequent words by removing them from the text before generating the contexts • Similar idea to removing stop-words — very frequent words are less informative. • Effectively makes the window larger, increasing the amount of information available for context V log2 V
  • 15. Comments • word2vec, even in its original formulation is actually a family of algorithms using various combinations of: • Skip-gram, CBOW • Hierarchical Softmax, Negative Sampling • The output of this neural network is deterministic: • If two words appear in the same context (“blue” vs “red”, for e.g.), they will have similar internal representations in 
 and • and are vector embeddings of the input words and the context words respectively • Words that are too rare are also removed. • The original implementation had a dynamic window size: • for each word in the corpus a window size k’ is sampled uniformly between 1 and k ⇥1 wj1wj+1 wj ⇥2 ⇥2 ⇥1 ⇥2 ⇥1 ⇥2
  • 16. Online resources • C - (the original one) • Python/tensorflow - • Both a minimalist and an efficient versions are available in the tutorial • Python/gensim - • Pretrained embeddings: • 30+ languages, • 100+ languages trained using wikipedia:
  • 23. Analogies • The embedding of each word is a function of the context it appears in:
 • words that appear in similar contexts will have similar embeddings: • “Distributional hypotesis” in linguistics (red) = f (context (red)) context (red) ⇡ context (blue) =) (red) ⇡ (blue) “You shall know a word by the company it keeps”
 (J. R. Firth) Geometrical relations between contexts imply semantic relations between words! Paris Rome Washington DC Lisbon France PortugalItaly USA Capital context Country context (France) (Paris) + (Rome) = (Italy) ~b ~a + ~c = ~d
  • 24. Analogies What is the word d that is most similar to 
 b and c and most dissimilar to a? d† = argmax x ⇣ ~b ~a + ~c ⌘T ~b ~a + ~c ~x d† ⇠ argmax x ⇣ ~bT ~x ~aT ~x + ~cT ~x ⌘ ~b ~a + ~c = ~d
  • 27. • Let’s imagine I want to perform these calculations:
 • for some given . • To calculate we must follow a certain sequence of operations. A diversion… y = f (x) z = g (y) x z Apply Assign Apply Assign
  • 28. • Let’s imagine I want to perform these calculations:
 • for some given . • To calculate we must follow a certain sequence of operations. • Which can be shortened if we are interested in just the value of • In Tensorflow, this is called a Computational Graph and it’s the most fundamental concept to understand • Data, in the form of tensors, flows through the graph from inputs to outputs • Tensorflow, is, essentially, a way of defining arbitrary computational graphs in a way that can be automatically distributed and optimized. A diversion… y = f (x) z = g (y) x z Apply Assign Apply Assign y
  • 29. • If we use base functions, tensorflow knows how to automatically calculate the respective gradients • Automatic BackProp • Graphs can have multiple outputs • Predictions • Cost functions • etc… Computational Graphs
  • 30. Sessions • After we have defined the computational graph, we can start using it to make calculations • All computations must take place within a “session” that defines the values of all required input values • Which values are required for a specific computation depend on what part of the graph is actually being executed. • When you request the value of a specific output, tensorflow determines what is the specific subgraph that must be executed and what are the required input values. • For optimization purposes, it can also execute independent parts of the graph in different devices (CPUs, GPUs, TPUs, etc) at the same time.
  • 32. @bgoncalves A basic Tensorflow program import tensorflow as tf x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) c = tf.constant(3.) m = tf.add(x, y) z = tf.multiply(m, c) with tf.Session() as sess: output =, feed_dict={x: 1., y: 2.}) print("Output value is:", output) z = c ⇤ (x + y) Placeholders Constant add multiply assign assign
  • 33. @bgoncalves A basic Tensorflow program import tensorflow as tf x = tf.placeholder(tf.float32) y = tf.placeholder(tf.float32) c = tf.constant(3.) m = tf.add(x, y) z = tf.multiply(m, c) with tf.Session() as sess: output =, feed_dict={x: 1., y: 2.}) print("Output value is:", output) z = c ⇤ (x + y) Placeholders Constant add multiply assign 1 2 assign 9 Inputs
  • 35. Linguistic Change • Train word embeddings for different years using Google Books • Independently trained embeddings differ by an arbitrary rotation • Align the different embeddings for different years • Track the way in which the meaning of words shifted over time! Statistically Significant Detection of Linguistic Change Vivek Kulkarni Stony Brook University, USA Rami Al-Rfou Stony Brook University, USA Bryan Perozzi Stony Brook University, USA Steven Skiena Stony Brook University, USA ABSTRACT We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the mean- ing and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algo- rithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector represen- tations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time. We demonstrate that our approach is scalable by track- ing linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Book Ngrams. Our analysis reveals interesting patterns of language usage change commensurate with each medium. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Keywords Web Mining;Computational Linguistics 1. INTRODUCTION Natural languages are inherently dynamic, evolving over time to accommodate the needs of their speakers. This e↵ect is especially prevalent on the Internet, where the rapid exchange of ideas can change a word’s meaning overnight. Figure 1: A 2-dimensional projection of the latent seman- tic space captured by our algorithm. Notice the semantic trajectory of the word gay transitioning meaning in the space. In this paper, we study the problem of detecting such linguistic shifts on a variety of media including micro-blog posts, product reviews, and books. Specifically, we seek to detect the broadening and narrowing of semantic senses of words, as they continually change throughout the lifetime of a medium. We propose the first computational approach for track- ing and detecting statistically significant linguistic shifts of words. To model the temporal evolution of natural language, we construct a time series per word. We investigate three methods to build our word time series. First, we extract Frequency based statistics to capture sudden changes in word usage. Second, we construct Syntactic time series by ana- lyzing each word’s part of speech (POS) tag distribution. Finally, we infer contextual cues from word co-occurrence statistics to construct Distributional time series. In order to detect and establish statistical significance of word changes over time, we present a change point detection algorithm, which is compatible with all methods. Figure 1 illustrates a 2-dimensional projection of the latent semantic space captured by our Distributional method. We clearly observe the sequence of semantic shifts that the word gay has undergone over the last century (1900-2005). Ini- tially, gay was an adjective that meant cheerful or dapper. WWW’15, 625 (2015)Statistically Significant Detection of Linguistic Change Vivek Kulkarni Stony Brook University, USA Rami Al-Rfou Stony Brook University, USA Bryan Perozzi Stony Brook University, USA Steven Skiena Stony Brook University, USA ABSTRACT We propose a new computational approach for tracking and detecting statistically significant linguistic shifts in the mean- ing and usage of words. Such linguistic shifts are especially prevalent on the Internet, where the rapid exchange of ideas can quickly change a word’s meaning. Our meta-analysis approach constructs property time series of word usage, and then uses statistically sound change point detection algo- rithms to identify significant linguistic shifts. We consider and analyze three approaches of increasing complexity to generate such linguistic property time series, the culmination of which uses distributional characteristics inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector represen- tations of words for each time period. Second, we warp the vector spaces into one unified coordinate system. Finally, we construct a distance-based distributional time series for each word to track its linguistic displacement over time. We demonstrate that our approach is scalable by track- ing linguistic change across years of micro-blogging using Twitter, a decade of product reviews using a corpus of movie reviews from Amazon, and a century of written books using talkative profligate courageous apparitional dapper sublimely unembarrassed courteous sorcerers metonymy religious adolescents philanthropist illiterate transgendered artisans healthy gays homosexual transgender lesbian statesman hispanic uneducated gay1900 gay1950 gay1975 gay1990 gay2005 cheerful Figure 1: A 2-dimensional projection of the latent seman- tic space captured by our algorithm. Notice the semantic trajectory of the word gay transitioning meaning in the space. In this paper, we study the problem of detecting such linguistic shifts on a variety of media including micro-blog posts, product reviews, and books. Specifically, we seek to
  • 36. node2vec • You can generate a graph out of a sequence of words by assigning a node to each word and connecting the words within their neighbors through edges. • With this representation, a piece of text is a walk through the network. Then perhaps we can invert the process? Use walks through a network to generate a sequence of nodes that can be used to train node embeddings? • node embeddings should capture features of the network structure and allow for detection of similarities between nodes. node2vec: Scalable Feature Learning for Networks Aditya Grover Stanford University Jure Leskovec Stanford University ABSTRACT Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the fea- tures themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learn- ing continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node’s network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of- the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken to- gether, our work represents a new way for efficiently learning state- of-the-art task-independent representations in complex networks. Categories and Subject Descriptors: H.2.8 [Database Manage- ment]: Database applications—Data mining; I.2.6 [Artificial In- telligence]: Learning General Terms: Algorithms; Experimentation. Keywords: Information networks, Feature learning, Node embed- dings, Graph representations. 1. INTRODUCTION Many important tasks in network analysis involve predictions over nodes and edges. In a typical node classification task, we are interested in predicting the most probable labels of nodes in a network [33]. For example, in a social network, we might be interested in predicting interests of users, or in a protein-protein in- teraction network we might be interested in predicting functional labels of proteins [25, 37]. Similarly, in link prediction, we wish to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed predict whether a pair of nodes in a network should have an edge connecting them [18]. Link prediction is useful in a wide variety of domains; for instance, in genomics, it helps us discover novel interactions between genes, and in social networks, it can identify real-world friends [2, 34]. Any supervised machine learning algorithm requires a set of in- formative, discriminating, and independent features. In prediction problems on networks this means that one has to construct a feature vector representation for the nodes and edges. A typical solution in- volves hand-engineering domain-specific features based on expert knowledge. Even if one discounts the tedious effort required for feature engineering, such features are usually designed for specific tasks and do not generalize across different prediction tasks. An alternative approach is to learn feature representations by solving an optimization problem [4]. The challenge in feature learn- ing is defining an objective function, which involves a trade-off in balancing computational efficiency and predictive accuracy. On one side of the spectrum, one could directly aim to find a feature representation that optimizes performance of a downstream predic- tion task. While this supervised procedure results in good accu- racy, it comes at the cost of high training time complexity due to a blowup in the number of parameters that need to be estimated. At the other extreme, the objective function can be defined to be inde- pendent of the downstream prediction task and the representations can be learned in a purely unsupervised way. This makes the op- timization computationally efficient and with a carefully designed objective, it results in task-independent features that closely match task-specific approaches in predictive accuracy [21, 23]. However, current techniques fail to satisfactorily define and opti- mize a reasonable objective required for scalable unsupervised fea- ture learning in networks. Classic approaches based on linear and non-linear dimensionality reduction techniques such as Principal Component Analysis, Multi-Dimensional Scaling and their exten- sions [3, 27, 30, 35] optimize an objective that transforms a repre- sentative data matrix of the network such that it maximizes the vari- ance of the data representation. Consequently, these approaches in- variably involve eigendecomposition of the appropriate data matrix which is expensive for large real-world networks. Moreover, the resulting latent representations give poor performance on various prediction tasks over networks. Alternatively, we can design an objective that seeks to preserve local neighborhoods of nodes. The objective can be efficiently op- arXiv:1607.00653v1[cs.SI]3Jul2016 KDD’16, 855 (2016)
  • 37. node2vec • The features depends strongly on the way in which the network is traversed • Generate the contexts for each node using Breath First Search and Depth First Search
 • Perform a biased Random Walk KDD’16, 855 (2016) n communities they belong to (i.e., ho- he organization could be based on the n the network (i.e., structural equiva- tance, in Figure 1, we observe nodes ame tightly knit community of nodes, the two distinct communities share the node. Real-world networks commonly uivalences. Thus, it is essential to allow can learn node representations obeying earn representations that embed nodes mmunity closely together, as well as to nodes that share similar roles have sim- d allow feature learning algorithms to iety of domains and prediction tasks. node2vec, a semi-supervised algorithm g in networks. We optimize a custom on using SGD motivated by prior work ing [21]. Intuitively, our approach re- s that maximize the likelihood of pre- oods of nodes in a d-dimensional fea- der random walk approach to generate hoods for nodes. n defining a flexible notion of a node’s y choosing an appropriate notion of a an learn representations that organize ork roles and/or communities they be- u s3 s2 s1 s4 s8 s9 s6 s7 s5 BFS DFS Figure 1: BFS and DFS search strategies from node u (k = 3). principles in network science, providing flexibility in discov- ering representations conforming to different equivalences. 3. We extend node2vec and other feature learning methods based on neighborhood preserving objectives, from nodes to pairs of nodes for edge-based prediction tasks. 4. We empirically evaluate node2vec for multi-label classifica- tion and link prediction on several real-world datasets. The rest of the paper is structured as follows. In Section 2, we briefly survey related work in feature learning for networks. We present the technical details for feature learning using node2vec in Section 3. In Section 4, we empirically evaluate node2vec on prediction tasks over nodes and edges on various real-world net- works and assess the parameter sensitivity, perturbation analysis, and scalability aspects of our algorithm. We conclude with a dis- cussion of the node2vec framework and highlight some promis- ing directions for future work in Section 5. Datasets and a refer- ain structural equivalence, it is of- e local neighborhoods accurately. nce based on network roles such as d just by observing the immediate restricting search to nearby nodes, on and obtains a microscopic view ode. Additionally, in BFS, nodes in d to repeat many times. This is also ance in characterizing the distribu- t the source node. However, a very plored for any given k. which can explore larger parts of t x2 x1 v x3 α=1 α=1/q α=1/q α=1/p v α=1 α=1/q α=1/q α=1/p x2 x3t x1 Figure 2: Illustration of the random walk procedure in node2vec. The walk just transitioned from t to v and is now evaluating its next • BFS - Explores only limited neighborhoods. Suitable for structural equivalences • DFS - Freely explores neighborhoods and covers homophiles communities • By modifying the parameter of the model it can interpolate between the BFS and DFS extremes
  • 38. dna2vec • Separate the genome into long non-overlapping DNS fragments. • Convert long DNA fragments into overlapping variable length k-mers • Train embeddings of each k-mer using Gensim implementation of SkipGram. • Summing embeddings is related to concatenating k-mers • Cosign similarity of k-mer embeddings reproduces a biologically motivated similarity score (Needleman-Wunsch) that is used to align nucleoti dna2vec: Consistent vector representations of variable-length k-mers Patrick Ng Abstract One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This is particularly problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis. In this paper, we propose a novel method to train distributed representations of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation between Needleman-Wunsch similarity score and cosine similarity of dna2vec vectors. 1 Introduction The usage of k-mer representation has been a popular approach in analyzing long sequence of DNA fragments. The k-mer representation is simple to understand and compute. Unfortunately, its straightforward vector encoding as a one-hot vector (i.e. bit vector that consists of all zeros except for a single dimension) is vulnerable to curse of dimensionality. Specifically, its one-hot vector has dimension exponential to the length of k. For example, an 8-mer needs a bit vector of dimension 48 = 65536. This is problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis, due to the fact that most of these tools prefer lower-dimensional continuous vectors as input (Suykens and Vandewalle, 1999; Angermueller et al., 2016; Turian et al., 2010). Worse yet, the distance between any arbitrary pair of one-hot vectors is equidistant, even though ATGGC should be closer to ATGGG than CACGA. 1.1 Word embeddings The Natural Language Processing (NLP) research community has a long tradition of using bag-of-words with one-hot vector, where its dimension is equal to the vocabulary size. Recently, there has been an explosion of using word embeddings as inputs to machine learning algorithms, especially in the deep learning community (Mikolov et al., 2013b; LeCun et al., 2015; Bengio et al., 2013). Word embeddings are vectors of real numbers that are distributed representations of words. A popular training technique for word embeddings, word2vec (Mikolov et al., 2013a), consists of using a 2-layer neural network that is trained on the current word and its surrounding context words (see Section 2.3). This reconstruction of context of words is loosely inspired by the linguistic concept of distributional hypothesis, which states that words that appear in the same context have similar meaning (Harris, 1954). Deep learning algorithms applied with word embeddings have had dramatic improvements in the areas of machine translation (Sutskever et al., 2014; Bahdanau et al., 2014; Cho et al., 2014), summarization (Chopra arXiv:1701.06279v1[q-bio.QM]23Jan2017 arXiv: 1701.06279 (2017)
  • 39. stock2vec • Apply word2vec to the 40 years of stock market data • Identify significant semantic similarities between companies working in the same area
  • 41. Thank you! You can hear me speak more about word2vec in this weeks podcast!