Research on word representation models, word embeddings, has gained a lot of attention in the recent years thanks to Word2Vec by Mikolov et al. The main purpose of this work is to validate previously proposed experiments for the English language and then trying to figure out if it is possibile to reproduce the same accuracy and performance with the Italian language.
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Word2Vec on Italian language
1. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
Word2Vec on Italian language
Cucari Francesco
De Cillis Daniele
Molinari Dario
I. Introduction
Research on word representation models, word embeddings, has gained a lot of attention in the
recent years[1]. This happened also thanks to a renewed boost in neural network technologies,
such as deep learning. With these progresses it has become possible to train more complex models
on much larger data sets. Probably, one of the most popular of this series of work is without
doubt Mikolow’s word2vec [2][3], that introduces the concept of using distributed representations
of words.
Distributed word representations, also known as word embeddings, represent a word as a vec-
tor in n. The more the vectors are closer the more the two corresponding words are deemed to
share some degree of syntactical or semantical similarity[4]. For example, the result of a vector
calculation vec(‘Paris‘) - vec(‘France‘) + vec(‘Italy‘) is closer to vec(‘Rome‘) than to any other word
vector.1
The main purpose of this work is to validate previously proposed experiments for the English
language and then trying to figure out if it is possibile to reproduce the same accuracy and
performance with the Italian language, using our implementation of word2vec.
II. Corpus
A corpus is a collection of texts composed by many sentences of various topics, in such a way that
the neural network can learn the meaning of each word it finds thanks to the context.
Some of the most famous Italian corpus are Paisà2 and ItWac3. Both include texts from the
web, such as chat conversations, books extracts and articles. However, both are not formatted prop-
erly for our purpose. So, we preferred to manually create a corpus. Our corpus has two datasets
in the Italian language: half of the dump of the Italian Wikipedia4 (dated 06/03/2016) and a
collection of about 120.000 Italian articles written in 2010 taken from the archive of “La Repubblica”.
We trained the network with corpus of different sizes. Initially, we train the model with an
English corpus, called text85, that had a dimension of 100 MB, corresponding to about 17 millions
of words. Finally, we train the model with our Italian corpus that has a dimension of 920 MB,
1https://code.google.com/archive/p/word2vec/ Google Code archive: word2vec - Tool for computing continuous
distributed representations of words.
2http://www.corpusitaliano.it/
3http://wacky.sslmit.unibo.it/doku.php?id=corpora
4http://dumps.wikimedia.org/itwiki/latest/
5http://mattmahoney.net/dc/textdata
1
2. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
equivalent to about 122 milions of words.
Before the training of the network, the corpus needs some preprocessing. Initially, punctua-
tion must be deleted: we need a plain text, because points, colons, commas and others symbols
can be interpreted as different words by the algorithm. Then, words are converted to lowercase
and numerals are converted to their word forms (e.g. 1992 becomes one nine nine two). Finally,
We noticed that by eliminating the so-called stopwords in the Italian corpus (articles, conjunctions,
prepositions, etc.) we obtained better performances.
After the preprocessing phase, it is possible to train the network. Our model takes in input
a subset of the corpus, called vocabulary, whose dimension is essential: with a small vocabulary
we do not have enough words to compute similarities in a correct way, otherwise we take into
account many words that are useless for our purpose. So, we decided to take into account only
the words that appeared more than 10 times into the corpus.
III. Network architecture
Word2vec is a neural network composed by three layers: an input, an hidden and an output layer.
The two main architectures are structurally similar and their names are Continuous Bag Of Words
and Skip-gram.
I. CBOW
CBOW is the original version of word2vec. In our case, we realize a CBOW model (Fig.1) with a
contexts of two words, namely the word on the left of the target word and the one on its right. In
this way we can predict the target word as output.
Figure 1: CBOW architecture
As we can see from the Fig.1, V is the vocabulary size and N is the embedding size. x1k, x2k are
two one-hot encoded vectors that represent the input words, i.e. they are vector of size V that have a
unit equal to one and all the others equal to zero.
2
3. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
The weights between the input layer and the output layer can be represented by a VxN ma-
trix W. Each row of W is the N-dimension vector representation vw of the associated word of the
input layer. Therefore, the output of the hidden layer is:
h =
1
C
W(x1 + x2 + ... + xC) =
1
C
(vw1
+ vw2 + ... + vwC
) (1)
where C is the number of words in the context, w1, ..., wC are the words in the context, and vw is
the input vector of a word w. We notice that the link activation function of the hidden layer units
is simply linear, i.e. it directly passes the weighted sum of inputs to the next layer.
Between hidden and output layer, there is a different matrix, called W’ , which has size NxV (it
is not the transpose of W). Using this matrix, we can compute a score uj for each word in the
vocabulary:
uj = v T
wj
h (2)
where v T
wj
is the j-th column of the matrix W’. Then, with softmax, a log-linear classifier, we can
find the posterior distribution of words, which is a multinomial distribution:
p(wj|wI,1, ..., wI,C) = yj =
exp(wj)
∑V
j =1 exp(uj )
(3)
After these calculations, it is essential to update the weights of the two matrices to assure
backpropagation. The training objective is to maximize the conditional probability of observing
the actual output word wO (denote its index in the output layer as j*) given the input context
words wI,1, ..., wI,C with regard to the weights:
maxp(wO) = maxyj∗ = maxlog(yj∗) = uj∗ − log
V
∑
j =1
exp(uj ) := E (4)
where E is our loss function (which we want to minimize), and j* is the index of the actual output
word in the output layer.
The update equation for the hidden-output weights is
v
(new)
wj
= v
(old)
wj
− ηejh (5)
for j = 1, 2, ..., V where η > 0 is the learning rate, ej := δE
δwj
= yj − tj (with tj is 1 when the j-th unit
is the actual output word, otherwise it is 0) is the prediction error of the output layer, h is the
output of the hidden layer and vwj
is the output vector of wj.
Having obtained the update equations for W’, we can now move on to W. In a similar man-
ner, it is possible to obtain the update equation for the input-hidden weights:
v
(new)
wI,c
= v
(old)
wI,c
−
1
C
ηEH (6)
for c = 1, ..., C where vwI,c
is a row of W correspondent to the c-th context word and EH = δE
δhi
=
∑V
j=1 ejwij is the sum of the output vectors of all the words in the vocabulary, weighted by their
3
4. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
prediction error. The only rows of W whose derivative is non-zero are the one correspondent to
the C context words. All the other rows of W will remain unchanged after this iteration, because
their derivatives are zero.
Intuitively, since vector EH is the sum of output vectors of all words in vocabulary weighted by
their prediction error ej = yj − tj , we can understand this equation as adding a portion of every
output vector in vocabulary to the input vectors of the context words. If, in the output layer, the
probability of a word wj being the output word is overestimated (yj > tj), then the input vector of
the context word wI,c will tend to move farther away from the output vector of wj; on the contrary,
if the probability of wj being the output word is underestimated (yj < tj), then the input vector
wI,c will tend to move closer to the output vector of wj; if the probability of wj is fairly accurately
predicted, then it will have little effect on the movement of the input vector of wI,c. The movement
of the input vector of wI,c is determined by the prediction error of all vectors in the vocabulary;
the larger the prediction error, the more significant effects a word will exert on the movement on
the input vector of the context word.
II. Skip-Gram
The skip-gram model (Fig.2) is the opposite of the CBOW model: the target word is at the input
layer and the context words are at the output layer.
Figure 2: Skip-Gram architecture
The weights between the input and the hidden layer can be represented by the VxN-dimensional
matrix W. Given the input word, we have:
h = xT
W = Wk, = vwI (7)
4
5. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
which is essentially copying the k-th row of W to h. vwI is the vector representation of the input
word wI. This implies that the activation function of the hidden layer units is simply linear.
On the output layer, instead of computing one multinomial distribution, we compute C multino-
mial distributions. Each output is computed using the same hidden-output matrix:
p(wc,j = wO,c|wI) = yc,j =
exp(uc,j)
∑V
j =1 exp(uj )
(8)
where wc,j is the j-th word on the c-th panel of the output layer; wO,c is the actual c-th word in
the output context words; wI is the only input word; yc,j is the output of the j-th unit on the c-
th panel of the output layer; uc,j is the net input of the j-th unit on the c-th panel of the output layer.
Now, the loss function is:
E = −logp(wO,1, wO,2, .., wO,C|wI) = −log
C
∏
c=1
exp(uc,j∗
c
)
∑V
j =1 exp(uj )
= −
C
∑
c=1
uj∗
c
+ Clog
V
∑
j =1
exp(uj ) (9)
where j∗
c is the index of the actual c-th output context word in the vocabulary.
The update equation for hidden-output matrix is:
v
(new)
wj
= v
(old)
wj
− ηEIjh (10)
for j=1,2,..,V and where EIj is the sum of prediction errors over all context words.
The derivation of the update equation for the input-hidden matrix is identical to that of CBOW,
except taking into account that the prediction error ej is replaced with EIj.
v
(new)
wI
= v
(old)
wI
− ηEIj (11)
III. Optimizer
We need to minimize the loss in order to obtain the best performances. A well known method for
minimizing an objective function is the Stochastic Gradient Descent or SGD. Usually, the objective
function comes in form of sum of differentiable functions where there is a certain parameter that
needs to be estimated:
G(w) =
n
∑
i=1
Gi(w)
When the training set is enormous and no simple formulas exist, evaluating the sums of gradients
becomes very expensive, because evaluating the gradient requires evaluating all the summand
functions’ gradients. To economize on the computational cost at every iteration, stochastic gradient
descent samples a subset of summand functions at every step. This is very effective in the case of
large-scale machine learning problems like ours. In stochastic gradient descent, the true gradient
of G(w) is approximated by a gradient at a single example:
w := w − η Gi(w)
5
6. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
As the algorithm sweeps through the training set, it performs the above update for each training
example. Several passes can be made over the training set until the algorithm converges. If this is
done, the data can be shuffled for each pass to prevent cycles. A compromise between computing
the true gradient and the gradient at a single example, is to compute the gradient against more
than one training example (called a mini-batch) at each step. This can perform significantly better
than true stochastic gradient descent because the code can make use of vectorization libraries
rather than computing each step separately. It may also result in smoother convergence, as the
gradient computed at each step uses more training examples.
In particular, the optimizer that we used is an enhanced gradient descend called AdaGrad (Adap-
tive Gradient algorithm). It combines different learning rates at each weight. Words which are
more frequent can have larger learning rate while words which are less frequent must have lower
learning rate. So, AdaGrad essentially does that. In fact, each word has a different learning rate
which is adaptable.
IV. Experiments and results
In this section we present the main experiments performed in parallel on the English and on the
Italian corpus.
I. Analogies test
In order to evaluate the developed models we have manually translated the Google word analogy
test for English. The original test is composed by questions divided in semantics questions (e.g.:
father : mother = grandpa : grandma) and syntactic questions (e.g.: going : went = predicting
: predicted). We have started by extracting a subset of questions, by translating the English test
to Italian and by making some changes. For example, the authors made a list of large American
cities and the states they belong to. Since the frequency of these American cities can be very low
in an Italian corpus, we replace the relationship “American cities in state” with the relationship
“Italian cities in region”. Then, we remove the relationships “Comparative”, given that in Italian
comparatives are usually built as multi-word expressions (smart : smarter = intelligente : più
intelligente) and the relationship “Plural verbs”. The category “Present-Participle” of English has
been mapped to the Italian gerund, as it is of more common use in Italian. Finally, we add the
relationship “Family”.
Then, the overall accuracy is evaluated for all question types. Question is assumed to be
correctly answered only if the closest word to the vector computed using the above method is
exactly the same as the correct word in the question. So synonyms are thus counted as mistakes.
This also means that reaching 100% is likely to be impossible, as the current models do not have
any input information about word morphology.
Some example from each category is shown in the table.
Table 1: Examples of questions in the Semantic-Syntactic Word Relationship test set for the English
language and some examples used for the Italian language test
6
7. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
Type of relationship Word Pair 1 Word Pair 2
Common capital city Atheens - Greece Berlino - Germania
Currency Italy - Euro Canada - Dollaro
City-in-state (City-in-region) Chicago - Illinois Roma - Lazio
Man-Woman brother - sister padre madre
Adjective to adverb rapid - rapidly preciso - precisamente
Opposite possibly - impossibly onesto - disonesto
Comparative big - bigger /
Superlative easy - easiest male - malissimo
Present-participle think - thinking volare - volando
Nationality adjective Switzerland - Swiss Francia - francese
Past tense walking - walked danzare - danzava
Plural nouns mouse - mice gatto - gatti
Plural verbs work- works /
Family / sorella - sorellastra
II. Analogies results on English language
Table 2: Skip-gram accuracy on reduced English text8 (100MB)
Category Accuracy
Common capital city 26,67% (4/15)
Currency 0,00% (0/15)
City-in-state 20,00% (3/15)
Man-Woman 26,67% (4/15)
Adjective to adverb 0,00% (0/15)
Opposite 0,00% (0/15)
Comparative 6,67% (1/15)
Superlative 0,00% (0/15)
Present-participle 20,00% (3/15)
Nationality adjective 53,33% (8/15)
Past tense 20,00% (3/15)
Plural nouns 13,33% (2/15)
Plural verbs 0,00% (0/15)
Total average: 14,36% (28/195)
III. Analogies results on Italian language
Table 3: Skip-gram accuracy of our reduced Italian corpus (100MB)
7
8. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
Category Accuracy
Common capital city 33,33% (5/15)
Currency 0,00% (0/15)
City-in-region 20,00% (3/15)
Man-Woman 40,00% (6/15)
Adjective to adverb 0,00% (0/15)
Opposite 0,00% (0/15)
Superlative 0,00% (0/15)
Present-gerund 0,00% (0/15)
Nationality adjective 46,67% (7/15)
Past tense 0,00% (0/15)
Plural nouns 0,00% (0/15)
Family 26,67% (4/15)
Total average: 13,88% (25/180)
Table 4: CBOW accuracy on our Italian corpus (920MB)
Category Accuracy
Common capital city 33,33% (5/15)
Currency 6,67% (1/15)
City-in-region 26,67% (4/15)
Man-Woman 60,00% (9/15)
Adjective to adverb 6,67% (1/15)
Opposite 0,00% (0/15)
Superlative 6,67% (1/15)
Present-gerund 20,00% (3/15)
Nationality adjective 20,00% (3/15)
Past tense 0,00% (0/15)
Plural nouns 6,67% (1/15)
Family 46,67% (7/15)
Total average: 19,44% (35/180)
Table 5: Skip-gram accuracy on our Italian corpus
(920MB)
Category Accuracy
Common capital city 46,67% (7/15)
Currency 0,00% (0/15)
City-in-region 53,33% (8/15)
Man-Woman 46,67% (7/15)
Adjective to adverb 6,67% (1/15)
Opposite 0,00% (0/15)
Superlative 0,00% (0/15)
Present-gerund 6,67% (1/15)
Nationality adjective 40,00% (6/15)
Past tense 0,00% (0/15)
Plural nouns 20,00% (3/15)
Family 60,00% (9/15)
Total average: 23,33% (42/180)
IV. ”Nearest to” test
A simple way to investigate the learned representations is to find the closest words for a user-
specified word.We use the cosine distance between two vectors (in our case a vector is the represen-
tation of the word in the space) that is a measure that calculates the cosine of the angle between
them. These are some examples:
francia
Nearest to Cosine distance
spagna 0.365999
parigi 0.497820
germania 0.545318
italia 0.554473
roma
Nearest to Cosine distance
bologna 0.382990
torino 0.386104
milano 0.401712
new 0.703315
8
9. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
padre
Nearest to Cosine distance
nonno 0.354935
figlio 0.381211
quando 0.687557
mentre 0.834535
paraurti
Nearest to Cosine distance
calandra 0.367163
cofano 0.475904
abitacolo 0.495059
carrozzeria 0.496319
V. ”Doesn’t match” test
The test consists of some questions, each is composed by 4 words. The goal of the test is to find
the word that doesn’t match with others. We compute the cosine distance of each word with the
others. Then, we compute the mean of these distances and finally the word with the highest mean
is the intrusive word.
Table 6: Some outputs of the test
Word 1 Word 2 Word 3 Word 4 Predicted intrusive word
uomo donna cucina bambino cucina
uomo mamma bambino ragazzo mamma
roma lombardia milano napoli lombardia
argentina francia spagna berlino berlino
sono ero essendo avere avere
trovai trovato trovare essere trovai
essendo avendo amando leggere amando
giallo uomo rosso arancione uomo
giallo bianco rosso arancione arancione
As a result, generally the trained model is able to identify similar words in clusters and it
deletes the word farthest from the others.
VI. Visualization of learned embeddings
After the training has finished we can visualize the learned embeddings using t-SNE (t-distributed
Stochastic Neighbor Embedding).
t-SNE[6] is a tool to visualize high-dimensional data. It converts similarities between data
points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the
joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a
cost function that is not convex, i.e. with different initializations we can get different results.
The static image of the plot looks very confusing, since it contains 150000 words. If you can
dynamically navigate through the Python tools, then you can notice that words with a similar
semantic meaning are grouped nearby, as you can see in the Fig. 3 and Fig.4.
9
10. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
Figure 3: A closer look to the plot
Figure 4: A closer look to the plot/2
V. Conclusions
In this work we tried to understand word2vec, a well known tool for learning word embeddings,
and to reproduce to some extent the results obtained in the original work for the English language.
From the analysis shown in the previous chapter, we can deduce that skip-gram has better per-
formance than CBOW model because its context is made by the two nearest words to the target
word, while with skip-gram you can consider the whole sentence as a context which leads to a
better understanding of the semantics of the target word, depending on the size of the skip-window.
Regarding the differences between Italian and English languages, the former, like other eu-
ropean languages, is morphologically richer than the latter. For example, English language doesn’t
have accents, it doesn’t distinguish male-female adjectives and verbal forms are simpler.
We noticed that by eliminating the so-called stopwords in the Italian corpus (articles, conjunctions,
prepositions, etc.) we obtained better performances because we take into account only the main
words of the context. For achieving this, it was necessary to reduce the size of the window
10
11. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
(skip-window) to avoid considering different contexts’ sentences.
We adopted a simple word analogies test to evaluate the generated word embeddings. We
provided a literal translation, and probably rough, of the analogies file used for testing. To
improve the performance it would be necessary to implement a test file with more accurate
translations and with a greater number of elements. Furthemore, to train efficiently the neural
network, it would be necessary to insert in the corpus a greater number of words from different
sources, such as books, conversations and articles of various newspapers, in order to learn more
information from multiple contexts.
In conclusion, the test was conducted on corpora of various sizes in order to demonstrate
that increasing the size of the corpus and the number of words in the vocabulary, the neural
network provides better results.
References
[1] "Natural language processing (almost) from scratch." - Collobert, Ronan, et al. 2011
[2] “Distributed representations of words and phrases and their compositionality” - Mikolov,
Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff 2013
[3] “Efficient estimation of word representations in vector space.” - Mikolov, Tomas, et al. 2013.
[4] “Word Embeddings Go to Italy: a Comparison of Models and Training Datasets” - Berardi,
Giacomo and Esuli, Andrea and Marcheggiani, Diego 2015
[5] “A closer look at skip-gram modelling.” - Guthrie, David, et al. 2006.
[6] "Visualizing data using t-SNE." - Van der Maaten, Laurens, and Geoffrey Hinton 2008
11