SlideShare une entreprise Scribd logo
1  sur  11
Télécharger pour lire hors ligne
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
Word2Vec on Italian language
Cucari Francesco
De Cillis Daniele
Molinari Dario
I. Introduction
Research on word representation models, word embeddings, has gained a lot of attention in the
recent years[1]. This happened also thanks to a renewed boost in neural network technologies,
such as deep learning. With these progresses it has become possible to train more complex models
on much larger data sets. Probably, one of the most popular of this series of work is without
doubt Mikolow’s word2vec [2][3], that introduces the concept of using distributed representations
of words.
Distributed word representations, also known as word embeddings, represent a word as a vec-
tor in n. The more the vectors are closer the more the two corresponding words are deemed to
share some degree of syntactical or semantical similarity[4]. For example, the result of a vector
calculation vec(‘Paris‘) - vec(‘France‘) + vec(‘Italy‘) is closer to vec(‘Rome‘) than to any other word
vector.1
The main purpose of this work is to validate previously proposed experiments for the English
language and then trying to figure out if it is possibile to reproduce the same accuracy and
performance with the Italian language, using our implementation of word2vec.
II. Corpus
A corpus is a collection of texts composed by many sentences of various topics, in such a way that
the neural network can learn the meaning of each word it finds thanks to the context.
Some of the most famous Italian corpus are Paisà2 and ItWac3. Both include texts from the
web, such as chat conversations, books extracts and articles. However, both are not formatted prop-
erly for our purpose. So, we preferred to manually create a corpus. Our corpus has two datasets
in the Italian language: half of the dump of the Italian Wikipedia4 (dated 06/03/2016) and a
collection of about 120.000 Italian articles written in 2010 taken from the archive of “La Repubblica”.
We trained the network with corpus of different sizes. Initially, we train the model with an
English corpus, called text85, that had a dimension of 100 MB, corresponding to about 17 millions
of words. Finally, we train the model with our Italian corpus that has a dimension of 920 MB,
1https://code.google.com/archive/p/word2vec/ Google Code archive: word2vec - Tool for computing continuous
distributed representations of words.
2http://www.corpusitaliano.it/
3http://wacky.sslmit.unibo.it/doku.php?id=corpora
4http://dumps.wikimedia.org/itwiki/latest/
5http://mattmahoney.net/dc/textdata
1
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
equivalent to about 122 milions of words.
Before the training of the network, the corpus needs some preprocessing. Initially, punctua-
tion must be deleted: we need a plain text, because points, colons, commas and others symbols
can be interpreted as different words by the algorithm. Then, words are converted to lowercase
and numerals are converted to their word forms (e.g. 1992 becomes one nine nine two). Finally,
We noticed that by eliminating the so-called stopwords in the Italian corpus (articles, conjunctions,
prepositions, etc.) we obtained better performances.
After the preprocessing phase, it is possible to train the network. Our model takes in input
a subset of the corpus, called vocabulary, whose dimension is essential: with a small vocabulary
we do not have enough words to compute similarities in a correct way, otherwise we take into
account many words that are useless for our purpose. So, we decided to take into account only
the words that appeared more than 10 times into the corpus.
III. Network architecture
Word2vec is a neural network composed by three layers: an input, an hidden and an output layer.
The two main architectures are structurally similar and their names are Continuous Bag Of Words
and Skip-gram.
I. CBOW
CBOW is the original version of word2vec. In our case, we realize a CBOW model (Fig.1) with a
contexts of two words, namely the word on the left of the target word and the one on its right. In
this way we can predict the target word as output.
Figure 1: CBOW architecture
As we can see from the Fig.1, V is the vocabulary size and N is the embedding size. x1k, x2k are
two one-hot encoded vectors that represent the input words, i.e. they are vector of size V that have a
unit equal to one and all the others equal to zero.
2
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
The weights between the input layer and the output layer can be represented by a VxN ma-
trix W. Each row of W is the N-dimension vector representation vw of the associated word of the
input layer. Therefore, the output of the hidden layer is:
h =
1
C
W(x1 + x2 + ... + xC) =
1
C
(vw1
+ vw2 + ... + vwC
) (1)
where C is the number of words in the context, w1, ..., wC are the words in the context, and vw is
the input vector of a word w. We notice that the link activation function of the hidden layer units
is simply linear, i.e. it directly passes the weighted sum of inputs to the next layer.
Between hidden and output layer, there is a different matrix, called W’ , which has size NxV (it
is not the transpose of W). Using this matrix, we can compute a score uj for each word in the
vocabulary:
uj = v T
wj
h (2)
where v T
wj
is the j-th column of the matrix W’. Then, with softmax, a log-linear classifier, we can
find the posterior distribution of words, which is a multinomial distribution:
p(wj|wI,1, ..., wI,C) = yj =
exp(wj)
∑V
j =1 exp(uj )
(3)
After these calculations, it is essential to update the weights of the two matrices to assure
backpropagation. The training objective is to maximize the conditional probability of observing
the actual output word wO (denote its index in the output layer as j*) given the input context
words wI,1, ..., wI,C with regard to the weights:
maxp(wO) = maxyj∗ = maxlog(yj∗) = uj∗ − log
V
∑
j =1
exp(uj ) := E (4)
where E is our loss function (which we want to minimize), and j* is the index of the actual output
word in the output layer.
The update equation for the hidden-output weights is
v
(new)
wj
= v
(old)
wj
− ηejh (5)
for j = 1, 2, ..., V where η > 0 is the learning rate, ej := δE
δwj
= yj − tj (with tj is 1 when the j-th unit
is the actual output word, otherwise it is 0) is the prediction error of the output layer, h is the
output of the hidden layer and vwj
is the output vector of wj.
Having obtained the update equations for W’, we can now move on to W. In a similar man-
ner, it is possible to obtain the update equation for the input-hidden weights:
v
(new)
wI,c
= v
(old)
wI,c
−
1
C
ηEH (6)
for c = 1, ..., C where vwI,c
is a row of W correspondent to the c-th context word and EH = δE
δhi
=
∑V
j=1 ejwij is the sum of the output vectors of all the words in the vocabulary, weighted by their
3
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
prediction error. The only rows of W whose derivative is non-zero are the one correspondent to
the C context words. All the other rows of W will remain unchanged after this iteration, because
their derivatives are zero.
Intuitively, since vector EH is the sum of output vectors of all words in vocabulary weighted by
their prediction error ej = yj − tj , we can understand this equation as adding a portion of every
output vector in vocabulary to the input vectors of the context words. If, in the output layer, the
probability of a word wj being the output word is overestimated (yj > tj), then the input vector of
the context word wI,c will tend to move farther away from the output vector of wj; on the contrary,
if the probability of wj being the output word is underestimated (yj < tj), then the input vector
wI,c will tend to move closer to the output vector of wj; if the probability of wj is fairly accurately
predicted, then it will have little effect on the movement of the input vector of wI,c. The movement
of the input vector of wI,c is determined by the prediction error of all vectors in the vocabulary;
the larger the prediction error, the more significant effects a word will exert on the movement on
the input vector of the context word.
II. Skip-Gram
The skip-gram model (Fig.2) is the opposite of the CBOW model: the target word is at the input
layer and the context words are at the output layer.
Figure 2: Skip-Gram architecture
The weights between the input and the hidden layer can be represented by the VxN-dimensional
matrix W. Given the input word, we have:
h = xT
W = Wk, = vwI (7)
4
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
which is essentially copying the k-th row of W to h. vwI is the vector representation of the input
word wI. This implies that the activation function of the hidden layer units is simply linear.
On the output layer, instead of computing one multinomial distribution, we compute C multino-
mial distributions. Each output is computed using the same hidden-output matrix:
p(wc,j = wO,c|wI) = yc,j =
exp(uc,j)
∑V
j =1 exp(uj )
(8)
where wc,j is the j-th word on the c-th panel of the output layer; wO,c is the actual c-th word in
the output context words; wI is the only input word; yc,j is the output of the j-th unit on the c-
th panel of the output layer; uc,j is the net input of the j-th unit on the c-th panel of the output layer.
Now, the loss function is:
E = −logp(wO,1, wO,2, .., wO,C|wI) = −log
C
∏
c=1
exp(uc,j∗
c
)
∑V
j =1 exp(uj )
= −
C
∑
c=1
uj∗
c
+ Clog
V
∑
j =1
exp(uj ) (9)
where j∗
c is the index of the actual c-th output context word in the vocabulary.
The update equation for hidden-output matrix is:
v
(new)
wj
= v
(old)
wj
− ηEIjh (10)
for j=1,2,..,V and where EIj is the sum of prediction errors over all context words.
The derivation of the update equation for the input-hidden matrix is identical to that of CBOW,
except taking into account that the prediction error ej is replaced with EIj.
v
(new)
wI
= v
(old)
wI
− ηEIj (11)
III. Optimizer
We need to minimize the loss in order to obtain the best performances. A well known method for
minimizing an objective function is the Stochastic Gradient Descent or SGD. Usually, the objective
function comes in form of sum of differentiable functions where there is a certain parameter that
needs to be estimated:
G(w) =
n
∑
i=1
Gi(w)
When the training set is enormous and no simple formulas exist, evaluating the sums of gradients
becomes very expensive, because evaluating the gradient requires evaluating all the summand
functions’ gradients. To economize on the computational cost at every iteration, stochastic gradient
descent samples a subset of summand functions at every step. This is very effective in the case of
large-scale machine learning problems like ours. In stochastic gradient descent, the true gradient
of G(w) is approximated by a gradient at a single example:
w := w − η Gi(w)
5
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
As the algorithm sweeps through the training set, it performs the above update for each training
example. Several passes can be made over the training set until the algorithm converges. If this is
done, the data can be shuffled for each pass to prevent cycles. A compromise between computing
the true gradient and the gradient at a single example, is to compute the gradient against more
than one training example (called a mini-batch) at each step. This can perform significantly better
than true stochastic gradient descent because the code can make use of vectorization libraries
rather than computing each step separately. It may also result in smoother convergence, as the
gradient computed at each step uses more training examples.
In particular, the optimizer that we used is an enhanced gradient descend called AdaGrad (Adap-
tive Gradient algorithm). It combines different learning rates at each weight. Words which are
more frequent can have larger learning rate while words which are less frequent must have lower
learning rate. So, AdaGrad essentially does that. In fact, each word has a different learning rate
which is adaptable.
IV. Experiments and results
In this section we present the main experiments performed in parallel on the English and on the
Italian corpus.
I. Analogies test
In order to evaluate the developed models we have manually translated the Google word analogy
test for English. The original test is composed by questions divided in semantics questions (e.g.:
father : mother = grandpa : grandma) and syntactic questions (e.g.: going : went = predicting
: predicted). We have started by extracting a subset of questions, by translating the English test
to Italian and by making some changes. For example, the authors made a list of large American
cities and the states they belong to. Since the frequency of these American cities can be very low
in an Italian corpus, we replace the relationship “American cities in state” with the relationship
“Italian cities in region”. Then, we remove the relationships “Comparative”, given that in Italian
comparatives are usually built as multi-word expressions (smart : smarter = intelligente : più
intelligente) and the relationship “Plural verbs”. The category “Present-Participle” of English has
been mapped to the Italian gerund, as it is of more common use in Italian. Finally, we add the
relationship “Family”.
Then, the overall accuracy is evaluated for all question types. Question is assumed to be
correctly answered only if the closest word to the vector computed using the above method is
exactly the same as the correct word in the question. So synonyms are thus counted as mistakes.
This also means that reaching 100% is likely to be impossible, as the current models do not have
any input information about word morphology.
Some example from each category is shown in the table.
Table 1: Examples of questions in the Semantic-Syntactic Word Relationship test set for the English
language and some examples used for the Italian language test
6
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
Type of relationship Word Pair 1 Word Pair 2
Common capital city Atheens - Greece Berlino - Germania
Currency Italy - Euro Canada - Dollaro
City-in-state (City-in-region) Chicago - Illinois Roma - Lazio
Man-Woman brother - sister padre madre
Adjective to adverb rapid - rapidly preciso - precisamente
Opposite possibly - impossibly onesto - disonesto
Comparative big - bigger /
Superlative easy - easiest male - malissimo
Present-participle think - thinking volare - volando
Nationality adjective Switzerland - Swiss Francia - francese
Past tense walking - walked danzare - danzava
Plural nouns mouse - mice gatto - gatti
Plural verbs work- works /
Family / sorella - sorellastra
II. Analogies results on English language
Table 2: Skip-gram accuracy on reduced English text8 (100MB)
Category Accuracy
Common capital city 26,67% (4/15)
Currency 0,00% (0/15)
City-in-state 20,00% (3/15)
Man-Woman 26,67% (4/15)
Adjective to adverb 0,00% (0/15)
Opposite 0,00% (0/15)
Comparative 6,67% (1/15)
Superlative 0,00% (0/15)
Present-participle 20,00% (3/15)
Nationality adjective 53,33% (8/15)
Past tense 20,00% (3/15)
Plural nouns 13,33% (2/15)
Plural verbs 0,00% (0/15)
Total average: 14,36% (28/195)
III. Analogies results on Italian language
Table 3: Skip-gram accuracy of our reduced Italian corpus (100MB)
7
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
Category Accuracy
Common capital city 33,33% (5/15)
Currency 0,00% (0/15)
City-in-region 20,00% (3/15)
Man-Woman 40,00% (6/15)
Adjective to adverb 0,00% (0/15)
Opposite 0,00% (0/15)
Superlative 0,00% (0/15)
Present-gerund 0,00% (0/15)
Nationality adjective 46,67% (7/15)
Past tense 0,00% (0/15)
Plural nouns 0,00% (0/15)
Family 26,67% (4/15)
Total average: 13,88% (25/180)
Table 4: CBOW accuracy on our Italian corpus (920MB)
Category Accuracy
Common capital city 33,33% (5/15)
Currency 6,67% (1/15)
City-in-region 26,67% (4/15)
Man-Woman 60,00% (9/15)
Adjective to adverb 6,67% (1/15)
Opposite 0,00% (0/15)
Superlative 6,67% (1/15)
Present-gerund 20,00% (3/15)
Nationality adjective 20,00% (3/15)
Past tense 0,00% (0/15)
Plural nouns 6,67% (1/15)
Family 46,67% (7/15)
Total average: 19,44% (35/180)
Table 5: Skip-gram accuracy on our Italian corpus
(920MB)
Category Accuracy
Common capital city 46,67% (7/15)
Currency 0,00% (0/15)
City-in-region 53,33% (8/15)
Man-Woman 46,67% (7/15)
Adjective to adverb 6,67% (1/15)
Opposite 0,00% (0/15)
Superlative 0,00% (0/15)
Present-gerund 6,67% (1/15)
Nationality adjective 40,00% (6/15)
Past tense 0,00% (0/15)
Plural nouns 20,00% (3/15)
Family 60,00% (9/15)
Total average: 23,33% (42/180)
IV. ”Nearest to” test
A simple way to investigate the learned representations is to find the closest words for a user-
specified word.We use the cosine distance between two vectors (in our case a vector is the represen-
tation of the word in the space) that is a measure that calculates the cosine of the angle between
them. These are some examples:
francia
Nearest to Cosine distance
spagna 0.365999
parigi 0.497820
germania 0.545318
italia 0.554473
roma
Nearest to Cosine distance
bologna 0.382990
torino 0.386104
milano 0.401712
new 0.703315
8
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
padre
Nearest to Cosine distance
nonno 0.354935
figlio 0.381211
quando 0.687557
mentre 0.834535
paraurti
Nearest to Cosine distance
calandra 0.367163
cofano 0.475904
abitacolo 0.495059
carrozzeria 0.496319
V. ”Doesn’t match” test
The test consists of some questions, each is composed by 4 words. The goal of the test is to find
the word that doesn’t match with others. We compute the cosine distance of each word with the
others. Then, we compute the mean of these distances and finally the word with the highest mean
is the intrusive word.
Table 6: Some outputs of the test
Word 1 Word 2 Word 3 Word 4 Predicted intrusive word
uomo donna cucina bambino cucina
uomo mamma bambino ragazzo mamma
roma lombardia milano napoli lombardia
argentina francia spagna berlino berlino
sono ero essendo avere avere
trovai trovato trovare essere trovai
essendo avendo amando leggere amando
giallo uomo rosso arancione uomo
giallo bianco rosso arancione arancione
As a result, generally the trained model is able to identify similar words in clusters and it
deletes the word farthest from the others.
VI. Visualization of learned embeddings
After the training has finished we can visualize the learned embeddings using t-SNE (t-distributed
Stochastic Neighbor Embedding).
t-SNE[6] is a tool to visualize high-dimensional data. It converts similarities between data
points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the
joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a
cost function that is not convex, i.e. with different initializations we can get different results.
The static image of the plot looks very confusing, since it contains 150000 words. If you can
dynamically navigate through the Python tools, then you can notice that words with a similar
semantic meaning are grouped nearby, as you can see in the Fig. 3 and Fig.4.
9
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
Figure 3: A closer look to the plot
Figure 4: A closer look to the plot/2
V. Conclusions
In this work we tried to understand word2vec, a well known tool for learning word embeddings,
and to reproduce to some extent the results obtained in the original work for the English language.
From the analysis shown in the previous chapter, we can deduce that skip-gram has better per-
formance than CBOW model because its context is made by the two nearest words to the target
word, while with skip-gram you can consider the whole sentence as a context which leads to a
better understanding of the semantics of the target word, depending on the size of the skip-window.
Regarding the differences between Italian and English languages, the former, like other eu-
ropean languages, is morphologically richer than the latter. For example, English language doesn’t
have accents, it doesn’t distinguish male-female adjectives and verbal forms are simpler.
We noticed that by eliminating the so-called stopwords in the Italian corpus (articles, conjunctions,
prepositions, etc.) we obtained better performances because we take into account only the main
words of the context. For achieving this, it was necessary to reduce the size of the window
10
Sapienza University of Rome • 12 April 2016 • Course: Neural Networks
(skip-window) to avoid considering different contexts’ sentences.
We adopted a simple word analogies test to evaluate the generated word embeddings. We
provided a literal translation, and probably rough, of the analogies file used for testing. To
improve the performance it would be necessary to implement a test file with more accurate
translations and with a greater number of elements. Furthemore, to train efficiently the neural
network, it would be necessary to insert in the corpus a greater number of words from different
sources, such as books, conversations and articles of various newspapers, in order to learn more
information from multiple contexts.
In conclusion, the test was conducted on corpora of various sizes in order to demonstrate
that increasing the size of the corpus and the number of words in the vocabulary, the neural
network provides better results.
References
[1] "Natural language processing (almost) from scratch." - Collobert, Ronan, et al. 2011
[2] “Distributed representations of words and phrases and their compositionality” - Mikolov,
Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff 2013
[3] “Efficient estimation of word representations in vector space.” - Mikolov, Tomas, et al. 2013.
[4] “Word Embeddings Go to Italy: a Comparison of Models and Training Datasets” - Berardi,
Giacomo and Esuli, Andrea and Marcheggiani, Diego 2015
[5] “A closer look at skip-gram modelling.” - Guthrie, David, et al. 2006.
[6] "Visualizing data using t-SNE." - Van der Maaten, Laurens, and Geoffrey Hinton 2008
11

Contenu connexe

Tendances

The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationGennadi Lembersky
 
Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...ijcsa
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modelingHiroyuki Kuromiya
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Daniele Di Mitri
 
Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...
Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...
Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...CSCJournals
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Association for Computational Linguistics
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkViet-Trung TRAN
 
Computational model language and grammar bnf
Computational model language and grammar bnfComputational model language and grammar bnf
Computational model language and grammar bnfTaha Shakeel
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
A simple library implementation of binary sessions
A simple library implementation of binary sessionsA simple library implementation of binary sessions
A simple library implementation of binary sessionsJeff Cacer
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
A critical reassessment of
A critical reassessment ofA critical reassessment of
A critical reassessment ofijcisjournal
 

Tendances (20)

New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
070
070070
070
 
Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
Doc format.
Doc format.Doc format.
Doc format.
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Basic review on topic modeling
Basic review on  topic modelingBasic review on  topic modeling
Basic review on topic modeling
 
Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation Lifelong Topic Modelling presentation
Lifelong Topic Modelling presentation
 
Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...
Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...
Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistak...
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Computational model language and grammar bnf
Computational model language and grammar bnfComputational model language and grammar bnf
Computational model language and grammar bnf
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Probabilistic content models,
Probabilistic content models,Probabilistic content models,
Probabilistic content models,
 
A simple library implementation of binary sessions
A simple library implementation of binary sessionsA simple library implementation of binary sessions
A simple library implementation of binary sessions
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
A critical reassessment of
A critical reassessment ofA critical reassessment of
A critical reassessment of
 
L3 v2
L3 v2L3 v2
L3 v2
 

En vedette

Word Embedding e word2vec: Introduzione ed Esperimenti Preliminari
Word Embedding e word2vec: Introduzione ed Esperimenti PreliminariWord Embedding e word2vec: Introduzione ed Esperimenti Preliminari
Word Embedding e word2vec: Introduzione ed Esperimenti PreliminariNet7
 
CNN for Sentiment Analysis on Italian Tweets
CNN for Sentiment Analysis on Italian TweetsCNN for Sentiment Analysis on Italian Tweets
CNN for Sentiment Analysis on Italian TweetsGiuseppe Attardi
 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyricsFrancesco Cucari
 
Agile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgnirudra Sikdar
 
Italian language
Italian languageItalian language
Italian languageB.Samu
 
Italy Powerpoint
Italy PowerpointItaly Powerpoint
Italy Powerpointdanacasucci
 
Italy-presentation
Italy-presentationItaly-presentation
Italy-presentationJelena Pahic
 

En vedette (8)

Word Embedding e word2vec: Introduzione ed Esperimenti Preliminari
Word Embedding e word2vec: Introduzione ed Esperimenti PreliminariWord Embedding e word2vec: Introduzione ed Esperimenti Preliminari
Word Embedding e word2vec: Introduzione ed Esperimenti Preliminari
 
CNN for Sentiment Analysis on Italian Tweets
CNN for Sentiment Analysis on Italian TweetsCNN for Sentiment Analysis on Italian Tweets
CNN for Sentiment Analysis on Italian Tweets
 
Mood classification of songs based on lyrics
Mood classification of songs based on lyricsMood classification of songs based on lyrics
Mood classification of songs based on lyrics
 
Word2vec 4 all
Word2vec 4 allWord2vec 4 all
Word2vec 4 all
 
Agile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity managementAgile analytics : An exploratory study of technical complexity management
Agile analytics : An exploratory study of technical complexity management
 
Italian language
Italian languageItalian language
Italian language
 
Italy Powerpoint
Italy PowerpointItaly Powerpoint
Italy Powerpoint
 
Italy-presentation
Italy-presentationItaly-presentation
Italy-presentation
 

Similaire à Word2Vec on Italian language

word2vec_summary_revised
word2vec_summary_revisedword2vec_summary_revised
word2vec_summary_revisedBennett Bullock
 
Skip-gram Model Broken Down
Skip-gram Model Broken DownSkip-gram Model Broken Down
Skip-gram Model Broken DownChin Huan Tan
 
Math for Intelligent Systems - 01 Linear Algebra 01 Vector Spaces
Math for Intelligent Systems - 01 Linear Algebra 01  Vector SpacesMath for Intelligent Systems - 01 Linear Algebra 01  Vector Spaces
Math for Intelligent Systems - 01 Linear Algebra 01 Vector SpacesAndres Mendez-Vazquez
 
Chapter 4: Vector Spaces - Part 1/Slides By Pearson
Chapter 4: Vector Spaces - Part 1/Slides By PearsonChapter 4: Vector Spaces - Part 1/Slides By Pearson
Chapter 4: Vector Spaces - Part 1/Slides By PearsonChaimae Baroudi
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data scienceNikhil Jaiswal
 
Vector space interpretation_of_random_variables
Vector space interpretation_of_random_variablesVector space interpretation_of_random_variables
Vector space interpretation_of_random_variablesGopi Saiteja
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
 
Course Assignment : Skip gram
Course Assignment : Skip gramCourse Assignment : Skip gram
Course Assignment : Skip gramKhalilBergaoui
 
Supporting Vector Machine
Supporting Vector MachineSupporting Vector Machine
Supporting Vector MachineSumit Singh
 
Multimodal Residual Networks for Visual QA
Multimodal Residual Networks for Visual QAMultimodal Residual Networks for Visual QA
Multimodal Residual Networks for Visual QAJin-Hwa Kim
 
EXTENDED LINEAR MULTI-COMMODITY MULTICOST NETWORK AND MAXIMAL FLOW LIMITED CO...
EXTENDED LINEAR MULTI-COMMODITY MULTICOST NETWORK AND MAXIMAL FLOW LIMITED CO...EXTENDED LINEAR MULTI-COMMODITY MULTICOST NETWORK AND MAXIMAL FLOW LIMITED CO...
EXTENDED LINEAR MULTI-COMMODITY MULTICOST NETWORK AND MAXIMAL FLOW LIMITED CO...IJCNCJournal
 
presentation2-180202073525.pptx
presentation2-180202073525.pptxpresentation2-180202073525.pptx
presentation2-180202073525.pptxKtonNguyn2
 
Website designing compay in noida
Website designing compay in noidaWebsite designing compay in noida
Website designing compay in noidaCss Founder
 

Similaire à Word2Vec on Italian language (20)

word2vec_summary_revised
word2vec_summary_revisedword2vec_summary_revised
word2vec_summary_revised
 
Skip-gram Model Broken Down
Skip-gram Model Broken DownSkip-gram Model Broken Down
Skip-gram Model Broken Down
 
Math for Intelligent Systems - 01 Linear Algebra 01 Vector Spaces
Math for Intelligent Systems - 01 Linear Algebra 01  Vector SpacesMath for Intelligent Systems - 01 Linear Algebra 01  Vector Spaces
Math for Intelligent Systems - 01 Linear Algebra 01 Vector Spaces
 
Chapter 4: Vector Spaces - Part 1/Slides By Pearson
Chapter 4: Vector Spaces - Part 1/Slides By PearsonChapter 4: Vector Spaces - Part 1/Slides By Pearson
Chapter 4: Vector Spaces - Part 1/Slides By Pearson
 
[Emnlp] what is glo ve part i - towards data science
[Emnlp] what is glo ve  part i - towards data science[Emnlp] what is glo ve  part i - towards data science
[Emnlp] what is glo ve part i - towards data science
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
Vector space interpretation_of_random_variables
Vector space interpretation_of_random_variablesVector space interpretation_of_random_variables
Vector space interpretation_of_random_variables
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Lecture1.pptx
Lecture1.pptxLecture1.pptx
Lecture1.pptx
 
Course Assignment : Skip gram
Course Assignment : Skip gramCourse Assignment : Skip gram
Course Assignment : Skip gram
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
Supporting Vector Machine
Supporting Vector MachineSupporting Vector Machine
Supporting Vector Machine
 
Multimodal Residual Networks for Visual QA
Multimodal Residual Networks for Visual QAMultimodal Residual Networks for Visual QA
Multimodal Residual Networks for Visual QA
 
EXTENDED LINEAR MULTI-COMMODITY MULTICOST NETWORK AND MAXIMAL FLOW LIMITED CO...
EXTENDED LINEAR MULTI-COMMODITY MULTICOST NETWORK AND MAXIMAL FLOW LIMITED CO...EXTENDED LINEAR MULTI-COMMODITY MULTICOST NETWORK AND MAXIMAL FLOW LIMITED CO...
EXTENDED LINEAR MULTI-COMMODITY MULTICOST NETWORK AND MAXIMAL FLOW LIMITED CO...
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
presentation2-180202073525.pptx
presentation2-180202073525.pptxpresentation2-180202073525.pptx
presentation2-180202073525.pptx
 
Draft6
Draft6Draft6
Draft6
 
Website designing compay in noida
Website designing compay in noidaWebsite designing compay in noida
Website designing compay in noida
 

Dernier

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 

Dernier (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 

Word2Vec on Italian language

  • 1. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks Word2Vec on Italian language Cucari Francesco De Cillis Daniele Molinari Dario I. Introduction Research on word representation models, word embeddings, has gained a lot of attention in the recent years[1]. This happened also thanks to a renewed boost in neural network technologies, such as deep learning. With these progresses it has become possible to train more complex models on much larger data sets. Probably, one of the most popular of this series of work is without doubt Mikolow’s word2vec [2][3], that introduces the concept of using distributed representations of words. Distributed word representations, also known as word embeddings, represent a word as a vec- tor in n. The more the vectors are closer the more the two corresponding words are deemed to share some degree of syntactical or semantical similarity[4]. For example, the result of a vector calculation vec(‘Paris‘) - vec(‘France‘) + vec(‘Italy‘) is closer to vec(‘Rome‘) than to any other word vector.1 The main purpose of this work is to validate previously proposed experiments for the English language and then trying to figure out if it is possibile to reproduce the same accuracy and performance with the Italian language, using our implementation of word2vec. II. Corpus A corpus is a collection of texts composed by many sentences of various topics, in such a way that the neural network can learn the meaning of each word it finds thanks to the context. Some of the most famous Italian corpus are Paisà2 and ItWac3. Both include texts from the web, such as chat conversations, books extracts and articles. However, both are not formatted prop- erly for our purpose. So, we preferred to manually create a corpus. Our corpus has two datasets in the Italian language: half of the dump of the Italian Wikipedia4 (dated 06/03/2016) and a collection of about 120.000 Italian articles written in 2010 taken from the archive of “La Repubblica”. We trained the network with corpus of different sizes. Initially, we train the model with an English corpus, called text85, that had a dimension of 100 MB, corresponding to about 17 millions of words. Finally, we train the model with our Italian corpus that has a dimension of 920 MB, 1https://code.google.com/archive/p/word2vec/ Google Code archive: word2vec - Tool for computing continuous distributed representations of words. 2http://www.corpusitaliano.it/ 3http://wacky.sslmit.unibo.it/doku.php?id=corpora 4http://dumps.wikimedia.org/itwiki/latest/ 5http://mattmahoney.net/dc/textdata 1
  • 2. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks equivalent to about 122 milions of words. Before the training of the network, the corpus needs some preprocessing. Initially, punctua- tion must be deleted: we need a plain text, because points, colons, commas and others symbols can be interpreted as different words by the algorithm. Then, words are converted to lowercase and numerals are converted to their word forms (e.g. 1992 becomes one nine nine two). Finally, We noticed that by eliminating the so-called stopwords in the Italian corpus (articles, conjunctions, prepositions, etc.) we obtained better performances. After the preprocessing phase, it is possible to train the network. Our model takes in input a subset of the corpus, called vocabulary, whose dimension is essential: with a small vocabulary we do not have enough words to compute similarities in a correct way, otherwise we take into account many words that are useless for our purpose. So, we decided to take into account only the words that appeared more than 10 times into the corpus. III. Network architecture Word2vec is a neural network composed by three layers: an input, an hidden and an output layer. The two main architectures are structurally similar and their names are Continuous Bag Of Words and Skip-gram. I. CBOW CBOW is the original version of word2vec. In our case, we realize a CBOW model (Fig.1) with a contexts of two words, namely the word on the left of the target word and the one on its right. In this way we can predict the target word as output. Figure 1: CBOW architecture As we can see from the Fig.1, V is the vocabulary size and N is the embedding size. x1k, x2k are two one-hot encoded vectors that represent the input words, i.e. they are vector of size V that have a unit equal to one and all the others equal to zero. 2
  • 3. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks The weights between the input layer and the output layer can be represented by a VxN ma- trix W. Each row of W is the N-dimension vector representation vw of the associated word of the input layer. Therefore, the output of the hidden layer is: h = 1 C W(x1 + x2 + ... + xC) = 1 C (vw1 + vw2 + ... + vwC ) (1) where C is the number of words in the context, w1, ..., wC are the words in the context, and vw is the input vector of a word w. We notice that the link activation function of the hidden layer units is simply linear, i.e. it directly passes the weighted sum of inputs to the next layer. Between hidden and output layer, there is a different matrix, called W’ , which has size NxV (it is not the transpose of W). Using this matrix, we can compute a score uj for each word in the vocabulary: uj = v T wj h (2) where v T wj is the j-th column of the matrix W’. Then, with softmax, a log-linear classifier, we can find the posterior distribution of words, which is a multinomial distribution: p(wj|wI,1, ..., wI,C) = yj = exp(wj) ∑V j =1 exp(uj ) (3) After these calculations, it is essential to update the weights of the two matrices to assure backpropagation. The training objective is to maximize the conditional probability of observing the actual output word wO (denote its index in the output layer as j*) given the input context words wI,1, ..., wI,C with regard to the weights: maxp(wO) = maxyj∗ = maxlog(yj∗) = uj∗ − log V ∑ j =1 exp(uj ) := E (4) where E is our loss function (which we want to minimize), and j* is the index of the actual output word in the output layer. The update equation for the hidden-output weights is v (new) wj = v (old) wj − ηejh (5) for j = 1, 2, ..., V where η > 0 is the learning rate, ej := δE δwj = yj − tj (with tj is 1 when the j-th unit is the actual output word, otherwise it is 0) is the prediction error of the output layer, h is the output of the hidden layer and vwj is the output vector of wj. Having obtained the update equations for W’, we can now move on to W. In a similar man- ner, it is possible to obtain the update equation for the input-hidden weights: v (new) wI,c = v (old) wI,c − 1 C ηEH (6) for c = 1, ..., C where vwI,c is a row of W correspondent to the c-th context word and EH = δE δhi = ∑V j=1 ejwij is the sum of the output vectors of all the words in the vocabulary, weighted by their 3
  • 4. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks prediction error. The only rows of W whose derivative is non-zero are the one correspondent to the C context words. All the other rows of W will remain unchanged after this iteration, because their derivatives are zero. Intuitively, since vector EH is the sum of output vectors of all words in vocabulary weighted by their prediction error ej = yj − tj , we can understand this equation as adding a portion of every output vector in vocabulary to the input vectors of the context words. If, in the output layer, the probability of a word wj being the output word is overestimated (yj > tj), then the input vector of the context word wI,c will tend to move farther away from the output vector of wj; on the contrary, if the probability of wj being the output word is underestimated (yj < tj), then the input vector wI,c will tend to move closer to the output vector of wj; if the probability of wj is fairly accurately predicted, then it will have little effect on the movement of the input vector of wI,c. The movement of the input vector of wI,c is determined by the prediction error of all vectors in the vocabulary; the larger the prediction error, the more significant effects a word will exert on the movement on the input vector of the context word. II. Skip-Gram The skip-gram model (Fig.2) is the opposite of the CBOW model: the target word is at the input layer and the context words are at the output layer. Figure 2: Skip-Gram architecture The weights between the input and the hidden layer can be represented by the VxN-dimensional matrix W. Given the input word, we have: h = xT W = Wk, = vwI (7) 4
  • 5. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks which is essentially copying the k-th row of W to h. vwI is the vector representation of the input word wI. This implies that the activation function of the hidden layer units is simply linear. On the output layer, instead of computing one multinomial distribution, we compute C multino- mial distributions. Each output is computed using the same hidden-output matrix: p(wc,j = wO,c|wI) = yc,j = exp(uc,j) ∑V j =1 exp(uj ) (8) where wc,j is the j-th word on the c-th panel of the output layer; wO,c is the actual c-th word in the output context words; wI is the only input word; yc,j is the output of the j-th unit on the c- th panel of the output layer; uc,j is the net input of the j-th unit on the c-th panel of the output layer. Now, the loss function is: E = −logp(wO,1, wO,2, .., wO,C|wI) = −log C ∏ c=1 exp(uc,j∗ c ) ∑V j =1 exp(uj ) = − C ∑ c=1 uj∗ c + Clog V ∑ j =1 exp(uj ) (9) where j∗ c is the index of the actual c-th output context word in the vocabulary. The update equation for hidden-output matrix is: v (new) wj = v (old) wj − ηEIjh (10) for j=1,2,..,V and where EIj is the sum of prediction errors over all context words. The derivation of the update equation for the input-hidden matrix is identical to that of CBOW, except taking into account that the prediction error ej is replaced with EIj. v (new) wI = v (old) wI − ηEIj (11) III. Optimizer We need to minimize the loss in order to obtain the best performances. A well known method for minimizing an objective function is the Stochastic Gradient Descent or SGD. Usually, the objective function comes in form of sum of differentiable functions where there is a certain parameter that needs to be estimated: G(w) = n ∑ i=1 Gi(w) When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions’ gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems like ours. In stochastic gradient descent, the true gradient of G(w) is approximated by a gradient at a single example: w := w − η Gi(w) 5
  • 6. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks As the algorithm sweeps through the training set, it performs the above update for each training example. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. A compromise between computing the true gradient and the gradient at a single example, is to compute the gradient against more than one training example (called a mini-batch) at each step. This can perform significantly better than true stochastic gradient descent because the code can make use of vectorization libraries rather than computing each step separately. It may also result in smoother convergence, as the gradient computed at each step uses more training examples. In particular, the optimizer that we used is an enhanced gradient descend called AdaGrad (Adap- tive Gradient algorithm). It combines different learning rates at each weight. Words which are more frequent can have larger learning rate while words which are less frequent must have lower learning rate. So, AdaGrad essentially does that. In fact, each word has a different learning rate which is adaptable. IV. Experiments and results In this section we present the main experiments performed in parallel on the English and on the Italian corpus. I. Analogies test In order to evaluate the developed models we have manually translated the Google word analogy test for English. The original test is composed by questions divided in semantics questions (e.g.: father : mother = grandpa : grandma) and syntactic questions (e.g.: going : went = predicting : predicted). We have started by extracting a subset of questions, by translating the English test to Italian and by making some changes. For example, the authors made a list of large American cities and the states they belong to. Since the frequency of these American cities can be very low in an Italian corpus, we replace the relationship “American cities in state” with the relationship “Italian cities in region”. Then, we remove the relationships “Comparative”, given that in Italian comparatives are usually built as multi-word expressions (smart : smarter = intelligente : più intelligente) and the relationship “Plural verbs”. The category “Present-Participle” of English has been mapped to the Italian gerund, as it is of more common use in Italian. Finally, we add the relationship “Family”. Then, the overall accuracy is evaluated for all question types. Question is assumed to be correctly answered only if the closest word to the vector computed using the above method is exactly the same as the correct word in the question. So synonyms are thus counted as mistakes. This also means that reaching 100% is likely to be impossible, as the current models do not have any input information about word morphology. Some example from each category is shown in the table. Table 1: Examples of questions in the Semantic-Syntactic Word Relationship test set for the English language and some examples used for the Italian language test 6
  • 7. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks Type of relationship Word Pair 1 Word Pair 2 Common capital city Atheens - Greece Berlino - Germania Currency Italy - Euro Canada - Dollaro City-in-state (City-in-region) Chicago - Illinois Roma - Lazio Man-Woman brother - sister padre madre Adjective to adverb rapid - rapidly preciso - precisamente Opposite possibly - impossibly onesto - disonesto Comparative big - bigger / Superlative easy - easiest male - malissimo Present-participle think - thinking volare - volando Nationality adjective Switzerland - Swiss Francia - francese Past tense walking - walked danzare - danzava Plural nouns mouse - mice gatto - gatti Plural verbs work- works / Family / sorella - sorellastra II. Analogies results on English language Table 2: Skip-gram accuracy on reduced English text8 (100MB) Category Accuracy Common capital city 26,67% (4/15) Currency 0,00% (0/15) City-in-state 20,00% (3/15) Man-Woman 26,67% (4/15) Adjective to adverb 0,00% (0/15) Opposite 0,00% (0/15) Comparative 6,67% (1/15) Superlative 0,00% (0/15) Present-participle 20,00% (3/15) Nationality adjective 53,33% (8/15) Past tense 20,00% (3/15) Plural nouns 13,33% (2/15) Plural verbs 0,00% (0/15) Total average: 14,36% (28/195) III. Analogies results on Italian language Table 3: Skip-gram accuracy of our reduced Italian corpus (100MB) 7
  • 8. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks Category Accuracy Common capital city 33,33% (5/15) Currency 0,00% (0/15) City-in-region 20,00% (3/15) Man-Woman 40,00% (6/15) Adjective to adverb 0,00% (0/15) Opposite 0,00% (0/15) Superlative 0,00% (0/15) Present-gerund 0,00% (0/15) Nationality adjective 46,67% (7/15) Past tense 0,00% (0/15) Plural nouns 0,00% (0/15) Family 26,67% (4/15) Total average: 13,88% (25/180) Table 4: CBOW accuracy on our Italian corpus (920MB) Category Accuracy Common capital city 33,33% (5/15) Currency 6,67% (1/15) City-in-region 26,67% (4/15) Man-Woman 60,00% (9/15) Adjective to adverb 6,67% (1/15) Opposite 0,00% (0/15) Superlative 6,67% (1/15) Present-gerund 20,00% (3/15) Nationality adjective 20,00% (3/15) Past tense 0,00% (0/15) Plural nouns 6,67% (1/15) Family 46,67% (7/15) Total average: 19,44% (35/180) Table 5: Skip-gram accuracy on our Italian corpus (920MB) Category Accuracy Common capital city 46,67% (7/15) Currency 0,00% (0/15) City-in-region 53,33% (8/15) Man-Woman 46,67% (7/15) Adjective to adverb 6,67% (1/15) Opposite 0,00% (0/15) Superlative 0,00% (0/15) Present-gerund 6,67% (1/15) Nationality adjective 40,00% (6/15) Past tense 0,00% (0/15) Plural nouns 20,00% (3/15) Family 60,00% (9/15) Total average: 23,33% (42/180) IV. ”Nearest to” test A simple way to investigate the learned representations is to find the closest words for a user- specified word.We use the cosine distance between two vectors (in our case a vector is the represen- tation of the word in the space) that is a measure that calculates the cosine of the angle between them. These are some examples: francia Nearest to Cosine distance spagna 0.365999 parigi 0.497820 germania 0.545318 italia 0.554473 roma Nearest to Cosine distance bologna 0.382990 torino 0.386104 milano 0.401712 new 0.703315 8
  • 9. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks padre Nearest to Cosine distance nonno 0.354935 figlio 0.381211 quando 0.687557 mentre 0.834535 paraurti Nearest to Cosine distance calandra 0.367163 cofano 0.475904 abitacolo 0.495059 carrozzeria 0.496319 V. ”Doesn’t match” test The test consists of some questions, each is composed by 4 words. The goal of the test is to find the word that doesn’t match with others. We compute the cosine distance of each word with the others. Then, we compute the mean of these distances and finally the word with the highest mean is the intrusive word. Table 6: Some outputs of the test Word 1 Word 2 Word 3 Word 4 Predicted intrusive word uomo donna cucina bambino cucina uomo mamma bambino ragazzo mamma roma lombardia milano napoli lombardia argentina francia spagna berlino berlino sono ero essendo avere avere trovai trovato trovare essere trovai essendo avendo amando leggere amando giallo uomo rosso arancione uomo giallo bianco rosso arancione arancione As a result, generally the trained model is able to identify similar words in clusters and it deletes the word farthest from the others. VI. Visualization of learned embeddings After the training has finished we can visualize the learned embeddings using t-SNE (t-distributed Stochastic Neighbor Embedding). t-SNE[6] is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results. The static image of the plot looks very confusing, since it contains 150000 words. If you can dynamically navigate through the Python tools, then you can notice that words with a similar semantic meaning are grouped nearby, as you can see in the Fig. 3 and Fig.4. 9
  • 10. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks Figure 3: A closer look to the plot Figure 4: A closer look to the plot/2 V. Conclusions In this work we tried to understand word2vec, a well known tool for learning word embeddings, and to reproduce to some extent the results obtained in the original work for the English language. From the analysis shown in the previous chapter, we can deduce that skip-gram has better per- formance than CBOW model because its context is made by the two nearest words to the target word, while with skip-gram you can consider the whole sentence as a context which leads to a better understanding of the semantics of the target word, depending on the size of the skip-window. Regarding the differences between Italian and English languages, the former, like other eu- ropean languages, is morphologically richer than the latter. For example, English language doesn’t have accents, it doesn’t distinguish male-female adjectives and verbal forms are simpler. We noticed that by eliminating the so-called stopwords in the Italian corpus (articles, conjunctions, prepositions, etc.) we obtained better performances because we take into account only the main words of the context. For achieving this, it was necessary to reduce the size of the window 10
  • 11. Sapienza University of Rome • 12 April 2016 • Course: Neural Networks (skip-window) to avoid considering different contexts’ sentences. We adopted a simple word analogies test to evaluate the generated word embeddings. We provided a literal translation, and probably rough, of the analogies file used for testing. To improve the performance it would be necessary to implement a test file with more accurate translations and with a greater number of elements. Furthemore, to train efficiently the neural network, it would be necessary to insert in the corpus a greater number of words from different sources, such as books, conversations and articles of various newspapers, in order to learn more information from multiple contexts. In conclusion, the test was conducted on corpora of various sizes in order to demonstrate that increasing the size of the corpus and the number of words in the vocabulary, the neural network provides better results. References [1] "Natural language processing (almost) from scratch." - Collobert, Ronan, et al. 2011 [2] “Distributed representations of words and phrases and their compositionality” - Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff 2013 [3] “Efficient estimation of word representations in vector space.” - Mikolov, Tomas, et al. 2013. [4] “Word Embeddings Go to Italy: a Comparison of Models and Training Datasets” - Berardi, Giacomo and Esuli, Andrea and Marcheggiani, Diego 2015 [5] “A closer look at skip-gram modelling.” - Guthrie, David, et al. 2006. [6] "Visualizing data using t-SNE." - Van der Maaten, Laurens, and Geoffrey Hinton 2008 11