SlideShare a Scribd company logo
1 of 37
2 3 M A R C H 2 0 1 8 , L E I D E N U N I V E RS I T Y
WORD2VEC TUTORIAL
SUZAN VERBERNE
CONTENTS
 Introduction
 Background
 What is word2vec?
 What can you do with it?
 Practical session
Suzan Verberne
INTRODUCTORY ROUND
Suzan Verberne
ABOUT ME
 Master (2002): Natural Language Processing (NLP)
 PhD (2010): NLP and Information Retrieval (IR)
 Now: Assistant Professor at the Leiden Institute for Advanced
Computer Science (LIACS), affiliated with the Data Science research
programme.
 Research: projects on text mining and information retrieval in
various domains
 Teaching: Data Science & Text Mining
Suzan Verberne
Suzan Verberne
BACKGROUND
WHERE TO START
 Linguistics: Distributional hypothesis
 Data science: Vector Space Model (VSM)
Suzan Verberne
DISTRIBUTIONAL HYPOTHESIS
 Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162.
 “Words that occur in similar contexts tend to be similar”
Suzan Verberne
VECTOR SPACE MODEL
 Traditional Vector Space Model (Information Retrieval):
 documents and queries represented in a vector space
 where the dimensions are the words
Suzan Verberne
VECTOR SPACE MODEL
Suzan Verberne
VECTOR SPACE MODEL
Suzan Verberne
VECTOR SPACE MODEL
Suzan Verberne
VECTOR SPACE MODEL
 Vector Space Models represent (embed) words or documents in a
continuous vector space
 The vector space is relatively low-dimensional (100 – 320
dimensions instead of 10,000s)
 Semantically and syntactically similar words are mapped to nearby
points (Distributional Hypothesis)
Suzan Verberne
Suzan Verberne
PCA projection of 320-
dimensional vector space
Suzan Verberne
WORD2VEC
WHAT IS WORD2VEC?
 Word2vec is a particularly computationally-efficient predictive
model for learning word embeddings from raw text
 So that similar words are mapped to nearby points
Suzan Verberne
EXAMPLE: PATIENT FORUM MINER
Suzan Verberne
WHERE DOES IT COME FROM
 Neural network language model (NNLM) (Bengio et al., 2003 – not
discussed today)
 Mikolov previously proposed (MSc thesis, PhD thesis, other papers)
to first learn word vectors “using neural network with a single
hidden layer” and then train the NNLM independently
 Word2vec directly extends this work => “word vectors learned
using a simple model”
 Many architectures and models have been proposed for computing
word vectors (e.g. see Socher’s Stanford group work which resulted
in GloVe - http://nlp.stanford.edu/projects/glove/)
Suzan Verberne
WHAT IS WORD2VEC (NOT)?
 It is:
 Neural network-based
 Word embeddings
 VSM-based
 Co-occurrence-based
 It is not:
 Deep learning
Suzan Verberne
Word embedding is the collective name for a
set of language modeling and feature learning
techniques in natural language processing
(NLP) where words or phrases from the
vocabulary are mapped to vectors of real
numbers in a low-dimensional space relative to
the vocabulary size
WORDS IN CONTEXT
Suzan Verberne
HOW IS THE MODEL TRAINED?
 Move through the training corpus with a sliding window. Each word
is a prediction problem:
 the objective is to predict the current word with the help of its
contexts (or vice versa)
 The outcome of the prediction determines whether we adjust the
current word vector. Gradually, vectors converge to (hopefully)
optimal values
It is important that prediction here is not an aim in itself: it is just a
proxy to learn vector representations good for other tasks
Suzan Verberne
OTHER VECTOR SPACE SEMANTICS TECHNIQUES
 Some “classic” NLP for estimating continuous representations of
words
 – LSA (Latent Semantic Analysis)
 – LDA (Latent Dirichlet Allocation)
 Distributed representations of words learned by neural networks
outperform LSA on various tasks
 LDA is computationally expensive and cannot be trained on very
large datasets
Suzan Verberne
ADVANTAGES OF WORD2VEC
 It scales
 Train on billion word corpora
 In limited time
 Possibility of parallel training
 Pre-trained word embeddings trained by one can be used by others
 For entirely different tasks
 Incremental training
 Train on one piece of data, save results, continue training later on
 There is a Python module for it:
 Gensim word2vec
Suzan Verberne
Suzan Verberne
WHAT CAN YOU DO WITH IT?
DO IT YOURSELF
 Implementation in Python package gensim
import gensim
model = gensim.models.Word2Vec(sentences, size=100,
window=5, min_count=5,
workers=4)
size: the dimensionality of the feature vectors (common: 200 or 320)
window: the maximum distance between the current and predicted word
within a sentence
min_count: minimum number of occurrences of a word in the corpus to be
included in the model
workers: for parallellization with multicore machines
Suzan Verberne
DO IT YOURSELF
model.most_similar(‘apple’)
 [(’banana’, 0.8571481704711914), ...]
model.doesnt_match("breakfast cereal dinner lunch".split())
 ‘cereal’
model.similarity(‘woman’, ‘man’)
 0.73723527
Suzan Verberne
WHAT CAN YOU DO WITH IT?
 A is to B as C is to ?
 This is the famous example:
vector(king) – vector(man) + vector(woman) = vector(queen)
 Actually, what the original paper says is: if you substract the vector
for ‘man’ from the one for ‘king’ and add the vector for ‘woman’,
the vector closest to the one you end up with turns out to be the
one for ‘queen’.
 More interesting:
France is to Paris as Germany is to …
Suzan Verberne
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In Advances in
Neural Information Processing Systems, pages 3111–3119, 2013.
WHAT CAN YOU DO WITH IT?
 A is to B as C is to ?
 It also works for syntactic relations:
 vector(biggest) - vector(big) + vector(small) =
Suzan Verberne
vector(smallest)
WHAT CAN YOU DO WITH IT?
 Mining knowledge about natural language
 Improve NLP applications
 Improve Text Mining applications
Suzan Verberne
WHAT CAN YOU DO WITH IT?
 Mining knowledge about natural language
 Selecting out-of-the-list words
 Example: which word does not belong in [monkey, lion, dog, truck]
 Selectional preferences
 Example: predict typical verb-noun pairs: people as subject of eating is more
likely than people as object of eating
 Imrove NLP applications:
 Sentence completion/text prediction
 Bilingual Word Embeddings for Phrase-Based Machine Translation
Suzan Verberne
WHAT CAN YOU DO WITH IT?
 Improve Text Mining applications:
 (Near-)Synonym detection ( query expansion)
 Concept representation of texts
 Example: Twitter sentiment classification
 Document similarity
 Example: cluster news articles per news event
Suzan Verberne
DEAL WITH PHRASES
 Entities as single words:
 during pre-processing: find collocations (e.g. Wikipedia anchor texts),
and feed them as single ‘words’ to the neural network.
 Bigrams (built-in function)
bigram = gensim.models.Phrases(sentence_stream)
term_list = list(bigram[sentence_stream])
model = gensim.models.Word2Vec(term_list)
Suzan Verberne
HOW TO EVALUATE A MODEL
 There is a comprehensive test set that contains five types of
semantic questions, and nine types of syntactic questions
 – 8869 semantic questions – 10675 syntactic questions
 https://rare-technologies.com/word2vec-tutorial/
Suzan Verberne
FURTHER READING
 T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems, pages 3111– 3119,
2013.
 http://radimrehurek.com/2014/02/word2vec-tutorial/
 http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-
model/
 https://rare-technologies.com/word2vec-tutorial/
Suzan Verberne
Suzan Verberne
PRACTICAL SESSION
PRE-TRAINED WORD2VEC MODELS
 We will compare two pre-trained word2vec models:
 A model trained on a Dutch web corpus (COW)
 A model trained on Dutch web forum data
 Goal: inspect the differences between the models
Suzan Verberne
THANKS
 Traian Rebedea
 Tom Kenter
 Mohammad Mahdavi
 Satoshi Sekine
Suzan Verberne

More Related Content

What's hot

Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representationhyunyoung Lee
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDevashish Shanker
 
Fasttext 20170720 yjy
Fasttext 20170720 yjyFasttext 20170720 yjy
Fasttext 20170720 yjy재연 윤
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 

What's hot (20)

Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representation
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
NLP
NLPNLP
NLP
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
Surface realization
Surface realizationSurface realization
Surface realization
 
Fasttext 20170720 yjy
Fasttext 20170720 yjyFasttext 20170720 yjy
Fasttext 20170720 yjy
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
NLP
NLPNLP
NLP
 
Nlp
NlpNlp
Nlp
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 

Similar to Tutorial on word2vec

Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for LexicographyLeiden University
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
Neural word embedding and language modelling
Neural word embedding and language modellingNeural word embedding and language modelling
Neural word embedding and language modellingRiddhi Jain
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) inventionjournals
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2Viral Gupta
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextrudolf eremyan
 
A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...Alexander Decker
 
Multilingual Text Classification using Ontologies
Multilingual Text Classification using OntologiesMultilingual Text Classification using Ontologies
Multilingual Text Classification using OntologiesGerard de Melo
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language modelc sharada
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics Ibutest
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements LabelingData Works MD
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
 
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...ijnlc
 
Deep Learning for NLP Applications
Deep Learning for NLP ApplicationsDeep Learning for NLP Applications
Deep Learning for NLP ApplicationsSamiur Rahman
 
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconferenceMarcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconferenceDaniel Lewis
 

Similar to Tutorial on word2vec (20)

Text Mining for Lexicography
Text Mining for LexicographyText Mining for Lexicography
Text Mining for Lexicography
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Doc format.
Doc format.Doc format.
Doc format.
 
Neural word embedding and language modelling
Neural word embedding and language modellingNeural word embedding and language modelling
Neural word embedding and language modelling
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdf
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI) International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastTextGDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
GDG Tbilisi 2017. Word Embedding Libraries Overview: Word2Vec and fastText
 
A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...A supervised word sense disambiguation method using ontology and context know...
A supervised word sense disambiguation method using ontology and context know...
 
Multilingual Text Classification using Ontologies
Multilingual Text Classification using OntologiesMultilingual Text Classification using Ontologies
Multilingual Text Classification using Ontologies
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language model
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements Labeling
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...A N H YBRID  A PPROACH TO  W ORD  S ENSE  D ISAMBIGUATION  W ITH  A ND  W ITH...
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
 
Deep Learning for NLP Applications
Deep Learning for NLP ApplicationsDeep Learning for NLP Applications
Deep Learning for NLP Applications
 
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconferenceMarcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
Marcelo Funes-Gallanzi - Simplish - Computational intelligence unconference
 

More from Leiden University

‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...Leiden University
 
Text mining for health knowledge discovery
Text mining for health knowledge discoveryText mining for health knowledge discovery
Text mining for health knowledge discoveryLeiden University
 
'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionalsLeiden University
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van tekstenLeiden University
 
Summarizing discussion threads
Summarizing discussion threadsSummarizing discussion threads
Summarizing discussion threadsLeiden University
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van tekstenLeiden University
 
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?Leiden University
 
RemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchRemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchLeiden University
 
Collecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextCollecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextLeiden University
 
Search engines for the humanities that go beyond Google
Search engines for the humanities that go beyond GoogleSearch engines for the humanities that go beyond Google
Search engines for the humanities that go beyond GoogleLeiden University
 
Krijgen we ooit de beschikking over slimme zoektechnologie?
Krijgen we ooit de beschikking over slimme zoektechnologie?Krijgen we ooit de beschikking over slimme zoektechnologie?
Krijgen we ooit de beschikking over slimme zoektechnologie?Leiden University
 

More from Leiden University (13)

‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...
 
Text mining for health knowledge discovery
Text mining for health knowledge discoveryText mining for health knowledge discovery
Text mining for health knowledge discovery
 
'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals'Het nieuwe zoeken' voor informatieprofessionals
'Het nieuwe zoeken' voor informatieprofessionals
 
kanker.nl & Data Science
kanker.nl & Data Sciencekanker.nl & Data Science
kanker.nl & Data Science
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van teksten
 
Computationeel denken
Computationeel denkenComputationeel denken
Computationeel denken
 
Summarizing discussion threads
Summarizing discussion threadsSummarizing discussion threads
Summarizing discussion threads
 
Automatische classificatie van teksten
Automatische classificatie van tekstenAutomatische classificatie van teksten
Automatische classificatie van teksten
 
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?
Leer je digitale klanten kennen: hoe zoeken ze en wat vinden ze?
 
RemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchRemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt Research
 
Collecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in contextCollecting a dataset of information behaviour in context
Collecting a dataset of information behaviour in context
 
Search engines for the humanities that go beyond Google
Search engines for the humanities that go beyond GoogleSearch engines for the humanities that go beyond Google
Search engines for the humanities that go beyond Google
 
Krijgen we ooit de beschikking over slimme zoektechnologie?
Krijgen we ooit de beschikking over slimme zoektechnologie?Krijgen we ooit de beschikking over slimme zoektechnologie?
Krijgen we ooit de beschikking over slimme zoektechnologie?
 

Recently uploaded

Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 

Recently uploaded (20)

Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 GenuineCall Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)User Guide: Orion™ Weather Station (Columbia Weather Systems)
User Guide: Orion™ Weather Station (Columbia Weather Systems)
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 

Tutorial on word2vec

  • 1. 2 3 M A R C H 2 0 1 8 , L E I D E N U N I V E RS I T Y WORD2VEC TUTORIAL SUZAN VERBERNE
  • 2. CONTENTS  Introduction  Background  What is word2vec?  What can you do with it?  Practical session Suzan Verberne
  • 4. ABOUT ME  Master (2002): Natural Language Processing (NLP)  PhD (2010): NLP and Information Retrieval (IR)  Now: Assistant Professor at the Leiden Institute for Advanced Computer Science (LIACS), affiliated with the Data Science research programme.  Research: projects on text mining and information retrieval in various domains  Teaching: Data Science & Text Mining Suzan Verberne
  • 6. WHERE TO START  Linguistics: Distributional hypothesis  Data science: Vector Space Model (VSM) Suzan Verberne
  • 7. DISTRIBUTIONAL HYPOTHESIS  Harris, Z. (1954). “Distributional structure”. Word. 10 (23): 146–162.  “Words that occur in similar contexts tend to be similar” Suzan Verberne
  • 8. VECTOR SPACE MODEL  Traditional Vector Space Model (Information Retrieval):  documents and queries represented in a vector space  where the dimensions are the words Suzan Verberne
  • 12. VECTOR SPACE MODEL  Vector Space Models represent (embed) words or documents in a continuous vector space  The vector space is relatively low-dimensional (100 – 320 dimensions instead of 10,000s)  Semantically and syntactically similar words are mapped to nearby points (Distributional Hypothesis) Suzan Verberne
  • 13. Suzan Verberne PCA projection of 320- dimensional vector space
  • 15. WHAT IS WORD2VEC?  Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text  So that similar words are mapped to nearby points Suzan Verberne
  • 16. EXAMPLE: PATIENT FORUM MINER Suzan Verberne
  • 17. WHERE DOES IT COME FROM  Neural network language model (NNLM) (Bengio et al., 2003 – not discussed today)  Mikolov previously proposed (MSc thesis, PhD thesis, other papers) to first learn word vectors “using neural network with a single hidden layer” and then train the NNLM independently  Word2vec directly extends this work => “word vectors learned using a simple model”  Many architectures and models have been proposed for computing word vectors (e.g. see Socher’s Stanford group work which resulted in GloVe - http://nlp.stanford.edu/projects/glove/) Suzan Verberne
  • 18. WHAT IS WORD2VEC (NOT)?  It is:  Neural network-based  Word embeddings  VSM-based  Co-occurrence-based  It is not:  Deep learning Suzan Verberne Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size
  • 20. HOW IS THE MODEL TRAINED?  Move through the training corpus with a sliding window. Each word is a prediction problem:  the objective is to predict the current word with the help of its contexts (or vice versa)  The outcome of the prediction determines whether we adjust the current word vector. Gradually, vectors converge to (hopefully) optimal values It is important that prediction here is not an aim in itself: it is just a proxy to learn vector representations good for other tasks Suzan Verberne
  • 21. OTHER VECTOR SPACE SEMANTICS TECHNIQUES  Some “classic” NLP for estimating continuous representations of words  – LSA (Latent Semantic Analysis)  – LDA (Latent Dirichlet Allocation)  Distributed representations of words learned by neural networks outperform LSA on various tasks  LDA is computationally expensive and cannot be trained on very large datasets Suzan Verberne
  • 22. ADVANTAGES OF WORD2VEC  It scales  Train on billion word corpora  In limited time  Possibility of parallel training  Pre-trained word embeddings trained by one can be used by others  For entirely different tasks  Incremental training  Train on one piece of data, save results, continue training later on  There is a Python module for it:  Gensim word2vec Suzan Verberne
  • 23. Suzan Verberne WHAT CAN YOU DO WITH IT?
  • 24. DO IT YOURSELF  Implementation in Python package gensim import gensim model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4) size: the dimensionality of the feature vectors (common: 200 or 320) window: the maximum distance between the current and predicted word within a sentence min_count: minimum number of occurrences of a word in the corpus to be included in the model workers: for parallellization with multicore machines Suzan Verberne
  • 25. DO IT YOURSELF model.most_similar(‘apple’)  [(’banana’, 0.8571481704711914), ...] model.doesnt_match("breakfast cereal dinner lunch".split())  ‘cereal’ model.similarity(‘woman’, ‘man’)  0.73723527 Suzan Verberne
  • 26. WHAT CAN YOU DO WITH IT?  A is to B as C is to ?  This is the famous example: vector(king) – vector(man) + vector(woman) = vector(queen)  Actually, what the original paper says is: if you substract the vector for ‘man’ from the one for ‘king’ and add the vector for ‘woman’, the vector closest to the one you end up with turns out to be the one for ‘queen’.  More interesting: France is to Paris as Germany is to … Suzan Verberne
  • 27. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
  • 28. WHAT CAN YOU DO WITH IT?  A is to B as C is to ?  It also works for syntactic relations:  vector(biggest) - vector(big) + vector(small) = Suzan Verberne vector(smallest)
  • 29. WHAT CAN YOU DO WITH IT?  Mining knowledge about natural language  Improve NLP applications  Improve Text Mining applications Suzan Verberne
  • 30. WHAT CAN YOU DO WITH IT?  Mining knowledge about natural language  Selecting out-of-the-list words  Example: which word does not belong in [monkey, lion, dog, truck]  Selectional preferences  Example: predict typical verb-noun pairs: people as subject of eating is more likely than people as object of eating  Imrove NLP applications:  Sentence completion/text prediction  Bilingual Word Embeddings for Phrase-Based Machine Translation Suzan Verberne
  • 31. WHAT CAN YOU DO WITH IT?  Improve Text Mining applications:  (Near-)Synonym detection ( query expansion)  Concept representation of texts  Example: Twitter sentiment classification  Document similarity  Example: cluster news articles per news event Suzan Verberne
  • 32. DEAL WITH PHRASES  Entities as single words:  during pre-processing: find collocations (e.g. Wikipedia anchor texts), and feed them as single ‘words’ to the neural network.  Bigrams (built-in function) bigram = gensim.models.Phrases(sentence_stream) term_list = list(bigram[sentence_stream]) model = gensim.models.Word2Vec(term_list) Suzan Verberne
  • 33. HOW TO EVALUATE A MODEL  There is a comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions  – 8869 semantic questions – 10675 syntactic questions  https://rare-technologies.com/word2vec-tutorial/ Suzan Verberne
  • 34. FURTHER READING  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111– 3119, 2013.  http://radimrehurek.com/2014/02/word2vec-tutorial/  http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram- model/  https://rare-technologies.com/word2vec-tutorial/ Suzan Verberne
  • 36. PRE-TRAINED WORD2VEC MODELS  We will compare two pre-trained word2vec models:  A model trained on a Dutch web corpus (COW)  A model trained on Dutch web forum data  Goal: inspect the differences between the models Suzan Verberne
  • 37. THANKS  Traian Rebedea  Tom Kenter  Mohammad Mahdavi  Satoshi Sekine Suzan Verberne

Editor's Notes

  1. One-hot encoding