SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Word Space Models
and
Random Indexing
By
Dileepa Jayakody
Overview
● Text Similarity
● Word Space Model
– Distributional hypothesis
– Distance and Similarity measures
– Pros & Cons
– Dimension Reduction
● Random Indexing
– Example
– Random Indexing Parameters
– Data pre-processing in Random Indexing
– Random Indexing Benefits and Concerns
Text Similarity
● Human readers determine the similarity between texts by
comparing the abstract meaning of the texts, whether they
discuss a similar topic
● How to model meaning in programming?
● In simplest way, if 2 texts contain the same words, it is believed
the texts have a similar meaning
Meaning of a Word
● The meaning of a word can be determined by the context
formed by the surrounding words
● E.g : The meaning of the word “foorbar” is determined by the
words co-occurred with it. e.g. "drink," "beverage" or "sodas."
– He drank the foobar at the game.
– Foobar is the number three beverage.
– A case of foobar is cheap compared to other sodas.
– Foobar tastes better when cold.
● Co-occurrence matrix represent the context vectors of
words/documents
Word Space Model
● The word-space model is a computational model of meaning to
represent similarity between words/text
● It derives the meaning of words by plotting the words in an n-
dimensional geometric space
Word Space Model
● The dimensions in word-space n can be arbitrarily large
(word * word | word * document)
● The coordinates used to plot each word depends upon the
frequency of the contextual feature that each word co-occur
with within a text
● e.g. words that do not co-occur with the word to be plotted
within a given context are assigned a coordinate value of zero
● The set of zero and non-zero values corresponding to the
coordinates of a word in a word-space are recorded in a context
vector
Distributional Hypothesis in Word Space
● To deduce a certain level of meaning, the coordinates of a word
needs to be measured relative to the coordinates of other words
● Linguistic concept known as the distributional hypothesis
states that “words that occur in the same contexts tend to have
similar meanings”
● The level of closeness of words in the word-space is called the
spatial proximity of words
● Spatial proximity represents the semantic similarity of words in
word space models
Distance and Similarity Measures
● Cosine Similarity
(A common approach used to determine spatial proximity by
measuring the cosine of the angle between the plotted context
vectors of the text)
● Other measures
– Euclidean
– Lin
– Jaccard
– Dice
Word Space Models
● Latent Semantic Analysis (document based co-occurrence :
word * document)
● Hyperspace Analogue to Language (word based co-occurrence :
word * word)
● Latent Dirichlet Allocation
● Random Indexing
Word Space Model Pros & Cons
● Pros
– Mathematically well defined model allows us to define
semantic similarity in mathematical terms
– Constitutes a purely descriptive approach to semantic
modeling; it does not require any previous linguistic or
semantic knowledge
● Cons
– Efficiency and scalability problems with the high
dimensionality of the context vectors
– Majority of the cells in the matrix will be zero due to the
sparse data problem
Dimension Reduction
● Singular Value Decomposition
– matrix factorization technique that can be used to decompose
and approximate a matrix, so that the resulting matrix has
much fewer columns but similar in dimensions
● Non-negative matrix factorization
Cons of Dimension Reduction
● Computationally very costly
● One-time operation; Constructing the co-occurrence matrix and
then transforming it has to be done from scratch, every time
new data is encountered
● Fails to avoid the initial huge co-occurrence matrix. Requires
initial sampling of the entire data which is computationally
cumbersome
● No intermediary results. It is only after co-occurrence matrix is
constructed and transformed the that any processing can begin
Random Indexing
Magnus Sahlgren,
Swedish Institute of Computer Science, 2005
● A word space model that is inherently incremental and does not
require a separate dimension reduction phase
● Each word is represented by two vectors
– Index vector : contains a randomly assigned label. The
random label is a vector filled mostly with zeros, except a
handful of +1 and -1 that are located at random indexes. Index
vectors are expected be orthogonal
e.g. school = [0,0,0,......,0,1,0,...........,-1,0,..............]
– Context vector : produced by scanning through the text; each
time a word occurs in a context (e.g. in a document, or within a
sliding context window), that context's d-dimensional index
vector is added to the context vector of the word in question
Random Indexing Example
● Sentence : "the quick brown fox jumps over the lazy dog."
● With a window-size of 2, the context vector for "fox" is
calculated by adding the index vectors as below;
● N-2(quick) + N-1(brown) + N1(jumps) + N2(over); where N-
k denotes the kth
permutation of the specified index vector
● Two words will have similar context vectors if the words
appear in similar contexts in the text
● Finally a document is represented by the sum of context vectors
of all words occurred in the document
Random Indexing Parameters
● The length of the vector
– determines the dimensionality, storage requirements
● The number of nonzero (+1,-1) entries in the index vector
– has an impact on how the random distortion will be
distributed over the index/context vector.
● Context window size (left and right context boundaries of a
word)
● Weighting Schemes for words within context window
– Constant weighting
– Weighting factor that depends on the distance to the focus
word in the middle of the context window
Data Preprocessing prior to Random
Indexing
● Filtering Stop words : Frequent words like and, the, thus, hence
contribute very little context unless looking at phrases
● Stemming words : reducing inflected words to their stem, base
or root form. e.g. fishing, fisher, fished > fish
● Lemmatizing words : Closely related to stemming, but reduces
the words to a single base or root form based on the word's
context. e.g : better, good > good
● Preprocessing numbers, smilies, money : <number>, <smiley>,
<money> to mark the sentence had a number/smiley at that
position
Random Indexing Vs LSA
● In contrast to other WSMs like LSA which first construct the
co-occurrence matrix and then extract context vectors; in the
Random Indexing approach, the process is backwards
● First context vectors are accumulated, then a co-occurrence
matrix is constructed by collecting the context vectors as rows
of the matrix
● Compresses sparse raw data to a smaller representation without
a separate dimensionality reduction phase as in LSA
Random Indexing Benefits
● The dimensionality of the final context vector of a document
will not depend on the number of documents or words that have
been indexed
● Method is incremental
● No need to sample all texts before results can be produced,
hence intermediate results can be gained
● Simple computation for context vector generation
● Doesn't require intensive processing power and memory
Random Indexing Design Concerns
● Random distortion
– Possible non orthogonal values in the index & context
vectors
– All words will have some similarity depending on the
dimension used for vectors compared to the corpora loaded
into the index (small dimension to represent a big corpora
could result in random distortions)
– Have to decide what level of random distortion is acceptable
to a context vector that represents a document based on the
context vectors of singular words
Random Indexing Design Concerns
● Negative similarity scores
● Words with no similarity would normally be expected to get a
cosine similarity score of zero, but with Random Indexing they
sometimes get a negative score due to opposite sign on the
same index in the word's context vector
● Proportional to the size of the corpora and dimensionality in the
Random Index
Conclusion
● Random Indexing is an efficient and scalable word space model
● Can be used for text analysis applications requiring incremental
approach to perform analysis.
e.g: email clustering and categorizing, online forum analysis
● Need to predetermine the optimal values for the parameters to
gain high accuracy: dimensions, no. of non zero indexes and
context window size
Thank you

Contenu connexe

Tendances

machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveyLifeng (Aaron) Han
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...eSAT Publishing House
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
 
Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Mudasir Qazi
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsijaia
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
CONTENT ANALYSIS AND Q-SORT
CONTENT ANALYSIS AND Q-SORTCONTENT ANALYSIS AND Q-SORT
CONTENT ANALYSIS AND Q-SORTANCYBS
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language modelc sharada
 
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKijnlc
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationIJECEIAES
 
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATIONTSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATIONLifeng (Aaron) Han
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesIJERA Editor
 

Tendances (18)

machine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a surveymachine translation evaluation resources and methods: a survey
machine translation evaluation resources and methods: a survey
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...Sentence level sentiment polarity calculation for customer reviews by conside...
Sentence level sentiment polarity calculation for customer reviews by conside...
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
 
Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)Database - Entity Relationship Diagram (ERD)
Database - Entity Relationship Diagram (ERD)
 
Cross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristicsCross lingual similarity discrimination with translation characteristics
Cross lingual similarity discrimination with translation characteristics
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
CONTENT ANALYSIS AND Q-SORT
CONTENT ANALYSIS AND Q-SORTCONTENT ANALYSIS AND Q-SORT
CONTENT ANALYSIS AND Q-SORT
 
A neural probabilistic language model
A neural probabilistic language modelA neural probabilistic language model
A neural probabilistic language model
 
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTKANALYSIS OF MWES IN HINDI TEXT USING NLTK
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
Two Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query TranslationTwo Level Disambiguation Model for Query Translation
Two Level Disambiguation Model for Query Translation
 
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATIONTSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
TSD2013.AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH INFORMATION
 
Implementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar RulesImplementation Of Syntax Parser For English Language Using Grammar Rules
Implementation Of Syntax Parser For English Language Using Grammar Rules
 

Similaire à Word Space Models and Random Indexing

Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distancesGanesh Borle
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
 A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP TasksMasahiro Kaneko
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Transition Based Dependency Parsing
Transition Based Dependency ParsingTransition Based Dependency Parsing
Transition Based Dependency ParsingDavid Przybilla
 
PyData Los Angeles 2020 (Abhilash Majumder)
PyData Los Angeles 2020 (Abhilash Majumder)PyData Los Angeles 2020 (Abhilash Majumder)
PyData Los Angeles 2020 (Abhilash Majumder)Abhilash Majumder
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
 
Knowledge based System
Knowledge based SystemKnowledge based System
Knowledge based SystemTamanna36
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfNohaGhoweil
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...QuantInsti
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
 

Similaire à Word Space Models and Random Indexing (20)

wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
 A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Transition Based Dependency Parsing
Transition Based Dependency ParsingTransition Based Dependency Parsing
Transition Based Dependency Parsing
 
Ontology matching
Ontology matchingOntology matching
Ontology matching
 
sentiment analysis
sentiment analysis sentiment analysis
sentiment analysis
 
PyData Los Angeles 2020 (Abhilash Majumder)
PyData Los Angeles 2020 (Abhilash Majumder)PyData Los Angeles 2020 (Abhilash Majumder)
PyData Los Angeles 2020 (Abhilash Majumder)
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
Knowledge based System
Knowledge based SystemKnowledge based System
Knowledge based System
 
EasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdfEasyChair-Preprint-7375.pdf
EasyChair-Preprint-7375.pdf
 
LLM.pdf
LLM.pdfLLM.pdf
LLM.pdf
 
IR.pptx
IR.pptxIR.pptx
IR.pptx
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
Masterclass: Natural Language Processing in Trading with Terry Benzschawel & ...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 

Dernier

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Dernier (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

Word Space Models and Random Indexing

  • 1. Word Space Models and Random Indexing By Dileepa Jayakody
  • 2. Overview ● Text Similarity ● Word Space Model – Distributional hypothesis – Distance and Similarity measures – Pros & Cons – Dimension Reduction ● Random Indexing – Example – Random Indexing Parameters – Data pre-processing in Random Indexing – Random Indexing Benefits and Concerns
  • 3. Text Similarity ● Human readers determine the similarity between texts by comparing the abstract meaning of the texts, whether they discuss a similar topic ● How to model meaning in programming? ● In simplest way, if 2 texts contain the same words, it is believed the texts have a similar meaning
  • 4. Meaning of a Word ● The meaning of a word can be determined by the context formed by the surrounding words ● E.g : The meaning of the word “foorbar” is determined by the words co-occurred with it. e.g. "drink," "beverage" or "sodas." – He drank the foobar at the game. – Foobar is the number three beverage. – A case of foobar is cheap compared to other sodas. – Foobar tastes better when cold. ● Co-occurrence matrix represent the context vectors of words/documents
  • 5. Word Space Model ● The word-space model is a computational model of meaning to represent similarity between words/text ● It derives the meaning of words by plotting the words in an n- dimensional geometric space
  • 6. Word Space Model ● The dimensions in word-space n can be arbitrarily large (word * word | word * document) ● The coordinates used to plot each word depends upon the frequency of the contextual feature that each word co-occur with within a text ● e.g. words that do not co-occur with the word to be plotted within a given context are assigned a coordinate value of zero ● The set of zero and non-zero values corresponding to the coordinates of a word in a word-space are recorded in a context vector
  • 7. Distributional Hypothesis in Word Space ● To deduce a certain level of meaning, the coordinates of a word needs to be measured relative to the coordinates of other words ● Linguistic concept known as the distributional hypothesis states that “words that occur in the same contexts tend to have similar meanings” ● The level of closeness of words in the word-space is called the spatial proximity of words ● Spatial proximity represents the semantic similarity of words in word space models
  • 8. Distance and Similarity Measures ● Cosine Similarity (A common approach used to determine spatial proximity by measuring the cosine of the angle between the plotted context vectors of the text) ● Other measures – Euclidean – Lin – Jaccard – Dice
  • 9. Word Space Models ● Latent Semantic Analysis (document based co-occurrence : word * document) ● Hyperspace Analogue to Language (word based co-occurrence : word * word) ● Latent Dirichlet Allocation ● Random Indexing
  • 10. Word Space Model Pros & Cons ● Pros – Mathematically well defined model allows us to define semantic similarity in mathematical terms – Constitutes a purely descriptive approach to semantic modeling; it does not require any previous linguistic or semantic knowledge ● Cons – Efficiency and scalability problems with the high dimensionality of the context vectors – Majority of the cells in the matrix will be zero due to the sparse data problem
  • 11. Dimension Reduction ● Singular Value Decomposition – matrix factorization technique that can be used to decompose and approximate a matrix, so that the resulting matrix has much fewer columns but similar in dimensions ● Non-negative matrix factorization
  • 12. Cons of Dimension Reduction ● Computationally very costly ● One-time operation; Constructing the co-occurrence matrix and then transforming it has to be done from scratch, every time new data is encountered ● Fails to avoid the initial huge co-occurrence matrix. Requires initial sampling of the entire data which is computationally cumbersome ● No intermediary results. It is only after co-occurrence matrix is constructed and transformed the that any processing can begin
  • 13. Random Indexing Magnus Sahlgren, Swedish Institute of Computer Science, 2005 ● A word space model that is inherently incremental and does not require a separate dimension reduction phase ● Each word is represented by two vectors – Index vector : contains a randomly assigned label. The random label is a vector filled mostly with zeros, except a handful of +1 and -1 that are located at random indexes. Index vectors are expected be orthogonal e.g. school = [0,0,0,......,0,1,0,...........,-1,0,..............] – Context vector : produced by scanning through the text; each time a word occurs in a context (e.g. in a document, or within a sliding context window), that context's d-dimensional index vector is added to the context vector of the word in question
  • 14. Random Indexing Example ● Sentence : "the quick brown fox jumps over the lazy dog." ● With a window-size of 2, the context vector for "fox" is calculated by adding the index vectors as below; ● N-2(quick) + N-1(brown) + N1(jumps) + N2(over); where N- k denotes the kth permutation of the specified index vector ● Two words will have similar context vectors if the words appear in similar contexts in the text ● Finally a document is represented by the sum of context vectors of all words occurred in the document
  • 15. Random Indexing Parameters ● The length of the vector – determines the dimensionality, storage requirements ● The number of nonzero (+1,-1) entries in the index vector – has an impact on how the random distortion will be distributed over the index/context vector. ● Context window size (left and right context boundaries of a word) ● Weighting Schemes for words within context window – Constant weighting – Weighting factor that depends on the distance to the focus word in the middle of the context window
  • 16. Data Preprocessing prior to Random Indexing ● Filtering Stop words : Frequent words like and, the, thus, hence contribute very little context unless looking at phrases ● Stemming words : reducing inflected words to their stem, base or root form. e.g. fishing, fisher, fished > fish ● Lemmatizing words : Closely related to stemming, but reduces the words to a single base or root form based on the word's context. e.g : better, good > good ● Preprocessing numbers, smilies, money : <number>, <smiley>, <money> to mark the sentence had a number/smiley at that position
  • 17. Random Indexing Vs LSA ● In contrast to other WSMs like LSA which first construct the co-occurrence matrix and then extract context vectors; in the Random Indexing approach, the process is backwards ● First context vectors are accumulated, then a co-occurrence matrix is constructed by collecting the context vectors as rows of the matrix ● Compresses sparse raw data to a smaller representation without a separate dimensionality reduction phase as in LSA
  • 18. Random Indexing Benefits ● The dimensionality of the final context vector of a document will not depend on the number of documents or words that have been indexed ● Method is incremental ● No need to sample all texts before results can be produced, hence intermediate results can be gained ● Simple computation for context vector generation ● Doesn't require intensive processing power and memory
  • 19. Random Indexing Design Concerns ● Random distortion – Possible non orthogonal values in the index & context vectors – All words will have some similarity depending on the dimension used for vectors compared to the corpora loaded into the index (small dimension to represent a big corpora could result in random distortions) – Have to decide what level of random distortion is acceptable to a context vector that represents a document based on the context vectors of singular words
  • 20. Random Indexing Design Concerns ● Negative similarity scores ● Words with no similarity would normally be expected to get a cosine similarity score of zero, but with Random Indexing they sometimes get a negative score due to opposite sign on the same index in the word's context vector ● Proportional to the size of the corpora and dimensionality in the Random Index
  • 21. Conclusion ● Random Indexing is an efficient and scalable word space model ● Can be used for text analysis applications requiring incremental approach to perform analysis. e.g: email clustering and categorizing, online forum analysis ● Need to predetermine the optimal values for the parameters to gain high accuracy: dimensions, no. of non zero indexes and context window size