SlideShare une entreprise Scribd logo
1  sur  26
Team Members
Ganesh Borle
Sonam Gupta
Satyam Verma
Vivek Vishnoi
From Word Embeddings To Document
Distances
Mentor
Narendra Babu Unnam
Content
●The question
●Existing Ways/Models
●Drawbacks
●Proposed Solutions
●Need
●Word Mover Distance
●Word Embedding to Document distances.
The Question
How to check whether the statements/documents
infers the same meaning or not?
“Obama speaks to the media in Illinois”
and
“The President greets the press in Chicago.”
Same content. Different words.
Existing Ways
Documents commonly are represented as :
● Bag of Words(BOW)
● term frequency-inverse document frequency (TF-IDF) of the
words
● Based on the Count of common words
#Dimensions = #Vocabulary (thousands)
● Stuck if no words in common.
“Obama” != “President”
Existing Ways
BOW
● A model in which a text (such as a sentence or a document) is
represented as the bag (multiset) of its words, disregarding grammar
and even word order but keeping multiplicity.
● Commonly used in methods of document classification where the
(frequency of) occurrence of each word is used as a feature for training
a classifier.
Existing Ways
Tf-idf Model
A statistical measure used to evaluate how important a word is
to a document in a collection or corpus. The importance increases
proportionally to the number of times a word appears in the
document but is offset by the frequency of the word in the corpus.
Tf(term Frequency) Score
idf(inverse-doc frequency) Score
Existing Ways
Tf-idf Model
The tf-idf weight of a term is the product of its tf weight and
its idf weight.
Existing Ways
BM25 Okapi
● A bag-of-words retrieval function that ranks a set of documents based
on the query terms appearing in each document, regardless of the
inter-relationship between the query terms within a document.
● A ranking function that extends TF-IDF for each word w in a
document D
Existing Ways
BM25 Okapi
Existing Ways
Drawbacks of previous models
● Representing documents via BOW or by TF-IDF representations
are often not suitable for document distances due to their
frequent near-orthogonality
● Another significant draw-back of these representations are that
they do not capture the distance between individual words.
Proposed Solutions
Low-dimensional latent features models
● Latent Semantic Indexing(LSI )
○ LSA assumes that words that are close in meaning will occur in
similar pieces of text.
○ uses singular value decomposition on the BOW representation to
arrive at a semantic feature space.
● Latent Dirichlet Allocation(LDA)
○ Generative model for text documents that learns representations
for documents as distributions over word topics.
Proposed Solutions
More Models...
● mSDA Marginalized Stacked Denoising Autoencoder
○ a representation learned from stacked denoting autoencoders (SDAs),
marginalized for fast training.
○ Generally, SDAs have been shown to have state-of-the-art performance
for document sentiment analysis tasks
Need for improved Solution
● While these low dimensional approaches produce a more coherent
document representation than BOW, they often do not improve the
empirical performance of BOW on distance-based tasks LDA.
Hence we need something better...
Improved Solution
Word Mover’s Distance (WMD)
● A novel distance function between text documents.
● The WMD distance measures the dissimilarity between two text
documents as the minimum amount of distance that the embedded
words of one document need to “travel” to reach the embedded words
of another document.
● Based on recent results in word embeddings that learn semantically
meaningful representations for words from local co-occurrences in
sentences.
● Based on the well known concept of Earth Mover’s Distance
Word Mover’s Distance (WMD)
● Built on top of Google’s word2vec and based on its embeddings
● Scaling to very large data because of word2vec
● semantic relationships are often preserved in vector operations on word
vectors, e.g., vec(Berlin) - vec(Germany) + vec(France) is close to
vec(Paris).
● It represent text documents as a weighted point cloud of embedded
words
Let's get familiar with the terms which are used in WMD….
Word Mover’s Distance (WMD)
● It is the collective name for a set of language modeling and feature
learning techniques in natural language processing (NLP) where words
or phrases from the vocabulary are mapped to vectors of real numbers.
● Stores each word in as a point in space, where it is represented by a
vector of fixed number of dimensions (generally 300).
● For example, “Hello” might be represented as : [0.4, -0.11, 0.55, 0.3 . . .
0.1, 0.02]
● Dimensions are basically projections along different axes, more of a
mathematical concept.
Word Embedding
● vector d as a point on the n−1 dimensional simplex of word distributions
● Two documents with different unique words will lie in different regions of
this simplex. However, these documents may still be semantically close
● if word i appears ci times in the document, we denote di = ci / ∑j=1 to ncj
● It is very parse as most of the words will not appear in document
Word Mover’s Distance (WMD)
nBOW representation
Word travel cost
● measure of word dissimilarity is naturally provided by their Euclidean
distance in the word2vec embedding space
● The cost associated with “traveling” from one word to another
c(i, j) = ||xi − xj||2
Word Mover’s Distance (WMD)
● Uses “travel cost” between two words to create a distance between two
documents
● each word i in d to be transformed into any word j in d’ in total or in parts
Tij ≥ 0 denotes how much of word i in d travels to word j in d’ (T ∈ Rn×n)
● To transform d entirely into d’ we ensure that the entire outgoing flow
from word i equals di ,i.e. ∑j Tij= di and same applies for the incoming
● Hence, the distance between the two documents as the minimum
(weighted) cumulative cost required to move all words from d to d’ , i.e.
Document distance
This optimization is a special case of
the earth mover’s distance metric, a
well studied transportation problem
for which specialized solvers have
been developed, hence we can use
those solvers to calculate the WMD.
● the minimum cumulative cost of moving d to d’ given the constraints is pro
vided by the solution to the following linear program
Word Mover’s Distance (WMD)
Transportation problem
Word Mover’s Distance (WMD)
The components of the WMD metric
between a query D0 and two
sentences D1, D2 (with equal BOW
distance). The arrows represent flow
between two words and are labeled
with their distance contribution.
The flow between two sentences D3
and D0 with different numbers of
words. This mis match causes the
WMD to move words to multiple
similar words.
Word Mover’s Distance (WMD)
● best average time complexity : O(p3 log p); p ⇾ # unique words in the
documents
● For datasets with many unique words (i.e., high-dimensional) and/or a
large number of documents, solving the WMD optimal transport problem
can become prohibitive.
Word Mover’s Distance (WMD)
Complexity
● Twitter
○ Collection of “Polarity - Tweet” tweets
● BBC Sport
○ Documents related to 5 categories like athletics, cricket, football, rugby,
tennis
● Classic
○ Documents related to 5 categories like casm, med, etc
● News20Groups
○ Documents divided into 20 categories like space, medical, etc
Word Mover’s Distance (WMD)
DataSets Used
Word Mover’s Distance (WMD)
% Error on BBC Sports data
Word Mover’s Distance (WMD)
% Error on Twitter data
Word Mover’s Distance (WMD)
Github link :
● https://github.com/buggynap/Word-Movers-Distance
Reference :
● Word Mover's Distance from Matthew J Kusner's paper "From Word Embeddings to Document
Distances"
SideShare Link :

Contenu connexe

Tendances

Io t system management with
Io t system management withIo t system management with
Io t system management withxyxz
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALASaikiran Panjala
 
Business intelligence and data warehouses
Business intelligence and data warehousesBusiness intelligence and data warehouses
Business intelligence and data warehousesDhani Ahmad
 
Data mining-2
Data mining-2Data mining-2
Data mining-2Nit Hik
 
Handheld operting system
Handheld operting systemHandheld operting system
Handheld operting systemAj Maurya
 
Multimedia Database
Multimedia Database Multimedia Database
Multimedia Database Avnish Patel
 
Adbms 11 object structure and type constructor
Adbms 11 object structure and type constructorAdbms 11 object structure and type constructor
Adbms 11 object structure and type constructorVaibhav Khanna
 
Business value Drivers for IoT Solutions
Business value Drivers for IoT SolutionsBusiness value Drivers for IoT Solutions
Business value Drivers for IoT SolutionsIBM_Info_Management
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notesMohit Saini
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
Deductive databases
Deductive databasesDeductive databases
Deductive databasesJohn Popoola
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processingSamraiz Tejani
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 

Tendances (20)

Io t system management with
Io t system management withIo t system management with
Io t system management with
 
KNN
KNN KNN
KNN
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Business intelligence and data warehouses
Business intelligence and data warehousesBusiness intelligence and data warehouses
Business intelligence and data warehouses
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Active and main memory database
Active and main memory databaseActive and main memory database
Active and main memory database
 
Multimedia Mining
Multimedia Mining Multimedia Mining
Multimedia Mining
 
Data mining-2
Data mining-2Data mining-2
Data mining-2
 
Handheld operting system
Handheld operting systemHandheld operting system
Handheld operting system
 
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS ArchitectureDistributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
 
Multimedia Database
Multimedia Database Multimedia Database
Multimedia Database
 
Adbms 11 object structure and type constructor
Adbms 11 object structure and type constructorAdbms 11 object structure and type constructor
Adbms 11 object structure and type constructor
 
Business value Drivers for IoT Solutions
Business value Drivers for IoT SolutionsBusiness value Drivers for IoT Solutions
Business value Drivers for IoT Solutions
 
Lecture 1 ddbms
Lecture 1 ddbmsLecture 1 ddbms
Lecture 1 ddbms
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Big data lecture notes
Big data lecture notesBig data lecture notes
Big data lecture notes
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Online analytical processing
Online analytical processingOnline analytical processing
Online analytical processing
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 

Similaire à Word Embedding to Document distances

Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptxChode Amarnath
 
Word Space Models and Random Indexing
Word Space Models and Random IndexingWord Space Models and Random Indexing
Word Space Models and Random IndexingDileepa Jayakody
 
Word Space Models & Random indexing
Word Space Models & Random indexingWord Space Models & Random indexing
Word Space Models & Random indexingDileepa Jayakody
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachAhmed Hani Ibrahim
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherenceContent Savvy
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 

Similaire à Word Embedding to Document distances (20)

Vectorization In NLP.pptx
Vectorization In NLP.pptxVectorization In NLP.pptx
Vectorization In NLP.pptx
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Word Space Models and Random Indexing
Word Space Models and Random IndexingWord Space Models and Random Indexing
Word Space Models and Random Indexing
 
Word Space Models & Random indexing
Word Space Models & Random indexingWord Space Models & Random indexing
Word Space Models & Random indexing
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Word embedding
Word embedding Word embedding
Word embedding
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
Learning to summarize using coherence
Learning to summarize using coherenceLearning to summarize using coherence
Learning to summarize using coherence
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
ijcai11
ijcai11ijcai11
ijcai11
 

Dernier

BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxNiranjanYadav41
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfRajuKanojiya4
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 

Dernier (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
BSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptxBSNL Internship Training presentation.pptx
BSNL Internship Training presentation.pptx
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
National Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdfNational Level Hackathon Participation Certificate.pdf
National Level Hackathon Participation Certificate.pdf
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 

Word Embedding to Document distances

  • 1. Team Members Ganesh Borle Sonam Gupta Satyam Verma Vivek Vishnoi From Word Embeddings To Document Distances Mentor Narendra Babu Unnam
  • 2. Content ●The question ●Existing Ways/Models ●Drawbacks ●Proposed Solutions ●Need ●Word Mover Distance ●Word Embedding to Document distances.
  • 3. The Question How to check whether the statements/documents infers the same meaning or not? “Obama speaks to the media in Illinois” and “The President greets the press in Chicago.” Same content. Different words.
  • 4. Existing Ways Documents commonly are represented as : ● Bag of Words(BOW) ● term frequency-inverse document frequency (TF-IDF) of the words ● Based on the Count of common words #Dimensions = #Vocabulary (thousands) ● Stuck if no words in common. “Obama” != “President”
  • 5. Existing Ways BOW ● A model in which a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. ● Commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.
  • 6. Existing Ways Tf-idf Model A statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Tf(term Frequency) Score idf(inverse-doc frequency) Score
  • 7. Existing Ways Tf-idf Model The tf-idf weight of a term is the product of its tf weight and its idf weight.
  • 8. Existing Ways BM25 Okapi ● A bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of the inter-relationship between the query terms within a document. ● A ranking function that extends TF-IDF for each word w in a document D
  • 10. Existing Ways Drawbacks of previous models ● Representing documents via BOW or by TF-IDF representations are often not suitable for document distances due to their frequent near-orthogonality ● Another significant draw-back of these representations are that they do not capture the distance between individual words.
  • 11. Proposed Solutions Low-dimensional latent features models ● Latent Semantic Indexing(LSI ) ○ LSA assumes that words that are close in meaning will occur in similar pieces of text. ○ uses singular value decomposition on the BOW representation to arrive at a semantic feature space. ● Latent Dirichlet Allocation(LDA) ○ Generative model for text documents that learns representations for documents as distributions over word topics.
  • 12. Proposed Solutions More Models... ● mSDA Marginalized Stacked Denoising Autoencoder ○ a representation learned from stacked denoting autoencoders (SDAs), marginalized for fast training. ○ Generally, SDAs have been shown to have state-of-the-art performance for document sentiment analysis tasks
  • 13. Need for improved Solution ● While these low dimensional approaches produce a more coherent document representation than BOW, they often do not improve the empirical performance of BOW on distance-based tasks LDA. Hence we need something better...
  • 14. Improved Solution Word Mover’s Distance (WMD) ● A novel distance function between text documents. ● The WMD distance measures the dissimilarity between two text documents as the minimum amount of distance that the embedded words of one document need to “travel” to reach the embedded words of another document. ● Based on recent results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences. ● Based on the well known concept of Earth Mover’s Distance
  • 15. Word Mover’s Distance (WMD) ● Built on top of Google’s word2vec and based on its embeddings ● Scaling to very large data because of word2vec ● semantic relationships are often preserved in vector operations on word vectors, e.g., vec(Berlin) - vec(Germany) + vec(France) is close to vec(Paris). ● It represent text documents as a weighted point cloud of embedded words Let's get familiar with the terms which are used in WMD….
  • 16. Word Mover’s Distance (WMD) ● It is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. ● Stores each word in as a point in space, where it is represented by a vector of fixed number of dimensions (generally 300). ● For example, “Hello” might be represented as : [0.4, -0.11, 0.55, 0.3 . . . 0.1, 0.02] ● Dimensions are basically projections along different axes, more of a mathematical concept. Word Embedding
  • 17. ● vector d as a point on the n−1 dimensional simplex of word distributions ● Two documents with different unique words will lie in different regions of this simplex. However, these documents may still be semantically close ● if word i appears ci times in the document, we denote di = ci / ∑j=1 to ncj ● It is very parse as most of the words will not appear in document Word Mover’s Distance (WMD) nBOW representation Word travel cost ● measure of word dissimilarity is naturally provided by their Euclidean distance in the word2vec embedding space ● The cost associated with “traveling” from one word to another c(i, j) = ||xi − xj||2
  • 18. Word Mover’s Distance (WMD) ● Uses “travel cost” between two words to create a distance between two documents ● each word i in d to be transformed into any word j in d’ in total or in parts Tij ≥ 0 denotes how much of word i in d travels to word j in d’ (T ∈ Rn×n) ● To transform d entirely into d’ we ensure that the entire outgoing flow from word i equals di ,i.e. ∑j Tij= di and same applies for the incoming ● Hence, the distance between the two documents as the minimum (weighted) cumulative cost required to move all words from d to d’ , i.e. Document distance
  • 19. This optimization is a special case of the earth mover’s distance metric, a well studied transportation problem for which specialized solvers have been developed, hence we can use those solvers to calculate the WMD. ● the minimum cumulative cost of moving d to d’ given the constraints is pro vided by the solution to the following linear program Word Mover’s Distance (WMD) Transportation problem
  • 20. Word Mover’s Distance (WMD) The components of the WMD metric between a query D0 and two sentences D1, D2 (with equal BOW distance). The arrows represent flow between two words and are labeled with their distance contribution. The flow between two sentences D3 and D0 with different numbers of words. This mis match causes the WMD to move words to multiple similar words.
  • 22. ● best average time complexity : O(p3 log p); p ⇾ # unique words in the documents ● For datasets with many unique words (i.e., high-dimensional) and/or a large number of documents, solving the WMD optimal transport problem can become prohibitive. Word Mover’s Distance (WMD) Complexity
  • 23. ● Twitter ○ Collection of “Polarity - Tweet” tweets ● BBC Sport ○ Documents related to 5 categories like athletics, cricket, football, rugby, tennis ● Classic ○ Documents related to 5 categories like casm, med, etc ● News20Groups ○ Documents divided into 20 categories like space, medical, etc Word Mover’s Distance (WMD) DataSets Used
  • 24. Word Mover’s Distance (WMD) % Error on BBC Sports data
  • 25. Word Mover’s Distance (WMD) % Error on Twitter data
  • 26. Word Mover’s Distance (WMD) Github link : ● https://github.com/buggynap/Word-Movers-Distance Reference : ● Word Mover's Distance from Matthew J Kusner's paper "From Word Embeddings to Document Distances" SideShare Link :