SlideShare une entreprise Scribd logo
1  sur  33
Latent Semantic Indexing
Prepared by:

HATOUM Saria
DONGO Irvin

Presented to: Prof. CHBEIR Richard

Bayonne/2013
Overview
• Introduction
– Information Retrieval
• Vector Space Model

• Problems
• Latent Semantic Indexing
– Algorithm
– Example
– Advantages
– Disadvantages

2
Introduction
• Many documents available.

• The need to extract information.
• Sorted and classified information.
• Queries the information.

3
Information Retrieval
• Before LSI:
– Literally Matching Text corpus with many documents.
• Given a query, find relevant documents.

– Some terms in a user's query will literally match terms in irrelevant documents .

4
Some Methods for IR
• Set-Theoretic
– Fuzzy Set

• Algebraic
– Vector Space
• Generalised Vector Space
• Latent Semantic Indexing

• Probabilistic
– Binary Interdependence

5
Vector Space Model
• An algebraic model for representing text documents.

• Documents and Queries are both vectors
dj =(w1,j , w2,j , …, wt,j)
qj =(w1,q , w2,q , …, wt,q)

6
vector space method
– Term (rows) by document (columns) matrix, based on occurrence
– one vector will be associate for each document
– Cosine to measure distance between vectors (documents)
• small angle = large cosine = similar
• large angle = small cosine = dissimilar

7
Cosine Similarity Meausure

• Sim(di, dj) = 1, if di = dj.
• Sim(di, dj) = 0, if di and dj are different.

8
Document vector space
W o rd 1

Q uery

W o rd 2

9
Problem Introduction
• Traditional term-matching method doesn‟t work well in information
retrieval
• We want to capture the concepts instead of words. Concepts are
reflected in the words. However,
– One term may have multiple meaning
– Different terms may have the same meaning.

10
The Problems
• Two problems that arose using the vector space model:
– synonymy: many ways to express a given concept e.g. “automobile”
when querying on “car”
• leads to poor recall “the percentage of all relevant documents are
retrieved”
– polysemy: words have multiple meanings e.g. “surfing”
• leads to poor precision “the percentage of the retrieved documents
are relevant”
• The context of the documents.

11
Polysemy and Context
• Document similarity on single word level: polysemy and context
ring
jupiter
•••

space
…
planet
...

…
saturn
...

meaning 1

meaning 2

car
company
•••

contribution to similarity, if used in 1st
meaning, but not if in 2nd

dodge
ford
12
Problematic
• Allow users to retrieve information on the basis of a conceptual topic or
meaning of a document.

13
Latent Semantic Indexing
• Overcome these problems of lexical matching :
– Using a statistical information retrieval method that is capable of retrieving text
based on the concepts it contains, not just by matching specific keywords.

14
Characteristics of LSI
• Documents are represented as "bags of words", where the order of the
words in a document is not important, only how many times each word
appears in a document.
• Is a technique that projects queries and documents into a space with
“latent” semantic dimensions.
• Convert high-dimensional space to lower-dimensional space

15
Characteristics of LSI
• Concepts are represented as patterns of words that usually appear
together in documents.
– For example “jaguar", “car", and “speed" might usually appear in documents
about sports cars, whereas “jaguar”, “animal”, “hunting” might refer to the
concept of jaguar the animal.

• LSI is based on the principle that words that are used in the same
contexts tend to have similar meanings.
• LSI uses Singular Value Decomposition for the mapping of terms to
concepts.

16
Generate matrix
• Number of words is huge

• throw out noise „and‟, „is‟, „at‟, „the‟, .etc.
• Select and use a smaller set of words that are of interest
• Stemming which means remove endings e.g. learning , learned , learn

17
“Semantic” Space
H o u se
Home
D o m ic ile

K um q uat
O ra n g e
P ear
A p p le
18
Information Retrieval
• Represent each document as a word vector

• Represent corpus as term-document matrix (T-D matrix) using a linear
analysis method called SVD
• A classical method:
– Create new vector from query terms

– Find documents with highest cosine similarity

19
Singular Value Decomposition(SVD)
• We decompose the term-document matrix into three matrices.

20
Example
• d1: Shipment of gold damaged in a fire.

• d2: Delivery of silver arrived in a silver truck.
• d3: Shipment of gold arrived in a truck.
• q: Gold silver truck

21
Example

22
Example

23
Example

24
Example
New vectors

• d1 = [-0.4945, 0.6492]
• d2 = [-0.6458, -0.7194]
• d3 = [-0.5817, 0.2469]

25
Example

26
Example
sim(q, di) = CosΘ
• sim(q,d1) = -0.0541
• sim(q,d2) = 0.9910
• sim(q,d3) = 0.4478

27
Advantages
• LSI overcomes two of the most problematic constraints of queries:
– Synonymy
– Polysemy

• True (latent) dimensions: the new dimensions are a better representation
of documents and queries.
• Term Dependence: The traditional vector space model assumes term
independence but LSI has strong associations between terms like the
language.

28
Disadvantages
• Storage
– Many documents have more than 150 unique terms so the sparce.

• Efficiency
– With LSI, the query must be compared to every document in the collection.

• Static Matrix
– If we have new documents, we need to do a new SVD in the main matrix.

29
References
• [Furnas et al., 1988] Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K.,
Harshman, R. A., Streeter, L. A., and Lochbaum, K. E. (1988). Information
retrieval using a singular value decomposition model of latent semantic
structure. In Proceedings of the 11th annual international ACM SIGIR
conference on Research and development in information retrieval, SIGIR '88,
pages 465-480, New York, NY, USA. ACM.
• [Hull, 1994] Hull, D. (1994). Improving text retrieval for the routing problem using
latent semantic indexing. In Proceedings of the 17th annual international
ACM SIGIR conference on Research and development in information
retrieval, SIGIR '94, pages 282-291, New York, NY, USA. Springer-Verlag New
York, Inc.

30
References
• [Atreya and Elkan, 2011] Atreya, A. and Elkan, C. (2011). Latent semantic
indexing (lsi) fails for trec collections. SIGKDD Explor. Newsl., 12(2):5-10.
• [Deerwester et al., 1990] Deerwester, S., Dumais, S. T., Furnas, G. W.,
Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic
analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION
SCIENCE, 41(6):391-407.

• [Littman et al., 1998] Littman, M., Dumais, S. T., and Landauer, T. K. (1998).
Automatic cross-language information retrieval using latent semantic
indexing. In Cross-Language Information Retrieval, chapter 5, pages 51{62.
Kluwer Academic Publishers.

31
Thank you for your Attention!!!

32
33

Contenu connexe

Tendances

RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeNational Institute of Informatics
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)Vladimir Alexiev, PhD, PMP
 
Hierarchical Dirichlet Process
Hierarchical Dirichlet ProcessHierarchical Dirichlet Process
Hierarchical Dirichlet ProcessSangwoo Mo
 
Using NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureUsing NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureDatabricks
 
Open Data Mashups: linking fragments into mosaics
Open Data Mashups: linking fragments into mosaicsOpen Data Mashups: linking fragments into mosaics
Open Data Mashups: linking fragments into mosaicsphduchesne
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text miningKrish_ver2
 
Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databasesGraph-TA
 
Relational Database Management System
Relational Database Management SystemRelational Database Management System
Relational Database Management Systemsweetysweety8
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing extKrish_ver2
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
ChronoSAGE: Diversifying Topic Modeling Chronologically
ChronoSAGE: Diversifying Topic Modeling ChronologicallyChronoSAGE: Diversifying Topic Modeling Chronologically
ChronoSAGE: Diversifying Topic Modeling ChronologicallyTomonari Masada
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataGraph-TA
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkGezim Sejdiu
 
Cluster labeling fcl_weeklymeeting30102013
Cluster labeling fcl_weeklymeeting30102013Cluster labeling fcl_weeklymeeting30102013
Cluster labeling fcl_weeklymeeting30102013Vahid Moosavi
 
WG5: A data wrangling experiment
WG5: A data wrangling experimentWG5: A data wrangling experiment
WG5: A data wrangling experimentWARCnet
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
 

Tendances (20)

RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 
Hierarchical Dirichlet Process
Hierarchical Dirichlet ProcessHierarchical Dirichlet Process
Hierarchical Dirichlet Process
 
Using NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureUsing NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 Literature
 
Open Data Mashups: linking fragments into mosaics
Open Data Mashups: linking fragments into mosaicsOpen Data Mashups: linking fragments into mosaics
Open Data Mashups: linking fragments into mosaics
 
SWT Lecture Session 8 - Rules
SWT Lecture Session 8 - RulesSWT Lecture Session 8 - Rules
SWT Lecture Session 8 - Rules
 
Dsa unit 1
Dsa unit 1Dsa unit 1
Dsa unit 1
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Managing RDF data with graph databases
Managing RDF data with graph databasesManaging RDF data with graph databases
Managing RDF data with graph databases
 
Relational Database Management System
Relational Database Management SystemRelational Database Management System
Relational Database Management System
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
ChronoSAGE: Diversifying Topic Modeling Chronologically
ChronoSAGE: Diversifying Topic Modeling ChronologicallyChronoSAGE: Diversifying Topic Modeling Chronologically
ChronoSAGE: Diversifying Topic Modeling Chronologically
 
Deriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF DataDeriving an Emergent Relational Schema from RDF Data
Deriving an Emergent Relational Schema from RDF Data
 
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talkDistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
DistLODStats: Distributed Computation of RDF Dataset Statistics - ISWC 2018 talk
 
Cluster labeling fcl_weeklymeeting30102013
Cluster labeling fcl_weeklymeeting30102013Cluster labeling fcl_weeklymeeting30102013
Cluster labeling fcl_weeklymeeting30102013
 
IR
IRIR
IR
 
Presentation 1st
Presentation 1stPresentation 1st
Presentation 1st
 
WG5: A data wrangling experiment
WG5: A data wrangling experimentWG5: A data wrangling experiment
WG5: A data wrangling experiment
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 

En vedette

Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationChristoph Trattner
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Kyunghoon Kim
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Krishna Bollojula
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...Damiano Spina
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_wordszukun
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...Christos Katsanos
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)muzzy4friends
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Ra'Fat Al-Msie'deen
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011aneeshabakharia
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionCory Andrew Henson
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML, Inc
 
An approach to source code plagiarism
An approach to source code plagiarismAn approach to source code plagiarism
An approach to source code plagiarismvarsha_bhat
 
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesBayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesJinYeong Bak
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiHow to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiSocial Media Camp
 
Latent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationLatent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationElaheh Barati
 

En vedette (20)

Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_words
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
 
Practical Machine Learning
Practical Machine Learning Practical Machine Learning
Practical Machine Learning
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 
BigML Summer 2016 Release
BigML Summer 2016 ReleaseBigML Summer 2016 Release
BigML Summer 2016 Release
 
An approach to source code plagiarism
An approach to source code plagiarismAn approach to source code plagiarism
An approach to source code plagiarism
 
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet ProcessesBayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
Bayesian Nonparametric Topic Modeling Hierarchical Dirichlet Processes
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco AmalfiHow to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
How to use Latent Semantic Analysis to Glean Real Insight - Franco Amalfi
 
Latent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationLatent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text Summarization
 
Naive Bayes | Statistics
Naive Bayes | StatisticsNaive Bayes | Statistics
Naive Bayes | Statistics
 

Similaire à LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the WebRinke Hoekstra
 
Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Andrea Scharnhorst
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain OntologyKeerti Bhogaraju
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingNa'im Tyson
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspectiveankurpandeyinfo
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dlmadhuvardhan
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dlmadhuvardhan
 
Innovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPInnovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPariadnenetwork
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesDr.-Ing. Thomas Hartmann
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchNoemi Derzsy
 
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...Keith.May
 

Similaire à LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco) (20)

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 
Semantic web
Semantic webSemantic web
Semantic web
 
Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...Between  information  retrieval  services  and bibliometrics  research. New  ...
Between  information  retrieval  services  and bibliometrics  research. New  ...
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Topic Extraction on Domain Ontology
Topic Extraction on Domain OntologyTopic Extraction on Domain Ontology
Topic Extraction on Domain Ontology
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic Processing
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 
Text mining
Text miningText mining
Text mining
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dl
 
Class 5-introto dl
Class 5-introto dlClass 5-introto dl
Class 5-introto dl
 
Innovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPInnovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLP
 
IASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with TriplesIASSIST 2012 - DDI-RDF - Trouble with Triples
IASSIST 2012 - DDI-RDF - Trouble with Triples
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
 

Plus de rchbeir

Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)rchbeir
 
SS tree (par SYLLA Demba et TALBI Rachid)
SS tree (par SYLLA Demba et TALBI Rachid)SS tree (par SYLLA Demba et TALBI Rachid)
SS tree (par SYLLA Demba et TALBI Rachid)rchbeir
 
Ranking (par IBRAHIM Sirine et TANIOS Dany)
Ranking (par IBRAHIM Sirine et TANIOS	 Dany)Ranking (par IBRAHIM Sirine et TANIOS	 Dany)
Ranking (par IBRAHIM Sirine et TANIOS Dany)rchbeir
 
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)rchbeir
 
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)rchbeir
 
NoSQL (par HEGUY Xabier)
NoSQL (par HEGUY Xabier)NoSQL (par HEGUY Xabier)
NoSQL (par HEGUY Xabier)rchbeir
 
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)rchbeir
 
Arbre b (par EL HACHEM Marwan et RICHA Elias)
Arbre b (par EL HACHEM Marwan et RICHA Elias)Arbre b (par EL HACHEM Marwan et RICHA Elias)
Arbre b (par EL HACHEM Marwan et RICHA Elias)rchbeir
 
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)rchbeir
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalrchbeir
 

Plus de rchbeir (13)

Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
Web ontologie language (par RAFEH Aya et VAILLEUX Arnaud)
 
SS tree (par SYLLA Demba et TALBI Rachid)
SS tree (par SYLLA Demba et TALBI Rachid)SS tree (par SYLLA Demba et TALBI Rachid)
SS tree (par SYLLA Demba et TALBI Rachid)
 
Ranking (par IBRAHIM Sirine et TANIOS Dany)
Ranking (par IBRAHIM Sirine et TANIOS	 Dany)Ranking (par IBRAHIM Sirine et TANIOS	 Dany)
Ranking (par IBRAHIM Sirine et TANIOS Dany)
 
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
Crawlers (par DE COURCHELLE Inès et JACOB Sophie)
 
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
Quad-Tree et Kd-Tree (par MARQUES Patricia et OLIVIER Aymeric)
 
NoSQL (par HEGUY Xabier)
NoSQL (par HEGUY Xabier)NoSQL (par HEGUY Xabier)
NoSQL (par HEGUY Xabier)
 
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
Mpeg7 et comm ontology (par MOHIBE Amine et BENSLIMANE Mohamed-Amine)
 
Arbre b (par EL HACHEM Marwan et RICHA Elias)
Arbre b (par EL HACHEM Marwan et RICHA Elias)Arbre b (par EL HACHEM Marwan et RICHA Elias)
Arbre b (par EL HACHEM Marwan et RICHA Elias)
 
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
Adaptative hypermedia (par MALKI Sara et MAKSIMOVICH Aleksandra)
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Plsql2
Plsql2Plsql2
Plsql2
 
Plsql
PlsqlPlsql
Plsql
 
Sql3
Sql3Sql3
Sql3
 

Dernier

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Dernier (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)

  • 1. Latent Semantic Indexing Prepared by: HATOUM Saria DONGO Irvin Presented to: Prof. CHBEIR Richard Bayonne/2013
  • 2. Overview • Introduction – Information Retrieval • Vector Space Model • Problems • Latent Semantic Indexing – Algorithm – Example – Advantages – Disadvantages 2
  • 3. Introduction • Many documents available. • The need to extract information. • Sorted and classified information. • Queries the information. 3
  • 4. Information Retrieval • Before LSI: – Literally Matching Text corpus with many documents. • Given a query, find relevant documents. – Some terms in a user's query will literally match terms in irrelevant documents . 4
  • 5. Some Methods for IR • Set-Theoretic – Fuzzy Set • Algebraic – Vector Space • Generalised Vector Space • Latent Semantic Indexing • Probabilistic – Binary Interdependence 5
  • 6. Vector Space Model • An algebraic model for representing text documents. • Documents and Queries are both vectors dj =(w1,j , w2,j , …, wt,j) qj =(w1,q , w2,q , …, wt,q) 6
  • 7. vector space method – Term (rows) by document (columns) matrix, based on occurrence – one vector will be associate for each document – Cosine to measure distance between vectors (documents) • small angle = large cosine = similar • large angle = small cosine = dissimilar 7
  • 8. Cosine Similarity Meausure • Sim(di, dj) = 1, if di = dj. • Sim(di, dj) = 0, if di and dj are different. 8
  • 9. Document vector space W o rd 1 Q uery W o rd 2 9
  • 10. Problem Introduction • Traditional term-matching method doesn‟t work well in information retrieval • We want to capture the concepts instead of words. Concepts are reflected in the words. However, – One term may have multiple meaning – Different terms may have the same meaning. 10
  • 11. The Problems • Two problems that arose using the vector space model: – synonymy: many ways to express a given concept e.g. “automobile” when querying on “car” • leads to poor recall “the percentage of all relevant documents are retrieved” – polysemy: words have multiple meanings e.g. “surfing” • leads to poor precision “the percentage of the retrieved documents are relevant” • The context of the documents. 11
  • 12. Polysemy and Context • Document similarity on single word level: polysemy and context ring jupiter ••• space … planet ... … saturn ... meaning 1 meaning 2 car company ••• contribution to similarity, if used in 1st meaning, but not if in 2nd dodge ford 12
  • 13. Problematic • Allow users to retrieve information on the basis of a conceptual topic or meaning of a document. 13
  • 14. Latent Semantic Indexing • Overcome these problems of lexical matching : – Using a statistical information retrieval method that is capable of retrieving text based on the concepts it contains, not just by matching specific keywords. 14
  • 15. Characteristics of LSI • Documents are represented as "bags of words", where the order of the words in a document is not important, only how many times each word appears in a document. • Is a technique that projects queries and documents into a space with “latent” semantic dimensions. • Convert high-dimensional space to lower-dimensional space 15
  • 16. Characteristics of LSI • Concepts are represented as patterns of words that usually appear together in documents. – For example “jaguar", “car", and “speed" might usually appear in documents about sports cars, whereas “jaguar”, “animal”, “hunting” might refer to the concept of jaguar the animal. • LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. • LSI uses Singular Value Decomposition for the mapping of terms to concepts. 16
  • 17. Generate matrix • Number of words is huge • throw out noise „and‟, „is‟, „at‟, „the‟, .etc. • Select and use a smaller set of words that are of interest • Stemming which means remove endings e.g. learning , learned , learn 17
  • 18. “Semantic” Space H o u se Home D o m ic ile K um q uat O ra n g e P ear A p p le 18
  • 19. Information Retrieval • Represent each document as a word vector • Represent corpus as term-document matrix (T-D matrix) using a linear analysis method called SVD • A classical method: – Create new vector from query terms – Find documents with highest cosine similarity 19
  • 20. Singular Value Decomposition(SVD) • We decompose the term-document matrix into three matrices. 20
  • 21. Example • d1: Shipment of gold damaged in a fire. • d2: Delivery of silver arrived in a silver truck. • d3: Shipment of gold arrived in a truck. • q: Gold silver truck 21
  • 25. Example New vectors • d1 = [-0.4945, 0.6492] • d2 = [-0.6458, -0.7194] • d3 = [-0.5817, 0.2469] 25
  • 27. Example sim(q, di) = CosΘ • sim(q,d1) = -0.0541 • sim(q,d2) = 0.9910 • sim(q,d3) = 0.4478 27
  • 28. Advantages • LSI overcomes two of the most problematic constraints of queries: – Synonymy – Polysemy • True (latent) dimensions: the new dimensions are a better representation of documents and queries. • Term Dependence: The traditional vector space model assumes term independence but LSI has strong associations between terms like the language. 28
  • 29. Disadvantages • Storage – Many documents have more than 150 unique terms so the sparce. • Efficiency – With LSI, the query must be compared to every document in the collection. • Static Matrix – If we have new documents, we need to do a new SVD in the main matrix. 29
  • 30. References • [Furnas et al., 1988] Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A., Streeter, L. A., and Lochbaum, K. E. (1988). Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '88, pages 465-480, New York, NY, USA. ACM. • [Hull, 1994] Hull, D. (1994). Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '94, pages 282-291, New York, NY, USA. Springer-Verlag New York, Inc. 30
  • 31. References • [Atreya and Elkan, 2011] Atreya, A. and Elkan, C. (2011). Latent semantic indexing (lsi) fails for trec collections. SIGKDD Explor. Newsl., 12(2):5-10. • [Deerwester et al., 1990] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 41(6):391-407. • [Littman et al., 1998] Littman, M., Dumais, S. T., and Landauer, T. K. (1998). Automatic cross-language information retrieval using latent semantic indexing. In Cross-Language Information Retrieval, chapter 5, pages 51{62. Kluwer Academic Publishers. 31
  • 32. Thank you for your Attention!!! 32
  • 33. 33

Notes de l'éditeur

  1. Since there are usually many ways to express a given concept Given a collection of documents: retrieve documents that are relevant to a given query Match terms in documents to terms in queryVector space method
  2. Fuzzy set : give u the doc using intersection and union to the query with all the docsVector space is when represent the doc in vector with diff ways for ex with the number of occurrence Latent semantic indexing Probabilistic : at the beginning the query will be rated as 0 and after comparison the value will be set to 1 if we have a relation and return all the doc that has the relation
  3. Precision: what percentage of the retrieved documents are relevantRecall: what percentage of all relevant documents are retrieved
  4. Lsi tries to overcome the problems of lexical matching by using a Represent docs (and queries) by their underlying latent concepts which means using statistically derived conceptual indices instead of individual words for retrieval.
  5. Using lsi all the word are combined in a region together
  6. The SVD projection is computed by decomposing the document-by-term matrix A into