2. Overview
• Introduction
– Information Retrieval
• Vector Space Model
• Problems
• Latent Semantic Indexing
– Algorithm
– Example
– Advantages
– Disadvantages
2
3. Introduction
• Many documents available.
• The need to extract information.
• Sorted and classified information.
• Queries the information.
3
4. Information Retrieval
• Before LSI:
– Literally Matching Text corpus with many documents.
• Given a query, find relevant documents.
– Some terms in a user's query will literally match terms in irrelevant documents .
4
5. Some Methods for IR
• Set-Theoretic
– Fuzzy Set
• Algebraic
– Vector Space
• Generalised Vector Space
• Latent Semantic Indexing
• Probabilistic
– Binary Interdependence
5
6. Vector Space Model
• An algebraic model for representing text documents.
• Documents and Queries are both vectors
dj =(w1,j , w2,j , …, wt,j)
qj =(w1,q , w2,q , …, wt,q)
6
7. vector space method
– Term (rows) by document (columns) matrix, based on occurrence
– one vector will be associate for each document
– Cosine to measure distance between vectors (documents)
• small angle = large cosine = similar
• large angle = small cosine = dissimilar
7
10. Problem Introduction
• Traditional term-matching method doesn‟t work well in information
retrieval
• We want to capture the concepts instead of words. Concepts are
reflected in the words. However,
– One term may have multiple meaning
– Different terms may have the same meaning.
10
11. The Problems
• Two problems that arose using the vector space model:
– synonymy: many ways to express a given concept e.g. “automobile”
when querying on “car”
• leads to poor recall “the percentage of all relevant documents are
retrieved”
– polysemy: words have multiple meanings e.g. “surfing”
• leads to poor precision “the percentage of the retrieved documents
are relevant”
• The context of the documents.
11
12. Polysemy and Context
• Document similarity on single word level: polysemy and context
ring
jupiter
•••
space
…
planet
...
…
saturn
...
meaning 1
meaning 2
car
company
•••
contribution to similarity, if used in 1st
meaning, but not if in 2nd
dodge
ford
12
13. Problematic
• Allow users to retrieve information on the basis of a conceptual topic or
meaning of a document.
13
14. Latent Semantic Indexing
• Overcome these problems of lexical matching :
– Using a statistical information retrieval method that is capable of retrieving text
based on the concepts it contains, not just by matching specific keywords.
14
15. Characteristics of LSI
• Documents are represented as "bags of words", where the order of the
words in a document is not important, only how many times each word
appears in a document.
• Is a technique that projects queries and documents into a space with
“latent” semantic dimensions.
• Convert high-dimensional space to lower-dimensional space
15
16. Characteristics of LSI
• Concepts are represented as patterns of words that usually appear
together in documents.
– For example “jaguar", “car", and “speed" might usually appear in documents
about sports cars, whereas “jaguar”, “animal”, “hunting” might refer to the
concept of jaguar the animal.
• LSI is based on the principle that words that are used in the same
contexts tend to have similar meanings.
• LSI uses Singular Value Decomposition for the mapping of terms to
concepts.
16
17. Generate matrix
• Number of words is huge
• throw out noise „and‟, „is‟, „at‟, „the‟, .etc.
• Select and use a smaller set of words that are of interest
• Stemming which means remove endings e.g. learning , learned , learn
17
18. “Semantic” Space
H o u se
Home
D o m ic ile
K um q uat
O ra n g e
P ear
A p p le
18
19. Information Retrieval
• Represent each document as a word vector
• Represent corpus as term-document matrix (T-D matrix) using a linear
analysis method called SVD
• A classical method:
– Create new vector from query terms
– Find documents with highest cosine similarity
19
21. Example
• d1: Shipment of gold damaged in a fire.
• d2: Delivery of silver arrived in a silver truck.
• d3: Shipment of gold arrived in a truck.
• q: Gold silver truck
21
28. Advantages
• LSI overcomes two of the most problematic constraints of queries:
– Synonymy
– Polysemy
• True (latent) dimensions: the new dimensions are a better representation
of documents and queries.
• Term Dependence: The traditional vector space model assumes term
independence but LSI has strong associations between terms like the
language.
28
29. Disadvantages
• Storage
– Many documents have more than 150 unique terms so the sparce.
• Efficiency
– With LSI, the query must be compared to every document in the collection.
• Static Matrix
– If we have new documents, we need to do a new SVD in the main matrix.
29
30. References
• [Furnas et al., 1988] Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K.,
Harshman, R. A., Streeter, L. A., and Lochbaum, K. E. (1988). Information
retrieval using a singular value decomposition model of latent semantic
structure. In Proceedings of the 11th annual international ACM SIGIR
conference on Research and development in information retrieval, SIGIR '88,
pages 465-480, New York, NY, USA. ACM.
• [Hull, 1994] Hull, D. (1994). Improving text retrieval for the routing problem using
latent semantic indexing. In Proceedings of the 17th annual international
ACM SIGIR conference on Research and development in information
retrieval, SIGIR '94, pages 282-291, New York, NY, USA. Springer-Verlag New
York, Inc.
30
31. References
• [Atreya and Elkan, 2011] Atreya, A. and Elkan, C. (2011). Latent semantic
indexing (lsi) fails for trec collections. SIGKDD Explor. Newsl., 12(2):5-10.
• [Deerwester et al., 1990] Deerwester, S., Dumais, S. T., Furnas, G. W.,
Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic
analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION
SCIENCE, 41(6):391-407.
• [Littman et al., 1998] Littman, M., Dumais, S. T., and Landauer, T. K. (1998).
Automatic cross-language information retrieval using latent semantic
indexing. In Cross-Language Information Retrieval, chapter 5, pages 51{62.
Kluwer Academic Publishers.
31
Since there are usually many ways to express a given concept Given a collection of documents: retrieve documents that are relevant to a given query Match terms in documents to terms in queryVector space method
Fuzzy set : give u the doc using intersection and union to the query with all the docsVector space is when represent the doc in vector with diff ways for ex with the number of occurrence Latent semantic indexing Probabilistic : at the beginning the query will be rated as 0 and after comparison the value will be set to 1 if we have a relation and return all the doc that has the relation
Precision: what percentage of the retrieved documents are relevantRecall: what percentage of all relevant documents are retrieved
Lsi tries to overcome the problems of lexical matching by using a Represent docs (and queries) by their underlying latent concepts which means using statistically derived conceptual indices instead of individual words for retrieval.
Using lsi all the word are combined in a region together
The SVD projection is computed by decomposing the document-by-term matrix A into