SlideShare une entreprise Scribd logo
1  sur  21
1
Vector Space Retrieval Models
Lecture 4,5
2
Retrieval Models
• A retrieval model specifies the details
of:
– Document representation
– Query representation
– Retrieval function
• Determines a notion of relevance.
• Notion of relevance can be binary or
continuous (i.e. ranked retrieval).
3
Classes of Retrieval Models
• Boolean models (set
theoretic)
–Extended Boolean
• Vector space models
• Probabilistic models
4
Common Preprocessing Steps
• Strip unwanted characters/markup (e.g.
HTML tags, punctuation, numbers, etc.).
• Break into tokens (keywords) on
whitespace.
• Stem tokens to “root” words
– computational  compute
• Remove common stop words (e.g. a, the, it,
etc.).
• Build inverted index (keyword  list of
docs containing it).
5
Statistical Models
• A document is typically represented by a bag of
words (unordered words with frequencies).
• Bag = set that allows multiple occurrences of the
same element.
• User specifies a set of desired terms with optional
weights:
– Weighted query terms:
Q = < database 0.5; text 0.8; information 0.2 >
– Unweighted query terms:
Q = < database; text; information >
– No Boolean conditions specified in the query.
6
Statistical Retrieval
• Retrieval based on similarity between query
and documents.
• Output documents are ranked according to
similarity to query.
• Similarity based on occurrence frequencies of
keywords in query and document.
• Automatic relevance feedback can be
supported:
– Relevant documents “added” to query.
– Irrelevant documents “subtracted” from query.
7
Issues for Vector Space Model
• How to determine important words in a
document?
• How to determine the degree of
importance of a term within a
document and within the entire
collection?
• How to determine the degree of
similarity between a document and the
query?
8
The Vector-Space Model
• Assume t distinct terms remain after
preprocessing; call them index terms or the
vocabulary.
• These terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a
real-valued weight, wij.
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
9
Document Collection
• A collection of n documents can be represented in the
vector space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no
significance in the document or it simply doesn’t exist in
the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
10
Term Weights: Term Frequency
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
• May want to normalize term frequency (tf) by
dividing by the frequency of the most common
term in the document:
tfij = fij / maxi{fij}
11
Term Weights: Inverse Document Frequency
• Terms that appear in many different documents are
less indicative of overall topic.
dfi = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ dfi)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
12
TF-IDF Weighting
• A typical combined term importance
indicator is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the
document but rarely in the rest of the
collection is given high weight.
• Many other ways of determining term
weights have been proposed.
• Experimentally, tf-idf has been found to
work well.
13
Computing TF-IDF -- An Example
Given a document containing terms with given
frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2(10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2(10000/250) = 5.3; tf-idf = 1.8
14
Query Vector
• Query vector is typically treated as
a document and also tf-idf
weighted.
• Alternative is for the user to supply
weights for the given query terms.
15
Similarity Measure
• A similarity measure is a function that computes
the degree of similarity between two vectors.
• Using a similarity measure between the query and
each document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain threshold so that the
size of the retrieved set can be controlled.
16
Similarity Measure - Inner Product
• Similarity between vectors for the document di
and query q can be computed as the vector
inner product (dot product):
sim(dj,q) = dj•q =
where wij is the weight of term i in
document j and wiqis the weight of term i in
the query
• For weighted term vectors, it is the sum of the
products of the weights of the matched terms.
iq
t
i
ij ww∑=1
17
Inner Product -- Examples
Binary:
– D = 1, 1, 1, 0, 1, 1, 0
– Q = 1, 0 , 1, 0, 0, 1, 1
sim(D, Q) = 3
retrieval
database
architecture
com
puter
text
m
anagem
ent
inform
ation
Size of vector = size of vocabulary =
7
0 means corresponding term not
found in document or query
Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10
sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
18
Cosine Similarity Measure
• Cosine similarity measures the cosine of
the angle between two vectors.
• Inner product normalized by the vector
lengths.
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
θ2
t3
t1
t2
D1
D2
Q
θ1
D1 is 6 times better than D2 using cosine similarity
∑ ∑
∑
= =
=
•
⋅
⋅
=
⋅
t
i
t
i
t
i
ww
ww
qd
qd
iqij
iqij
j
j
1 1
22
1
)(


CosSim(dj, q) =
19
Naïve Implementation
Convert all documents in collection D to tf-idf
weighted vectors, dj, for keyword
vocabulary V.
Convert query to a tf-idf-weighted vector q.
For each dj in D do
Compute score sj= cosSim(dj,q)
Sort documents by decreasing score.
Present top ranked documents to the user.
20
Comments on Vector Space Models
• Simple, mathematically based approach.
• Considers both local (tf) and global (idf) word
occurrence frequencies.
• Provides partial matching and ranked results.
• Tends to work quite well in practice despite
obvious weaknesses.
• Allows efficient implementation for large
document collections.
21
Problems with Vector Space Model
• Missing semantic information (e.g.
word sense).
• Missing syntactic information (e.g.
phrase structure, word order,
proximity information).
• Assumption of term independence

Contenu connexe

Tendances (20)

Decision Tree - ID3
Decision Tree - ID3Decision Tree - ID3
Decision Tree - ID3
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
The comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithmThe comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithm
 
Signature files
Signature filesSignature files
Signature files
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)
 
Chap8 slides
Chap8 slidesChap8 slides
Chap8 slides
 
2. Stream Ciphers
2. Stream Ciphers2. Stream Ciphers
2. Stream Ciphers
 
Apriori
AprioriApriori
Apriori
 
Classical Encryption Techniques
Classical Encryption TechniquesClassical Encryption Techniques
Classical Encryption Techniques
 
Inverted index
Inverted indexInverted index
Inverted index
 
graph.ppt
graph.pptgraph.ppt
graph.ppt
 
Complexity of Algorithm
Complexity of AlgorithmComplexity of Algorithm
Complexity of Algorithm
 
Chord Algorithm
Chord AlgorithmChord Algorithm
Chord Algorithm
 
Introduction to Graph Theory
Introduction to Graph TheoryIntroduction to Graph Theory
Introduction to Graph Theory
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 
Graph theory
Graph theory Graph theory
Graph theory
 
Lec 4 (program and network properties)
Lec 4 (program and network properties)Lec 4 (program and network properties)
Lec 4 (program and network properties)
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
Elgamal &amp; schnorr digital signature scheme copy
Elgamal &amp; schnorr digital signature scheme   copyElgamal &amp; schnorr digital signature scheme   copy
Elgamal &amp; schnorr digital signature scheme copy
 

En vedette

Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrievalotisg
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3alaa223
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7alaa223
 
Search: Probabilistic Information Retrieval
Search: Probabilistic Information RetrievalSearch: Probabilistic Information Retrieval
Search: Probabilistic Information RetrievalVipul Munot
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information RetrievalHarsh Thakkar
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space modeldalal404
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Genetic Algorithm by Example
Genetic Algorithm by ExampleGenetic Algorithm by Example
Genetic Algorithm by ExampleNobal Niraula
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operatorsRoi Blanco
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 

En vedette (19)

Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrieval
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Lectures 1,2,3
Lectures 1,2,3Lectures 1,2,3
Lectures 1,2,3
 
Lec 2
Lec 2Lec 2
Lec 2
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
 
Lec1
Lec1Lec1
Lec1
 
lec6
lec6lec6
lec6
 
Lec1,2
Lec1,2Lec1,2
Lec1,2
 
Search: Probabilistic Information Retrieval
Search: Probabilistic Information RetrievalSearch: Probabilistic Information Retrieval
Search: Probabilistic Information Retrieval
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Ir 08
Ir   08Ir   08
Ir 08
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
similarity measure
similarity measure similarity measure
similarity measure
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Genetic Algorithm by Example
Genetic Algorithm by ExampleGenetic Algorithm by Example
Genetic Algorithm by Example
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 

Similaire à Lec 4,5

Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptxthenmozhip8
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalrchbeir
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and trackingGeorge Ang
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Vector space model12345678910111213.pptx
Vector space model12345678910111213.pptxVector space model12345678910111213.pptx
Vector space model12345678910111213.pptxsomeyamohsen2
 
엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630Yong Joon Moon
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptpepe3059
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text ClassificationSai Srinivas Kotni
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
 
Slides
SlidesSlides
Slidesbutest
 

Similaire à Lec 4,5 (20)

Ir models
Ir modelsIr models
Ir models
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
Lec1
Lec1Lec1
Lec1
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Simple semantics in topic detection and tracking
Simple semantics in topic detection and trackingSimple semantics in topic detection and tracking
Simple semantics in topic detection and tracking
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Text clustering
Text clusteringText clustering
Text clustering
 
Vector space model12345678910111213.pptx
Vector space model12345678910111213.pptxVector space model12345678910111213.pptx
Vector space model12345678910111213.pptx
 
엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Slides
SlidesSlides
Slides
 

Lec 4,5

  • 1. 1 Vector Space Retrieval Models Lecture 4,5
  • 2. 2 Retrieval Models • A retrieval model specifies the details of: – Document representation – Query representation – Retrieval function • Determines a notion of relevance. • Notion of relevance can be binary or continuous (i.e. ranked retrieval).
  • 3. 3 Classes of Retrieval Models • Boolean models (set theoretic) –Extended Boolean • Vector space models • Probabilistic models
  • 4. 4 Common Preprocessing Steps • Strip unwanted characters/markup (e.g. HTML tags, punctuation, numbers, etc.). • Break into tokens (keywords) on whitespace. • Stem tokens to “root” words – computational  compute • Remove common stop words (e.g. a, the, it, etc.). • Build inverted index (keyword  list of docs containing it).
  • 5. 5 Statistical Models • A document is typically represented by a bag of words (unordered words with frequencies). • Bag = set that allows multiple occurrences of the same element. • User specifies a set of desired terms with optional weights: – Weighted query terms: Q = < database 0.5; text 0.8; information 0.2 > – Unweighted query terms: Q = < database; text; information > – No Boolean conditions specified in the query.
  • 6. 6 Statistical Retrieval • Retrieval based on similarity between query and documents. • Output documents are ranked according to similarity to query. • Similarity based on occurrence frequencies of keywords in query and document. • Automatic relevance feedback can be supported: – Relevant documents “added” to query. – Irrelevant documents “subtracted” from query.
  • 7. 7 Issues for Vector Space Model • How to determine important words in a document? • How to determine the degree of importance of a term within a document and within the entire collection? • How to determine the degree of similarity between a document and the query?
  • 8. 8 The Vector-Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These terms form a vector space. Dimension = t = |vocabulary| • Each term, i, in a document or query, j, is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)
  • 9. 9 Document Collection • A collection of n documents can be represented in the vector space model by a term-document matrix. • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn
  • 10. 10 Term Weights: Term Frequency • More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij}
  • 11. 11 Term Weights: Inverse Document Frequency • Terms that appear in many different documents are less indicative of overall topic. dfi = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ dfi) (N: total number of documents) • An indication of a term’s discrimination power. • Log used to dampen the effect relative to tf.
  • 12. 12 TF-IDF Weighting • A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi) • A term occurring frequently in the document but rarely in the rest of the collection is given high weight. • Many other ways of determining term weights have been proposed. • Experimentally, tf-idf has been found to work well.
  • 13. 13 Computing TF-IDF -- An Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6 B: tf = 2/3; idf = log2(10000/1300) = 2.9; tf-idf = 2.0 C: tf = 1/3; idf = log2(10000/250) = 5.3; tf-idf = 1.8
  • 14. 14 Query Vector • Query vector is typically treated as a document and also tf-idf weighted. • Alternative is for the user to supply weights for the given query terms.
  • 15. 15 Similarity Measure • A similarity measure is a function that computes the degree of similarity between two vectors. • Using a similarity measure between the query and each document: – It is possible to rank the retrieved documents in the order of presumed relevance. – It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled.
  • 16. 16 Similarity Measure - Inner Product • Similarity between vectors for the document di and query q can be computed as the vector inner product (dot product): sim(dj,q) = dj•q = where wij is the weight of term i in document j and wiqis the weight of term i in the query • For weighted term vectors, it is the sum of the products of the weights of the matched terms. iq t i ij ww∑=1
  • 17. 17 Inner Product -- Examples Binary: – D = 1, 1, 1, 0, 1, 1, 0 – Q = 1, 0 , 1, 0, 0, 1, 1 sim(D, Q) = 3 retrieval database architecture com puter text m anagem ent inform ation Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3 Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2
  • 18. 18 Cosine Similarity Measure • Cosine similarity measures the cosine of the angle between two vectors. • Inner product normalized by the vector lengths. D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3 θ2 t3 t1 t2 D1 D2 Q θ1 D1 is 6 times better than D2 using cosine similarity ∑ ∑ ∑ = = = • ⋅ ⋅ = ⋅ t i t i t i ww ww qd qd iqij iqij j j 1 1 22 1 )(   CosSim(dj, q) =
  • 19. 19 Naïve Implementation Convert all documents in collection D to tf-idf weighted vectors, dj, for keyword vocabulary V. Convert query to a tf-idf-weighted vector q. For each dj in D do Compute score sj= cosSim(dj,q) Sort documents by decreasing score. Present top ranked documents to the user.
  • 20. 20 Comments on Vector Space Models • Simple, mathematically based approach. • Considers both local (tf) and global (idf) word occurrence frequencies. • Provides partial matching and ranked results. • Tends to work quite well in practice despite obvious weaknesses. • Allows efficient implementation for large document collections.
  • 21. 21 Problems with Vector Space Model • Missing semantic information (e.g. word sense). • Missing syntactic information (e.g. phrase structure, word order, proximity information). • Assumption of term independence