SlideShare une entreprise Scribd logo
1  sur  12
The Vector space model
Submitted By –
Deeksha Agarwal
Semester 5th
University of Allahabad
Boolean Model Disadvantages
• Similarity function is boolean
⁻ Exact-match only, no partial matches
⁻ Retrieved documents not ranked
• All terms are equally important
– Boolean operator usage has much more
influence than a critical word
• Query language is expressive but complicated
Statistical Models
• A document is typically represented by a bag
of words (unordered words with frequencies).
• Bag = set that allows multiple occurrences of
the same element.
4
Statistical Retrieval
• Retrieval based on similarity between query and
documents.
• Output documents are ranked according to
similarity to query.
• Similarity based on occurrence frequencies of
keywords in query and document.
• Automatic relevance feedback can be supported:
– Relevant documents “added” to query.
– Irrelevant documents “subtracted” from query.
5
The Vector-Space Model
• Documents and queries are both vectors
• Each term, i, in a document or query, j, is given a
real-valued weight, wij.
• Both documents and queries are expressed as t-
dimensional vectors:
dj = (w1j, w2j, …, wtj)
6
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5
7
Document Collection
• A collection of n documents can be represented in the vector
space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a term in
the document; zero means the term has no significance in the
document or it simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
8
Term Weights: Term Frequency
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
fij = frequency of term i in document j
• May want to normalize term frequency (tf) by
dividing by the frequency of the most
common term in the document:
tfij = fij / maxi{fij}
9
Term Weights: Inverse Document Frequency
• Terms that appear in many different
documents are less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
10
TF-IDF Weighting
• A typical combined term importance indicator
is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document
but rarely in the rest of the collection is given
high weight.
• Many other ways of determining term weights
have been proposed.
• Experimentally, tf-idf has been found to work
well.
11
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and document
frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
THANKYOU

Contenu connexe

Tendances

Text clustering
Text clusteringText clustering
Text clustering
KU Leuven
 

Tendances (20)

Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Text clustering
Text clusteringText clustering
Text clustering
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Signature files
Signature filesSignature files
Signature files
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Vector space classification
Vector space classificationVector space classification
Vector space classification
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
 
Text categorization
Text categorizationText categorization
Text categorization
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
 
Information retrieval 10 vector and probabilistic models
Information retrieval 10 vector and probabilistic modelsInformation retrieval 10 vector and probabilistic models
Information retrieval 10 vector and probabilistic models
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 

En vedette

Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
dalal404
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 

En vedette (20)

Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
Ir 08
Ir   08Ir   08
Ir 08
 
similarity measure
similarity measure similarity measure
similarity measure
 
Indexing, vector spaces, search engines
Indexing, vector spaces, search enginesIndexing, vector spaces, search engines
Indexing, vector spaces, search engines
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 
Probabilistic Retrieval TFIDF
Probabilistic Retrieval TFIDFProbabilistic Retrieval TFIDF
Probabilistic Retrieval TFIDF
 
Ch7
Ch7Ch7
Ch7
 
Vector Spaces
Vector SpacesVector Spaces
Vector Spaces
 
Text Similarity
Text SimilarityText Similarity
Text Similarity
 
Beyond tf idf why, what & how
Beyond tf idf why, what & howBeyond tf idf why, what & how
Beyond tf idf why, what & how
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
 
Search: Probabilistic Information Retrieval
Search: Probabilistic Information RetrievalSearch: Probabilistic Information Retrieval
Search: Probabilistic Information Retrieval
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Probabilistic Retrieval
Probabilistic RetrievalProbabilistic Retrieval
Probabilistic Retrieval
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Natural Language Processing: L02 words
Natural Language Processing: L02 wordsNatural Language Processing: L02 words
Natural Language Processing: L02 words
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
 
Vector Spaces,subspaces,Span,Basis
Vector Spaces,subspaces,Span,BasisVector Spaces,subspaces,Span,Basis
Vector Spaces,subspaces,Span,Basis
 
OUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text ClassificationOUTDATED Text Mining 4/5: Text Classification
OUTDATED Text Mining 4/5: Text Classification
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 

Similaire à The vector space model

TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
pepe3059
 

Similaire à The vector space model (20)

Ir models
Ir modelsIr models
Ir models
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
Chapter 6 Query Language .pdf
Chapter 6 Query Language .pdfChapter 6 Query Language .pdf
Chapter 6 Query Language .pdf
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Document similarity
Document similarityDocument similarity
Document similarity
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
Vector space model12345678910111213.pptx
Vector space model12345678910111213.pptxVector space model12345678910111213.pptx
Vector space model12345678910111213.pptx
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
 
IR.pptx
IR.pptxIR.pptx
IR.pptx
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Introduction to Text Mining and Topic Modelling
Introduction to Text Mining and Topic ModellingIntroduction to Text Mining and Topic Modelling
Introduction to Text Mining and Topic Modelling
 
Lec1
Lec1Lec1
Lec1
 
3_Indexing.ppt
3_Indexing.ppt3_Indexing.ppt
3_Indexing.ppt
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
Ir 03
Ir   03Ir   03
Ir 03
 

Dernier

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Dernier (20)

Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptxCOMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
COMMUNICATING NEGATIVE NEWS - APPROACHES .pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 

The vector space model

  • 1. The Vector space model Submitted By – Deeksha Agarwal Semester 5th University of Allahabad
  • 2. Boolean Model Disadvantages • Similarity function is boolean ⁻ Exact-match only, no partial matches ⁻ Retrieved documents not ranked • All terms are equally important – Boolean operator usage has much more influence than a critical word • Query language is expressive but complicated
  • 3. Statistical Models • A document is typically represented by a bag of words (unordered words with frequencies). • Bag = set that allows multiple occurrences of the same element.
  • 4. 4 Statistical Retrieval • Retrieval based on similarity between query and documents. • Output documents are ranked according to similarity to query. • Similarity based on occurrence frequencies of keywords in query and document. • Automatic relevance feedback can be supported: – Relevant documents “added” to query. – Irrelevant documents “subtracted” from query.
  • 5. 5 The Vector-Space Model • Documents and queries are both vectors • Each term, i, in a document or query, j, is given a real-valued weight, wij. • Both documents and queries are expressed as t- dimensional vectors: dj = (w1j, w2j, …, wtj)
  • 6. 6 Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 T3 T1 T2 D1 = 2T1+ 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 7 32 5
  • 7. 7 Document Collection • A collection of n documents can be represented in the vector space model by a term-document matrix. • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document. T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn
  • 8. 8 Term Weights: Term Frequency • More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij = fij / maxi{fij}
  • 9. 9 Term Weights: Inverse Document Frequency • Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents)
  • 10. 10 TF-IDF Weighting • A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi) • A term occurring frequently in the document but rarely in the rest of the collection is given high weight. • Many other ways of determining term weights have been proposed. • Experimentally, tf-idf has been found to work well.
  • 11. 11 Computing TF-IDF -- An Example Given a document containing terms with given frequencies: A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6 B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0 C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8

Notes de l'éditeur

  1. 1.Very rigid: AND means all; OR means any. 2.Difficult to express complex user requests. 3.Difficult to control the number of documents retrieved-All matched documents will be returned.5.Difficult to rank output-All matched documents logically satisfy the query. 7.Difficult to perform relevance feedback-a document is identified by the user as relevant or irrelevant, how should the query how should the query be modified?
  2. if a term t appears often in a document, then a query containing t should retrieve that document. Zipf’s law: term frequency » 1/rank importance is inversely proportional to frequency of occurrence.
  3. tfij = fij / maxi{fij}