SlideShare a Scribd company logo
1 of 37
Probabilistic Retrieval Models
           Lecture 8
         Sean A. Golliher
   Need to quickly cover some old material
    to understand the new methods
   Complex concept that has been studied for
    some time
     Many factors to consider
     People often disagree when making relevance
     judgments
   Retrieval models make various assumptions
    about relevance to simplify problem
     e.g., topical vs. user relevance
     e.g., binary vs. multi-valued relevance
   Older models
     Boolean retrieval
     Vector Space model
   Probabilistic Models
     BM25
     Language models
   Combining evidence
     Inference networks
     Learning to Rank
   Two possible outcomes for query
    processing
     TRUE and FALSE
     “exact-match” retrieval
     simplest form of ranking
   Query usually specified using Boolean
    operators
     AND, OR, NOT
   Advantages
     Results are predictable, relatively easy to
      explain
     Many different features can be incorporated
     Efficient processing since many documents
      can be eliminated from search
   Disadvantages
     Effectiveness depends entirely on user
     Simple queries usually don’t work well
     Complex queries are difficult
 Documents and query represented by a
  vector of term weights
 Collection represented by a matrix of
  term weights
   3-d pictures useful, but can be
    misleading for high-dimensional space
The Euclidean
distance between q
and d2 is large even
though the
distribution of terms
in the query q and
the distribution of
terms in the
document d2 are
very similar.
   Thought experiment: take a document d and
    append it to itself. Call this document d′.
   “Semantically” d and d′ have the same content
   The Euclidean distance between the two
    documents can be quite large
   The angle between the two documents is 0,
    corresponding to maximal similarity (cos(0) = 1).
   Key idea: Rank documents according to angle with
    query.
 In Euclidean space, define dot product of
  vectors a & b as
ab=||a|| ||b|| cos               a



    where                             b
      ||a|| == length
         == angle between a & b
 By using Law of Cosines, can compute
  coordinate-dependent definition in 3-
  space:
 ab= axbx + ayby + azbz


 cos = ab/||a|| ||b||
 cosine(0) = 1
 cosine(90 deg) = 0
   Documents ranked by distance between
    points representing query and documents
     Similarity measure more common than a
      distance or dissimilarity measure
     e.g. Cosine correlation
 Consider two documents D1, D2 and a query
 Q
 ○ D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
Dot product         Unit vectors

                                            V
                 q    d    q   d                  q di
                                                i 1 i
    cos(q, d )                
                   q    d    q   d          V   2
                                                q
                                                           V
                                                                 d i2
                                            i 1 i          i   1



qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
   tf.idf weight (older retrieval model)
     tf: term frequency of term over collection of
      documents
     idf: inverse document freq. ex:
      ○ log(N/n)
          N is the total number of document
          n is total number of documents that contain a term
          Measure of “importance” of term. The more documents a
           term appears in the lest discriminating the term is.
          Use log to dampen the effects
     The collection frequency of t is the
      number of occurrences of t in the
      collection, counting multiple occurrences.
          Word   Collection frequency   Document frequency

    insurance                  10440                    3997
    try                        10422                    8760


 Document frequency df is number of
  documents that contain a term t.
 Which of these is more useful?
   The tf-idf weight of a term is the product of its
    tf weight and its idf weight.



   Best known weighting scheme in information retrieval
       ○ Note: the “-” in tf-idf is a hyphen, not a minus sign!
       ○ Alternative names: tf.idf, tf x idf
   Increases with the number of occurrences within a
    document
   Increases with the rarity of the term in the collection
 Rocchio algorithm (paper topic)
 Optimal query
     Maximizes the difference between the average
     vector representing the relevant documents and
     the average vector representing the non-relevant
     documents
   Modifies query according to

     α, β, and γ are parameters
      ○ Typical values 8, 16, 4
   Most dominant paradigm used today

   Probability theory is a strong foundation
    for representing uncertainty that is
    inherent in IR process.
   Robertson (1977)
    If a reference retrieval system’s response to each
    request is a ranking of the documents in the
    collection in order of decreasing probability of
    relevance to the user who submitted the request,
    where the probabilities are estimated as
    accurately as possible on the basis of whatever
    data have been made available to the system for
    this purpose, the overall effectiveness of the
    system to its user will be the best that is obtainable
    on the basis of those data.”
   Probability Ranking Principle (Robertson, 70ies;
    Maron, Kuhns, 1959)
   Information Retrieval as Probabilistic Inference
    (van Rijsbergen & co, since 70ies)
   Probabilistic Indexing (Fuhr & Co.,late 80ies-
    90ies)
   Bayesian Nets in IR (Turtle, Croft, 90ies)
   Probabilistic Logic Programming in IR (Fuhr &
    co, 90ies)
   P(a | b) => Conditional probability. Probability of a given
    that b occurred.
   Basic definitions


    (a È b) => AorB
    (a Ç b) = AandB
Let a, b be two events.

  p(a | b)p(b) = p(a Ç b) = p(b | a)p(a)
             p(b | a)p(a)
  p(a | b) =
                p(b)
  p(a | b)p(b) = p(b | a)p(a)
   Let D be a document in the collection.
   Let R represent relevance of a document w.r.t. given
    (fixed) query and let NR represent non-relevance.
   How do we find P(R|D)? Probability that a retrieved
    document is relevant. Abstract concept.
   P(R) is the probability that a retrieved is relevant
   Not Clear how to calculate this.
   Can we calculate P(D|R)? Probability of a document
    occurring in a set given a relevant set has been returned.
   If we KNOW we have relevant set of documents (maybe
    from humans?) We could calculate how often specific
    words occur in a certain set.
Let D be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed)
query and let NR represent non-relevance.

Need to find p(R|D) - probability that a retrieved document D
is relevant.

           p(D | R)p(R)      p(R),p(NR) - prior probability
 p(R | D) =
               p(D)          of retrieving a (non) relevant
             p(xD | NR)p(NR) document
 p(NR | D) =
                  p(xD)
  P(D|R), p(D|NR) - probability that if a relevant (non-rel
  document is retrieved, it is D.
p(D | R)p(R)
                p(R | D) =
                              p(D)
                            p(D | NR)p(NR)
                p(NR | D) =
                                  p(D)

Ranking Principle (Bayes’ Decision Rule):

If p(R|D) > p(NR|D) then D is relevant,
Otherwise D is not relevant
   Bayes Decision Rule
     A document D is relevant if P(R|D) > P(NR|D)
   Estimating probabilities
     use Bayes Rule


     classify a document as relevant if




      ○ Left side is likelihood ratio
   Can we calculate P(D|R)? Probability that if a relevant
    document is returned it is D?
   If we KNOW we have relevant set of documents (maybe
    from humans?) We could calculate how often specific
    words occur in a certain set.
   Ex: We have info on how often specific words occur in
    relevant set. We could calculate how likely it is to see the
    words appear in a set.
   Ex: Prob: “president” in the relevant set is 0.02 and
    “lincoln” in the relevant set is “0.03”. If new doc. has
    pres. & lincoln then prob. Is 0.02*0.03= .0006.
   Suppose we have a vector representing the presence and
    absence of terms (1,0,0,1,1). Terms 1, 4, & 5 are present.
   What is the probability of this document occurring in the
    relevant set?
   pi is the probability that the term i occurs in a relevant
    set. (1- pi ) would be the probability a term would not be
    included the relevant set.
   This gives us: p1 x (1-p2) x (1-p3) x p4 x p5
   Assume independence

   Binary independence model
     Dot product of over terms that have value
      one. Zero means dot product over terms that
      have value 0.
     pi is probability that term i occurs (i.e., has
      value 1) in relevant document, si is
      probability of occurrence in non-relevant
      document
   Scoring function is (Last term was same
    for all documents. So it can be ignored.
 Jump to machine learning and web
  search. Lots of training data available
  from web search queries. Learning to
  rank models.
 http://www.bradblock.com/A_General_La
  nguage_Model_for_Information_Retrieval
  .pdf

More Related Content

What's hot

Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfDeptii Chaudhari
 
Common evaluation measures in NLP and IR
Common evaluation measures in NLP and IRCommon evaluation measures in NLP and IR
Common evaluation measures in NLP and IRRushdi Shams
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Edureka!
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processingSanzid Kawsar
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 
Supervised learning and unsupervised learning
Supervised learning and unsupervised learningSupervised learning and unsupervised learning
Supervised learning and unsupervised learningArunakumariAkula1
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Umesh Prasad
 

What's hot (20)

Lecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdfLecture Notes-Finite State Automata for NLP.pdf
Lecture Notes-Finite State Automata for NLP.pdf
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Common evaluation measures in NLP and IR
Common evaluation measures in NLP and IRCommon evaluation measures in NLP and IR
Common evaluation measures in NLP and IR
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Examples of Ontology Applications
Examples of Ontology ApplicationsExamples of Ontology Applications
Examples of Ontology Applications
 
Learning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search GuildLearning to Rank Presentation (v2) at LexisNexis Search Guild
Learning to Rank Presentation (v2) at LexisNexis Search Guild
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
Term weighting
Term weightingTerm weighting
Term weighting
 
Supervised learning and unsupervised learning
Supervised learning and unsupervised learningSupervised learning and unsupervised learning
Supervised learning and unsupervised learning
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Word embedding
Word embedding Word embedding
Word embedding
 
Text clustering
Text clusteringText clustering
Text clustering
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr
 

Viewers also liked

Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrievalNanthini Dominique
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space modeldalal404
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of BristolSimon Price
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...Simon Price
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information RetrievalHarsh Thakkar
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Fuzzy Logic ppt
Fuzzy Logic pptFuzzy Logic ppt
Fuzzy Logic pptRitu Bafna
 
similarity measure
similarity measure similarity measure
similarity measure ZHAO Sam
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Genetic Algorithm by Example
Genetic Algorithm by ExampleGenetic Algorithm by Example
Genetic Algorithm by ExampleNobal Niraula
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operatorsRoi Blanco
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsMounia Lalmas-Roelleke
 

Viewers also liked (17)

Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
Document similarity with vector space model
Document similarity with vector space modelDocument similarity with vector space model
Document similarity with vector space model
 
Ir models
Ir modelsIr models
Ir models
 
Research IT at the University of Bristol
Research IT at the University of BristolResearch IT at the University of Bristol
Research IT at the University of Bristol
 
SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...SubSift: a novel application of the vector space model to support the academi...
SubSift: a novel application of the vector space model to support the academi...
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
 
SAX-VSM
SAX-VSMSAX-VSM
SAX-VSM
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Ir 08
Ir   08Ir   08
Ir 08
 
Fuzzy Logic ppt
Fuzzy Logic pptFuzzy Logic ppt
Fuzzy Logic ppt
 
similarity measure
similarity measure similarity measure
similarity measure
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Genetic Algorithm by Example
Genetic Algorithm by ExampleGenetic Algorithm by Example
Genetic Algorithm by Example
 
Extending BM25 with multiple query operators
Extending BM25 with multiple query operatorsExtending BM25 with multiple query operators
Extending BM25 with multiple query operators
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 

Similar to Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494

Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspacePrakash Dubey
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
Search Engines
Search EnginesSearch Engines
Search Enginesbutest
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Sean Golliher
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptxthenmozhip8
 
IR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdfIR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdfhimarusti
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Dwaipayan Roy
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.pptBereketAraya
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq
 
Slides
SlidesSlides
Slidesbutest
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptpepe3059
 

Similar to Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494 (20)

Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
Document ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspaceDocument ranking using qprp with concept of multi dimensional subspace
Document ranking using qprp with concept of multi dimensional subspace
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)Lecture 9 - Machine Learning and Support Vector Machines (SVM)
Lecture 9 - Machine Learning and Support Vector Machines (SVM)
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
IRT Unit_ 2.pptx
IRT Unit_ 2.pptxIRT Unit_ 2.pptx
IRT Unit_ 2.pptx
 
IR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdfIR-lec17-probabilistic-ir.pdf
IR-lec17-probabilistic-ir.pdf
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
4-IR Models_new.ppt
4-IR Models_new.ppt4-IR Models_new.ppt
4-IR Models_new.ppt
 
Lec1
Lec1Lec1
Lec1
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
Slides
SlidesSlides
Slides
 
Canini09a
Canini09aCanini09a
Canini09a
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 

More from Sean Golliher

Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Sean Golliher
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:Sean Golliher
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Sean Golliher
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingSean Golliher
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
 
PageRank and The Google Matrix
PageRank and The Google MatrixPageRank and The Google Matrix
PageRank and The Google MatrixSean Golliher
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerSean Golliher
 

More from Sean Golliher (8)

Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)Time Series Forecasting using Neural Nets (GNNNs)
Time Series Forecasting using Neural Nets (GNNNs)
 
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction:
 
Goprez sg
Goprez  sgGoprez  sg
Goprez sg
 
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
Property Matching and Query Expansion on Linked Data Using Kullback-Leibler D...
 
Lecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document ParsingLecture 7- Text Statistics and Document Parsing
Lecture 7- Text Statistics and Document Parsing
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
PageRank and The Google Matrix
PageRank and The Google MatrixPageRank and The Google Matrix
PageRank and The Google Matrix
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494

  • 1. Probabilistic Retrieval Models Lecture 8 Sean A. Golliher
  • 2. Need to quickly cover some old material to understand the new methods
  • 3. Complex concept that has been studied for some time  Many factors to consider  People often disagree when making relevance judgments  Retrieval models make various assumptions about relevance to simplify problem  e.g., topical vs. user relevance  e.g., binary vs. multi-valued relevance
  • 4. Older models  Boolean retrieval  Vector Space model  Probabilistic Models  BM25  Language models  Combining evidence  Inference networks  Learning to Rank
  • 5. Two possible outcomes for query processing  TRUE and FALSE  “exact-match” retrieval  simplest form of ranking  Query usually specified using Boolean operators  AND, OR, NOT
  • 6. Advantages  Results are predictable, relatively easy to explain  Many different features can be incorporated  Efficient processing since many documents can be eliminated from search  Disadvantages  Effectiveness depends entirely on user  Simple queries usually don’t work well  Complex queries are difficult
  • 7.  Documents and query represented by a vector of term weights  Collection represented by a matrix of term weights
  • 8.
  • 9. 3-d pictures useful, but can be misleading for high-dimensional space
  • 10. The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar.
  • 11. Thought experiment: take a document d and append it to itself. Call this document d′.  “Semantically” d and d′ have the same content  The Euclidean distance between the two documents can be quite large  The angle between the two documents is 0, corresponding to maximal similarity (cos(0) = 1).  Key idea: Rank documents according to angle with query.
  • 12.  In Euclidean space, define dot product of vectors a & b as ab=||a|| ||b|| cos a where b ||a|| == length == angle between a & b
  • 13.  By using Law of Cosines, can compute coordinate-dependent definition in 3- space:  ab= axbx + ayby + azbz  cos = ab/||a|| ||b||  cosine(0) = 1  cosine(90 deg) = 0
  • 14. Documents ranked by distance between points representing query and documents  Similarity measure more common than a distance or dissimilarity measure  e.g. Cosine correlation
  • 15.  Consider two documents D1, D2 and a query Q ○ D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
  • 16. Dot product Unit vectors     V   q d q d q di i 1 i cos(q, d )     q d q d V 2 q V d i2 i 1 i i 1 qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
  • 17. tf.idf weight (older retrieval model)  tf: term frequency of term over collection of documents  idf: inverse document freq. ex: ○ log(N/n)  N is the total number of document  n is total number of documents that contain a term  Measure of “importance” of term. The more documents a term appears in the lest discriminating the term is.  Use log to dampen the effects
  • 18. The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. Word Collection frequency Document frequency insurance 10440 3997 try 10422 8760  Document frequency df is number of documents that contain a term t.  Which of these is more useful?
  • 19. The tf-idf weight of a term is the product of its tf weight and its idf weight.  Best known weighting scheme in information retrieval ○ Note: the “-” in tf-idf is a hyphen, not a minus sign! ○ Alternative names: tf.idf, tf x idf  Increases with the number of occurrences within a document  Increases with the rarity of the term in the collection
  • 20.  Rocchio algorithm (paper topic)  Optimal query  Maximizes the difference between the average vector representing the relevant documents and the average vector representing the non-relevant documents  Modifies query according to  α, β, and γ are parameters ○ Typical values 8, 16, 4
  • 21. Most dominant paradigm used today  Probability theory is a strong foundation for representing uncertainty that is inherent in IR process.
  • 22. Robertson (1977) If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.”
  • 23. Probability Ranking Principle (Robertson, 70ies; Maron, Kuhns, 1959)  Information Retrieval as Probabilistic Inference (van Rijsbergen & co, since 70ies)  Probabilistic Indexing (Fuhr & Co.,late 80ies- 90ies)  Bayesian Nets in IR (Turtle, Croft, 90ies)  Probabilistic Logic Programming in IR (Fuhr & co, 90ies)
  • 24. P(a | b) => Conditional probability. Probability of a given that b occurred.  Basic definitions (a È b) => AorB (a Ç b) = AandB
  • 25. Let a, b be two events. p(a | b)p(b) = p(a Ç b) = p(b | a)p(a) p(b | a)p(a) p(a | b) = p(b) p(a | b)p(b) = p(b | a)p(a)
  • 26. Let D be a document in the collection.  Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance.  How do we find P(R|D)? Probability that a retrieved document is relevant. Abstract concept.  P(R) is the probability that a retrieved is relevant  Not Clear how to calculate this.
  • 27. Can we calculate P(D|R)? Probability of a document occurring in a set given a relevant set has been returned.  If we KNOW we have relevant set of documents (maybe from humans?) We could calculate how often specific words occur in a certain set.
  • 28.
  • 29. Let D be a document in the collection. Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance. Need to find p(R|D) - probability that a retrieved document D is relevant. p(D | R)p(R) p(R),p(NR) - prior probability p(R | D) = p(D) of retrieving a (non) relevant p(xD | NR)p(NR) document p(NR | D) = p(xD) P(D|R), p(D|NR) - probability that if a relevant (non-rel document is retrieved, it is D.
  • 30. p(D | R)p(R) p(R | D) = p(D) p(D | NR)p(NR) p(NR | D) = p(D) Ranking Principle (Bayes’ Decision Rule): If p(R|D) > p(NR|D) then D is relevant, Otherwise D is not relevant
  • 31. Bayes Decision Rule  A document D is relevant if P(R|D) > P(NR|D)  Estimating probabilities  use Bayes Rule  classify a document as relevant if ○ Left side is likelihood ratio
  • 32. Can we calculate P(D|R)? Probability that if a relevant document is returned it is D?  If we KNOW we have relevant set of documents (maybe from humans?) We could calculate how often specific words occur in a certain set.  Ex: We have info on how often specific words occur in relevant set. We could calculate how likely it is to see the words appear in a set.  Ex: Prob: “president” in the relevant set is 0.02 and “lincoln” in the relevant set is “0.03”. If new doc. has pres. & lincoln then prob. Is 0.02*0.03= .0006.
  • 33. Suppose we have a vector representing the presence and absence of terms (1,0,0,1,1). Terms 1, 4, & 5 are present.  What is the probability of this document occurring in the relevant set?  pi is the probability that the term i occurs in a relevant set. (1- pi ) would be the probability a term would not be included the relevant set.  This gives us: p1 x (1-p2) x (1-p3) x p4 x p5
  • 34. Assume independence  Binary independence model  Dot product of over terms that have value one. Zero means dot product over terms that have value 0.  pi is probability that term i occurs (i.e., has value 1) in relevant document, si is probability of occurrence in non-relevant document
  • 35.
  • 36. Scoring function is (Last term was same for all documents. So it can be ignored.
  • 37.  Jump to machine learning and web search. Lots of training data available from web search queries. Learning to rank models.  http://www.bradblock.com/A_General_La nguage_Model_for_Information_Retrieval .pdf

Editor's Notes

  1. Angle captures relative proportion of terms
  2. http://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html … For example auto industry. All documents contain the word auto. Want to decrease the value of that phraseas it occurs more because it is non discriminating in a search query. Df is more useful.. Look at the range.
  3. Tf is the number of times the word occurs in document d.
  4. D is a collection of documents. R is relevance. P(R) Is
  5. Use log since we get lots of small numbers. pi is probablity that that term I occurs in relevant set.
  6. Use log since we get lots of small numbers. pi is probablity that that term I occurs in relevant set.