SlideShare une entreprise Scribd logo
1  sur  15
INCORPORATING
PROBABILISTIC
RETRIEVAL
KNOWLEDGE INTO
TFIDF-BASED SEARCH
ENGINE
Alex Lin
Senior Architect
Intelligent Mining
alin at IntelligentMinining.com
Overview of Retrieval Models
  Boolean Retrieval
  Vector Space Model

  Probabilistic Model

  Language Model
Boolean Retrieval
  lincolnAND NOT (car AND automobile)
  The earliest model and still in use today

  The result is very easy to explain to users

  Highly efficient computationally

  The major drawback – lack of sophisticated
   ranking algorithm.
Vector Space Model
    Term2

            Doc1


                   Doc2

                                                t
                   Query
                                            ∑d       ij   *qj
                                            j=1
                             Cos(Di ,Q) =   t              t
                     Term3
                                            ∑ d * ∑q2
                                                    ij
                                                                 2
                                                                 j
                                            j=1            j=1




 Major flaws: It lacks guidance on the details of
                   €
 how weighting and ranking algorithms are
 related to relevance
Probabilistic Retrieval Model

             Relevant       P(R|D)

                                     Document




              Non-
             Relevant      P(NR|D)




                             P(D | R)P(R)
    Bayes’ Rule   P(R | D) =
                                P(D)



    €
Probabilistic Retrieval Model
                     P(D | R)P(R)               P(D | NR)P(NR)
          P(R | D) =                P(NR | D) =
                        P(D)                          P(D)


          IfP(D | R)P(R) > P(D | NR)P(NR)
€                         €
          then classify D as relevant

    €
Estimate P(D|R) and P(D|NR)
  Define        D = (d1,d2 ,...,dt )
                                t
        then    P(D | R) = ∏ P(di | R)
                                i=1
                                t

    €          P(D | NR) = ∏ P(di | NR)
                                i=1


€
        Binary Independence Model
€        term independence + binary features in documents
Likelihood Ratio
      Likelihood   ratio:
           P(D | R)   P(NR)
                    >
          P(D | NR)    P(R)
                                    si: in non-relevant set, the probability of term i occurring
                                    pi: in relevant set, the probability of term i occurring

           P(D | R)          p           1− pi           p (1− si )
                    =∏ i⋅ ∏                    = ∑ log i
€         P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi )
                                                          (ri + 0.5) /(R − ri + 0.5)
                      =      ∑             log
                                                 (n i − ri + 0.5) /(N − n i − R + ri + 0.5)
                          i:d i = q i =1
€
                                N: total number of Non-relevant documents
                                ni: number of non-relevant documents that contain a term
                                ri: number of relevant documents that contain a term
                                R: total number of Relevant documents
          €
Combine with BM25 Ranking
    Algorithm
      BM25   extends the scoring function for the binary
       independence model to include document and
       query term weight.
      It performs very well in TREC experiments


                              (ri + 0.5) /(R − ri + 0.5)        (k + 1) f i (k 2 + 1)qf i
    R(q,D) = ∑ log                                             ⋅ i         ⋅
            i∈Q      (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i      k 2 + qf i

                                                                                         dl
                                                                 K = k1 ((1− b) + b ⋅         )
                                                                                        avgdl
€
                                k1 k2 b: tuning parameters
                                dl: document length
                                avgdl: average document length in data set
                                                  €
                                qf: term frequency in query terms
Weighted Fields Boolean Search
 doc-id       field0     field1                     …   text
   1
   2
   3
   …
   n


                   R(q,D) = ∑    ∑w        f   mi
                          i∈q f ∈ fileds




          €
Apply Probabilistic Knowledge
into Fields
           Higher     gradient         Lower

 doc-id   field0      field1           …       Text
   1
   2      Lightyear    Buzz

   3
   …
   n



          Relevant


                          P(R|D)


                                   Document
           Non-
          Relevant    P(NR|D)
Use the Knowledge during Ranking
     doc-id         field0      field1    …           Text
       1
       2            Lightyear    Buzz

       3
       …
       n



      The    goal is:
                                    t
                         t
      P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi
                         i=1
                                   i=1           i∈q f ∈F



                                                    Learnable

€
Comparison of Approaches
                                     f ik                N
    RTF −IDF = tf ik ⋅ idf i =   t               ⋅ log
                                                         nk
                                 ∑f         ij
                                 j=1

                     (k1 + 1) f i (k2 + 1)qf i                                        dl
    Rbm 25 (q,D) =               ⋅                            K = k1 ((1− b) + b ⋅         )
                      K + fi       k 2 + qf i                                        avgdl
€                           (ri + 0.5) /(R − ri + 0.5)         (k1 + 1) f i (k 2 + 1)qf i
    R(q,D) = ∑ log                                           ⋅             ⋅
             i∈Q   (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i        k 2 + qf i
€                                        €
                                                              IDF                      TF


€                                (k1 + 1) f i (k 2 + 1)qf i
    R(q,D) = ∑ ∑ w f mi ⋅                    ⋅
               i∈q f ∈F           K + fi       k 2 + qf i

                          IDF                            TF

€
Other Considerations
  Thisis not a formal model
  Require user relevance feedback (search log)

  Harder to handle real-time search queries

  How to prevent Love/Hate attacks
Thank you

Contenu connexe

Tendances

Csr2011 june14 15_45_musatov
Csr2011 june14 15_45_musatovCsr2011 june14 15_45_musatov
Csr2011 june14 15_45_musatovCSR2011
 
Threshold and Proactive Pseudo-Random Permutations
Threshold and Proactive Pseudo-Random PermutationsThreshold and Proactive Pseudo-Random Permutations
Threshold and Proactive Pseudo-Random PermutationsAleksandr Yampolskiy
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos
 
Aaex5 group2(中英夾雜)
Aaex5 group2(中英夾雜)Aaex5 group2(中英夾雜)
Aaex5 group2(中英夾雜)Shiang-Yun Yang
 
Solving problems by searching Informed (heuristics) Search
Solving problems by searching Informed (heuristics) SearchSolving problems by searching Informed (heuristics) Search
Solving problems by searching Informed (heuristics) Searchmatele41
 
Formal methods 4 - Z notation
Formal methods   4 - Z notationFormal methods   4 - Z notation
Formal methods 4 - Z notationVlad Patryshev
 
On complementarity in qec and quantum cryptography
On complementarity in qec and quantum cryptographyOn complementarity in qec and quantum cryptography
On complementarity in qec and quantum cryptographywtyru1989
 
RuleML 2015 Constraint Handling Rules - What Else?
RuleML 2015 Constraint Handling Rules - What Else?RuleML 2015 Constraint Handling Rules - What Else?
RuleML 2015 Constraint Handling Rules - What Else?RuleML
 
Path Contraction Faster than 2^n
Path Contraction Faster than 2^nPath Contraction Faster than 2^n
Path Contraction Faster than 2^nAkankshaAgrawal55
 
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...PadmaGadiyar
 
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsDiscrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsNIT Sikkim
 
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...Asai Masataro
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsNAVER Engineering
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos
 
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSean Moran
 

Tendances (20)

Integral Domains
Integral DomainsIntegral Domains
Integral Domains
 
Csr2011 june14 15_45_musatov
Csr2011 june14 15_45_musatovCsr2011 june14 15_45_musatov
Csr2011 june14 15_45_musatov
 
Algorithm
AlgorithmAlgorithm
Algorithm
 
18560 lecture6
18560 lecture618560 lecture6
18560 lecture6
 
Threshold and Proactive Pseudo-Random Permutations
Threshold and Proactive Pseudo-Random PermutationsThreshold and Proactive Pseudo-Random Permutations
Threshold and Proactive Pseudo-Random Permutations
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
 
Aaex5 group2(中英夾雜)
Aaex5 group2(中英夾雜)Aaex5 group2(中英夾雜)
Aaex5 group2(中英夾雜)
 
Solving problems by searching Informed (heuristics) Search
Solving problems by searching Informed (heuristics) SearchSolving problems by searching Informed (heuristics) Search
Solving problems by searching Informed (heuristics) Search
 
Formal methods 4 - Z notation
Formal methods   4 - Z notationFormal methods   4 - Z notation
Formal methods 4 - Z notation
 
On complementarity in qec and quantum cryptography
On complementarity in qec and quantum cryptographyOn complementarity in qec and quantum cryptography
On complementarity in qec and quantum cryptography
 
RuleML 2015 Constraint Handling Rules - What Else?
RuleML 2015 Constraint Handling Rules - What Else?RuleML 2015 Constraint Handling Rules - What Else?
RuleML 2015 Constraint Handling Rules - What Else?
 
Path Contraction Faster than 2^n
Path Contraction Faster than 2^nPath Contraction Faster than 2^n
Path Contraction Faster than 2^n
 
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...
 
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsDiscrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems
 
Athens workshop on MCMC
Athens workshop on MCMCAthens workshop on MCMC
Athens workshop on MCMC
 
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...
 
Efficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representationsEfficient end-to-end learning for quantizable representations
Efficient end-to-end learning for quantizable representations
 
Lec 5-nn-slides
Lec 5-nn-slidesLec 5-nn-slides
Lec 5-nn-slides
 
High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...High-dimensional polytopes defined by oracles: algorithms, computations and a...
High-dimensional polytopes defined by oracles: algorithms, computations and a...
 
Sparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image AnnotationSparse Kernel Learning for Image Annotation
Sparse Kernel Learning for Image Annotation
 

Similaire à Probabilistic Retrieval TFIDF

Scope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languagesScope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languagesEelco Visser
 
Class 18: Measuring Cost
Class 18: Measuring CostClass 18: Measuring Cost
Class 18: Measuring CostDavid Evans
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtureszukun
 
Volume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVolume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVissarion Fisikopoulos
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDFnet2-project
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Traian Rebedea
 
Lecture4 kenrels functions_rkhs
Lecture4 kenrels functions_rkhsLecture4 kenrels functions_rkhs
Lecture4 kenrels functions_rkhsStéphane Canu
 
A note on arithmetic progressions in sets of integers
A note on arithmetic progressions in sets of integersA note on arithmetic progressions in sets of integers
A note on arithmetic progressions in sets of integersLukas Nabergall
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsJonny Daenen
 
Functional programming in f sharp
Functional programming in f sharpFunctional programming in f sharp
Functional programming in f sharpchribben
 
Fractional Calculus A Commutative Method on Real Analytic Functions
Fractional Calculus A Commutative Method on Real Analytic FunctionsFractional Calculus A Commutative Method on Real Analytic Functions
Fractional Calculus A Commutative Method on Real Analytic FunctionsMatt Parker
 

Similaire à Probabilistic Retrieval TFIDF (20)

Ml4nlp04 1
Ml4nlp04 1Ml4nlp04 1
Ml4nlp04 1
 
Scope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languagesScope Graphs: A fresh look at name binding in programming languages
Scope Graphs: A fresh look at name binding in programming languages
 
Class 18: Measuring Cost
Class 18: Measuring CostClass 18: Measuring Cost
Class 18: Measuring Cost
 
Analysis of algo
Analysis of algoAnalysis of algo
Analysis of algo
 
Newfile6
Newfile6Newfile6
Newfile6
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: MixturesCVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures
 
Volume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensionsVolume and edge skeleton computation in high dimensions
Volume and edge skeleton computation in high dimensions
 
Lista exercintegrais
Lista exercintegraisLista exercintegrais
Lista exercintegrais
 
Data Exchange over RDF
Data Exchange over RDFData Exchange over RDF
Data Exchange over RDF
 
Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11Algorithm Design and Complexity - Course 11
Algorithm Design and Complexity - Course 11
 
Nbvtalkatbzaonencryptionpuzzles
NbvtalkatbzaonencryptionpuzzlesNbvtalkatbzaonencryptionpuzzles
Nbvtalkatbzaonencryptionpuzzles
 
Nbvtalkatbzaonencryptionpuzzles
NbvtalkatbzaonencryptionpuzzlesNbvtalkatbzaonencryptionpuzzles
Nbvtalkatbzaonencryptionpuzzles
 
Lecture4 kenrels functions_rkhs
Lecture4 kenrels functions_rkhsLecture4 kenrels functions_rkhs
Lecture4 kenrels functions_rkhs
 
Problem
ProblemProblem
Problem
 
S 7
S 7S 7
S 7
 
A note on arithmetic progressions in sets of integers
A note on arithmetic progressions in sets of integersA note on arithmetic progressions in sets of integers
A note on arithmetic progressions in sets of integers
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
Codes and Isogenies
Codes and IsogeniesCodes and Isogenies
Codes and Isogenies
 
Functional programming in f sharp
Functional programming in f sharpFunctional programming in f sharp
Functional programming in f sharp
 
Fractional Calculus A Commutative Method on Real Analytic Functions
Fractional Calculus A Commutative Method on Real Analytic FunctionsFractional Calculus A Commutative Method on Real Analytic Functions
Fractional Calculus A Commutative Method on Real Analytic Functions
 

Dernier

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Probabilistic Retrieval TFIDF

  • 1. INCORPORATING PROBABILISTIC RETRIEVAL KNOWLEDGE INTO TFIDF-BASED SEARCH ENGINE Alex Lin Senior Architect Intelligent Mining alin at IntelligentMinining.com
  • 2. Overview of Retrieval Models   Boolean Retrieval   Vector Space Model   Probabilistic Model   Language Model
  • 3. Boolean Retrieval   lincolnAND NOT (car AND automobile)   The earliest model and still in use today   The result is very easy to explain to users   Highly efficient computationally   The major drawback – lack of sophisticated ranking algorithm.
  • 4. Vector Space Model Term2 Doc1 Doc2 t Query ∑d ij *qj j=1 Cos(Di ,Q) = t t Term3 ∑ d * ∑q2 ij 2 j j=1 j=1 Major flaws: It lacks guidance on the details of € how weighting and ranking algorithms are related to relevance
  • 5. Probabilistic Retrieval Model Relevant P(R|D) Document Non- Relevant P(NR|D) P(D | R)P(R) Bayes’ Rule P(R | D) = P(D) €
  • 6. Probabilistic Retrieval Model P(D | R)P(R) P(D | NR)P(NR) P(R | D) = P(NR | D) = P(D) P(D)   IfP(D | R)P(R) > P(D | NR)P(NR) € € then classify D as relevant €
  • 7. Estimate P(D|R) and P(D|NR)   Define D = (d1,d2 ,...,dt ) t then P(D | R) = ∏ P(di | R) i=1 t € P(D | NR) = ∏ P(di | NR) i=1 €   Binary Independence Model € term independence + binary features in documents
  • 8. Likelihood Ratio   Likelihood ratio: P(D | R) P(NR) > P(D | NR) P(R) si: in non-relevant set, the probability of term i occurring pi: in relevant set, the probability of term i occurring P(D | R) p 1− pi p (1− si ) =∏ i⋅ ∏ = ∑ log i € P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi ) (ri + 0.5) /(R − ri + 0.5) = ∑ log (n i − ri + 0.5) /(N − n i − R + ri + 0.5) i:d i = q i =1 € N: total number of Non-relevant documents ni: number of non-relevant documents that contain a term ri: number of relevant documents that contain a term R: total number of Relevant documents €
  • 9. Combine with BM25 Ranking Algorithm   BM25 extends the scoring function for the binary independence model to include document and query term weight.   It performs very well in TREC experiments (ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ i ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i dl K = k1 ((1− b) + b ⋅ ) avgdl € k1 k2 b: tuning parameters dl: document length avgdl: average document length in data set € qf: term frequency in query terms
  • 10. Weighted Fields Boolean Search doc-id field0 field1 … text 1 2 3 … n R(q,D) = ∑ ∑w f mi i∈q f ∈ fileds €
  • 11. Apply Probabilistic Knowledge into Fields Higher gradient Lower doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n Relevant P(R|D) Document Non- Relevant P(NR|D)
  • 12. Use the Knowledge during Ranking doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n   The goal is: t t P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi i=1 i=1 i∈q f ∈F Learnable €
  • 13. Comparison of Approaches f ik N RTF −IDF = tf ik ⋅ idf i = t ⋅ log nk ∑f ij j=1 (k1 + 1) f i (k2 + 1)qf i dl Rbm 25 (q,D) = ⋅ K = k1 ((1− b) + b ⋅ ) K + fi k 2 + qf i avgdl € (ri + 0.5) /(R − ri + 0.5) (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i € € IDF TF € (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ ∑ w f mi ⋅ ⋅ i∈q f ∈F K + fi k 2 + qf i IDF TF €
  • 14. Other Considerations   Thisis not a formal model   Require user relevance feedback (search log)   Harder to handle real-time search queries   How to prevent Love/Hate attacks