Probabilistic Retrieval TFIDF

•

2 j'aime•2,365 vues

DKALab

To download slides please go here: http://www.intelligentmining.com/category/knowledge-base/

Technologie

INCORPORATING
PROBABILISTIC
RETRIEVAL
KNOWLEDGE INTO
TFIDF-BASED SEARCH
ENGINE
Alex Lin
Senior Architect
Intelligent Mining
alin at IntelligentMinining.com

Overview of Retrieval Models
  Boolean Retrieval
  Vector Space Model

  Probabilistic Model

  Language Model

Boolean Retrieval
  lincolnAND NOT (car AND automobile)
  The earliest model and still in use today

  The result is very easy to explain to users

  Highly efficient computationally

  The major drawback – lack of sophisticated
ranking algorithm.

Vector Space Model
Term2

Doc1

Doc2

t
Query
∑d ij *qj
j=1
Cos(Di ,Q) = t t
Term3
∑ d * ∑q2
ij
2
j
j=1 j=1

Major flaws: It lacks guidance on the details of
€
how weighting and ranking algorithms are
related to relevance

Probabilistic Retrieval Model

Relevant P(R|D)

Document

Non-
Relevant P(NR|D)

P(D | R)P(R)
Bayes’ Rule P(R | D) =
P(D)

€

Probabilistic Retrieval Model
P(D | R)P(R) P(D | NR)P(NR)
P(R | D) = P(NR | D) =
P(D) P(D)

  IfP(D | R)P(R) > P(D | NR)P(NR)
€ €
then classify D as relevant

€

Estimate P(D|R) and P(D|NR)
  Define D = (d1,d2 ,...,dt )
t
then P(D | R) = ∏ P(di | R)
i=1
t

€ P(D | NR) = ∏ P(di | NR)
i=1

€
  Binary Independence Model
€ term independence + binary features in documents

Likelihood Ratio
  Likelihood ratio:
P(D | R) P(NR)
>
P(D | NR) P(R)
si: in non-relevant set, the probability of term i occurring
pi: in relevant set, the probability of term i occurring

P(D | R) p 1− pi p (1− si )
=∏ i⋅ ∏ = ∑ log i
€ P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi )
(ri + 0.5) /(R − ri + 0.5)
= ∑ log
(n i − ri + 0.5) /(N − n i − R + ri + 0.5)
i:d i = q i =1
€
N: total number of Non-relevant documents
ni: number of non-relevant documents that contain a term
ri: number of relevant documents that contain a term
R: total number of Relevant documents
€

Combine with BM25 Ranking
Algorithm
  BM25 extends the scoring function for the binary
independence model to include document and
query term weight.
  It performs very well in TREC experiments

(ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i
R(q,D) = ∑ log ⋅ i ⋅
i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i

dl
K = k1 ((1− b) + b ⋅ )
avgdl
€
k1 k2 b: tuning parameters
dl: document length
avgdl: average document length in data set
€
qf: term frequency in query terms

Weighted Fields Boolean Search
doc-id field0 field1 … text
1
2
3
…
n

R(q,D) = ∑ ∑w f mi
i∈q f ∈ fileds

€

Apply Probabilistic Knowledge
into Fields
Higher gradient Lower

doc-id field0 field1 … Text
1
2 Lightyear Buzz

3
…
n

Relevant

P(R|D)

Document
Non-
Relevant P(NR|D)

Use the Knowledge during Ranking
doc-id field0 field1 … Text
1
2 Lightyear Buzz

3
…
n

  The goal is:
t
t
P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi
i=1
i=1 i∈q f ∈F

Learnable

€

Comparison of Approaches
f ik N
RTF −IDF = tf ik ⋅ idf i = t ⋅ log
nk
∑f ij
j=1

(k1 + 1) f i (k2 + 1)qf i dl
Rbm 25 (q,D) = ⋅ K = k1 ((1− b) + b ⋅ )
K + fi k 2 + qf i avgdl
€ (ri + 0.5) /(R − ri + 0.5) (k1 + 1) f i (k 2 + 1)qf i
R(q,D) = ∑ log ⋅ ⋅
i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i
€ €
IDF TF

€ (k1 + 1) f i (k 2 + 1)qf i
R(q,D) = ∑ ∑ w f mi ⋅ ⋅
i∈q f ∈F K + fi k 2 + qf i

IDF TF

€

Other Considerations
  Thisis not a formal model
  Require user relevance feedback (search log)

  Harder to handle real-time search queries

  How to prevent Love/Hate attacks

Contenu connexe

Tendances

Integral DomainsFranklin College Mathematics and Computing Department

Csr2011 june14 15_45_musatovCSR2011

AlgorithmShakil Ahmed

18560 lecture6Universitas Bina Darma Palembang

Threshold and Proactive Pseudo-Random PermutationsAleksandr Yampolskiy

High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos

Aaex5 group2(中英夾雜)Shiang-Yun Yang

Solving problems by searching Informed (heuristics) Searchmatele41

Formal methods 4 - Z notationVlad Patryshev

On complementarity in qec and quantum cryptographywtyru1989

RuleML 2015 Constraint Handling Rules - What Else?RuleML

Path Contraction Faster than 2^nAkankshaAgrawal55

Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...PadmaGadiyar

Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsNIT Sikkim

Athens workshop on MCMCChristian Robert

[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...Asai Masataro

Efficient end-to-end learning for quantizable representationsNAVER Engineering

Lec 5-nn-slidesAtner Yegorov

High-dimensional polytopes defined by oracles: algorithms, computations and a...Vissarion Fisikopoulos

Sparse Kernel Learning for Image AnnotationSean Moran

Tendances (20)

Integral Domains

Csr2011 june14 15_45_musatov

Algorithm

18560 lecture6

Threshold and Proactive Pseudo-Random Permutations

High-dimensional polytopes defined by oracles: algorithms, computations and a...

Aaex5 group2(中英夾雜)

Solving problems by searching Informed (heuristics) Search

Formal methods 4 - Z notation

On complementarity in qec and quantum cryptography

RuleML 2015 Constraint Handling Rules - What Else?

Path Contraction Faster than 2^n

Discrete Logarithm Problem over Prime Fields, Non-canonical Lifts and Logarit...

Discrete Logarithmic Problem- Basis of Elliptic Curve Cryptosystems

Athens workshop on MCMC

[AAAI-16] Tiebreaking Strategies for A* Search: How to Explore the Final Fron...

Efficient end-to-end learning for quantizable representations

Lec 5-nn-slides

High-dimensional polytopes defined by oracles: algorithms, computations and a...

Sparse Kernel Learning for Image Annotation

Similaire à Probabilistic Retrieval TFIDF

Ml4nlp04 1Yohei Sato

Scope Graphs: A fresh look at name binding in programming languagesEelco Visser

Class 18: Measuring CostDavid Evans

Analysis of algoManeesha Srivastav

Newfile6David Rogers

CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtureszukun

Volume and edge skeleton computation in high dimensionsVissarion Fisikopoulos

Lista exercintegraisUniversidade de São Paulo USP

Data Exchange over RDFnet2-project

Algorithm Design and Complexity - Course 11Traian Rebedea

NbvtalkatbzaonencryptionpuzzlesNagasuri Bala Venkateswarlu

Lecture4 kenrels functions_rkhsStéphane Canu

ProblemDavid Rogers

S 7admin

A note on arithmetic progressions in sets of integersLukas Nabergall

Parallel Evaluation of Multi-Semi-JoinsJonny Daenen

Codes and IsogeniesPriyanka Aash

Functional programming in f sharpchribben

Fractional Calculus A Commutative Method on Real Analytic FunctionsMatt Parker

Similaire à Probabilistic Retrieval TFIDF (20)

Ml4nlp04 1

Scope Graphs: A fresh look at name binding in programming languages

Class 18: Measuring Cost

Analysis of algo

Newfile6

CVPR2010: Advanced ITinCVPR in a Nutshell: part 6: Mixtures

Volume and edge skeleton computation in high dimensions

Lista exercintegrais

Data Exchange over RDF

Algorithm Design and Complexity - Course 11

Nbvtalkatbzaonencryptionpuzzles

Lecture4 kenrels functions_rkhs

Problem

S 7

A note on arithmetic progressions in sets of integers

Parallel Evaluation of Multi-Semi-Joins

Codes and Isogenies

Functional programming in f sharp

Fractional Calculus A Commutative Method on Real Analytic Functions

Dernier

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

How to convert PDF to text with Nanonetsnaman860154

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Histor y of HAM Radio presentation slidevu2urc

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Dernier (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf

My Hashitalk Indonesia April 2024 Presentation

Data Cloud, More than a CDP by Matt Robison

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

SQL Database Design For Developers at php[tek] 2024

How to convert PDF to text with Nanonets

Scaling API-first – The story of a global engineering organization

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

08448380779 Call Girls In Civil Lines Women Seeking Men

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Histor y of HAM Radio presentation slide

Breaking the Kubernetes Kill Chain: Host Path Mount

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

How to Troubleshoot Apps for the Modern Connected Worker

Probabilistic Retrieval TFIDF

1. INCORPORATING PROBABILISTIC RETRIEVAL KNOWLEDGE INTO TFIDF-BASED SEARCH ENGINE Alex Lin Senior Architect Intelligent Mining alin at IntelligentMinining.com

2. Overview of Retrieval Models   Boolean Retrieval   Vector Space Model   Probabilistic Model   Language Model

3. Boolean Retrieval   lincolnAND NOT (car AND automobile)   The earliest model and still in use today   The result is very easy to explain to users   Highly efficient computationally   The major drawback – lack of sophisticated ranking algorithm.

4. Vector Space Model Term2 Doc1 Doc2 t Query ∑d ij *qj j=1 Cos(Di ,Q) = t t Term3 ∑ d * ∑q2 ij 2 j j=1 j=1 Major flaws: It lacks guidance on the details of € how weighting and ranking algorithms are related to relevance

5. Probabilistic Retrieval Model Relevant P(R|D) Document Non- Relevant P(NR|D) P(D | R)P(R) Bayes’ Rule P(R | D) = P(D) €

8. Likelihood Ratio   Likelihood ratio: P(D | R) P(NR) > P(D | NR) P(R) si: in non-relevant set, the probability of term i occurring pi: in relevant set, the probability of term i occurring P(D | R) p 1− pi p (1− si ) =∏ i⋅ ∏ = ∑ log i € P(D | NR) i:d i =1 si i:d i = 0 1− si i:d i =1 si (1− pi ) (ri + 0.5) /(R − ri + 0.5) = ∑ log (n i − ri + 0.5) /(N − n i − R + ri + 0.5) i:d i = q i =1 € N: total number of Non-relevant documents ni: number of non-relevant documents that contain a term ri: number of relevant documents that contain a term R: total number of Relevant documents €

9. Combine with BM25 Ranking Algorithm   BM25 extends the scoring function for the binary independence model to include document and query term weight.   It performs very well in TREC experiments (ri + 0.5) /(R − ri + 0.5) (k + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ i ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i dl K = k1 ((1− b) + b ⋅ ) avgdl € k1 k2 b: tuning parameters dl: document length avgdl: average document length in data set € qf: term frequency in query terms

10. Weighted Fields Boolean Search doc-id field0 field1 … text 1 2 3 … n R(q,D) = ∑ ∑w f mi i∈q f ∈ fileds €

11. Apply Probabilistic Knowledge into Fields Higher gradient Lower doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n Relevant P(R|D) Document Non- Relevant P(NR|D)

12. Use the Knowledge during Ranking doc-id field0 field1 … Text 1 2 Lightyear Buzz 3 … n   The goal is: t t P(D | R) = ∏ P(di | R) = ∑ log(P(di | R)) ≈ ∑ ∑ w f mi i=1 i=1 i∈q f ∈F Learnable €

13. Comparison of Approaches f ik N RTF −IDF = tf ik ⋅ idf i = t ⋅ log nk ∑f ij j=1 (k1 + 1) f i (k2 + 1)qf i dl Rbm 25 (q,D) = ⋅ K = k1 ((1− b) + b ⋅ ) K + fi k 2 + qf i avgdl € (ri + 0.5) /(R − ri + 0.5) (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ log ⋅ ⋅ i∈Q (n i − ri + 0.5) /(N − n i − R + ri + 0.5) K + f i k 2 + qf i € € IDF TF € (k1 + 1) f i (k 2 + 1)qf i R(q,D) = ∑ ∑ w f mi ⋅ ⋅ i∈q f ∈F K + fi k 2 + qf i IDF TF €

14. Other Considerations   Thisis not a formal model   Require user relevance feedback (search log)   Harder to handle real-time search queries   How to prevent Love/Hate attacks

15. Thank you

Probabilistic Retrieval TFIDF

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Probabilistic Retrieval TFIDF

Similaire à Probabilistic Retrieval TFIDF (20)

Dernier

Dernier (20)

Probabilistic Retrieval TFIDF