BM25 Scoring for Lucene: From Academia to Industry

BM25 Scoring for Lucene:
From Academia to Industry

Yuval Feinstein
Answers Corporation

Apache Lucene EuroCon 2010 Meetup
Prague, May 2010

Overview

 Answers.com
 A Relevance problem
 BM25F - a possible solution
 Joaquin’s Implementation
 Productization
 Future directions

2

Answers.com

 Mission - Provide best answers about anything.
 A popular web site (according to comScore,
March 2010):
 #33 worldwide, with 75.8 million unique users
 #18 in US, with 51.2 million unique users
 WikiAnswers – community Q&A site (UGC)
 ReferenceAnswers – editorial content
 Atlas – internal search engine
 Implicit search example: find similar
3
questions

Enter BM25F

 Query Q = (t1, t2, …, tm)
 Document D
 Term frequency tfi
similarity Q , D    w i tf i 
tQ  D

 How much should tfi influence similarity?
 Determine similarity by choosing weights
 BM25F: saturation, soft length normalization, idf
weights and field weights.

Saturation

Frequency Saturation

1
0.9
0.8
0.7
0.6
Saturated
0.5
Weight, tf/(2+tf)
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30
Term Frequency tf

Replace tf by tf/(k1+tf)

Soft Length Normalization

length normalization

2
1.8
1.6
1.4
1.2
normalized
1
frequency
0.8
0.6
0.4
0.2
0
0 5 10 15 20 25 30
document length

tf
tf ' 
Replace tf by  dl 
 1  b   b 
 avdl 

Inverse Document Frequency (IDF)

IDF weighting

2.5

2

1.5
IDF weight (wi)
1

0.5

0
0 20 40 60 80 100 120
num docs with term (ni)

N  n i  0 .5
 log
IDF
wi
n i  0 .5

Field Weights

Every field has a different b (length verbosity parameter) and a different v
(field value parameer)
10

The BM25F Formula

S
~ tf si
v
Field weighting
tf i  s
s 1 Bs

 sl s 
Field length normalization B s   1  b s   b s 
 avsl 

~
tf i

BM 25 F IDF
Saturation and IDF w i ~ w i
k1  f i

Joaquin’s Implementation

 Joaquín Pérez Iglesias of UNED, Madrid, Spain
implemented a BM25F library for Lucene,
with the class BM25BooleanQuery
 Algorithm:
 Collect documents with query terms
 Score individual terms using BM25F
 Combine scores using addition to get Boolean query
score

12

BM25F Usefulness for Our Case

 Short texts
 Term repetitions hurt relevance for short texts
 Want to combine different fields (in the future,
different information sources)

 Initial Experiments showed nice relevance, but….

13

Feeling Safe to make Changes

 How can we be sure not to break anything?

 Added Unit Tests
 (This is almost a Lucene standard, but not in
Academia…)

14

Production Challenges –
Performance

Can this library handle 10M queries daily?
Initial Runtimes:

Average Median
Runtime Runtime
mSec mSec

Standard 161 119
Lucene
Scoring
BM25F 273 209
Difference 68% 75%

15

Improving Performance

Addressed using:
 Benchmarking

 Profiling

 Refactoring, to give

Average Median
Runtime Runtime
mSec mSec
Standard 93 65
Lucene
Scoring
BM25F 92 70
16 Difference -1% 8%

Robustness

 Lots of users  strange inputs e.g.
////////////////////////////////////////
;-)
fdsfdsdfsdffssssssfsfsfs

 Addressed using more careful tokenization

Integration and Interoperability

 Needs data not currently in Lucene index:
 Average Field Lengths
 Document-level IDF
 We calculated the first externally and
approximated the second using longest field IDF

 Library does not play nicely with others – not
recursive
 BM25 Library supports BooleanQuery, not
phrases, prefix, etc.

Remember case 31136?

Well, She’s mostly pleased…

 BM25 runs in our production environment
 Supporting 10s of millions of queries daily

Future Work

 LUCENE-2091 – Our suggested contrib patch
 LUCENE-2392 – Current work on making Lucene
scoring more flexible, to incorporate BM25 as well
as other models
 We want to incorporate BM25 scoring into Solr
 Could this be faster as well?

20

References

 Integrating the Probabilistic Model BM25/BM25F
into Lucene – Joaquin Perez Iglesias
 The Probabilistic Relevance Framework: BM25
and Beyond – Stephen Robertson and Hugo
Zaragoza
 Working Effectively with Legacy Code – Michael
Feathers

BM25 Scoring for Lucene: From Academia to Industry

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (7)

Similaire à BM25 Scoring for Lucene: From Academia to Industry

Similaire à BM25 Scoring for Lucene: From Academia to Industry (17)

Dernier

Dernier (20)

BM25 Scoring for Lucene: From Academia to Industry