Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
BM25 Scoring for Lucene: From Academia to Industry
1. BM25 Scoring for Lucene:
From Academia to Industry
Yuval Feinstein
Answers Corporation
Apache Lucene EuroCon 2010 Meetup
Prague, May 2010
2. Overview
Answers.com
A Relevance problem
BM25F - a possible solution
Joaquin’s Implementation
Productization
Future directions
2
3. Answers.com
Mission - Provide best answers about anything.
A popular web site (according to comScore,
March 2010):
#33 worldwide, with 75.8 million unique users
#18 in US, with 51.2 million unique users
WikiAnswers – community Q&A site (UGC)
ReferenceAnswers – editorial content
Atlas – internal search engine
Implicit search example: find similar
3
questions
6. Enter BM25F
Query Q = (t1, t2, …, tm)
Document D
Term frequency tfi
similarity Q , D w i tf i
tQ D
How much should tfi influence similarity?
Determine similarity by choosing weights
BM25F: saturation, soft length normalization, idf
weights and field weights.
7. Saturation
Frequency Saturation
1
0.9
0.8
0.7
0.6
Saturated
0.5
Weight, tf/(2+tf)
0.4
0.3
0.2
0.1
0
0 5 10 15 20 25 30
Term Frequency tf
Replace tf by tf/(k1+tf)
9. Inverse Document Frequency (IDF)
IDF weighting
2.5
2
1.5
IDF weight (wi)
1
0.5
0
0 20 40 60 80 100 120
num docs with term (ni)
N n i 0 .5
log
IDF
wi
n i 0 .5
10. Field Weights
Every field has a different b (length verbosity parameter) and a different v
(field value parameer)
10
11. The BM25F Formula
S
~ tf si
v
Field weighting
tf i s
s 1 Bs
sl s
Field length normalization B s 1 b s b s
avsl
~
tf i
BM 25 F IDF
Saturation and IDF w i ~ w i
k1 f i
12. Joaquin’s Implementation
Joaquín Pérez Iglesias of UNED, Madrid, Spain
implemented a BM25F library for Lucene,
with the class BM25BooleanQuery
Algorithm:
Collect documents with query terms
Score individual terms using BM25F
Combine scores using addition to get Boolean query
score
12
13. BM25F Usefulness for Our Case
Short texts
Term repetitions hurt relevance for short texts
Want to combine different fields (in the future,
different information sources)
Initial Experiments showed nice relevance, but….
13
14. Feeling Safe to make Changes
How can we be sure not to break anything?
Added Unit Tests
(This is almost a Lucene standard, but not in
Academia…)
14
15. Production Challenges –
Performance
Can this library handle 10M queries daily?
Initial Runtimes:
Average Median
Runtime Runtime
mSec mSec
Standard 161 119
Lucene
Scoring
BM25F 273 209
Difference 68% 75%
15
16. Improving Performance
Addressed using:
Benchmarking
Profiling
Refactoring, to give
Average Median
Runtime Runtime
mSec mSec
Standard 93 65
Lucene
Scoring
BM25F 92 70
16 Difference -1% 8%
17. Production Challenges –
Robustness
Lots of users strange inputs e.g.
////////////////////////////////////////
;-)
fdsfdsdfsdffssssssfsfsfs
Addressed using more careful tokenization
18. Production Challenges –
Integration and Interoperability
Needs data not currently in Lucene index:
Average Field Lengths
Document-level IDF
We calculated the first externally and
approximated the second using longest field IDF
Library does not play nicely with others – not
recursive
BM25 Library supports BooleanQuery, not
phrases, prefix, etc.
19. Remember case 31136?
Well, She’s mostly pleased…
BM25 runs in our production environment
Supporting 10s of millions of queries daily
20. Future Work
LUCENE-2091 – Our suggested contrib patch
LUCENE-2392 – Current work on making Lucene
scoring more flexible, to incorporate BM25 as well
as other models
We want to incorporate BM25 scoring into Solr
Could this be faster as well?
20
21. References
Integrating the Probabilistic Model BM25/BM25F
into Lucene – Joaquin Perez Iglesias
The Probabilistic Relevance Framework: BM25
and Beyond – Stephen Robertson and Hugo
Zaragoza
Working Effectively with Legacy Code – Michael
Feathers