SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
BM25 Scoring for Lucene:
From Academia to Industry

             Yuval Feinstein
             Answers Corporation




              Apache Lucene EuroCon 2010 Meetup
              Prague, May 2010
Overview

       Answers.com
       A Relevance problem
       BM25F - a possible solution
       Joaquin’s Implementation
       Productization
       Future directions




2
Answers.com

       Mission - Provide best answers about anything.
       A popular web site (according to comScore,
        March 2010):
          #33 worldwide, with 75.8 million unique users
          #18 in US, with 51.2 million unique users
       WikiAnswers – community Q&A site (UGC)
       ReferenceAnswers – editorial content
       Atlas – internal search engine
       Implicit search example: find similar
3
        questions
Similar Questions




4
Case 31136




5
Enter BM25F

   Query Q = (t1, t2, …, tm)
   Document D
   Term frequency tfi
    similarity   Q , D    w i tf i 
                            tQ  D

   How much should tfi influence similarity?
   Determine similarity by choosing weights
   BM25F: saturation, soft length normalization, idf
    weights and field weights.
Saturation

                            Frequency Saturation


                    1
                  0.9
                  0.8
                  0.7
                  0.6
 Saturated
                  0.5
Weight, tf/(2+tf)
                  0.4
                  0.3
                  0.2
                  0.1
                    0
                        0   5       10        15      20   25   30
                                      Term Frequency tf




 Replace tf by tf/(k1+tf)
Soft Length Normalization

                         length normalization

             2
           1.8
           1.6
           1.4
           1.2
normalized
             1
 frequency
           0.8
           0.6
           0.4
           0.2
             0
                 0   5          10          15          20     25   30
                                      document length




                                                 tf
                             tf ' 
Replace tf by                                         dl 
                                        1  b   b      
                                                     avdl 
Inverse Document Frequency (IDF)

                                       IDF weighting

                   2.5

                    2

                   1.5
 IDF weight (wi)
                    1

                   0.5

                    0
                         0        20        40         60      80    100   120
                                           num docs with term (ni)



                 N  n i  0 .5
          log
   IDF
 wi
                   n i  0 .5
Field Weights




     Every field has a different b (length verbosity parameter) and a different v
     (field value parameer)
10
The BM25F Formula

                                         S
                                ~                  tf si
                                        v
 Field weighting
                               tf i           s
                                        s 1       Bs

                                                       sl s 
 Field length normalization   B s   1  b s   b s       
                                                      avsl 

                                                        ~
                                                       tf i
                                               
                                   BM 25 F                     IDF
  Saturation and IDF          w   i                     ~ w   i
                                                   k1  f i
Joaquin’s Implementation

        Joaquín Pérez Iglesias of UNED, Madrid, Spain
         implemented a BM25F library for Lucene,
         with the class BM25BooleanQuery
        Algorithm:
          Collect documents with query terms
          Score individual terms using BM25F
          Combine scores using addition to get Boolean query
           score




12
BM25F Usefulness for Our Case

        Short texts
        Term repetitions hurt relevance for short texts
        Want to combine different fields (in the future,
         different information sources)

        Initial Experiments showed nice relevance, but….




13
Feeling Safe to make Changes

        How can we be sure not to break anything?



        Added Unit Tests
        (This is almost a Lucene standard, but not in
         Academia…)




14
Production Challenges –
     Performance

     Can this library handle 10M queries daily?
     Initial Runtimes:


                     Average   Median
                     Runtime   Runtime
                     mSec      mSec

        Standard     161       119
        Lucene
        Scoring
        BM25F        273       209
        Difference   68%       75%

15
Improving Performance

     Addressed using:
      Benchmarking

      Profiling

      Refactoring, to give


                     Average   Median
                     Runtime   Runtime
                     mSec      mSec
        Standard     93        65
        Lucene
        Scoring
        BM25F        92        70
16      Difference   -1%       8%
Production Challenges –
Robustness

   Lots of users  strange inputs e.g.
////////////////////////////////////////
;-)
fdsfdsdfsdffssssssfsfsfs

   Addressed using more careful tokenization
Production Challenges –
Integration and Interoperability

   Needs data not currently in Lucene index:
     Average Field Lengths
     Document-level IDF
   We calculated the first externally and
    approximated the second using longest field IDF

   Library does not play nicely with others – not
    recursive
   BM25 Library supports BooleanQuery, not
    phrases, prefix, etc.
Remember case 31136?



Well, She’s mostly pleased…

   BM25 runs in our production environment
   Supporting 10s of millions of queries daily
Future Work

        LUCENE-2091 – Our suggested contrib patch
        LUCENE-2392 – Current work on making Lucene
         scoring more flexible, to incorporate BM25 as well
         as other models
        We want to incorporate BM25 scoring into Solr
        Could this be faster as well?




20
References

   Integrating the Probabilistic Model BM25/BM25F
    into Lucene – Joaquin Perez Iglesias
   The Probabilistic Relevance Framework: BM25
    and Beyond – Stephen Robertson and Hugo
    Zaragoza
   Working Effectively with Legacy Code – Michael
    Feathers

Contenu connexe

Tendances (20)

Waveform Coding
Waveform CodingWaveform Coding
Waveform Coding
 
Spectrum-Compliant Accelerograms through Harmonic Wavelet Transform
Spectrum-Compliant Accelerograms through Harmonic Wavelet TransformSpectrum-Compliant Accelerograms through Harmonic Wavelet Transform
Spectrum-Compliant Accelerograms through Harmonic Wavelet Transform
 
VSB
VSBVSB
VSB
 
Software-defined white-space cognitive systems: implementation of the spectru...
Software-defined white-space cognitive systems: implementation of the spectru...Software-defined white-space cognitive systems: implementation of the spectru...
Software-defined white-space cognitive systems: implementation of the spectru...
 
Icici bme 2011
Icici bme 2011Icici bme 2011
Icici bme 2011
 
Fourier transform
Fourier transformFourier transform
Fourier transform
 
Ofdm
OfdmOfdm
Ofdm
 
I phone 10
I phone 10I phone 10
I phone 10
 
Ch6 1 v1
Ch6 1 v1Ch6 1 v1
Ch6 1 v1
 
Introduction to OFDM
Introduction to OFDMIntroduction to OFDM
Introduction to OFDM
 
Assignment 1
Assignment 1Assignment 1
Assignment 1
 
Mimo
MimoMimo
Mimo
 
Nyquist criterion for zero ISI
Nyquist criterion for zero ISINyquist criterion for zero ISI
Nyquist criterion for zero ISI
 
Faster rcnn
Faster rcnnFaster rcnn
Faster rcnn
 
Tham khao ofdm tutorial
Tham khao ofdm tutorialTham khao ofdm tutorial
Tham khao ofdm tutorial
 
Data and signals
Data and signalsData and signals
Data and signals
 
Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...
Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...
Receiver deghosting method to mitigate F-­K transform artifacts: A non-­windo...
 
Physical Layer Numericals - Data Communication & Networking
Physical Layer  Numericals - Data Communication & NetworkingPhysical Layer  Numericals - Data Communication & Networking
Physical Layer Numericals - Data Communication & Networking
 
Adm
AdmAdm
Adm
 
2008 anna university
2008 anna university2008 anna university
2008 anna university
 

En vedette

Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scaleKen Krugler
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitKavita Ganesan
 
LOW COST HOUSING
LOW COST HOUSINGLOW COST HOUSING
LOW COST HOUSINGSUJEESH A S
 
What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)Rudy De Busscher
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkVolker Hirsch
 

En vedette (7)

Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Future Urban Transport: When Less is More
Future Urban Transport: When Less is MoreFuture Urban Transport: When Less is More
Future Urban Transport: When Less is More
 
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval ToolkitVery Small Tutorial on Terrier 3.0 Retrieval Toolkit
Very Small Tutorial on Terrier 3.0 Retrieval Toolkit
 
LOW COST HOUSING
LOW COST HOUSINGLOW COST HOUSING
LOW COST HOUSING
 
Skybus
SkybusSkybus
Skybus
 
What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)What is tackled in the Java EE Security API (Java EE 8)
What is tackled in the Java EE Security API (Java EE 8)
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 

Similaire à BM25 Scoring for Lucene: From Academia to Industry

Analysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unitAnalysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unitsushanthsjce
 
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum SensingAnalysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum SensingOmer Ali
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010tcoyle72
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010tcoyle72
 
OFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division MultiplexingOFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division MultiplexingAbdullaziz Tagawy
 
MEF Service Level Aggrement
MEF Service Level AggrementMEF Service Level Aggrement
MEF Service Level Aggrementshivlu
 
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐCHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐlykhnh386525
 
4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-notePei-Che Chang
 
Resilience at exascale
Resilience at exascaleResilience at exascale
Resilience at exascaleMarc Snir
 
V5 protocol English
V5 protocol EnglishV5 protocol English
V5 protocol Englishfigtree614
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...NUGU developers
 
F01 beam forming_srs
F01 beam forming_srsF01 beam forming_srs
F01 beam forming_srsLuciano Motta
 
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
 Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R... Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...inventy
 

Similaire à BM25 Scoring for Lucene: From Academia to Industry (17)

Analysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unitAnalysis of vibration signals to identify cracks in a gear unit
Analysis of vibration signals to identify cracks in a gear unit
 
D0432427
D0432427D0432427
D0432427
 
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum SensingAnalysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
Analysis Of Ofdm Parameters Using Cyclostationary Spectrum Sensing
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010
 
Pcb carolina scg_2010
Pcb carolina scg_2010Pcb carolina scg_2010
Pcb carolina scg_2010
 
OFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division MultiplexingOFDM Orthogonal Frequency Division Multiplexing
OFDM Orthogonal Frequency Division Multiplexing
 
MEF Service Level Aggrement
MEF Service Level AggrementMEF Service Level Aggrement
MEF Service Level Aggrement
 
ofdm
ofdmofdm
ofdm
 
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐCHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
CHƯƠNG 2 KỸ THUẬT TRUYỀN DẪN SỐ - THONG TIN SỐ
 
4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note4g LTE and LTE-A for mobile broadband-note
4g LTE and LTE-A for mobile broadband-note
 
Resilience at exascale
Resilience at exascaleResilience at exascale
Resilience at exascale
 
V5 protocol English
V5 protocol EnglishV5 protocol English
V5 protocol English
 
Lec11 rate distortion optimization
Lec11 rate distortion optimizationLec11 rate distortion optimization
Lec11 rate distortion optimization
 
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
[NUGU CONFERENCE 2019] 트랙 A-4 : Zero-shot learning for Personalized Text-to-S...
 
F01 beam forming_srs
F01 beam forming_srsF01 beam forming_srs
F01 beam forming_srs
 
Filter dengan-op-amp
Filter dengan-op-ampFilter dengan-op-amp
Filter dengan-op-amp
 
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
 Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R... Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
Area Efficient Reconfigurable Fast Filter Bank for Multi-Standard Wireless R...
 

Dernier

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Dernier (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

BM25 Scoring for Lucene: From Academia to Industry

  • 1. BM25 Scoring for Lucene: From Academia to Industry Yuval Feinstein Answers Corporation Apache Lucene EuroCon 2010 Meetup Prague, May 2010
  • 2. Overview  Answers.com  A Relevance problem  BM25F - a possible solution  Joaquin’s Implementation  Productization  Future directions 2
  • 3. Answers.com  Mission - Provide best answers about anything.  A popular web site (according to comScore, March 2010):  #33 worldwide, with 75.8 million unique users  #18 in US, with 51.2 million unique users  WikiAnswers – community Q&A site (UGC)  ReferenceAnswers – editorial content  Atlas – internal search engine  Implicit search example: find similar 3 questions
  • 6. Enter BM25F  Query Q = (t1, t2, …, tm)  Document D  Term frequency tfi similarity Q , D    w i tf i  tQ  D  How much should tfi influence similarity?  Determine similarity by choosing weights  BM25F: saturation, soft length normalization, idf weights and field weights.
  • 7. Saturation Frequency Saturation 1 0.9 0.8 0.7 0.6 Saturated 0.5 Weight, tf/(2+tf) 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 Term Frequency tf Replace tf by tf/(k1+tf)
  • 8. Soft Length Normalization length normalization 2 1.8 1.6 1.4 1.2 normalized 1 frequency 0.8 0.6 0.4 0.2 0 0 5 10 15 20 25 30 document length tf tf '  Replace tf by  dl   1  b   b   avdl 
  • 9. Inverse Document Frequency (IDF) IDF weighting 2.5 2 1.5 IDF weight (wi) 1 0.5 0 0 20 40 60 80 100 120 num docs with term (ni) N  n i  0 .5  log IDF wi n i  0 .5
  • 10. Field Weights Every field has a different b (length verbosity parameter) and a different v (field value parameer) 10
  • 11. The BM25F Formula S ~ tf si v Field weighting tf i  s s 1 Bs  sl s  Field length normalization B s   1  b s   b s   avsl  ~ tf i  BM 25 F IDF Saturation and IDF w i ~ w i k1  f i
  • 12. Joaquin’s Implementation  Joaquín Pérez Iglesias of UNED, Madrid, Spain implemented a BM25F library for Lucene, with the class BM25BooleanQuery  Algorithm:  Collect documents with query terms  Score individual terms using BM25F  Combine scores using addition to get Boolean query score 12
  • 13. BM25F Usefulness for Our Case  Short texts  Term repetitions hurt relevance for short texts  Want to combine different fields (in the future, different information sources)  Initial Experiments showed nice relevance, but…. 13
  • 14. Feeling Safe to make Changes  How can we be sure not to break anything?  Added Unit Tests  (This is almost a Lucene standard, but not in Academia…) 14
  • 15. Production Challenges – Performance Can this library handle 10M queries daily? Initial Runtimes: Average Median Runtime Runtime mSec mSec Standard 161 119 Lucene Scoring BM25F 273 209 Difference 68% 75% 15
  • 16. Improving Performance Addressed using:  Benchmarking  Profiling  Refactoring, to give Average Median Runtime Runtime mSec mSec Standard 93 65 Lucene Scoring BM25F 92 70 16 Difference -1% 8%
  • 17. Production Challenges – Robustness  Lots of users  strange inputs e.g. //////////////////////////////////////// ;-) fdsfdsdfsdffssssssfsfsfs  Addressed using more careful tokenization
  • 18. Production Challenges – Integration and Interoperability  Needs data not currently in Lucene index:  Average Field Lengths  Document-level IDF  We calculated the first externally and approximated the second using longest field IDF  Library does not play nicely with others – not recursive  BM25 Library supports BooleanQuery, not phrases, prefix, etc.
  • 19. Remember case 31136? Well, She’s mostly pleased…  BM25 runs in our production environment  Supporting 10s of millions of queries daily
  • 20. Future Work  LUCENE-2091 – Our suggested contrib patch  LUCENE-2392 – Current work on making Lucene scoring more flexible, to incorporate BM25 as well as other models  We want to incorporate BM25 scoring into Solr  Could this be faster as well? 20
  • 21. References  Integrating the Probabilistic Model BM25/BM25F into Lucene – Joaquin Perez Iglesias  The Probabilistic Relevance Framework: BM25 and Beyond – Stephen Robertson and Hugo Zaragoza  Working Effectively with Legacy Code – Michael Feathers