SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Advanced Document Similarity
With Apache Lucene
Alessandro Benedetti, Software Engineer, Sease Ltd.
Alessandro Benedetti
● Search Consultant
● R&D Software Engineer
● Master in Computer Science
● Apache Lucene/Solr Enthusiast
● Semantic, NLP, Machine Learning Technologies passionate
● Beach Volleyball Player & Snowboarder
Who I am
Search Services
● Open Source Enthusiasts
● Apache Lucene/Solr experts
● Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Measuring Search Quality, Relevancy Tuning
Sease Ltd
● Document Similarity
● Apache Lucene More Like This
● Term Scorer
● BM25
● Interesting Terms Retrieval
● Query Building
● DEMO
● Future Work
● JIRA References
Agenda
Real World Use Cases - Streaming Services
Real World Use Cases - Hotels
Document Similarity
Problem : find similar documents to a seed one
Solution(s) :
● Collaborative approach
(users interactions)
● Content Based
● Hybrid
Similar ?
● Documents accessed in
association to the input one by
users close to you
● Terms distributions
● All of above
Apache Lucene
Apache LuceneTM
is a high-performance, full-featured text search engine library
written entirely in Java.
It is a technology suitable for nearly any application that requires full-text
search, especially cross-platform.
Apache Lucene is an open source project available for free download.
● Search Library (java)
● Structured Documents
● Inverted Index
● Similarity Metrics ( TF-IDF, BM25)
● Fast Search
● Support for advanced queries
● Relevancy tuning
Apache Lucene
Inverted Index
Indexing
Pros
● Apache Lucene Module
● Advanced Params
● Input :
- structured document
- just text
● Build an advanced query
● Leverage the Inverted Index
( and additional data structures)
More Like This
Cons
● Massive single class
● Low cohesion
● Low readability
● Minimum test coverage
● Difficult to extend
( and improve)
Input
Document More Like This
Params
Interesting
Terms
Retriever
Term Scorer
Query Builder QUERY
More Like This - Break Up
Responsibility : define a set of parameters (and defaults) that affect the
various components of the More Like This module
● Regulate MLT behavior
● Groups parameters specific to each component
● Javadoc documentation
● Default values
● Useful container for various parameters to be passed
More Like This Params
● Field Name
● Field Stats ( Document Count)
● Term Stats ( Document Frequency)
● Term Frequency
● TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1)
● BM25
Term Scorer
Responsibility : assign a score to a term that measure how distinctive is the term
for the document in input
● Origin from Probabilistic Information Retrieval
● Default Similarity from Lucene 6.0 [1]
● 25th iteration in improving TF-IDF
● TF
● IDF
● Document Length
[1] LUCENE-6789
BM25 Term Scorer
BM25 Term Scorer - Inverse Document Frequency
IDF Score
has very similar
behavior
BM25 Term Scorer - Term Frequency
TF Score
approaches
asymptotically (k+1)
k=1.2 in this
example
BM25 Term Scorer - Document Length
Document Length /
Avg Document
Length
affects how fast we
saturate TF score
Responsibility : retrieve from the document a queue of weighted interesting
terms Params Used
● Analyzer
● Max Num Token Parsed
● Min Term Frequency
● Min/Max Document Frequency
● Max Query Terms
● Query Time Field Boost
Interesting Term Retriever
● Analyze content / Term Vector
● Skip Tokens
● Score Tokens
● Build Queue of Top Scored terms
Params Used
● Term Boost Enabled
More Like This Query Builder
Field1 :
Term1
Field2 :
Term2
Field1 :
Term3
Field1 :
Term4
Field3 :
Term5
3.0 4.0 4.5 4.8 7.5
Q = Field1:Term1^3.0 Field2:Term2^4.0
Field1:Term3^4.5 Field1:Term4^4.8
Field3:Term5^7.5
Term Boost
● on/off
● Affect each term weight in the
MLT query
● It is the term score
( it depends of the Term Scorer
implementation chosen)
More Like This Boost
Field Boost
● field1^5.0 field2^2.0 field3^1.5
● Affect Term Scorer
● Affect the interesting terms
retrieved
N.B. a highly boosted field can
dominate the interesting terms
retrieval
More Like This Usage - Lucene Classification
● Given a document D to classify
● K Nearest Neighbours Classifier
● Find Top K similar documents to D ( MLT)
● Classes are extracted
● Class Frequency + Class ranking -> Class probability
More Like This Usage - Apache Solr
● More Like This query parser
( can be concatenated with other queries)
● More Like This search component
( can be assigned to a Request Handler)
● More Like This handler
( handler with specific request parameters)
More Like This Demo - Movie Data Set
This data consists of the following fields:
● id - unique identifier for the movie
● name - Name of the movie
● directed_by - The person(s) who directed the making of the film
● initial_release_date - The earliest official initial film screening date in
any country
● genre - The genre(s) that the movie belongs to
More Like This Demo - Tuned
● Enable/Disable Term Boost
● Min Term Frequency
● Min Document Frequency
● Field Boost
● Ad Hoc fields ( ngram analysis)
Future Work
● Query Builder just use Terms and Term Score
● Term Positions ?
● Phrase Queries Boost
(for terms close in position)
● Sentence boundaries
● Field centric vs Document centric
( should high boosted fields kick out
relevant terms from low boosted fields)
Future Work - More Like These
● Multiple documents in input
● Interesting terms across
documents
● Useful for Content Based
recommender engines
● LUCENE-7498 - Introducing BM25 Term Scorer
● LUCENE-7802 - Architectural Refactor
JIRA References
Questions ?
Arigato !
ありがとう !

Contenu connexe

Tendances

A Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemA Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemSease
 
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachAlessandro Benedetti
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachAlessandro Benedetti
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationAlessandro Benedetti
 
How to Build your Training Set for a Learning To Rank Project
How to Build your Training Set for a Learning To Rank ProjectHow to Build your Training Set for a Learning To Rank Project
How to Build your Training Set for a Learning To Rank ProjectSease
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document ClassificationAlessandro Benedetti
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 
How to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackHow to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackSease
 
Entity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph EmbeddingsEntity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph EmbeddingsSease
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...Lucidworks
 
Feature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsFeature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsSease
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Lucidworks
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 

Tendances (16)

A Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking ProblemA Learning to Rank Project on a Daily Song Ranking Problem
A Learning to Rank Project on a Daily Song Ranking Problem
 
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source ApproachSearch Quality Evaluation to Help Reproducibility : an Open Source Approach
Search Quality Evaluation to Help Reproducibility : an Open Source Approach
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
 
How to Build your Training Set for a Learning To Rank Project
How to Build your Training Set for a Learning To Rank ProjectHow to Build your Training Set for a Learning To Rank Project
How to Build your Training Set for a Learning To Rank Project
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
How to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackHow to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - Haystack
 
Entity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph EmbeddingsEntity Search on Virtual Documents Created with Graph Embeddings
Entity Search on Virtual Documents Created with Graph Embeddings
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
An Introduction to NLP4L - Natural Language Processing Tool for Apache Lucene...
 
Feature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsFeature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text Collections
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 

Similaire à Advanced Document Similarity With Apache Lucene

The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From SolrRamzi Alqrainy
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010ivan provalov
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Lucidworks
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industryTommaso Teofili
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorialGanesh Venkataraman
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State Universitydhabalia
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
Anghami: From Billions Of Streams To Better Recommendations
Anghami: From Billions Of Streams To Better RecommendationsAnghami: From Billions Of Streams To Better Recommendations
Anghami: From Billions Of Streams To Better RecommendationsRamzi Karam
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 

Similaire à Advanced Document Similarity With Apache Lucene (20)

The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Retrieving Information From Solr
Retrieving Information From SolrRetrieving Information From Solr
Retrieving Information From Solr
 
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
Michigan Information Retrieval Enthusiasts Group Meetup - August 19, 2010
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Search engines in the industry
Search engines in the industrySearch engines in the industry
Search engines in the industry
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Instant search - A hands-on tutorial
Instant search  - A hands-on tutorialInstant search  - A hands-on tutorial
Instant search - A hands-on tutorial
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
Anghami: From Billions Of Streams To Better Recommendations
Anghami: From Billions Of Streams To Better RecommendationsAnghami: From Billions Of Streams To Better Recommendations
Anghami: From Billions Of Streams To Better Recommendations
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 

Dernier

Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxachiever3003
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMchpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMNanaAgyeman13
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectErbil Polytechnic University
 

Dernier (20)

Crystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptxCrystal Structure analysis and detailed information pptx
Crystal Structure analysis and detailed information pptx
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMMchpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
chpater16.pptxMMMMMMMMMMMMMMMMMMMMMMMMMMM
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Risk Management in Engineering Construction Project
Risk Management in Engineering Construction ProjectRisk Management in Engineering Construction Project
Risk Management in Engineering Construction Project
 

Advanced Document Similarity With Apache Lucene

  • 1. Advanced Document Similarity With Apache Lucene Alessandro Benedetti, Software Engineer, Sease Ltd.
  • 2. Alessandro Benedetti ● Search Consultant ● R&D Software Engineer ● Master in Computer Science ● Apache Lucene/Solr Enthusiast ● Semantic, NLP, Machine Learning Technologies passionate ● Beach Volleyball Player & Snowboarder Who I am
  • 3. Search Services ● Open Source Enthusiasts ● Apache Lucene/Solr experts ● Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Measuring Search Quality, Relevancy Tuning Sease Ltd
  • 4. ● Document Similarity ● Apache Lucene More Like This ● Term Scorer ● BM25 ● Interesting Terms Retrieval ● Query Building ● DEMO ● Future Work ● JIRA References Agenda
  • 5. Real World Use Cases - Streaming Services
  • 6. Real World Use Cases - Hotels
  • 7. Document Similarity Problem : find similar documents to a seed one Solution(s) : ● Collaborative approach (users interactions) ● Content Based ● Hybrid Similar ? ● Documents accessed in association to the input one by users close to you ● Terms distributions ● All of above
  • 8. Apache Lucene Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Apache Lucene is an open source project available for free download.
  • 9. ● Search Library (java) ● Structured Documents ● Inverted Index ● Similarity Metrics ( TF-IDF, BM25) ● Fast Search ● Support for advanced queries ● Relevancy tuning Apache Lucene
  • 11. Pros ● Apache Lucene Module ● Advanced Params ● Input : - structured document - just text ● Build an advanced query ● Leverage the Inverted Index ( and additional data structures) More Like This Cons ● Massive single class ● Low cohesion ● Low readability ● Minimum test coverage ● Difficult to extend ( and improve)
  • 12. Input Document More Like This Params Interesting Terms Retriever Term Scorer Query Builder QUERY More Like This - Break Up
  • 13. Responsibility : define a set of parameters (and defaults) that affect the various components of the More Like This module ● Regulate MLT behavior ● Groups parameters specific to each component ● Javadoc documentation ● Default values ● Useful container for various parameters to be passed More Like This Params
  • 14. ● Field Name ● Field Stats ( Document Count) ● Term Stats ( Document Frequency) ● Term Frequency ● TF-IDF -> tf * (log ( numDocs / docFreq + 1) + 1) ● BM25 Term Scorer Responsibility : assign a score to a term that measure how distinctive is the term for the document in input
  • 15. ● Origin from Probabilistic Information Retrieval ● Default Similarity from Lucene 6.0 [1] ● 25th iteration in improving TF-IDF ● TF ● IDF ● Document Length [1] LUCENE-6789 BM25 Term Scorer
  • 16. BM25 Term Scorer - Inverse Document Frequency IDF Score has very similar behavior
  • 17. BM25 Term Scorer - Term Frequency TF Score approaches asymptotically (k+1) k=1.2 in this example
  • 18. BM25 Term Scorer - Document Length Document Length / Avg Document Length affects how fast we saturate TF score
  • 19. Responsibility : retrieve from the document a queue of weighted interesting terms Params Used ● Analyzer ● Max Num Token Parsed ● Min Term Frequency ● Min/Max Document Frequency ● Max Query Terms ● Query Time Field Boost Interesting Term Retriever ● Analyze content / Term Vector ● Skip Tokens ● Score Tokens ● Build Queue of Top Scored terms
  • 20. Params Used ● Term Boost Enabled More Like This Query Builder Field1 : Term1 Field2 : Term2 Field1 : Term3 Field1 : Term4 Field3 : Term5 3.0 4.0 4.5 4.8 7.5 Q = Field1:Term1^3.0 Field2:Term2^4.0 Field1:Term3^4.5 Field1:Term4^4.8 Field3:Term5^7.5
  • 21. Term Boost ● on/off ● Affect each term weight in the MLT query ● It is the term score ( it depends of the Term Scorer implementation chosen) More Like This Boost Field Boost ● field1^5.0 field2^2.0 field3^1.5 ● Affect Term Scorer ● Affect the interesting terms retrieved N.B. a highly boosted field can dominate the interesting terms retrieval
  • 22. More Like This Usage - Lucene Classification ● Given a document D to classify ● K Nearest Neighbours Classifier ● Find Top K similar documents to D ( MLT) ● Classes are extracted ● Class Frequency + Class ranking -> Class probability
  • 23. More Like This Usage - Apache Solr ● More Like This query parser ( can be concatenated with other queries) ● More Like This search component ( can be assigned to a Request Handler) ● More Like This handler ( handler with specific request parameters)
  • 24. More Like This Demo - Movie Data Set This data consists of the following fields: ● id - unique identifier for the movie ● name - Name of the movie ● directed_by - The person(s) who directed the making of the film ● initial_release_date - The earliest official initial film screening date in any country ● genre - The genre(s) that the movie belongs to
  • 25. More Like This Demo - Tuned ● Enable/Disable Term Boost ● Min Term Frequency ● Min Document Frequency ● Field Boost ● Ad Hoc fields ( ngram analysis)
  • 26. Future Work ● Query Builder just use Terms and Term Score ● Term Positions ? ● Phrase Queries Boost (for terms close in position) ● Sentence boundaries ● Field centric vs Document centric ( should high boosted fields kick out relevant terms from low boosted fields)
  • 27. Future Work - More Like These ● Multiple documents in input ● Interesting terms across documents ● Useful for Content Based recommender engines
  • 28. ● LUCENE-7498 - Introducing BM25 Term Scorer ● LUCENE-7802 - Architectural Refactor JIRA References