SlideShare une entreprise Scribd logo
1  sur  28
Automatically Build Solr Synonym List
Using Machine Learning
Chao Han
VP, Head of Data Science, Lucidworks
Goal
• Automatically generate Solr synonym list that includes synonyms, common
misspellings and misplaced blank spaces. Choose the right Solr synonym format
(e.g., one or bi-directional).
• Examples:
• Synonym: bag, case; four, iv; mac, apple mac, mac book, macbook
• Acronym: playstation, ps
• Misspelling: accesory, accesoire, accessoire, accessorei => accessory
• Misplaced blank spaces: book end, bookend; whirl pool => whirlpool
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Existing Methods and Challenges
• Knowledge-base methods, such as utilizing WordNet, do not have
good coverage of customer’s own ontology.
• Example result from WordNet on an ecommerce data:
•Lack of usefulness:
• mankind, humanity; luck, chance; interference, noise
•Missing context specific synonyms:
• galaxy, Samsung galaxy; noise, quiet; vac, vacuum;
•Do not update frequently.
Existing Methods and Challenges
• Find synonyms from word2vec
• Example result from word2vec on an ecommerce data:
• Provide related words instead of inter-changeable words:
• king, queen; red, blue; broom, floor;
• Provide surrounding words:
• battery, rechargeable; unlocked, phone; power, supply;
• Sensitive to hyper-parameters; local optimization;
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Proposed method : Step 1 – Find similar queries
• Utilize customer behavior data to focus on queries that lead to similar set of clicked
documents, then further extract token/phrase wise synonyms.
Query Doc Set Num of Clicks
apple mac charger 1 500
apple mac charger 2 300
apple mac charger 3 100
apple mac charger 4 30
Mac power 1 200
Mac power 2 100
Mac power 3 50
Use Jaccard Index to measure query similarities:
𝐽 𝑞𝑢𝑒𝑟𝑦1, 𝑞𝑢𝑒𝑟𝑦2 =
|𝐷𝑜𝑐𝑆𝑒𝑡1 ∩ 𝐷𝑜𝑐𝑆𝑒𝑡2|
|𝐷𝑜𝑐𝑆𝑒𝑡2 ∪ 𝐷𝑜𝑐𝑆𝑒𝑡2|
Doc Set is weighted by number of clicks to de-noise.
Proposed method : Step 2 – Query pre-processing
• Stemming, stop words removal
• Find misspellings separately and correct misspellings in queries:
• If leave misspellings in: mattress, matress, mattrass, mattresss
which should be: matress, mattrass, mattresss => mattress
• Identify phrases in queries to find multi-word synonyms: mac, mac_book
Proposed method : Step 3 – Extract synonyms
• Extract synonym (token/phrases) from queries by finding token/phrases which
before/after the same word:
• E.g. Similar query: laptop charger, laptop power
Synonym: charger, power
Similar query: playstation console, ps console
Synonym: playstation, ps
• Measure synonym similarity by occurrence in similar query adjusted by the counts
of synonym in the corpus.
Proposed method : Step 4 – De-noise
• Drop the synonym pair that exist in the same query.
• Use graph model to find relationships among synonyms to put multiple synonyms
into the same set and to drop non-synonyms.
Synonym group: mac, apple mac, mac book
LCD
tv
tv
LED tv
mac
book
mac
apple
mac
Proposed method : Step 5 – Categorize output
• A tree based model is built based on features generated from the above steps
to help choose from synonym vs context:
• Example features: synonym similarity, number of context the synonym shown
up, token overlapping, synonym counts etc.
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Evaluation and comparison with word2vec
• Run word2vec on catalog and trim the rare words that are not in queries. (with
the same misspelling and phrase extraction steps)
Evaluation and comparison with word2vec
• Manually evaluated synonym pairs generated from the ecommerce dataset.
Method Precision Recall F1
LW synonym job 83% 81% 82%
word2vec 31% 28% 29%
Word2vec with de-
noise step
45% 25% 32%
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Spell Correction in Fusion 4.0:
• An offline job to find misspellings and provide corrections based on the number of
occurrence of words/phrases. Comparing to Solr spell checker, the advantages of this job
are:
• If query clicks are captured after Solr spell checker was turned on, then these misspellings
found from click data are mainly identifying erroneous corrections or no corrections from Solr.
• It allow offline human review to make sure the changes are all correct. If user have a dictionary
(e.g. product catalog) to check against the list, the job will go through the result list to make
sure misspellings do not exist in the dictionary and corrections do exist in dictionary.
Spell Correction in Fusion 4.0:
• High accuracy rate (96%). In addition to basic Solr spell checker settings :
• When there are multiple possible corrections, we rank corrections based on multiple criteria in
addition to edit distance.
• Rather than using a fixed max edit distance filter, we use an edit distance threshold relative to
the query length to provide more wiggle room for long queries.
• Since the job is running offline, it can ease concerns of expensive spell check tasks from Solr
spell check. E.g., it does not limit the maximum number of possible matches to review
(maxInspections parameter in Solr).
Spell Correction in Fusion 4.0:
• Several fields are provided to facilitate the reviewing process:
• by default, results are sorted by "mis_string_len", (descending) and "edit_dist" (ascending) to position more
probable corrections at the top.
• Soundex or last character match indicator.
Spell Correction in Fusion 4.0:
• Several additional fields are provided to disclose relationship among the
token corrections and phrase corrections to help further reduce the list:
• The suggested_corrections field help automatically choose to use phrase level correction or token level
correction. If there is low confidence of the correction, a “review” label is attached.
Spell Correction in Fusion 4.0:
• The resulting corrections can be used in various ways, for example:
• Put into synonym list in Solr to perform auto correction.
• Help evaluate and guide Solr spellcheck configuration.
• Put into typeahead or autosuggest list.
• Perform document cleansing (e.g. clean product catalog or medical records) by
mapping misspellings to corrections.
Phrase Extraction in Fusion:
• Income tax -> tax Income tax -> income
• a Spark job detects commonly co-occurring terms phrases
• Usage:
A. In the query pipeline, boost on any phrase that appears,
e.g. for the query red ipad case, rewrite it to red “ipad case”~10^2
B. Treat phrases as a single token (ipad_case) and feed into downstream
jobs such as clustering/classification/synonym detection.
Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
Synonym review process in Fusion 4.2
Automatic tail query rewriting
Tail reason investigation
Tail rewriting at query time
User searched for “red case for macbook.pro”
See this: After query rewriting: “macbook pro case”~10^2 color: red
Future works
• Utilize query rewrites in session logs.
• Explore deep learning embeddings and attention weights.
source: Rush et al (2014): https://arxiv.org/pdf/1409.0473.pdf)
• Evaluate results on more types of data.
Thank you!
Chao Han
VP, Head of Data Science, Lucidworks

Contenu connexe

Tendances

Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
Erik Hatcher
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversion
Eugene Yan Ziyou
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
Trey Grainger
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Lucidworks
 

Tendances (20)

Hyperloglog Project
Hyperloglog ProjectHyperloglog Project
Hyperloglog Project
 
Solr Query Parsing
Solr Query ParsingSolr Query Parsing
Solr Query Parsing
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
JSON in Solr: from top to bottom
JSON in Solr: from top to bottomJSON in Solr: from top to bottom
JSON in Solr: from top to bottom
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Learn to Rank search results
Learn to Rank search resultsLearn to Rank search results
Learn to Rank search results
 
Delta Lake: Optimizing Merge
Delta Lake: Optimizing MergeDelta Lake: Optimizing Merge
Delta Lake: Optimizing Merge
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Thought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered SearchThought Vectors and Knowledge Graphs in AI-powered Search
Thought Vectors and Knowledge Graphs in AI-powered Search
 
Benchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on SparkBenchmark MinHash+LSH algorithm on Spark
Benchmark MinHash+LSH algorithm on Spark
 
How Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversionHow Lazada ranks products to improve customer experience and conversion
How Lazada ranks products to improve customer experience and conversion
 
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...
 
Balancing the Dimensions of User Intent
Balancing the Dimensions of User IntentBalancing the Dimensions of User Intent
Balancing the Dimensions of User Intent
 
Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4Scaling Through Partitioning and Shard Splitting in Solr 4
Scaling Through Partitioning and Shard Splitting in Solr 4
 
hive lab
hive labhive lab
hive lab
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr
 

Similaire à Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
CSCI 180 Project Grading  Your project is graded based .docx
CSCI 180 Project Grading   Your project is graded based .docxCSCI 180 Project Grading   Your project is graded based .docx
CSCI 180 Project Grading  Your project is graded based .docx
faithxdunce63732
 
Grails Spock Testing
Grails Spock TestingGrails Spock Testing
Grails Spock Testing
TO THE NEW | Technology
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureML
George Simov
 

Similaire à Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks (20)

Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Mdb dn 2016_05_index_tuning
Mdb dn 2016_05_index_tuningMdb dn 2016_05_index_tuning
Mdb dn 2016_05_index_tuning
 
Practical Machine Learning and Rails Part2
Practical Machine Learning and Rails Part2Practical Machine Learning and Rails Part2
Practical Machine Learning and Rails Part2
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
CSCI 180 Project Grading  Your project is graded based .docx
CSCI 180 Project Grading   Your project is graded based .docxCSCI 180 Project Grading   Your project is graded based .docx
CSCI 180 Project Grading  Your project is graded based .docx
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comPersonalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
 
Automated Acceptance Tests & Tool choice
Automated Acceptance Tests & Tool choiceAutomated Acceptance Tests & Tool choice
Automated Acceptance Tests & Tool choice
 
Search Solutions 2015: Towards a new model of search relevance testing
Search Solutions 2015:  Towards a new model of search relevance testingSearch Solutions 2015:  Towards a new model of search relevance testing
Search Solutions 2015: Towards a new model of search relevance testing
 
SEppt
SEpptSEppt
SEppt
 
Mariia Havrylovych "Active learning and weak supervision in NLP projects"
Mariia Havrylovych "Active learning and weak supervision in NLP projects"Mariia Havrylovych "Active learning and weak supervision in NLP projects"
Mariia Havrylovych "Active learning and weak supervision in NLP projects"
 
Cracking OCA and OCP Java 8 Exams
Cracking OCA and OCP Java 8 ExamsCracking OCA and OCP Java 8 Exams
Cracking OCA and OCP Java 8 Exams
 
Grails Spock Testing
Grails Spock TestingGrails Spock Testing
Grails Spock Testing
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureML
 
Testing overview
Testing overviewTesting overview
Testing overview
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
Improving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language ProcessingImproving Search in Workday Products using Natural Language Processing
Improving Search in Workday Products using Natural Language Processing
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
 

Plus de Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

Plus de Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Dernier (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks

  • 1. Automatically Build Solr Synonym List Using Machine Learning Chao Han VP, Head of Data Science, Lucidworks
  • 2. Goal • Automatically generate Solr synonym list that includes synonyms, common misspellings and misplaced blank spaces. Choose the right Solr synonym format (e.g., one or bi-directional). • Examples: • Synonym: bag, case; four, iv; mac, apple mac, mac book, macbook • Acronym: playstation, ps • Misspelling: accesory, accesoire, accessoire, accessorei => accessory • Misplaced blank spaces: book end, bookend; whirl pool => whirlpool
  • 3. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 4. Existing Methods and Challenges • Knowledge-base methods, such as utilizing WordNet, do not have good coverage of customer’s own ontology. • Example result from WordNet on an ecommerce data: •Lack of usefulness: • mankind, humanity; luck, chance; interference, noise •Missing context specific synonyms: • galaxy, Samsung galaxy; noise, quiet; vac, vacuum; •Do not update frequently.
  • 5. Existing Methods and Challenges • Find synonyms from word2vec • Example result from word2vec on an ecommerce data: • Provide related words instead of inter-changeable words: • king, queen; red, blue; broom, floor; • Provide surrounding words: • battery, rechargeable; unlocked, phone; power, supply; • Sensitive to hyper-parameters; local optimization;
  • 6. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 7. Proposed method : Step 1 – Find similar queries • Utilize customer behavior data to focus on queries that lead to similar set of clicked documents, then further extract token/phrase wise synonyms. Query Doc Set Num of Clicks apple mac charger 1 500 apple mac charger 2 300 apple mac charger 3 100 apple mac charger 4 30 Mac power 1 200 Mac power 2 100 Mac power 3 50 Use Jaccard Index to measure query similarities: 𝐽 𝑞𝑢𝑒𝑟𝑦1, 𝑞𝑢𝑒𝑟𝑦2 = |𝐷𝑜𝑐𝑆𝑒𝑡1 ∩ 𝐷𝑜𝑐𝑆𝑒𝑡2| |𝐷𝑜𝑐𝑆𝑒𝑡2 ∪ 𝐷𝑜𝑐𝑆𝑒𝑡2| Doc Set is weighted by number of clicks to de-noise.
  • 8. Proposed method : Step 2 – Query pre-processing • Stemming, stop words removal • Find misspellings separately and correct misspellings in queries: • If leave misspellings in: mattress, matress, mattrass, mattresss which should be: matress, mattrass, mattresss => mattress • Identify phrases in queries to find multi-word synonyms: mac, mac_book
  • 9. Proposed method : Step 3 – Extract synonyms • Extract synonym (token/phrases) from queries by finding token/phrases which before/after the same word: • E.g. Similar query: laptop charger, laptop power Synonym: charger, power Similar query: playstation console, ps console Synonym: playstation, ps • Measure synonym similarity by occurrence in similar query adjusted by the counts of synonym in the corpus.
  • 10. Proposed method : Step 4 – De-noise • Drop the synonym pair that exist in the same query. • Use graph model to find relationships among synonyms to put multiple synonyms into the same set and to drop non-synonyms. Synonym group: mac, apple mac, mac book LCD tv tv LED tv mac book mac apple mac
  • 11. Proposed method : Step 5 – Categorize output • A tree based model is built based on features generated from the above steps to help choose from synonym vs context: • Example features: synonym similarity, number of context the synonym shown up, token overlapping, synonym counts etc.
  • 12. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 13. Evaluation and comparison with word2vec • Run word2vec on catalog and trim the rare words that are not in queries. (with the same misspelling and phrase extraction steps)
  • 14. Evaluation and comparison with word2vec • Manually evaluated synonym pairs generated from the ecommerce dataset. Method Precision Recall F1 LW synonym job 83% 81% 82% word2vec 31% 28% 29% Word2vec with de- noise step 45% 25% 32%
  • 15. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 16. Spell Correction in Fusion 4.0: • An offline job to find misspellings and provide corrections based on the number of occurrence of words/phrases. Comparing to Solr spell checker, the advantages of this job are: • If query clicks are captured after Solr spell checker was turned on, then these misspellings found from click data are mainly identifying erroneous corrections or no corrections from Solr. • It allow offline human review to make sure the changes are all correct. If user have a dictionary (e.g. product catalog) to check against the list, the job will go through the result list to make sure misspellings do not exist in the dictionary and corrections do exist in dictionary.
  • 17. Spell Correction in Fusion 4.0: • High accuracy rate (96%). In addition to basic Solr spell checker settings : • When there are multiple possible corrections, we rank corrections based on multiple criteria in addition to edit distance. • Rather than using a fixed max edit distance filter, we use an edit distance threshold relative to the query length to provide more wiggle room for long queries. • Since the job is running offline, it can ease concerns of expensive spell check tasks from Solr spell check. E.g., it does not limit the maximum number of possible matches to review (maxInspections parameter in Solr).
  • 18. Spell Correction in Fusion 4.0: • Several fields are provided to facilitate the reviewing process: • by default, results are sorted by "mis_string_len", (descending) and "edit_dist" (ascending) to position more probable corrections at the top. • Soundex or last character match indicator.
  • 19. Spell Correction in Fusion 4.0: • Several additional fields are provided to disclose relationship among the token corrections and phrase corrections to help further reduce the list: • The suggested_corrections field help automatically choose to use phrase level correction or token level correction. If there is low confidence of the correction, a “review” label is attached.
  • 20. Spell Correction in Fusion 4.0: • The resulting corrections can be used in various ways, for example: • Put into synonym list in Solr to perform auto correction. • Help evaluate and guide Solr spellcheck configuration. • Put into typeahead or autosuggest list. • Perform document cleansing (e.g. clean product catalog or medical records) by mapping misspellings to corrections.
  • 21. Phrase Extraction in Fusion: • Income tax -> tax Income tax -> income • a Spark job detects commonly co-occurring terms phrases • Usage: A. In the query pipeline, boost on any phrase that appears, e.g. for the query red ipad case, rewrite it to red “ipad case”~10^2 B. Treat phrases as a single token (ipad_case) and feed into downstream jobs such as clustering/classification/synonym detection.
  • 22. Agenda • Introduction • Existing methods and challenges • Walk through of our approach • Evaluation and comparison • Misspelling and phrase extraction • Demo of synonym detection job in Fusion • Future works
  • 23. Synonym review process in Fusion 4.2
  • 24. Automatic tail query rewriting
  • 26. Tail rewriting at query time User searched for “red case for macbook.pro” See this: After query rewriting: “macbook pro case”~10^2 color: red
  • 27. Future works • Utilize query rewrites in session logs. • Explore deep learning embeddings and attention weights. source: Rush et al (2014): https://arxiv.org/pdf/1409.0473.pdf) • Evaluate results on more types of data.
  • 28. Thank you! Chao Han VP, Head of Data Science, Lucidworks

Notes de l'éditeur

  1. Synonyms list plays an important part for search. However, it usually take a long time to detect and maintain synonyms by the search or ontology group in a company. Within the context of an ecommerce search use case.
  2. There are experiments around automatically generating synonym already. And I will talk about two of the most popular methods here.
  3. Word2vec is a shallow NN trying to predict target words from near by words or wise versa. Then we take the dense vector out, basically transfer from word space to vector space and find nearest neighbors through cosine similarity. Because the vectors live in a vast high dimensional space, then two vectors can be similar in any sense. E.g. red and blue are similar bc they are both colors, broom and floor share a functional relationship. They are related but they are not inter-changeable. Then in a search application, we usually require synonym to be bi-directional and interchangeable, thus it can leads to relevancy problem. E.g. if I want a king bed sheet, I may not want queen bed sheet. Red paint is not blue paint. Due to the way that w2v model is constructed, bc it’s trying to predict context from target words, thus it tends to find surrounding words. Since w2v is a NN model that use SGD, thus it can converge to a local optimization. Overall you can see some failed examples here from w2v results is due to lack of constraint. And problem with wordnet is a mismatched semantic context between customer data and the general dictionary.
  4. In order to tackle the above problems, here we propose a 5 step synonym detection algorithm. Nowadays websites can easily track and store user events such as queries, result clicks and purchases, we can use this collective behavior to create clickstream or LTR models, we can also use this data to help find synonyms. First step is to find similar queries then we can further extract. This way we are putting contraints through the input data.
  5. Since we don’t want to put all the stemmed and non-stemmed pairs into synonym list, just leave the stemming work to Solr.
  6. This method looks like a naïve method without fancy modeling involved, but it turns out works pretty well. I think it’s bc it’s a straight forward way to replicate how ppl construct the language. Also here we are not projecting the words into a different vector space as in w2v, thus we are getting the first order similarity between words.
  7. Have to say all methods leads to noise due to the nature of click data. Synonym should be transitional. Use graph algorithm to find a community which have enough edges in the graph. (BronKerbosch clique algorithm an example from clique is : frozenset({‘ear’, ‘ear bud’, ‘earbud’, ‘earphone’, ‘headset’}),but if only require connected component would be messy: audio, headphone, ear bud, ipod, headset, earbud, head, beat, heartbeat, ie, ibeat, tour, ear headphone, earphone, ear in order to keep good recall, I’m also considering loose cliques, i.e., if two triangles have 2 edges between each other, then can say they are 1 clique, loosier than strict clique defination)
  8. A problem we face is some of the synonym we extracted is too abstract and does not work outside certain context. In this algorithm’s output, we find the most frequent occuring words before/after the synonym pair. We call it context pair here. In this case, the tree model predict that we should include the word console in the synonym pair to make it more clear.
  9. many queries misspells may due to the same tokens or phrases. So in Fusion 4, we have a new job called token and phrase wise spell checker which can help you find misspellings and suggest corrections. Solr Spell Checker Index-based, Executes at query time
  10. such as min prefix match, max edit distance, min length of misspelling, count thresholds of misspellings and corrections, collation check. Specifically, we apply a filter such that only pairs with edit_distance <= query_length/length_scale will be kept. E.g., if we choose length_scale=4, for queries with lengths between 4 and 7, edit distance has to be 1 to be chosen. While for queries with lengths between 8 and 11, edit distance can be 2. and is able to find comprehensive lists of spelling errors resulting from misplaced whitespace (breakWords in Solr)
  11. can also sort by the ratio of correction traffic over misspelling traffic to only keep high traffic boosting corrections.