Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Automatically Build Solr Synonyms List using Machine Learning - Chao Han, Lucidworks
1. Automatically Build Solr Synonym List
Using Machine Learning
Chao Han
VP, Head of Data Science, Lucidworks
2. Goal
• Automatically generate Solr synonym list that includes synonyms, common
misspellings and misplaced blank spaces. Choose the right Solr synonym format
(e.g., one or bi-directional).
• Examples:
• Synonym: bag, case; four, iv; mac, apple mac, mac book, macbook
• Acronym: playstation, ps
• Misspelling: accesory, accesoire, accessoire, accessorei => accessory
• Misplaced blank spaces: book end, bookend; whirl pool => whirlpool
3. Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
4. Existing Methods and Challenges
• Knowledge-base methods, such as utilizing WordNet, do not have
good coverage of customer’s own ontology.
• Example result from WordNet on an ecommerce data:
•Lack of usefulness:
• mankind, humanity; luck, chance; interference, noise
•Missing context specific synonyms:
• galaxy, Samsung galaxy; noise, quiet; vac, vacuum;
•Do not update frequently.
5. Existing Methods and Challenges
• Find synonyms from word2vec
• Example result from word2vec on an ecommerce data:
• Provide related words instead of inter-changeable words:
• king, queen; red, blue; broom, floor;
• Provide surrounding words:
• battery, rechargeable; unlocked, phone; power, supply;
• Sensitive to hyper-parameters; local optimization;
6. Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
7. Proposed method : Step 1 – Find similar queries
• Utilize customer behavior data to focus on queries that lead to similar set of clicked
documents, then further extract token/phrase wise synonyms.
Query Doc Set Num of Clicks
apple mac charger 1 500
apple mac charger 2 300
apple mac charger 3 100
apple mac charger 4 30
Mac power 1 200
Mac power 2 100
Mac power 3 50
Use Jaccard Index to measure query similarities:
𝐽 𝑞𝑢𝑒𝑟𝑦1, 𝑞𝑢𝑒𝑟𝑦2 =
|𝐷𝑜𝑐𝑆𝑒𝑡1 ∩ 𝐷𝑜𝑐𝑆𝑒𝑡2|
|𝐷𝑜𝑐𝑆𝑒𝑡2 ∪ 𝐷𝑜𝑐𝑆𝑒𝑡2|
Doc Set is weighted by number of clicks to de-noise.
8. Proposed method : Step 2 – Query pre-processing
• Stemming, stop words removal
• Find misspellings separately and correct misspellings in queries:
• If leave misspellings in: mattress, matress, mattrass, mattresss
which should be: matress, mattrass, mattresss => mattress
• Identify phrases in queries to find multi-word synonyms: mac, mac_book
9. Proposed method : Step 3 – Extract synonyms
• Extract synonym (token/phrases) from queries by finding token/phrases which
before/after the same word:
• E.g. Similar query: laptop charger, laptop power
Synonym: charger, power
Similar query: playstation console, ps console
Synonym: playstation, ps
• Measure synonym similarity by occurrence in similar query adjusted by the counts
of synonym in the corpus.
10. Proposed method : Step 4 – De-noise
• Drop the synonym pair that exist in the same query.
• Use graph model to find relationships among synonyms to put multiple synonyms
into the same set and to drop non-synonyms.
Synonym group: mac, apple mac, mac book
LCD
tv
tv
LED tv
mac
book
mac
apple
mac
11. Proposed method : Step 5 – Categorize output
• A tree based model is built based on features generated from the above steps
to help choose from synonym vs context:
• Example features: synonym similarity, number of context the synonym shown
up, token overlapping, synonym counts etc.
12. Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
13. Evaluation and comparison with word2vec
• Run word2vec on catalog and trim the rare words that are not in queries. (with
the same misspelling and phrase extraction steps)
14. Evaluation and comparison with word2vec
• Manually evaluated synonym pairs generated from the ecommerce dataset.
Method Precision Recall F1
LW synonym job 83% 81% 82%
word2vec 31% 28% 29%
Word2vec with de-
noise step
45% 25% 32%
15. Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
16. Spell Correction in Fusion 4.0:
• An offline job to find misspellings and provide corrections based on the number of
occurrence of words/phrases. Comparing to Solr spell checker, the advantages of this job
are:
• If query clicks are captured after Solr spell checker was turned on, then these misspellings
found from click data are mainly identifying erroneous corrections or no corrections from Solr.
• It allow offline human review to make sure the changes are all correct. If user have a dictionary
(e.g. product catalog) to check against the list, the job will go through the result list to make
sure misspellings do not exist in the dictionary and corrections do exist in dictionary.
17. Spell Correction in Fusion 4.0:
• High accuracy rate (96%). In addition to basic Solr spell checker settings :
• When there are multiple possible corrections, we rank corrections based on multiple criteria in
addition to edit distance.
• Rather than using a fixed max edit distance filter, we use an edit distance threshold relative to
the query length to provide more wiggle room for long queries.
• Since the job is running offline, it can ease concerns of expensive spell check tasks from Solr
spell check. E.g., it does not limit the maximum number of possible matches to review
(maxInspections parameter in Solr).
18. Spell Correction in Fusion 4.0:
• Several fields are provided to facilitate the reviewing process:
• by default, results are sorted by "mis_string_len", (descending) and "edit_dist" (ascending) to position more
probable corrections at the top.
• Soundex or last character match indicator.
19. Spell Correction in Fusion 4.0:
• Several additional fields are provided to disclose relationship among the
token corrections and phrase corrections to help further reduce the list:
• The suggested_corrections field help automatically choose to use phrase level correction or token level
correction. If there is low confidence of the correction, a “review” label is attached.
20. Spell Correction in Fusion 4.0:
• The resulting corrections can be used in various ways, for example:
• Put into synonym list in Solr to perform auto correction.
• Help evaluate and guide Solr spellcheck configuration.
• Put into typeahead or autosuggest list.
• Perform document cleansing (e.g. clean product catalog or medical records) by
mapping misspellings to corrections.
21. Phrase Extraction in Fusion:
• Income tax -> tax Income tax -> income
• a Spark job detects commonly co-occurring terms phrases
• Usage:
A. In the query pipeline, boost on any phrase that appears,
e.g. for the query red ipad case, rewrite it to red “ipad case”~10^2
B. Treat phrases as a single token (ipad_case) and feed into downstream
jobs such as clustering/classification/synonym detection.
22. Agenda
• Introduction
• Existing methods and challenges
• Walk through of our approach
• Evaluation and comparison
• Misspelling and phrase extraction
• Demo of synonym detection job in Fusion
• Future works
26. Tail rewriting at query time
User searched for “red case for macbook.pro”
See this: After query rewriting: “macbook pro case”~10^2 color: red
27. Future works
• Utilize query rewrites in session logs.
• Explore deep learning embeddings and attention weights.
source: Rush et al (2014): https://arxiv.org/pdf/1409.0473.pdf)
• Evaluate results on more types of data.
Synonyms list plays an important part for search. However, it usually take a long time to detect and maintain synonyms by the search or ontology group in a company.
Within the context of an ecommerce search use case.
There are experiments around automatically generating synonym already. And I will talk about two of the most popular methods here.
Word2vec is a shallow NN trying to predict target words from near by words or wise versa. Then we take the dense vector out, basically transfer from word space to vector space and find
nearest neighbors through cosine similarity.
Because the vectors live in a vast high dimensional space, then two vectors can be similar in any sense. E.g. red and blue are similar bc they are both colors, broom and floor share a functional relationship.
They are related but they are not inter-changeable. Then in a search application, we usually require synonym to be bi-directional and interchangeable, thus it can leads to relevancy problem. E.g. if I want a king bed sheet, I may not want queen bed sheet. Red paint is not blue paint.
Due to the way that w2v model is constructed, bc it’s trying to predict context from target words, thus it tends to find surrounding words.
Since w2v is a NN model that use SGD, thus it can converge to a local optimization.
Overall you can see some failed examples here from w2v results is due to lack of constraint. And problem with wordnet is a mismatched semantic context between customer data and the general dictionary.
In order to tackle the above problems, here we propose a 5 step synonym detection algorithm.
Nowadays websites can easily track and store user events such as queries, result clicks and purchases, we can use this collective behavior to create clickstream or LTR models, we can also use this data to help find synonyms.
First step is to find similar queries then we can further extract. This way we are putting contraints through the input data.
Since we don’t want to put all the stemmed and non-stemmed pairs into synonym list, just leave the stemming work to Solr.
This method looks like a naïve method without fancy modeling involved, but it turns out works pretty well. I think it’s bc it’s a straight forward way to replicate how ppl construct the language. Also here we are not projecting the words into a different vector space as in w2v, thus we are getting the first order similarity between words.
Have to say all methods leads to noise due to the nature of click data.
Synonym should be transitional.
Use graph algorithm to find a community which have enough edges in the graph.
(BronKerbosch clique algorithm
an example from clique is : frozenset({‘ear’, ‘ear bud’, ‘earbud’, ‘earphone’, ‘headset’}),but if only require connected component would be messy: audio, headphone, ear bud, ipod, headset, earbud, head, beat, heartbeat, ie, ibeat, tour, ear headphone, earphone, ear
in order to keep good recall, I’m also considering loose cliques, i.e., if two triangles have 2 edges between each other, then can say they are 1 clique, loosier than strict clique defination)
A problem we face is some of the synonym we extracted is too abstract and does not work outside certain context. In this algorithm’s output, we find the most frequent occuring words before/after the synonym pair. We call it context pair here.
In this case, the tree model predict that we should include the word console in the synonym pair to make it more clear.
many queries misspells may due to the same tokens or phrases. So in Fusion 4, we have a new job called token and phrase wise spell checker which can help you find misspellings and suggest corrections.
Solr Spell Checker Index-based, Executes at query time
such as min prefix match, max edit distance, min length of misspelling, count thresholds of misspellings and corrections, collation check.
Specifically, we apply a filter such that only pairs with edit_distance <= query_length/length_scale will be kept. E.g., if we choose length_scale=4, for queries with lengths between 4 and 7, edit distance has to be 1 to be chosen. While for queries with lengths between 8 and 11, edit distance can be 2.
and is able to find comprehensive lists of spelling errors resulting from misplaced whitespace (breakWords in Solr)
can also sort by the ratio of correction traffic over misspelling traffic to only keep high traffic boosting corrections.