Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon Hughes, Dice.com

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Evolving The Optimal Relevancy Scoring Model at Dice.com
Simon Hughes
Chief Data Scientist, Dice.com

3
•  Chief Data Scientist at Dice.com and DHI, under Yuri Bykov
•  Dice.com – leading US job board for IT professionals
•  Twitter handle: https://twitter.com/hughes_meister
Who Am I?
•  Dice Skills pages - http://www.dice.com/skills
•  New Dice Careers Mobile App
Key Projects
•  PhD candidate at DePaul University, studying NLP and machine learning
•  Thesis topic – Detecting causality in scientiﬁc explanatory essays
PhD

4
•  Look under https://github.com/DiceTechJobs
•  Set of Solr plugins https://github.com/DiceTechJobs/SolrPlugins
•  Tutorial for this talk: https://github.com/DiceTechJobs/RelevancyTuning
Open Source GitHub Repositories

5
1.  Approaches to Relevancy Tuning
2.  Automated Relevancy Tuning – using Reinforcement Learning
3.  Feedback Loops - Dangers of Closed Loop Learning Systems
Overview

6
•  Last year I talked about conceptual search and how that could be used to improve recall
•  This year I want to focus on techniques to improve precision
•  Novelty
Motivations for Talk

7
Finding the Optimal Search Engine Configuration
•  Most companies initially approach this in a very ad hoc and manual process:
•  Follow ‘best practices’ and make some initial educated guesses as to the best settings
•  Manually tune the parameters on a number of key user queries
•  The search engine parameters should be tuned to reflect how your users search
•  Relevancy is a hard to define concept, but it’s what your users consider provides them with an
optimal search experience. So it should be informed by their search behavior
Relevancy Tuning

8
What Solr Configuration Options Influence Relevancy?
Solr and Lucene provide many configuration options that impact search relevancy, including:
•  Which query parser – dismax, edismax, LuceneParser, etc
•  Field boosts – qf parameter
•  Phrase boosts – pf, pf2, pf3 parameters
•  Minimum should match - mm parameter
•  Similarity Class – default similarity, BM25, Tf.Idf, custom or one of many others
•  Boost queries – boost, bf, bq, etc
•  Edismax tie parameter – recommended value ≈ 0.1

9
Remove Noise Chars
•  Ensure punctuation characters and plurality are removed from each field using the analysis chain
Ø  ‘q=developer’ should match ‘developer,’ ,’developer.’, ‘developer’s’ and ‘developers’
When using Stemming Synonyms – use Copy Fields + Edismax
•  Use copy fields to apply stemming and synonyms to existing fields
•  Allows different boosts to be applied to stemmed and synonym matches
•  Set fields boost to be lower on the stemmed and synonym copy fields
Some General Tips on Relevancy Tuning

10
Use Boost Queries for Speciﬁc Query Use Cases
•  Edismax bq parameter – allows boosting of matches to nested queries
•  See chapter 7 of Relevant Search - good coverage of this strategy
Make Good Use of Phrase Query Boosts
•  Use pf, pf2 and pf3 parameters in edismax to give preference for multi-term matches
•  pf2 and pf3 often give better performance than pf, which requires an exact match for all query terms
Caveat Emptor: Monitor impact of these changes on query performance (QTime) and index size
Some General Tips on Relevancy Tuning

11
•  To tune your search parameters, you can gather a dataset of relevancy judgements
•  For a set of important queries, the dataset will contain a set of relevancy judgements with the
top results returned annotated for relevancy
•  This dataset can be collected using domain experts and a user interface designed for this task
•  Commercial Examples:
•  Quepid – developed by OpenSource Connections
•  Fusion UI Relevancy Workbench – part of the Fusion offering from Lucidworks
The ‘Golden’ Test Collection

13
•  An alternative to manually collecting relevancy judgements is to collect them directly from your users
•  For each user search on the site, capture:
•  User’s query, and timestamp
•  Any ﬁlters applied
•  Result impressions and clicks
•  You can then turn this into a test collection by assuming that the results that people click on are more relevant
than those they don’t
•  The time spent on the results page is also a great indication of how relevant that result was to the original search
Search Log Capture

14
•  Now you have a test collection, you can use that to tune your search engine configuration
•  Using the test collection, you can measure the relevancy of a set of searches on that collection using some IR metrics, such as:
•  MAP (Mean Average Precision)
•  Precision at K (compute precision at the k’th document retrieved)
•  NDCG (Normalized Discounted Cumulative Gain)
•  Regression testing – this allows you to build a set of regression tests to ensure configuration changes both improve relevancy
and don’t break certain queries
•  Manually tuning search configurations is still a time consuming and inefficient process
•  Is there a better way?
Relevancy Tuning with a Test Collection

15
1.  Supervised Machine Learning?
•  No - cannot optimize your search configuration without a computable gradient
2.  Grid Search?
•  Perform a brute force search over a the range of possible configuration parameters
•  Very slow and inefficient – is not able to learn which ranges of settings work best
3.  Black Box Optimization Algorithms?
•  Optimization algorithms exist that attempt to find the optimum value of an unknown function in as few iterations as
possible
•  Perform a much smarter search of the parameter space than grid search
Automated Relevancy Tuning Approaches

16
•  Use an optimization algorithm to optimize a ‘black box’ function
•  Black box function – provide the optimization algorithm with a function that takes a set of parameters as inputs
and computes a score
•  The black box algorithm will then try and choose parameter settings to optimize the score
•  This can be thought of as a form of reinforcement learning
•  These algorithms will intelligently search the space of possible search conﬁgurations to arrive at a solution
•  Example algorithms include Bayesian Optimization, Simulated Annealing, and Genetic Algorithms (hence talk
title)
Black Box Optimization Algorithms

17
Example Black Box Function for Search Relevancy

18
•  There are some excellent mature libraries for doing this sort of thing e.g.
•  DEAP
- Distributed Evolutionary Algorithms in Python (hence talk title)
•  Scikit Optimize
– General optimization library built by a team at CERN headed by Tim Head
•  These libraries are very easy to use, however getting them to optimize your search conﬁguration is a little trickier
•  They tend to work better when optimizing a small set of parameters at a time – 1 to 4 works well
•  Achieved an improvement of 5% in MAP @ 5 for our MLT conﬁguration. AB testing changes to search before
EOY
Making it Work

19
•  To optimize a large set of search parameters – start with the most important ones and optimize those while
keeping the rest fixed
•  If you are using search logs to optimize the search configuration, use a large number of searches (at least a few
thousand) to ensure you are performing a robust enough test
•  For most search collections of a reasonable size, running these optimizations over your search collection will
take time – set it up on a server, parallelize where possible and leave running overnight
•  Typically you will want to allow the algorithm to try a few hundred variations of each parameter set at least to
find a good range of settings
•  Ideally – first optimize your search configuration against a set of relevancy judgements acquired from domain
experts, deploy to production and use the search logs to further tune against your users search behavior
Making it Work

20
•  As with any machine learning problem, it is essential to use one dataset to learn from, and a second separate dataset to
validate your results – prevents ‘overfitting’
•  Overfitting in this context means that the search parameters are over-tuned on your initial dataset, that the search engine
performs worse on new data than with the current configuration
•  Once you have an optimal set of configuration parameters, that you are happy with, these should be evaluated on a second set
of relevancy judgements to ensure the same performance gains are seen there also
•  This applies to both manual and automatic tuning of the search engine configuration. Humans can overfit a dataset just as
easily as an algorithm can
Use a Separate Testing Dataset to Validate Improvements

21
•  Auto-tune other solr parameters – phrase slop, mm settings, similarity class used
•  Your can evolve a more optimal ranking function:
•  Either tweak the settings of the existing ranking functions (see
SweetSpotSimilarityFactory class)
•  Or use Genetic Programming to evolve a better ranking function for your dataset
•  Genetic Programming is an evolutionary algorithm that can evolve programs and equations
•  Some relevant papers, good introductory paper (but not very recent)
Some Other Things to Try

22
•  Building a Machine Learned Ranking system is a premature optimization if you haven’t ﬁrst optimized
your search conﬁguration
•  Relevancy tuning and MLR both primarily optimize for precision over recall due to nature of training
data**
•  For techniques to improve recall, see conceptual semantic search:
•  Simon Hughes - “Conceptual Search” (Revolution 2015)
•  Trey Grainger - “Enhancing Relevancy Through Personalization and Semantic Search” (Revolution 2013)
•  Doug Turnbull and John Berryman - Chapter 11 of Relevant Search
Things to Consider

Feedback Loops – Dangers of Closed Loop Learning Systems

Users
Interact with
the System
Model
Machine Learning
Produce
Building a Machine Learning System
1.  Users interact with the system to
produce data
2.  Machine learning algorithms turns
that data into a model
What happens if the model’s
predictions inﬂuence the user’s
behavior?

Users
Interact with
the System
Model
Produce
Positive Feedback Loop
1.  Users interact with the system to
produce data
2.  Machine learning algorithms turns
that data into a model
3.  Model changes user behavior,
modifying its own future training
data
Model changes behavior
Machine Learning

26
1.  Isolate a subset of data from being inﬂuenced by the model, use this data to train the system
•  E.g. leave a small proportion of user searches un-ranked by the MLR model
•  E.g. generate a subset of recommendations at random, or by using an unsupervised model
2.  Use a reinforcement learning model instead (such as a multi-armed bandit) - the system will
dynamically adapt to the users’ behavior, balancing exploring different hypotheses with
exploiting what it’s learned to produce accurate predictions
Preventing Positive Feedback Loops

27
THE END
•  Thank you for listening
•  Any questions?

Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon Hughes, Dice.com

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon Hughes, Dice.com

Similaire à Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon Hughes, Dice.com (20)

Plus de Lucidworks

Plus de Lucidworks (20)

Dernier

Dernier (20)

Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon Hughes, Dice.com