In this talk we’ll cover the basics of search relevancy in elasticsearch from how relevancy is calculated and modeled to modifying query structure, setting up analyzer chains and how to measure incremental improvements. The talk will highlight several real world relevancy scenarios encountered in the consulting work at KMW Technology, a leading provider of search professional services to major organizations.
2. Outline
• Intro to Relevance
• Crash Course: Scoring
• Relevance Tuning Case Study
• Testing Relevance
• Discussion
3. What is Relevance?
• A subjective measure of how useful a document is to user
who searched for something
• Does it satisfy the user’s information need?
• If I search for “cats”…
• Probably relevant: the movie “Cats,” the stage musical
“Cats,” cat pictures, cat blogs, cat food, Felis catus
• Vaguely relevant: dogs
• Not relevant: CAT scanners, catsup, cement mixers
5. What Is Relevance Tuning?
Adjusting the content of search results so that the most
relevant documents are included
Adjusting the order of search results so that the most
relevant results appear on top
7. Why Tune Relevance?
• FANTASY: “Once I get the data into my search engine, it
does all the work of finding the best matches for my
queries.”
• TRUTH: “We have to configure the search engine to rank
results in a way that is meaningful to the user.”
8. Search Engine Doesn't
Know…
• Which fields are important
• How users will search those fields
• Which query terms are the most significant
• Whether term order is significant
• Which terms mean the same thing
• What priorities the user has based on location, season, task, etc.
• What priorities the provider has re: sales, promotions, sponsorships, etc.
• Whether freshness, popularity, ratings are important
9. Relevance Problems
• Search for “Rocky" returns “Rocky Road To Dublin”
before the movie “Rocky”
• Search in MA for "coffee" returns “Coffee Day” (chain in
India) before “George Howell Coffee”
• Search for product by SKU returns permutations
• Search for “bikes” fails to find “bicycle”
• Search for “The The” (band) returns no results
10. Precision and Recall
• High Precision: "Everything I see is useful to me"
• High Recall: “Everything I might want is included”
• Relevance tuning is a tradeoff between precision and
recall
11. Precision And Recall
• Precision = Relevant Results / All Results
• “Only 5 out of 10 results returned were useful to me.
There was a lot of noise.”
• Recall = Relevant Results / All Relevant Documents
• “Only 5 out of 10 useful documents in the index were
returned. There were lots of things missing.”
12. When relevance you want to tune,
All iffy results you should prune
To achieve good precision,
Unless your decision’s
That recall is more of a boon.
Precision and Recall
13. How do we tune relevance?
• Enrich documents with metadata that's useful to search
• Search the right fields
• Configure field analyzers to match the way users search
• Set field weights
• Match phrases
• Handle typos
• Apply synonyms and stemming
• Reward exact/complete matches
• Reward freshness, popularity, ratings, etc.
14. Scoring
• A search engine has to find relevant documents without
knowing what they mean
• A search engine assigns a numerical score to each match
using a "blind" but effective statistical heuristic
• Results are displayed in order by score
• To tune relevance we need to understand the search
engine’s built-in method of scoring
25. Query 4: “dog cat”
Doc 1: “dog dog dog dog dog dog dog”
Doc 2: “cat cat cat cat cat cat cat”
Doc 3: “dog cat”
26. Query 4: “dog cat”
Doc 1: “dog dog dog dog dog dog dog” → 0.8
Doc 2: “cat cat cat cat cat cat cat” → 0.8
Doc 3: “dog cat” → 1.2
Matching more query terms is good
Term
Saturation
32. Ties
“when two documents have the same score, they will be sorted by their
internal Lucene doc id (which is unrelated to the _id) by default”
“The internal doc_id can differ for the same document inside each
replica of the same shard so it's recommended to use another
tiebreaker for sort in order to get consistent results. For instance you
could do: sort: ["_score", "datetime"] to force top_hits to
rank documents based on score first and use datetime as a
tiebreaker.”
"sort": [
{ "_score": { "order": "desc" }},
{ "date": { "order": "desc" }}
]
33. Comparing Field Scores
• Raw scores across fields are not directly comparable
• Term frequencies, document frequencies, and average field length all
differ across fields
• Field analyzers can generate additional tokens that that affect scoring
• A "good" match in one field might score in the range 0.1 to 0.2 while a
good match in another field might score in the range 1 to 2. There’s
no universal relevance scale.
• A multiplicative boost of 10 doesn't mean “field1 is 10 times more
important than field2”
• Boosts can compensate for scale discrepancies
34. TF x IDF
A search engine handles the chore
Of ranking each match, good or poor:
If a document’s TF
Divided by DF
Is huge, it will get the top score.
35. How does TFxIDF affect
query scoring?
• High score: A document with many occurrences of a rare
term
• Low score: A document with few occurrences of a common
term
• TFxIDF depends on the corpus
• A term stops being rare once more documents are added
that contain it
• Documents that don't match a query can still affect the
order of results
46. Query 7: “dog”
Doc 1: “dog” → 0.13
Doc 2: “dog” → 0.13
Doc 3: “dog” → 0.13
We can do a Distributed Frequency Search
GET /test/_search?search_type=dfs_query_then_fetch
{ "query": { "match": { "title": "dog" } } }
47. Replicas And Scoring
• Replicas of the same shard may have different statistics
• Documents marked for deletion but not yet physically
removed (when their segments are merged) still
contribute to statistics
• Replicas may be out of sync re: physical deletion
• Specifying a user or session ID in the shard copy
preference parameter helps route requests to the
same replicas
48. Updates and Scoring
• Updates to an existing document behave like adding a
completely new document as far as DF statistics, until
segments are merged:
• “n, number of documents containing term” increases
• “N, total number of documents with field” increases
49. Updates and Scoring
PUT test/_doc/1
{ "title": "dog cat" }
GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
"explain": true }
PUT test/_doc/1?refresh
{ "title": "dog zebra" }
GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
"explain": true }
POST test/_forcemerge
GET test/_search?format=yaml
{ "query" : { "match" : { "title": "dog" } },
“explain": true }
_score: 0.2876821
"n, number of documents containing term”: 1
"N, total number of documents with field”: 1
_score: 0.2876821
"n, number of documents containing term”: 1
"N, total number of documents with field”: 1
_score: 0.18232156
"n, number of documents containing term”: 2
"N, total number of documents with field”: 2
50. Query 4: “dog cat”
Doc 1: “dog dog dog dog dog dog dog” → 0.8
Doc 2: “cat cat cat cat cat cat cat” → 0.8
Doc 3: “dog cat” → 1.2
Matching more query terms is good.
But what also benefits Doc 3 here?
51. Query 4 redux: “dog cat”
Doc 1: “dog dog” → 0.6
Doc 2: “cat cat” → 0.6
Doc 3: “dog cat” → 0.9
Matching more query terms is good
53. Query 4 redux: “dog cat”
Doc 1: {“pet1”: “dog”, “pet2”: “dog”} → 0.87
Doc 2: {“pet1”: ”dog”, “pet2”: “cat”} → 0.87
Matching more query terms within the
same field is good. But there's no
advantage when the matches happen
across fields.
54. Query 4 redux: “dog cat”
Doc 1: {“pet1”:“dog”, “pet2”: “dog”} → 0.18
Doc 2: {“pet1”:”dog”, “pet2”: “cat”} → 0.87
We can simulate a single field using
cross_fields.
GET test/_search
{
"query": {
"multi_match": {
"query": "dog cat",
"fields": [“pet1”, "pet2"],
"type": "cross_fields"
}
}
}
55. Query 8: “orange dog”
Doc 1: {“type”: “dog”, “description”: “A sweet and loving pet that is always
eager to play. Brown coat. Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Duis non nibh sagittis, mollis ex a, scelerisque nisl. Ut vitae
pellentesque magna, ut tristique nisi. Maecenas ut urna a elit posuere
scelerisque. Suspendisse vel urna turpis. Mauris viverra fermentum
ullamcorper. Duis ac lacus nibh. Nulla auctor lacus in purus vulputate,
maximus ultricies augue scelerisque.”}
Doc 2: {“type”: “cat”, “description”: “Puzzlingly grumpy. Occasionally turns
orange.”}
GET test/_search
{
"query": {
"multi_match": {
"query": "orange dog",
"fields": ["type", “description"],
"type": "most_fields"
}
}
}
56. Query 8: “orange dog”
Doc 1: {“type”: “dog”, “description”: “A sweet and loving pet that is
always eager to play. Brown coat. Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Duis non nibh sagittis, mollis ex a, scelerisque
nisl. Ut vitae pellentesque magna, ut tristique nisi. Maecenas ut urna a elit
posuere scelerisque. Suspendisse vel urna turpis. Mauris viverra
fermentum ullamcorper. Duis ac lacus nibh. Nulla auctor lacus in purus
vulputate, maximus ultricies augue scelerisque.”} → 1.06
Doc 2: {“type”: “cat”, “description”: “Puzzlingly grumpy. Occasionally
turns orange.”} → 0.69
“Shortness” is relative to the field's average
59. Case Study: SKUs
I searched for a product by SKU—
I was looking to purchase a shoe—
But the website I used
Seemed very confused
And offered me nothing to view.
123AB-543D-234C
79. Fuzziness and Scoring
PUT /test/_doc/1
{ "title": "dog" }
PUT /test/_doc/2
{ "title": "elephant" }
GET /test/_validate/query?rewrite=true
{
"query": {
"match" : {
"title": {
"query": "dog",
"fuzziness": 2
}}}}
GET /test/_search
{
"query": {
"fuzzy" : {
"title": {
"value": "dog",
"fuzziness": 2,
"rewrite": "constant_score"
}
}
}
}
Query Lucene Query Edits
dog title:dog
dg (title:dog)^0.5 D
do (title:dog)^0.5 D
dgo (title:dog)^0.666666 T
dox (title:dog)^0.666666 S
dogg (title:dog)^0.666666 I
doggg (title:dog)^0.333333 I, I
elepha (title:elephant)^0.6666666 D, D
elephan (title:elephant)^0.85714287 D
elephantt (title:elephant)^0.875 I
elephanttt (title:elephant)^0.75 I, I
86. Case Study: SKUs
I searched for a product by SKU—
I was looking to purchase a shoe.
Results were returned
And the price I soon learned.
“I’ll take one,” I said, “Make it two!”
123AB-543D-234C