Elastic Relevance Presentation feb4 2020

Relevance Tuning In
Elasticsearch
Rudi Seitz

KMW Technology

Elastic Boston User Group

Outline
• Intro to Relevance

• Crash Course: Scoring

• Relevance Tuning Case Study

• Testing Relevance

• Discussion

What is Relevance?
• A subjective measure of how useful a document is to user
who searched for something

• Does it satisfy the user’s information need?

• If I search for “cats”…

• Probably relevant: the movie “Cats,” the stage musical
“Cats,” cat pictures, cat blogs, cat food, Felis catus

• Vaguely relevant: dogs

• Not relevant: CAT scanners, catsup, cement mixers

What Is Relevance Tuning?
Adjusting the content of search results so that the most
relevant documents are included

Adjusting the order of search results so that the most
relevant results appear on top

Why Tune Relevance?
• FANTASY: “Once I get the data into my search engine, it
does all the work of ﬁnding the best matches for my
queries.”

• TRUTH: “We have to conﬁgure the search engine to rank
results in a way that is meaningful to the user.”

Search Engine Doesn't
Know…
• Which fields are important

• How users will search those fields

• Which query terms are the most significant

• Whether term order is significant

• Which terms mean the same thing

• What priorities the user has based on location, season, task, etc.

• What priorities the provider has re: sales, promotions, sponsorships, etc.

• Whether freshness, popularity, ratings are important

Relevance Problems
• Search for “Rocky" returns “Rocky Road To Dublin”
before the movie “Rocky”

• Search in MA for "coffee" returns “Coffee Day” (chain in
India) before “George Howell Coffee”

• Search for product by SKU returns permutations

• Search for “bikes” fails to find “bicycle”

• Search for “The The” (band) returns no results

Precision and Recall
• High Precision: "Everything I see is useful to me"

• High Recall: “Everything I might want is included”

• Relevance tuning is a tradeoﬀ between precision and
recall

Precision And Recall
• Precision = Relevant Results / All Results

• “Only 5 out of 10 results returned were useful to me.
There was a lot of noise.”

• Recall = Relevant Results / All Relevant Documents

• “Only 5 out of 10 useful documents in the index were
returned. There were lots of things missing.”

When relevance you want to tune,
All iﬀy results you should prune
To achieve good precision,
Unless your decision’s
That recall is more of a boon.
Precision and Recall

How do we tune relevance?
• Enrich documents with metadata that's useful to search

• Search the right fields

• Configure field analyzers to match the way users search

• Set field weights

• Match phrases

• Handle typos

• Apply synonyms and stemming

• Reward exact/complete matches

• Reward freshness, popularity, ratings, etc.

Scoring
• A search engine has to ﬁnd relevant documents without
knowing what they mean

• A search engine assigns a numerical score to each match
using a "blind" but eﬀective statistical heuristic

• Results are displayed in order by score

• To tune relevance we need to understand the search
engine’s built-in method of scoring

Query:
Corpus:
Indus Script, 3500 B.C.

All You Need To Know
About Scoring:
Six Examples

Query 1: “dog”
Doc 1: “dog”

Doc 2: “dog dog”

Doc 3: “dog dog dog”

Query 1: “dog”
Doc 1: “dog”

Doc 2: “dog dog”

Doc 3: “dog dog dog”
GET test/_search
{
"query": {
"match": {
"title": "dog"
}
}
}
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "standard"
}
}
}

Query 1: “dog”
Doc 1: “dog” → 0.167

Doc 2: “dog dog” → 0.183

Doc 3: “dog dog dog” → 0.189

High Term Frequency is good
Term
Frequency
(TF)

Query 2: “dog dog cat”
Doc 1: “cat”

Doc 2: “dog”

Doc 1: “cat” → 0.6

Doc 2: “dog” → 1.3

Scores for each term are summed

Doc 1: “dog”

Doc 2: “dog”

Doc 3: “dog”

Doc 4: “dog”

Doc 5: “dog”

Doc 6: “dog”

Doc 7: “cat”

Doc 1: “dog” → 0.4

Doc 2: “dog” → 0.4

Doc 3: “dog” → 0.4

Doc 4: “dog” → 0.4

Doc 5: “dog” → 0.4

Doc 6: “dog” → 0.4

Doc 7: “cat” → 1.5

Matches for rarer terms are better
Document
Frequency
(DF)

Query 3.5: “dog^7 cat”
Doc 1: “dog” → 1.6

Doc 2: “dog” → 1.6

Doc 3: “dog” → 1.6

Doc 4: “dog” → 1.6

Doc 5: “dog” → 1.6

Doc 6: “dog” → 1.6

Doc 7: “cat” → 1.5

We can boost terms
GET test/_search
{
"query": {
"query_string": {
"query": "dog^7 cat",
"fields": ["title"]
}
}
}

Query 4: “dog cat”
Doc 1: “dog dog dog dog dog dog dog”

Doc 2: “cat cat cat cat cat cat cat”

Doc 3: “dog cat”

Doc 1: “dog dog dog dog dog dog dog” → 0.8

Doc 2: “cat cat cat cat cat cat cat” → 0.8

Doc 3: “dog cat” → 1.2

Matching more query terms is good
Term
Saturation

Query 5: “dog”
Doc 1: “dog cat zebra”

Doc 2: “dog cat”

Query 5: “dog”
Doc 1: “dog cat zebra” → 0.16

Doc 2: “dog cat” → 0.19

Matches in shorter ﬁelds are better

Query 6: “orange dog”
Doc 1: { "type" : "dog", "color" : "brown" }

Doc 2: { "type" : "dog", "color" : "brown" }

Doc 3: { "type" : "cat", "color" : "brown" }

Doc 4: { "type" : "cat", "color" : "orange" }

GET test/_search
{
"query": {
"multi_match": {
"query": "orange dog",
"fields": ["type", "color"],
"type": "most_fields"
}
}
}
brown dog
brown dog
brown cat
orange cat

Doc 1: { "type" : "dog", "color" : "brown" } → 0.6



Doc 4: { "type" : "cat", "color" : "orange" } → 1.2




Doc 4: { "type" : "cat", "color" : "orange" } → 1.2

We can boost ﬁelds
GET test/_search
{
"query": {
"multi_match": {
"fields": [“type^2", "color"],
}
}
}

Ties
“when two documents have the same score, they will be sorted by their
internal Lucene doc id (which is unrelated to the _id) by default”

“The internal doc_id can diﬀer for the same document inside each
replica of the same shard so it's recommended to use another
tiebreaker for sort in order to get consistent results. For instance you
could do: sort: ["_score", "datetime"] to force top_hits to
rank documents based on score ﬁrst and use datetime as a
tiebreaker.”
"sort": [
{ "_score": { "order": "desc" }},
{ "date": { "order": "desc" }}
]

Comparing Field Scores
• Raw scores across fields are not directly comparable

• Term frequencies, document frequencies, and average field length all
differ across fields

• Field analyzers can generate additional tokens that that affect scoring

• A "good" match in one field might score in the range 0.1 to 0.2 while a
good match in another field might score in the range 1 to 2. There’s
no universal relevance scale.

• A multiplicative boost of 10 doesn't mean “field1 is 10 times more
important than field2”

• Boosts can compensate for scale discrepancies

TF x IDF
A search engine handles the chore
Of ranking each match, good or poor:
If a document’s TF
Divided by DF
Is huge, it will get the top score.

How does TFxIDF affect
query scoring?
• High score: A document with many occurrences of a rare
term

• Low score: A document with few occurrences of a common
term

• TFxIDF depends on the corpus

• A term stops being rare once more documents are added
that contain it

• Documents that don't match a query can still aﬀect the
order of results

Explain
_score: 0.18952842
_source:
title: "dog dog dog"
_explanation:
value: 0.18952842
description: "weight(title:dog in 0) [PerFieldSimilarity], result of:"
details:
- value: 0.18952842
description: "score(freq=3.0), product of:"
details:
- value: 2.2
description: "boost"
details: []
- value: 0.13353139
description: "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:"
details:
- value: 3
description: "n, number of documents containing term"
details: []
- value: 3
description: "N, total number of documents with field"
details: []
- value: 0.6451613
description: "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl))
from:"
details:
- value: 3.0
description: "freq, occurrences of term within document"
details: []
- value: 1.2
description: "k1, term saturation parameter"
details: []
- value: 0.75
description: "b, length normalization parameter"
details: []
- value: 3.0
description: "dl, length of field"
details: []
- value: 2.0
description: "avgdl, average length of field"
details: []
GET test/_search?format=yaml
{
"explain": true,
"query": {
"match": {
"title": "dog"
}
}
}

Explain
_score: 0.18952842
_source:
title: "dog dog dog"
_explanation:
0.18952842 = "weight(title:dog in 0) [PerFieldSimilarity], result of:"
0.18952842 ="score(freq=3.0), product of:"
2.2 = "boost"
0.13353139 = "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:"
3 = "n, number of documents containing term"
3 = "N, total number of documents with field"
0.6451613 = "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl))""
3.0 = "freq, occurrences of term within document"
1.2 = "k1, term saturation parameter"
0.75 = "b, length normalization parameter"
3.0 = "dl, length of field"
2.0 = "avgdl, average length of field"

The Mysterious Boost
PUT /bm25test
{
"settings": {
"index": {
"similarity": {
"my_similarity": {
"type": "BM25",
"k1": 1.3,
"b": 0.75
}}}},
"mappings": {
"properties": {
"title": {
"type": "text",
"similarity": "my_similarity"
}}}}
PUT /bm25test/_doc/1
{ "title" : "dog" }
GET /bm25test/_search
{ "query": { "match": { "title": "dog" } }, "explain": true }
"_explanation" : {
"value" : 0.2876821,
"description" : "weight(title:dog in 0) …",
"details" : [
{
"value" : 0.2876821,
"description" : "score(freq=1.0), product of:",
"details" : [
{
"value" : 2.3,
"description" : "boost",
"details" : [ ]
},

Query 7: “dog”
Doc 1: “dog”

Doc 2: “dog”

Doc 3: “dog”

GET test/_search
{
"query": {
"match": {
"title": "dog"
}
}
}

Query 7: “dog”
Doc 1: “dog” → 0.28

Doc 2: “dog” → 0.18

Doc 3: “dog” → 0.18

Query 7: “dog”
Doc 1: “dog” → 0.28

Doc 2: “dog” → 0.18

Doc 3: “dog” → 0.18

Statistics are per-shard
PUT /test
{ "settings": { "number_of_shards": 2 } }
PUT /test/_doc/1?routing=0
{ "title" : "dog" }
{ "title" : "dog" }
{ "title" : "dog" }

Query 7: “dog”
Doc 1: “dog” → 0.13

Doc 2: “dog” → 0.13

Doc 3: “dog” → 0.13

We can do a Distributed Frequency Search
GET /test/_search?search_type=dfs_query_then_fetch
{ "query": { "match": { "title": "dog" } } }

Replicas And Scoring
• Replicas of the same shard may have diﬀerent statistics

• Documents marked for deletion but not yet physically
removed (when their segments are merged) still
contribute to statistics

• Replicas may be out of sync re: physical deletion

• Specifying a user or session ID in the shard copy
preference parameter helps route requests to the
same replicas

Updates and Scoring
• Updates to an existing document behave like adding a
completely new document as far as DF statistics, until
segments are merged:

• “n, number of documents containing term” increases

• “N, total number of documents with ﬁeld” increases

Updates and Scoring
PUT test/_doc/1
{ "title": "dog cat" }
{ "query" : { "match" : { "title": "dog" } },
"explain": true }
PUT test/_doc/1?refresh
{ "title": "dog zebra" }
"explain": true }
POST test/_forcemerge
“explain": true }
_score: 0.2876821
"n, number of documents containing term”: 1
"N, total number of documents with field”: 1
_score: 0.2876821
_score: 0.18232156

Doc 1: “dog dog dog dog dog dog dog” → 0.8

Doc 2: “cat cat cat cat cat cat cat” → 0.8


Matching more query terms is good.
But what also beneﬁts Doc 3 here?

Query 4 redux: “dog cat”
Doc 1: “dog dog” → 0.6

Doc 2: “cat cat” → 0.6


Matching more query terms is good

Doc 1: {“pet1”: “dog”, “pet2”: “dog”}

Doc 2: {“pet1”: “dog”, “pet2”: “cat”}

GET test/_search
{
"query": {
"multi_match": {
"query": "dog cat",
"fields": [“pet1”, "pet2"],
}
}
}

Doc 1: {“pet1”: “dog”, “pet2”: “dog”} → 0.87

Doc 2: {“pet1”: ”dog”, “pet2”: “cat”} → 0.87

Matching more query terms within the
same ﬁeld is good. But there's no
advantage when the matches happen
across ﬁelds.

Doc 1: {“pet1”:“dog”, “pet2”: “dog”} → 0.18

Doc 2: {“pet1”:”dog”, “pet2”: “cat”} → 0.87

We can simulate a single ﬁeld using
cross_ﬁelds.
GET test/_search
{
"query": {
"multi_match": {
"query": "dog cat",
"fields": [“pet1”, "pet2"],
"type": "cross_fields"
}
}
}

Doc 1: {“type”: “dog”, “description”: “A sweet and loving pet that is always
eager to play. Brown coat. Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Duis non nibh sagittis, mollis ex a, scelerisque nisl. Ut vitae
pellentesque magna, ut tristique nisi. Maecenas ut urna a elit posuere
scelerisque. Suspendisse vel urna turpis. Mauris viverra fermentum
ullamcorper. Duis ac lacus nibh. Nulla auctor lacus in purus vulputate,
maximus ultricies augue scelerisque.”}

Doc 2: {“type”: “cat”, “description”: “Puzzlingly grumpy. Occasionally turns
orange.”}
GET test/_search
{
"query": {
"multi_match": {
"fields": ["type", “description"],
}
}
}

Doc 1: {“type”: “dog”, “description”: “A sweet and loving pet that is
always eager to play. Brown coat. Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Duis non nibh sagittis, mollis ex a, scelerisque
nisl. Ut vitae pellentesque magna, ut tristique nisi. Maecenas ut urna a elit
posuere scelerisque. Suspendisse vel urna turpis. Mauris viverra
fermentum ullamcorper. Duis ac lacus nibh. Nulla auctor lacus in purus
vulputate, maximus ultricies augue scelerisque.”} → 1.06

Doc 2: {“type”: “cat”, “description”: “Puzzlingly grumpy. Occasionally
turns orange.”} → 0.69

“Shortness” is relative to the ﬁeld's average

Query 9:
“abcd efghĳklmnopqrstuvwxyz”
Doc 1: “abcd”

Doc 2: “efghijklmnopqrstuvwxyz”

Query 9:
“abcd efghĳklmnopqrstuvwxyz”
Doc 1: “abcd” → 0.69

Doc 2: “efghijklmnopqrstuvwxyz” → 0.69

Term length is not signiﬁcant.

Case Study: SKUs
I searched for a product by SKU—
I was looking to purchase a shoe—
But the website I used
Seemed very confused
And oﬀered me nothing to view.
123AB-543D-234C

Requirements
1. Exact match: 123AB-543D-234C

2. Without punctuation: 123AB 543D 234C

3. Without spaces: 123AB543D234C

4. Section: 123AB

5. Section preﬁx: 123A

6. Typo: 123AB543D234D

7. Replacement products

8. Tie-breakers: Popularity, Freshness

Step 1: Standard Analyzer
PUT /test
{
"mappings": {
"properties": {
"sku": {
"type": "text",
"analyzer": "standard"
}
}
}
}

Query: “123AB-543D-234C”
Doc 1: “123AB-543D-234C” winged

Doc 2: “123AB-234C-543D” not winged

Query: “123AB-543D-234C”
Doc 1: “123AB-543D-234C” → 0.54

Doc 2: “123AB-234C-543D” → 0.54

Term order is ignored

Debug With Analyze API
GET skutest/_analyze?filter_path=*.token&format=yaml
{
"text": ["123AB-543D-234C"],
"analyzer" : "standard"
}
---
tokens:
- token: "123ab"
- token: "543d"
- token: "234c"

Analysis Chain
winged not winged

Step 2: Shingles
{
"text": ["123AB-543D-234C"],
"tokenizer": "standard",
"filter": ["lowercase", {"type":"shingle", "max_shingle_size":4}]
}
---
tokens:
- token: "123ab"
- token: "123ab 543d"
- token: "123ab 543d 234c"
- token: "543d"
- token: "543d 234c"
- token: "234c"

PUT test
{
"settings": {
"analysis": {
"filter": {
"shingle4" : {
"type" : "shingle",
"max_shingle_size" : 4
}
},
"analyzer": {
"custom_sku": {
"filter": ["lowercase", "shingle4"],
"tokenizer" : "standard"
}
}
}
},
"mappings": {
"properties": {
"sku": {
"type": "text",
"analyzer": "custom_sku"
}
}
}
}

Query: “123AB-543D-234C”
Doc 1: “123AB-543D-234C” → 0.84

Doc 2: “123AB-234C-543D” → 0.68

Term order is respected

Query: “123AB-543D-234C”
Doc 1: “123AB-543D-234C” → 0.96

Doc 2: [“123AB-234C-543D-234C-1”,

“123AB-234C-543D-234C-2”,

“123AB-234C-543D-234C-3” ] → 1.01

Exact match isn’t respected enough!

Step 3: Reward Exact
MatchesPUT sku6
{
"settings": {
"analysis": {
"filter": {
"shingle4" : {
"type" : "shingle",
}
},
"analyzer": {
"custom_sku": {
"filter": ["lowercase", "shingle4"],
},
"lowercase": {
"filter": ["lowercase"],
"tokenizer" : "keyword"
}
}
}
},

Multiﬁelds
"mappings": {
"properties": {
"sku": {
"type": "text",
"fields": {
"exact": {
"type": "text",
"analyzer": "lowercase"
},
"shingle": {
"type": "text",
"analyzer": "custom_sku"
}
}
}
}
}
}

Query: “123AB-543D-234C”
Doc 1: “123AB-543D-234C” → 1.84

Doc 2: [“123AB-234C-543D-234C-1”,

“123AB-234C-543D-234C-2”,

“123AB-234C-543D-234C-3” ] → 1.01

GET /test/_search
{
"query": {
"multi_match": {
"query": "123AB-543D-234C",
"fields": ["sku.exact", "sku.shingle"],
}
}
}

Step 4: ngrams
{
"text": ["123AB-543D-234C"],
"tokenizer": "standard",
"filter": ["lowercase", {"type":"edge_ngram", "min_gram": 3,
"max_gram": 8}]
}
---
tokens:
- token: "123"
- token: "123a"
- token: "123ab"
- token: "543"
- token: "543d"
- token: "234"
- token: "234c"

Step 5: Omitting Spaces
{
"text": ["123AB-543D-234C"],
"filter": [{"type":"word_delimiter",
"catenate_all":"true",
"generate_word_parts":"false",
"generate_number_parts":"false",
"preserve_original":"false",
"split_on_numerics":"false",
"split_on_case_change":"false"
}],
"tokenizer": "keyword"
}
---
tokens:
- token: "123AB543D234C"

Step 6: Synonyms
PUT test
{
"settings": {
"analysis": {
"filter": {
"shingle4" : {
"type" : "shingle",
},
"synonym" : {
"type" : "synonym",
"synonyms" : ["8971-34DA-65JQ => 123AB-543D-234C"]
}
},
"analyzer": {
"custom_sku": {
"filter": ["lowercase", "synonym", "shingle4"],
},
"lowercase": {
"filter": ["synonym", "lowercase"],
"tokenizer" : "keyword"
}
}
}
},

Step 7: Typos / Fuzziness
GET /sku8/_search
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "123AB543D234D",
"fields": ["sku.exact", "sku.catenated", "sku.shingle"],
"type": "most_fields",
"boost": 5
}
},
{
"multi_match": {
"query": "123AB543D234D",
"fields": ["sku.exact", "sku.catenated"],
"fuzziness": 1
}
}]
}
}
}

Fuzziness and Scoring
PUT /test/_doc/1
{ "title": "dog" }
PUT /test/_doc/2
{ "title": "elephant" }
GET /test/_validate/query?rewrite=true
{
"query": {
"match" : {
"title": {
"query": "dog",
"fuzziness": 2
}}}}
GET /test/_search
{
"query": {
"fuzzy" : {
"title": {
"value": "dog",
"fuzziness": 2,
"rewrite": "constant_score"
}
}
}
}
Query Lucene Query Edits
dog title:dog
dg (title:dog)^0.5 D
do (title:dog)^0.5 D
dgo (title:dog)^0.666666 T
dox (title:dog)^0.666666 S
dogg (title:dog)^0.666666 I
doggg (title:dog)^0.333333 I, I
elepha (title:elephant)^0.6666666 D, D
elephan (title:elephant)^0.85714287 D
elephantt (title:elephant)^0.875 I
elephanttt (title:elephant)^0.75 I, I

Query: “123AB-543D-234C”
Doc 1: {“sku” : “123AB-543D-234D”, “likes”: 1000} → 0.6398282

Doc 2: {“sku” : “123AB-543D-234E”, “likes”: 5000} → 0.6398282

Step 8: Popularity
GET /sku9/_search
{
"query": {
"script_score": {
"script": {
"source": "_score + 0.00000001*doc['likes'].value"
},
"query": {
"bool" : {
"should" : [
{
"multi_match": {
"query": "123AB543D234C",
"fields": ["sku.exact", "sku.catenated", "sku.shingle"],
"boost" : 5
}
},
{
"multi_match" : {
"query": "123AB543D234C",
"fields": ["sku.exact", "sku.catenated"],
"fuzziness": 1
}
}]
}
}
}
}
}

Query: “123AB-543D-234C”
Doc 1: {“sku” : “123AB-543D-234D”, “likes”: 1000} → 0.6398382

Doc 2: {“sku” : “123AB-543D-234E”, “likes”: 5000} → 0.6398782

Query: “123AB”
Doc 1: {“sku” : “123AB-543D-234G”, “date”: “2018-01-01”} → 0.46582156

Doc 2: {“sku” : “123AB-543D-234H”, “date”: “2019-01-01”} → 0.46582156

Query: “123AB”
Doc 1: {“sku” : “123AB-543D-234G”, “date”: “2018-01-01”} → 0.46585178

Doc 2: {“sku” : “123AB-543D-234H”, “date”: “2019-01-01”} → 0.46588513

"script_score": {
"script": {
"source": "_score +
0.0001*decayDateLinear('2020-02-04', '1095d', '0d', 0.0, doc['date'].value)"
},

Case Study: SKUs
I searched for a product by SKU—
I was looking to purchase a shoe.
Results were returned
And the price I soon learned.
“I’ll take one,” I said, “Make it two!”
123AB-543D-234C

Stopword Filtering
GET test/_analyze
{
"analyzer" : "english",
"text" : "To be, or not to be, that is the..."
}
{
"tokens" : [ ]
}

“I am not a fan of stopword
ﬁltering.” — W. Shakespeare

Stemming
GET certona/_analyze?filter_path=*.token&format=yaml
{
"analyzer": "english",
"text": "dog dogs dog's dogged dogging doggy doggie doggies doggy's"
}
---
tokens:
- token: "dog"
- token: "dog"
- token: "dog"
- token: "dog"
- token: "dog"
- token: "doggi"
- token: "doggi"
- token: "doggi"
- token: “doggi"
dogged person
= dog person?

Next Steps
•Suggestions

•Highlighting

•Search as you type

•Multi-language support

•Session-Based Relevancy

•Signals Boosting / Adaptive Relevancy

•Learning To Rank (LTR)

•Relevancy Proﬁles (User, Region, Time)

•Named Entity Recognition

•Query Classiﬁcation

Elastic Relevance Presentation feb4 2020

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (6)

Similaire à Elastic Relevance Presentation feb4 2020

Similaire à Elastic Relevance Presentation feb4 2020 (20)

Dernier

Dernier (20)

Elastic Relevance Presentation feb4 2020