Balancing out diversity vs relevance is a common challenge we have to solve when building search applications. Most of our efforts focus on improving relevance and even techniques like Learning to Rank might easily decrease the diversity of search results in favor of relevance. In the context of e-commerce, diversifying the result set is a very successful approach to deal with wide-broad user queries, liquidity of item inventory or covering business rules.
This talk will cover techniques I had experience with to get started on diversifying search results in e-commerce discovery flow. Based on real use cases of search-based recommendations and autocomplete features, I will discuss how web search-diversity techniques can be used in e-commerce and how information theory can be used to measure search diversity.
4. All steps are interconnected!
4
● Users have different intents
● What can break the dialogue with the user ?
○ Broad queries (Autocomplete and Search)
○ Ambiguity (Query understanding)
○ Bad Interactions (Recommendations)
5. Diversifying search results
5
Strength the dialogue with the user
● Dealing with broad queries
○ Autocomplete
○ Search
● Item showcase for new or exploring users
● Gathering more interactions to improve recommendations
● Autocomplete
6. What will be covered and how ?
6
● Broad queries problem in autocomplete
● Techniques to promote diversification
● Our use case:
○ Autocomplete at OLX Europe
7. What is Autocomplete ?
7
A tool to talk directly to the user
● Guide users to good queries
● Help query understanding to understand
● Fast response/reaction
● Help tackling search relevance earlier as possible
8. Autocomplete at OLX Europe
8
● Suggest popular searches with category filters
● Covers 7 different countries
● > 50 mi requests per day
● Responsible for 40% of total searches
● Ranks suggestions by popularity and narrowness
○ but ...
9. Broad query problem ...
9
What is my intent ?
What if I don't know any Vespa
model ?
popularity
What if I have a Vespa and
want some accessory ?
10. Broad query effect ...
Fashion
Bags and
accessories
Footwear
Clothing
Watches and
Jewelry
Notions
Other bags and
accessories
Woman
Sunglasses
Man
Woman
Man
Watches
Jewelry
10
Different topics
Level 1 (L1)
Level 2 (L2)
Level 3 (L3)
Gucci
Wallets
Handbags
Health and
Beauty
Perfumes
Medical care
Autocomplete
suggestions
11. Breaks in the dialogue with user
11
● We jumped to premature conclusions
○ Show very specific popular suggestions (Vespa models)
● We could have asked more
○ Show more possibilities (like accessories)
● Maybe we will never have the chance to ask more
○ Popularity feedback loop ("rich get richer")
12. Diversifying autocomplete suggestions
12
Improve user experience on broad queries
● Minimize overspecialization of suggestions
● Give an overview of different available item categories
● Break popularity feedback loop
● Refine the query (user intents)
13. The goal
13
Diversifying autocomplete category suggestions for broad queries
Broad queries =
popular queries
AND contain categories with many search results
AND those categories are not yet suggested!
14. How to apply diversification ?
14
Inspiration from Web Search and Information retrieval
Explicit diversification
○ From query (information needs)
○ Increase Coverage
○ Broad queries
Based on Search result diversification: http://www.dcs.gla.ac.uk/~craigm/publications/santos2015ftir.pdf
15. How can we measure coverage ?
15
Step 1: Clustering documents into topics
○ Facets, categories, colors, word embeddings, ...
891
...
36
...
37
...
903 3
topics
topics
probability
16. How can we measure coverage ?
16
Step 2: Measure dispersion of topics distribution
GINI Coefficient: https://opensourceconnections.com/blog/2019/09/05/diversity-vs-relevance
<>
GINI Coefficient
Shannon Entropy
topicstopics
probability
probability
17. Shannon Entropy
17
Measures level of information in a probability distribution
A B C
High Knowledge Medium Knowledge Low Knowledge
Low Surprise Medium Surprise High Surprise
entropy = 0 entropy = 0.81 entropy = 1.5
18. Shannon Entropy for e-commerces
18
1. Cluster document into categories (or any other criteria)
2. Category probability
entropy: 2.38 entropy: 0.52
19. Entropy from another perspective
19Extracted from: https://medium.com/udacity/shannon-entropy-information-gain-and-picking-balls-from-buckets-5810d35d54b4
On average, how many questions do we need to ask to find out what letter it is?
Entropy = 0
Bucket 1
Entropy = 1.75
Bucket 2
Entropy = 2.0
Bucket 3
Akinator: https://en.wikipedia.org/wiki/Akinator
20. Entropy from another perspective
20
Extracted from: https://medium.com/udacity/shannon-entropy-information-gain-and-picking-balls-from-buckets-5810d35d54b4
Bucket 3 (2 questions on overage)
Bucket 2 (1.75 questions on average)
21. Coming back to the autocomplete
21
On average, how many questions can we ask to make sure we cover all user intents ?
each suggestion we give = a different question we make
○ 0 questions for very specific queries (low entropy)
○ n questions for broad queries
■ How many is n ?
■ How can we define these questions ?
22. How many questions can we ask ?
22
possible question!
entropy of each category
10 slots
Entropy = # of different questions
23. Maximum diversity is 10 different suggestions!
● Each category has p(x) = 0.1 and e(x) = 0.33
How to pick each suggestion ?
23
0.33
too few results Narrow queries
candidates
28. Experiment Scope
28
● 2 countries (C1 and C2)
● Expansions for less than 5% of suggested queries but covered:
○ 26% of total searches for C1
○ 17% of total searches for C2
● Compared the performance of both groups
○ broad queries: expanded vs not expanded
29. Primary metrics Description C1 C2
suggest_search_rate Autocomplete usage: # suggested searches / # total searches +10.41% +0.72%
pos_filter_rate Search filters applied after picking expanded suggestions -3.14% -5.14%
Experiment Results
29
● Diversification impacted user behaviour in autocomplete
● C1 users interacted more with autocomplete suggestions
● Did C2 users pick less suggestions but better ones ?
30. Experiment Results
30
Query metrics* Description C1 C2
suggest_ctr Uplift in ad clicks from expanded query +3.64% -3.86
suggest_reply_rate Uplift in ad replies from expanded query +1.81% +0.26%
Suggestion metrics* Description C1 C2
suggest_cat_ctr Uplift in ad clicks from expanded suggestions (category) +2.24% +9.48%
suggest_cat_reply_rate Uplift in ad replies from expanded suggestions (category) +6.13% +13.01%
● Promising for C1 users in general
● In C2, we might have replaced relevant suggestions
● In both countries, new suggested categories look relevant
31. Considerations and Future
31
● Early stage: first and simple iteration
● Extend experiment
○ Affect more queries and add more countries
● Impact short vs long term
○ Consider rank (top n results)
○ Explore more clustering dimensions
○ Define entropy and popularity thresholds (prior and observed)