[2024]Digital Global Overview Report 2024 Meltwater.pdf
Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise
1. Implementing Click-through
Relevance Ranking
in Solr and LucidWorks Enterprise
Andrzej Białecki
ab@lucidimagination.com
2. About the speaker
§ Started using Lucene in 2003 (1.2-dev…)
§ Created Luke – the Lucene Index Toolbox
§ Apache Nutch, Hadoop, Solr committer, Lucene
PMC member
§ Apache Nutch PMC Chair
§ LucidWorks Enterprise developer
3
5. Improving relevance of top-N hits
§ N < 10, first page counts the most
• N = 3, first three results count the most
§ Many techniques available in Solr / Lucene
• Indexing-time
§ text analysis, morphological analysis, synonyms, ...
• Query-time
§ boosting, rewriting, synonyms, DisMax, function queries …
• Editorial ranking (QueryElevationComponent)
§ No direct feedback from users on relevance L
§ What user actions do we know about?
• Search, navigation, click-through, other actions…
6
6. Query log and click-through events
Click-through: user selects an item at a
among a for a query
§ Why this information may be useful
• “Indicates” user's interest in a selected result
• “Implies” that the result is relevant to the query
• “Significant” when low-ranking results selected
• “May be” considered as user's implicit feedback
§ Why this information may be useless
• Many strong assumptions about user’s intent
• “Average user’s behavior” could be a fiction
§ “Careful with that axe, Eugene”
7
7. Click-through in context
§ Query log, click positions, click intervals provide a
context
§ Source of spell-checking data
• Query reformulation until a click event occurs
§ Click events per user – total or during a session
• Building a user profile (e.g. topics of interest)
§ Negative click events
• User did NOT click the top 3 results è demote?
§ Clicks of all users for an item (or a query, or both)
• Item popularity or relevance to queries
§ Goal: analysis and modification of result ranking
8
8. Click to add title…
§ Clicking through == adding labels!
§ Collaborative filtering, recommendation system
§ Topic discovery & opinion mining
§ Tracking the topic / opinion drift over time
§ Click-stream is sparse and noisy – caveat emptor
• Changing intent – “hey, this reminds me of smth…”
• Hidden intent – remember the “miserable failure”?
• No intent at all – “just messing around”
9
9. What’s in the click-through data?
§ Query log, with unique id=f(user,query,time)!
• User id (or group)
• Query (+ facets, filters, origin, etc)
• Number of returned results
• Context (suggestions, autocomplete, “more like
this” terms …)
§ Click-through log
• Query id , document id, click position & click
timestamp
§ What data we would like to get?
• Map of docId =>
§ Aggregated queries, aggregated users
§ Weight factor f(clickCount, positions, intervals)
10
10. Other aggregations / reports
§ User profiles
• Document types / categories viewed most often
• Population profile for a document
• User’s sophistication, education level, locations,
interests, vices … (scary!)
§ Query re-formulations
• Spell-checking or “did you mean”
§ Corpus of the most useful queries
• Indicator for caching of results and documents
§ Zeitgeist – general user interest over time
11
11. Documents with click-through data
original document document with click-through data
- documentWeight - documentWeight
- field1 : weight1 - field1 : weight1
- field2 : weight2 - field2 : weight2
- field3 : weight3 - field3 : weight3
- labels : weight4
- users : weight5
§ Modified document and field weights
§ Added / modified fields
• Top-N labels aggregated from successful queries
• User “profile” aggregated from click-throughs
§ Changing in time – new clicks arrive
12
13. Undesired effects
§ Unbounded positive feedback
• Top-10 dominated by popular but irrelevant
results, self-reinforcing due to user expectations
about the Top-10 results
§ Everlasting effects of past click-storms
• Top-10 dominated by old documents once
extremely popular for no longer valid reasons
§ Off-topic (noisy) labels
§ Conclusions:
• f(click data) should be sub-linear
• f(click data, time) should discount older clicks
• f(click data) should be sanitized and bounded
14
15. Click-through scoring in Solr
§ Not out of the box – you need:
• A component to log queries
• A component to record click-throughs
• A tool to correlate and aggregate the logs
• A tool to manage click-through history
§ …let’s (conveniently) assume the above is
handled by a user-facing app… and we got that
map of docId => click data
§ How to integrate this map into a Solr index?
16
16. Via ExternalFileField
§ Pros:
• Simple to implement
• Easy to update – no need to do full re-indexing
(just core reload)
§ Cons:
• Only docId => field : boost
• No user-generated labels attached to docs L L
§ Still useful if a simple “popularity” metric is
sufficient
17
17. Via full re-index
§ If the corpus is small, or click data updates
infrequent… just re-index everything
§ Pros:
• Relatively easy to implement – join source docs
and click data by docId + reindex
• Allows adding all click data, including labels as
searchable text
§ Cons:
• Infeasible for larger corpora or frequent updates,
time-wise and cost-wise
18
18. Via incremental field updates
§ Oops! Under construction, come back later…
§ … much later …
• Some discussions on the mailing lists
• No implementation yet, design in flux
19
19. Via ParallelReader
click data main index
c1, c2, ... D1 D4 1 f1, f2, ... D4 c1, c2, ... D4 1 f1, f2, ...
c1, c2, ... D2 D2 2 f1, f2, ... D2 c1, c2, ... D2 2 f1, f2, ...
c1, c2, ... D3 D6 3 f1, f2, ... D6 c1, c2, ... D6 3 f1, f2, ...
c1, c2, ... D4 D1 4 f1, f2, ... D1 c1, c2, ... D1 4 f1, f2, ...
c1, c2, ... D5 D3 5 f1, f2, ... D3 c1, c2, ... D3 5 f1, f2, ...
c1, c2,… D6 D5 6 f1, f2, … D5 c1, c2,… D5 6 f1, f2, …
§ Pros:
• All click data (e.g. searchable labels) can be added
§ Cons:
• Complicated and fragile (rebuild on every update)
§ Though only the click index needs a rebuild
• No tools to manage this parallel index in Solr
20
21. Click Scoring Framework
§ LucidWorks Enterprise feature
§ Click-through log collection & analysis
• Query logs and click-through logs (when using
Lucid's search UI)
• Analysis of click-through events
• Maintenance of historical click data
• Creating of query phrase dictionary (-> autosuggest)
§ Modification of ranking based on click events:
• Modifies query rewriting & field boosts
• Adds top query phrases associated with a document
http://getopt.org/ 0.13 luke:0.5,stempel:0.3,murmur:0.2
22
22. Aggregation of click events
§ Relative importance of clicks:
• Clicks on lower ranking documents more important
§ Plateau after the second page
• The more clicks the more important a document
§ Sub-linear to counter click-storms
• “Reading time” weighting factor
§ Intervals between clicks on the same result list
§ Association of query terms with target document
• Top-N successful queries considered
• Top-N frequent phrases (shingles) extracted from
queries, sanitized
23
23. Aggregation of click-through history
§ Needs to reflect document popularity over time
• Should react quickly to bursts (topics of the day)
• Has to avoid documents being “stuck” at the top
due to the past popularity
§ Solution: half-life decay model
• Adjustable period & rate
• Adjustable length of history (affects smoothing)
time
24
24. Click scoring in practice
l Query log and click log generated by the
LucidWorks search UI
l Logs and intermediate data files in plain text,
well-documented formats and locations
l Scheduled click-through analysis activity
l Final click data – open formats
l Boost factor plus top phrases per document
(plain text)
l Click data is integrated with the main index
l No need to re-index the main corpus
(ParallelReader trick)
l Where are the incremental field updates when you need them ?!!!
l Works also with Solr replication (rsync or Java)
25
25. Click Scoring – added fields
l Fields added to the main index
l click – a field with a constant value of 1, but
with boost relative to aggregated click history
l Indexed, with norms
l click_val - “string” (not analyzed) field
containing numerical value of boost
l Stored, indexed, not analyzed
l click_terms – top-N terms and phrases from
queries that caused click events on this
document
l Stored, indexed and analyzed
26
26. Click scoring – query modifications
§ Using click in queries (or DisMax’s bq)
• Constant term “1” with boost value
• Example: term1 OR click:1
§ Using click_val in function queries
• Floating point boost value as a string
• Example: term1 OR _val_:click_val
§ Using click_terms in queries (e.g. DisMax)
• Add click_terms to the list of query fields (qf)
in DisMax handler (default in /lucid)
• Matches on click_terms will be scored as other
matches on other fields
27
27. Click Scoring – impact
l Configuration options of the click analysis tools
l max normalization
l The highest value of click boost will be 1, all
other values are proportionally lower
l Controlled max impact on any given result list
l total normalization
l Total value of all boosts will be constant
l Limits the total impact of click scoring on all lists
of results
l raw – whatever value is in the click data
l Controlled impact is the key for improving the
top–N results
28
29. Unsupervised feedback
l LucidWorks Enterprise feature
l Unsupervised – no need to train the system
l Enhances quality of top-N results
l Well-researched topic
l Several strategies for keyword extraction and
combining with the original query
l Automatic feedback loop:
l Submit original query and take the top 5 docs
l Extracts some keywords (“important” terms)
l Combine original query with extracted keywords
l Submit the modified query & return results
30
30. Unsupervised feedback options
l “Enhance precision” option (tighter fit) precision
l Extracted terms are AND-ed with the
original query
dog AND (cat OR mouse)
l Filters out documents less similar to recall
the original top-5
l “Enhance recall” option (more
documents) precision
l Extracted terms are OR-ed with the
original query
dog OR cat OR mouse
recall
l Adds more documents loosely similar
to the original top-5
31