Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise

Implementing Click-through
Relevance Ranking
in Solr and LucidWorks Enterprise

Andrzej Białecki
ab@lucidimagination.com

About the speaker
§  Started using Lucene in 2003 (1.2-dev…)
§  Created Luke – the Lucene Index Toolbox
§  Apache Nutch, Hadoop, Solr committer, Lucene
PMC member
§  Apache Nutch PMC Chair
§  LucidWorks Enterprise developer

3

Agenda
§  Click-through concepts
§  Apache Solr click-through scoring
•  Model
•  Integration options
§  LucidWorks Enterprise
•  Click Scoring Framework
•  Unsupervised feedback

4

Click-through concepts

5

Improving relevance of top-N hits
§  N < 10, first page counts the most
•  N = 3, first three results count the most
§  Many techniques available in Solr / Lucene
•  Indexing-time
§  text analysis, morphological analysis, synonyms, ...
•  Query-time
§  boosting, rewriting, synonyms, DisMax, function queries …
•  Editorial ranking (QueryElevationComponent)
§  No direct feedback from users on relevance L
§  What user actions do we know about?
•  Search, navigation, click-through, other actions…

6

Query log and click-through events
Click-through: user selects an item at a
among a for a query
§  Why this information may be useful
•  “Indicates” user's interest in a selected result
•  “Implies” that the result is relevant to the query
•  “Significant” when low-ranking results selected
•  “May be” considered as user's implicit feedback
§  Why this information may be useless
•  Many strong assumptions about user’s intent
•  “Average user’s behavior” could be a fiction
§  “Careful with that axe, Eugene”
7

Click-through in context
§  Query log, click positions, click intervals provide a
context
§  Source of spell-checking data
•  Query reformulation until a click event occurs
§  Click events per user – total or during a session
•  Building a user profile (e.g. topics of interest)
§  Negative click events
•  User did NOT click the top 3 results è demote?
§  Clicks of all users for an item (or a query, or both)
•  Item popularity or relevance to queries
§  Goal: analysis and modification of result ranking
8

Click to add title…
§  Clicking through == adding labels!
§  Collaborative filtering, recommendation system
§  Topic discovery & opinion mining
§  Tracking the topic / opinion drift over time
§  Click-stream is sparse and noisy – caveat emptor
•  Changing intent – “hey, this reminds me of smth…”
•  Hidden intent – remember the “miserable failure”?
•  No intent at all – “just messing around”

9

What’s in the click-through data?
§  Query log, with unique id=f(user,query,time)!
•  User id (or group)
•  Query (+ facets, filters, origin, etc)
•  Number of returned results
•  Context (suggestions, autocomplete, “more like
this” terms …)
§  Click-through log
•  Query id , document id, click position & click
timestamp
§  What data we would like to get?
•  Map of docId =>
§  Aggregated queries, aggregated users
§  Weight factor f(clickCount, positions, intervals)
10

Other aggregations / reports
§  User profiles
•  Document types / categories viewed most often
•  Population profile for a document
•  User’s sophistication, education level, locations,
interests, vices … (scary!)
§  Query re-formulations
•  Spell-checking or “did you mean”
§  Corpus of the most useful queries
•  Indicator for caching of results and documents
§  Zeitgeist – general user interest over time

11

Documents with click-through data
original document document with click-through data
-  documentWeight -  documentWeight

-  field1 : weight1 -  field1 : weight1
-  labels : weight4
-  users : weight5

§  Modified document and field weights
§  Added / modified fields
•  Top-N labels aggregated from successful queries
•  User “profile” aggregated from click-throughs
§  Changing in time – new clicks arrive
12

Desired effects
§  Improvement in relevance of top-N results
•  Non-query specific:
f(clickCount) (or “popularity”)
•  Query-specific:
f([query] Ÿ [labels])
•  User-specific (personalized ranking):
f([userProfile] Ÿ [docProfile])
§  Observed phenomena
•  Top-10 better matches user expectations
•  Inversion of ranking (oft-clicked > TF-IDF)
•  Positive feedback
clicked -> highly ranked -> clicked -> even higher ranked …
13

Undesired effects
§  Unbounded positive feedback
•  Top-10 dominated by popular but irrelevant
results, self-reinforcing due to user expectations
about the Top-10 results
§  Everlasting effects of past click-storms
•  Top-10 dominated by old documents once
extremely popular for no longer valid reasons
§  Off-topic (noisy) labels
§  Conclusions:
•  f(click data) should be sub-linear
•  f(click data, time) should discount older clicks
•  f(click data) should be sanitized and bounded

14

Click-through scoring in Solr
§  Not out of the box – you need:
•  A component to log queries
•  A component to record click-throughs
•  A tool to correlate and aggregate the logs
•  A tool to manage click-through history

§  …let’s (conveniently) assume the above is
handled by a user-facing app… and we got that
map of docId => click data

§  How to integrate this map into a Solr index?
16

Via ExternalFileField
§  Pros:
•  Simple to implement
•  Easy to update – no need to do full re-indexing
(just core reload)
§  Cons:
•  Only docId => field : boost
•  No user-generated labels attached to docs L L
§  Still useful if a simple “popularity” metric is
sufficient

17

Via full re-index
§  If the corpus is small, or click data updates
infrequent… just re-index everything
§  Pros:
•  Relatively easy to implement – join source docs
and click data by docId + reindex
•  Allows adding all click data, including labels as
searchable text
§  Cons:
•  Infeasible for larger corpora or frequent updates,
time-wise and cost-wise

18

Via incremental field updates

§  Oops! Under construction, come back later…

§  … much later …
•  Some discussions on the mailing lists
•  No implementation yet, design in flux
19

Via ParallelReader
click data main index

c1, c2, ... D1 D4 1 f1, f2, ... D4 c1, c2, ... D4 1 f1, f2, ...
c1, c2, ... D2 D2 2 f1, f2, ... D2 c1, c2, ... D2 2 f1, f2, ...
c1, c2, ... D3 D6 3 f1, f2, ... D6 c1, c2, ... D6 3 f1, f2, ...
c1, c2, ... D4 D1 4 f1, f2, ... D1 c1, c2, ... D1 4 f1, f2, ...
c1, c2, ... D5 D3 5 f1, f2, ... D3 c1, c2, ... D3 5 f1, f2, ...
c1, c2,… D6 D5 6 f1, f2, … D5 c1, c2,… D5 6 f1, f2, …

§  Pros:
•  All click data (e.g. searchable labels) can be added
§  Cons:
•  Complicated and fragile (rebuild on every update)
§  Though only the click index needs a rebuild
•  No tools to manage this parallel index in Solr
20

LucidWorks Enterprise
implementation

21

Click Scoring Framework
§  LucidWorks Enterprise feature
§  Click-through log collection & analysis
•  Query logs and click-through logs (when using
Lucid's search UI)
•  Analysis of click-through events
•  Maintenance of historical click data
•  Creating of query phrase dictionary (-> autosuggest)
§  Modification of ranking based on click events:
•  Modifies query rewriting & field boosts
•  Adds top query phrases associated with a document
http://getopt.org/ 0.13 luke:0.5,stempel:0.3,murmur:0.2

22

Aggregation of click events
§  Relative importance of clicks:
•  Clicks on lower ranking documents more important
§  Plateau after the second page
•  The more clicks the more important a document
§  Sub-linear to counter click-storms
•  “Reading time” weighting factor
§  Intervals between clicks on the same result list
§  Association of query terms with target document
•  Top-N successful queries considered
•  Top-N frequent phrases (shingles) extracted from
queries, sanitized

23

Aggregation of click-through history
§  Needs to reflect document popularity over time
•  Should react quickly to bursts (topics of the day)
•  Has to avoid documents being “stuck” at the top
due to the past popularity
§  Solution: half-life decay model
•  Adjustable period & rate
•  Adjustable length of history (affects smoothing)

time

24

Click scoring in practice
l  Query log and click log generated by the
LucidWorks search UI
l  Logs and intermediate data files in plain text,
well-documented formats and locations
l  Scheduled click-through analysis activity
l  Final click data – open formats
l  Boost factor plus top phrases per document
(plain text)
l  Click data is integrated with the main index
l  No need to re-index the main corpus
(ParallelReader trick)
l  Where are the incremental field updates when you need them ?!!!
l  Works also with Solr replication (rsync or Java)
25

Click Scoring – added fields
l  Fields added to the main index
l  click – a field with a constant value of 1, but
with boost relative to aggregated click history
l  Indexed, with norms
l  click_val - “string” (not analyzed) field
containing numerical value of boost
l  Stored, indexed, not analyzed
l  click_terms – top-N terms and phrases from
queries that caused click events on this
document
l  Stored, indexed and analyzed

26

Click scoring – query modifications
§  Using click in queries (or DisMax’s bq)
•  Constant term “1” with boost value
•  Example: term1 OR click:1
§  Using click_val in function queries
•  Floating point boost value as a string
•  Example: term1 OR _val_:click_val
§  Using click_terms in queries (e.g. DisMax)
•  Add click_terms to the list of query fields (qf)
in DisMax handler (default in /lucid)
•  Matches on click_terms will be scored as other
matches on other fields

27

Click Scoring – impact
l  Configuration options of the click analysis tools
l  max normalization

l  The highest value of click boost will be 1, all
other values are proportionally lower
l  Controlled max impact on any given result list
l  total normalization
l  Total value of all boosts will be constant
l  Limits the total impact of click scoring on all lists
of results
l raw – whatever value is in the click data
l  Controlled impact is the key for improving the
top–N results
28

LucidWorks Enterprise –
Unsupervised Feedback

29

Unsupervised feedback
l  LucidWorks Enterprise feature
l  Unsupervised – no need to train the system
l  Enhances quality of top-N results
l  Well-researched topic
l  Several strategies for keyword extraction and
combining with the original query
l  Automatic feedback loop:
l  Submit original query and take the top 5 docs
l  Extracts some keywords (“important” terms)
l  Combine original query with extracted keywords
l  Submit the modified query & return results
30

Unsupervised feedback options
l  “Enhance precision” option (tighter fit) precision

l  Extracted terms are AND-ed with the
original query
dog AND (cat OR mouse)

l  Filters out documents less similar to recall

the original top-5
l  “Enhance recall” option (more
documents) precision

l  Extracted terms are OR-ed with the
original query
dog OR cat OR mouse
recall
l  Adds more documents loosely similar
to the original top-5
31

Summary & QA
§  Click-through concepts
§  Apache Solr click-through scoring
•  Model
•  Integration options
§  LucidWorks Enterprise
•  Click Scoring Framework
•  Unsupervised feedback

§  More questions? ab@lucidimagination.com

32

Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise

Similaire à Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise (20)

Plus de Lucidworks (Archived)

Plus de Lucidworks (Archived) (20)

Dernier

Dernier (20)

Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise