SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Implementing Click-through
    Relevance Ranking
    in Solr and LucidWorks Enterprise

            Andrzej Białecki
         ab@lucidimagination.com
About the speaker
§  Started using Lucene in 2003 (1.2-dev…)
§  Created Luke – the Lucene Index Toolbox
§  Apache Nutch, Hadoop, Solr committer, Lucene
    PMC member
§  Apache Nutch PMC Chair
§  LucidWorks Enterprise developer




                                                   3
Agenda
§  Click-through concepts
§  Apache Solr click-through scoring
  •  Model
  •  Integration options
§  LucidWorks Enterprise
  •  Click Scoring Framework
  •  Unsupervised feedback




                                        4
Click-through concepts



                         5
Improving relevance of top-N hits
§  N < 10, first page counts the most
   •  N = 3, first three results count the most
§  Many techniques available in Solr / Lucene
   •  Indexing-time
      §  text analysis, morphological analysis, synonyms, ...
   •  Query-time
      §  boosting, rewriting, synonyms, DisMax, function queries …
   •  Editorial ranking (QueryElevationComponent)
§  No direct feedback from users on relevance L
§  What user actions do we know about?
   •  Search, navigation, click-through, other actions…

                                                                      6
Query log and click-through events
Click-through: user selects an item at a
     among a                       for a query
§  Why this information may be useful
   •    “Indicates” user's interest in a selected result
   •    “Implies” that the result is relevant to the query
   •    “Significant” when low-ranking results selected
   •    “May be” considered as user's implicit feedback
§  Why this information may be useless
   •  Many strong assumptions about user’s intent
   •  “Average user’s behavior” could be a fiction
§  “Careful with that axe, Eugene”
                                                             7
Click-through in context
§  Query log, click positions, click intervals provide a
    context
§  Source of spell-checking data
   •  Query reformulation until a click event occurs
§  Click events per user – total or during a session
   •  Building a user profile (e.g. topics of interest)
§  Negative click events
   •  User did NOT click the top 3 results è demote?
§  Clicks of all users for an item (or a query, or both)
   •  Item popularity or relevance to queries
§  Goal: analysis and modification of result ranking
                                                            8
Click to add title…
§    Clicking through == adding labels!
§    Collaborative filtering, recommendation system
§    Topic discovery & opinion mining
§    Tracking the topic / opinion drift over time
§    Click-stream is sparse and noisy – caveat emptor
      •  Changing intent – “hey, this reminds me of smth…”
      •  Hidden intent – remember the “miserable failure”?
      •  No intent at all – “just messing around”




                                                         9
What’s in the click-through data?
§  Query log, with unique id=f(user,query,time)!
   •    User id (or group)
   •    Query (+ facets, filters, origin, etc)
   •    Number of returned results
   •    Context (suggestions, autocomplete, “more like
        this” terms …)
§  Click-through log
   •  Query id , document id, click position & click
      timestamp
§  What data we would like to get?
   •  Map of docId =>
       §  Aggregated queries, aggregated users
       §  Weight factor f(clickCount, positions, intervals)
                                                          10
Other aggregations / reports
§  User profiles
   •  Document types / categories viewed most often
   •  Population profile for a document
   •  User’s sophistication, education level, locations,
      interests, vices … (scary!)
§  Query re-formulations
   •  Spell-checking or “did you mean”
§  Corpus of the most useful queries
   •  Indicator for caching of results and documents
§  Zeitgeist – general user interest over time

                                                           11
Documents with click-through data
   original document     document with click-through data
   -  documentWeight             -  documentWeight

   -  field1 : weight1           -    field1 : weight1
   -  field2 : weight2           -    field2 : weight2
   -  field3 : weight3           -    field3 : weight3
                                 -    labels : weight4
                                 -    users : weight5


§  Modified document and field weights
§  Added / modified fields
   •  Top-N labels aggregated from successful queries
   •  User “profile” aggregated from click-throughs
§  Changing in time – new clicks arrive
                                                            12
Desired effects
§  Improvement in relevance of top-N results
  •  Non-query specific:
     f(clickCount)         (or “popularity”)
  •  Query-specific:
     f([query] Ÿ [labels])
  •  User-specific (personalized ranking):
     f([userProfile] Ÿ [docProfile])
§  Observed phenomena
   •  Top-10 better matches user expectations
   •  Inversion of ranking (oft-clicked > TF-IDF)
   •  Positive feedback
      clicked -> highly ranked -> clicked -> even higher ranked …
                                                                    13
Undesired effects
§  Unbounded positive feedback
   •  Top-10 dominated by popular but irrelevant
      results, self-reinforcing due to user expectations
      about the Top-10 results
§  Everlasting effects of past click-storms
   •  Top-10 dominated by old documents once
      extremely popular for no longer valid reasons
§  Off-topic (noisy) labels
§  Conclusions:
   •  f(click data) should be sub-linear
   •  f(click data, time) should discount older clicks
   •  f(click data) should be sanitized and bounded

                                                           14
Implementation



                 15
Click-through scoring in Solr
§  Not out of the box – you need:
   •    A component to log queries
   •    A component to record click-throughs
   •    A tool to correlate and aggregate the logs
   •    A tool to manage click-through history


§  …let’s (conveniently) assume the above is
    handled by a user-facing app… and we got that
    map of docId => click data

§  How to integrate this map into a Solr index?
                                                     16
Via ExternalFileField
§  Pros:
   •  Simple to implement
   •  Easy to update – no need to do full re-indexing
      (just core reload)
§  Cons:
   •  Only docId => field : boost
   •  No user-generated labels attached to docs L L
§  Still useful if a simple “popularity” metric is
    sufficient



                                                        17
Via full re-index
§  If the corpus is small, or click data updates
    infrequent… just re-index everything
§  Pros:
   •  Relatively easy to implement – join source docs
      and click data by docId + reindex
   •  Allows adding all click data, including labels as
      searchable text
§  Cons:
   •  Infeasible for larger corpora or frequent updates,
      time-wise and cost-wise


                                                           18
Via incremental field updates




§  Oops! Under construction, come back later…

§  … much later …
  •  Some discussions on the mailing lists
  •  No implementation yet, design in flux
                                                 19
Via ParallelReader
   click data                     main index

c1, c2, ...     D1           D4    1 f1, f2, ...         D4     c1, c2, ...   D4   1 f1, f2, ...
c1, c2, ...     D2           D2    2 f1, f2, ...         D2     c1, c2, ...   D2   2 f1, f2, ...
c1, c2, ...     D3           D6    3 f1, f2, ...         D6     c1, c2, ...   D6   3 f1, f2, ...
c1, c2, ...     D4           D1    4 f1, f2, ...         D1     c1, c2, ...   D1   4 f1, f2, ...
c1, c2, ...     D5           D3    5 f1, f2, ...         D3     c1, c2, ...   D3   5 f1, f2, ...
c1, c2,…        D6           D5    6 f1, f2, …           D5     c1, c2,…      D5   6 f1, f2, …


§  Pros:
      •  All click data (e.g. searchable labels) can be added
§  Cons:
      •  Complicated and fragile (rebuild on every update)
              §  Though only the click index needs a rebuild
      •  No tools to manage this parallel index in Solr
                                                                                              20
LucidWorks Enterprise
implementation



                        21
Click Scoring Framework
§  LucidWorks Enterprise feature
§  Click-through log collection & analysis
   •  Query logs and click-through logs (when using
      Lucid's search UI)
   •  Analysis of click-through events
   •  Maintenance of historical click data
   •  Creating of query phrase dictionary (-> autosuggest)
§  Modification of ranking based on click events:
   •  Modifies query rewriting & field boosts
   •  Adds top query phrases associated with a document
   http://getopt.org/   0.13   luke:0.5,stempel:0.3,murmur:0.2


                                                                 22
Aggregation of click events
§  Relative importance of clicks:
   •  Clicks on lower ranking documents more important
      §  Plateau after the second page
   •  The more clicks the more important a document
      §  Sub-linear to counter click-storms
   •  “Reading time” weighting factor
      §  Intervals between clicks on the same result list
§  Association of query terms with target document
   •  Top-N successful queries considered
   •  Top-N frequent phrases (shingles) extracted from
      queries, sanitized


                                                             23
Aggregation of click-through history
§  Needs to reflect document popularity over time
   •  Should react quickly to bursts (topics of the day)
   •  Has to avoid documents being “stuck” at the top
      due to the past popularity
§  Solution: half-life decay model
   •  Adjustable period & rate
   •  Adjustable length of history (affects smoothing)




                                                   time



                                                           24
Click scoring in practice
l    Query log and click log generated by the
      LucidWorks search UI
      l    Logs and intermediate data files in plain text,
            well-documented formats and locations
l    Scheduled click-through analysis activity
l    Final click data – open formats
      l    Boost factor plus top phrases per document
            (plain text)
l    Click data is integrated with the main index
      l    No need to re-index the main corpus
            (ParallelReader trick)
               l    Where are the incremental field updates when you need them ?!!!
      l    Works also with Solr replication (rsync or Java)
                                                                                       25
Click Scoring – added fields
l    Fields added to the main index
      l    click – a field with a constant value of 1, but
            with boost relative to aggregated click history
            l    Indexed, with norms
      l    click_val - “string” (not analyzed) field
            containing numerical value of boost
            l    Stored, indexed, not analyzed
      l    click_terms – top-N terms and phrases from
            queries that caused click events on this
            document
            l    Stored, indexed and analyzed



                                                              26
Click scoring – query modifications
§  Using click in queries (or DisMax’s bq)
  •  Constant term “1” with boost value
  •  Example: term1 OR click:1
§  Using click_val in function queries
  •  Floating point boost value as a string
  •  Example: term1 OR _val_:click_val
§  Using click_terms in queries (e.g. DisMax)
  •  Add click_terms to the list of query fields (qf)
     in DisMax handler (default in /lucid)
  •  Matches on click_terms will be scored as other
     matches on other fields

                                                        27
Click Scoring – impact
l    Configuration options of the click analysis tools
      l  max normalization

            l    The highest value of click boost will be 1, all
                  other values are proportionally lower
            l    Controlled max impact on any given result list
      l    total normalization
            l    Total value of all boosts will be constant
            l    Limits the total impact of click scoring on all lists
                  of results
      l raw – whatever value is in the click data
l    Controlled impact is the key for improving the
      top–N results
                                                                          28
LucidWorks Enterprise –
Unsupervised Feedback



                          29
Unsupervised feedback
l    LucidWorks Enterprise feature
l    Unsupervised – no need to train the system
l    Enhances quality of top-N results
      l    Well-researched topic
      l    Several strategies for keyword extraction and
            combining with the original query
l    Automatic feedback loop:
      l    Submit original query and take the top 5 docs
      l    Extracts some keywords (“important” terms)
      l    Combine original query with extracted keywords
      l    Submit the modified query & return results
                                                             30
Unsupervised feedback options
l    “Enhance precision” option (tighter fit)      precision

      l    Extracted terms are AND-ed with the
            original query
             dog AND (cat OR mouse)

      l    Filters out documents less similar to               recall

            the original top-5
l    “Enhance recall” option (more
      documents)                                    precision

      l    Extracted terms are OR-ed with the
            original query
             dog OR cat OR mouse
                                                                recall
      l    Adds more documents loosely similar
            to the original top-5
                                                                 31
Summary & QA
§  Click-through concepts
§  Apache Solr click-through scoring
  •  Model
  •  Integration options
§  LucidWorks Enterprise
  •  Click Scoring Framework
  •  Unsupervised feedback


§  More questions? ab@lucidimagination.com


                                              32

Contenu connexe

Tendances

Boosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesBoosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesLucidworks (Archived)
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
 
Twitter Search Architecture
Twitter Search Architecture Twitter Search Architecture
Twitter Search Architecture Ramez Al-Fayez
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solrguest432cd6
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksIntroduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksLucidworks
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comPersonalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comLucidworks
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseLucidworks
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopDmitry Kan
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Lucidworks
 
R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil
R to Forecast Solr Activity - Patrick Beaucamp, Bpm-ConseilR to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil
R to Forecast Solr Activity - Patrick Beaucamp, Bpm-ConseilLucidworks
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search PerformanceLucidworks (Archived)
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsLucidworks
 

Tendances (19)

Boosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesBoosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User Preferences
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
 
Whitepaper- Real World Search
Whitepaper-  Real World SearchWhitepaper-  Real World Search
Whitepaper- Real World Search
 
Twitter Search Architecture
Twitter Search Architecture Twitter Search Architecture
Twitter Search Architecture
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksIntroduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comPersonalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, FindwiseRelevance in the Wild - Daniel Gomez Vilanueva, Findwise
Relevance in the Wild - Daniel Gomez Vilanueva, Findwise
 
NoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache HadoopNoSQL, Apache SOLR and Apache Hadoop
NoSQL, Apache SOLR and Apache Hadoop
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil
R to Forecast Solr Activity - Patrick Beaucamp, Bpm-ConseilR to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil
R to Forecast Solr Activity - Patrick Beaucamp, Bpm-Conseil
 
Understanding Lucene Search Performance
Understanding Lucene Search PerformanceUnderstanding Lucene Search Performance
Understanding Lucene Search Performance
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 

Similaire à Click-through relevance ranking in solr &  lucid works enterprise - By Andrzej Bialecki

Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More IntelligentKyle Davis
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
It4 Coursework Help
It4 Coursework HelpIt4 Coursework Help
It4 Coursework HelpJTHSICT
 
Stc preso2012 b
Stc preso2012 bStc preso2012 b
Stc preso2012 bprboswell
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slidesLouis Rosenfeld
 
Alla ricerca della User Story perduta
Alla ricerca della User Story perdutaAlla ricerca della User Story perduta
Alla ricerca della User Story perdutaEdoardo Schepis
 
Alla ricerca della user story perduta
Alla ricerca della user story perdutaAlla ricerca della user story perduta
Alla ricerca della user story perdutaBetter Software
 
Revolutionizing the hypatia metadata experience
Revolutionizing the hypatia metadata experienceRevolutionizing the hypatia metadata experience
Revolutionizing the hypatia metadata experienceKat Chuang
 
Usability and Salesforce - Dallas Salesforce.com User Group September 2011
Usability and Salesforce - Dallas Salesforce.com User Group September 2011Usability and Salesforce - Dallas Salesforce.com User Group September 2011
Usability and Salesforce - Dallas Salesforce.com User Group September 2011Shell Black
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Renee Anderson, Techniques for prioritizing, road-mapping, and staffing your ...
Renee Anderson, Techniques for prioritizing, road-mapping, and staffing your ...Renee Anderson, Techniques for prioritizing, road-mapping, and staffing your ...
Renee Anderson, Techniques for prioritizing, road-mapping, and staffing your ...museums and the web
 
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...Lucidworks
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Exploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban ForestryExploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban ForestryAzavea
 
Implimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyImplimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyIndiana Online Users Group
 

Similaire à Click-through relevance ranking in solr &  lucid works enterprise - By Andrzej Bialecki (20)

Making Session Stores More Intelligent
Making Session Stores More IntelligentMaking Session Stores More Intelligent
Making Session Stores More Intelligent
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
It4 Coursework Help
It4 Coursework HelpIt4 Coursework Help
It4 Coursework Help
 
Stc preso2012 b
Stc preso2012 bStc preso2012 b
Stc preso2012 b
 
Adaptable Information Workshop slides
Adaptable Information Workshop slidesAdaptable Information Workshop slides
Adaptable Information Workshop slides
 
Alla ricerca della User Story perduta
Alla ricerca della User Story perdutaAlla ricerca della User Story perduta
Alla ricerca della User Story perduta
 
Alla ricerca della user story perduta
Alla ricerca della user story perdutaAlla ricerca della user story perduta
Alla ricerca della user story perduta
 
Revolutionizing the hypatia metadata experience
Revolutionizing the hypatia metadata experienceRevolutionizing the hypatia metadata experience
Revolutionizing the hypatia metadata experience
 
Usability and Salesforce - Dallas Salesforce.com User Group September 2011
Usability and Salesforce - Dallas Salesforce.com User Group September 2011Usability and Salesforce - Dallas Salesforce.com User Group September 2011
Usability and Salesforce - Dallas Salesforce.com User Group September 2011
 
F8 tech talk_pinterest_v4
F8 tech talk_pinterest_v4F8 tech talk_pinterest_v4
F8 tech talk_pinterest_v4
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Renee Anderson, Techniques for prioritizing, road-mapping, and staffing your ...
Renee Anderson, Techniques for prioritizing, road-mapping, and staffing your ...Renee Anderson, Techniques for prioritizing, road-mapping, and staffing your ...
Renee Anderson, Techniques for prioritizing, road-mapping, and staffing your ...
 
One day Course On Agile
One day Course On AgileOne day Course On Agile
One day Course On Agile
 
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
Empowering Customers to Self Solve - A Findability Journey - Manikandan Sivan...
 
Gauge March 2012
Gauge March 2012 Gauge March 2012
Gauge March 2012
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Exploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban ForestryExploring Data Preparation and Visualization Tools for Urban Forestry
Exploring Data Preparation and Visualization Tools for Urban Forestry
 
Implimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyImplimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled Technology
 
Ch 3
Ch   3Ch   3
Ch 3
 

Plus de lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

Plus de lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Dernier

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Dernier (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Click-through relevance ranking in solr &  lucid works enterprise - By Andrzej Bialecki

  • 1. Implementing Click-through Relevance Ranking in Solr and LucidWorks Enterprise Andrzej Białecki ab@lucidimagination.com
  • 2. About the speaker §  Started using Lucene in 2003 (1.2-dev…) §  Created Luke – the Lucene Index Toolbox §  Apache Nutch, Hadoop, Solr committer, Lucene PMC member §  Apache Nutch PMC Chair §  LucidWorks Enterprise developer 3
  • 3. Agenda §  Click-through concepts §  Apache Solr click-through scoring •  Model •  Integration options §  LucidWorks Enterprise •  Click Scoring Framework •  Unsupervised feedback 4
  • 5. Improving relevance of top-N hits §  N < 10, first page counts the most •  N = 3, first three results count the most §  Many techniques available in Solr / Lucene •  Indexing-time §  text analysis, morphological analysis, synonyms, ... •  Query-time §  boosting, rewriting, synonyms, DisMax, function queries … •  Editorial ranking (QueryElevationComponent) §  No direct feedback from users on relevance L §  What user actions do we know about? •  Search, navigation, click-through, other actions… 6
  • 6. Query log and click-through events Click-through: user selects an item at a among a for a query §  Why this information may be useful •  “Indicates” user's interest in a selected result •  “Implies” that the result is relevant to the query •  “Significant” when low-ranking results selected •  “May be” considered as user's implicit feedback §  Why this information may be useless •  Many strong assumptions about user’s intent •  “Average user’s behavior” could be a fiction §  “Careful with that axe, Eugene” 7
  • 7. Click-through in context §  Query log, click positions, click intervals provide a context §  Source of spell-checking data •  Query reformulation until a click event occurs §  Click events per user – total or during a session •  Building a user profile (e.g. topics of interest) §  Negative click events •  User did NOT click the top 3 results è demote? §  Clicks of all users for an item (or a query, or both) •  Item popularity or relevance to queries §  Goal: analysis and modification of result ranking 8
  • 8. Click to add title… §  Clicking through == adding labels! §  Collaborative filtering, recommendation system §  Topic discovery & opinion mining §  Tracking the topic / opinion drift over time §  Click-stream is sparse and noisy – caveat emptor •  Changing intent – “hey, this reminds me of smth…” •  Hidden intent – remember the “miserable failure”? •  No intent at all – “just messing around” 9
  • 9. What’s in the click-through data? §  Query log, with unique id=f(user,query,time)! •  User id (or group) •  Query (+ facets, filters, origin, etc) •  Number of returned results •  Context (suggestions, autocomplete, “more like this” terms …) §  Click-through log •  Query id , document id, click position & click timestamp §  What data we would like to get? •  Map of docId => §  Aggregated queries, aggregated users §  Weight factor f(clickCount, positions, intervals) 10
  • 10. Other aggregations / reports §  User profiles •  Document types / categories viewed most often •  Population profile for a document •  User’s sophistication, education level, locations, interests, vices … (scary!) §  Query re-formulations •  Spell-checking or “did you mean” §  Corpus of the most useful queries •  Indicator for caching of results and documents §  Zeitgeist – general user interest over time 11
  • 11. Documents with click-through data original document document with click-through data -  documentWeight -  documentWeight -  field1 : weight1 -  field1 : weight1 -  field2 : weight2 -  field2 : weight2 -  field3 : weight3 -  field3 : weight3 -  labels : weight4 -  users : weight5 §  Modified document and field weights §  Added / modified fields •  Top-N labels aggregated from successful queries •  User “profile” aggregated from click-throughs §  Changing in time – new clicks arrive 12
  • 12. Desired effects §  Improvement in relevance of top-N results •  Non-query specific: f(clickCount) (or “popularity”) •  Query-specific: f([query] Ÿ [labels]) •  User-specific (personalized ranking): f([userProfile] Ÿ [docProfile]) §  Observed phenomena •  Top-10 better matches user expectations •  Inversion of ranking (oft-clicked > TF-IDF) •  Positive feedback clicked -> highly ranked -> clicked -> even higher ranked … 13
  • 13. Undesired effects §  Unbounded positive feedback •  Top-10 dominated by popular but irrelevant results, self-reinforcing due to user expectations about the Top-10 results §  Everlasting effects of past click-storms •  Top-10 dominated by old documents once extremely popular for no longer valid reasons §  Off-topic (noisy) labels §  Conclusions: •  f(click data) should be sub-linear •  f(click data, time) should discount older clicks •  f(click data) should be sanitized and bounded 14
  • 15. Click-through scoring in Solr §  Not out of the box – you need: •  A component to log queries •  A component to record click-throughs •  A tool to correlate and aggregate the logs •  A tool to manage click-through history §  …let’s (conveniently) assume the above is handled by a user-facing app… and we got that map of docId => click data §  How to integrate this map into a Solr index? 16
  • 16. Via ExternalFileField §  Pros: •  Simple to implement •  Easy to update – no need to do full re-indexing (just core reload) §  Cons: •  Only docId => field : boost •  No user-generated labels attached to docs L L §  Still useful if a simple “popularity” metric is sufficient 17
  • 17. Via full re-index §  If the corpus is small, or click data updates infrequent… just re-index everything §  Pros: •  Relatively easy to implement – join source docs and click data by docId + reindex •  Allows adding all click data, including labels as searchable text §  Cons: •  Infeasible for larger corpora or frequent updates, time-wise and cost-wise 18
  • 18. Via incremental field updates §  Oops! Under construction, come back later… §  … much later … •  Some discussions on the mailing lists •  No implementation yet, design in flux 19
  • 19. Via ParallelReader click data main index c1, c2, ... D1 D4 1 f1, f2, ... D4 c1, c2, ... D4 1 f1, f2, ... c1, c2, ... D2 D2 2 f1, f2, ... D2 c1, c2, ... D2 2 f1, f2, ... c1, c2, ... D3 D6 3 f1, f2, ... D6 c1, c2, ... D6 3 f1, f2, ... c1, c2, ... D4 D1 4 f1, f2, ... D1 c1, c2, ... D1 4 f1, f2, ... c1, c2, ... D5 D3 5 f1, f2, ... D3 c1, c2, ... D3 5 f1, f2, ... c1, c2,… D6 D5 6 f1, f2, … D5 c1, c2,… D5 6 f1, f2, … §  Pros: •  All click data (e.g. searchable labels) can be added §  Cons: •  Complicated and fragile (rebuild on every update) §  Though only the click index needs a rebuild •  No tools to manage this parallel index in Solr 20
  • 21. Click Scoring Framework §  LucidWorks Enterprise feature §  Click-through log collection & analysis •  Query logs and click-through logs (when using Lucid's search UI) •  Analysis of click-through events •  Maintenance of historical click data •  Creating of query phrase dictionary (-> autosuggest) §  Modification of ranking based on click events: •  Modifies query rewriting & field boosts •  Adds top query phrases associated with a document http://getopt.org/ 0.13 luke:0.5,stempel:0.3,murmur:0.2 22
  • 22. Aggregation of click events §  Relative importance of clicks: •  Clicks on lower ranking documents more important §  Plateau after the second page •  The more clicks the more important a document §  Sub-linear to counter click-storms •  “Reading time” weighting factor §  Intervals between clicks on the same result list §  Association of query terms with target document •  Top-N successful queries considered •  Top-N frequent phrases (shingles) extracted from queries, sanitized 23
  • 23. Aggregation of click-through history §  Needs to reflect document popularity over time •  Should react quickly to bursts (topics of the day) •  Has to avoid documents being “stuck” at the top due to the past popularity §  Solution: half-life decay model •  Adjustable period & rate •  Adjustable length of history (affects smoothing) time 24
  • 24. Click scoring in practice l  Query log and click log generated by the LucidWorks search UI l  Logs and intermediate data files in plain text, well-documented formats and locations l  Scheduled click-through analysis activity l  Final click data – open formats l  Boost factor plus top phrases per document (plain text) l  Click data is integrated with the main index l  No need to re-index the main corpus (ParallelReader trick) l  Where are the incremental field updates when you need them ?!!! l  Works also with Solr replication (rsync or Java) 25
  • 25. Click Scoring – added fields l  Fields added to the main index l  click – a field with a constant value of 1, but with boost relative to aggregated click history l  Indexed, with norms l  click_val - “string” (not analyzed) field containing numerical value of boost l  Stored, indexed, not analyzed l  click_terms – top-N terms and phrases from queries that caused click events on this document l  Stored, indexed and analyzed 26
  • 26. Click scoring – query modifications §  Using click in queries (or DisMax’s bq) •  Constant term “1” with boost value •  Example: term1 OR click:1 §  Using click_val in function queries •  Floating point boost value as a string •  Example: term1 OR _val_:click_val §  Using click_terms in queries (e.g. DisMax) •  Add click_terms to the list of query fields (qf) in DisMax handler (default in /lucid) •  Matches on click_terms will be scored as other matches on other fields 27
  • 27. Click Scoring – impact l  Configuration options of the click analysis tools l  max normalization l  The highest value of click boost will be 1, all other values are proportionally lower l  Controlled max impact on any given result list l  total normalization l  Total value of all boosts will be constant l  Limits the total impact of click scoring on all lists of results l raw – whatever value is in the click data l  Controlled impact is the key for improving the top–N results 28
  • 29. Unsupervised feedback l  LucidWorks Enterprise feature l  Unsupervised – no need to train the system l  Enhances quality of top-N results l  Well-researched topic l  Several strategies for keyword extraction and combining with the original query l  Automatic feedback loop: l  Submit original query and take the top 5 docs l  Extracts some keywords (“important” terms) l  Combine original query with extracted keywords l  Submit the modified query & return results 30
  • 30. Unsupervised feedback options l  “Enhance precision” option (tighter fit) precision l  Extracted terms are AND-ed with the original query dog AND (cat OR mouse) l  Filters out documents less similar to recall the original top-5 l  “Enhance recall” option (more documents) precision l  Extracted terms are OR-ed with the original query dog OR cat OR mouse recall l  Adds more documents loosely similar to the original top-5 31
  • 31. Summary & QA §  Click-through concepts §  Apache Solr click-through scoring •  Model •  Integration options §  LucidWorks Enterprise •  Click Scoring Framework •  Unsupervised feedback §  More questions? ab@lucidimagination.com 32