2024 DevNexus Patterns for Resiliency: Shuffle shards
Managed Search: Presented by Jacob Graves, Getty Images
1.
2. Managed Search
Jacob Graves,
Principal Engineer at Getty Images
jacob.graves@gettyimages.com
3. Introduction
Getty Images is the global leader in visual communications with over 170 million
assets available through its premium content site www.gettyimages.com and its
leading stock content site www.istock.com. With its advanced search and image
recognition technology, Getty Images serves business customers in more than 100
countries and is the first place creative and media professionals turn to discover,
purchase and manage images and other digital content. Its award-winning
photographers and content creators help customers produce inspiring work which
appears every day in the world’s most influential newspapers, magazines, advertising
campaigns, films, television programs, books and online media.
4. Getty Search
Obviously, in order to buy images you have to be able to find them.
Search Process:
• Receive search containing words.
• Tokenize and map the words onto our controlled vocabulary keywords
• Find all the images associated with the correct keywords.
• Score all the images and then sort them by the score.
The scoring determines which images the users see.
5. Managed Search
The details of how the scoring takes place is a technical concern, but the end result is
a business concern.
Goal – make business users self sufficient.
So the problem is to create a framework for business users that will:
• Hide technical complexity.
• Allows Control over scoring components and result ordering.
• Allows Balancing of these scoring components against each other.
• Provides Feedback.
• Allows Visualization of the results of their changes.
We call this Managed Search.
6. Managed Search – Our Implementation
1. We created a SOLR search ecosystem containing all our images, keywords and
associated metadata, and added plugins using Java.
2. We used a C# middle tier to wrap around our SOLR ecosystem.
3. We built a web application called SAW – Search Administration Workbench, using
the Java Play framework and lots of javascript.
7. Managed Search Architecture Diagram
SOLR
Custom
functions
(valuesources)
Price
Tier
Shuffle
(RankQuery)
Index
settings
with
debug
scores
SOLR
select
url Save
Search
Middle
Tier
SAW
Site
Business
User
Customer
Algorithm
DB
Search
and
algorithm
Search
results
Load
algorithm
settings
search
Search
results
algorithm
settings
Search
results
(for
site
searches)
8. SAW
SAW has 5 main areas:
• Algorithm – control sort scoring.
• Preview – see search results.
• Single Page Charts – single search score component charts.
• Scale report charts – all searches score component charts.
• Live tests – expose test algorithms to live users to gather and view KPI data.
9. Scoring Breakdown
To help the business control the scoring we break it down into 3 different scoring
components:
• Relevancy – image attributes that are relative to the search (i.e. keywords).
• Recency
• Image Source – image attributes that are not related to the specific search.
Then we provide 2 types of parameter the user can control:
• Internal parameters - to control how the component is calculated.
• External boosts - to control how the components are weighted against each other
11. Scoring Architecture
• In order to allow immediate feedback we have to implement scoring using query
time boosting.
• Use boost functions as they are cleaner.
• Favor Query time over Index time, to prioritize control over small performance
gains.
• Define minimum performance metrics and ensure that we stay within them.
Initially we had concentrated on performance above all else and had ended up with
inflexible scoring in return for fairly minor performance gains.
We used the Valuesource plugin to create our own boost functions.
12. Relevancy
• The most important component, how confident are we that this image is correct?
• We measure relevancy at the image/keyword level by tracking user interactions.
• After experimenting we settled on a form of the standard tf-idf VSM (Vector Space
Model) and expose a normalization parameter.
• We also expose a boost so they can control the strength of relevancy relative to
other factors
13. Recency
• Recency is the age of the images.
• Newer images get a higher score to prevent staleness.
• Aging curve – the way an images recency score changes with age.
• We expose 3 different aging curves (reciprocal, linear and reversed reciprocal) and
appropriate parameters to control the shape of the curve.
• We also expose a boost so they can control the strength of recency relative to
other factors
15. Image Source
• We have a variety of image level attribute data that should affect the sort order,
mostly to do with how likely we think the image is to be of high quality.
• We separate our images into groups based on these attributes, called the source.
• We expose a boost that allows the users to increase the score of images with a
given source.
• Unlike relevancy, this is an image level, not image/keyword level property, so it
doesn’t vary from one search to the next.
• Because it isn’t context specific it is dangerous to make this boost too large.
16. Custom Shuffle
As well as influencing the scoring, the business wants to have control over the order
the images where displayed in, so that instead of just appearing in score order certain
slots on the page can be allocated to particular classes of image. This is to ensure
that we always show a diverse range of images.
To accommodate this we need to be able to apply a custom shuffle, similar to a sort
but with more control. To accomplish this we take advantage of a new SOLR plugin
(new in 4.9) called the RankQuery plugin.
17. Image Tier Shuffle
We classify our images into separate groups or image tiers based on various image
level attributes, e.g.
• Licensing Structure
• Image partner
• Exclusivity
• Etc.
We distill these factors into a single image property that we assigned at index time.
We generate a mapping of result slots to image tiers, e.g.
• slot 1 => image tier 2
• slot 2 => image tier 4
• etc.
We pass in the mapping at query time and used the RankQuery implementation to shuffle
the query results.
18. Preview page
• Search and get results scored using algorithm settings.
• Display in pages of 100 images.
• Show image score breakdown by component.
• Show image tier.
To calculate the score for each component we run the SOLR query in debug mode,
and parse the results with regex expressions to get the score for each component.
This is the least stable piece of the whole application, as debug syntax can change
quite frequently between SOLR releases. However, it’s also pretty easy to fix.
20. Single Page Charts
This allows the users to verify what they think they are seeing visually with numbers.
• Aggregate the component scoring data across all the 100 images on a page.
• Create interactive charts from the data.
• Charts that display the distribution of each score component.
• Chart that displays the comparative score from each component.
• Chart that shows the custom shuffle distribution.
We use the javascript D3.js library to generate the graphs.
22. Scale Reports
This allows the users to validate their settings across the full spectrum of searches
that users execute at Getty.
• Execute 1000 different searches (throttled).
• Use the first 100 images from each search by default, number can be increased up
to 10000 (slower).
• Aggregate the component scoring data across all the results.
• Create and display charts similar to the ones used in the single page charts view.
To generate the list of 1000 searches we use proportional sampling, from search log
data.
24. Live tests
Once the users are happy with an algorithm the next stage is to test it for real.
To do this we have a page that controls:
• The algorithm settings for the various live and test sorts.
• Saving these settings to a database where they are used to generate production
SOLR queries.
• The percentage of users for a given live sort that will be allocated to a test sort.
25. KPI monitoring
We also have a page that displays the user interaction data.
• Displays actions against our KPI’s (Key Performance Indicators).
• Primarily we use click-through (i.e. user clicks on an image in the search results).
• Broken out by time and by sort so we can compare the test algorithms against the
live ones.
• We get this data in a feed from our existing analytics framework.
26. Conclusion
Self sufficient business user, path to changing sort order:
1. Change algorithm settings.
2. Execute searches and evaluate sort order visually.
3. Use single page charts to confirm visual impressions.
4. Use scale report to confirm behavior across proportional set of searches.
5. Set a test algorithm to have the settings you want.
6. Set a percentage of users to experience the test.
7. Monitor KPI’s over time to see if settings work as intended.
8. Set the live algorithm to have the settings you want.
27. ValueSource Plugins
This is a well-established SOLR plugin for adding custom query functions.
http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser
There are 3 parts:
• Implement ValueSource. This is where the actual logic is implemented. It can take in
either simple datatypes (like Strings or floats) or other ValueSource objects (e.g. an asset
field value or another query function).
• Implement ValueSourceParser. This creates the ValueSource object with appropriate
inputs.
• Solrconfig.xml. Add a line to enable the new ValueSource plugin.
You can look at any of the existing Query function implementations to see how they should
work.
e.g. – for the “Map” query function see:
• org.apache.solr.search.ValueSourceParser
• org.apache.lucene.queries.function.valuesource.RangeMapFloatFunction
You can also change the debug output so that we can see the results of each custom
function in debug mode, this allows us to display the individual score components to the
users.
28. RankQuery Implementation
This is a new plugin in SOLR 4.9, created by Joel Bernstein.
https://issues.apache.org/jira/browse/SOLR-5973
There is a test in the SOLR 4.9 tests that shows a good example implementation:
org.apache.solr.search.TestRankQueryPlugin
Very briefly, you have to implement:
• QParserPlugin, it creates and returns the QParser implementation.
• QParser, it creates and returns the RankQuery implementation.
• RankQuery, it creates and returns the TopDocsCollector and MergeStrategy implementations.
• TopDocsCollector, this returns the top documents from each shard that you wish to include in your final
results. In our case we separate the documents into separate priority queues by image tier, and order by
score within each image tier. Then we go through a pre-determined list of which image tier should occupy
each slot, and pull the next item from the appropriate image tier priority queue to generate the top
documents List.
• MergeStrategy, this combines the top documents generated by the TopDocsCollectors on each shard. In
our case we followed the same logic as we had for each individual shard, assigning documents to priority
queues by image tier in score order, and then assigning queues to pre-determined slots.
Lastly you reference the new QParserPlugin in your solrconfig.xml.
The pre-determined list of image tier slots could either be a user configurable parameter that is passed in or it
could just included in the solrconfig.xml, or even hard coded.
29. Q & A
Please contact me if you have any questions or thoughts.
I will be attending till the end of the conference.
Email – jacob.graves@gettyimages.com