Building a high-quality and robust search engine for visual content is a complex problem that can be addressed in many different ways.
In this talk, I will show how we leveraged anonymized user engagement data as well as various metadata sources of a media platform to build a multimodal vector space of search queries, tags, and gifs. We consider this space a compact representation of the environment at hand, which allows us to model user behavior and preferences. We will discuss approaches that we utilized at different stages of the project and application of these embeddings in numerous services, including search and recommender system.
2. Media content search. Information retrieval problem
To perform search on media content (gifs, images, videos) one can’t simply use
original files (set of pixels, frames etc), since they cannot be efficiently indexed
2
3. Media content search. Information retrieval problem
Usually media documents are converted into more compressed representations
(textual or vectorized) for which various known search strategies can be applied.
Search = Content + Candidate Generation + Ranking
3
4. Media search engines. Textual representations
Media content can be converted into textual data via the following approaches:
1) OCR (Optical Character Recognition)
2) ASR (Automatic Speech Recognition)
3) Tags annotation (either manual or automatic via ML model)
4) Video summarization models
4
5. Search can be organized in one of the following ways:
1) Full-text search solutions to rank generated text documents for the given
search query
2) Train an LTR (Learning To Rank) model that predicts relevancy for each pair
(text document, search query). Training dataset is needed!
Media search engines. Textual representations
5
6. Issues with “textual” approach:
1) Visual (or audio) signal cannot be converted into text without information loss
(discretization problem)
2) To better represent the content, various models/signals should be used =>
more complicated system
Media search engines. Textual representations
6
7. Images and videos can be converted into meaningful and efficiently compressed
vector representations via CV models.
We can build similarity index of all documents, perform clustering to group
documents into categories that can be used for search etc.
Media search engines. Vector representations
7
8. NLP models can be used to represent search query.
To match search query against documents:
1) LTR - predict relevance for the given pair of vectors
2) Mapper model - fuse both search query and document vectors into a single
vector space
Media search engines. Vector representations
8
9. Dataset
Pairs (document, search query) with relevance scores.
1) Manual annotation (e.g. via crowdsourcing job)
a) Takes time to collect
b) Can be expensive because should cover large part of the search space
2) Online. Based on engagement data (logged events)
a) Approximates relevance with some noise
b) Having substantial traffic, large and diverse dataset can be built on a periodic
basis (trends, seasonality)
9
11. Dataset. Engagement data (training/validation)
Billions of anonymized events per day
are logged to capture:
1) views
2) clicks
3) shares
4) favourites
for each gif and search query.
Can be grouped into “sessions” by utilizing
client-specific details
11
12. “Sessions” can be unfolded into sequences of gifs clicked by each user:
session 1: gif_1, gif_2, gif_3
...
Or we can incorporate both search queries and gifs:
session 1: hello, gif_1, gif_2, good_morning, gif_3
...
Dataset. Engagement data (training/validation)
12
13. To address positional bias for different grids:
1) shuffling of search results for a small percentage
of traffic
2) probabilistic modeling based on hierarchical
pooling to estimate positional bias effect on CTR
For content safety: both search queries and gifs
datasets are filtered via maintained blacklists and
nsfw models
Dataset. Engagement data (training/validation)
13
14. Human judgements obtained via
crowdsourcing tasks that estimate:
1) query-gif relevance
2) gif-gif relevance
● Complex relevance criteria defined by
business
● Rarely updated and relatively compact
Dataset. Manually labeled (benchmark)
14
15. Metric - % of triplets for which
(anchor, positive) relevancy >
(anchor, negative) relevancy
Dataset. Manually labeled (benchmark)
Triplets dataset (anchor, positive, negative)
OR
15
16. MVP. Gifs embeddings for Recommender System.
Train Gensim Skip-Gram model only on gifs:
session 1: gif_1, gif_2, gif_3
, where gif_* is an identifier of a gif that was clicked during a session.
For inference: kNN search in the embedding space (nmslib).
Baseline. Word2Vec model
16
17. V1. Joint embeddings for search queries and gifs:
session 1: query_1, gif_1, gif_2, query_2, query_3
...
, where query_* - identifier of a search query issued by a user,
and gif_* - identifier of a gif that was clicked during a session
Baseline. Word2Vec model
17
19. Baseline. Word2Vec model
Pros: Search queries and gifs in a single space. Also, gifs’ tags can be
incorporated. Applications:
1) Search (query -> relevant gifs)
2) Recommender System (gif -> relevant gifs)
3) Tags Suggestion (query -> relevant tags)
Cons: Identifiers (not gif/query content) are used => cold start problem
The less frequent is the identifier, the less accurately it is positioned in the
embedding space
19
22. Baseline. Word2Vec model
Search. Implicit usage. Features for ElasticSearch
1) Query Expansion
love you to the moon and back => love, adore you, couple, happy
2) Tag Suggestion for gifs
gif_1 => love, happy, couple
Results:
+ 10% CTR relative change
22
25. Cold start. Part 1. StarSpace
Extend search query with identifiers of its word n-grams:
how_are_you_id, gif_1, doing_good_id, gif_2
becomes:
how_are_you_id, how_id, are_id, you_id, gif_1, doing_good_id, doing_id, good_id, gif_2
● Model additionally learns to compare word n-grams with document identifiers
● Unseen search query vector = average of available tokens’ vectors
25
26. Cold start. Part 2. Word2Vec + BERT
Take pre-trained BERT model and fine-tune it jointly with Word2Vec
BERT learns mapping from search query tokens to Word2Vec gifs space
Cold start problem is solved for queries, but is still an issue for gifs ;(
26
27. The key point is that we haven’t really
utilized gif data (e.g. visual representation,
tags etc) yet.
What if we extend the approach like
BERT+Word2Vec to all available signals?
Mixture of Embedding Experts
27
31. We still have the same unified embedding space, but without the cold start
problem
Leverage all available gifs metadata:
1) Visual representation
2) Tags representation
3) OCR representation
Mixture of Embedding Experts
31
35. Summary
1) Embeddings are great for various IR tasks
2) The ideal application is a candidate generation step
3) Start with a simple baseline with recall as high as possible
4) Wise collection of implicit users’ feedback is a vital part of good embeddings
5) Use human-verified datasets for benchmarks
6) The more data sources you have, the better is the quality of representations
35
36. 1) Word2Vec illustration: http://jalammar.github.io/illustrated-word2vec
2) nmslib. Efficient aNN search: https://github.com/nmslib/nmslib
3) Starspace for space fusion: https://github.com/facebookresearch/StarSpace
4) DSSM: https://www.microsoft.com/en-us/research/project/dssm
5) Pinterest multimodal learning:
https://labs.pinterest.com/user/themes/pin_labs/assets/paper/training-and-evaluating.pdf
6) Mixture of embedding experts: https://arxiv.org/pdf/1804.02516.pdf
Links
36