Using Query Embeddings based on user sessions for query expansion.
Discover how we expand queries at OLX Group by embedding our queries using Neural Networks.
2. Me, OLX, The team
Ich heiße Mariano Semelman!
Ich komme aus Argentinien.
@msemelman
mariano.semelman@olx.com
I’m a Data Scientist with 6 years of experience
working in:
● Behavioural targeting
● Natural language processing
● Recommendation systems
● Search engine
3. Me, OLX, The team
● OLX: Online classifieds
platform
● Berlin Shared Service:
Support and Center of
expertise to the rest of the
platform.
● PnR Services Team:
Search, Recommender
systems, Big Data.
4. Me, OLX, The team
Vladan
Radosavljevic
Head of Data
Science
Mariano
Semelman
Senior Data
Scientist
Manish
Saraswat
Data Scientist
Vaibhav
Sharma
Data Scientist
9. Embedding
Definition, very easy!:
F: X↪Y
X: Your domain (example: Words,
Categories, etc)
Y: Domain with interesting
properties for your problem.
F: Injective function that translates
from X to Y.
tricky part: creating F.
13. Interesting property
If word A and word B always have similar
context, then cosine_similarity(F(A), F(B))
would tend to 1.
14. Gensim code
# import modules & set up logging
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
indexes = model.wv.index2word
embedding = model.wv.vectors
15. Search2Vec
or What does all this have to do
with searches...
Based on “Scalable Semantic Matching of Queries to
Ads in Sponsored Search Advertising” paper.
16. Remember the queries...
Search Sessions from OLX Data
13inch_rims rims 205_60_13 205 205_13inch
mountain_bicycle fiets bike bicycle
honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1
S2
S3
fencing devils_forkS4
S5 ferraro ferrari lamboghini porsche ewings
S6 catering_table funeral_tent wedding_tent bar_stool tiffany_chairs
17. Remember the queries...
Search Sessions from OLX Data
honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1
Train samples:
(honda_nc_700, suzuki_sv650)
(suzuki_sv650, honda_nc_700)
(suzuki_sv650, honda_cbx_250_twister)
(honda_cbx_250_twister, suzuki_sv650)
(honda_cbx_250_twister, honda_xr_125)
(honda_xr_125, honda_cbx_250_twister)
18. Training Data
~110M searches across a year
~12M sessions (aka sentences)
~4M unique searches
Preprocessing (pyspark):
● lowercase remove trailing spaces,
stopwords, punctuation marks,
double spaces, etc
● outliers:
long “sentences”
long tail queries (<10 occurrences)
20. Offline evaluation
If you are searching for "${search_string}", do you expect similar results for "${related_query}"?
● 1) very similar results
● 2) related results
● 3) very different results
25. Tail query vectors
Step 3: invert index for fast matching (BM25)
input query top result top result’s document
diving equipment scuba_diving_gea
r
scuba diving equipment diving gear scuba equipment
scuba gear scuba shop
cusinart machine bread_machines bread maker bread machines cusinart bread maker
bread machine reviews bread machine recipes
off road bakkie 4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie
nissan 4x4 4x4 off road
26. Offline Analysis with holdout data
90%
10%
ordered by
frequency
use for testing the
matching
index
query vectors
Vq-context
0.3 1.3 6.2 0.5 3.1
Vq-index (top result)
cosine similarity
0.2 1.4 7.2 0.6 6.1
33. Possible extensions
Include more entities in the sessions:
● Listings the user interacted with
● Categories the user browsed
● Locations the user search for/interacted with
Meta-prod2vec:
Add side information while generating pairs.
Meta-Prod2Vec - Product Embeddings Using Side-
Information for Recommendation