SlideShare une entreprise Scribd logo
1  sur  33
Query expansion using
semantic query embeddings
Me, OLX, The team
Ich heiße Mariano Semelman!
Ich komme aus Argentinien.
@msemelman
mariano.semelman@olx.com
I’m a Data Scientist with 6 years of experience
working in:
● Behavioural targeting
● Natural language processing
● Recommendation systems
● Search engine
Me, OLX, The team
● OLX: Online classifieds
platform
● Berlin Shared Service:
Support and Center of
expertise to the rest of the
platform.
● PnR Services Team:
Search, Recommender
systems, Big Data.
Me, OLX, The team
Vladan
Radosavljevic
Head of Data
Science
Mariano
Semelman
Senior Data
Scientist
Manish
Saraswat
Data Scientist
Vaibhav
Sharma
Data Scientist
So frustrating...
Reasons: Typos, Wrong brand/model
combination, localism, specificity, etc.
What if we could search not just for what the
user searched for, but also for highly similar
queries which mean the same?
Sessions
Search Sessions from OLX South Africa
“13inch rims, “rims” “205 60 13”, “205”, ”205_13inch”
“mountain bicycle”, “fiets”, “bike”, “bicycle”
“honda nc 700”, “suzuki sv650”, “honda cbx 250 twister”, “honda xr 125”S1
S2
S3
“fencing”, “devils fork”S4
S5 “ferraro”, “ferrari”, “lamboghini”, “porsche”, “ewings”
S6 “catering table”, “funeral tent”, “wedding tent”, “bar stool”, “tiffany chairs”
Word2Vec
or How I Learned to Stop
Worrying and Love Embeddings
Embedding
Definition, very easy!:
F: X↪Y
X: Your domain (example: Words,
Categories, etc)
Y: Domain with interesting
properties for your problem.
F: Injective function that translates
from X to Y.
tricky part: creating F.
Word2Vec (skip-gram flavour)
The fake task!
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Under the hood
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Chapeau!
http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Interesting property
If word A and word B always have similar
context, then cosine_similarity(F(A), F(B))
would tend to 1.
Gensim code
# import modules & set up logging
import gensim
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
indexes = model.wv.index2word
embedding = model.wv.vectors
Search2Vec
or What does all this have to do
with searches...
Based on “Scalable Semantic Matching of Queries to
Ads in Sponsored Search Advertising” paper.
Remember the queries...
Search Sessions from OLX Data
13inch_rims rims 205_60_13 205 205_13inch
mountain_bicycle fiets bike bicycle
honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1
S2
S3
fencing devils_forkS4
S5 ferraro ferrari lamboghini porsche ewings
S6 catering_table funeral_tent wedding_tent bar_stool tiffany_chairs
Remember the queries...
Search Sessions from OLX Data
honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1
Train samples:
(honda_nc_700, suzuki_sv650)
(suzuki_sv650, honda_nc_700)
(suzuki_sv650, honda_cbx_250_twister)
(honda_cbx_250_twister, suzuki_sv650)
(honda_cbx_250_twister, honda_xr_125)
(honda_xr_125, honda_cbx_250_twister)
Training Data
~110M searches across a year
~12M sessions (aka sentences)
~4M unique searches
Preprocessing (pyspark):
● lowercase remove trailing spaces,
stopwords, punctuation marks,
double spaces, etc
● outliers:
long “sentences”
long tail queries (<10 occurrences)
We have our model...
Offline evaluation
If you are searching for "${search_string}", do you expect similar results for "${related_query}"?
● 1) very similar results
● 2) related results
● 3) very different results
Tail queries
Limitations
Head queries: 162k
embeddings =)
Tail queries: 3.8M =(
Frequency
Query
10
Step 1: find top K queries for each head query from the vocabulary
query expansions score
scuba diving equipment 0.792
diving gear 0.766
scuba diving gear scuba equipment 0.765
scuba gear 0.764
scuba shop 0.763
query expansions score
bread maker 0.728
bread machines 0.722
bread machines cusinart bread maker 0.644
bread machine reviews 0.621
bread machine recipes 0.605
query expansions score
jeep 0.824
4x4 jeep 0.819
4x4 isuzu 4x4 0.805
toyota 4x4 0.793
hilux 4x4 0.790
Tail queries
Tail query
Step 2: form query documents (ie: flatten)
id document
scuba_diving_gear scuba diving equipment diving gear scuba equipment scuba
gear scuba shop
bread_machines bread maker bread machines cusinart bread maker bread
machine reviews bread machine recipes
4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie nissan 4x4
4x4 off road
Tail query vectors
Step 3: invert index for fast matching (BM25)
input query top result top result’s document
diving equipment scuba_diving_gea
r
scuba diving equipment diving gear scuba equipment
scuba gear scuba shop
cusinart machine bread_machines bread maker bread machines cusinart bread maker
bread machine reviews bread machine recipes
off road bakkie 4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie
nissan 4x4 4x4 off road
Offline Analysis with holdout data
90%
10%
ordered by
frequency
use for testing the
matching
index
query vectors
Vq-context
0.3 1.3 6.2 0.5 3.1
Vq-index (top result)
cosine similarity
0.2 1.4 7.2 0.6 6.1
Tail Queries Solution
Zahlen bitte!
We launched to production!!!
playstation 4 peugot ktm 900
8.7% -> 0.6%
Searches with no results
+13.5%**
Increase in new contacts by day
Vielen Dank!
Fragen?
Possible extensions
Include more entities in the sessions:
● Listings the user interacted with
● Categories the user browsed
● Locations the user search for/interacted with
Meta-prod2vec:
Add side information while generating pairs.
Meta-Prod2Vec - Product Embeddings Using Side-
Information for Recommendation

Contenu connexe

Tendances

Sourcing And Networking LinkedIn
Sourcing And Networking LinkedInSourcing And Networking LinkedIn
Sourcing And Networking LinkedInryanleary
 
5x5 Method - Drip Campaigns for Salespeople
5x5 Method - Drip Campaigns for Salespeople5x5 Method - Drip Campaigns for Salespeople
5x5 Method - Drip Campaigns for SalespeopleTK Kader
 
X-Ray Searching - SourceBreaker
X-Ray Searching - SourceBreakerX-Ray Searching - SourceBreaker
X-Ray Searching - SourceBreakerHunted
 
Salesforce Billing overview_VARA.pptx
Salesforce Billing overview_VARA.pptxSalesforce Billing overview_VARA.pptx
Salesforce Billing overview_VARA.pptxssuser1eba67
 
Webinaire Google Analytics 4 pour le Ecommerce
Webinaire Google Analytics 4 pour le EcommerceWebinaire Google Analytics 4 pour le Ecommerce
Webinaire Google Analytics 4 pour le EcommerceOsharaInc
 
ClickMinded SEO Mini Course
ClickMinded SEO Mini CourseClickMinded SEO Mini Course
ClickMinded SEO Mini CourseClickMinded
 
Modern Go-To-Market Framework
Modern Go-To-Market FrameworkModern Go-To-Market Framework
Modern Go-To-Market FrameworkJesse Hopps
 
Session 1: INTRODUCTION TO SALESFORCE
Session 1: INTRODUCTION TO SALESFORCESession 1: INTRODUCTION TO SALESFORCE
Session 1: INTRODUCTION TO SALESFORCESmritiSharan1
 
Website Strategy And Audit Proposal PowerPoint Presentation Slides
Website Strategy And Audit Proposal PowerPoint Presentation SlidesWebsite Strategy And Audit Proposal PowerPoint Presentation Slides
Website Strategy And Audit Proposal PowerPoint Presentation SlidesSlideTeam
 
How to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyHow to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyKevin Gibbons
 
Learned Embeddings for Search and Discovery at Instacart
Learned Embeddings for  Search and Discovery at InstacartLearned Embeddings for  Search and Discovery at Instacart
Learned Embeddings for Search and Discovery at InstacartSharath Rao
 
Loadster Load Testing by RapidValue Solutions
Loadster Load Testing by RapidValue SolutionsLoadster Load Testing by RapidValue Solutions
Loadster Load Testing by RapidValue SolutionsRapidValue
 
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdfCreating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdfRichard Lawrence
 
Introduction to SharePoint Information Architecture
Introduction to SharePoint Information ArchitectureIntroduction to SharePoint Information Architecture
Introduction to SharePoint Information ArchitectureGregory Zelfond
 
Roland Frasier Traffic & Conversion Presentation January 2014
Roland Frasier Traffic & Conversion Presentation January 2014Roland Frasier Traffic & Conversion Presentation January 2014
Roland Frasier Traffic & Conversion Presentation January 2014Roland Frasier
 
Understanding Security and Compliance in Microsoft Teams M365 North 2023
Understanding Security and Compliance in Microsoft Teams M365 North 2023Understanding Security and Compliance in Microsoft Teams M365 North 2023
Understanding Security and Compliance in Microsoft Teams M365 North 2023Chirag Patel
 
SEO CAMP'us Paris 2022 - CADOR EN SEO - Mathieu Chapon (1).pptx
SEO CAMP'us Paris 2022 - CADOR EN SEO - Mathieu Chapon (1).pptxSEO CAMP'us Paris 2022 - CADOR EN SEO - Mathieu Chapon (1).pptx
SEO CAMP'us Paris 2022 - CADOR EN SEO - Mathieu Chapon (1).pptxPeak Ace
 

Tendances (20)

Sourcing And Networking LinkedIn
Sourcing And Networking LinkedInSourcing And Networking LinkedIn
Sourcing And Networking LinkedIn
 
Salesforce Pardot Benefits
Salesforce Pardot BenefitsSalesforce Pardot Benefits
Salesforce Pardot Benefits
 
5x5 Method - Drip Campaigns for Salespeople
5x5 Method - Drip Campaigns for Salespeople5x5 Method - Drip Campaigns for Salespeople
5x5 Method - Drip Campaigns for Salespeople
 
X-Ray Searching - SourceBreaker
X-Ray Searching - SourceBreakerX-Ray Searching - SourceBreaker
X-Ray Searching - SourceBreaker
 
Salesforce Billing overview_VARA.pptx
Salesforce Billing overview_VARA.pptxSalesforce Billing overview_VARA.pptx
Salesforce Billing overview_VARA.pptx
 
Webinaire Google Analytics 4 pour le Ecommerce
Webinaire Google Analytics 4 pour le EcommerceWebinaire Google Analytics 4 pour le Ecommerce
Webinaire Google Analytics 4 pour le Ecommerce
 
ClickMinded SEO Mini Course
ClickMinded SEO Mini CourseClickMinded SEO Mini Course
ClickMinded SEO Mini Course
 
Modern Go-To-Market Framework
Modern Go-To-Market FrameworkModern Go-To-Market Framework
Modern Go-To-Market Framework
 
Session 1: INTRODUCTION TO SALESFORCE
Session 1: INTRODUCTION TO SALESFORCESession 1: INTRODUCTION TO SALESFORCE
Session 1: INTRODUCTION TO SALESFORCE
 
Website Strategy And Audit Proposal PowerPoint Presentation Slides
Website Strategy And Audit Proposal PowerPoint Presentation SlidesWebsite Strategy And Audit Proposal PowerPoint Presentation Slides
Website Strategy And Audit Proposal PowerPoint Presentation Slides
 
Boolean Training
Boolean TrainingBoolean Training
Boolean Training
 
How to create an SEO data-driven content strategy
How to create an SEO data-driven content strategyHow to create an SEO data-driven content strategy
How to create an SEO data-driven content strategy
 
Learned Embeddings for Search and Discovery at Instacart
Learned Embeddings for  Search and Discovery at InstacartLearned Embeddings for  Search and Discovery at Instacart
Learned Embeddings for Search and Discovery at Instacart
 
Loadster Load Testing by RapidValue Solutions
Loadster Load Testing by RapidValue SolutionsLoadster Load Testing by RapidValue Solutions
Loadster Load Testing by RapidValue Solutions
 
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdfCreating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
Creating Search Quality Algorithms - Richard Lawrence - BrightonSEO.pdf
 
Introduction to SharePoint Information Architecture
Introduction to SharePoint Information ArchitectureIntroduction to SharePoint Information Architecture
Introduction to SharePoint Information Architecture
 
Roland Frasier Traffic & Conversion Presentation January 2014
Roland Frasier Traffic & Conversion Presentation January 2014Roland Frasier Traffic & Conversion Presentation January 2014
Roland Frasier Traffic & Conversion Presentation January 2014
 
Understanding Security and Compliance in Microsoft Teams M365 North 2023
Understanding Security and Compliance in Microsoft Teams M365 North 2023Understanding Security and Compliance in Microsoft Teams M365 North 2023
Understanding Security and Compliance in Microsoft Teams M365 North 2023
 
SEO CAMP'us Paris 2022 - CADOR EN SEO - Mathieu Chapon (1).pptx
SEO CAMP'us Paris 2022 - CADOR EN SEO - Mathieu Chapon (1).pptxSEO CAMP'us Paris 2022 - CADOR EN SEO - Mathieu Chapon (1).pptx
SEO CAMP'us Paris 2022 - CADOR EN SEO - Mathieu Chapon (1).pptx
 
Einstein Analytics
Einstein Analytics Einstein Analytics
Einstein Analytics
 

Similaire à Search2Vec at OLX Group - Pydata Meetup Berlin

Duplicates everywhere (Berlin)
Duplicates everywhere (Berlin)Duplicates everywhere (Berlin)
Duplicates everywhere (Berlin)Alexey Grigorev
 
Max The Digital Optimization Engine
Max The Digital Optimization EngineMax The Digital Optimization Engine
Max The Digital Optimization EngineTim Scoutelas
 
Fraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMakerFraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMakerAmazon Web Services
 
Catch Matrix Pages - Overview
Catch Matrix Pages - OverviewCatch Matrix Pages - Overview
Catch Matrix Pages - Overviewcatchmarketing
 
Investment presentation powerpoint
Investment presentation powerpointInvestment presentation powerpoint
Investment presentation powerpointCole McDowell
 
CRO analytics - How to Continually Optimise
CRO analytics - How to Continually OptimiseCRO analytics - How to Continually Optimise
CRO analytics - How to Continually OptimisePhil Pearce
 
Mongo db meetuppresentation-2014-v5-1
Mongo db meetuppresentation-2014-v5-1Mongo db meetuppresentation-2014-v5-1
Mongo db meetuppresentation-2014-v5-1Gennadiy Civil
 
Vinod kumar maurya (1)
Vinod kumar maurya (1)Vinod kumar maurya (1)
Vinod kumar maurya (1)vinodmaurya
 
Customer Experience: Just Another Hype?
Customer Experience: Just Another Hype?Customer Experience: Just Another Hype?
Customer Experience: Just Another Hype?Osudio
 
20100608 final-affiliate-cogentis-dimmock
20100608 final-affiliate-cogentis-dimmock20100608 final-affiliate-cogentis-dimmock
20100608 final-affiliate-cogentis-dimmockMatt Bateman
 
طراحی وب سایت و تجارت الکترونیک
طراحی وب سایت و تجارت الکترونیکطراحی وب سایت و تجارت الکترونیک
طراحی وب سایت و تجارت الکترونیکSajad Salehipour
 
Online to-offline commerce in automobile industry
Online to-offline commerce in automobile industryOnline to-offline commerce in automobile industry
Online to-offline commerce in automobile industryYunkun Zhao, PhD
 
ISVWorld software industry database - 8 minute Intro and Training
ISVWorld software industry database - 8 minute Intro and TrainingISVWorld software industry database - 8 minute Intro and Training
ISVWorld software industry database - 8 minute Intro and TrainingISV World
 
Exploration of the new visual look for Seznam.cz products and services
Exploration of the new visual look for Seznam.cz products and servicesExploration of the new visual look for Seznam.cz products and services
Exploration of the new visual look for Seznam.cz products and servicesCzech Design Systems Community
 
Century Fuel Products On Material Handling Network Magazine
Century Fuel Products On Material Handling Network MagazineCentury Fuel Products On Material Handling Network Magazine
Century Fuel Products On Material Handling Network MagazineCentury Fuel Products
 
2019 12 14 Global AI Bootcamp - Auto ML with Machine Learning.Net
2019 12 14 Global AI Bootcamp   - Auto ML with Machine Learning.Net2019 12 14 Global AI Bootcamp   - Auto ML with Machine Learning.Net
2019 12 14 Global AI Bootcamp - Auto ML with Machine Learning.NetBruno Capuano
 

Similaire à Search2Vec at OLX Group - Pydata Meetup Berlin (20)

Duplicates everywhere (Berlin)
Duplicates everywhere (Berlin)Duplicates everywhere (Berlin)
Duplicates everywhere (Berlin)
 
Smart Crawler
Smart CrawlerSmart Crawler
Smart Crawler
 
Max The Digital Optimization Engine
Max The Digital Optimization EngineMax The Digital Optimization Engine
Max The Digital Optimization Engine
 
Fraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMakerFraud Detection with Amazon SageMaker
Fraud Detection with Amazon SageMaker
 
Catch Matrix Pages - Overview
Catch Matrix Pages - OverviewCatch Matrix Pages - Overview
Catch Matrix Pages - Overview
 
Investment presentation powerpoint
Investment presentation powerpointInvestment presentation powerpoint
Investment presentation powerpoint
 
CRO analytics - How to Continually Optimise
CRO analytics - How to Continually OptimiseCRO analytics - How to Continually Optimise
CRO analytics - How to Continually Optimise
 
Mongo db meetuppresentation-2014-v5-1
Mongo db meetuppresentation-2014-v5-1Mongo db meetuppresentation-2014-v5-1
Mongo db meetuppresentation-2014-v5-1
 
Vinod kumar maurya (1)
Vinod kumar maurya (1)Vinod kumar maurya (1)
Vinod kumar maurya (1)
 
Customer Experience: Just Another Hype?
Customer Experience: Just Another Hype?Customer Experience: Just Another Hype?
Customer Experience: Just Another Hype?
 
Ad Spots Trading Marketplace
Ad Spots Trading MarketplaceAd Spots Trading Marketplace
Ad Spots Trading Marketplace
 
20100608 final-affiliate-cogentis-dimmock
20100608 final-affiliate-cogentis-dimmock20100608 final-affiliate-cogentis-dimmock
20100608 final-affiliate-cogentis-dimmock
 
طراحی وب سایت و تجارت الکترونیک
طراحی وب سایت و تجارت الکترونیکطراحی وب سایت و تجارت الکترونیک
طراحی وب سایت و تجارت الکترونیک
 
Online to-offline commerce in automobile industry
Online to-offline commerce in automobile industryOnline to-offline commerce in automobile industry
Online to-offline commerce in automobile industry
 
ISVWorld software industry database - 8 minute Intro and Training
ISVWorld software industry database - 8 minute Intro and TrainingISVWorld software industry database - 8 minute Intro and Training
ISVWorld software industry database - 8 minute Intro and Training
 
Weka_10BM60025_VGSOM
Weka_10BM60025_VGSOMWeka_10BM60025_VGSOM
Weka_10BM60025_VGSOM
 
Exploration of the new visual look for Seznam.cz products and services
Exploration of the new visual look for Seznam.cz products and servicesExploration of the new visual look for Seznam.cz products and services
Exploration of the new visual look for Seznam.cz products and services
 
Century Fuel Products On Material Handling Network Magazine
Century Fuel Products On Material Handling Network MagazineCentury Fuel Products On Material Handling Network Magazine
Century Fuel Products On Material Handling Network Magazine
 
Facts erp best erp software dubai
Facts erp   best erp software dubaiFacts erp   best erp software dubai
Facts erp best erp software dubai
 
2019 12 14 Global AI Bootcamp - Auto ML with Machine Learning.Net
2019 12 14 Global AI Bootcamp   - Auto ML with Machine Learning.Net2019 12 14 Global AI Bootcamp   - Auto ML with Machine Learning.Net
2019 12 14 Global AI Bootcamp - Auto ML with Machine Learning.Net
 

Dernier

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 

Dernier (20)

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 

Search2Vec at OLX Group - Pydata Meetup Berlin

  • 2. Me, OLX, The team Ich heiße Mariano Semelman! Ich komme aus Argentinien. @msemelman mariano.semelman@olx.com I’m a Data Scientist with 6 years of experience working in: ● Behavioural targeting ● Natural language processing ● Recommendation systems ● Search engine
  • 3. Me, OLX, The team ● OLX: Online classifieds platform ● Berlin Shared Service: Support and Center of expertise to the rest of the platform. ● PnR Services Team: Search, Recommender systems, Big Data.
  • 4. Me, OLX, The team Vladan Radosavljevic Head of Data Science Mariano Semelman Senior Data Scientist Manish Saraswat Data Scientist Vaibhav Sharma Data Scientist
  • 5. So frustrating... Reasons: Typos, Wrong brand/model combination, localism, specificity, etc.
  • 6. What if we could search not just for what the user searched for, but also for highly similar queries which mean the same?
  • 7. Sessions Search Sessions from OLX South Africa “13inch rims, “rims” “205 60 13”, “205”, ”205_13inch” “mountain bicycle”, “fiets”, “bike”, “bicycle” “honda nc 700”, “suzuki sv650”, “honda cbx 250 twister”, “honda xr 125”S1 S2 S3 “fencing”, “devils fork”S4 S5 “ferraro”, “ferrari”, “lamboghini”, “porsche”, “ewings” S6 “catering table”, “funeral tent”, “wedding tent”, “bar stool”, “tiffany chairs”
  • 8. Word2Vec or How I Learned to Stop Worrying and Love Embeddings
  • 9. Embedding Definition, very easy!: F: X↪Y X: Your domain (example: Words, Categories, etc) Y: Domain with interesting properties for your problem. F: Injective function that translates from X to Y. tricky part: creating F.
  • 10. Word2Vec (skip-gram flavour) The fake task! http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 13. Interesting property If word A and word B always have similar context, then cosine_similarity(F(A), F(B)) would tend to 1.
  • 14. Gensim code # import modules & set up logging import gensim sentences = [['first', 'sentence'], ['second', 'sentence']] # train word2vec on the two sentences model = gensim.models.Word2Vec(sentences, min_count=1) indexes = model.wv.index2word embedding = model.wv.vectors
  • 15. Search2Vec or What does all this have to do with searches... Based on “Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising” paper.
  • 16. Remember the queries... Search Sessions from OLX Data 13inch_rims rims 205_60_13 205 205_13inch mountain_bicycle fiets bike bicycle honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1 S2 S3 fencing devils_forkS4 S5 ferraro ferrari lamboghini porsche ewings S6 catering_table funeral_tent wedding_tent bar_stool tiffany_chairs
  • 17. Remember the queries... Search Sessions from OLX Data honda_nc_700 suzuki_sv650 honda_cbx_250_twister honda_xr_125S1 Train samples: (honda_nc_700, suzuki_sv650) (suzuki_sv650, honda_nc_700) (suzuki_sv650, honda_cbx_250_twister) (honda_cbx_250_twister, suzuki_sv650) (honda_cbx_250_twister, honda_xr_125) (honda_xr_125, honda_cbx_250_twister)
  • 18. Training Data ~110M searches across a year ~12M sessions (aka sentences) ~4M unique searches Preprocessing (pyspark): ● lowercase remove trailing spaces, stopwords, punctuation marks, double spaces, etc ● outliers: long “sentences” long tail queries (<10 occurrences)
  • 19. We have our model...
  • 20. Offline evaluation If you are searching for "${search_string}", do you expect similar results for "${related_query}"? ● 1) very similar results ● 2) related results ● 3) very different results
  • 22. Limitations Head queries: 162k embeddings =) Tail queries: 3.8M =( Frequency Query 10
  • 23. Step 1: find top K queries for each head query from the vocabulary query expansions score scuba diving equipment 0.792 diving gear 0.766 scuba diving gear scuba equipment 0.765 scuba gear 0.764 scuba shop 0.763 query expansions score bread maker 0.728 bread machines 0.722 bread machines cusinart bread maker 0.644 bread machine reviews 0.621 bread machine recipes 0.605 query expansions score jeep 0.824 4x4 jeep 0.819 4x4 isuzu 4x4 0.805 toyota 4x4 0.793 hilux 4x4 0.790 Tail queries
  • 24. Tail query Step 2: form query documents (ie: flatten) id document scuba_diving_gear scuba diving equipment diving gear scuba equipment scuba gear scuba shop bread_machines bread maker bread machines cusinart bread maker bread machine reviews bread machine recipes 4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie nissan 4x4 4x4 off road
  • 25. Tail query vectors Step 3: invert index for fast matching (BM25) input query top result top result’s document diving equipment scuba_diving_gea r scuba diving equipment diving gear scuba equipment scuba gear scuba shop cusinart machine bread_machines bread maker bread machines cusinart bread maker bread machine reviews bread machine recipes off road bakkie 4x4 jeep 4x4 jeep isuzu 4x4 toyota 4x4 hilux 4x4 bakkie nissan 4x4 4x4 off road
  • 26. Offline Analysis with holdout data 90% 10% ordered by frequency use for testing the matching index query vectors Vq-context 0.3 1.3 6.2 0.5 3.1 Vq-index (top result) cosine similarity 0.2 1.4 7.2 0.6 6.1
  • 29. We launched to production!!! playstation 4 peugot ktm 900
  • 30. 8.7% -> 0.6% Searches with no results
  • 31. +13.5%** Increase in new contacts by day
  • 33. Possible extensions Include more entities in the sessions: ● Listings the user interacted with ● Categories the user browsed ● Locations the user search for/interacted with Meta-prod2vec: Add side information while generating pairs. Meta-Prod2Vec - Product Embeddings Using Side- Information for Recommendation