SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
Role of Data Science
in eCommerce
Manojkumar Rangasamy Kannadasan
eBay Inc
June 2019
1
Agenda
● Background
● Fast Facts about eBay
● Data Science in eCommerce
● Data Science @ eBay Search
● Case Studies
2
Background
3
What is Data Science?
Data science is a multidisciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from structured and unstructured data - Wikipedia
4
5
Reference:
https://datascience.berkeley.edu
/about/what-is-data-science/
Why Data Science?
● Empowering management to make better decisions
● Directing actions based on trends—which in turn help to define goals
● Challenging the staff to adopt best practices and focus on issues that matter
● Identifying opportunities
● Decision making with quantifiable, data-driven evidence
● Testing these decisions
● Identification and refining of target audiences
● Recruiting the right talent for the organization
6Reference: https://www.simplilearn.com/why-and-how-data-science-matters-to-business-article
Fast Facts about eBay
7
8
9
10
Frequency of Product Purchases
11
Data Science in eCommerce
Objective
12
● Help users find and discover products to purchase
● Maximize revenue / profit per user session
Data Science in Different Departments
● Search
● SEO
● Trust / Fraud / Abuse
● Selling
● Shipping
● Pricing
● Merchandising
13
● Ads / Marketing
● Structured Data
● Inventory Management
● Machine Translation
● Coupons & Rewards
● Customer Service
● Infrastructure
14
Data Science @ eBay Search
Data Science @ eBay Search
15
● Text Search
● Faceted Search
● Image Search
● Voice Search
● Conversational Search
● Recommendations
Data Science @ eBay Search
16
● Text Search
● Faceted Search
● Image Search
● Voice Search
● Conversational Search
● Recommendations
17
Query Autocompletion
18
Query Understanding
19
Ranking
20
Faceted Search
21
Spell Correction
22
Recommendations
23
Recommendations
24
Recommendations
25
Recommendations
26
Recommendations
Questions?
27
28
Case Studies
29
Query Categorization
Team Members: M. Liu, X. Liu & E. Luo
What is Query Categorization?
30
● Predict relevant product categories given a query
● Use high confident prediction to filter product listings
● Use confidence scores of the predictions to influence ranking
Why?
● 1.2 Billion Listings
● ~20K Categories & ~35 Verticals
31
Deep Semantic Similarity Model
32
Huang, He, Gao, Deng, Acero, Heck, “Learning deep structured semantic models for web search using clickthrough data”, CIKM, 2013
33
eBay Query Categorization
● Based on Convolutional Latent Semantic Model (CLSM)
○ Shen, He, Gao, Deng, Mesnil, “A latent semantic model with
convolutional-pooling structure for IR,” CIKM 2014
● Maximize the posterior probability of a category given a query
Training - Data Collection
● Test Data: Confident set from a future period
34
Query - Product
Category, Clicks,
Transactions
Confident Set: Queries w/ >= 90%
products in a single category
Ambiguous Set
Subsamples by
popularity
Train/Validation
Data
Query Categorization in Action
35
● Directly use historic data if there were
sufficient amount
● Use an experimentally determined
confidence score threshold to pick top
predictions
● Fallback to parent category or entire
inventory when there are no high confident
predictions
● Baseline = ngrams + BM25 + attribute filtration
● Absolute scale obfuscated
FastCat - Faster Training & Inference
36
● Based on (Joulin et al., “Bag of tricks for
efficient text classification”, arXiv, 2016)
○ Shallow network but deep learning
- no feature engineering
○ Bag of ngrams as input
○ Hierarchical softmax in the output
layer: log2
V outputs to evaluate
● Data collected as before
Training time
20X faster
Inference time
< 1 ms
Commodity
Hardware
Comparable
Accuracy
W1
W2
Wn-1
Wn
.
.
.
H
I
D
D
E
N
C
A
T
E
G
O
R
Y
Query
Questions?
37
38
Personalized Query
Autocompletion
Team Members: Manojkumar R Kannadasan, Grigor Aslanyan
39
Query Autocompletion
Why?
40
● Saves time for users
● Guides users to reach their products faster
● Avoids Spell errors
● Can help promote Top products
Why is it Challenging & Fun?
● Millions of Users
● Humongous Amount of Queries per sec
● Show Relevant Suggestions to users
● Detect spell errors and provide corrected suggestions
41
Most Popular Completions - Overview
42
User Prefix
Most Popular
Completions
(MPC)
Query Data
Get Top N
Queries
Most Popular Completions - Naive Approach
● Show Queries matching prefix based on popularity
● Popularity can be frequency or sale
43
Personalized Query Autocompletion
● Users queries in a session are around one or more intents
● Global query completions may be sub-optimal
44
Dslr camera Canon dslr camera Canon 5D Mark IV Canon lenses
Personalized Re-Ranker Overview
45
User Prefix
Most Popular
Completions
(MPC)
Query Data
Get Top N
Queries
Re-Ranker
Query
Features
User
Features
Manojkumar Rangasamy Kannadasan, Grigor Aslanyan, “Personalized Query Auto-Completion Through a Lightweight
Representation of the User Context” [Under Review]
Data Collection
● Billions of User Sessions
● Capture user behavioral activity
○ Prefix
○ Query Clicked from Autocomplete
○ Previous Queries issued by user
○ Queries viewed and not clicked
○ Global performance of the query
46
Understanding User Context
47
Dslr camera Canon dslr camera Canon 5D Mark IV Canon lenses
User Starts Typing C
Understanding User Context
● Features computed based on previous queries issued by the user
○ Textual features like ngrams, # of terms, frequency, session-based etc
○ Similarity features based on text
○ Similarity features based on Vector representations
● Query Vectors can be learned by
○ Supervised - query transitions, queries from product co-clicks
○ Semi-Supervised - Word2Vec, fastText, GloVe
48
Model Training
● Positive Samples
○ Queries clicked in Autocomplete
● Negative Samples
○ Queries viewed and not clicked in Autocomplete
● Train a Machine Learned Ranking Model
○ Ref: https://en.wikipedia.org/wiki/Learning_to_rank
49
Evaluation
● MRR, Success Rate, MAP & nDCG
○ 20% - 30%**
Lift over MPC
○ 5% - 10%**
Lift over Non-Personalized Re-Ranker
50
** Manojkumar Rangasamy Kannadasan, Grigor Aslanyan, “Personalized Query Auto-Completion Through a Lightweight
Representation of the User Context” [Under Review]
Questions?
51
52
Spell Correction
Team Members: Utkarsh Porwal & Roberto Konow
Why?
53
● Product names can be difficult to spell
● Users will appreciate the help
● Sellers will appreciate the help
● It is challenging and fun!
54
Spell Correction
Why is it Challenging and Fun?
● Special - Query Spell Correction for user generated item information
● Big - Millions of Users, Billions of Items
● Efficiency - Need to process humongous amount of queries per sec
● Precision - Suggest the correct spell correction for the correct query
55
Overview
56
Candidate Generation
Language Model
Error Model
RankingQuery
Corrected
Query
Efficiency
Big & Special
Big & Special
Precision
Mathematical Formulation
57Reference: http://norvig.com/spell-correct.html
Efficiency
58
Query: top
1 edit distance
n deletions
n-1 transpositions
26n alterations
26(n+1) insertions
54n+25
qop
op
sop
…
thp
tap
tkp
…
tpn
tops
Efficiency
59
Query: top
Generate only the
ones we know
qop
op
sop
…
thp
tap
tkp
…
tpn
tops
Efficiency
60
Generate only the ones we know?
Source:
wikipedia
tap
taps
top
tops
Efficiency
61
Generate only the ones we know?
Source:
wikipedia
tap
taps
top
tops
Efficiency
62
Generate only the
ones we know?
Source: http://ajainarayanan.github.io/ctrlf/
tap
taps
top
tops
Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn:
Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees. EMNLP 2015
Efficiency - Which one?
● Naïve: Slow, no memory footprint, unnecessary candidates (?)
● Trie: Faster, Huge memory footprint
● DAWG: Even Faster, Not-that-huge memory footprint
● Suffix Trees (not compressed): Humongous memory footprint
● Suffix Trees (compressed): Slowest, very small memory footprint
63
Language Model
● How likely is the candidate - p(c) ?
● p(c1
c2
c3
… cn
)? = p(levis blue jeans 32 in)?
● Naive Algorithm - look for number of occurrences of given query
○ What if we have never seen the query
○ Long queries will have poor count leading to poor probability estimates
● Markov Assumption - second order
○ p(c1
c2
c3
…cn
) = p(c1
)p(c2
|c1
)p(c3
|c1
c2
) … p(cn
|cn-2
cn-1
)
64
Language Model
● p(levis blue jeans 32 in) = p(levis)p(blue|levis)p(jeans|levis blue)p(32|blue
jeans)p(in|jeans 32)
● p(blue|levis) = count(levis,blue) / count(levis)
● Now we have to only deal with unigrams, bigrams and trigrams
● There are still issues
○ Words that we have never seen - we still need to assign some probability
65
Error Model
● p(query|correction)?
● How likely is that user wanted to type the correction but typed the query
● Multiple ways to estimate this
○ Keyboard distance
○ Phonetic distance
○ Mine your logs
66
Error Model
Industry approach
● To train an error model we need triples of (intended word, observed word,
count)
● We would expect
○ p(the|the) to be very high
○ p(teh|the) to be relatively high
○ p(hippopotamus|the) to be extremely low
67
Error Model
● Get 10 million most frequent unigrams
● Get all the candidates at certain edit distance (depending on word length)
● This will give a huge tuple list <apple, applo>
● Assumption is that top 10 million are generally correct
● Prune this list based on freq - apple should be at least 10x more frequent
68
Questions?
69
Hiring @ eBay
70
Students & Recent Graduates
https://careers.ebayinc.com/join-our-team/students-recent-graduat
es/
71
Start your Career @ eBay
https://careers.ebayinc.com/join-our-team/start-your-search/
Q & A
mkannadasan@ebay.com
72
73
Language Model
● p(levis blue jeans 32 in) = p(levis)p(blue|levis)p(jeans|levis blue)p(32|blue
jeans)p(in|jeans 32)
● p(blue|levis) = count(levis,blue) / count(levis)
● Now we have to only deal with unigrams, bigrams and trigrams
● There are still issues
○ Words that we have never seen - we still need to assign some probability
○ Adjustment of probabilities to demote high freq words - the, a etc
○ Backoff scores - KenLM (https://kheafield.com/code/kenlm/)
74

Contenu connexe

Tendances

Recommendation System
Recommendation SystemRecommendation System
Recommendation SystemAnamta Sayyed
 
Handling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseHandling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseArangoDB Database
 
Machine learning life cycle
Machine learning life cycleMachine learning life cycle
Machine learning life cycleRamjee Ganti
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Getting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDBGetting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDBMongoDB
 
Introduction to Recommendation System
Introduction to Recommendation SystemIntroduction to Recommendation System
Introduction to Recommendation SystemMinha Hwang
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithmsShalitha Suranga
 
Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerceAlexander Konduforov
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender SystemsLior Rokach
 
Use of data science in recommendation system
Use of data science in  recommendation systemUse of data science in  recommendation system
Use of data science in recommendation systemAkashPatil334
 
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation SystemsSalil Navgire
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsNeo4j
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringViet-Trung TRAN
 
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes] How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes] CARTO
 
Social Network Analysis power point presentation
Social Network Analysis power point presentation Social Network Analysis power point presentation
Social Network Analysis power point presentation Ratnesh Shah
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search resultsJettro Coenradie
 
Choosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectChoosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectOntotext
 
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine LearnGraphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine LearnNeo4j
 

Tendances (20)

Recommendation System
Recommendation SystemRecommendation System
Recommendation System
 
Handling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph DatabaseHandling Billions of Edges in a Graph Database
Handling Billions of Edges in a Graph Database
 
Machine learning life cycle
Machine learning life cycleMachine learning life cycle
Machine learning life cycle
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Getting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDBGetting Started with Geospatial Data in MongoDB
Getting Started with Geospatial Data in MongoDB
 
Introduction to Recommendation System
Introduction to Recommendation SystemIntroduction to Recommendation System
Introduction to Recommendation System
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithms
 
Recommender systems for E-commerce
Recommender systems for E-commerceRecommender systems for E-commerce
Recommender systems for E-commerce
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Vector database
Vector databaseVector database
Vector database
 
Use of data science in recommendation system
Use of data science in  recommendation systemUse of data science in  recommendation system
Use of data science in recommendation system
 
Recommender system
Recommender systemRecommender system
Recommender system
 
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation Systems
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply Chains
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes] How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
How to Use Spatial Data Science in your Site Planning Process? [CARTOframes]
 
Social Network Analysis power point presentation
Social Network Analysis power point presentation Social Network Analysis power point presentation
Social Network Analysis power point presentation
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search results
 
Choosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectChoosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your Project
 
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine LearnGraphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
 

Similaire à Role of Data Science in Personalizing eCommerce Search

Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Xavier Amatriain
 
Udacity webinar on Recommendation Systems
Udacity webinar on Recommendation SystemsUdacity webinar on Recommendation Systems
Udacity webinar on Recommendation SystemsAxel de Romblay
 
[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation SystemsAxel de Romblay
 
Multimodal Learning to Rank in Production Scale E-commerce Search
Multimodal Learning to Rank in Production Scale E-commerce SearchMultimodal Learning to Rank in Production Scale E-commerce Search
Multimodal Learning to Rank in Production Scale E-commerce SearchLucidworks
 
Adopting Data Science and Machine Learning in the financial enterprise
Adopting Data Science and Machine Learning in the financial enterpriseAdopting Data Science and Machine Learning in the financial enterprise
Adopting Data Science and Machine Learning in the financial enterpriseQuantUniversity
 
Active Learning on Question Answering with Dialogues
 Active Learning on Question Answering with Dialogues Active Learning on Question Answering with Dialogues
Active Learning on Question Answering with DialoguesJinho Choi
 
Data science in 10 steps
Data science in 10 stepsData science in 10 steps
Data science in 10 stepsQuantUniversity
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or realityAwantik Das
 
Learning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestionsLearning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestionsClaudia Hauff
 
Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019Sonya Liberman
 
Context Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive AnalyticsContext Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive AnalyticsJulia Kiseleva
 
Iterative Methodology for Personalization Models Optimization
 Iterative Methodology for Personalization Models Optimization Iterative Methodology for Personalization Models Optimization
Iterative Methodology for Personalization Models OptimizationSonya Liberman
 
ECIR Recommendation Challenges
ECIR Recommendation ChallengesECIR Recommendation Challenges
ECIR Recommendation ChallengesDaniel Kohlsdorf
 
Dynamic Search and Beyond
Dynamic Search and BeyondDynamic Search and Beyond
Dynamic Search and BeyondGrace Hui Yang
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
Delivering Machine Learning Solutions by fmr Sears Dir of PM
Delivering Machine Learning Solutions by fmr Sears Dir of PMDelivering Machine Learning Solutions by fmr Sears Dir of PM
Delivering Machine Learning Solutions by fmr Sears Dir of PMProduct School
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makerszekeLabs Technologies
 
Scaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine LearningScaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine LearningVo Viet Anh
 
Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueXavier Amatriain
 

Similaire à Role of Data Science in Personalizing eCommerce Search (20)

Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
 
Udacity webinar on Recommendation Systems
Udacity webinar on Recommendation SystemsUdacity webinar on Recommendation Systems
Udacity webinar on Recommendation Systems
 
[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems
 
Multimodal Learning to Rank in Production Scale E-commerce Search
Multimodal Learning to Rank in Production Scale E-commerce SearchMultimodal Learning to Rank in Production Scale E-commerce Search
Multimodal Learning to Rank in Production Scale E-commerce Search
 
Adopting Data Science and Machine Learning in the financial enterprise
Adopting Data Science and Machine Learning in the financial enterpriseAdopting Data Science and Machine Learning in the financial enterprise
Adopting Data Science and Machine Learning in the financial enterprise
 
Active Learning on Question Answering with Dialogues
 Active Learning on Question Answering with Dialogues Active Learning on Question Answering with Dialogues
Active Learning on Question Answering with Dialogues
 
Data science in 10 steps
Data science in 10 stepsData science in 10 steps
Data science in 10 steps
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Learning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestionsLearning by example: training users through high-quality query suggestions
Learning by example: training users through high-quality query suggestions
 
Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019Recommender Systems @ Scale - PyData 2019
Recommender Systems @ Scale - PyData 2019
 
Context Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive AnalyticsContext Mining and Integration in Web Predictive Analytics
Context Mining and Integration in Web Predictive Analytics
 
Iterative Methodology for Personalization Models Optimization
 Iterative Methodology for Personalization Models Optimization Iterative Methodology for Personalization Models Optimization
Iterative Methodology for Personalization Models Optimization
 
ECIR Recommendation Challenges
ECIR Recommendation ChallengesECIR Recommendation Challenges
ECIR Recommendation Challenges
 
Dynamic Search and Beyond
Dynamic Search and BeyondDynamic Search and Beyond
Dynamic Search and Beyond
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Delivering Machine Learning Solutions by fmr Sears Dir of PM
Delivering Machine Learning Solutions by fmr Sears Dir of PMDelivering Machine Learning Solutions by fmr Sears Dir of PM
Delivering Machine Learning Solutions by fmr Sears Dir of PM
 
Moving from BI to AI : For decision makers
Moving from BI to AI : For decision makersMoving from BI to AI : For decision makers
Moving from BI to AI : For decision makers
 
Scaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine LearningScaling Quality on Quora Using Machine Learning
Scaling Quality on Quora Using Machine Learning
 
Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business Value
 

Dernier

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformationAnnie Melnic
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are successPratikSingh115843
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 

Dernier (17)

6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
Role of Consumer Insights in business transformation
Role of Consumer Insights in business transformationRole of Consumer Insights in business transformation
Role of Consumer Insights in business transformation
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Presentation of project of business person who are success
Presentation of project of business person who are successPresentation of project of business person who are success
Presentation of project of business person who are success
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 

Role of Data Science in Personalizing eCommerce Search

  • 1. Role of Data Science in eCommerce Manojkumar Rangasamy Kannadasan eBay Inc June 2019 1
  • 2. Agenda ● Background ● Fast Facts about eBay ● Data Science in eCommerce ● Data Science @ eBay Search ● Case Studies 2
  • 4. What is Data Science? Data science is a multidisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data - Wikipedia 4
  • 6. Why Data Science? ● Empowering management to make better decisions ● Directing actions based on trends—which in turn help to define goals ● Challenging the staff to adopt best practices and focus on issues that matter ● Identifying opportunities ● Decision making with quantifiable, data-driven evidence ● Testing these decisions ● Identification and refining of target audiences ● Recruiting the right talent for the organization 6Reference: https://www.simplilearn.com/why-and-how-data-science-matters-to-business-article
  • 8. 8
  • 9. 9
  • 11. 11 Data Science in eCommerce
  • 12. Objective 12 ● Help users find and discover products to purchase ● Maximize revenue / profit per user session
  • 13. Data Science in Different Departments ● Search ● SEO ● Trust / Fraud / Abuse ● Selling ● Shipping ● Pricing ● Merchandising 13 ● Ads / Marketing ● Structured Data ● Inventory Management ● Machine Translation ● Coupons & Rewards ● Customer Service ● Infrastructure
  • 14. 14 Data Science @ eBay Search
  • 15. Data Science @ eBay Search 15 ● Text Search ● Faceted Search ● Image Search ● Voice Search ● Conversational Search ● Recommendations
  • 16. Data Science @ eBay Search 16 ● Text Search ● Faceted Search ● Image Search ● Voice Search ● Conversational Search ● Recommendations
  • 29. 29 Query Categorization Team Members: M. Liu, X. Liu & E. Luo
  • 30. What is Query Categorization? 30 ● Predict relevant product categories given a query ● Use high confident prediction to filter product listings ● Use confidence scores of the predictions to influence ranking
  • 31. Why? ● 1.2 Billion Listings ● ~20K Categories & ~35 Verticals 31
  • 32. Deep Semantic Similarity Model 32 Huang, He, Gao, Deng, Acero, Heck, “Learning deep structured semantic models for web search using clickthrough data”, CIKM, 2013
  • 33. 33 eBay Query Categorization ● Based on Convolutional Latent Semantic Model (CLSM) ○ Shen, He, Gao, Deng, Mesnil, “A latent semantic model with convolutional-pooling structure for IR,” CIKM 2014 ● Maximize the posterior probability of a category given a query
  • 34. Training - Data Collection ● Test Data: Confident set from a future period 34 Query - Product Category, Clicks, Transactions Confident Set: Queries w/ >= 90% products in a single category Ambiguous Set Subsamples by popularity Train/Validation Data
  • 35. Query Categorization in Action 35 ● Directly use historic data if there were sufficient amount ● Use an experimentally determined confidence score threshold to pick top predictions ● Fallback to parent category or entire inventory when there are no high confident predictions ● Baseline = ngrams + BM25 + attribute filtration ● Absolute scale obfuscated
  • 36. FastCat - Faster Training & Inference 36 ● Based on (Joulin et al., “Bag of tricks for efficient text classification”, arXiv, 2016) ○ Shallow network but deep learning - no feature engineering ○ Bag of ngrams as input ○ Hierarchical softmax in the output layer: log2 V outputs to evaluate ● Data collected as before Training time 20X faster Inference time < 1 ms Commodity Hardware Comparable Accuracy W1 W2 Wn-1 Wn . . . H I D D E N C A T E G O R Y Query
  • 38. 38 Personalized Query Autocompletion Team Members: Manojkumar R Kannadasan, Grigor Aslanyan
  • 40. Why? 40 ● Saves time for users ● Guides users to reach their products faster ● Avoids Spell errors ● Can help promote Top products
  • 41. Why is it Challenging & Fun? ● Millions of Users ● Humongous Amount of Queries per sec ● Show Relevant Suggestions to users ● Detect spell errors and provide corrected suggestions 41
  • 42. Most Popular Completions - Overview 42 User Prefix Most Popular Completions (MPC) Query Data Get Top N Queries
  • 43. Most Popular Completions - Naive Approach ● Show Queries matching prefix based on popularity ● Popularity can be frequency or sale 43
  • 44. Personalized Query Autocompletion ● Users queries in a session are around one or more intents ● Global query completions may be sub-optimal 44 Dslr camera Canon dslr camera Canon 5D Mark IV Canon lenses
  • 45. Personalized Re-Ranker Overview 45 User Prefix Most Popular Completions (MPC) Query Data Get Top N Queries Re-Ranker Query Features User Features Manojkumar Rangasamy Kannadasan, Grigor Aslanyan, “Personalized Query Auto-Completion Through a Lightweight Representation of the User Context” [Under Review]
  • 46. Data Collection ● Billions of User Sessions ● Capture user behavioral activity ○ Prefix ○ Query Clicked from Autocomplete ○ Previous Queries issued by user ○ Queries viewed and not clicked ○ Global performance of the query 46
  • 47. Understanding User Context 47 Dslr camera Canon dslr camera Canon 5D Mark IV Canon lenses User Starts Typing C
  • 48. Understanding User Context ● Features computed based on previous queries issued by the user ○ Textual features like ngrams, # of terms, frequency, session-based etc ○ Similarity features based on text ○ Similarity features based on Vector representations ● Query Vectors can be learned by ○ Supervised - query transitions, queries from product co-clicks ○ Semi-Supervised - Word2Vec, fastText, GloVe 48
  • 49. Model Training ● Positive Samples ○ Queries clicked in Autocomplete ● Negative Samples ○ Queries viewed and not clicked in Autocomplete ● Train a Machine Learned Ranking Model ○ Ref: https://en.wikipedia.org/wiki/Learning_to_rank 49
  • 50. Evaluation ● MRR, Success Rate, MAP & nDCG ○ 20% - 30%** Lift over MPC ○ 5% - 10%** Lift over Non-Personalized Re-Ranker 50 ** Manojkumar Rangasamy Kannadasan, Grigor Aslanyan, “Personalized Query Auto-Completion Through a Lightweight Representation of the User Context” [Under Review]
  • 52. 52 Spell Correction Team Members: Utkarsh Porwal & Roberto Konow
  • 53. Why? 53 ● Product names can be difficult to spell ● Users will appreciate the help ● Sellers will appreciate the help ● It is challenging and fun!
  • 55. Why is it Challenging and Fun? ● Special - Query Spell Correction for user generated item information ● Big - Millions of Users, Billions of Items ● Efficiency - Need to process humongous amount of queries per sec ● Precision - Suggest the correct spell correction for the correct query 55
  • 56. Overview 56 Candidate Generation Language Model Error Model RankingQuery Corrected Query Efficiency Big & Special Big & Special Precision
  • 58. Efficiency 58 Query: top 1 edit distance n deletions n-1 transpositions 26n alterations 26(n+1) insertions 54n+25 qop op sop … thp tap tkp … tpn tops
  • 59. Efficiency 59 Query: top Generate only the ones we know qop op sop … thp tap tkp … tpn tops
  • 60. Efficiency 60 Generate only the ones we know? Source: wikipedia tap taps top tops
  • 61. Efficiency 61 Generate only the ones we know? Source: wikipedia tap taps top tops
  • 62. Efficiency 62 Generate only the ones we know? Source: http://ajainarayanan.github.io/ctrlf/ tap taps top tops Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Trevor Cohn: Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees. EMNLP 2015
  • 63. Efficiency - Which one? ● Naïve: Slow, no memory footprint, unnecessary candidates (?) ● Trie: Faster, Huge memory footprint ● DAWG: Even Faster, Not-that-huge memory footprint ● Suffix Trees (not compressed): Humongous memory footprint ● Suffix Trees (compressed): Slowest, very small memory footprint 63
  • 64. Language Model ● How likely is the candidate - p(c) ? ● p(c1 c2 c3 … cn )? = p(levis blue jeans 32 in)? ● Naive Algorithm - look for number of occurrences of given query ○ What if we have never seen the query ○ Long queries will have poor count leading to poor probability estimates ● Markov Assumption - second order ○ p(c1 c2 c3 …cn ) = p(c1 )p(c2 |c1 )p(c3 |c1 c2 ) … p(cn |cn-2 cn-1 ) 64
  • 65. Language Model ● p(levis blue jeans 32 in) = p(levis)p(blue|levis)p(jeans|levis blue)p(32|blue jeans)p(in|jeans 32) ● p(blue|levis) = count(levis,blue) / count(levis) ● Now we have to only deal with unigrams, bigrams and trigrams ● There are still issues ○ Words that we have never seen - we still need to assign some probability 65
  • 66. Error Model ● p(query|correction)? ● How likely is that user wanted to type the correction but typed the query ● Multiple ways to estimate this ○ Keyboard distance ○ Phonetic distance ○ Mine your logs 66
  • 67. Error Model Industry approach ● To train an error model we need triples of (intended word, observed word, count) ● We would expect ○ p(the|the) to be very high ○ p(teh|the) to be relatively high ○ p(hippopotamus|the) to be extremely low 67
  • 68. Error Model ● Get 10 million most frequent unigrams ● Get all the candidates at certain edit distance (depending on word length) ● This will give a huge tuple list <apple, applo> ● Assumption is that top 10 million are generally correct ● Prune this list based on freq - apple should be at least 10x more frequent 68
  • 71. Students & Recent Graduates https://careers.ebayinc.com/join-our-team/students-recent-graduat es/ 71 Start your Career @ eBay https://careers.ebayinc.com/join-our-team/start-your-search/
  • 73. 73
  • 74. Language Model ● p(levis blue jeans 32 in) = p(levis)p(blue|levis)p(jeans|levis blue)p(32|blue jeans)p(in|jeans 32) ● p(blue|levis) = count(levis,blue) / count(levis) ● Now we have to only deal with unigrams, bigrams and trigrams ● There are still issues ○ Words that we have never seen - we still need to assign some probability ○ Adjustment of probabilities to demote high freq words - the, a etc ○ Backoff scores - KenLM (https://kheafield.com/code/kenlm/) 74