SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Universal Job Embedding
In Recommendation
Marsan Ma
2019.8.28
Outline
● Architecture
○ Problems => Algorithms
○ Algorithms => Pipelines
● Algorithms
○ Item2vec Family
○ SNN (Siamese Neural Network) Family
● Evaluations: online/offline
● Summary
(btw, allow me to shorten “recommendation(s)” as “reco(s)” in later slides)
Architecture:
From Problems to Algorithms
Problems & Solutions
● 3 main challenges in reco-system
1. Scaling
2. User cold start
3. Job cold start
● Problem
○ How to predict [Millions users] x [Millions jobs]?
● Solutions
○ Narrow-down search space by location & job_title
■ tradeoff: fewer to predict = worse recall value
○ Reduce predicting burden
■ Rather than predict user x jobs, predict job embeddings.
■ Find Approximate-KNN from job-embeddings as recos.
Scaling
● Problem
■ 66% users visit indeed are new users.
■ 40% users never come back in 21 days.
■ Serving new users is now or never!
■ But retrain model for new recos cannot be real-time.
○ Solution
■ Rather than P10d (personalized reco), do J2J (job-to-job reco) instead.
■ New user click/apply new job => assemble from J2J-reco in real-time.
(No need to retrain the model!)
User Cold Start
● P10d (Personalized reco) : each user got top-100 recos
○ No way to update recos without retrain model.
● J2J (Job-to-job reco) : each job got top-100 recos
○ Predict (reco, score) for jobs
○ Synthesize P10d from J2J reco, ex: user applied for (Job1, Job2)
=> sort(Job1_reco + Job2_reco) by their scores
=> final reco: [(job6, 0.8), (job4, 0.6), (job7, 0.5), (job5, 0.4)]
○ Whenever user clicked/applied new jobs, recos updated on-the-fly!
Job1_reco : (job4, 0.6), (job5, 0.4)
Job2_reco : (job6, 0.8), (job7, 0.5)
Job3_reco : (job8, 0.9), (job9, 0.3)
UserA : job1, job2
UserB : job3, job4
UserC : job5, job6
P10d (personalized) v.s. J2J (job-to-job)
● New jobs are popular, SERP boosted them for better impressions and display position (CTR).
● Unfortunately, it’s hard to do the same in RECO models, since it took days to learn user interest from a
new job worth recommending (weak in recommend new jobs!)
Job Cold Start
● Existing Solution
○ Rule based recommend same place/normTitle jobs.
○ Tfidf+KNN as unsupervised content based alg, but performance not good.
● Can we do better?
○ Sure, let’s have supervised + content + user-behavior algorithm.
○ As long as job have job-description, it could be recommended
○ Same job reposted in different jobId = give the same result.
Job Cold Start
Architecture:
From Algorithms to Pipelines
● Data source:
○ user behavior from log-entry
○ job meta & description from searchable-jobs daily snapshot
○ Preprocessing: Spark
● Model:
○ Training/Predict embedding with AWS EC2 GPU instances
● Bridge ORC & AWS:
○ Use AWS-S3 as data buffer
○ Use Jenkins on EC2 to spawn EC2 instances
● Serve Reco:
○ ORC job will build recos into artifact, later load by Lorax
Pipeline
Evaluate
Pipeline
ORC
trigger
User behavior
(clicks/applyStart)
Job Description
Aggregated
Data
trigger
spawn
Model training
on EC2 w/ GPU
Job Embedding
RecomsRecoms
predict
download
Artifact
build
Lorax
(Reco API)
Training
data
Approximate NN
form
Recommendations
load
Start
Goal
● What’s the model output
○ By design, we don’t want model predict all users-jobs combinations.
○ Model only predict job-embedding
● From job-embedding to P10d (personalized) recommendation
○ Method-1:
■ User-embedding = AVERAGE(job-embedding | user applyStart jobs)
■ P10d recommendations = A-KNN of user-embedding
○ Method-2:
■ Build J2J (job to job) recommendation by finding A-KNN of it
■ P10d recommendations = Sort(J2J Recommendations | user applied/clicked jobs)
From job-embedding to recommendations
Model training
on EC2 w/ GPU
Evaluate
Challenge … cluster on fire
ORC
trigger
User behavior
(clicks/applyStart)
Job Description
Aggregated
Data
trigger
spawn
Job Embedding
RecomsRecoms
predict
download
Artifact
build
Lorax
(Reco API)
Training
data
load
Start
Goal
Approximate NN
form
Recommendations
Over my dead body ...
● Avoid Spark on product, it’s cool but it doesn’t belongs to you.
○ It’s a shared thing, you don’t want your product fail because of someone’s experiment.
● Reduce upsource data dependency
○ We rely on IQL:organic index, and it died often for a long time.
● AWS is your good friend
○ First design only minimally depend on AWS (GPU instance for model training).
○ Change architecture 3 times, latest version go AWS for everything.
from data preprocessing to evaluation. (then soon team got disbanded)
○ It’s not free, but neither indeed infra is.
Algorithms:
Item2vec Family
● Let’s review word2vec, since it’s borrowed idea from word2vec
1. Take training pairs from words within sliding window in a sentence
2. “Random reduce window” to make words closer to center having higher probability to be chosen
Item2vec: Skip-gram model
An example of window_size=2
● Borrowed the idea from word2vec
○ Train with skip-gram or cbow model (we choose skip-gram, better performance in our case)
○ Train on this ground truth:
When the center word is w(t), what other words will be within the window (closer to w(t))
Item2vec: Skip-gram model
● In indeed’s case
○ User behavior sequence as “sentence”, clicked / applyStart jobs as “words”
○ It’s directly training on “if user applied this job, will he also apply that job soon”.
○ It’s supervised training (response variable align your business goal), even original word2vec isn’t.
Item2vec: Skip-gram model
Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8
Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8
Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8
Job1 Job2 Job3 Job4 Job5 Job6 Job7
(J1, J2)
(J1, J3)
(J2, J1)
(J2, J3)
(J2, J4)
(J3, J1)
(J3, J2)
(J3, J4)
(J3, J5)
(J4, J2)
(J4, J3)
(J4, J5)
(J4, J6)
Job8
User Session
Training
Samples
Job3
Job1
Job2
Job4
Job5
time
Item2vec: Global Context
● How to weigh click / applyStart ?
○ Click is noisy signal, ApplyStart is relatively strong but it’s rare events.
○ Only train on applyStart leads to:
■ lower job-coverage, many jobs never got applyStart.
■ lower precision, since data volume smaller.
○ Naively merging click + applyStart as response variable?
■ Mixing weak and strong data together doesn’t make sense
● Problem: Lack of way to weight click/apply differently!
Item2vec: Global Context
● Solution: Item2vec with global context. (KDD2018 best paper by airbnb)
○ Clicks as main sequence, get pairs with sliding window (and reduced window).
○ Applies as “global context”, always connected with every single clicked jobs.
Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8
Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8
Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8
Job1 Job2 Job3 Job4 Job5 Job6 Job7
(J1, J2)
(J1, J3)
(J1, J9)
(J1, J10)
(J2, J1)
(J2, J3)
(J2, J4)
(J2, J9)
(J2, J10)
(J3, J1)
(J3, J2)
(J3, J4)
(J3, J5)
(J3, J9)
(J3, J10)
(J4, J2)
(J4, J3)
(J4, J5)
(J4, J6)
(J1, J9)
(J1, J10)
Job8
User Clicked Jobs
Training Samples
Job9
time
User ApplyStart Jobs
Job10
Job9 Job10
Job9 Job10
Job9 Job10
Algorithms:
Siamese-NN Family
Siamese-NN family: problem & design
● Problem statement
○ Learned from item2vec, we have the idea on how to
■ Transfer user behavior sequence (clicks / applyStarts) to job-pairs training data.
■ Learn these pairs as classification problem (which is machine learning good at).
○ Next, how do we include Job Description?
■ Since user decide whether to apply a job from this.
■ On the same time, we still want to learn from User Behavior!
○ Job Description is HUGE data, scaling is challenging!
■ Still, model shall only predict embedding.
■ Build recos ONLY from Approximate-KNN on embedding (without model).
● Design: Siamese Neural Network
○ Scaling: we want the job embedding, and define L2 distance as job similarity
○ Make model trained like how you use it (supervised learning):
■ Extract JD feature by “encoding layer”
● candidates: RNN/CNN/Transformer/BERT
■ Response = L2 distance of Job embedding
● User apply both jobs: similarity = 1
● User applied one, ignore another: similarity = 0
● Unsimilar job = other job viewed but not applied
○ Secret sauce (can’t work without this):
■ BatchNorm + Dense learner after JobVec.
Siamese-NN family: problem & design
Training Pairs
(J1, J2)
(J1, J3)
(J1, J9)
(J1, J10)
Here both job have become 1D vector, so we could calculate
distance (could be L1 or L2) from them.
Some tricks here:
1. Use batch_norm to constraint vector size, making
job-embedding comparable in L1/L2 domain.
2. Use simple (1 or layer) MLP to learn interpolated features.
Concatenated word embeddings
of each doc (JD) to array
Siamese-NN family: problem & design
Job Description
+
Job Description of 2 jobs in a pair
Any feature extraction layers (CNN/RNN/Transformer/BERT)
could convert doc embedding array to 1D array
Encoder candidates (just FYI, each is long story)
Algorithm Briefing
Tfidf
● n-gram tfidf is the most intuitive and widely used technique. it's the NLP 101.
● we have this on product as existing content based model.
● dimension = dictionary size, much larger (usually 100x) than all the other embeddings, too heavy weight to use.
CNN
● include n-gram feature extraction into model-training process, choosing feature and iterate manually is of course less precise and much slower than model.
● augment n-gram into embedding space, thus n-grams having same meaning but different word could be considered the same now.
● Tried in RECJOBS-753.
RNN
● having the mechanism of possessing long-term memory, thus theoretically could capture patterns having long distance away from each other. Compare to the Tfidf and
CNN, it's more promising in long text (like our job description, many of them up-to 800-1000 words).
● Extremely slow since cannot be accelerated by paralleling tasks. Also if the text is really long, signals too far away still fade.
● we're not plan in trying this since RNN is too slow, and not particular promising.
Capsule
Network
● an abomination novel idea proposed by Hintons, rather than a replacement of CNN/RNN, it's a layer been put after CNN/RNN and constructing hierarchical understanding of
features extracted by prior layer.
● It means to imitate how human recognize objects and ideas.
● doing the experiment in RECJOBS-898.
Transformer
● it comes from "attention mechanism" which used to be a applied to RNN. But later been proved that it could work pretty well alone, even better than CNN/RNN.
● It's currently the best architecture to capture long distance feature in text, not like RNN's long-term memory which gradually fade, it's always aware of full-article. And at the
same time it could be parallelized. (basically being the killer to RNN)
● we plan to try it in RECJOBS-857.
BERT
● Training phase using bidirectional transformer on 2 self-supervision task: a) close test b) sentence relationship.
● The pre-trained model have a couple of different ways to use, for us the best chance is using it to encoding jobs, then cascade this result to another learner picking other
meta as features and train with this bert embedding together on our business task.
● doing the experiment in RECJOBS-886.
1. In convenient of scaling, we want embedding + A-KNN solutions
○ yeah, we have item2vec
2. Hey, how about we weight clicks & applies?
○ airbnb published item2vec+global context, let’s use it!
3. Hey, can we include job-description as feature? since apparently users read that before apply.
○ let’s augment item2vec with JD => use Siamese-NN to train on item2vec pairs
4. Hey, this Siamese-NN actually is generic for any kinds of encoder!
○ let’s try all state-of-the-art NLP deep-learning encoder on it!
Summary on Algorithms
Evaluations
● Unfortunately, we never solved online fresh-data problem
○ Indeed cluster had a long choking in early 2019.
○ Spark jobs & index builders (IQL:organic) failed constantly at that time.
○ Our data-generating & evaluation orc tasks also failed often at that time.
● The only thing go online correctly: vanilla version of item2vec, since require minimum data
○ Only need applyStart data, since applyStart is rare, ~1% of all user behavior data.
○ Neither job metadata (location/country) nor job description needed.
● Special thanks to @Jialin!!!
○ She built this dedicate pipeline, extract data from logEntry for item2vec.
○ She found the frontend problem cause all new models on Lorax evaluated falsely.
○ We finally have correct online result for item2vec
(Unfortunately, team disbanded 3 weeks later)
Evaluations: online
● Vanilla version of item2vec alone
○ ApplyStart 24% better than “all the old RECO models ensemble” for a month ( IQL )
All old models
ensembled
Evaluations: online
item2vec
Evaluations: online
● Coverage of p10d is very bad, but j2j+p10d could boost it from 50%=>80%. ( IQL )
○ P10d could only provide ~50% coverage
○ P10d + J2J could boost coverage to ~80%
○ Ensemble of all old models could cover 90%, since it’s ensemble of bunch of models.
P10d only
J2J+P10d
Ensemble olds
● The main trade-off: precision & recommendation diversity
○ Item2vec family win in precision (decent way learning job similarity)
○ Siamese-NN family win in diversity (able to find similar new jobs)
○ Details of all metrics and variances of models in RECJOBS-900.
● Another promising alg: “random-walk” from pinterest, seems better than item2vec in offline eval,
but I didn’t got time finish integrating it for this test (sorry, @Kewei)
Evaluations: offline
● What online test tell us
○ Precision improved 24% by simplest item2vec.
○ Good user-coverage thanks to J2J reco.
● What offline test tell us
○ The only thing go online(item2vec) is just the simplest model. We have it’s improved version
(item2vec with global context) better in ALL ASPECTS (precision/coverage/diversity).
○ Content based ones are also promising, since they are:
■ Same idea of how item2vec extract training pairs
■ Include JD as features, learn with deep-learning NLP encoders.
○ Unfortunately, content based ones cannot be evaluated fairly offline, since they mean to
recommend new, unseen jobs. So we don’t know how good it could be.
Evaluations: summary
Summary
● About embedding
○ Train supervised model on your business goal, for us it’s applyStart.
■ Unsupervised bag-of-word won’t be useful.
■ Supervised on “not-your-business-goal” is useless for you as well.
● I’m lying about “universal”. You have to train on your own business target
○ For indeed, applyStart might be “the most universal”.
○ Pos-Outcome is a good vision, but we don’t have data
reliable/unbiased/good coverage yet.
Summary - supervised embedding
Summary - algorithms
● Item2vec
○ Novel ideas on:
■ Convert recommend to classify task (which machine learning good at).
■ Defined “distance” align with business goal (click/apply).
○ Minimum data volume, simple model, training is lightning fast.
○ Cold start problem: have no idea on new jobs.
● Siamese Neural Network
○ Augment the idea of item2vec, adding job description into features.
○ Huge task, big network, need big data to train, predicting take long time.
○ Immune to cold start: can predict new jobs, recognize re-posted job.

Contenu connexe

Tendances

Tendances (20)

PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From ScratchPPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
PPT - AutoML-Zero: Evolving Machine Learning Algorithms From Scratch
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Xgboost
XgboostXgboost
Xgboost
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
DataRobot R Package
DataRobot R PackageDataRobot R Package
DataRobot R Package
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Ppt shuai
Ppt shuaiPpt shuai
Ppt shuai
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
The Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting ProblemThe Validity of CNN to Time-Series Forecasting Problem
The Validity of CNN to Time-Series Forecasting Problem
 
GBM package in r
GBM package in rGBM package in r
GBM package in r
 
C3 w2
C3 w2C3 w2
C3 w2
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Introduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-LearnIntroduction to Machine Learning in Python using Scikit-Learn
Introduction to Machine Learning in Python using Scikit-Learn
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
Feature engineering pipelines
Feature engineering pipelinesFeature engineering pipelines
Feature engineering pipelines
 

Similaire à Universal job embedding in recommendation (public ver.)

Workers and Worker Patterns at Scale
Workers and Worker Patterns at ScaleWorkers and Worker Patterns at Scale
Workers and Worker Patterns at Scale
Chad Arimura
 

Similaire à Universal job embedding in recommendation (public ver.) (20)

Single Responsibility Principle
Single Responsibility PrincipleSingle Responsibility Principle
Single Responsibility Principle
 
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
Strata 2016 -  Lessons Learned from building real-life Machine Learning SystemsStrata 2016 -  Lessons Learned from building real-life Machine Learning Systems
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
 
Background Processing With Work Manager
Background Processing With Work ManagerBackground Processing With Work Manager
Background Processing With Work Manager
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer Learning
 
From Conventional Machine Learning to Deep Learning and Beyond.pptx
From Conventional Machine Learning to Deep Learning and Beyond.pptxFrom Conventional Machine Learning to Deep Learning and Beyond.pptx
From Conventional Machine Learning to Deep Learning and Beyond.pptx
 
Software Craftmanship - Cours Polytech
Software Craftmanship - Cours PolytechSoftware Craftmanship - Cours Polytech
Software Craftmanship - Cours Polytech
 
Ood and solid principles
Ood and solid principlesOod and solid principles
Ood and solid principles
 
Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
CSSC ML Workshop
CSSC ML WorkshopCSSC ML Workshop
CSSC ML Workshop
 
Concurrency - Why it's hard ?
Concurrency - Why it's hard ?Concurrency - Why it's hard ?
Concurrency - Why it's hard ?
 
Production ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ wazeProduction ready big ml workflows from zero to hero daniel marcous @ waze
Production ready big ml workflows from zero to hero daniel marcous @ waze
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Data Con LA 2022-PaCMAP ensembles for occupational specializations in the Cal...
Data Con LA 2022-PaCMAP ensembles for occupational specializations in the Cal...Data Con LA 2022-PaCMAP ensembles for occupational specializations in the Cal...
Data Con LA 2022-PaCMAP ensembles for occupational specializations in the Cal...
 
Neuromation.io AI Ukraine Presentation
Neuromation.io AI Ukraine PresentationNeuromation.io AI Ukraine Presentation
Neuromation.io AI Ukraine Presentation
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
 
Workers and Worker Patterns at Scale
Workers and Worker Patterns at ScaleWorkers and Worker Patterns at Scale
Workers and Worker Patterns at Scale
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Dernier (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

Universal job embedding in recommendation (public ver.)

  • 1. Universal Job Embedding In Recommendation Marsan Ma 2019.8.28
  • 2. Outline ● Architecture ○ Problems => Algorithms ○ Algorithms => Pipelines ● Algorithms ○ Item2vec Family ○ SNN (Siamese Neural Network) Family ● Evaluations: online/offline ● Summary (btw, allow me to shorten “recommendation(s)” as “reco(s)” in later slides)
  • 4. Problems & Solutions ● 3 main challenges in reco-system 1. Scaling 2. User cold start 3. Job cold start
  • 5. ● Problem ○ How to predict [Millions users] x [Millions jobs]? ● Solutions ○ Narrow-down search space by location & job_title ■ tradeoff: fewer to predict = worse recall value ○ Reduce predicting burden ■ Rather than predict user x jobs, predict job embeddings. ■ Find Approximate-KNN from job-embeddings as recos. Scaling
  • 6. ● Problem ■ 66% users visit indeed are new users. ■ 40% users never come back in 21 days. ■ Serving new users is now or never! ■ But retrain model for new recos cannot be real-time. ○ Solution ■ Rather than P10d (personalized reco), do J2J (job-to-job reco) instead. ■ New user click/apply new job => assemble from J2J-reco in real-time. (No need to retrain the model!) User Cold Start
  • 7. ● P10d (Personalized reco) : each user got top-100 recos ○ No way to update recos without retrain model. ● J2J (Job-to-job reco) : each job got top-100 recos ○ Predict (reco, score) for jobs ○ Synthesize P10d from J2J reco, ex: user applied for (Job1, Job2) => sort(Job1_reco + Job2_reco) by their scores => final reco: [(job6, 0.8), (job4, 0.6), (job7, 0.5), (job5, 0.4)] ○ Whenever user clicked/applied new jobs, recos updated on-the-fly! Job1_reco : (job4, 0.6), (job5, 0.4) Job2_reco : (job6, 0.8), (job7, 0.5) Job3_reco : (job8, 0.9), (job9, 0.3) UserA : job1, job2 UserB : job3, job4 UserC : job5, job6 P10d (personalized) v.s. J2J (job-to-job)
  • 8. ● New jobs are popular, SERP boosted them for better impressions and display position (CTR). ● Unfortunately, it’s hard to do the same in RECO models, since it took days to learn user interest from a new job worth recommending (weak in recommend new jobs!) Job Cold Start
  • 9. ● Existing Solution ○ Rule based recommend same place/normTitle jobs. ○ Tfidf+KNN as unsupervised content based alg, but performance not good. ● Can we do better? ○ Sure, let’s have supervised + content + user-behavior algorithm. ○ As long as job have job-description, it could be recommended ○ Same job reposted in different jobId = give the same result. Job Cold Start
  • 11. ● Data source: ○ user behavior from log-entry ○ job meta & description from searchable-jobs daily snapshot ○ Preprocessing: Spark ● Model: ○ Training/Predict embedding with AWS EC2 GPU instances ● Bridge ORC & AWS: ○ Use AWS-S3 as data buffer ○ Use Jenkins on EC2 to spawn EC2 instances ● Serve Reco: ○ ORC job will build recos into artifact, later load by Lorax Pipeline
  • 12. Evaluate Pipeline ORC trigger User behavior (clicks/applyStart) Job Description Aggregated Data trigger spawn Model training on EC2 w/ GPU Job Embedding RecomsRecoms predict download Artifact build Lorax (Reco API) Training data Approximate NN form Recommendations load Start Goal
  • 13. ● What’s the model output ○ By design, we don’t want model predict all users-jobs combinations. ○ Model only predict job-embedding ● From job-embedding to P10d (personalized) recommendation ○ Method-1: ■ User-embedding = AVERAGE(job-embedding | user applyStart jobs) ■ P10d recommendations = A-KNN of user-embedding ○ Method-2: ■ Build J2J (job to job) recommendation by finding A-KNN of it ■ P10d recommendations = Sort(J2J Recommendations | user applied/clicked jobs) From job-embedding to recommendations
  • 14. Model training on EC2 w/ GPU Evaluate Challenge … cluster on fire ORC trigger User behavior (clicks/applyStart) Job Description Aggregated Data trigger spawn Job Embedding RecomsRecoms predict download Artifact build Lorax (Reco API) Training data load Start Goal Approximate NN form Recommendations
  • 15. Over my dead body ... ● Avoid Spark on product, it’s cool but it doesn’t belongs to you. ○ It’s a shared thing, you don’t want your product fail because of someone’s experiment. ● Reduce upsource data dependency ○ We rely on IQL:organic index, and it died often for a long time. ● AWS is your good friend ○ First design only minimally depend on AWS (GPU instance for model training). ○ Change architecture 3 times, latest version go AWS for everything. from data preprocessing to evaluation. (then soon team got disbanded) ○ It’s not free, but neither indeed infra is.
  • 17. ● Let’s review word2vec, since it’s borrowed idea from word2vec 1. Take training pairs from words within sliding window in a sentence 2. “Random reduce window” to make words closer to center having higher probability to be chosen Item2vec: Skip-gram model An example of window_size=2
  • 18. ● Borrowed the idea from word2vec ○ Train with skip-gram or cbow model (we choose skip-gram, better performance in our case) ○ Train on this ground truth: When the center word is w(t), what other words will be within the window (closer to w(t)) Item2vec: Skip-gram model
  • 19. ● In indeed’s case ○ User behavior sequence as “sentence”, clicked / applyStart jobs as “words” ○ It’s directly training on “if user applied this job, will he also apply that job soon”. ○ It’s supervised training (response variable align your business goal), even original word2vec isn’t. Item2vec: Skip-gram model Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8 Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8 Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8 Job1 Job2 Job3 Job4 Job5 Job6 Job7 (J1, J2) (J1, J3) (J2, J1) (J2, J3) (J2, J4) (J3, J1) (J3, J2) (J3, J4) (J3, J5) (J4, J2) (J4, J3) (J4, J5) (J4, J6) Job8 User Session Training Samples Job3 Job1 Job2 Job4 Job5 time
  • 20. Item2vec: Global Context ● How to weigh click / applyStart ? ○ Click is noisy signal, ApplyStart is relatively strong but it’s rare events. ○ Only train on applyStart leads to: ■ lower job-coverage, many jobs never got applyStart. ■ lower precision, since data volume smaller. ○ Naively merging click + applyStart as response variable? ■ Mixing weak and strong data together doesn’t make sense ● Problem: Lack of way to weight click/apply differently!
  • 21. Item2vec: Global Context ● Solution: Item2vec with global context. (KDD2018 best paper by airbnb) ○ Clicks as main sequence, get pairs with sliding window (and reduced window). ○ Applies as “global context”, always connected with every single clicked jobs. Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8 Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8 Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8 Job1 Job2 Job3 Job4 Job5 Job6 Job7 (J1, J2) (J1, J3) (J1, J9) (J1, J10) (J2, J1) (J2, J3) (J2, J4) (J2, J9) (J2, J10) (J3, J1) (J3, J2) (J3, J4) (J3, J5) (J3, J9) (J3, J10) (J4, J2) (J4, J3) (J4, J5) (J4, J6) (J1, J9) (J1, J10) Job8 User Clicked Jobs Training Samples Job9 time User ApplyStart Jobs Job10 Job9 Job10 Job9 Job10 Job9 Job10
  • 23. Siamese-NN family: problem & design ● Problem statement ○ Learned from item2vec, we have the idea on how to ■ Transfer user behavior sequence (clicks / applyStarts) to job-pairs training data. ■ Learn these pairs as classification problem (which is machine learning good at). ○ Next, how do we include Job Description? ■ Since user decide whether to apply a job from this. ■ On the same time, we still want to learn from User Behavior! ○ Job Description is HUGE data, scaling is challenging! ■ Still, model shall only predict embedding. ■ Build recos ONLY from Approximate-KNN on embedding (without model).
  • 24. ● Design: Siamese Neural Network ○ Scaling: we want the job embedding, and define L2 distance as job similarity ○ Make model trained like how you use it (supervised learning): ■ Extract JD feature by “encoding layer” ● candidates: RNN/CNN/Transformer/BERT ■ Response = L2 distance of Job embedding ● User apply both jobs: similarity = 1 ● User applied one, ignore another: similarity = 0 ● Unsimilar job = other job viewed but not applied ○ Secret sauce (can’t work without this): ■ BatchNorm + Dense learner after JobVec. Siamese-NN family: problem & design
  • 25. Training Pairs (J1, J2) (J1, J3) (J1, J9) (J1, J10) Here both job have become 1D vector, so we could calculate distance (could be L1 or L2) from them. Some tricks here: 1. Use batch_norm to constraint vector size, making job-embedding comparable in L1/L2 domain. 2. Use simple (1 or layer) MLP to learn interpolated features. Concatenated word embeddings of each doc (JD) to array Siamese-NN family: problem & design Job Description + Job Description of 2 jobs in a pair Any feature extraction layers (CNN/RNN/Transformer/BERT) could convert doc embedding array to 1D array
  • 26. Encoder candidates (just FYI, each is long story) Algorithm Briefing Tfidf ● n-gram tfidf is the most intuitive and widely used technique. it's the NLP 101. ● we have this on product as existing content based model. ● dimension = dictionary size, much larger (usually 100x) than all the other embeddings, too heavy weight to use. CNN ● include n-gram feature extraction into model-training process, choosing feature and iterate manually is of course less precise and much slower than model. ● augment n-gram into embedding space, thus n-grams having same meaning but different word could be considered the same now. ● Tried in RECJOBS-753. RNN ● having the mechanism of possessing long-term memory, thus theoretically could capture patterns having long distance away from each other. Compare to the Tfidf and CNN, it's more promising in long text (like our job description, many of them up-to 800-1000 words). ● Extremely slow since cannot be accelerated by paralleling tasks. Also if the text is really long, signals too far away still fade. ● we're not plan in trying this since RNN is too slow, and not particular promising. Capsule Network ● an abomination novel idea proposed by Hintons, rather than a replacement of CNN/RNN, it's a layer been put after CNN/RNN and constructing hierarchical understanding of features extracted by prior layer. ● It means to imitate how human recognize objects and ideas. ● doing the experiment in RECJOBS-898. Transformer ● it comes from "attention mechanism" which used to be a applied to RNN. But later been proved that it could work pretty well alone, even better than CNN/RNN. ● It's currently the best architecture to capture long distance feature in text, not like RNN's long-term memory which gradually fade, it's always aware of full-article. And at the same time it could be parallelized. (basically being the killer to RNN) ● we plan to try it in RECJOBS-857. BERT ● Training phase using bidirectional transformer on 2 self-supervision task: a) close test b) sentence relationship. ● The pre-trained model have a couple of different ways to use, for us the best chance is using it to encoding jobs, then cascade this result to another learner picking other meta as features and train with this bert embedding together on our business task. ● doing the experiment in RECJOBS-886.
  • 27. 1. In convenient of scaling, we want embedding + A-KNN solutions ○ yeah, we have item2vec 2. Hey, how about we weight clicks & applies? ○ airbnb published item2vec+global context, let’s use it! 3. Hey, can we include job-description as feature? since apparently users read that before apply. ○ let’s augment item2vec with JD => use Siamese-NN to train on item2vec pairs 4. Hey, this Siamese-NN actually is generic for any kinds of encoder! ○ let’s try all state-of-the-art NLP deep-learning encoder on it! Summary on Algorithms
  • 29. ● Unfortunately, we never solved online fresh-data problem ○ Indeed cluster had a long choking in early 2019. ○ Spark jobs & index builders (IQL:organic) failed constantly at that time. ○ Our data-generating & evaluation orc tasks also failed often at that time. ● The only thing go online correctly: vanilla version of item2vec, since require minimum data ○ Only need applyStart data, since applyStart is rare, ~1% of all user behavior data. ○ Neither job metadata (location/country) nor job description needed. ● Special thanks to @Jialin!!! ○ She built this dedicate pipeline, extract data from logEntry for item2vec. ○ She found the frontend problem cause all new models on Lorax evaluated falsely. ○ We finally have correct online result for item2vec (Unfortunately, team disbanded 3 weeks later) Evaluations: online
  • 30. ● Vanilla version of item2vec alone ○ ApplyStart 24% better than “all the old RECO models ensemble” for a month ( IQL ) All old models ensembled Evaluations: online item2vec
  • 31. Evaluations: online ● Coverage of p10d is very bad, but j2j+p10d could boost it from 50%=>80%. ( IQL ) ○ P10d could only provide ~50% coverage ○ P10d + J2J could boost coverage to ~80% ○ Ensemble of all old models could cover 90%, since it’s ensemble of bunch of models. P10d only J2J+P10d Ensemble olds
  • 32. ● The main trade-off: precision & recommendation diversity ○ Item2vec family win in precision (decent way learning job similarity) ○ Siamese-NN family win in diversity (able to find similar new jobs) ○ Details of all metrics and variances of models in RECJOBS-900. ● Another promising alg: “random-walk” from pinterest, seems better than item2vec in offline eval, but I didn’t got time finish integrating it for this test (sorry, @Kewei) Evaluations: offline
  • 33. ● What online test tell us ○ Precision improved 24% by simplest item2vec. ○ Good user-coverage thanks to J2J reco. ● What offline test tell us ○ The only thing go online(item2vec) is just the simplest model. We have it’s improved version (item2vec with global context) better in ALL ASPECTS (precision/coverage/diversity). ○ Content based ones are also promising, since they are: ■ Same idea of how item2vec extract training pairs ■ Include JD as features, learn with deep-learning NLP encoders. ○ Unfortunately, content based ones cannot be evaluated fairly offline, since they mean to recommend new, unseen jobs. So we don’t know how good it could be. Evaluations: summary
  • 35. ● About embedding ○ Train supervised model on your business goal, for us it’s applyStart. ■ Unsupervised bag-of-word won’t be useful. ■ Supervised on “not-your-business-goal” is useless for you as well. ● I’m lying about “universal”. You have to train on your own business target ○ For indeed, applyStart might be “the most universal”. ○ Pos-Outcome is a good vision, but we don’t have data reliable/unbiased/good coverage yet. Summary - supervised embedding
  • 36. Summary - algorithms ● Item2vec ○ Novel ideas on: ■ Convert recommend to classify task (which machine learning good at). ■ Defined “distance” align with business goal (click/apply). ○ Minimum data volume, simple model, training is lightning fast. ○ Cold start problem: have no idea on new jobs. ● Siamese Neural Network ○ Augment the idea of item2vec, adding job description into features. ○ Huge task, big network, need big data to train, predicting take long time. ○ Immune to cold start: can predict new jobs, recognize re-posted job.