Universal job embedding in recommendation (public ver.)

Universal Job Embedding
In Recommendation
Marsan Ma
2019.8.28

Outline
● Architecture
○ Problems => Algorithms
○ Algorithms => Pipelines
● Algorithms
○ Item2vec Family
○ SNN (Siamese Neural Network) Family
● Evaluations: online/offline
● Summary
(btw, allow me to shorten “recommendation(s)” as “reco(s)” in later slides)

Architecture:
From Problems to Algorithms

Problems & Solutions
● 3 main challenges in reco-system
1. Scaling
2. User cold start
3. Job cold start

● Problem
○ How to predict [Millions users] x [Millions jobs]?
● Solutions
○ Narrow-down search space by location & job_title
■ tradeoff: fewer to predict = worse recall value
○ Reduce predicting burden
■ Rather than predict user x jobs, predict job embeddings.
■ Find Approximate-KNN from job-embeddings as recos.
Scaling

● Problem
■ 66% users visit indeed are new users.
■ 40% users never come back in 21 days.
■ Serving new users is now or never!
■ But retrain model for new recos cannot be real-time.
○ Solution
■ Rather than P10d (personalized reco), do J2J (job-to-job reco) instead.
■ New user click/apply new job => assemble from J2J-reco in real-time.
(No need to retrain the model!)
User Cold Start

● P10d (Personalized reco) : each user got top-100 recos
○ No way to update recos without retrain model.
● J2J (Job-to-job reco) : each job got top-100 recos
○ Predict (reco, score) for jobs
○ Synthesize P10d from J2J reco, ex: user applied for (Job1, Job2)
=> sort(Job1_reco + Job2_reco) by their scores
=> final reco: [(job6, 0.8), (job4, 0.6), (job7, 0.5), (job5, 0.4)]
○ Whenever user clicked/applied new jobs, recos updated on-the-fly!
Job1_reco : (job4, 0.6), (job5, 0.4)
Job2_reco : (job6, 0.8), (job7, 0.5)
Job3_reco : (job8, 0.9), (job9, 0.3)
UserA : job1, job2
UserB : job3, job4
UserC : job5, job6
P10d (personalized) v.s. J2J (job-to-job)

● New jobs are popular, SERP boosted them for better impressions and display position (CTR).
● Unfortunately, it’s hard to do the same in RECO models, since it took days to learn user interest from a
new job worth recommending (weak in recommend new jobs!)
Job Cold Start

● Existing Solution
○ Rule based recommend same place/normTitle jobs.
○ Tfidf+KNN as unsupervised content based alg, but performance not good.
● Can we do better?
○ Sure, let’s have supervised + content + user-behavior algorithm.
○ As long as job have job-description, it could be recommended
○ Same job reposted in different jobId = give the same result.
Job Cold Start

Architecture:
From Algorithms to Pipelines

● Data source:
○ user behavior from log-entry
○ job meta & description from searchable-jobs daily snapshot
○ Preprocessing: Spark
● Model:
○ Training/Predict embedding with AWS EC2 GPU instances
● Bridge ORC & AWS:
○ Use AWS-S3 as data buffer
○ Use Jenkins on EC2 to spawn EC2 instances
● Serve Reco:
○ ORC job will build recos into artifact, later load by Lorax
Pipeline

Evaluate
Pipeline
ORC
trigger
User behavior
(clicks/applyStart)
Job Description
Aggregated
Data
trigger
spawn
Model training
on EC2 w/ GPU
Job Embedding
RecomsRecoms
predict
download
Artifact
build
Lorax
(Reco API)
Training
data
Approximate NN
form
Recommendations
load
Start
Goal

● What’s the model output
○ By design, we don’t want model predict all users-jobs combinations.
○ Model only predict job-embedding
● From job-embedding to P10d (personalized) recommendation
○ Method-1:
■ User-embedding = AVERAGE(job-embedding | user applyStart jobs)
■ P10d recommendations = A-KNN of user-embedding
○ Method-2:
■ Build J2J (job to job) recommendation by finding A-KNN of it
■ P10d recommendations = Sort(J2J Recommendations | user applied/clicked jobs)
From job-embedding to recommendations

Model training
on EC2 w/ GPU
Evaluate
Challenge … cluster on ﬁre
ORC
trigger
User behavior
(clicks/applyStart)
Job Description
Aggregated
Data
trigger
spawn
Job Embedding
RecomsRecoms
predict
download
Artifact
build
Lorax
(Reco API)
Training
data
load
Start
Goal
Approximate NN
form
Recommendations

Over my dead body ...
● Avoid Spark on product, it’s cool but it doesn’t belongs to you.
○ It’s a shared thing, you don’t want your product fail because of someone’s experiment.
● Reduce upsource data dependency
○ We rely on IQL:organic index, and it died often for a long time.
● AWS is your good friend
○ First design only minimally depend on AWS (GPU instance for model training).
○ Change architecture 3 times, latest version go AWS for everything.
from data preprocessing to evaluation. (then soon team got disbanded)
○ It’s not free, but neither indeed infra is.

● Let’s review word2vec, since it’s borrowed idea from word2vec
1. Take training pairs from words within sliding window in a sentence
2. “Random reduce window” to make words closer to center having higher probability to be chosen
Item2vec: Skip-gram model
An example of window_size=2

● Borrowed the idea from word2vec
○ Train with skip-gram or cbow model (we choose skip-gram, better performance in our case)
○ Train on this ground truth:
When the center word is w(t), what other words will be within the window (closer to w(t))

● In indeed’s case
○ User behavior sequence as “sentence”, clicked / applyStart jobs as “words”
○ It’s directly training on “if user applied this job, will he also apply that job soon”.
○ It’s supervised training (response variable align your business goal), even original word2vec isn’t.
Job1 Job2 Job3 Job4 Job5 Job6 Job7 Job8
Job1 Job2 Job3 Job4 Job5 Job6 Job7
(J1, J2)
(J1, J3)
(J2, J1)
(J2, J3)
(J2, J4)
(J3, J1)
(J3, J2)
(J3, J4)
(J3, J5)
(J4, J2)
(J4, J3)
(J4, J5)
(J4, J6)
Job8
User Session
Training
Samples
Job3
Job1
Job2
Job4
Job5
time

Item2vec: Global Context
● How to weigh click / applyStart ?
○ Click is noisy signal, ApplyStart is relatively strong but it’s rare events.
○ Only train on applyStart leads to:
■ lower job-coverage, many jobs never got applyStart.
■ lower precision, since data volume smaller.
○ Naively merging click + applyStart as response variable?
■ Mixing weak and strong data together doesn’t make sense
● Problem: Lack of way to weight click/apply differently!

Item2vec: Global Context
● Solution: Item2vec with global context. (KDD2018 best paper by airbnb)
○ Clicks as main sequence, get pairs with sliding window (and reduced window).
○ Applies as “global context”, always connected with every single clicked jobs.
Job1 Job2 Job3 Job4 Job5 Job6 Job7
(J1, J2)
(J1, J3)
(J1, J9)
(J1, J10)
(J2, J1)
(J2, J3)
(J2, J4)
(J2, J9)
(J2, J10)
(J3, J1)
(J3, J2)
(J3, J4)
(J3, J5)
(J3, J9)
(J3, J10)
(J4, J2)
(J4, J3)
(J4, J5)
(J4, J6)
(J1, J9)
(J1, J10)
Job8
User Clicked Jobs
Training Samples
Job9
time
User ApplyStart Jobs
Job10
Job9 Job10
Job9 Job10
Job9 Job10

Siamese-NN family: problem & design
● Problem statement
○ Learned from item2vec, we have the idea on how to
■ Transfer user behavior sequence (clicks / applyStarts) to job-pairs training data.
■ Learn these pairs as classification problem (which is machine learning good at).
○ Next, how do we include Job Description?
■ Since user decide whether to apply a job from this.
■ On the same time, we still want to learn from User Behavior!
○ Job Description is HUGE data, scaling is challenging!
■ Still, model shall only predict embedding.
■ Build recos ONLY from Approximate-KNN on embedding (without model).

● Design: Siamese Neural Network
○ Scaling: we want the job embedding, and define L2 distance as job similarity
○ Make model trained like how you use it (supervised learning):
■ Extract JD feature by “encoding layer”
● candidates: RNN/CNN/Transformer/BERT
■ Response = L2 distance of Job embedding
● User apply both jobs: similarity = 1
● User applied one, ignore another: similarity = 0
● Unsimilar job = other job viewed but not applied
○ Secret sauce (can’t work without this):
■ BatchNorm + Dense learner after JobVec.

Training Pairs
(J1, J2)
(J1, J3)
(J1, J9)
(J1, J10)
Here both job have become 1D vector, so we could calculate
distance (could be L1 or L2) from them.
Some tricks here:
1. Use batch_norm to constraint vector size, making
job-embedding comparable in L1/L2 domain.
2. Use simple (1 or layer) MLP to learn interpolated features.
Concatenated word embeddings
of each doc (JD) to array
Job Description
+
Job Description of 2 jobs in a pair
Any feature extraction layers (CNN/RNN/Transformer/BERT)
could convert doc embedding array to 1D array

Encoder candidates (just FYI, each is long story)
Algorithm Briefing
Tfidf
● n-gram tfidf is the most intuitive and widely used technique. it's the NLP 101.
● we have this on product as existing content based model.
● dimension = dictionary size, much larger (usually 100x) than all the other embeddings, too heavy weight to use.
CNN
● include n-gram feature extraction into model-training process, choosing feature and iterate manually is of course less precise and much slower than model.
● augment n-gram into embedding space, thus n-grams having same meaning but different word could be considered the same now.
● Tried in RECJOBS-753.
RNN
● having the mechanism of possessing long-term memory, thus theoretically could capture patterns having long distance away from each other. Compare to the Tfidf and
CNN, it's more promising in long text (like our job description, many of them up-to 800-1000 words).
● Extremely slow since cannot be accelerated by paralleling tasks. Also if the text is really long, signals too far away still fade.
● we're not plan in trying this since RNN is too slow, and not particular promising.
Capsule
Network
● an abomination novel idea proposed by Hintons, rather than a replacement of CNN/RNN, it's a layer been put after CNN/RNN and constructing hierarchical understanding of
features extracted by prior layer.
● It means to imitate how human recognize objects and ideas.
● doing the experiment in RECJOBS-898.
Transformer
● it comes from "attention mechanism" which used to be a applied to RNN. But later been proved that it could work pretty well alone, even better than CNN/RNN.
● It's currently the best architecture to capture long distance feature in text, not like RNN's long-term memory which gradually fade, it's always aware of full-article. And at the
same time it could be parallelized. (basically being the killer to RNN)
● we plan to try it in RECJOBS-857.
BERT
● Training phase using bidirectional transformer on 2 self-supervision task: a) close test b) sentence relationship.
● The pre-trained model have a couple of different ways to use, for us the best chance is using it to encoding jobs, then cascade this result to another learner picking other
meta as features and train with this bert embedding together on our business task.
● doing the experiment in RECJOBS-886.

1. In convenient of scaling, we want embedding + A-KNN solutions
○ yeah, we have item2vec
2. Hey, how about we weight clicks & applies?
○ airbnb published item2vec+global context, let’s use it!
3. Hey, can we include job-description as feature? since apparently users read that before apply.
○ let’s augment item2vec with JD => use Siamese-NN to train on item2vec pairs
4. Hey, this Siamese-NN actually is generic for any kinds of encoder!
○ let’s try all state-of-the-art NLP deep-learning encoder on it!
Summary on Algorithms

● Unfortunately, we never solved online fresh-data problem
○ Indeed cluster had a long choking in early 2019.
○ Spark jobs & index builders (IQL:organic) failed constantly at that time.
○ Our data-generating & evaluation orc tasks also failed often at that time.
● The only thing go online correctly: vanilla version of item2vec, since require minimum data
○ Only need applyStart data, since applyStart is rare, ~1% of all user behavior data.
○ Neither job metadata (location/country) nor job description needed.
● Special thanks to @Jialin!!!
○ She built this dedicate pipeline, extract data from logEntry for item2vec.
○ She found the frontend problem cause all new models on Lorax evaluated falsely.
○ We finally have correct online result for item2vec
(Unfortunately, team disbanded 3 weeks later)
Evaluations: online

● Vanilla version of item2vec alone
○ ApplyStart 24% better than “all the old RECO models ensemble” for a month ( IQL )
All old models
ensembled
Evaluations: online
item2vec

Evaluations: online
● Coverage of p10d is very bad, but j2j+p10d could boost it from 50%=>80%. ( IQL )
○ P10d could only provide ~50% coverage
○ P10d + J2J could boost coverage to ~80%
○ Ensemble of all old models could cover 90%, since it’s ensemble of bunch of models.
P10d only
J2J+P10d
Ensemble olds

● The main trade-off: precision & recommendation diversity
○ Item2vec family win in precision (decent way learning job similarity)
○ Siamese-NN family win in diversity (able to find similar new jobs)
○ Details of all metrics and variances of models in RECJOBS-900.
● Another promising alg: “random-walk” from pinterest, seems better than item2vec in offline eval,
but I didn’t got time finish integrating it for this test (sorry, @Kewei)
Evaluations: oﬄine

● What online test tell us
○ Precision improved 24% by simplest item2vec.
○ Good user-coverage thanks to J2J reco.
● What offline test tell us
○ The only thing go online(item2vec) is just the simplest model. We have it’s improved version
(item2vec with global context) better in ALL ASPECTS (precision/coverage/diversity).
○ Content based ones are also promising, since they are:
■ Same idea of how item2vec extract training pairs
■ Include JD as features, learn with deep-learning NLP encoders.
○ Unfortunately, content based ones cannot be evaluated fairly offline, since they mean to
recommend new, unseen jobs. So we don’t know how good it could be.
Evaluations: summary

● About embedding
○ Train supervised model on your business goal, for us it’s applyStart.
■ Unsupervised bag-of-word won’t be useful.
■ Supervised on “not-your-business-goal” is useless for you as well.
● I’m lying about “universal”. You have to train on your own business target
○ For indeed, applyStart might be “the most universal”.
○ Pos-Outcome is a good vision, but we don’t have data
reliable/unbiased/good coverage yet.
Summary - supervised embedding

Summary - algorithms
● Item2vec
○ Novel ideas on:
■ Convert recommend to classify task (which machine learning good at).
■ Defined “distance” align with business goal (click/apply).
○ Minimum data volume, simple model, training is lightning fast.
○ Cold start problem: have no idea on new jobs.
● Siamese Neural Network
○ Augment the idea of item2vec, adding job description into features.
○ Huge task, big network, need big data to train, predicting take long time.
○ Immune to cold start: can predict new jobs, recognize re-posted job.

Universal job embedding in recommendation (public ver.)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Universal job embedding in recommendation (public ver.)

Similaire à Universal job embedding in recommendation (public ver.) (20)

Dernier

Dernier (20)

Universal job embedding in recommendation (public ver.)