Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members

Scoring At Scale: Generating
Follow Recommendations for Over
690 Million LinkedIn Members
Abdulla Al Qawasmeh
Engineering Manager, AI
Emilie de Longueau
Sr Software Engineer, AI

Agenda
Introduction to Follows
Relevance at LinkedIn
Offline Scoring Architecture
Scalability Improvements with
2D Hash-Partitioned Join

Introduction to Follows Relevance

Communities AI
▪ Discover
▪ Follow entities with shared interest
▪ Engage
▪ Join conversations happening in communities with shared interest
▪ Contribute
▪ Engage with the right communities when creating content
https://engineering.linkedin.com/blog/2019/06/building-communities-around-interests
https://recsys.acm.org/recsys19/industry-session-1/
Mission: Empower members to form communities around common interests and have active
conversations

Discover: Follow Recommendations at Scale
Large-scale system that recommends entities to follow for every LinkedIn member
Members: 100s of millions Entities (e): millionsX
Members Pages
NewslettersGroups
Hashtags
Key Challenge: 100s of
trillions of possible
pairs!
Viewer (v) Events

Recommendation Objective
▪ Interesting (form edges): pfollow(v follows e | e recommended to v)
▪ Engaging: utility(v engages e | v follows e)
▪ Follow edges (link between v and e) contribute a substantial amount of
content and engagement on the Feed
Recommend entities that the member ﬁnds interesting and engaging

PFOLLOW Model:
● Binary response
● Predicts the probability of
following the entity given an
impression
UTILITY Model:
● Continuous response
● Look at engagement between
viewer and entity after the
follow edge is formed
Problem Formulation
The ranking objective function

Active vs Inactive members
Recommending entities to follow for every LinkedIn member
▪ Active Members
▪ Users who have performed recent actions on
LinkedIn
▪ Inactive Members
▪ New users to LinkedIn
▪ Registered users who have not performed
recent actions on LinkedIn
690+ million
members

Recommending entities to follow for every LinkedIn member
▪ Active Members
▪ Users who have performed recent actions on
LinkedIn
▪ Inactive Members
▪ New users to LinkedIn
▪ Registered users who have not performed
recent actions on LinkedIn
Personalized recommendations
precomputed offline per member
(+ real-time contextual recommendations
based on recent activity)
Heavy Spark offline pipeline
Segment-based recommendations
precomputed offline per segment (e.g industry,
skills, country) and fetched online
Lightweight Spark offline pipeline
High % client
calls
Low % client
calls
Active vs Inactive members

Scoring Architecture
Simpliﬁed end-to-end pipeline for active members
Active member precomputed
recommendations and scores:
(viewer, (entity, score))
Push
Context-based precomputed
(context, (entity, score))
Key-Value Store
Push
Not Found:
Inactive member
Query active
members store
for X
Fetch contexts
for X
Found:
Active
member
Query store for
contextual
recommendations
Final Scoring
Filtering
Blending
“Get follow
recommendations for
viewer X”
CLIENT
Recent Member
Activity
(Realtime Service)
Key-Value Store
ONLINE (Java)
OFFLINE (Spark)
(followed, (entity_X, score))
(interacted, (entity_Y, score))

Scoring Architecture
Simpliﬁed end-to-end pipeline for active members
Active member precomputed
(viewer, (entity, score))
Push periodically
Context-based precomputed
(context, (entity, score))
Key-Value Store
Push
Not Found:
Inactive member
Query active
members store
for X
Fetch contexts
for X
Found:
Active
member
Query store for
contextual
recommendations
Final Scoring
Filtering
Blending
“Get follow
recommendations for
viewer X”
CLIENT
Recent Member
Activity
(Realtime Service)
Key-Value Store
ONLINE (Java)
OFFLINE (Spark)
(followed, (entity_X, score))
(interacted, (entity_Y, score))

Feature Categories
Viewer Features (small number)
▪ Follow-through -rate (FTR)
▪ Feed click-through-rate (CTR)
▪ Impression counts
▪ Interaction counts
▪ Segments: industry, country,
skills, company...
▪ Language(s)
...
Pair/Interaction
Features (large number)
▪ Viewer-entity engagement
▪ Segment-entity engagement
and follow
▪ Graph-based features
▪ Browsemap scores of entities
already followed by the viewer
(blog link)
▪ Embedding features
… many more
Entity Features (medium number)
▪ Follow-through -rate (FTR)
▪ Unfollow-through-rate (UTR)
▪ Feed click-through-rate (CTR)
▪ Impression counts
▪ Interaction counts
▪ Number of posts
▪ Language(s)
...

Joining Features
Viewer Features
millions
distinct active members
Pair/Interaction Features
trillions
possible (viewer-entity) pairs
100s of billions
(viewer-entity) pairs
Entity Features
millions
recommendable entities
(member, company, hashtag,
newsletters)
How to manage the explosive growth of members / entities
Candidate
selection
How can we join all features together and meet an acceptable performance ?

Viewer Features Pair/Interaction Features Entity Features
Partition Partition Partition
1st HASH JOIN on
viewerId key
(100s TB of shuffle)
TBs GBsGBs
1st Option : 3-way Spark Join
.join()

1st Option : 3-way Spark Join
Viewer Features Pair/Interaction Features Entity Features
Partition Partition Partition
TBs GBsGBs
2nd HASH JOIN on
entityId key
(100s TB of shuffle, very skewed)
● 2 gigantic shuffles
● Poor runtime performance
● Problematic skewness
1st HASH JOIN on
viewerId key
(100s TB of shuffle)
.join() .join()

2nd Option: Partial Scoring with Linear model
GBs TBs GBs
Partial
scoring
● Manageable 3-way join
performed on smaller
outputs
Disadvantages:
● Scoring overhead and
intermediary outputs
● Constraint to use a
linear model

Scalability Improvements with 2D
Hash-Partitioned Join

Goal: Avoid huge shuffles
Bottleneck:
Large / wide table of pair features + skewed entity distribution
Can we manage to join features together without shuffling the pair features ?

Partitioning of the 3 feature tables
▪ Hash-Partition the viewer features table into
V partitions
▪ Hash-Partition the entity features table into
E partitions
▪ Partition the pair features table into V * E
partitions , using a 2-dimensional custom
partition function to allow joining on two
keys (member, entity)
▪ Choose E and V so that every member and
entity partition can be loaded into memory
(depends on data size + executor memory)
Partition
V1 E1
Partition
V1 E2
Partition
V1 E3
Partition
V2 E1
Partition
V2E2
Partition
V2E3
Partition
E1
Partition
E2
Partition
E3
Partition
V1
Partition
V2
Partition
V3
Partition
V3 E1
Partition
V3E2
Partition
V3E3
Viewer Features Entity Features
Pair Features
* Blog Link
(*)

Partition
E20
Pair Features Partition
viewer 1001, entity 220
For a (viewer v, entity e):
▪ Viewer table partition number: h (v) % V
▪ Entity table partition number: h (e) % E
▪ Pair table partition number: h (v) % V * E + h (e) % E
Smart partitioning of pair features table
For each pair partition P, we always have a single corresponding:
▪ Viewer partition number equals to: P / E
▪ Entity partition number equals to: P % E
h: Custom positive hash function
Example:
V= 50, E = 100, h(x) = abs(x)
P = 120
entity table
partition ?
120 % 100 = 20
viewer table
partition ?
Partition
V1
120 / 100 = 1

Join Algorithm
Partition
V1 E1
Partition
V1 E2
Partition
V1 E3
Partition
V2 E1
Partition
V2E2
Partition
V2E3
Partition
V3 E1
Partition
V3E2
Partition
V3E3
Partition
V1
Partition
V2
Partition
V3
Partition
E1
Partition
E2
Partition
E3
Partitioned
Viewer
Features
Partitioned Entity Features
Partitioned Pair Features
1 - Launch a mapper for each pair partition
2.1 - Load the corresponding entity partition as
in-memory hashmap
2.2 - Load the corresponding viewer partition
(presorted by viewer id) into a stream reader
3 - For each pair features record, lookup entity
features record by entity id, and viewer
features record from stream reader
4 - Merge three feature sets into a joined
record
5 - Features can be scored right away before
storing to HDFS!
ALGORITHM:
.mapPartitions()

New Offline Scoring Pipeline
BEFORE AFTER
Partial scoring
GBs GBsTBs
TBs GBsGBs

BEFORE AFTER
Partial scoring
GBs GBsTBs
TBs GBsGBs
No shuffle of the pair
features table during the join

BEFORE AFTER
Partial scoring
GBs GBsTBs
TBs GBsGBs
No intermediate data
stored in HDFS (single
Spark job)

BEFORE AFTER
Partial scoring
GBs GBsTBs
TBs GBsGBs
Ability to score using a
non-linear model that
interacts features (XGBoost)

Gains After adopting 2D Hash-Partition Join
Offline Scoring Runtime
Performance
▪ Cost to Serve of Offline Scoring
pipeline reduced by 5X (in Gb.h)
▪ HDFS Storage: intermediate
outputs reduced by 8X
Relevance
▪ Enabled transition from linear
model (LoR, LiR) to non-linear
model (XGBoost)
▪ Total follows up by 17%
▪ Engagement up by 11%

Contacts:
● https://www.linkedin.com/in/emilie-de-longueau/
● https://www.linkedin.com/in/aalqawasmeh/
Credits:
● LinkedIn Hadoop team, in particular Fangshi Li for
implementing the algorithm and helping with its adoption
in Follows Relevance
Check these Blogs:
● LinkedIn Engineering - Communities AI: Building
Communities Around Interests
● LinkedIn Engineering - Managing
Exploding Big Data

Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members

Similaire à Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members