SlideShare une entreprise Scribd logo
1  sur  31
Télécharger pour lire hors ligne
Scoring At Scale: Generating
Follow Recommendations for Over
690 Million LinkedIn Members
Abdulla Al Qawasmeh
Engineering Manager, AI
Emilie de Longueau
Sr Software Engineer, AI
Agenda
Introduction to Follows
Relevance at LinkedIn
Offline Scoring Architecture
Scalability Improvements with
2D Hash-Partitioned Join
Introduction to Follows Relevance
Product Placements
Communities AI
▪ Discover
▪ Follow entities with shared interest
▪ Engage
▪ Join conversations happening in communities with shared interest
▪ Contribute
▪ Engage with the right communities when creating content
https://engineering.linkedin.com/blog/2019/06/building-communities-around-interests
https://recsys.acm.org/recsys19/industry-session-1/
Mission: Empower members to form communities around common interests and have active
conversations
Discover: Follow Recommendations at Scale
Large-scale system that recommends entities to follow for every LinkedIn member
Members: 100s of millions Entities (e): millionsX
Members Pages
NewslettersGroups
Hashtags
Key Challenge: 100s of
trillions of possible
pairs!
Viewer (v) Events
Recommendation Objective
▪ Interesting (form edges): pfollow(v follows e | e recommended to v)
▪ Engaging: utility(v engages e | v follows e)
▪ Follow edges (link between v and e) contribute a substantial amount of
content and engagement on the Feed
Recommend entities that the member finds interesting and engaging
PFOLLOW Model:
● Binary response
● Predicts the probability of
following the entity given an
impression
UTILITY Model:
● Continuous response
● Look at engagement between
viewer and entity after the
follow edge is formed
Problem Formulation
The ranking objective function
Offline Scoring Architecture
Active vs Inactive members
Recommending entities to follow for every LinkedIn member
▪ Active Members
▪ Users who have performed recent actions on
LinkedIn
▪ Inactive Members
▪ New users to LinkedIn
▪ Registered users who have not performed
recent actions on LinkedIn
690+ million
members
Recommending entities to follow for every LinkedIn member
▪ Active Members
▪ Users who have performed recent actions on
LinkedIn
▪ Inactive Members
▪ New users to LinkedIn
▪ Registered users who have not performed
recent actions on LinkedIn
Personalized recommendations
precomputed offline per member
(+ real-time contextual recommendations
based on recent activity)
Heavy Spark offline pipeline
Segment-based recommendations
precomputed offline per segment (e.g industry,
skills, country) and fetched online
Lightweight Spark offline pipeline
High % client
calls
Low % client
calls
Active vs Inactive members
Scoring Architecture
Simplified end-to-end pipeline for active members
Active member precomputed
recommendations and scores:
(viewer, (entity, score))
Push
Context-based precomputed
recommendations and scores:
(context, (entity, score))
Key-Value Store
Push
Not Found:
Inactive member
Query active
members store
for X
Fetch contexts
for X
Found:
Active
member
Query store for
contextual
recommendations
Final Scoring
Filtering
Blending
“Get follow
recommendations for
viewer X”
CLIENT
Recent Member
Activity
(Realtime Service)
Key-Value Store
ONLINE (Java)
OFFLINE (Spark)
(followed, (entity_X, score))
(interacted, (entity_Y, score))
Scoring Architecture
Simplified end-to-end pipeline for active members
Active member precomputed
recommendations and scores:
(viewer, (entity, score))
Push periodically
Context-based precomputed
recommendations and scores:
(context, (entity, score))
Key-Value Store
Push
Not Found:
Inactive member
Query active
members store
for X
Fetch contexts
for X
Found:
Active
member
Query store for
contextual
recommendations
Final Scoring
Filtering
Blending
“Get follow
recommendations for
viewer X”
CLIENT
Recent Member
Activity
(Realtime Service)
Key-Value Store
ONLINE (Java)
OFFLINE (Spark)
(followed, (entity_X, score))
(interacted, (entity_Y, score))
Feature Categories
Viewer Features (small number)
▪ Follow-through -rate (FTR)
▪ Feed click-through-rate (CTR)
▪ Impression counts
▪ Interaction counts
▪ Segments: industry, country,
skills, company...
▪ Language(s)
...
Pair/Interaction
Features (large number)
▪ Viewer-entity engagement
▪ Segment-entity engagement
and follow
▪ Graph-based features
▪ Browsemap scores of entities
already followed by the viewer
(blog link)
▪ Embedding features
… many more
Entity Features (medium number)
▪ Follow-through -rate (FTR)
▪ Unfollow-through-rate (UTR)
▪ Feed click-through-rate (CTR)
▪ Impression counts
▪ Interaction counts
▪ Number of posts
▪ Language(s)
...
Joining Features
Viewer Features
millions
distinct active members
Pair/Interaction Features
trillions
possible (viewer-entity) pairs
100s of billions
(viewer-entity) pairs
Entity Features
millions
recommendable entities
(member, company, hashtag,
newsletters)
How to manage the explosive growth of members / entities
Candidate
selection
How can we join all features together and meet an acceptable performance ?
Viewer Features Pair/Interaction Features Entity Features
Partition Partition Partition
1st HASH JOIN on
viewerId key
(100s TB of shuffle)
TBs GBsGBs
1st Option : 3-way Spark Join
.join()
1st Option : 3-way Spark Join
Viewer Features Pair/Interaction Features Entity Features
Partition Partition Partition
TBs GBsGBs
2nd HASH JOIN on
entityId key
(100s TB of shuffle, very skewed)
● 2 gigantic shuffles
● Poor runtime performance
● Problematic skewness
1st HASH JOIN on
viewerId key
(100s TB of shuffle)
.join() .join()
2nd Option: Partial Scoring with Linear model
GBs TBs GBs
Partial
scoring
● Manageable 3-way join
performed on smaller
outputs
Disadvantages:
● Scoring overhead and
intermediary outputs
● Constraint to use a
linear model
Scalability Improvements with 2D
Hash-Partitioned Join
Goal: Avoid huge shuffles
Bottleneck:
Large / wide table of pair features + skewed entity distribution
Can we manage to join features together without shuffling the pair features ?
2D Hash-Partitioned Join
Partitioning of the 3 feature tables
▪ Hash-Partition the viewer features table into
V partitions
▪ Hash-Partition the entity features table into
E partitions
▪ Partition the pair features table into V * E
partitions , using a 2-dimensional custom
partition function to allow joining on two
keys (member, entity)
▪ Choose E and V so that every member and
entity partition can be loaded into memory
(depends on data size + executor memory)
Partition
V1 E1
Partition
V1 E2
Partition
V1 E3
Partition
V2 E1
Partition
V2E2
Partition
V2E3
Partition
E1
Partition
E2
Partition
E3
Partition
V1
Partition
V2
Partition
V3
Partition
V3 E1
Partition
V3E2
Partition
V3E3
Viewer Features Entity Features
Pair Features
* Blog Link
(*)
2D Hash-Partitioned Join
Partition
E20
Pair Features Partition
viewer 1001, entity 220
For a (viewer v, entity e):
▪ Viewer table partition number: h (v) % V
▪ Entity table partition number: h (e) % E
▪ Pair table partition number: h (v) % V * E + h (e) % E
Smart partitioning of pair features table
For each pair partition P, we always have a single corresponding:
▪ Viewer partition number equals to: P / E
▪ Entity partition number equals to: P % E
h: Custom positive hash function
Example:
V= 50, E = 100, h(x) = abs(x)
P = 120
entity table
partition ?
120 % 100 = 20
viewer table
partition ?
Partition
V1
120 / 100 = 1
2D Hash-Partitioned Join
Join Algorithm
Partition
V1 E1
Partition
V1 E2
Partition
V1 E3
Partition
V2 E1
Partition
V2E2
Partition
V2E3
Partition
V3 E1
Partition
V3E2
Partition
V3E3
Partition
V1
Partition
V2
Partition
V3
Partition
E1
Partition
E2
Partition
E3
Partitioned
Viewer
Features
Partitioned Entity Features
Partitioned Pair Features
1 - Launch a mapper for each pair partition
2.1 - Load the corresponding entity partition as
in-memory hashmap
2.2 - Load the corresponding viewer partition
(presorted by viewer id) into a stream reader
3 - For each pair features record, lookup entity
features record by entity id, and viewer
features record from stream reader
4 - Merge three feature sets into a joined
record
5 - Features can be scored right away before
storing to HDFS!
ALGORITHM:
.mapPartitions()
New Offline Scoring Pipeline
BEFORE AFTER
Partial scoring
GBs GBsTBs
TBs GBsGBs
New Offline Scoring Pipeline
BEFORE AFTER
Partial scoring
GBs GBsTBs
TBs GBsGBs
No shuffle of the pair
features table during the join
New Offline Scoring Pipeline
BEFORE AFTER
Partial scoring
GBs GBsTBs
TBs GBsGBs
No intermediate data
stored in HDFS (single
Spark job)
New Offline Scoring Pipeline
BEFORE AFTER
Partial scoring
GBs GBsTBs
TBs GBsGBs
Ability to score using a
non-linear model that
interacts features (XGBoost)
Gains After adopting 2D Hash-Partition Join
Offline Scoring Runtime
Performance
▪ Cost to Serve of Offline Scoring
pipeline reduced by 5X (in Gb.h)
▪ HDFS Storage: intermediate
outputs reduced by 8X
Relevance
▪ Enabled transition from linear
model (LoR, LiR) to non-linear
model (XGBoost)
▪ Total follows up by 17%
▪ Engagement up by 11%
Thank you !
Contacts:
● https://www.linkedin.com/in/emilie-de-longueau/
● https://www.linkedin.com/in/aalqawasmeh/
Credits:
● LinkedIn Hadoop team, in particular Fangshi Li for
implementing the algorithm and helping with its adoption
in Follows Relevance
Check these Blogs:
● LinkedIn Engineering - Communities AI: Building
Communities Around Interests
● LinkedIn Engineering - Managing
Exploding Big Data

Contenu connexe

Tendances

SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Tendances (20)

A Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta LakeA Practical Enterprise Feature Store on Delta Lake
A Practical Enterprise Feature Store on Delta Lake
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
ETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure DatabricksETL Made Easy with Azure Data Factory and Azure Databricks
ETL Made Easy with Azure Data Factory and Azure Databricks
 
You Can Do It in SQL
You Can Do It in SQLYou Can Do It in SQL
You Can Do It in SQL
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
Performance Optimization of Recommendation Training Pipeline at Netflix DB Ts...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Spark as a Service with Azure Databricks
Spark as a Service with Azure DatabricksSpark as a Service with Azure Databricks
Spark as a Service with Azure Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 

Similaire à Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members

Presentation on Crystal Reports and Business Objects Enterprise Features
Presentation on Crystal Reports and Business Objects Enterprise FeaturesPresentation on Crystal Reports and Business Objects Enterprise Features
Presentation on Crystal Reports and Business Objects Enterprise Features
InfoDev
 
Inteligencia de Negocios con PowerView
Inteligencia de Negocios con PowerViewInteligencia de Negocios con PowerView
Inteligencia de Negocios con PowerView
Eduardo Castro
 

Similaire à Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members (20)

Agile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI SustainablyAgile Testing Days 2017 Introducing AgileBI Sustainably
Agile Testing Days 2017 Introducing AgileBI Sustainably
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Designing Optimized Symbols for InduSoft Web Studio Projects
Designing Optimized Symbols for InduSoft Web Studio ProjectsDesigning Optimized Symbols for InduSoft Web Studio Projects
Designing Optimized Symbols for InduSoft Web Studio Projects
 
Log insight technical overview customer facing (based on 3.x)
Log insight technical overview customer facing (based on 3.x)Log insight technical overview customer facing (based on 3.x)
Log insight technical overview customer facing (based on 3.x)
 
Introduction to Shiny for building web apps in R
Introduction to Shiny for building web apps in RIntroduction to Shiny for building web apps in R
Introduction to Shiny for building web apps in R
 
Py conkr 2020-automated newsletter service for your valuable community-chans...
Py conkr 2020-automated newsletter service  for your valuable community-chans...Py conkr 2020-automated newsletter service  for your valuable community-chans...
Py conkr 2020-automated newsletter service for your valuable community-chans...
 
Denver ACE October 21st 2020
Denver ACE October 21st 2020Denver ACE October 21st 2020
Denver ACE October 21st 2020
 
Dev days Visual Studio 2012 Enhancements
Dev days Visual Studio 2012 EnhancementsDev days Visual Studio 2012 Enhancements
Dev days Visual Studio 2012 Enhancements
 
SAP Business Objects Trianing
SAP Business Objects TrianingSAP Business Objects Trianing
SAP Business Objects Trianing
 
srikanthg
srikanthgsrikanthg
srikanthg
 
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...Admin Tech Clash: Discussing Best (and Worst) Administration Practices from  ...
Admin Tech Clash: Discussing Best (and Worst) Administration Practices from ...
 
O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...
O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...
O365Con18 - Reach for the Cloud Build Solutions with the Power of Microsoft G...
 
Get started with Sketch: a fast (and awesome) communication and design tool
Get started with Sketch: a fast (and awesome) communication and design toolGet started with Sketch: a fast (and awesome) communication and design tool
Get started with Sketch: a fast (and awesome) communication and design tool
 
Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017Dive Into Azure Data Lake - PASS 2017
Dive Into Azure Data Lake - PASS 2017
 
Presentation on Crystal Reports and Business Objects Enterprise Features
Presentation on Crystal Reports and Business Objects Enterprise FeaturesPresentation on Crystal Reports and Business Objects Enterprise Features
Presentation on Crystal Reports and Business Objects Enterprise Features
 
SPS Nashville Modern Sharepoint Experience
SPS Nashville Modern Sharepoint ExperienceSPS Nashville Modern Sharepoint Experience
SPS Nashville Modern Sharepoint Experience
 
SPS Nashville Modern Sharepoint Experience
SPS Nashville Modern Sharepoint ExperienceSPS Nashville Modern Sharepoint Experience
SPS Nashville Modern Sharepoint Experience
 
Recsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and DeepakRecsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and Deepak
 
Inteligencia de Negocios con PowerView
Inteligencia de Negocios con PowerViewInteligencia de Negocios con PowerView
Inteligencia de Negocios con PowerView
 
Monitoring as an entry point for collaboration
Monitoring as an entry point for collaborationMonitoring as an entry point for collaboration
Monitoring as an entry point for collaboration
 

Plus de Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 

Plus de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 

Dernier

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Dernier (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 

Scoring at Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members

  • 1.
  • 2. Scoring At Scale: Generating Follow Recommendations for Over 690 Million LinkedIn Members Abdulla Al Qawasmeh Engineering Manager, AI Emilie de Longueau Sr Software Engineer, AI
  • 3. Agenda Introduction to Follows Relevance at LinkedIn Offline Scoring Architecture Scalability Improvements with 2D Hash-Partitioned Join
  • 6. Communities AI ▪ Discover ▪ Follow entities with shared interest ▪ Engage ▪ Join conversations happening in communities with shared interest ▪ Contribute ▪ Engage with the right communities when creating content https://engineering.linkedin.com/blog/2019/06/building-communities-around-interests https://recsys.acm.org/recsys19/industry-session-1/ Mission: Empower members to form communities around common interests and have active conversations
  • 7. Discover: Follow Recommendations at Scale Large-scale system that recommends entities to follow for every LinkedIn member Members: 100s of millions Entities (e): millionsX Members Pages NewslettersGroups Hashtags Key Challenge: 100s of trillions of possible pairs! Viewer (v) Events
  • 8. Recommendation Objective ▪ Interesting (form edges): pfollow(v follows e | e recommended to v) ▪ Engaging: utility(v engages e | v follows e) ▪ Follow edges (link between v and e) contribute a substantial amount of content and engagement on the Feed Recommend entities that the member finds interesting and engaging
  • 9. PFOLLOW Model: ● Binary response ● Predicts the probability of following the entity given an impression UTILITY Model: ● Continuous response ● Look at engagement between viewer and entity after the follow edge is formed Problem Formulation The ranking objective function
  • 11. Active vs Inactive members Recommending entities to follow for every LinkedIn member ▪ Active Members ▪ Users who have performed recent actions on LinkedIn ▪ Inactive Members ▪ New users to LinkedIn ▪ Registered users who have not performed recent actions on LinkedIn 690+ million members
  • 12. Recommending entities to follow for every LinkedIn member ▪ Active Members ▪ Users who have performed recent actions on LinkedIn ▪ Inactive Members ▪ New users to LinkedIn ▪ Registered users who have not performed recent actions on LinkedIn Personalized recommendations precomputed offline per member (+ real-time contextual recommendations based on recent activity) Heavy Spark offline pipeline Segment-based recommendations precomputed offline per segment (e.g industry, skills, country) and fetched online Lightweight Spark offline pipeline High % client calls Low % client calls Active vs Inactive members
  • 13. Scoring Architecture Simplified end-to-end pipeline for active members Active member precomputed recommendations and scores: (viewer, (entity, score)) Push Context-based precomputed recommendations and scores: (context, (entity, score)) Key-Value Store Push Not Found: Inactive member Query active members store for X Fetch contexts for X Found: Active member Query store for contextual recommendations Final Scoring Filtering Blending “Get follow recommendations for viewer X” CLIENT Recent Member Activity (Realtime Service) Key-Value Store ONLINE (Java) OFFLINE (Spark) (followed, (entity_X, score)) (interacted, (entity_Y, score))
  • 14. Scoring Architecture Simplified end-to-end pipeline for active members Active member precomputed recommendations and scores: (viewer, (entity, score)) Push periodically Context-based precomputed recommendations and scores: (context, (entity, score)) Key-Value Store Push Not Found: Inactive member Query active members store for X Fetch contexts for X Found: Active member Query store for contextual recommendations Final Scoring Filtering Blending “Get follow recommendations for viewer X” CLIENT Recent Member Activity (Realtime Service) Key-Value Store ONLINE (Java) OFFLINE (Spark) (followed, (entity_X, score)) (interacted, (entity_Y, score))
  • 15. Feature Categories Viewer Features (small number) ▪ Follow-through -rate (FTR) ▪ Feed click-through-rate (CTR) ▪ Impression counts ▪ Interaction counts ▪ Segments: industry, country, skills, company... ▪ Language(s) ... Pair/Interaction Features (large number) ▪ Viewer-entity engagement ▪ Segment-entity engagement and follow ▪ Graph-based features ▪ Browsemap scores of entities already followed by the viewer (blog link) ▪ Embedding features … many more Entity Features (medium number) ▪ Follow-through -rate (FTR) ▪ Unfollow-through-rate (UTR) ▪ Feed click-through-rate (CTR) ▪ Impression counts ▪ Interaction counts ▪ Number of posts ▪ Language(s) ...
  • 16. Joining Features Viewer Features millions distinct active members Pair/Interaction Features trillions possible (viewer-entity) pairs 100s of billions (viewer-entity) pairs Entity Features millions recommendable entities (member, company, hashtag, newsletters) How to manage the explosive growth of members / entities Candidate selection How can we join all features together and meet an acceptable performance ?
  • 17. Viewer Features Pair/Interaction Features Entity Features Partition Partition Partition 1st HASH JOIN on viewerId key (100s TB of shuffle) TBs GBsGBs 1st Option : 3-way Spark Join .join()
  • 18. 1st Option : 3-way Spark Join Viewer Features Pair/Interaction Features Entity Features Partition Partition Partition TBs GBsGBs 2nd HASH JOIN on entityId key (100s TB of shuffle, very skewed) ● 2 gigantic shuffles ● Poor runtime performance ● Problematic skewness 1st HASH JOIN on viewerId key (100s TB of shuffle) .join() .join()
  • 19. 2nd Option: Partial Scoring with Linear model GBs TBs GBs Partial scoring ● Manageable 3-way join performed on smaller outputs Disadvantages: ● Scoring overhead and intermediary outputs ● Constraint to use a linear model
  • 20. Scalability Improvements with 2D Hash-Partitioned Join
  • 21. Goal: Avoid huge shuffles Bottleneck: Large / wide table of pair features + skewed entity distribution Can we manage to join features together without shuffling the pair features ?
  • 22. 2D Hash-Partitioned Join Partitioning of the 3 feature tables ▪ Hash-Partition the viewer features table into V partitions ▪ Hash-Partition the entity features table into E partitions ▪ Partition the pair features table into V * E partitions , using a 2-dimensional custom partition function to allow joining on two keys (member, entity) ▪ Choose E and V so that every member and entity partition can be loaded into memory (depends on data size + executor memory) Partition V1 E1 Partition V1 E2 Partition V1 E3 Partition V2 E1 Partition V2E2 Partition V2E3 Partition E1 Partition E2 Partition E3 Partition V1 Partition V2 Partition V3 Partition V3 E1 Partition V3E2 Partition V3E3 Viewer Features Entity Features Pair Features * Blog Link (*)
  • 23. 2D Hash-Partitioned Join Partition E20 Pair Features Partition viewer 1001, entity 220 For a (viewer v, entity e): ▪ Viewer table partition number: h (v) % V ▪ Entity table partition number: h (e) % E ▪ Pair table partition number: h (v) % V * E + h (e) % E Smart partitioning of pair features table For each pair partition P, we always have a single corresponding: ▪ Viewer partition number equals to: P / E ▪ Entity partition number equals to: P % E h: Custom positive hash function Example: V= 50, E = 100, h(x) = abs(x) P = 120 entity table partition ? 120 % 100 = 20 viewer table partition ? Partition V1 120 / 100 = 1
  • 24. 2D Hash-Partitioned Join Join Algorithm Partition V1 E1 Partition V1 E2 Partition V1 E3 Partition V2 E1 Partition V2E2 Partition V2E3 Partition V3 E1 Partition V3E2 Partition V3E3 Partition V1 Partition V2 Partition V3 Partition E1 Partition E2 Partition E3 Partitioned Viewer Features Partitioned Entity Features Partitioned Pair Features 1 - Launch a mapper for each pair partition 2.1 - Load the corresponding entity partition as in-memory hashmap 2.2 - Load the corresponding viewer partition (presorted by viewer id) into a stream reader 3 - For each pair features record, lookup entity features record by entity id, and viewer features record from stream reader 4 - Merge three feature sets into a joined record 5 - Features can be scored right away before storing to HDFS! ALGORITHM: .mapPartitions()
  • 25. New Offline Scoring Pipeline BEFORE AFTER Partial scoring GBs GBsTBs TBs GBsGBs
  • 26. New Offline Scoring Pipeline BEFORE AFTER Partial scoring GBs GBsTBs TBs GBsGBs No shuffle of the pair features table during the join
  • 27. New Offline Scoring Pipeline BEFORE AFTER Partial scoring GBs GBsTBs TBs GBsGBs No intermediate data stored in HDFS (single Spark job)
  • 28. New Offline Scoring Pipeline BEFORE AFTER Partial scoring GBs GBsTBs TBs GBsGBs Ability to score using a non-linear model that interacts features (XGBoost)
  • 29. Gains After adopting 2D Hash-Partition Join Offline Scoring Runtime Performance ▪ Cost to Serve of Offline Scoring pipeline reduced by 5X (in Gb.h) ▪ HDFS Storage: intermediate outputs reduced by 8X Relevance ▪ Enabled transition from linear model (LoR, LiR) to non-linear model (XGBoost) ▪ Total follows up by 17% ▪ Engagement up by 11%
  • 31. Contacts: ● https://www.linkedin.com/in/emilie-de-longueau/ ● https://www.linkedin.com/in/aalqawasmeh/ Credits: ● LinkedIn Hadoop team, in particular Fangshi Li for implementing the algorithm and helping with its adoption in Follows Relevance Check these Blogs: ● LinkedIn Engineering - Communities AI: Building Communities Around Interests ● LinkedIn Engineering - Managing Exploding Big Data