SlideShare une entreprise Scribd logo
1  sur  29
Télécharger pour lire hors ligne
Maja Kabiljo & Aleksandar Ilic,
Facebook
Large scale item recommendations
with Apache Giraph
Conclusion
04
01
02
05
03
Motivation and challenges
Apache Giraph
Distributing Collaborative Filtering
Social signals and Applications
Collaborative
Filtering
Collaborative Filtering
Predict user’s interests based on many other users
Disney Roller coasters Disneyland Six Flags
?
? ?
? ?
?
? ? ?
Matrix factorization
4 4 1 3
5 3 1
1 2 4
5 3 4 5
2 3
...
. . .
...
U1
U2
U3
U4
users
...
U5
. . .I1 I2 I3 I4
items
I5
?
Basic form
Objective function
Two iterative approaches:
•Stochastic Gradient Descent
•Alternating Least Squares
Challenges
Scale
•100s of billions of (user, item) pairs
•Over billion users
•Tens of millions of items
Performance
•Train models and iterate quickly
•Use more features
Training / testing metrics exampleRMSE
0
0.2
0.4
0.6
0.8
Iterations
0 4 8 12 16 20 24 28 32 36 40 44
Train f=8
Test f=8
Train f=128
Test f=128
Apache Giraph
Iterative and graph processing on massive datasets
Billion vertices, trillion edges
Data mapped to a graph
•Vertex ids and values
•Edges and edge values
What is Apache Giraph?
10
5
1
3
Neural
networks
Logistic
regression
Neural
networks
Boosted
decision
What is Apache Giraph?
Runs on top of Hadoop
Map only jobs
Keeps data in memory
Mappers communicate through network
Giraph workflow
Worker 1
Worker 2
Worker 3
Distributing
Collaborative Filtering
Common approach
A bipartite graph:
•Users and items are vertices
•Known ratings are edges
•Feature vectors sent through edges
Problems:
•Data sent per iteration: #knownRatings * #features
•Memory
•Large degree items
•SGD modifications are different than in the sequential solution
Worker 1
Worker 2
Worker 3
I2
I1
I3
I4
Our solution - rotational approach
Worker 1
Worker 2
Worker 3
item
set 3
item
set 1
item
set 2
•Network traffic?
•Memory?
•Skewed item degrees?
•SGD calculation?
Users are vertices,
items are worker data
Performance
Comparison with Spark MLlib
Spark MLlib ALS CF
•On scaled copies of Amazon reviews dataset
We can handle over 100 billion ratings
Cpuminutes
0
150
300
450
600
Millions examples
0 300 600 900 1200
Common approach (in Spark)
Rotational (in Giraph)
Hybrid - common + rotational
Choose how to update item based on its degree
Network traffic per item:
•Common: #features * itemDegree
•Rotational SGD: #features * #workers
•Rotational ALS: #features * #features * #workers
Extensions
Slower connections can be a bottleneck
Solution: in every step send items between all workers
Rotating items
(#workers - 1) item sets on each worker
Decomposing complete graph into edge disjoint Hamilton cycles
Construction using Latin squares
Rotating items
Social signals
Incorporate social network information - social regularization
User’s latent features should be similar to his/her friends
Social signals
Easy to add in Giraph model
Additional complexity #friendships * #features
Solves cold start problem
Additional features
Tracking rmse, average rank and auc
Combining SGD & ALS
Different objective functions
•Implicit feedback
•Degree based regularization
Incremental training
Fast top K recommendations
Calculate item similarities based on:
•Common users
•Global item properties
Adjustable formulas for easy experimentation
Item similarities
?
u
u
u
u
u
u
I1 I2
150M users
15M items
4B ratings
1.3B users
35M items
15B ratings
2.4B users
8M items
220B
ratings
Hive CPU hours 10 227 963
Giraph CPU
hours
3 16 87
Sample datasets
Applications
Use user and item embeddings in ranking models
Get user to item score in realtime
Direct user recommendations
Context based recommendations
Conclusion
Conclusion
Scalable implementation of Collaborative Filtering
On top of Apache Giraph
Highly performant (100s of billion ratings)
Utilizing social signals and item similarities
Many use cases at Facebook
Thank you!
tinyurl.com/fb-mf-cf
tinyurl.com/giraph-vldb-2015
Questions?

Contenu connexe

Plus de MLconf

Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMLconf
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionMLconf
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLMLconf
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksMLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldMLconf
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...MLconf
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...MLconf
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...MLconf
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeMLconf
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...MLconf
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareMLconf
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesMLconf
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...MLconf
 
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...MLconf
 
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...MLconf
 
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...MLconf
 
Nitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud DetectionNitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud DetectionMLconf
 
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...MLconf
 
Alexandra johnson reducing operational barriers to model training
Alexandra johnson   reducing operational barriers to model trainingAlexandra johnson   reducing operational barriers to model training
Alexandra johnson reducing operational barriers to model trainingMLconf
 

Plus de MLconf (20)

Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
Madalina Fiterau - Hybrid Machine Learning Methods for the Interpretation and...
 
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...
Niels Bantilan - Augmenting Mental Health Care in the Digital Age: Machine Le...
 
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...
LeAnna Kent - Using Network Analysis to Detect Kickback Schemes Among Medical...
 
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...
Liliana Cruz Lopez - Deep Reinforcement Learning based Insulin Controller for...
 
Nitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud DetectionNitin sharma - Deep Learning Applications to Online Payment Fraud Detection
Nitin sharma - Deep Learning Applications to Online Payment Fraud Detection
 
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
Sandhya Prabhakaran - A Bayesian Approach To Model Overlapping Objects Availa...
 
Alexandra johnson reducing operational barriers to model training
Alexandra johnson   reducing operational barriers to model trainingAlexandra johnson   reducing operational barriers to model training
Alexandra johnson reducing operational barriers to model training
 

Dernier

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Dernier (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Maja Kabiljo, Software Engineer, Facebook Inc. at MLconf ATL - 9/18/15

  • 1. Maja Kabiljo & Aleksandar Ilic, Facebook Large scale item recommendations with Apache Giraph
  • 2. Conclusion 04 01 02 05 03 Motivation and challenges Apache Giraph Distributing Collaborative Filtering Social signals and Applications
  • 4. Collaborative Filtering Predict user’s interests based on many other users Disney Roller coasters Disneyland Six Flags
  • 5. ? ? ? ? ? ? ? ? ? Matrix factorization 4 4 1 3 5 3 1 1 2 4 5 3 4 5 2 3 ... . . . ... U1 U2 U3 U4 users ... U5 . . .I1 I2 I3 I4 items I5 ?
  • 6. Basic form Objective function Two iterative approaches: •Stochastic Gradient Descent •Alternating Least Squares
  • 7. Challenges Scale •100s of billions of (user, item) pairs •Over billion users •Tens of millions of items Performance •Train models and iterate quickly •Use more features
  • 8. Training / testing metrics exampleRMSE 0 0.2 0.4 0.6 0.8 Iterations 0 4 8 12 16 20 24 28 32 36 40 44 Train f=8 Test f=8 Train f=128 Test f=128
  • 10. Iterative and graph processing on massive datasets Billion vertices, trillion edges Data mapped to a graph •Vertex ids and values •Edges and edge values What is Apache Giraph? 10 5 1 3 Neural networks Logistic regression Neural networks Boosted decision
  • 11. What is Apache Giraph? Runs on top of Hadoop Map only jobs Keeps data in memory Mappers communicate through network
  • 14. Common approach A bipartite graph: •Users and items are vertices •Known ratings are edges •Feature vectors sent through edges Problems: •Data sent per iteration: #knownRatings * #features •Memory •Large degree items •SGD modifications are different than in the sequential solution Worker 1 Worker 2 Worker 3 I2 I1 I3 I4
  • 15. Our solution - rotational approach Worker 1 Worker 2 Worker 3 item set 3 item set 1 item set 2 •Network traffic? •Memory? •Skewed item degrees? •SGD calculation? Users are vertices, items are worker data
  • 17. Comparison with Spark MLlib Spark MLlib ALS CF •On scaled copies of Amazon reviews dataset We can handle over 100 billion ratings Cpuminutes 0 150 300 450 600 Millions examples 0 300 600 900 1200 Common approach (in Spark) Rotational (in Giraph)
  • 18. Hybrid - common + rotational Choose how to update item based on its degree Network traffic per item: •Common: #features * itemDegree •Rotational SGD: #features * #workers •Rotational ALS: #features * #features * #workers
  • 20. Slower connections can be a bottleneck Solution: in every step send items between all workers Rotating items
  • 21. (#workers - 1) item sets on each worker Decomposing complete graph into edge disjoint Hamilton cycles Construction using Latin squares Rotating items
  • 22. Social signals Incorporate social network information - social regularization User’s latent features should be similar to his/her friends
  • 23. Social signals Easy to add in Giraph model Additional complexity #friendships * #features Solves cold start problem
  • 24. Additional features Tracking rmse, average rank and auc Combining SGD & ALS Different objective functions •Implicit feedback •Degree based regularization Incremental training Fast top K recommendations
  • 25. Calculate item similarities based on: •Common users •Global item properties Adjustable formulas for easy experimentation Item similarities ? u u u u u u I1 I2 150M users 15M items 4B ratings 1.3B users 35M items 15B ratings 2.4B users 8M items 220B ratings Hive CPU hours 10 227 963 Giraph CPU hours 3 16 87 Sample datasets
  • 26. Applications Use user and item embeddings in ranking models Get user to item score in realtime Direct user recommendations Context based recommendations
  • 28. Conclusion Scalable implementation of Collaborative Filtering On top of Apache Giraph Highly performant (100s of billion ratings) Utilizing social signals and item similarities Many use cases at Facebook