SlideShare a Scribd company logo
1 of 33
Maja Kabiljo & Aleksandar Ilic, Facebook
Large scale Collaborative Filtering
using Apache Giraph
Conclusion
04
01
02
05
03
What is Apache Giraph?
Collaborative Filtering problem
Neighborhood-based models
Matrix factorization
What is
Apache Giraph?
What is Apache Giraph?
Iterative and graph processing on massive datasets
Billion vertices, trillion edges
Data mapped to a graph
•Vertex ids and values
•Edges and edge values
“Think like a vertex”
10
5
1
3
What is Apache Giraph?
Runs on top of Hadoop
Map only jobs
Keeps data in memory
Mappers communicate through network
Giraph workflow
Worker 1
Worker 2
Worker 3
Collaborative Filtering
Problem
Collaborative Filtering
Predict user’s interests based on many other users
Disney Roller coasters Disneyland Six Flags
Collaborative Filtering
Main challenge: Facebook data
•Billion users, 100 billion ratings
•Skewed item degrees
•No explicit ratings
Common approaches:
•Neighborhood based models
•Matrix factorization
Neighborhood
Based Models
Neighborhood based CF
Start from user item ratings
Calculate item similarities
For each item pair:
•Users who rated first item
•Users who rated second item
•Users who rated both items
?
u
u
u
u
u
u
I1 I2
Neighborhood based CF
Calculate user recommendations
For every user:
•Items rated by user
•Most similar items to these items
?
?
?
?
I4
I5
I6
I7
I1
I2
I3
u
Configurable formulas
Accommodating different use cases
Each calculation step is configurable
•User’s contribution to item similarities
•Item similarities based on all user’s contributions
•User to item recommendation score
Passing a piece of Java code through configuration
intersection / Math.sqrt(degree1 * degree2)
Users to items edges
Preprocessing:
•Filter out low degree ones
•Calculate global item stats
Users send item lists to items
•Items need other items’ global stats
to calculate similarities
Worker 1
Worker 2
Worker 3
Our solution
i
u
u
u
u
i
i
iu
Optimizations
Make item info globally available
•Using reduce/broadcast api
Striping technique
•Split computation across multiple supersteps
•In each stripe process one subset of items
Applications
Direct user recommendations
Context aware recommendations
User explore
Item similarities implemented using Hive join
•Remapping all items to 1..N first
Comparison with Hive
150M users
15M items
4B ratings
1.3B users
35M items
15B ratings
2.4B users
8M items
220B ratings
Hive CPU hours
(after int
remapping)
10 227 963
Giraph CPU hours 3 16 87
Ease of use
ratings = i2iRatings(table = ‘user_item_ratings')
similarities = i2iSimilarities(table = 'item_similarities')
recommendations = i2iRecommendations(table = 'user_recommendations')
i2iCalculateSimilarities(ratings,
similarities,
similarity_formula = '...',
num_workers = 10)
i2iCalculateRecommendations(ratings,
similarities,
recommendations,
scoring_formula = '...',
num_workers = 50)
Matrix
Factorization
?
? ?
? ?
?
? ? ?
Matrix factorization CF
4 4 1 3
5 3 1
1 2 4
5 3 4 5
2 3
...
. . .
...
U1
U2
U3
U4
users
...
U5
. . .I1 I2 I3 I4
items
I5
?
Basic form
Objective function
Two iterative approaches:
•Stochastic Gradient Descent
•Alternating Least Squares
regularization
Standard approach
A bipartite graph:
•Users and items are vertices
•Known ratings are edges
•Feature vectors sent through edges
Problems:
•Data sent per iteration: #knownRatings * #features
•Memory
•Large degree items
•SGD modifications are different than in the sequential solution
Worker 1
Worker 2
Worker 3
I2
I1
I3
I4
Our solution
Extending Giraph
•Worker data
•Worker to worker messages
Users are vertices, items are worker data
Our solution - rotational approach
Worker 1
Worker 2
Worker 3
item
set 3
item
set 1
item
set 2
•Network traffic?
•Memory?
•Skewed item degrees?
•SGD calculation?
Recommendations
Finding top inner products
Each (user, item) pair is unfeasible
Creating Ball Tree from item vectors
•Greedy tree traversal
•Pruning subtrees
•100-1000x faster
Additional features
Tracking rmse, average rank and precision/recall
Combining SGD & ALS
Using other objective functions
•CF for implicit feedback
•Biases
•Degree based regularization
•Optimizing ranks
Applications
Add user and item feature vectors in ranking
Get user to item score in realtime
Direct user recommendations
Training / testing metrics exampleRMSE
0
0.2
0.4
0.6
0.8
Iterations
0 4 8 12 16 20 24 28 32 36 40 44
Train f=8
Test f=8
Train f=128
Test f=128
Comparison with Spark MLlib
Performance of Spark MLlib ALS CF published in July 2014
On scaled copies of Amazon reviews datasetCpuminutes
0
150
300
450
600
Millions examples
0 300 600 900 1200
Standard (in Spark)
Rotational (in Giraph)
Ease of use
ratings = CFRatings(table = 'cf_ratings')
feature_vectors = CFFeatureVectors(table = 'cf_feature_vectors')
CFTrain(ratings,
feature_vectors,
CFSettings(features_size = 10, iterations = 20),
num_workers = 5)
CFRecommend(ratings,
feature_vectors,
CFRecommendations(top_items_table = 'cf_top_items'),
num_workers = 50)
Conclusion
Conclusion
Scalable implementation of Collaborative Filtering
On top of Apache Giraph
Highly performant (>100 billion ratings)
Neighborhood-based models
Matrix factorization
Group and Page recommendations at Facebook
Thank you!
tinyurl.com/fb-mf-cf
Questions?

More Related Content

Viewers also liked

How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
DataWorks Summit
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
DataWorks Summit
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
DataWorks Summit
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Computation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop ClusterComputation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop Cluster
Abhishek Sagar
 

Viewers also liked (20)

How to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and AnalyticsHow to use Parquet as a Sasis for ETL and Analytics
How to use Parquet as a Sasis for ETL and Analytics
 
Apache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic DataApache Lens: Unified OLAP on Realtime and Historic Data
Apache Lens: Unified OLAP on Realtime and Historic Data
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
a Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resourcesa Secure Public Cache for YARN Application Resources
a Secure Public Cache for YARN Application Resources
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source Technologies
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Functional Programming and Big Data
Functional Programming and Big DataFunctional Programming and Big Data
Functional Programming and Big Data
 
Computation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop ClusterComputation of spatial data on Hadoop Cluster
Computation of spatial data on Hadoop Cluster
 

Similar to large scale collaborative filtering using Apache Giraph

Survey of Recommendation Systems
Survey of Recommendation SystemsSurvey of Recommendation Systems
Survey of Recommendation Systems
youalab
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
Databricks
 
Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011
Ernesto Mislej
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Databricks
 
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Sri Ambati
 

Similar to large scale collaborative filtering using Apache Giraph (20)

Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 
Survey of Recommendation Systems
Survey of Recommendation SystemsSurvey of Recommendation Systems
Survey of Recommendation Systems
 
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
 
Fast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA HardwareFast Parallel Similarity Calculations with FPGA Hardware
Fast Parallel Similarity Calculations with FPGA Hardware
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdf
 
Automated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and TrackingAutomated Hyperparameter Tuning, Scaling and Tracking
Automated Hyperparameter Tuning, Scaling and Tracking
 
Building High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning ApplicationsBuilding High Available and Scalable Machine Learning Applications
Building High Available and Scalable Machine Learning Applications
 
Recsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and DeepakRecsys2016 Tutorial by Xavier and Deepak
Recsys2016 Tutorial by Xavier and Deepak
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Sparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya HristakevaSparking Science up with Research Recommendations by Maya Hristakeva
Sparking Science up with Research Recommendations by Maya Hristakeva
 
Deep Dive to Learning to Rank for Graph Search.pptx
Deep Dive to Learning to Rank for Graph Search.pptxDeep Dive to Learning to Rank for Graph Search.pptx
Deep Dive to Learning to Rank for Graph Search.pptx
 
Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011Recommender Systems! @ASAI 2011
Recommender Systems! @ASAI 2011
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
Running with Elephants: Predictive Analytics with HDInsight
Running with Elephants: Predictive Analytics with HDInsightRunning with Elephants: Predictive Analytics with HDInsight
Running with Elephants: Predictive Analytics with HDInsight
 
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB SchemasRemaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
 
Sparking Science up with Research Recommendations
Sparking Science up with Research RecommendationsSparking Science up with Research Recommendations
Sparking Science up with Research Recommendations
 
Machine Learning for Recommender Systems in the Job Market
Machine Learning for Recommender Systems in the Job MarketMachine Learning for Recommender Systems in the Job Market
Machine Learning for Recommender Systems in the Job Market
 
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
 
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
Driver vs Driverless AI - Mark Landry, Competitive Data Scientist and Product...
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

large scale collaborative filtering using Apache Giraph

  • 1. Maja Kabiljo & Aleksandar Ilic, Facebook Large scale Collaborative Filtering using Apache Giraph
  • 2. Conclusion 04 01 02 05 03 What is Apache Giraph? Collaborative Filtering problem Neighborhood-based models Matrix factorization
  • 4. What is Apache Giraph? Iterative and graph processing on massive datasets Billion vertices, trillion edges Data mapped to a graph •Vertex ids and values •Edges and edge values “Think like a vertex” 10 5 1 3
  • 5. What is Apache Giraph? Runs on top of Hadoop Map only jobs Keeps data in memory Mappers communicate through network
  • 8. Collaborative Filtering Predict user’s interests based on many other users Disney Roller coasters Disneyland Six Flags
  • 9. Collaborative Filtering Main challenge: Facebook data •Billion users, 100 billion ratings •Skewed item degrees •No explicit ratings Common approaches: •Neighborhood based models •Matrix factorization
  • 11. Neighborhood based CF Start from user item ratings Calculate item similarities For each item pair: •Users who rated first item •Users who rated second item •Users who rated both items ? u u u u u u I1 I2
  • 12. Neighborhood based CF Calculate user recommendations For every user: •Items rated by user •Most similar items to these items ? ? ? ? I4 I5 I6 I7 I1 I2 I3 u
  • 13. Configurable formulas Accommodating different use cases Each calculation step is configurable •User’s contribution to item similarities •Item similarities based on all user’s contributions •User to item recommendation score Passing a piece of Java code through configuration intersection / Math.sqrt(degree1 * degree2)
  • 14. Users to items edges Preprocessing: •Filter out low degree ones •Calculate global item stats Users send item lists to items •Items need other items’ global stats to calculate similarities Worker 1 Worker 2 Worker 3 Our solution i u u u u i i iu
  • 15. Optimizations Make item info globally available •Using reduce/broadcast api Striping technique •Split computation across multiple supersteps •In each stripe process one subset of items
  • 16. Applications Direct user recommendations Context aware recommendations User explore
  • 17. Item similarities implemented using Hive join •Remapping all items to 1..N first Comparison with Hive 150M users 15M items 4B ratings 1.3B users 35M items 15B ratings 2.4B users 8M items 220B ratings Hive CPU hours (after int remapping) 10 227 963 Giraph CPU hours 3 16 87
  • 18. Ease of use ratings = i2iRatings(table = ‘user_item_ratings') similarities = i2iSimilarities(table = 'item_similarities') recommendations = i2iRecommendations(table = 'user_recommendations') i2iCalculateSimilarities(ratings, similarities, similarity_formula = '...', num_workers = 10) i2iCalculateRecommendations(ratings, similarities, recommendations, scoring_formula = '...', num_workers = 50)
  • 20. ? ? ? ? ? ? ? ? ? Matrix factorization CF 4 4 1 3 5 3 1 1 2 4 5 3 4 5 2 3 ... . . . ... U1 U2 U3 U4 users ... U5 . . .I1 I2 I3 I4 items I5 ?
  • 21. Basic form Objective function Two iterative approaches: •Stochastic Gradient Descent •Alternating Least Squares regularization
  • 22. Standard approach A bipartite graph: •Users and items are vertices •Known ratings are edges •Feature vectors sent through edges Problems: •Data sent per iteration: #knownRatings * #features •Memory •Large degree items •SGD modifications are different than in the sequential solution Worker 1 Worker 2 Worker 3 I2 I1 I3 I4
  • 23. Our solution Extending Giraph •Worker data •Worker to worker messages Users are vertices, items are worker data
  • 24. Our solution - rotational approach Worker 1 Worker 2 Worker 3 item set 3 item set 1 item set 2 •Network traffic? •Memory? •Skewed item degrees? •SGD calculation?
  • 25. Recommendations Finding top inner products Each (user, item) pair is unfeasible Creating Ball Tree from item vectors •Greedy tree traversal •Pruning subtrees •100-1000x faster
  • 26. Additional features Tracking rmse, average rank and precision/recall Combining SGD & ALS Using other objective functions •CF for implicit feedback •Biases •Degree based regularization •Optimizing ranks
  • 27. Applications Add user and item feature vectors in ranking Get user to item score in realtime Direct user recommendations
  • 28. Training / testing metrics exampleRMSE 0 0.2 0.4 0.6 0.8 Iterations 0 4 8 12 16 20 24 28 32 36 40 44 Train f=8 Test f=8 Train f=128 Test f=128
  • 29. Comparison with Spark MLlib Performance of Spark MLlib ALS CF published in July 2014 On scaled copies of Amazon reviews datasetCpuminutes 0 150 300 450 600 Millions examples 0 300 600 900 1200 Standard (in Spark) Rotational (in Giraph)
  • 30. Ease of use ratings = CFRatings(table = 'cf_ratings') feature_vectors = CFFeatureVectors(table = 'cf_feature_vectors') CFTrain(ratings, feature_vectors, CFSettings(features_size = 10, iterations = 20), num_workers = 5) CFRecommend(ratings, feature_vectors, CFRecommendations(top_items_table = 'cf_top_items'), num_workers = 50)
  • 32. Conclusion Scalable implementation of Collaborative Filtering On top of Apache Giraph Highly performant (>100 billion ratings) Neighborhood-based models Matrix factorization Group and Page recommendations at Facebook