SlideShare une entreprise Scribd logo
1  sur  21
26 Trillion App Recommendation
using100 Lines of Spark Code
Ayman Farahat
● Motivation
● Spark Implementation
○ Collabrative Filtering
○ Data Frames
○ BLAS-3
● Results and lessons learnt.
Overview
● App discovery is a challenging problem due to the exponential
growth in number of apps
● Over 1.5 million apps available through both market places
(i.e. Itunes and Google Play store)
● Develop app recommendation engine using various user
behavior signals
○ Explicit Signal (App rating)
○ Implicit Signal (frequency/duration of app usage)
Motivation
● Data available through Flurry SDK is rich in both coverage
and depth
● Collected session length for Apps used on IOS platform in
period between Sept 1-15 2015 .
● Restricted analysis to Apps used by 100 or more users
○ ~496 million Users
○ ~53,793 Apps
Flurry Data and Summary
● User Count : 496,508,312
● App Count : 153,773
● App 100+ : 53,793
● Train time : 52 minutes
● Predict time : 8 minutes
Data Summary
● Utilize a collaborative filtering based App recommendation
● Run collaborative filtering that works at scale to generate:
○ Low dimension user features
○ Low dimension App features
○ Compute user x App rating for all possible
combinations (26.7 Trillion)
● Used spark framework to efficiently train and recommend.
Our Approach
● Projects the users and Apps (in our case) into a lower
dimensional space
Collaborative Filtering Model
● Used out of sample prediction accuracy on 20+ Apps Users
● The MSE was minimum with number of factors fixed at 60
Model Fitting and Parameter Optimization
● Join operation can greatly benefit from caching.
● Filter out Apps that have less than 100 users
cleandata = allapps.join(cleanapps)
● Do a replicated join in Spark
#only keep the apps that had 100 or more user
cleanapps = myapps.filter(lambda x :x[1] > MAXAPPS).map(lambda x: int(x[0]))
#persist the apps data
apps = sc.broadcast(set(cleanapps.collect()))
# filter by the data set: I have simulated a replicated join
cleandata = allapps.filter(lambda x: x[1] in apps.value)
Data Frames
● In spark you can use a dataframe directly
Record = Row("userId", "iuserId", "appId", "value")
MAXAPPS = 100
#transform allapps to a df
allappsdf = allapps.map(lambda x: Record(*x)).toDF()
# register the DF and issue SQL queries
sqlContext.registerDataFrameAsTable(allappdf, "table1")
#here I am grouing by the AppID
df2 = sqlContext.sql("SELECT appId as appId2, avg(value), count(*) from table1 group by appId")
topappsdf = df2.filter(df2.c2 >MAXAPPS)
#DF join
cleandata = allappsdf.join(topappsdf, allapps.appId == topappdf.appId2)
Data Frames
● The number of possible user x App combinations is very large
Default prediction : PredictAll
○predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
○ Prediction is simply matrix multiplication of user “i” and App “j”
● Never completes and most of time spent on reshuffle.
● The users are not partioned so can be on all Nodes.
● The Apps are not partioned so can be on all Nodes.
● Reshuffle is extremely slow.
BLAS 3
● The key is that the Number of Apps << Number of users
● Exploit the low number of Apps to optimize the prediction time
BLAS 3
● The App features being smaller in size can be stored in
primary memory (BLAS 3)
● We broadcast the Apps to all executors, which reduces the
overall reshuffling of data
● use BLAS-3 matrix multiplication available within numpy
which is highly optimized
BLAS 3
Basic linear algbera system for solving problems of the form
D = a A * b B + c C
Highly optimized for matrix multiplication.
BLAS 3
import numpy
from numpy import *
myModel=MatrixFactorizationModel.load(sc, "BingBong”)
m1 = myModel.productFeatures()
m2 = m1.map(lambda (product,feature) : feature).collect()
m3 = matrix(m2).transpose()
pf = sc.broadcast(m3)
uf = myModel.userFeatures().coalesce(100)
#get predictions on all user
f1 = uf.map(lambda (userID, features): (userID, squeeze(asarray(matrix(array(features)) * pf.value))))
BLAS 3
Evaluation :Predicted Score
Predicted Score : Positive
Predicted Score : Negative
Evaluation of Recommendation
● Identify users with high(low) scores
● Design of experiment :
● High score x Recommendation
● High score x Placebo
● Low score x Recommendation
● High score x Placebo
Future Work
● Spark econometrics library (std. error, robust std. errors.. )
● Online experiments to measure value of recommendation .
● Experiments with various implicit ratings :
● number of sessions
● days used
● Log of days used
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat

Contenu connexe

Tendances

Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 

Tendances (20)

Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
 
Spark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza KarimiSpark Summit EU talk by Reza Karimi
Spark Summit EU talk by Reza Karimi
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015Spark streaming State of the Union - Strata San Jose 2015
Spark streaming State of the Union - Strata San Jose 2015
 
Multi dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframesMulti dimension aggregations using spark and dataframes
Multi dimension aggregations using spark and dataframes
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 

En vedette

Sayed Kassem Gallup Report
Sayed Kassem Gallup ReportSayed Kassem Gallup Report
Sayed Kassem Gallup Report
Sayed Kassem
 
Iskander Business Partner Studie: Verständlichkeit in der Unternehmenskommuni...
Iskander Business Partner Studie: Verständlichkeit in der Unternehmenskommuni...Iskander Business Partner Studie: Verständlichkeit in der Unternehmenskommuni...
Iskander Business Partner Studie: Verständlichkeit in der Unternehmenskommuni...
Iskander Business Partner GmbH
 
1. Stosowanie przepisów prawa w gospodarowaniu
1. Stosowanie przepisów prawa w gospodarowaniu1. Stosowanie przepisów prawa w gospodarowaniu
1. Stosowanie przepisów prawa w gospodarowaniu
Lukas Pobocha
 
NYU 5-Day Lean Launchpad Syllabus
NYU 5-Day Lean Launchpad SyllabusNYU 5-Day Lean Launchpad Syllabus
NYU 5-Day Lean Launchpad Syllabus
New York University
 
Lista de-precios-compugreiff-enero-09-2013
Lista de-precios-compugreiff-enero-09-2013Lista de-precios-compugreiff-enero-09-2013
Lista de-precios-compugreiff-enero-09-2013
xxxxx
 

En vedette (15)

Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
 
Spark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa SawyerSpark Summit EU talk by Larisa Sawyer
Spark Summit EU talk by Larisa Sawyer
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
Сервис бронирования отелей в последний момент LastRoom
Сервис бронирования отелей в последний момент LastRoomСервис бронирования отелей в последний момент LastRoom
Сервис бронирования отелей в последний момент LastRoom
 
Sayed Kassem Gallup Report
Sayed Kassem Gallup ReportSayed Kassem Gallup Report
Sayed Kassem Gallup Report
 
tablas y Secciones
tablas y Seccionestablas y Secciones
tablas y Secciones
 
Iskander Business Partner Studie: Verständlichkeit in der Unternehmenskommuni...
Iskander Business Partner Studie: Verständlichkeit in der Unternehmenskommuni...Iskander Business Partner Studie: Verständlichkeit in der Unternehmenskommuni...
Iskander Business Partner Studie: Verständlichkeit in der Unternehmenskommuni...
 
Marketing in a Social World - FinanceConnect 2015
Marketing in a Social World - FinanceConnect 2015Marketing in a Social World - FinanceConnect 2015
Marketing in a Social World - FinanceConnect 2015
 
Cosma Paradores
Cosma ParadoresCosma Paradores
Cosma Paradores
 
La paritaria nacional es un fraude
La paritaria nacional es un fraudeLa paritaria nacional es un fraude
La paritaria nacional es un fraude
 
1. Stosowanie przepisów prawa w gospodarowaniu
1. Stosowanie przepisów prawa w gospodarowaniu1. Stosowanie przepisów prawa w gospodarowaniu
1. Stosowanie przepisów prawa w gospodarowaniu
 
Inclusion europe.
Inclusion europe.Inclusion europe.
Inclusion europe.
 
Ten Things Agencies Should Accentuate When It Comes To Social Media
Ten Things Agencies Should Accentuate When It Comes To Social MediaTen Things Agencies Should Accentuate When It Comes To Social Media
Ten Things Agencies Should Accentuate When It Comes To Social Media
 
NYU 5-Day Lean Launchpad Syllabus
NYU 5-Day Lean Launchpad SyllabusNYU 5-Day Lean Launchpad Syllabus
NYU 5-Day Lean Launchpad Syllabus
 
Lista de-precios-compugreiff-enero-09-2013
Lista de-precios-compugreiff-enero-09-2013Lista de-precios-compugreiff-enero-09-2013
Lista de-precios-compugreiff-enero-09-2013
 

Similaire à 26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat

Download It
Download ItDownload It
Download It
butest
 
Use Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersUse Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data Clusters
Databricks
 
Lecture07 abap on line
Lecture07 abap on lineLecture07 abap on line
Lecture07 abap on line
Milind Patil
 

Similaire à 26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat (20)

Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Download It
Download ItDownload It
Download It
 
Use Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data ClustersUse Machine Learning to Get the Most out of Your Big Data Clusters
Use Machine Learning to Get the Most out of Your Big Data Clusters
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Final Presentation.pptx
Final Presentation.pptxFinal Presentation.pptx
Final Presentation.pptx
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
Blue book
Blue bookBlue book
Blue book
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
p850-ries
p850-riesp850-ries
p850-ries
 
Lecture07 abap on line
Lecture07 abap on lineLecture07 abap on line
Lecture07 abap on line
 
Unit 1
Unit  1Unit  1
Unit 1
 
Vedic Calculator
Vedic CalculatorVedic Calculator
Vedic Calculator
 
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
Monitoring as Code: Getting to Monitoring-Driven Development - DEV314 - re:In...
 
Hadoop cluster performance profiler
Hadoop cluster performance profilerHadoop cluster performance profiler
Hadoop cluster performance profiler
 
les07.pdf
les07.pdfles07.pdf
les07.pdf
 
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta CachingReal-Time Forecasting at Scale using Delta Lake and Delta Caching
Real-Time Forecasting at Scale using Delta Lake and Delta Caching
 
Prometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the CloudPrometheus Everything, Observing Kubernetes in the Cloud
Prometheus Everything, Observing Kubernetes in the Cloud
 
Workshop: Building a Streaming Data Platform on AWS
Workshop: Building a Streaming Data Platform on AWSWorkshop: Building a Streaming Data Platform on AWS
Workshop: Building a Streaming Data Platform on AWS
 

Plus de Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

Plus de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Dernier

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Dernier (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
hybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptxhybrid Seed Production In Chilli & Capsicum.pptx
hybrid Seed Production In Chilli & Capsicum.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat

  • 1. 26 Trillion App Recommendation using100 Lines of Spark Code Ayman Farahat
  • 2. ● Motivation ● Spark Implementation ○ Collabrative Filtering ○ Data Frames ○ BLAS-3 ● Results and lessons learnt. Overview
  • 3. ● App discovery is a challenging problem due to the exponential growth in number of apps ● Over 1.5 million apps available through both market places (i.e. Itunes and Google Play store) ● Develop app recommendation engine using various user behavior signals ○ Explicit Signal (App rating) ○ Implicit Signal (frequency/duration of app usage) Motivation
  • 4. ● Data available through Flurry SDK is rich in both coverage and depth ● Collected session length for Apps used on IOS platform in period between Sept 1-15 2015 . ● Restricted analysis to Apps used by 100 or more users ○ ~496 million Users ○ ~53,793 Apps Flurry Data and Summary
  • 5. ● User Count : 496,508,312 ● App Count : 153,773 ● App 100+ : 53,793 ● Train time : 52 minutes ● Predict time : 8 minutes Data Summary
  • 6. ● Utilize a collaborative filtering based App recommendation ● Run collaborative filtering that works at scale to generate: ○ Low dimension user features ○ Low dimension App features ○ Compute user x App rating for all possible combinations (26.7 Trillion) ● Used spark framework to efficiently train and recommend. Our Approach
  • 7. ● Projects the users and Apps (in our case) into a lower dimensional space Collaborative Filtering Model
  • 8. ● Used out of sample prediction accuracy on 20+ Apps Users ● The MSE was minimum with number of factors fixed at 60 Model Fitting and Parameter Optimization
  • 9. ● Join operation can greatly benefit from caching. ● Filter out Apps that have less than 100 users cleandata = allapps.join(cleanapps) ● Do a replicated join in Spark #only keep the apps that had 100 or more user cleanapps = myapps.filter(lambda x :x[1] > MAXAPPS).map(lambda x: int(x[0])) #persist the apps data apps = sc.broadcast(set(cleanapps.collect())) # filter by the data set: I have simulated a replicated join cleandata = allapps.filter(lambda x: x[1] in apps.value) Data Frames
  • 10. ● In spark you can use a dataframe directly Record = Row("userId", "iuserId", "appId", "value") MAXAPPS = 100 #transform allapps to a df allappsdf = allapps.map(lambda x: Record(*x)).toDF() # register the DF and issue SQL queries sqlContext.registerDataFrameAsTable(allappdf, "table1") #here I am grouing by the AppID df2 = sqlContext.sql("SELECT appId as appId2, avg(value), count(*) from table1 group by appId") topappsdf = df2.filter(df2.c2 >MAXAPPS) #DF join cleandata = allappsdf.join(topappsdf, allapps.appId == topappdf.appId2) Data Frames
  • 11. ● The number of possible user x App combinations is very large Default prediction : PredictAll ○predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2])) ○ Prediction is simply matrix multiplication of user “i” and App “j” ● Never completes and most of time spent on reshuffle. ● The users are not partioned so can be on all Nodes. ● The Apps are not partioned so can be on all Nodes. ● Reshuffle is extremely slow. BLAS 3
  • 12. ● The key is that the Number of Apps << Number of users ● Exploit the low number of Apps to optimize the prediction time BLAS 3
  • 13. ● The App features being smaller in size can be stored in primary memory (BLAS 3) ● We broadcast the Apps to all executors, which reduces the overall reshuffling of data ● use BLAS-3 matrix multiplication available within numpy which is highly optimized BLAS 3
  • 14. Basic linear algbera system for solving problems of the form D = a A * b B + c C Highly optimized for matrix multiplication. BLAS 3
  • 15. import numpy from numpy import * myModel=MatrixFactorizationModel.load(sc, "BingBong”) m1 = myModel.productFeatures() m2 = m1.map(lambda (product,feature) : feature).collect() m3 = matrix(m2).transpose() pf = sc.broadcast(m3) uf = myModel.userFeatures().coalesce(100) #get predictions on all user f1 = uf.map(lambda (userID, features): (userID, squeeze(asarray(matrix(array(features)) * pf.value)))) BLAS 3
  • 17. Predicted Score : Positive
  • 18. Predicted Score : Negative
  • 19. Evaluation of Recommendation ● Identify users with high(low) scores ● Design of experiment : ● High score x Recommendation ● High score x Placebo ● Low score x Recommendation ● High score x Placebo
  • 20. Future Work ● Spark econometrics library (std. error, robust std. errors.. ) ● Online experiments to measure value of recommendation . ● Experiments with various implicit ratings : ● number of sessions ● days used ● Log of days used

Notes de l'éditeur

  1. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  2. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  3. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  4. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  5. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  6. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  7. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  8. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  9. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  10. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  11. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  12. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  13. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?
  14. The challenge is that native ads are supposed to allow a similar experience than content; but at the same time should not mislead users. “In some instances 16-35% of ads are confused for creative.” You mean confused for content?