Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Till Rohrmann
Flink committer / data Artisans
trohrmann@apache.org
@stsffap
Computing Recommendations at
Extreme Scale wit...
Recommendations: Collaborative
Filtering
1
Recommendations
§  Omnipresent nowadays
§  Important for user
experience and sales
2
Recommending Websites
3
Collaborative Filtering
§  Recommend items based on users with
similar preferences
§  Latent factor models capture under...
Rating Matrix
§  Explicit or implicit ratings
§  Prediction goal: Rating for unseen items
5
Items
Users
10 5 ?
? 2 10
7 ...
Matrix Factorization
§  Calculate low rank approximation to obtain latent factors
6
minX,Y ru,i − xu
T
yi( )
2
+ λ nu xu
...
Alternating Least Squares
§  Hard to optimize since we have two
variables
§  Fixing one variables gives quadratic
proble...
ALS Algorithm
§  Update user matrix
8
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
• Y( )
...
ALS Algorithm contd.
§  Update item matrix
9
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!
"
#
#
#
$
%
&
&
&
...
ALS Algorithm contd.
§  Repeat update step until convergence
10
Items
Users
10 5 ?
? 2 10
7 ? 5
!
"
#
#
#
$
%
&
&
&
≈ X
!...
Apache Flink
11
What is Apache Flink?
Apache	
  Flink	
  deep-­‐dive	
  by	
  
Stephan	
  Ewen	
  
Tomorrow,	
  12:20	
  –	
  13:00	
  on	...
Why Using Flink for ALS?
§  Expressive API
§  Pipelined stream processor
§  Closed loop iterations
§  Operations on ma...
Expressive APIs
§  DataSet: Abstraction for distributed data
§  Computation specified as sequence of lazily
evaluated tra...
Program Execution
15
case	
  class	
  Path	
  (from:	
  Long,	
  to:	
  Long)	
  
val	
  tc	
  =	
  edges.iterate(10)	
  {...
Pipelined Stream Processor
16
Avoiding	
  materializaBon	
  of	
  
intermediate	
  results	
  
Iterate by looping
§  for/while loop in client submits one job per
iteration step
§  Data reuse by caching in memory and...
Iterate in the Dataflow
18
Memory Management
19
Memory Management
20
ALS implementations with Apache
Flink
21
Naïve Implementation
1.  Join item vectors
with ratings
2.  Group on user ID
3.  Compute new
user vectors
22
xu = YSu
YT
+...
Pros and Cons of Naïve ALS
§  Pros
•  Easy to implement
§  Cons
•  Item vectors are sent redundantly to network
nodes
• ...
Blocked ALS Implementation
1.  Create user and item
rating blocks
2.  Cache them on
worker nodes
3.  Send all item vectors...
Pros and Cons of Blocked ALS
§  Pros
•  Reduces network load by avoiding data
duplication
•  Caching ratings: Only one sh...
Performance Comparison
26
•  40	
  node	
  GCE	
  cluster,	
  
highmem-­‐8	
  
•  10	
  ALS	
  iteraBons	
  with	
  50	
  ...
Machine Learning with FlinkML
§  FlinkML contains
blocked ALS
§  Support for many
other tasks
•  Clustering
•  Regressio...
Closing
28
What Have You Seen?
§  How to use collaborative filtering to make
recommendations
§  Apache Flink, a powerful parallel st...
30
flink.apache.org
@ApacheFlink
Prochain SlideShare
Chargement dans…5
×

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

3 397 vues

Publié le

How to scale recommendations to extremely large scale using Apache Flink. We use matrix factorization to calculate a latent factor model which can be used for collaborative filtering. The implemented alternating least squares algorithm is able to deal with data sizes on the scale of Netflix.

Publié dans : Technologie

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

  1. 1. Till Rohrmann Flink committer / data Artisans trohrmann@apache.org @stsffap Computing Recommendations at Extreme Scale with Apache Flink
  2. 2. Recommendations: Collaborative Filtering 1
  3. 3. Recommendations §  Omnipresent nowadays §  Important for user experience and sales 2
  4. 4. Recommending Websites 3
  5. 5. Collaborative Filtering §  Recommend items based on users with similar preferences §  Latent factor models capture underlying characteristics of items and preferences of user §  Predicted preference: 4 ˆru,i = xu T yi
  6. 6. Rating Matrix §  Explicit or implicit ratings §  Prediction goal: Rating for unseen items 5 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & Princess   Facebook  
  7. 7. Matrix Factorization §  Calculate low rank approximation to obtain latent factors 6 minX,Y ru,i − xu T yi( ) 2 + λ nu xu 2 + ni yi 2 i ∑ u ∑ # $ % & ' ( ru,i≠0 ∑ Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Princess   Facebook   §  = rating of user u for item i §  = latent factors of user u §  = latent factors of item i §  = regularization constant §  = number of rated items for user u §  = number of ratings for item i xu yi λ nu ni ru,i
  8. 8. Alternating Least Squares §  Hard to optimize since we have two variables §  Fixing one variables gives quadratic problem 7 X,Y xu = YSu YT + λnuΙ( ) −1 Yru T Sii u = 1 if ru,i ≠ 0 0 else " # $ %$ We  only  need  the  item   vectors  rated  by  user  u  
  9. 9. ALS Algorithm §  Update user matrix 8 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Keep  fixed   Calculate  update  
  10. 10. ALS Algorithm contd. §  Update item matrix 9 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Keep  fixed   Calculate  update  
  11. 11. ALS Algorithm contd. §  Repeat update step until convergence 10 Items Users 10 5 ? ? 2 10 7 ? 5 ! " # # # $ % & & & ≈ X ! " # # # $ % & & & • Y( ) Keep  fixed   Calculate  update  
  12. 12. Apache Flink 11
  13. 13. What is Apache Flink? Apache  Flink  deep-­‐dive  by   Stephan  Ewen   Tomorrow,  12:20  –  13:00  on   Stage  3   12 Gelly Table FlinkML SAMOA DataSet (Java/Scala/Python) DataStream (Java/Scala) HadoopM/R Local Remote Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Streaming dataflow runtime
  14. 14. Why Using Flink for ALS? §  Expressive API §  Pipelined stream processor §  Closed loop iterations §  Operations on managed memory 13
  15. 15. Expressive APIs §  DataSet: Abstraction for distributed data §  Computation specified as sequence of lazily evaluated transformations 14 case  class  Word(word:  String,  frequency:  Int)     val  lines:  DataSet[String]  =  env.readTextFile(…)     lines.flatMap(line  =>  line.split(“  “).map(word  =>  Word(word,  1))    .groupBy(“word”).sum(“frequency”)    .print()  
  16. 16. Program Execution 15 case  class  Path  (from:  Long,  to:  Long)   val  tc  =  edges.iterate(10)  {        paths:  DataSet[Path]  =>          val  next  =  paths              .join(edges)              .where("to")              .equalTo("from")  {                  (path,  edge)  =>                        Path(path.from,  edge.to)              }              .union(paths)              .distinct()          next      }   Optimizer Type extraction stack Task scheduling Dataflow metadata Pre-flight (Client) Master Workers Data Source orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph deploy operators track intermediate results
  17. 17. Pipelined Stream Processor 16 Avoiding  materializaBon  of   intermediate  results  
  18. 18. Iterate by looping §  for/while loop in client submits one job per iteration step §  Data reuse by caching in memory and/or disk Step Step Step Step Step Client 17
  19. 19. Iterate in the Dataflow 18
  20. 20. Memory Management 19
  21. 21. Memory Management 20
  22. 22. ALS implementations with Apache Flink 21
  23. 23. Naïve Implementation 1.  Join item vectors with ratings 2.  Group on user ID 3.  Compute new user vectors 22 xu = YSu YT + λnuΙ( ) −1 Yru T
  24. 24. Pros and Cons of Naïve ALS §  Pros •  Easy to implement §  Cons •  Item vectors are sent redundantly to network nodes •  Two shuffle steps make execution expensive 23
  25. 25. Blocked ALS Implementation 1.  Create user and item rating blocks 2.  Cache them on worker nodes 3.  Send all item vectors needed by user rating block bundled 4.  Compute block of item vectors 24 Based  on  Spark’s  MLlib  implementaBon  
  26. 26. Pros and Cons of Blocked ALS §  Pros •  Reduces network load by avoiding data duplication •  Caching ratings: Only one shuffle step needed §  Cons •  Duplicates the rating matrix (user block/item block partitioning) 25
  27. 27. Performance Comparison 26 •  40  node  GCE  cluster,   highmem-­‐8   •  10  ALS  iteraBons  with  50   latent  factors   •  RaBng  matrix  has  28  billion   non  zero  entries:  Scale  of   NeAlix  or  SpoCfy  
  28. 28. Machine Learning with FlinkML §  FlinkML contains blocked ALS §  Support for many other tasks •  Clustering •  Regression •  Classification §  scikit-learn like pipeline support 27 val  als  =  ALS()     val  ratingDS  =      env.readCsvFile[(Int,  Int,  Double)](ratingData)     val  parameters  =  ParameterMap()      .add(ALS.Iterations,  10)      .add(ALS.NumFactors,  50)      .add(ALS.Lambda,  1.5)     als.fit(ratingDS,  parameters)     val  testingDS  =  env.readCsvFile[(Int,  Int)](testingData)     val  predictions  =  als.predict(testingDS)  
  29. 29. Closing 28
  30. 30. What Have You Seen? §  How to use collaborative filtering to make recommendations §  Apache Flink, a powerful parallel stream processing engine §  How to use Apache Flink and alternating least squares to factorize really large matrices 29
  31. 31. 30
  32. 32. flink.apache.org @ApacheFlink

×