The ever-growing amount of data available on the Internet calls for personalization. Yet, the most effective personalization schemes, such as those based on collaborative filtering (CF), are notoriously resource greedy. We argue that scalable infrastructures should rely on P2P design to scale to that increasing number of users, data and dynamics.
In this talk, I will present a scalable algorithm for collaborative filtering, which P2P flavor provides scalability by design. This scheme can been instanciated in various settings: P2P, hybrid infrastructure offloading CPU-intensive recommendation tasks to front-end client browsers, while retaining storage and orchestration tasks within back-end servers, as well as centralized infrastructures. As personalization relies on users giving away information, I will also present the potential encountered privacy issues and a range of solutions to preserve users privacy in such systems.
2. Inria, France: 8 research centers, 150
research teams
Les huit
centres de recherche Inria
Inria RENNES
Bretagne
Atlantique
Inria BORDEAUX
Sud-Ouest
Inria PARIS - Rocquencourt
Inria LILLE
Nord Europe
Inria NANCY
Grand Est
Inria SACLAY
Île-de-France
Inria GRENOBLE
Rhône-Alpes
Inria SOPHIA ANTIPOLIS
Méditerranée
- 3
June 2014A.-M. Kermarrec (Inria)
4. A cry for personalization
June 2014A.-M. Kermarrec (Inria)
5. Why is personalization so difficult?
• Huge volume of data: small portion of interest
• Dynamic interests
• Interesting stuff does not come always from friends
• Classical notification systems do not filter enough or too much
Scalable personalization infrastructures
June 2014A.-M. Kermarrec (Inria)
6. KNN computation over large data
Basic building block for many applications
• Similarity search
• Machine learning
• Data mining
• Image processing
• Collaborative filtering
June 2014A.-M. Kermarrec (Inria)
7. KNN-based user-centric collaborative
filtering
Provide each user with her k closest neighbors
(Users owns a profile, the system has its favorite similarity metric)
Use this topology for
• personalized notifications
• recommendation
Alice
Bob
Carl
Dave
Ellie
June 2014A.-M. Kermarrec (Inria)
8. Dealing with truly big data
Want to scale? Think P2P
June 2014A.-M. Kermarrec (Inria)
9. Do not look exhaustively
June 2014A.-M. Kermarrec (Inria)
10. The key to scalability in KNN graph
construction
Consider a partial set of candidates
Sampling-based approach
June 2014A.-M. Kermarrec (Inria)
11. P2P KNN graph construction
Which nodes are close?
How to discover them?
Similarity metric
Sampling
June 2014A.-M. Kermarrec (Inria)
12. Which nodes are close?
Model
U(sers) × I(tems) (items)
Profile(u) = vector of liked/shared/viewed items
Cosine similarity metric Jaccard metric
Minimal information: no tag, no user’s input, generic
June 2014 A.-M. Kermarrec (Inria)
13. Each node maintains a set of
neighbors (c entries)
Peer exchange
Shuffle
P Q
How to discover them: Gossip-based
computing
Result random graph
Highly resilient against churn, partition
Small diameter
[JGKVV, ACM TOCS 2007]
June 2014A.-M. Kermarrec (Inria)
15. Decentralized KNN selection
[FGKL Middleware 2010]
RPS layer providing
random sampling
KPS clustering
layer gossip-based
topology clustering
Interest-based linkRandom link
Alice
Bob
Carl
Dave
Ellie
Alice
Bob
Carl
Dave
Ellie
June 2014A.-M. Kermarrec (Inria)
19. What’s wrong with news feed
Interest are dynamic
Wrong granularity for filtering of classical notification
systems
Small portion of the available information is of interest
Interesting stuff does not come always from friends
June 2014A.-M. Kermarrec (Inria)
20. WhatsUp in a nutshell
KNN selection
Dissemination
June 2014A.-M. Kermarrec (Inria)
21. Dissemination: orientation and amplification
Orientation: to whom?
Exploit:
Forward
To friends
Explore:
Forward to
random
users
Amplification: to how many?
Increase
Fanout
(Log(n))
Decrease
Fanout
(1)
June 2014A.-M. Kermarrec (Inria)
22. WhatsUp in action on the survey (480 users)
Precision Recall F1-Score Messages
Gossip (f=4) 0.34 0.99 0.51 2.3 M
Cosine-CF 0.50 0.65 0.57 5,9k
Whatsup
(f=10)
0.471 0.83 0.60 2,4k
160 180 200
w (WHATSUP)
80 100 120 140 160 180 200
Cycle
(b) Similarity in WUP view (WHATSUP-Cos)
80 100 120 140 160 180 200
Cycle
(c) Reception of liked news items (WHATSUP)
Figure 7: Cold start and dynamics in WHATSUP
eiving news quickly as shown in
n the number of interesting news
ode joins. This is a result of both
(Section II-D) and our metric’s
h small profiles. Once the node’s
mber of received news per cycle
arable to those of the reference
oining node reaches 80% of the
after only a few cycles.
e, we select a pair of random
ataset and, at 100 cycles into the
r interests and start measuring the
uild their WUP views. Figure 7
by averaging 100 experiments.
auses the views to converge faster
cycles as opposed to over 100.
ecall and precision for the nodes
nterestsnever decreasebelow 80%
ues. These results are clearly tied
window, set to about 40 cycles in
windows would in fact lead to an
nodes (machines and users) deployed on a 25-node cluster
equipped with theModelNet network emulator. For practical
reasons we consider a shorter trace and very fast gossip
and news-generation cycles of 30sec, with 5 news items per
cycle. These gossip frequencies are higher than those we
use in our prototype, but they were chosen to be able to run
a large number of experiments in reasonable time. We also
use a profile window of 4min, compatible with the duration
of our experiments (1 to 2 hours each).
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
2 4 6 8 10 12
F1-Score
Fanout (Flike)
Simulation
PlanetLab
ModelNet
(a) Survey: F1-Score
June 2014A.-M. Kermarrec (Inria)
23. Orientation (survey)
News items received through a dislike forward
Number of
dislikes
0 1 2 3 4
Fraction of
liked news
54% 31% 10% 3% 2%
hat likes
that did
he news
by nodes
he dislike
om users
cross the
ke
0
10
20
30
40
50
60
70
0 5 10 15 20 25 30
NBNodes
NB Hops
Forward by like
Infection by like
Forward by dislike
Infection by dislike
Figure 6: Survey (f LIKE = 5): Impact of amplification of BEEP
June 2014A.-M. Kermarrec (Inria)
24. WhatsUp versus Pub/Sub
Approach Precision Recall F1-Score
Pub/Sub 0.40 1.0 0.58
WhatsUp 0.47 0.83 0.60
June 2014A.-M. Kermarrec (Inria)
25. WhatsUp versus cascading
Approach Precision Recall F1-Score
Cascading 0.57 0.09 0.16
WhatsUp 0.56 0.57 0.57
June 2014A.-M. Kermarrec (Inria)
27. Privacy issues
During user clustering
• Exchange of profile in clear
During item dissemination
• Predictive nature of the protocol
Profile Obfuscation
Randomized
dissemination
June 2014A.-M. Kermarrec (Inria)
28. Privacy
Obfuscation
• Does not reveal the exact profile
• Does not reveal the least sensitive information
Randomized dissemination
• Flips the opinion with a given probability (pf)
June 2014A.-M. Kermarrec (Inria)
30. Structure profiles
Private Profile
Compact profile
Filter profile
Item profile
Obfuscated profile
In clear: Full information about the interests
Aggregate signatures of liked items
Interests of users that like similar items
Least sensitive information about interests
Aggregate interests of users that liked it
June 2014A.-M. Kermarrec (Inria)
33. Obfuscation mechanism
News item
(received)
Private Profile Compact Profile Filter Profile
Obfuscated ProfileNews item
(forwarded)
x+
Profiles kept locally
Profiles exchanged
with others
signature
item
profile
item
profile
mask of
popularity
System parameter
June 2014A.-M. Kermarrec (Inria)
34. Randomized dissemination
Flips the opinion with a given probability (pf)
• Attacker could still learn from the profile
• Private profile contains a field with the result of the randomized
decision
Generate Randomized compact profile
• Users still use locally their non randomized profile for clustering
• Differentially private protocol
June 2014A.-M. Kermarrec (Inria)
35. Experimental setup
Simulations and Planetlab
Alternatives
• Cleartext profile (CT);
• 2DP (DP dissemination and randomized profile for clustering)
Metrics
• Recommendation: recall/precision
• Privacy: Distance between obfuscated profile and real profile;
Dataset: Real survey, 120 users on 200 news items (4 instances)
June 2014A.-M. Kermarrec (Inria)
38. http://131.254.213.98:8080/wup/
Operational prototype
Tested on 500 users @
TrentoRise last year
TRY IT
Take away message
Personalization is needed
Decentralization is healthy
Gossip-based computing is one (the) way to go
June 2014A.-M. Kermarrec (Inria)
39. For those who are afraid of P2P
June 2014A.-M. Kermarrec (Inria)
42. HyRec: Taking the best of both worlds
Online KNN selection
Restricted andidate set (k)
No data stored at the client
HyRec client: Javascript (widget) running in the browser
June 2014A.-M. Kermarrec (Inria)
46. HyRec versus the client load
Impact of HyRec Impact of the client load
Negligible disruption of HyRec 50% load
<60ms on smartphone
<10ms on laptop
June 2014A.-M. Kermarrec (Inria)
47. HyRec versus a centralized recommender
Impact of the request stressImpact of the profile size
June 2014A.-M. Kermarrec (Inria)
48. To take away
Personalization is crucial
P2P in a design mindset
Randomization and obfuscation provides a good tradeoff between privacy
and quality
June 2014A.-M. Kermarrec (Inria)