SlideShare une entreprise Scribd logo
1  sur  6
Télécharger pour lire hors ligne
Toward Personalized Peer-to-Peer Top-K Processing

            Xiao Bai              Marin Bertier                      Rachid Guerraoui                        Anne-Marie Kermarrec
                INSA de Rennes, France                                 EPFL, Switzerland                        INRIA Rennes, France




ABSTRACT                                                                                   Very elegant centralized approaches have recently been
We present the first personalized peer-to-peer top-k search                              proposed to capture the personalization through social scor-
protocol for a collaborative tagging system. Each peer main-                            ing models that compute the user-centric correlation among
tains relevant personalized information about its tagging be-                           different tags in order to improve information retrieval qual-
havior as well as that of its social neighbors, and uses those                          ity [1, 2]. In Amer-Yahia et al. [3], the score of an item
to locally process its queries. Extensive experiments based                             only depends on how users sharing similar preferences have
on a real-world dataset crawled from del.icio.us shows that                             tagged it. The reference retrieval solution consists in main-
very little storage at each peer suffices to get almost the same                         taining one inverted list per (tag, user), but this is extremely
results as a hypothetical centralized solution with infinite                             space consuming. Alternative solutions to save storage space
storage.                                                                                do exist but their processing time is increased.
                                                                                           We argue that scalability calls for fully decentralized so-
                                                                                        lutions to process top-k queries. In fact, decentralized solu-
1.     INTRODUCTION                                                                     tions have indeed been proposed. In Michel et al. [4], pre-
   A central task in information retrieval consists in process-                         computed inverted lists are distributed across peers and par-
ing queries in order to obtain the top-k items with the high-                           tial information is transmitted in the network progressively
est scores under a monotonic function. This is particularly                             to approximate the top-k results. In Cuenca-acuna et al. [5],
appealing in collaborative tagging systems, also called folk-                           a gossip scheme is used to implement distributed content
sonomies, such as Flickr, del.icio.us and CiteULike, which                              search and ranking. None of these decentralized solutions
have become highly popular for publishing and searching                                 is however ‘personalized’.
contents. Yet this can turn into a nightmare because of the                                This paper presents, to our knowledge, the first decen-
unstructured nature of tagging: there is usually no fixed on-                            tralized and personalized top-k processing scheme. In our
tology and users typically choose their tags in a free and pos-                         scheme, each user maintains its own inverted list by period-
sibly ambiguous manner.                                                                 ically discovering its network.1 Our approach alleviates the
   One way to introduce some structure in such a scheme                                 storage space problem while enabling highly efficient local
is to capture affinities between users with common tagging                               query processing. We use a gossip-based network manage-
behaviors and then leverage these in the top-k processing.                              ment protocol to identify users’ personal networks in a peer-
More relevant results for a given user could be achieved                                to-peer way; once these personal networks are established,
should the search be directed in a restricted network of users                          users locally process their queries using a classical top-k al-
exhibiting similar tagging bahaviors. For example, a com-                               gorithm, namely NRA (No Random Access). Interestingly,
puter scientist and a Keanu Reeves fan might probably not be                            only a subset of (well selected) users suffices to provide the
interested in the same results when Googling ‘matrix’. User                             most relevant items while preserving their relative order for
affinities can disambiguate these situations and alleviate the                           a given query. We explore various strategies to prevent over-
need for reformulating the same query through several steps                             loading individual users by limiting the size of their personal
with more refined keywords.                                                              networks. This results in little degradation in the quality of
                                                                                        the top-k results.
                                                                                           We have implemented our decentralized and personalized
                                                                                        top-k processing scheme in PeerSim [6] with a dataset crawl-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are               ed from del.icio.us. Interestingly, if all qualified users are
not made or distributed for profit or commercial advantage and that copies               kept as neighbors, after 50 cycles of gossips, we retrieve
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific       1
permission and/or a fee.                                                                 Note that this is different from a social network in the traditional
SNS’09, March 31, 2009, Nuremberg, Germany                                              sense, as neighbors in our network are those with similar tagging
Copyright 2009 ACM ISBN 978-1-60558-463-8 ...$5.00.                                     behavior and might be and remain unknown.


                                                                                    1
at least 8 relevant items out of 10 with a negligible storage               social network model.)
overhead with respect to the idealized centralized solution
of [3] (called Exact). Thanks to our optimization strategy                  Top-k Processing.
which simply keeps the closest neighbors, more than 35%                         Given a query, a top-k processing algorithm (e.g., [7])
storage space is further economized for each user.                          aims at retrieving the k most relevant items. The goal is
   The rest of this paper is organized as follows: Section 2                typically to minimize the time it takes to come up with these
establishes the social model of our system and provides a                   items as well as the amount of storage needed to perform the
brief review of relevant top-k processing and network-aware                 actual computation.
search techniques. Section 3 describes our peer-to-peer per-                    Information is organized in per tag inverted lists. Each
sonalized top-k processing scheme and some optimization                     entry of the inverted list contains the identifier of an item
techniques. Section 4 presents our experimental setup and                   and its score for that tag. Inverted lists are sorted in a de-
results. Section 5 concludes the paper.                                     scending order of scores. Below we review the well-known
                                                                            NRA algorithm, as this is considered particularly effective:
2.    SOCIAL NETWORK MODEL AND BACK-                                        it underlies our work as well as that of [3].
                                                                                For a query of n tags, NRA scans the n inverted lists in
      GROUND
                                                                            parallel. In order to do so, NRA maintains a heap of candi-
Social Network Model.                                                       date items. Each candidate has a score lower-bound and a
   In a social network model, collaborative tagging sites are               score upper-bound, which are the overall scores it can attain
typically presented as information space U × I × T , where                  based on the information available at the moment. The score
U denotes the set of users. Each user has a profile that ex-                 lower-bound takes the most pessimistic assumption that if
presses his endorsement of visited items by tagging them.                   an item has not been seen in some lists, then it does not exist
The profile is described in the form of i{t1 , ..., tn }, meaning            in them. Alternatively the score upper-bound takes the most
that an item i is tagged by tags t1 , ..., tn . I is the set of items       optimistic assumption that its scores in those lists equal to
that appear in the system and T is the set of all related tags.             the scores of last seen items in them. Once an item is seen,
T agged(u, i, t) captures the fact that user u tags the item i              it is either added to or updated in the heap. The score upper-
with the tag t.                                                             bounds of items already in the heap are also updated. Candi-
   We model the social network as a directed graph where                    date items are sorted based on their score lower-bounds. For
a node corresponds to a user and an edge presents the rela-                 those with equal lower-bounds, the one with a larger upper-
tionship between two users. We use Link(u, v) to indicate                   bound is ranked ahead. The processing stops when none of
the existence of a directed edge from user u to user v. For                 the items out of the top-k items has an upper-bound larger
a user u ∈ U , N etwork(u) is the set of u’s neighbors, i.e.,               than the lower-bound of the kth item.
N etwork(u) = {v|Link(u, v)}.
   There are many possibilities for establishing Link(u, v)                 Network-Aware Search.
according to user preferences. We do not assume any par-                       Amer-Yahia et al. [3] proposed several strategies to achieve
ticular semantics; as such we follow the criterion in [3] that              efficient network-aware top-k processing. Again, for a netw-
Link(u, v) exists if and only if v tags a sufficient number of               ork-aware search, the score of an item only depends on the
items with the same tag as u, i.e.,                                         querier’s network. The most straightforward strategy, called
                                                                            Exact, is to build one inverted list for each (tag, user) pair.
 |{i|∃t, T agged(u, i, t) ∧ T agged(v, i, t)}| > threshold,                 Query Q generated by a user u is processed on the (tj , u)
                                                                            lists, where tj ∈ Q, using a traditional top-k processing al-
where threshold is a predefined number. The number of                        gorithm such as NRA. However, storing these lists is pro-
commonly tagged items is also used to measure the strength                  hibitive space-wise for a single server. Therefore, they ex-
of Link(u, v), denoted by StrengthLink(u,v) .                               plore another strategy, Global Upper-Bound, that maintains
                                                                            user-independent inverted lists whose entries only contain
Query And Scoring Model.                                                    the max scores over all users. The exact score of an item
   We consider a query Q = {u, t1 , ..., tn }, issued by a user             is computed at query time. With Global Upper-Bound, con-
u with a set of tags t1 , ..., tn . Returned items for a query              siderable storage space is saved but much more time is re-
should be ranked according to its overall score. Our score                  quired for query processing. To strike a balance between the
is user-specific and network-aware. The score of an item                     two extremes, users are then clustered and only score upper-
i for user u and tag tj is defined as the number of users                    bounds over all clusters are maintained in per (tag, cluster)
in u’s network who tags i with tj , i.e., Scoretj (u, i) =                  pair inverted list.
| N etwork(u) ∩ {v | ∃tj , T agged(v, i, tj )} |. The overall
score of an item i for user u is the sum of all the scores re-
lated to the query, i.e., Score(u, i) = tj ∈Q Scoretj (u, i).               3. PEER-TO-PEER SOLUTION
We use the same scoring functions as in [3] for ease of com-                  In our decentralized scheme, each user maintains a set of
parison. (Note that alternative functions can be used in our                neighbors that form its personal network. A query is pro-

                                                                        2
cessed with the information available in the querier’s net-            query already exist in its cache and whether they should be
work to get personalized top-k results. If its network fails to        updated to reflect new neighbors. Once all the related lists
provide any satisfactory results, a default search mechanism           are up-to-date, the user processes its query locally with NRA
will be activated. Here we concentrate on the personalized             to get the top-k results.
processing. In our setting, a key problem is how to build                 The inverted lists are constructed lazily so that an inverted
a personal network for each user in a peer-to-peer way as              list is computed only when it is necessary for the query pro-
quickly as possible so as to guarantee the quality of top-k            cessing. There is no need to pre-compute the inverted lists
results.                                                               for the tags that may never be queried in the Exact central-
                                                                       ized case. Note that the computation of inverted lists are not
3.1 Personal Network Construction                                      more expensive than Exact when changes occur in the net-
   In our scheme, the personal network is discovered and               work. In contrast, gossip-based protocols enable to capture
maintained through a two-layer gossip protocol. The bottom-            such dynamics.
layer gossip protocol, typically known as a random peer sam-
pling protocol (RPS) [8], is in charge of keeping the overlay          3.3 Personal Network Optimization
connected. Basically it provides each peer with c random                  We adapt the criteria in [3] for choosing neighbors in our
peers. On top of the peer sampling protocol, a top-layer pro-          setting. (Note that any metric that captures the affinity of
tocol is in charge of tracking the similarity between users’           users will work as well with our gossip-based personal net-
profiles and discovering new neighbors. Once the top proto-             work construction protocol). Since the common-item-based
col has stabilized, it is still possible to discover new related       neighbor choosing criteria may cause a scalability problem
peers through the peer sampling protocol.                              when the network grows and users add more contents in their
   We assume that each peer executes the same protocol in              profiles, we propose several optimization strategies that re-
the same manner every T time units, referred to as a cycle.            duce the size of user’s network without degrading the top-k
A gossip protocol, bottom and top, consists of two threads:            quality.
an active thread initiating communication with other peers,               Instead of maintaining all the users that meet the criteria
and a passive thread waiting for incoming messages. A gos-             to be a neighbor, only n of them are kept in a user’s personal
sip protocol is fully characterized by three functions: (i)            network. The effect of n will be evaluated in the next sec-
the peer selection (choice of gossip target); (ii) the data ex-        tion. These strategies can also be used independently of the
change (which data are exchanged in a gossip interaction)              criteria in [3] to form personal networks.
and, (iii) the data processing (which data are kept after the             Random n random neighbors are chosen from the candi-
interaction). In this paper, the data exchanged are peers (IP          date users. The intuition is that a random sample is usually
addresses and profiles of peers).                                       representative of the population from which it is drawn.
   At the beginning of each cycle, the peer selection selects             Biased Random Like Random, only n users are ran-
a neighbor with the oldest TimeStamp as a gossip destina-              domly selected as neighbors. However the probability that
tion. Once a peer is picked, its TimeStamp is set to zero and          a user is kept as neighbor is proportional to the strength of
other neighbors’ TimeStamps increase by one. The variable              the link between the two users. So users with stronger links
TimeStamp ensures that all the neighbors have a compara-               have better chances of being chosen as neighbor.
ble chance of participating in gossiping. The data exchange               Nearest As users are more likely to enjoy what is pre-
function selects gossip-size neighbors from the views of both          ferred by the users having similar preferences, and the simi-
layers and sends the profiles of selected peers to the gos-             larity of user preferences is measured by the strength of the
sip destinaton. When the gossip destination receives this              link between them, this strategy has confidence in users pos-
information, its passive thread is activated. It also selects          sessing strongest links with the gossip initiator and chooses
gossip-size neighbors and sends their profiles to the peer              them as neighbors.
from who the message came. The data processing function                   Nearest With Enhanced Link Strength (Nearest-ELS)
selects the network-size closest peers according to some pre-          Similar to Nearest, users having similar preferences are also
defined metric. Two users are added to each other’s view                chosen here. In collaborative tagging sites, similarity of user
if they tag a minimum number of items in common with at                preferences depends on the tagging behaviors exhibited in
least one same tag. Note that the top layer provides each              their profiles. Normally, the overlap of used tags in users’
peer with its personal network.                                        profiles implies their common interests on topics, while an
                                                                       overlap of tagged items reveals specific objects they prefer.
3.2 Inverted Lists and Query Processing                                As a tag can be used for several items and a same item can
   In the process of gossiping, the profiles of a user’s neigh-         receive different tags from different users, it is more accurate
bors are stored by the user. To enable efficient top-k process-         to compare tagging behaviors by the number of (item, tag)
ing, inverted lists for each (tag, user) pair are also computed        pairs in common rather than the number of items tagged by
and stored by the user itself. When a user generates a query,          both users with same tags. The intuition is that the more
it first checks whether the inverted lists for the tags in the          common tags are used for an item, the more similar the

                                                                   3
way on which the users understand and judge the world is.              with some bootstrap mechanisms. At any following time,
For this reason, we re-define the strength of Link(u, v) as             each user has a number of neighbors thanks to gossiping.
| {(i, tj ) | T agged (u, i, tj ) ∧ T agged (v, i, tj )} |.            The ratio of this number to the user’s network-size in a cen-
For each candidate neighbor v of user u, we re-compute                 tralized setting implies how close its current personal net-
StrengthLink(u,v) and keep the n users having strongest                work is to the target. We measure the speed of convergence
links with u as neighbors.                                             with the average ratio of all users, i.e.,
                                                                                  1                 |Current N etwork(u)|
4.   EVALUATION                                                        Speed =                                                     .
                                                                                 |U |         |N etwork(u) in centralized setting|
                                                                                        u∈U

4.1 Experimental Setup                                                 Speed attains 1 when all users find their personal networks
                                                                       in the centralized implementation.
Data Set and Query Generation.                                            Our goal is to show that an efficient network-aware top-
   We have implemented our decentralized and personalized              k processing can be achieved in a peer-to-peer way, so we
top-k processing scheme in PeerSim, an open source simu-               take the top-k results in [3] as a reference. Note that all al-
lator for peer-to-peer protocols. All experiments presented            gorithms proposed in [3] provide the same top-k items and
here are executed on a cluster of servers including 10 Dell            only differs from each other in processing time and storage
PowerEdge 1855 machines equipped with Bi-pro Intel Xeon                space. We then try to obtain as similar top-k items as possi-
processors with 3.40GHz CPU and 4GB memory.                            ble for the same query with less time and space consumption
   The dataset used in our evaluation was crawled from del.ic-         in our setting.
io.us. It contains 13,521 distinct users who participate in               We use the accepted metric, recall, in information retrieval
31,833,700 tagging actions in the form of T agged(u, i, t).            to evaluate the quality of top-k results. The recall Rk is the
4,741,631 distinct items and 620,340 distinct tags are con-            proportion of the total number of relevant items that are re-
cerned by these actions. We randomly picked 10,000 users               trieved in the top k:
from the dataset and built their profiles with the items and                        N umber of Retrieved Relevant Items
tags used by at least 10 distinct users. Note that this does not            Rk =
                                                                                     T otal N umber of Relevant Items
affect the top-k results and processing time because only the
items ranked at the tail of the inverted lists are dropped. Af-        Recall quantifies the coverage of the result set and varies
ter removing uncommon tags, the interference of the noisy              between 0 and 1.
and often meaningless tags is also eliminated. The remain-                The space overhead for each user is estimated by the num-
ing dataset contained a total of 101,144 items, 31,899 tags            ber of entries in the inverted lists and the number of entries
and 9,536,635 tagging actions.                                         in the profiles of its neighbors. The query execution time for
   In our experiments, each user processed exactly one query.          a query is quantified by the number of sequential accesses of
We randomly picked an item from a user’s profile and com-               related inverted lists. Both metrics are highly dependent on
posed the query with the tags used by the user to annotate             the query and the user, so we are more interested in the rel-
this item. This is motivated by the reasonable observation             ative improvement compared to the centralized implementa-
that the tags in the query reflect the user’s understandings            tion of Exact and Global Upper-Bound for a given user and
of this item and their combination within the same query is            query.
meaningful.
                                                                       4.2 Convergence of Personal Networks
Evaluation Metrics.                                                       We start with some observations on how the network con-
   We were mainly interested in the implicit semantic rela-            verges and how fast our algorithm enables users to find their
tions exhibited by users’ tagging behaviors. As in [3], there          own networks, which are considered as the basis of our per-
is a link between two users if they tag two items in common            sonalized top-k processing. The gossip-size is set to 20 and
with at least one same tag. The personal relationships are             50 respectively for both layers in two different experimental
formed as the network converges. If all users have their com-          settings. We can see from Figure 1 that exchanging more in-
plete networks, the same top-k results would be obtained for           formation provides higher convergence speed. After a small
a given query. We assume that there is neither arrival nor de-         number of cycles, users can get almost its whole personal
parture of user for ease of comparison with Exact and Global           network in a centralized setting.
Upper-Bound in [3], which are considered ideal in terms of                We run top-10 processing in a centralized implementation
processing time and storage space respectively. Important              of Exact and take the 10 returned items for each query as
questions related to our decentralized setting are then how            relevant items and compare our top-10 results with them.
fast the personal network of each user can be established and          The following results are from the setting of gossip-size set
what is the influence on the top-k results.                             to 50. Figure 2 plots the evolution of R10 in the process of
   A user begins building its personal network by first dis-            network converging.
covering the IP address of any user currently in the system               At cycle 50, more than 77% of queries get exactly the

                                                                   4
1                                                                                                                  1




                                                                                                      % of top-10s having recall at least 0.7
                             0.8                                                                                                                0.8



                             0.6                                                                                                                0.6
    Speed




                             0.4                                                                                                                0.4


                                                                                                                                                0.2                                              Random
                             0.2                                                                                                                                                         Biased Random
                                                                          Gossip size 20                                                                                                         Nearest
                                                                          Gossip size 50                                                                              Nearest With Enhanced Link Strength
                                                                                                                                                 0
                              0                                                                                                                       0   200   400     600    800   1000    1200   1400    1600   1800
                               0         100        200            300         400     500
                                                                                                                                                                              network-size
                                                          Cycles
                                                                                                 Figure 3: R10 for the four strategies with varying
                                    Figure 1: Convergence speed
                                                                                                 network-sizes
                               1
                                     recall = 0.8
                                     recall = 0.9
                             0.8     recall = 1                                                     Figure 3 compares the performance of the four strategies
     Percentage of queries




                                                                                                 we proposed in Section 3.3 The horizontal axis corresponds
                             0.6                                                                 to the maximum number of neighbors a user can have. We
                                                                                                 consider Recall with no less than 0.7 as satisfactory top-
                             0.4                                                                 10 results. The vertical axis shows how many queries re-
                                                                                                 ceive such results. In this figure, only the queries sent by
                             0.2                                                                 users having corresponding number of neighbors are taken
                                                                                                 into account, because others will get the same results as in
                               0
                                     0         50    100    150          200     250
                                                                                                 the centralized implementation using the users’ whole per-
                                                       Cycles                                    sonal networks. It is seen that the strategy Nearest With
                              Figure 2: Recall evolution (gossip-size=50)                        Enhanced Link Strength outperforms the others in terms of
                                                                                                 top-10 quality. When the network-size is relatively small,
                                                                                                 the difference among these strategies is more pronounced,
same results as in the centralized implementation and the                                        while the difference decreases when fewer limits are set on
rest of the queries retrieve at least 8 relevant items. R10 con-                                 network-size. This difference also conveys the fact that users
tinues improving as time passes and about 98.5% of queries                                       with similar tagging behaviors are more representative and
obtain all their relevant items at cycle 250.                                                    contribute more to the final top-k results.

4.3 Comparison of Optimization Strategies                                                        4.4 Space Overhead And Response Time
   Constructing personal networks in a peer-to-peer way can                                         The maximum space overhead for each user is attained
provide almost the same personalized top-k results as the                                        when its complete personal network is established. Figure 4
centralized implementation can do. However, when the num-                                        compares individual user’s space overhead with that of Ex-
ber of users and the items tagged by each user increase, it is                                   act and Global Upper-Bounds. As mentioned earlier, the in-
possible that users’ local disks become fully occupied, be-                                      dividual user’s space overhead depends on the total length of
cause too many profiles of neighbors and inverted lists need                                      its inverted lists Figure 4(a) and entries in all its neighbors’
to be stored. With our own dataset, average network-size                                         profiles Figure 4(b). Users are ranked in ascending order of
and maximum network-size over all users are listed in Ta-                                        their space overhead. As expected, no user needs to store as
ble 1. On average, each user should maintain about 18% of                                        much information as a centralized database even for Global
other users as neighbors to construct inverted lists for top-k                                   Upper-Bounds. Space is no longer a severe problem with
processing, which is infeasible in large-scale networks. It is                                   our decentralized storage.
therefore important to choose neighbors in a more intelligent                                       Figure 4(c) illustrates the response time at cycle 50. Short-
way.                                                                                             er inverted lists do not necessarily mean less execution time
                                                                                                 because the lack of information may decrease the scores of
                                                                                                 certain items and as a result, more entries in the lists may
                              Table 1: Network-size in different systems                         need to be checked to get the final top-10. However, the
                                                                                                 penalty in time consumption is not too expensive. On aver-
 Number of users Average network-size Max network-size
                                                                                                 age, only 3% more time is required to process these queries.
     1000                180                750                                                     With the optimization strategies to limit network-size, fewer
     5000                897               4165                                                  profiles and shorter inverted lists are stored by each user. If
    10000               1801               8350                                                  we use the Nearest-ELS strategy to choose neighbors and

                                                                                             5
1e+11                                                                                                                                                                                                                                      1e+06
                            1e+10                                                                                         1e+07




                                                                                                                                                                                                                                      Number of sequential accesses
                                                                                                                                                                                                                                                                      100000
                            1e+09                                                                                         1e+06




                                                                                              Length of stored profiles
 Length of inverted lists




                            1e+08
                                                                                                                          100000                                                                                                                                       10000
                            1e+07
                            1e+06                                                                                         10000                                                                                                                                         1000
                            100000                                                                                          1000
                                                                                                                                                                                                                                                                            100
                             10000
                                                                                                                            100
                             1000
                                                                            Exact                                                                                                                                                                                            10
                                                              Global Upper-Bound                                              10       Tagging actions stored in centralized database                                                                                                                                      Exact
                              100
                                                                   Individual User                                                        Entries of profiles stored by individual user                                                                                                                                50th cycle
                               10                                                                                             1                                                                                                                                               1
                                     0   1000 2000 3000 4000 5000 6000 7000 8000 9000 10000                                        0   1000 2000 3000 4000 5000 6000 7000 8000 9000 10000                                                                                         0     1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
                                                            Users                                                                                         Users                                                                                                                                            Users



                                     (a) Length of inverted lists                                                            (b) Length of stored user profiles                                                                    (c) Number of sequential accesses per query

                                                                    Figure 4: Space overhead and processing time at cycle 50 (log scale)



fix the network-size to 500, there is no difference for about                                                                                                                                                             1




                                                                                                                                                                              % of top-10s having recall at least 0.7
28% of the users whose network-size is 500 or smaller. For
                                                                                                                                                                                                                        0.8
the others, on average storage space for 34.5% of the entries
in inverted lists and 54.2% of the entries in neighbors’ pro-
                                                                                                                                                                                                                        0.6
files are saved. As the inverted lists are changed, the number
of sequential accesses to get the top-10 items is no longer                                                                                                                                                             0.4
the same as before. About 28% of queries need the same
time to retrieve the results while 35.9% consume less time                                                                                                                                                              0.2
                                                                                                                                                                                                                                                                                                  1000 Random Users
and 36.1% require more time. On average, there is no great                                                                                                                                                                                                                                        5000 Random Users
                                                                                                                                                                                                                                                                                                 10000 Random Users
difference as before in terms of processing time.                                                                                                                                                                        0
                                                                                                                                                                                                                          100   200                                   300         400      500    600     700    800      900       1000
                                                                                                                                                                                                                                                                                          network-size
4.5 Scalability
   As shown in Table 1, the personal network-size grows lin-                                                                                                       Figure 5: Recall of Nearest-ELS in systems of different
early as the number of users increases using the criteria in the                                                                                                   size
centralized implementation to choose neighbors. With our
optimization, we find that the necessary number of neigh-                                                                                                           6. REFERENCES
bors to obtain top-10 results of same quality remains sta-
ble even when the number of users continues increasing (see                                                                                                        [1] R. Schenkel, T. Crecelius, M. Kacimi, T. Neumann,
Figure 5). Clearly, the larger the network, the smaller the                                                                                                            S. Michel, J.X. Parreira, and G. Weikum. Efficient top-k
ratio of necessary network-size over number of users. Our                                                                                                              querying over social-tagging networks. In SIGIR ’08.
optimization scales well, and similar phenomena can also                                                                                                           [2] S. Xu, S. Bao, B. Fei, Z. Su, and Y. Yu. Exploring
be found in other optimization strategies. This suggests that                                                                                                          folksonomy for personalized search. In SIGIR ’08.
there is no need to maintain a large network to obtain good                                                                                                        [3] S. Amer-Yahia, M. Benedikt, Laks V. S. Lakshmanan,
personalized top-k results. Well selected neighbors conserve                                                                                                           and J. Stoyanovich. Efficient network aware search in
the relative order of relevant items to a query even though                                                                                                            collaborative tagging sites. In VLDB’08.
their overall scores decrease.                                                                                                                                     [4] S. Michel, P. Triantafillou, and G. Weikum. Klee: a
                                                                                                                                                                       framework for distributed top-k query algorithms. In
                                                                                                                                                                       VLDB ’05.
5.                            CONCLUSION                                                                                                                           [5] F.M. Cuenca-acuna, C. Peery, R.P. Martin, and T.D.
   We presented in this paper the first peer-to-peer approach                                                                                                           Nguyen. Planetp: Using gossiping to build content
to personalized top-k processing. We described the design                                                                                                              addressable peer-to-peer information sharing
and implementation of our approach and investigated its per-                                                                                                           communities. In HPDC’03.
formance. The experimental results are encouraging and we                                                                                                          [6] M. Jelasity, A. Montresor, G.P. Jesi, and S. Voulgaris.
believe that decentralization is the right way to provide per-                                                                                                         The Peersim simulator. http://peersim.sf.net.
sonalized top-k processing in a scalable manner.                                                                                                                   [7] R. Fagin. Combining fuzzy information: an overview.
   We are exploring several complementary research direc-                                                                                                              SIGMOD Record, 31:2002, 2002.
tions, including system dynamics, in terms of churn as well                                                                                                        [8] M. Jelasity, S. Voulgaris, R. Guerraoui, A.M.
as frequent changes in tagging behaviors. We are also con-                                                                                                             Kermarrec, and M. van Steen. Gossip-based peer
sidering various techniques to improve privacy as well as                                                                                                              sampling. ACM Trans. Comput. Syst., 25(3):8, 2007.
reducing redundant information within a cluster.

                                                                                                                                                            6

Contenu connexe

Tendances

Distributeddatabasesforchallengednet
DistributeddatabasesforchallengednetDistributeddatabasesforchallengednet
Distributeddatabasesforchallengednet
Vinoth Chandar
 
COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA AND CHORD DHTS
COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA  AND CHORD DHTS COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA  AND CHORD DHTS
COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA AND CHORD DHTS
ijp2p
 
2006 hicss - you are who you talk to - detecting roles in usenet newsgroups
2006   hicss - you are who you talk to - detecting roles in usenet newsgroups2006   hicss - you are who you talk to - detecting roles in usenet newsgroups
2006 hicss - you are who you talk to - detecting roles in usenet newsgroups
Marc Smith
 
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social NetworksLink Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
Sina Sajadmanesh
 

Tendances (16)

New York City and Baltimore Semantic Web Meetups 20130221/20120226
New York City and Baltimore Semantic Web Meetups 20130221/20120226New York City and Baltimore Semantic Web Meetups 20130221/20120226
New York City and Baltimore Semantic Web Meetups 20130221/20120226
 
Effective Approach For Content Based Image Retrieval In Peer-Peer To Networks
Effective Approach For Content Based Image Retrieval In Peer-Peer To NetworksEffective Approach For Content Based Image Retrieval In Peer-Peer To Networks
Effective Approach For Content Based Image Retrieval In Peer-Peer To Networks
 
network mining and representation learning
network mining and representation learningnetwork mining and representation learning
network mining and representation learning
 
Distributeddatabasesforchallengednet
DistributeddatabasesforchallengednetDistributeddatabasesforchallengednet
Distributeddatabasesforchallengednet
 
Bv25430436
Bv25430436Bv25430436
Bv25430436
 
Taxonomy and survey of community
Taxonomy and survey of communityTaxonomy and survey of community
Taxonomy and survey of community
 
Topical clustering of search results
Topical clustering of search resultsTopical clustering of search results
Topical clustering of search results
 
A NOVEL APPROACH TOWARDS COST EFFECTIVE REGION-BASED GROUP KEY AGREEMENT PROT...
A NOVEL APPROACH TOWARDS COST EFFECTIVE REGION-BASED GROUP KEY AGREEMENT PROT...A NOVEL APPROACH TOWARDS COST EFFECTIVE REGION-BASED GROUP KEY AGREEMENT PROT...
A NOVEL APPROACH TOWARDS COST EFFECTIVE REGION-BASED GROUP KEY AGREEMENT PROT...
 
COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA AND CHORD DHTS
COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA  AND CHORD DHTS COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA  AND CHORD DHTS
COMPARATIVE STUDY OF CAN, PASTRY, KADEMLIA AND CHORD DHTS
 
ANALYSE THE PERFORMANCE OF MOBILE PEER TO PEER NETWORK USING ANT COLONY OPTIM...
ANALYSE THE PERFORMANCE OF MOBILE PEER TO PEER NETWORK USING ANT COLONY OPTIM...ANALYSE THE PERFORMANCE OF MOBILE PEER TO PEER NETWORK USING ANT COLONY OPTIM...
ANALYSE THE PERFORMANCE OF MOBILE PEER TO PEER NETWORK USING ANT COLONY OPTIM...
 
Group Finder
Group FinderGroup Finder
Group Finder
 
2006 hicss - you are who you talk to - detecting roles in usenet newsgroups
2006   hicss - you are who you talk to - detecting roles in usenet newsgroups2006   hicss - you are who you talk to - detecting roles in usenet newsgroups
2006 hicss - you are who you talk to - detecting roles in usenet newsgroups
 
MIT CSAIL Linked Data Ventures Class: Linked Open Data for Entrepreneurs 2013
MIT CSAIL Linked Data Ventures Class: Linked Open Data for Entrepreneurs 2013MIT CSAIL Linked Data Ventures Class: Linked Open Data for Entrepreneurs 2013
MIT CSAIL Linked Data Ventures Class: Linked Open Data for Entrepreneurs 2013
 
Hierarchical p2 p
Hierarchical p2 pHierarchical p2 p
Hierarchical p2 p
 
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social NetworksLink Prediction in (Partially) Aligned Heterogeneous Social Networks
Link Prediction in (Partially) Aligned Heterogeneous Social Networks
 
A Cooperative Peer Clustering Scheme for Unstructured Peer-to-Peer Systems
A Cooperative Peer Clustering Scheme for Unstructured Peer-to-Peer SystemsA Cooperative Peer Clustering Scheme for Unstructured Peer-to-Peer Systems
A Cooperative Peer Clustering Scheme for Unstructured Peer-to-Peer Systems
 

En vedette

10 pitanja koja će Vam Bog postaviti
10 pitanja koja će Vam Bog postaviti10 pitanja koja će Vam Bog postaviti
10 pitanja koja će Vam Bog postaviti
Dragana28
 
10pitanja[1] M
10pitanja[1] M10pitanja[1] M
10pitanja[1] M
Dragana28
 
Sistema di Scansione Antropometrica per la realizzazione di Abbigliamento di ...
Sistema di Scansione Antropometrica per la realizzazione di Abbigliamento di ...Sistema di Scansione Antropometrica per la realizzazione di Abbigliamento di ...
Sistema di Scansione Antropometrica per la realizzazione di Abbigliamento di ...
Lorenzo Salemme
 

En vedette (16)

paper
paperpaper
paper
 
Skt migas share1
Skt migas share1Skt migas share1
Skt migas share1
 
Skt migas share1
Skt migas share1Skt migas share1
Skt migas share1
 
10 pitanja koja će Vam Bog postaviti
10 pitanja koja će Vam Bog postaviti10 pitanja koja će Vam Bog postaviti
10 pitanja koja će Vam Bog postaviti
 
Proses iujk kualifikasi b1 (grade 7)
Proses iujk kualifikasi b1 (grade 7)Proses iujk kualifikasi b1 (grade 7)
Proses iujk kualifikasi b1 (grade 7)
 
Proses iujk kualifikasi b1 (grade 7)
Proses iujk kualifikasi b1 (grade 7)Proses iujk kualifikasi b1 (grade 7)
Proses iujk kualifikasi b1 (grade 7)
 
Skt migas share1
Skt migas share1Skt migas share1
Skt migas share1
 
10pitanja[1] M
10pitanja[1] M10pitanja[1] M
10pitanja[1] M
 
Skt migas share2
Skt migas share2Skt migas share2
Skt migas share2
 
Skt migas share2
Skt migas share2Skt migas share2
Skt migas share2
 
Skt migas share2
Skt migas share2Skt migas share2
Skt migas share2
 
Cad Modelling Ergonomics: Sistema di Scansione Antropometrica
Cad Modelling Ergonomics:  Sistema di Scansione AntropometricaCad Modelling Ergonomics:  Sistema di Scansione Antropometrica
Cad Modelling Ergonomics: Sistema di Scansione Antropometrica
 
test
testtest
test
 
Sistema di Scansione Antropometrica per la realizzazione di Abbigliamento di ...
Sistema di Scansione Antropometrica per la realizzazione di Abbigliamento di ...Sistema di Scansione Antropometrica per la realizzazione di Abbigliamento di ...
Sistema di Scansione Antropometrica per la realizzazione di Abbigliamento di ...
 
Presentazione lugano 2010
Presentazione lugano 2010Presentazione lugano 2010
Presentazione lugano 2010
 
Cad Modelling Ergonomics - brief introduction
Cad Modelling Ergonomics - brief introductionCad Modelling Ergonomics - brief introduction
Cad Modelling Ergonomics - brief introduction
 

Similaire à Toward Personalized Peer-to-Peer Top-k Processing

Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
butest
 
Recommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assocRecommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assoc
ijerd
 
Dq2644974501
Dq2644974501Dq2644974501
Dq2644974501
IJMER
 
14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdf14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdf
MehwishKanwal14
 

Similaire à Toward Personalized Peer-to-Peer Top-k Processing (20)

Applying Machine Learning to Software Clustering
Applying Machine Learning to Software ClusteringApplying Machine Learning to Software Clustering
Applying Machine Learning to Software Clustering
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
 
LatentCross.pdf
LatentCross.pdfLatentCross.pdf
LatentCross.pdf
 
IRJET- Semantics based Document Clustering
IRJET- Semantics based Document ClusteringIRJET- Semantics based Document Clustering
IRJET- Semantics based Document Clustering
 
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
IRJET- Cluster Analysis for Effective Information Retrieval through Cohesive ...
 
A Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed ClusteringA Competent and Empirical Model of Distributed Clustering
A Competent and Empirical Model of Distributed Clustering
 
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
 
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Recommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assocRecommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assoc
 
Design of file system architecture with cluster
Design of file system architecture with clusterDesign of file system architecture with cluster
Design of file system architecture with cluster
 
Dq2644974501
Dq2644974501Dq2644974501
Dq2644974501
 
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
Ap26261267
Ap26261267Ap26261267
Ap26261267
 
Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory Generating domain specific sentiment lexicons using the Web Directory
Generating domain specific sentiment lexicons using the Web Directory
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
 
14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdf14420-Article Text-17938-1-2-20201228.pdf
14420-Article Text-17938-1-2-20201228.pdf
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkages
 

Dernier

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Dernier (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 

Toward Personalized Peer-to-Peer Top-k Processing

  • 1. Toward Personalized Peer-to-Peer Top-K Processing Xiao Bai Marin Bertier Rachid Guerraoui Anne-Marie Kermarrec INSA de Rennes, France EPFL, Switzerland INRIA Rennes, France ABSTRACT Very elegant centralized approaches have recently been We present the first personalized peer-to-peer top-k search proposed to capture the personalization through social scor- protocol for a collaborative tagging system. Each peer main- ing models that compute the user-centric correlation among tains relevant personalized information about its tagging be- different tags in order to improve information retrieval qual- havior as well as that of its social neighbors, and uses those ity [1, 2]. In Amer-Yahia et al. [3], the score of an item to locally process its queries. Extensive experiments based only depends on how users sharing similar preferences have on a real-world dataset crawled from del.icio.us shows that tagged it. The reference retrieval solution consists in main- very little storage at each peer suffices to get almost the same taining one inverted list per (tag, user), but this is extremely results as a hypothetical centralized solution with infinite space consuming. Alternative solutions to save storage space storage. do exist but their processing time is increased. We argue that scalability calls for fully decentralized so- lutions to process top-k queries. In fact, decentralized solu- 1. INTRODUCTION tions have indeed been proposed. In Michel et al. [4], pre- A central task in information retrieval consists in process- computed inverted lists are distributed across peers and par- ing queries in order to obtain the top-k items with the high- tial information is transmitted in the network progressively est scores under a monotonic function. This is particularly to approximate the top-k results. In Cuenca-acuna et al. [5], appealing in collaborative tagging systems, also called folk- a gossip scheme is used to implement distributed content sonomies, such as Flickr, del.icio.us and CiteULike, which search and ranking. None of these decentralized solutions have become highly popular for publishing and searching is however ‘personalized’. contents. Yet this can turn into a nightmare because of the This paper presents, to our knowledge, the first decen- unstructured nature of tagging: there is usually no fixed on- tralized and personalized top-k processing scheme. In our tology and users typically choose their tags in a free and pos- scheme, each user maintains its own inverted list by period- sibly ambiguous manner. ically discovering its network.1 Our approach alleviates the One way to introduce some structure in such a scheme storage space problem while enabling highly efficient local is to capture affinities between users with common tagging query processing. We use a gossip-based network manage- behaviors and then leverage these in the top-k processing. ment protocol to identify users’ personal networks in a peer- More relevant results for a given user could be achieved to-peer way; once these personal networks are established, should the search be directed in a restricted network of users users locally process their queries using a classical top-k al- exhibiting similar tagging bahaviors. For example, a com- gorithm, namely NRA (No Random Access). Interestingly, puter scientist and a Keanu Reeves fan might probably not be only a subset of (well selected) users suffices to provide the interested in the same results when Googling ‘matrix’. User most relevant items while preserving their relative order for affinities can disambiguate these situations and alleviate the a given query. We explore various strategies to prevent over- need for reformulating the same query through several steps loading individual users by limiting the size of their personal with more refined keywords. networks. This results in little degradation in the quality of the top-k results. We have implemented our decentralized and personalized top-k processing scheme in PeerSim [6] with a dataset crawl- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are ed from del.icio.us. Interestingly, if all qualified users are not made or distributed for profit or commercial advantage and that copies kept as neighbors, after 50 cycles of gossips, we retrieve bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific 1 permission and/or a fee. Note that this is different from a social network in the traditional SNS’09, March 31, 2009, Nuremberg, Germany sense, as neighbors in our network are those with similar tagging Copyright 2009 ACM ISBN 978-1-60558-463-8 ...$5.00. behavior and might be and remain unknown. 1
  • 2. at least 8 relevant items out of 10 with a negligible storage social network model.) overhead with respect to the idealized centralized solution of [3] (called Exact). Thanks to our optimization strategy Top-k Processing. which simply keeps the closest neighbors, more than 35% Given a query, a top-k processing algorithm (e.g., [7]) storage space is further economized for each user. aims at retrieving the k most relevant items. The goal is The rest of this paper is organized as follows: Section 2 typically to minimize the time it takes to come up with these establishes the social model of our system and provides a items as well as the amount of storage needed to perform the brief review of relevant top-k processing and network-aware actual computation. search techniques. Section 3 describes our peer-to-peer per- Information is organized in per tag inverted lists. Each sonalized top-k processing scheme and some optimization entry of the inverted list contains the identifier of an item techniques. Section 4 presents our experimental setup and and its score for that tag. Inverted lists are sorted in a de- results. Section 5 concludes the paper. scending order of scores. Below we review the well-known NRA algorithm, as this is considered particularly effective: 2. SOCIAL NETWORK MODEL AND BACK- it underlies our work as well as that of [3]. For a query of n tags, NRA scans the n inverted lists in GROUND parallel. In order to do so, NRA maintains a heap of candi- Social Network Model. date items. Each candidate has a score lower-bound and a In a social network model, collaborative tagging sites are score upper-bound, which are the overall scores it can attain typically presented as information space U × I × T , where based on the information available at the moment. The score U denotes the set of users. Each user has a profile that ex- lower-bound takes the most pessimistic assumption that if presses his endorsement of visited items by tagging them. an item has not been seen in some lists, then it does not exist The profile is described in the form of i{t1 , ..., tn }, meaning in them. Alternatively the score upper-bound takes the most that an item i is tagged by tags t1 , ..., tn . I is the set of items optimistic assumption that its scores in those lists equal to that appear in the system and T is the set of all related tags. the scores of last seen items in them. Once an item is seen, T agged(u, i, t) captures the fact that user u tags the item i it is either added to or updated in the heap. The score upper- with the tag t. bounds of items already in the heap are also updated. Candi- We model the social network as a directed graph where date items are sorted based on their score lower-bounds. For a node corresponds to a user and an edge presents the rela- those with equal lower-bounds, the one with a larger upper- tionship between two users. We use Link(u, v) to indicate bound is ranked ahead. The processing stops when none of the existence of a directed edge from user u to user v. For the items out of the top-k items has an upper-bound larger a user u ∈ U , N etwork(u) is the set of u’s neighbors, i.e., than the lower-bound of the kth item. N etwork(u) = {v|Link(u, v)}. There are many possibilities for establishing Link(u, v) Network-Aware Search. according to user preferences. We do not assume any par- Amer-Yahia et al. [3] proposed several strategies to achieve ticular semantics; as such we follow the criterion in [3] that efficient network-aware top-k processing. Again, for a netw- Link(u, v) exists if and only if v tags a sufficient number of ork-aware search, the score of an item only depends on the items with the same tag as u, i.e., querier’s network. The most straightforward strategy, called Exact, is to build one inverted list for each (tag, user) pair. |{i|∃t, T agged(u, i, t) ∧ T agged(v, i, t)}| > threshold, Query Q generated by a user u is processed on the (tj , u) lists, where tj ∈ Q, using a traditional top-k processing al- where threshold is a predefined number. The number of gorithm such as NRA. However, storing these lists is pro- commonly tagged items is also used to measure the strength hibitive space-wise for a single server. Therefore, they ex- of Link(u, v), denoted by StrengthLink(u,v) . plore another strategy, Global Upper-Bound, that maintains user-independent inverted lists whose entries only contain Query And Scoring Model. the max scores over all users. The exact score of an item We consider a query Q = {u, t1 , ..., tn }, issued by a user is computed at query time. With Global Upper-Bound, con- u with a set of tags t1 , ..., tn . Returned items for a query siderable storage space is saved but much more time is re- should be ranked according to its overall score. Our score quired for query processing. To strike a balance between the is user-specific and network-aware. The score of an item two extremes, users are then clustered and only score upper- i for user u and tag tj is defined as the number of users bounds over all clusters are maintained in per (tag, cluster) in u’s network who tags i with tj , i.e., Scoretj (u, i) = pair inverted list. | N etwork(u) ∩ {v | ∃tj , T agged(v, i, tj )} |. The overall score of an item i for user u is the sum of all the scores re- lated to the query, i.e., Score(u, i) = tj ∈Q Scoretj (u, i). 3. PEER-TO-PEER SOLUTION We use the same scoring functions as in [3] for ease of com- In our decentralized scheme, each user maintains a set of parison. (Note that alternative functions can be used in our neighbors that form its personal network. A query is pro- 2
  • 3. cessed with the information available in the querier’s net- query already exist in its cache and whether they should be work to get personalized top-k results. If its network fails to updated to reflect new neighbors. Once all the related lists provide any satisfactory results, a default search mechanism are up-to-date, the user processes its query locally with NRA will be activated. Here we concentrate on the personalized to get the top-k results. processing. In our setting, a key problem is how to build The inverted lists are constructed lazily so that an inverted a personal network for each user in a peer-to-peer way as list is computed only when it is necessary for the query pro- quickly as possible so as to guarantee the quality of top-k cessing. There is no need to pre-compute the inverted lists results. for the tags that may never be queried in the Exact central- ized case. Note that the computation of inverted lists are not 3.1 Personal Network Construction more expensive than Exact when changes occur in the net- In our scheme, the personal network is discovered and work. In contrast, gossip-based protocols enable to capture maintained through a two-layer gossip protocol. The bottom- such dynamics. layer gossip protocol, typically known as a random peer sam- pling protocol (RPS) [8], is in charge of keeping the overlay 3.3 Personal Network Optimization connected. Basically it provides each peer with c random We adapt the criteria in [3] for choosing neighbors in our peers. On top of the peer sampling protocol, a top-layer pro- setting. (Note that any metric that captures the affinity of tocol is in charge of tracking the similarity between users’ users will work as well with our gossip-based personal net- profiles and discovering new neighbors. Once the top proto- work construction protocol). Since the common-item-based col has stabilized, it is still possible to discover new related neighbor choosing criteria may cause a scalability problem peers through the peer sampling protocol. when the network grows and users add more contents in their We assume that each peer executes the same protocol in profiles, we propose several optimization strategies that re- the same manner every T time units, referred to as a cycle. duce the size of user’s network without degrading the top-k A gossip protocol, bottom and top, consists of two threads: quality. an active thread initiating communication with other peers, Instead of maintaining all the users that meet the criteria and a passive thread waiting for incoming messages. A gos- to be a neighbor, only n of them are kept in a user’s personal sip protocol is fully characterized by three functions: (i) network. The effect of n will be evaluated in the next sec- the peer selection (choice of gossip target); (ii) the data ex- tion. These strategies can also be used independently of the change (which data are exchanged in a gossip interaction) criteria in [3] to form personal networks. and, (iii) the data processing (which data are kept after the Random n random neighbors are chosen from the candi- interaction). In this paper, the data exchanged are peers (IP date users. The intuition is that a random sample is usually addresses and profiles of peers). representative of the population from which it is drawn. At the beginning of each cycle, the peer selection selects Biased Random Like Random, only n users are ran- a neighbor with the oldest TimeStamp as a gossip destina- domly selected as neighbors. However the probability that tion. Once a peer is picked, its TimeStamp is set to zero and a user is kept as neighbor is proportional to the strength of other neighbors’ TimeStamps increase by one. The variable the link between the two users. So users with stronger links TimeStamp ensures that all the neighbors have a compara- have better chances of being chosen as neighbor. ble chance of participating in gossiping. The data exchange Nearest As users are more likely to enjoy what is pre- function selects gossip-size neighbors from the views of both ferred by the users having similar preferences, and the simi- layers and sends the profiles of selected peers to the gos- larity of user preferences is measured by the strength of the sip destinaton. When the gossip destination receives this link between them, this strategy has confidence in users pos- information, its passive thread is activated. It also selects sessing strongest links with the gossip initiator and chooses gossip-size neighbors and sends their profiles to the peer them as neighbors. from who the message came. The data processing function Nearest With Enhanced Link Strength (Nearest-ELS) selects the network-size closest peers according to some pre- Similar to Nearest, users having similar preferences are also defined metric. Two users are added to each other’s view chosen here. In collaborative tagging sites, similarity of user if they tag a minimum number of items in common with at preferences depends on the tagging behaviors exhibited in least one same tag. Note that the top layer provides each their profiles. Normally, the overlap of used tags in users’ peer with its personal network. profiles implies their common interests on topics, while an overlap of tagged items reveals specific objects they prefer. 3.2 Inverted Lists and Query Processing As a tag can be used for several items and a same item can In the process of gossiping, the profiles of a user’s neigh- receive different tags from different users, it is more accurate bors are stored by the user. To enable efficient top-k process- to compare tagging behaviors by the number of (item, tag) ing, inverted lists for each (tag, user) pair are also computed pairs in common rather than the number of items tagged by and stored by the user itself. When a user generates a query, both users with same tags. The intuition is that the more it first checks whether the inverted lists for the tags in the common tags are used for an item, the more similar the 3
  • 4. way on which the users understand and judge the world is. with some bootstrap mechanisms. At any following time, For this reason, we re-define the strength of Link(u, v) as each user has a number of neighbors thanks to gossiping. | {(i, tj ) | T agged (u, i, tj ) ∧ T agged (v, i, tj )} |. The ratio of this number to the user’s network-size in a cen- For each candidate neighbor v of user u, we re-compute tralized setting implies how close its current personal net- StrengthLink(u,v) and keep the n users having strongest work is to the target. We measure the speed of convergence links with u as neighbors. with the average ratio of all users, i.e., 1 |Current N etwork(u)| 4. EVALUATION Speed = . |U | |N etwork(u) in centralized setting| u∈U 4.1 Experimental Setup Speed attains 1 when all users find their personal networks in the centralized implementation. Data Set and Query Generation. Our goal is to show that an efficient network-aware top- We have implemented our decentralized and personalized k processing can be achieved in a peer-to-peer way, so we top-k processing scheme in PeerSim, an open source simu- take the top-k results in [3] as a reference. Note that all al- lator for peer-to-peer protocols. All experiments presented gorithms proposed in [3] provide the same top-k items and here are executed on a cluster of servers including 10 Dell only differs from each other in processing time and storage PowerEdge 1855 machines equipped with Bi-pro Intel Xeon space. We then try to obtain as similar top-k items as possi- processors with 3.40GHz CPU and 4GB memory. ble for the same query with less time and space consumption The dataset used in our evaluation was crawled from del.ic- in our setting. io.us. It contains 13,521 distinct users who participate in We use the accepted metric, recall, in information retrieval 31,833,700 tagging actions in the form of T agged(u, i, t). to evaluate the quality of top-k results. The recall Rk is the 4,741,631 distinct items and 620,340 distinct tags are con- proportion of the total number of relevant items that are re- cerned by these actions. We randomly picked 10,000 users trieved in the top k: from the dataset and built their profiles with the items and N umber of Retrieved Relevant Items tags used by at least 10 distinct users. Note that this does not Rk = T otal N umber of Relevant Items affect the top-k results and processing time because only the items ranked at the tail of the inverted lists are dropped. Af- Recall quantifies the coverage of the result set and varies ter removing uncommon tags, the interference of the noisy between 0 and 1. and often meaningless tags is also eliminated. The remain- The space overhead for each user is estimated by the num- ing dataset contained a total of 101,144 items, 31,899 tags ber of entries in the inverted lists and the number of entries and 9,536,635 tagging actions. in the profiles of its neighbors. The query execution time for In our experiments, each user processed exactly one query. a query is quantified by the number of sequential accesses of We randomly picked an item from a user’s profile and com- related inverted lists. Both metrics are highly dependent on posed the query with the tags used by the user to annotate the query and the user, so we are more interested in the rel- this item. This is motivated by the reasonable observation ative improvement compared to the centralized implementa- that the tags in the query reflect the user’s understandings tion of Exact and Global Upper-Bound for a given user and of this item and their combination within the same query is query. meaningful. 4.2 Convergence of Personal Networks Evaluation Metrics. We start with some observations on how the network con- We were mainly interested in the implicit semantic rela- verges and how fast our algorithm enables users to find their tions exhibited by users’ tagging behaviors. As in [3], there own networks, which are considered as the basis of our per- is a link between two users if they tag two items in common sonalized top-k processing. The gossip-size is set to 20 and with at least one same tag. The personal relationships are 50 respectively for both layers in two different experimental formed as the network converges. If all users have their com- settings. We can see from Figure 1 that exchanging more in- plete networks, the same top-k results would be obtained for formation provides higher convergence speed. After a small a given query. We assume that there is neither arrival nor de- number of cycles, users can get almost its whole personal parture of user for ease of comparison with Exact and Global network in a centralized setting. Upper-Bound in [3], which are considered ideal in terms of We run top-10 processing in a centralized implementation processing time and storage space respectively. Important of Exact and take the 10 returned items for each query as questions related to our decentralized setting are then how relevant items and compare our top-10 results with them. fast the personal network of each user can be established and The following results are from the setting of gossip-size set what is the influence on the top-k results. to 50. Figure 2 plots the evolution of R10 in the process of A user begins building its personal network by first dis- network converging. covering the IP address of any user currently in the system At cycle 50, more than 77% of queries get exactly the 4
  • 5. 1 1 % of top-10s having recall at least 0.7 0.8 0.8 0.6 0.6 Speed 0.4 0.4 0.2 Random 0.2 Biased Random Gossip size 20 Nearest Gossip size 50 Nearest With Enhanced Link Strength 0 0 0 200 400 600 800 1000 1200 1400 1600 1800 0 100 200 300 400 500 network-size Cycles Figure 3: R10 for the four strategies with varying Figure 1: Convergence speed network-sizes 1 recall = 0.8 recall = 0.9 0.8 recall = 1 Figure 3 compares the performance of the four strategies Percentage of queries we proposed in Section 3.3 The horizontal axis corresponds 0.6 to the maximum number of neighbors a user can have. We consider Recall with no less than 0.7 as satisfactory top- 0.4 10 results. The vertical axis shows how many queries re- ceive such results. In this figure, only the queries sent by 0.2 users having corresponding number of neighbors are taken into account, because others will get the same results as in 0 0 50 100 150 200 250 the centralized implementation using the users’ whole per- Cycles sonal networks. It is seen that the strategy Nearest With Figure 2: Recall evolution (gossip-size=50) Enhanced Link Strength outperforms the others in terms of top-10 quality. When the network-size is relatively small, the difference among these strategies is more pronounced, same results as in the centralized implementation and the while the difference decreases when fewer limits are set on rest of the queries retrieve at least 8 relevant items. R10 con- network-size. This difference also conveys the fact that users tinues improving as time passes and about 98.5% of queries with similar tagging behaviors are more representative and obtain all their relevant items at cycle 250. contribute more to the final top-k results. 4.3 Comparison of Optimization Strategies 4.4 Space Overhead And Response Time Constructing personal networks in a peer-to-peer way can The maximum space overhead for each user is attained provide almost the same personalized top-k results as the when its complete personal network is established. Figure 4 centralized implementation can do. However, when the num- compares individual user’s space overhead with that of Ex- ber of users and the items tagged by each user increase, it is act and Global Upper-Bounds. As mentioned earlier, the in- possible that users’ local disks become fully occupied, be- dividual user’s space overhead depends on the total length of cause too many profiles of neighbors and inverted lists need its inverted lists Figure 4(a) and entries in all its neighbors’ to be stored. With our own dataset, average network-size profiles Figure 4(b). Users are ranked in ascending order of and maximum network-size over all users are listed in Ta- their space overhead. As expected, no user needs to store as ble 1. On average, each user should maintain about 18% of much information as a centralized database even for Global other users as neighbors to construct inverted lists for top-k Upper-Bounds. Space is no longer a severe problem with processing, which is infeasible in large-scale networks. It is our decentralized storage. therefore important to choose neighbors in a more intelligent Figure 4(c) illustrates the response time at cycle 50. Short- way. er inverted lists do not necessarily mean less execution time because the lack of information may decrease the scores of certain items and as a result, more entries in the lists may Table 1: Network-size in different systems need to be checked to get the final top-10. However, the penalty in time consumption is not too expensive. On aver- Number of users Average network-size Max network-size age, only 3% more time is required to process these queries. 1000 180 750 With the optimization strategies to limit network-size, fewer 5000 897 4165 profiles and shorter inverted lists are stored by each user. If 10000 1801 8350 we use the Nearest-ELS strategy to choose neighbors and 5
  • 6. 1e+11 1e+06 1e+10 1e+07 Number of sequential accesses 100000 1e+09 1e+06 Length of stored profiles Length of inverted lists 1e+08 100000 10000 1e+07 1e+06 10000 1000 100000 1000 100 10000 100 1000 Exact 10 Global Upper-Bound 10 Tagging actions stored in centralized database Exact 100 Individual User Entries of profiles stored by individual user 50th cycle 10 1 1 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Users Users Users (a) Length of inverted lists (b) Length of stored user profiles (c) Number of sequential accesses per query Figure 4: Space overhead and processing time at cycle 50 (log scale) fix the network-size to 500, there is no difference for about 1 % of top-10s having recall at least 0.7 28% of the users whose network-size is 500 or smaller. For 0.8 the others, on average storage space for 34.5% of the entries in inverted lists and 54.2% of the entries in neighbors’ pro- 0.6 files are saved. As the inverted lists are changed, the number of sequential accesses to get the top-10 items is no longer 0.4 the same as before. About 28% of queries need the same time to retrieve the results while 35.9% consume less time 0.2 1000 Random Users and 36.1% require more time. On average, there is no great 5000 Random Users 10000 Random Users difference as before in terms of processing time. 0 100 200 300 400 500 600 700 800 900 1000 network-size 4.5 Scalability As shown in Table 1, the personal network-size grows lin- Figure 5: Recall of Nearest-ELS in systems of different early as the number of users increases using the criteria in the size centralized implementation to choose neighbors. With our optimization, we find that the necessary number of neigh- 6. REFERENCES bors to obtain top-10 results of same quality remains sta- ble even when the number of users continues increasing (see [1] R. Schenkel, T. Crecelius, M. Kacimi, T. Neumann, Figure 5). Clearly, the larger the network, the smaller the S. Michel, J.X. Parreira, and G. Weikum. Efficient top-k ratio of necessary network-size over number of users. Our querying over social-tagging networks. In SIGIR ’08. optimization scales well, and similar phenomena can also [2] S. Xu, S. Bao, B. Fei, Z. Su, and Y. Yu. Exploring be found in other optimization strategies. This suggests that folksonomy for personalized search. In SIGIR ’08. there is no need to maintain a large network to obtain good [3] S. Amer-Yahia, M. Benedikt, Laks V. S. Lakshmanan, personalized top-k results. Well selected neighbors conserve and J. Stoyanovich. Efficient network aware search in the relative order of relevant items to a query even though collaborative tagging sites. In VLDB’08. their overall scores decrease. [4] S. Michel, P. Triantafillou, and G. Weikum. Klee: a framework for distributed top-k query algorithms. In VLDB ’05. 5. CONCLUSION [5] F.M. Cuenca-acuna, C. Peery, R.P. Martin, and T.D. We presented in this paper the first peer-to-peer approach Nguyen. Planetp: Using gossiping to build content to personalized top-k processing. We described the design addressable peer-to-peer information sharing and implementation of our approach and investigated its per- communities. In HPDC’03. formance. The experimental results are encouraging and we [6] M. Jelasity, A. Montresor, G.P. Jesi, and S. Voulgaris. believe that decentralization is the right way to provide per- The Peersim simulator. http://peersim.sf.net. sonalized top-k processing in a scalable manner. [7] R. Fagin. Combining fuzzy information: an overview. We are exploring several complementary research direc- SIGMOD Record, 31:2002, 2002. tions, including system dynamics, in terms of churn as well [8] M. Jelasity, S. Voulgaris, R. Guerraoui, A.M. as frequent changes in tagging behaviors. We are also con- Kermarrec, and M. van Steen. Gossip-based peer sidering various techniques to improve privacy as well as sampling. ACM Trans. Comput. Syst., 25(3):8, 2007. reducing redundant information within a cluster. 6