SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
SybilInfer: Detecting Sybil Nodes using Social Networks

                 George Danezis                                           Prateek Mittal
                Microsoft Research,                       University of Illinois at Urbana-Champaign,
                 Cambridge, UK.                                           Illinois, USA.
             gdane@microsoft.com                                    mittal2@uiuc.edu


                        Abstract                                contributors. On-line forums, starting with USENET [19],
                                                                to contemporary blogs or virtual worlds like Second
    SybilInfer is an algorithm for labelling nodes in a so-     Life [22] always have to deal with the issue of disruption
cial network as honest users or Sybils controlled by an         in the discussion threads, with persistent abusers coming
adversary. At the heart of SybilInfer lies a probabilis-        back under different names. All these are forms of Sybil
tic model of honest social networks, and an inference           attacks at the high-level application layers.
engine that returns potential regions of dishonest nodes.           There are two schools of Sybil defence mechanisms,
The Bayesian inference approach to Sybil detection comes        the centralised and decentralised ones. Centralised de-
with the advantage label has an assigned probability, indi-     fences assume the existence of an authority that is capable
cating its degree of certainty. We prove through analytical     of doing admission control for the network [3]. Its role is
results as well as experiments on simulated and real-world      to rate limit the introduction of ‘fake’ identities, to ensure
network topologies that, given standard constraints on the      that the fraction of corrupt nodes remains under a certain
adversary, SybilInfer is secure, in that it successfully dis-   threshold. The practicalities of running such an authority
tinguishes between honest and dishonest nodes and is not        are very system-specific and in general it would have to
susceptible to manipulation by the adversary. Further-          act as a Public Key Certification Authority as well as a
more, our results show that SybilInfer outperforms state        guardian of the moral standing of the nodes introduced –
of the art algorithms, both in being more widely applica-       a very difficult problem in practice. Such centralised so-
ble, as well as providing vastly more accurate results.         lutions are also at odds with the decentralisation guiding
                                                                principle of peer-to-peer systems.
                                                                    Decentralised approaches recognise the difficulty in
1   Introduction                                                having a single authority vouching for nodes, and dis-
                                                                tribute this task across all nodes of the system. The first
   The Peer-to-peer paradigm allows cooperating users to        such proposal is Advogato [14], which aimed to reduce
enjoy a service with little or no need for any centralised      abuse in on-line services, followed by a proposal to use
infrastructure. While communication [24] and storage [4]        introduction graphs of Distributed Hash Tables [6] to limit
systems employing this design philosophy have been pro-         the potential for routing disruption in those systems. The
posed, the lack of any centralised control over identities      state of the art SybilGuard [27] and SybiLimit [26] pro-
opens such systems to Sybil attacks [8]: a few malicious        pose the use of social networks to mitigate Sybil attacks.
nodes can simulate the presence of a very large number          As we will see, SybilGuard suffers from high false neg-
of nodes, to take over or disrupt key functions of the          atives, while SybilLimit makes unrealistic assumptions
distributed or peer-to-peer system. Any attempt to build        about the knowledge of number of honest nodes in the net-
fault-tolerance mechanisms is doomed since adversaries          work. In both cases the systems Sybil detection strategies
can control arbitrary fractions of the system nodes. This       are based on heuristics that are not optimal.
Sybil attack is further made practical through the use of           Our key contribution is to propose SybilInfer, a method
the existing large number of compromised networked ma-          for detecting Sybil nodes in a social network, that makes
chines (often called zombies) being part of bot-nets.           use of all information available to the defenders. The for-
   Similar problems plague Web 2.0 applications that            mal model underlying our approach casts the problem of
rely on collaborative tagging, filtering and editing, like       detecting Sybil nodes in the context of Bayesian Infer-
Wikipedia [31], or del.icio.us [28]. A single user              ence: given a set of stated relationships between nodes,
can register under different pseudonyms, and bypass any         the task is to label nodes as honest or dishonest. Based on
elections of velocity check mechanism that attempts to          some simple and generic assumptions, like the fact that
guarantee the quality of the data through the plurality of      social networks are fast mixing [18], we sample cuts in
the social graph according to the probability they divide        Limit, the current state of the art Sybil defence mecha-
it into honest and dishonest regions. These samples not          nisms. We also propose extensions that enable our solu-
only allow us to label nodes as honest or Sybil attackers,       tion to be implemented in decentralised settings.
but also to associate with each label output by our algo-           This paper is organised in the following fashion: in sec-
rithm a degree of certainty.                                     tion 2 we present an overview of our approach that can
    The proposed techniques can be applied in a wide vari-       be used as a road-map to the technical sections. In sec-
ety of settings where high reliability peer-to-peer systems,     tion 3 we present our security assumptions, threat model,
or Sybil-resistant collaborative mechanisms, are required        the probabilistic model and sampler underpinning Sybil-
even under attack:                                               Infer; a security evaluation follows in section 4, providing
  • Secure routing in Distributed Hash Tables motivated          analytical as well as experimental arguments supporting
    early research into this field, and our proposal can          the security of the method proposed along with a compar-
    be used instead of a centralised authority to limit the      ison with SybilInfer. Section 5 discusses the settings in
    fraction of dishonest nodes, that could disrupt rout-        which SybilInfer can be fruitfully used, followed by some
    ing [3].                                                     conclusions in section 6.

  • In public anonymous communication networks, such
                                                                 2 Overview
    as Tor [7], our techniques can be used to eliminate the
    potential for a single entity introducing a large num-
    ber of nodes, and de-anonymize users’ circuits. This            The SybilInfer algorithm takes as an input a social
    was so far a key open problem for securely scaling           graph G and a single known good node that is part of this
    such systems.                                                graph. The following conceptual steps are then applied to
                                                                 return the probability each node is honest or controlled by
  • Leader Election [2] and Byzantine agreement [12]             a Sybil attacker:
    mechanisms that were rendered useless by the Sybil             • A set of traces T are generated and stored by per-
    attack can again be of use, after Sybil nodes have               forming special random walks over the social graph
    been detected and eliminated from the social graph.              G. These are the only information retained about the
  • Finally, detecting Sybil accounts is a key step in pre-          graph for the rest of the SybilInfer algorithm, and
    venting false email accounts used for spam, or pre-              their generation is detailed in section 3.1.
    venting trolling and abuse of on-line communities              • A probabilistic model is then defined that describes
    and web-forums. Our techniques can be applied in                 the likelihood a trace T was generated by a specific
    all those settings, to fight spam and abuse [14].                 honest set of nodes within G, called X. This model
SybilInfer applies to settings where a peer-to-peer or dis-          is based on our assumptions that social networks are
tributed system is somehow based on or aware of social               fast mixing, while the transitions to dishonest regions
connections between users. Properties of natural social              are slow. Given the probabilistic model, the traces T
graphs are used to classify nodes are honest or Sybils.              and the set of honest nodes we are able to calculate
While this approach might not be applicable to very tradi-           Pr[T |X is honest]. The calculation of this quantity
tional peer-to-peer systems [24], it is more an more com-            is the subject of section 3.1 and section 3.2.
mon for designers to make distributed systems aware of             • Once the probabilistic model is defined, we use
the social environment of their users. Third party social            Bayes theorem to calculate for any set of nodes X
network services [29, 30], can also be used to extract so-           and the generated trace T , the probability that X con-
cial information to protect systems against sybil attacks            sists of honest nodes. Mathematically this quality is
using SybilInfer. Section 5 details deployment strategies            defined as Pr[X is honest|T ]. The use of Bayes the-
for SybilInfer and how it is applicable to current systems.          orem is described in section 3.1.
    We show analytically that SybilInfer is, from a theo-
retical perspective, very powerful: under ideal circum-            • Since it is not possible to simply enumerate all sub-
stances an adversary gains no advantage by introducing               sets of nodes X of the graph G, we instead sample
into the social network any additional Sybil nodes that are          from the distribution of honest node sets X, to only
not ‘naturally’ connected to the rest of the social structure.       get a few X0 , . . . , XN ∼ Pr[X is honest|T ]. Using
Even linking all dishonest nodes with each other (with-              those representative sample sets of honest nodes, we
out adding any Sybils) changes the characteristics of their          can calculate the probability any node in the system
social sub-graph, and can under some circumstances be                is honest or dishonest. Sampling and the approxi-
detected. We demonstrate the practical efficacy of our ap-            mation of the sought marginal probabilities are the
proach using both synthetic scale-free topologies as well            subject of section 3.3.
as real-world LiveJournal data. We show very significant            The key conceptual difficulty of our approach is the
security improvements over both SybilGuard and Sybil-            definition of the probabilistic model over the traces T , and
graph. Several authors have shown that real-life so-
                                                                     cial networks are indeed fast mixing [16, 18].

                                                                 3. A node knows the complete social network topology
                                                                    (G) : social network topologies are relatively static,
                                                                    and it is feasible to obtain a global snapshot of the
                                                                    network. Friendship relationships are already public
                         Attack
                         Edges
                                                                    data for popular social networks like Facebook [29]
                                                                    and Orkut [30]. This assumption can be relaxed to
                                                                    using sub-graphs, making SybilInfer applicable to
          Honest Nodes              Sybil Nodes                     decentralised settings.
                                                                    Assumptions (1) and (2) are identical to those made by
     Figure 1. Illustration of Honest nodes, Sybil              the SybilGuard and SybilInfer systems. Previously, the
     nodes and attack edges between them.                       authors of SybilGuard [27] observed that when the adver-
                                                                sary creates too many Sybil nodes, then the graph G has a
                                                                small cut: a set of edges that together have small station-
                                                                ary probability and whose removal disconnects the graph
its inversion using Bayes theorem to define a probability
                                                                into two large sub-graphs.
distribution over all possible honest sets of nodes X. This
distribution describes the likelihood that a specific set of         This intuition can be pushed much further to build su-
nodes is honest. The key technical challenge is making          perior Sybil defences. It has been shown [20] that the
use of this distribution to extract the sought probability      presence of a small cut in a graph results in slow mix-
each node is honest or dishonest, that we achieve via sam-      ing which means that fast mixing implies the absence of
pling. Section 3, describes in some detail how these issues     small cuts. Applied to social graphs this observation un-
are tackled by SybilInfer.                                      derpins the key intuition behind our Sybil defence mech-
                                                                anism: the mixing between honest nodes in the social net-
                                                                works is fast, while the mixing between honest nodes and
3     Model & Algorithm                                         dishonest nodes is slow. Thus, computing the set of honest
                                                                nodes in the graph is related to computing the bottleneck
    Let us denote the social network topology as a graph G      cut of the graph.
comprising vertices V , representing people and edges E,            One way of formalising the notion of a bottleneck cut,
representing trust relationships between people. We con-        is in terms of graph conductance (Φ) [11], defined as:
sider the friendship relationship to be an undirected edge
in the graph G. Such an edge indicates that two nodes trust                      Φ=         min       ΦX ,
                                                                                      X⊂V :π(X)<1/2
each other to not be part of a Sybil attack. Furthermore,
we denote the friendship relationship between an attacker
node and an honest node as an attack edge and the hon-          where ΦX is defined as:
est node connected to an attacker node as a naive node or
                                                                                      Σx∈X Σy∈X π(x)Pxy
                                                                                             /
misguided node. Different types of nodes are illustrated in                   ΦX =                      ,
figure 1. These relationships must be understood by users                                    π(X)
as having security implications, to restrict the promiscu-
ous behaviour often observed in current social networks,        and π(·) is the stationary distribution of the graph. Intu-
where users often flag strangers as their friends [23].          itively for any subset of vertices X ⊂ V its conductance
    We build our Sybil defence around the following as-         ΦX represents the probability of going from X to the rest
sumptions:                                                      of the graph, normalised by the probability weight of be-
                                                                ing on X. When the value is minimal the bottleneck cut
    1. At least one honest node in the network is known. In     in the graph is detected.
       practise, each node trying to detect Sybil nodes can
                                                                    Note that performing a brute force search for this bot-
       use itself as the apriori honest node. This assump-
                                                                tleneck cut is computationally infeasible (it is actually NP-
       tion is necessary to break symmetry: otherwise an
                                                                Hard, given its relationship to the sparse-cut problem).
       attacker could simply mirror the honest social struc-
                                                                Furthermore, finding the exact smallest cut is not as im-
       ture, and any detector would not be able to distin-
                                                                portant as being able to judge how likely any cut is, to be
       guish which of the two regions is the honest one.
                                                                dividing nodes into an honest and dishonest region. This
    2. Social networks are fast mixing: this means that a       probability is related to the deviation of the size of any
       random walk on the social graph converges quickly        cut from what we would expect in a natural, fast mixing,
       to a node following the stationary distribution of the   social network.
3.1    Inferring honest sets                                                                                 ProbXX

                                                                                           ProbXX                        ProbXX
   In this paper, we propose a framework based on
Bayesian inference to detect approximate cuts between                                                    X        X
honest and Sybil node regions in a social graph and use
those to infer the labels of each node. A key strength of
our approach is that it, not only associates labels to each                                                  ProbXX
node, but also finds the correct probability of error that                             (a) A schematic representation of tran-
could be used by peer-to-peer or distributed applications                             sition probabilities between honest X
                                                                                                      ¯
                                                                                      and dishonest X regions of the social
to select nodes.
                                                                                      network.
   The first step of SybilInfer is the generation of a set of
random walks on the social graph G. These walks are gen-
erated by performing a number s of random walks, start-                         ProbXX = 1 / |V| + Exx




                                                                  Probability
ing from each node in the graph (i.e. a total of s · |V |                                                                            1 / |V|

walks.) A special probability transition matrix is used,                                                        ProbXX = 1 / |V| - Exx
defined as follows:
                                                                                   Honest Nodes                       Sybil nodes
                   1 1
              min( di , dj )   if i → j is an edge in G
      Pij =                                               ,      (b) The model of the probability a short random walk of length
              0                otherwise                         O(log |V |) starting at an honest node ends on a particular honest or
                                                                 dishonest (Sybil) node. If no Sybils are present the network is fast
where di denotes the degree of vertex i in G.                    mixing and the probability converges to 1/|V |, otherwise it is biased
    This choice of transition probabilities ensures that the     towards landing on an honest node.
stationary distribution of the random walk is uniform over
all vertices |V |. The length of the random walks is l =                Figure 2. Illustrations of the SybilInfer mod-
O(log |V |), which is rather short, while the number of ran-            els
dom walks per node (denoted by s) is a tunable parameter
of the model. Only the starting vertex and the ending ver-
tex of each random walk are used by the algorithm, and
we denote this set of vertex-pairs, also called the traces,
by T .                                                             Note that since the set X is honest, we assume (by as-
    Now consider any cut X ⊂ V of nodes in the graph,          sumption (2)) fast mixing amongst its elements, meaning
such that the a-prior honest node is an element of X. We       that a short random walk reaches any element of the sub-
are interested in the probability that the vertices in set X   set X uniformly at random. On the other hand a random
are all honest nodes, given our set of traces T , i.e. P (X =  walk starting in X is less likely to end up in the dishonest
Honest|T ). Through the application of Bayes theorem we                 ¯
                                                               region X, since there should be an abnormally small cut
have an expression of this probability:                        between them. (This intuition is illustrated in figure 2(a).)
                                                               Therefore we approximate the probability that a short ran-
                         P (T |X = Honest) · P (X = Honest)
P (X = Honest|T ) =                                          , dom walk of length l = O(log |V |) starts in X and ends at
                                            Z
                                                               a particular node in X is given by ProbXX = Π + EXX ,
where Z is the normalization constant given by: Z =            where Π is the stationary distribution given by 1/|V |, for
ΣX⊂V P (T |X = Honest) · P (X = Honest). Note that             some EXX > 0. Similarly, we approximate the proba-
Z is difficult to compute because it involves the summa-        bility that a random walk starts in X and does not end
tion of an exponential number of terms in the size of |V |.    in X is given by ProbX X = Π − EX X . Notice that
                                                                                           ¯                ¯
Only being able to compute this probability up to a mul-       ProbXX > ProbX X , which captures the property that
                                                                                     ¯
tiplicative constant Z is not an impediment. The a-prior       there is fast mixing amongst honest nodes and slow mix-
distribution P (X = Honest) can be used to encode any          ing between honest and dishonest nodes. The approxi-
further knowledge about the honest nodes, or can simply        mate probabilities ProbXX and ProbX X and their likely
                                                                                                         ¯
be set to be uniform over all possible cuts.                   gap from the ideal 1/|V | are illustrated in figure 2(b).
    Bayes theorem has allowed us to reduce the initial
problem of inferring the set of good nodes X from the set          Let NXX be the number of traces in T starting in the
of traces T , to simply being able to assign a probability     honest set X and ending in same honest set X. Let NX X     ¯
to each set of traces T given a set of honest nodes X, i.e.    be the number of random walks that start at the honest set
calculating P (T |X = Honest). Our only remaining theo-                                            ¯
                                                               X and end in the dishonest set X. NX X and NXX are
                                                                                                         ¯ ¯        ¯
retical task is deriving this probability, given our model’s   defined similarly. Given the approximate probabilities of
assumptions.                                                   transitions from one set to the other and the counts of such
transitions we can ascribe a probability to the trace:            Approximating ProbXX through the traces T provides
                                                               us with a simple expression for the sought probability,
P (T |X = Honest) = (ProbXX )NXX · (ProbX X )NX X ·
                                          ¯
                                                ¯
                                                               based simply on the number of walks starting in one re-
                        (ProbX X )NX X · (ProbXX )NXX ,
                             ¯ ¯
                                   ¯ ¯
                                              ¯
                                                   ¯           gion and ending in another:

where ProbX X and ProbXX are the probabilities a walk                                           NXX            1 NXX
             ¯ ¯            ¯                                   P (T |X = Honest)     =    (                ·     )     ·
starting in the dishonest region ends in the dishonest or                                    NXX + NX X ¯     |X|
honest regions respectively.                                                                    NX X¯          1
                                                                                           (                · ¯ )NX X ·
                                                                                                                      ¯

   The model described by P (T |X = Honest) is an ap-                                        NX X + NXX
                                                                                                ¯             |X|
proximation to reality that is suitable enough to perform                                       NX X
                                                                                                  ¯ ¯          1
Sybil detection. It is of course unlikely that a random walk                               (                · ¯ )NX X ·
                                                                                                                    ¯ ¯

                                                                                             NX X + NXX
                                                                                              ¯ ¯     ¯       |X|
starting at an honest node will have a uniform probabil-                                        NXX
                                                                                                  ¯            1 NXX
ity to land on all honest or dishonest nodes respectively.                                 (                ·     ) ¯ ,
                                                                                             NXX + NX X
                                                                                              ¯       ¯ ¯     |X|
Yet this simple probabilistic model relating the starting
and ending nodes of traces is rich enough to capture the       This expression concludes the definition of our probabilis-
“probability gap” between landing on an honest or dis-         tic model, and contains only quantities that can be ex-
honest node, as illustrated in figure 2(b), and suitable for    tracted from either the known set of nodes X, or the set
Sybil detection.                                               of traces T that is assigned a probability. Note that we do
                                                               not assume any prior knowledge of the size of the honest
3.2   Approximating EXX                                                                                   ¯
                                                               set, and it is simply a variable |X| or |X| of the model.
                                                               Next, we shall describe how to sample from the distri-
    We have reduced the problem of calculating P (T |X =       bution P (X = Honest|T ) using the Metropolis-Hastings
Honest) to finding a suitable EXX , representing the ‘gap’      algorithm.
between the case when the full graph is fast mixing (for
EXX = 0) and when there is a distinctive Sybil attack (in      3.3 Sampling honest configurations
which case EXX >> 0.)
    One approach could be to try inferring EXX through a           At the heart of our Sybil detection techniques lies a
trivial modification of our analysis to co-estimate P (X =      model that assigns a probability to each sub-set of nodes
Honest, EXX |T ). Another possibility is to approximate        of being honest. This probability P (X = Honest|T ) can
EXX or ProbXX directly, by choosing the most likely            be calculated up to a constant multiplicative factor Z, that
candidate value for each configuration of honest nodes X        is not easily computable. Hence, instead of directly calcu-
considered. This can be done through the conductance or        lating this probability for any configuration of nodes X,
through sampling random walks on the social graph.             we will attempt instead to sample configurations Xi fol-
    Given the full graph G, ProbXX can be approximated         lowing this distribution. Those samples are used to esti-
               Σx∈X Σy∈X Π(x)P lxy           l                 mate the marginal probability that any specific node, or
as ProbXX =            Π(X)        , where Pxy is the prob-
                                                               collections of nodes, are honest or Sybil attackers.
ability that a random walk of length l starting at x ends in
                                                                   Our sampler for P (X = Honest|T ) is based on
y. This approximation is very closely related to the con-
                            ¯                                  the established Metropolis-Hastings algorithm [10] (MH),
ductance of the set X and X. Yet computing this measure
                                                               which is an instance of a Markov Chain Monte Carlo sam-
would require some effort.
                                                               pler. In a nutshell, the MH algorithm holds at any point
   Notice that ProbXX , as calculated above, can also
                                                               a sample X0 . Based on the X0 sample a new candidate
be approximated by performing many random walks of
                                                               sample X is proposed according to a probability distribu-
length l starting at X and computing the fraction of those
                                                               tion Q, with probability Q(X |X0 ). The new sample X
walks that end in X. Interestingly our traces already con-
                                                               is ‘accepted’ to replace X0 with probability α:
tain random walks over the graph of exactly the appropri-
ate length, and therefore we can reuse them to estimate a                            P (X |T ) · Q(X0 |X )
good ProbXX and related probabilities. Given the counts                   α = min(                          , 1)
                                                                                     P (X0 |T ) · Q(X |X0 )
NXX , NX X , NXX and NX X :
             ¯    ¯         ¯ ¯
                                                               otherwise the original sample X0 is retained. It can be
                          NXX        1                         shown that after multiple iterations this yields samples X
            ProbXX    =            ·
                        NXX + NX X |X|
                                 ¯                             according to the distribution P (X|T ) irrespective of the
                                                               way new candidate sets X are proposed or the initial state
and                                                            of the algorithm, i.e. a more likely state X will pop-out
                            NX X
                              ¯ ¯     1
            ProbX X =
                ¯ ¯                 · ¯ ,                      more frequently from the sampler, than less likely states.
                         NX X + NXX |X|
                          ¯ ¯     ¯
                                                                  A relatively naive strategy can be used to propose can-
and, ProbX X = 1−ProbXX and ProbXX = 1−ProbX X .
           ¯                    ¯          ¯ ¯                 didate states X given X0 for our problem. It relies on
simply considering sets of nodes X that are only different      4.1 Theoretical results
by a single member from X0 . Thus, with some probability
                           ¯
padd a random node x ∈ X0 is added to the set to form the          The security of our Sybil detection scheme hinges on
candidate X = X0 ∪ x. Alternatively, with probability           two important results. First, we show that we can detect
premove , a member of X0 is removed from the set of nodes,      whether a network is under Sybil attack, based on the so-
defining X = X0 ∩ x for x ∈ X0 . It is trivial to calculate      cial graph. Second, we show that we are able to detect
the probabilities Q(X |X0 ) and Q(X |X0 ) based on padd ,       Sybil attackers connected to the honest social graph, and
premove and using a uniformly at random choice over nodes       this for any attacker topology.
          ¯           ¯
in X0 , X0 , X and X when necessary.                               Our first result states that:
   A key issue when utilizing the MH algorithm is decid-             T HEOREM A. In the absence of any Sybil attack,
ing how many iterations are necessary to get independent             the distribution of P (X = Honest|T ), for a given
samples. Our rule of thumb is that |V | · log |V | steps are         size |X|, is close to uniform, and all cuts are equally
likely to guarantee convergence to the target distribution           likely (EXX 0).
P . After that number of steps the coupon collector’s the-      This result is based on our assumption that a random walk
orem states that each node in the graph would have been         over a social network is fast mixing meaning that, after
considered at least once by the sampler, and assigned to        log(|V |) steps, it visits nodes drawn from the stationary
the honest or dishonest set. In practice, given very large      distribution of the graph. In our case the random walk is
traces T , the number of nodes that are difficult to cate-       performed over a slightly modified version of the social
gorise is very small, and a non-naive sampler requires few      graph, where the transition probability attached to each
steps to produce good samples (after a certain burn in-         link ij is:
period that allows it to detect the most likely honest re-
                                                                                  1 1
gion.)                                                                       min( di , dj )   if i → j is an edge in G
                                                                    Pij =                                                ,
   Finally, given a set of N samples Xi ∼ P (X|T ) out-                      0                otherwise
put by the MH algorithm it is possible to calculate the
marginal probabilities any node is honest. This is key out-     which guarantees that the stationary distribution is uni-
                                                                                                      1
put of the SybilInfer algorithm: given a node i it is pos-      form over all nodes (i.e. Π = |V | ). Therefore we ex-
sible to associate a probability it is honest by calculating:   pect that in the absence of an adversary the short walks
                              I(i∈Xj )                          in T to end at a network node drawn at random amongst
Pr[i is honest] = j∈[0,N −1)N         , where I(i ∈ Xj ) is
an indicator variable taking value 1 if node i is in the hon-   all nodes |V |. In turn this means that the number of end
est sample Xj , and value zero otherwise. Enough samples        nodes in the set of traces T , that end in the honest set X is
can be extracted from the sampler to estimate this proba-       NXX = lim|TX |→∞ |X| · |TX |, where TX is the number
                                                                                       |V |
bility with an arbitrary degree of precision.                   of traces in T starting within the set |X|. Substituting this
    More sophisticated samplers would make use of a bet-        in the equations presented in section 2.1 and 2.2 we get:
ter strategy to propose candidate states X for each iter-                            NXX             1
ation. The choice of X can, for example, be biased to-                  ProbXX =                ·        ⇒                   (1)
                                                                                  NXX + NX X |X|
                                                                                             ¯
wards adding or removing nodes according to how often                                  NXX             1
random walks starting at the single honest node land on                 Π + EXX =                  ·       ⇒                 (2)
                                                                                   NXX + NX X |X|
                                                                                                ¯
them. We expect nodes that are reached often by random
walks starting in the honest region to be honest, and the                1           (|X|/|V |) · |TX | 1
                                                                             + EXX =                     ·     ⇒             (3)
opposite to be true for dishonest nodes. In all cases this              |V |              |TX |            |X|
bias is simply an optimization for the sampling to take                 EXX = 0                                              (4)
fewer iterations, and does not affect the correctness of the
results.                                                        As a result, by sufficiently increasing the number of ran-
                                                                dom walks T performed on the social graph, we can get
                                                                EXX arbitrarily close to zero. In turn this means that our
4   Security evaluation                                         distribution P (T |X = Honest) is uniform for given sizes
                                                                of |X|, given our uniform a-prior P (X = Honest|T ).
                                                                   In a nutshell by estimating EXX for any sample X re-
   In this section we discuss the security of SybilInfer        turned by the MH algorithm, and testing how close it is
when under Sybil attack. We show analytically that we           to zero we detect whether it corresponds to an attack (as
can detect when a social network suffers from a Sybil at-       we will see from theorem B) or a natural cut in the graph.
tack, and correctly label the Sybil nodes. Our assumptions      We can increase the precision of the detector arbitrarily by
and full proposal are then tested experimentally on syn-        increasing the number of walks T .
thetic as well as real-world data sets, indicating that the        Our second results relates to the behaviour of the sys-
theoretical guarantees hold.                                    tem under Sybil attack:
T HEOREM B. Connecting any additional Sybil nodes           4.2 Practical considerations
     to the social network, through a set of corrupt nodes,
     lowers the dishonest sub-graph conductance to the
                                                                     Models and assumptions are always an approximation
     honest region, leading to slow mixing, and hence we
                                                                 of the real world. As a result, careful evaluation is nec-
     expect EXX > 0.
                                                                 essary to ensure that the theorems are robust to deviations
                                     ¯                           from the ideal behaviour assumed so far.
First we define the dishonest set X0 comprising all dis-
honest nodes connected to honest nodes in the graph. The             The first practical issue concerns the fast mixing prop-
    ¯
set XS contains all dishonest nodes in the system, includ-       erties of social networks. There is a lot of evidence that
              ¯
ing nodes in X0 and the Sybil nodes attached to them. It         social networks exhibit this behaviour [18], and previous
                 ¯       ¯
must hold that |X0 | < |XS |, in case there is a Sybil attack.   proposals relating to Sybil defence use and validate the
Second we note that the probability of a transition be-          same assumption [27, 26]. SybilInfer makes an further
tween an honest node i ∈ X and a dishonest node j ∈ X       ¯    assumption, namely that the modified random walk over
cannot increase through Sybil attacks, since it is equal to      the social network, that yields a uniform distribution over
              1 1
Pij = min( di , dj ). At worst the corrupt node will in-         all nodes, is also fast mixing for real social networks. The
                                                                                              1 1
crease its degree by connecting Sybils which has only the        probability Pij = min( di , dj ), depends on the mutual
potential to decrease this probability. Therefore we have        degrees of the nodes i and j, and makes the transition to
that x∈XS y∈XS Pxy ≤ x∈X0 y∈X0 Pxy . Com-
           ¯       ¯                 ¯         ¯
                                                                 nodes of higher degree less likely. This effect has the po-
bining the two inequalities we get that:                         tential to slow down mixing times in the honest case, par-
                                                                 ticularly when there is a high variation in node degrees.
                                                                 This effect can be alleviated by removing random edges
            ¯    Pxy          ¯         ¯    Pxy                 from high degree nodes to guarantee that the ratio of max-
          y∈XS              x∈X0      y∈X0
             ¯S |    <              ¯              ⇔       (5)   imum and minimum node degree in the graph is bounded
            |X                     |X0 |
                                                                 (an approach also used by SybilLimit.)
              1                               1
       y∈XS |V | Pxy
          ¯                   ¯
                            x∈X0     y∈X0 |V | Pxy
                                          ¯                          The second consideration also relates to the fast mix-
         ¯           <              ¯ 1                ⇔   (6)
        |XS | 1
             |V |                  |X0 | |V |                    ing properties of networks. While in theory fast mixing
                                                                 networks should not exhibit any small cuts, or regions of
       ¯
     y∈XS π(x)Pxy         ¯
                       x∈X0   y∈X0 π(x)Pxy
                                 ¯
                   <                       ⇔ (7)                 abnormally low conductance, in practice they do. This
          ¯S )
        Π(X                  Π(X¯0)
                                                                 is especially true for regions with new users that have not
               ¯       ¯
            Φ(XS ) < Φ(X0 ).                 (8)                 had the chance to connect to many others, as well as social
                                                                 networks that only contain users with particular charac-
                                                                 teristics (like interest, locality, or administrative groups.)
This result signifies that independently of the topology of       Those regions yield, even in the honest case, sample cuts
the adversary region the conductance of a sub-graph con-         that have the potential to be mistaken as attacks. This ef-
taining Sybil nodes will be lower compared with the con-         fect forces us to consider a threshold EXX under which
ductance of the sub-graph of nodes that are simply com-          we consider cuts to be simply false positives. In turn this
promised and connected to the social network. Lower              makes the guarantees of schemes weaker in practice than
conductance in turn leads to slower mixing times be-             in theory, since the adversary can introduce Sybils into a
tween honest and dishonest regions [20] which means that         region undetected, as long as the set threshold EXX is not
EXX > 0, even for very few Sybils. This deviation is             exceeded.
subject to the sampling variation introduced by the trace            The threshold EXX is chosen to be α·EXXmax , where
T , but the error can be made arbitrarily small by sampling                        1      1
                                                                 EXXmax = |X| − |V | , and α is a constant between 0
more random walks in T .                                         and 1. Here α can be used to control the tradeoff between
                                                                 false positives and false negatives. A higher value of alpha
    These two results are very strong: they indicate that, in
                                                                 enables the adversary to insert a larger number of sybils
theory, a set of compromised nodes connecting to honest
                                                                 undetected but reduces the false positives. On the other
nodes in a social network, would get no advantage by con-
                                                                 hand, a smaller value of α reduces the number of Sybils
necting any additional Sybil nodes, since that would lead
                                                                 that can be introduced undetected but at the cost of higher
to their detection. Sampling regions of the graph with ab-
                                                                 number of false positives.
normally small conductance, through the use of the ran-
dom walks T , should lead to their discovery, which is the           Given these practical considerations, we can formulate
theoretical foundation of our technique. Furthermore we          a weaker security guarantee for SybilInfer:
established that techniques based on detecting abnormali-             T HEOREM C. Given a certain “natural” threshold
ties in the value of EXX are strategy proof, meaning that             value for EXX in an honest social network, a dis-
there is no attacker strategy (in terms of special adversary          honest region performing a Sybil attack will exceed
topology) to foil detection.                                          it after introducing a certain number of Sybil nodes.
Scale Free Topology: 1000 nodes, 100 malicious                                                                           Scale Free Topology: 1000 nodes, 100 malicious
                              250                                                                                                                      250
                                                                                     False Negatives, alpha=0.65                                                                                               False Negatives,alpha=0.65
                                                                                      False Positives, alpha=0.65                                                                                               False Positives,alpha=0.65
                                                                                      False Negatives, alpha=0.7                                                                                                False Negatives,alpha=0.7
                                                                                       False Positives, alpha=0.7                                                                                                False Positives,alpha=0.7
                                                                                     False Negatives, alpha=0.75                                                                                               False Negatives,alpha=0.75
                              200                                                     False Positives, alpha=0.75                                      200                                                      False Positives,alpha=0.75
       Number of identities




                                                                                                                                Number of identities
                              150                                                                                                                      150




                              100                                                                                                                      100




                              50                                                                                                                       50




                               0                                                                                                                        0
                                    0   100   200     300    400       500        600       700       800      900   1000                                    0   100   200     300    400       500        600       700       800      900   1000
                                                      Number of additional sybil identities (x)                                                                                Number of additional sybil identities (x)



                                        (a) Average degree compromised nodes                                                                                       (b) Low degree compromised nodes


   Figure 3. Synthetic Scale Free Topology: SybilInfer Evaluation as a function of additional Sybil
   identities (ψ) introduced by colluding entities. False negatives denote the total number of dishon-
   est identities accepted by SybilInfer while false positives denote the number of honest nodes that
   are misclassified.



This theorem is the result of Theorem B that demonstrates                                                                   given node is proportional to the degree of that node; i.e.:
that the conductance keeps decreasing as the number of                                                                      Pr[(v, i)] = didj , where di is the degree of node i. In
                                                                                                                                             j
Sybils attached to a dishonest region increases. This in                                                                    our simulations, we use m = 5, giving an average node
turn will slow down the mixing time between the hon-                                                                        degree of 10.
est and dishonest region, leading to an increasingly large                                                                      In such a scale free topology of 1000 nodes, we con-
EXX .                                                                                                                       sider a fraction f = 10% of the nodes to be compro-
   Intuitively, as the attack becomes larger, the cut be-                                                                   mised by a single adversary. The compromised nodes are
tween honest and dishonest nodes becomes increasingly                                                                       distributed uniformly at random in the topology. Com-
distinct, which makes Sybil detection easier. It is impor-                                                                  promised nodes introduce ψ additional Sybil nodes and
tant to note that as more Sybils are introduced into the                                                                    establish a scale free topology amongst themselves. We
dishonest region, the probability of the whole region be-                                                                   configure SybilInfer to use 20 samples for computing the
ing detected as an attack increases, not only the new Sybil                                                                 marginal probabilities, and label as honest the set of nodes
nodes. This provides strong disincentives to the adver-                                                                     whose marginal probability of being honest is greater than
sary from performing larger Sybil attacks, since even pre-                                                                  0.5. The experiment is repeated 100 times with different
viously undetected malicious nodes might be flagged as                                                                       scale free topologies.
Sybils.
                                                                                                                                Figure 3(a) illustrates the false positives and false neg-
                                                                                                                            atives classifications returned by SybilInfer, for varying
4.3   Experimental evaluation using synthetic                                                                               value of ψ, the number of additional Sybil nodes intro-
      data                                                                                                                  duced. We observe that when ψ < 100, α = 0.7 , then all
                                                                                                                            the malicious identities are classified as honest by Sybil-
   We first experimentally demonstrate the validity of                                                                       Infer. However, there is a threshold at ψ = 100, be-
Theorem C using synthetic topologies. Our experiments                                                                       yond which all of the Sybil identities, including the ini-
consist of building synthetic social network topologies, in-                                                                tially compromised entities are flagged as attackers. This
jecting a variable number of Sybil nodes, and applying                                                                      is because beyond this point, the EXX for the Sybil region
SybilInfer to establish how many of them are detected. A                                                                    exceeds the natural threshold leading to full detection, val-
key issue we explore is the number of introduced Sybil                                                                      idating Theorem C. The value ψ = 100 is clearly the op-
nodes under which Sybil attacks are not detected.                                                                           timal attack strategy, in which the attacker can introduce
   Social networks exhibit a scale-free (or power law)                                                                      the maximal number of Sybils without being detected. We
node degree topology [21]. Our network synthesis al-                                                                        also note that even in the worst case, the false positives are
gorithm replicates this structure through preferential at-                                                                  less than 5%. The false positive nodes have been misclas-
tachment, following the methodology of Nagaraja [18].                                                                       sified because these nodes are closer to the Sybil region;
We create m0 initial nodes connected in a clique, and                                                                       SybilInfer is thus incentive compatible in the sense that
then for each new node v, we create m new edges to                                                                          nodes which have mostly honest friends are likely not to
existing nodes, such that the probability of choosing any                                                                   be misclassified.
Figure 4 presents a plot of the maximum Sybil iden-
                                                                                  Scale Free Topology: 1000 nodes
                                              0.4
                                                                                                                                                   tities as a function of the compromised fraction of nodes
                                                              False Negatives

                                             0.35
                                                         Theoretical Prediction
                                                                           y=x                                                                     f . Note that our theoretical prediction (which is strategy-
                                                                                                                                                   independent) matches closely with the attacker strategy
                                              0.3
      Fraction of total malicious entities




                                                                                                                                                   of connecting Sybil nodes in a scale free topology. The
                                             0.25
                                                                                                                                                   adversary is able to introduce roughly about 1 additional
                                              0.2
                                                                                                                                                   Sybil identity per real entity. For instance, at f = 0.2,
                                             0.15                                                                                                  the total number of Sybil identities is 0.37. As we observe
                                              0.1                                                                                                  from the figure the ability of the adversary to just include
                                             0.05                                                                                                  about one additional Sybil identity per compromised node
                                               0
                                                                                                                                                   embedded in the social network remains constant, no mat-
                                                    0   0.02     0.04      0.06        0.08       0.1       0.12        0.14   0.16   0.18   0.2
                                                                                   Fraction of malicious entities (f)                              ter the fraction f of compromised nodes in the network.

   Figure 4. Scale Free Topology: fraction of                                                                                                      4.4   Experimental evaluation using real-world
   total malicious and Sybil identities as a func-                                                                                                       data
   tion of real malicious entities.

                                                                                                                                                      Next we validate the security guarantees provided by
    We can also see the effect of varying the threshold                                                                                            SybilInfer using a sampled LiveJournal topology. A vari-
EXX . As α is increased from 0.65 to 0.7, the ψ for the                                                                                            ant of snowball [9] sampling was used to collect the full
optimal attacker strategy increases from 70 to 100. This is                                                                                        data set data, comprising over 100,000 nodes.
because an increase in the threshold EXX allows the ad-                                                                                               To perform our experiments we chose a random node
versary to insert more Sybils undetected. The advantage                                                                                            and collect all nodes in its three hop neighbourhood. The
of increasing α lies in reducing the worst case false pos-                                                                                         resulting social network has about 50,000 nodes. We then
itives. We can see that by increasing α from 0.7 to 0.75,                                                                                          perform some pre-processing step on the sub-graph:
the worst case false positives can be reduced from 5% to
                                                                                                                                                     • Nodes with degree less than 3 are removed, to filter
2%.
                                                                                                                                                       out nodes that are too new to the social network, or
    Note that for the remainder of the paper, we shall use                                                                                             inactive.
α = 0.7.
    We also wish to show that the security of our scheme                                                                                             • If there is an edge between A → B, but no edge
depends primarily on the number of colluding malicious                                                                                                 between B → A, then A → B is removed (to only
nodes and not on the number of attack edges. To this end,                                                                                              keep the symmetric friendship relationships.)
we chose the compromised nodes to have the lowest num-
ber of attack edges (instead of choosing them uniformly                                                                                            We note that despite this pre-processing nodes all degrees
at random), and repeat the experiment. Figure 3(b) illus-                                                                                          can be found in the final dataset, since nodes with ini-
trates that the false positives and false negatives classifica-                                                                                     tial degree over 3 will have some edges removed reducing
tions returned by SybilInfer, where the average number of                                                                                          their degree to less than 3.
attack edges are 500. Note that these results are very simi-                                                                                           After pre-processing, the social sub-graph consists of
lar to the previous case illustrated in Figure 3(a), where the                                                                                     about 33,000 nodes. First, we ran SybilInfer on this topol-
number of attack edges is around 800. This analysis in-                                                                                            ogy without introducing any artificial attack. We found a
dicates that the security provided by SybilInfer primarily                                                                                         bottleneck cut diving off about 2, 000 Sybil nodes. It is
depends on the number of colluding entities. The implica-                                                                                          impossible to establish whether these nodes are false pos-
tion is that the compromise of high degree nodes does not                                                                                          itives (a rate of 6%) or a real-world Sybil attack present
yield any significant advantage to the adversary. As we                                                                                             in the LiveJournal network. Since there is no way to es-
shall see, this is in contrast to SybilGuard and SybilLimit,                                                                                       tablish ground truth, we do not label these nodes as either
which are extremely vulnerable when high degree nodes                                                                                              honest/dishonest.
are compromised.                                                                                                                                      Next, we consider a fraction f of the nodes to be com-
    Our next experiment establishes the number of Sybil                                                                                            promised and compute the optimal attacker strategy, as in
nodes that can be inserted into a network given different                                                                                          our experiments with synthetic data. Figure 5 shows the
fractions of compromised nodes. We vary the fraction of                                                                                            fraction of malicious identities accepted by SybilInfer as
compromised colluding nodes f , and for each value of f ,                                                                                          a function of fraction of malicious entitites in the system.
we compute the optimal number of additional Sybil iden-                                                                                            The trend is similar to our observations on synthetic scale
tities that the attackers can insert, as in the previous exper-                                                                                    free topologies. At f = 0.2, the fraction of Sybil identities
iment.                                                                                                                                             accepted by SybilInfer is approximately 0.32.
LiveJournal Topology: 31603 nodes                                                                                                   Scale Free Topology: 1000 nodes
                                               0.45                                                                                                                          0.6
                                                             Fraction of malicious identities                                                                                             SybilLimit, uniform
                                                                                         y=x                                                                                           SybilLimit, low degree
                                                0.4                                                                                                                                                  SybilInfer
                                                                                                                                                                                                           y=x
                                                                                                                                                                             0.5
                                               0.35




                                                                                                                                      Fraction of total malicious entities
      Fraction of total malicious identities




                                                0.3                                                                                                                          0.4


                                               0.25
                                                                                                                                                                             0.3
                                                0.2


                                               0.15                                                                                                                          0.2


                                                0.1
                                                                                                                                                                             0.1
                                               0.05


                                                 0                                                                                                                            0
                                                      0.05          0.1              0.15              0.2          0.25   0.3                                                     0.006        0.008             0.01        0.012          0.014       0.016   0.018   0.02
                                                                                   Fraction of colluding entities                                                                                                   Fraction of malicious entities (f)




   Figure 5. LiveJournal Topology: fraction of                                                                                        Figure 6. Comparison with related work
   total malicious identities as a function of
   real malicious entities


                                                                                                                                 f < 0.01. In this case SybilInfer allows for 5% total ma-
4.5                                 Comparison with SybilLimit and Sybil-                                                        licious entities, while SybilLimit allows for over 50%.
                                    Guard
                                                                                                                                     This is an illustration that SybilInfer performs an or-
   SybilGuard [27] and SybilLimit [26] are state of the art                                                                      der of magnitude better than the state of the art both in
decentralized protocols that defend against Sybil attacks.                                                                       terms of range of applicability and performance within
Similar to SybilInfer, both protocols exploit the fact that a                                                                    that range (SybilGuard’s performance is strictly worse
Sybil attack disrupts the fast mixing property of the social                                                                     than SybilLimit’s performance, and is not illustrated.) An
network topology, albeit in a heuristic fashion. A brief                                                                         obvious question is: “why does SybilInfer perform so
overview of the two systems can be found in the appendix,                                                                        much better than SybilGuard and SybilLimit?” It is partic-
and their full descriptions is given in [27, 26].                                                                                ularly pertinent since all three systems are making use of
   Figure 6 compares the performance of SybilInfer with                                                                          the same assumptions, and a similar intuition, that there
the performance of SybilLimit. First it is worth noting                                                                          should be a “gap” between the honest and Sybil regions
that the fraction of compromised nodes that SybilLimit                                                                           of a social network. The reason SybilLimit and Sybil-
tolerates is only a small fraction of the range within which                                                                     Guard provide weaker guarantees is that they interpret
SybilInfer provide its guarantees. SybilLimit tolerates up                                                                       these assumptions in a very sensitive way: they assume
to f = 0.02 compromised nodes when the degree of at-                                                                             that an overwhelming majority of random walks staring in
tackers is low (about degree 5 – green line), while we have                                                                      the honest region will stay in the honest region, and then
already shown the performance of SybilLimit for compro-                                                                          bound the number of walks originating from the Sybil re-
mised fractions up to f = 0.35 in figure 4. Within the                                                                            gion via the number of corrupt edges. As a result they
interval SybilLimit is applicable, our system systemati-                                                                         are very sensitive to the length of those walks, and can
cally outperforms: when very few compromised nodes are                                                                           only provide strong guarantees for a very small number of
present in the system (f = 0.01) our system only allows                                                                          corrupt edges. Furthermore the validation procedure re-
them to control less than 5% of the entities in the system,                                                                      lies on collisions between honest nodes, via the birthday
versus SybilLimit that allows them to control over 30%                                                                           paradox, which adds a further layer of inefficiency to esti-
of entities (rendering insecure byzantine fault tolerance                                                                        mating good from bad regions.
mechanisms that requires at least 2/3 honest nodes) At the
limit of SybilLimit’s applicability range when f = 0.02,                                                                             SybilInfer, on the other hand, interprets the disruption
our approach caps the number of dishonest entities in the                                                                        in fast-mixing between the honest and dishonest region
system to fewer than 8%, while SybilLimit allows about                                                                           simply as a faint bias in the last node of a short random
50% dishonest entities. (This large fraction renders leader                                                                      walk (as illustrated in figures 2(a) and 2(b).) In our ex-
election or other voting systems ineffective.)                                                                                   periments, as well as in theory, we observe a very large
   An important difference between SybilInfer and Sybil-                                                                         fraction of the T walks crossing between the honest and
Limit is that the former is not sensitive to the degree of                                                                       dishonest regions. Yet the faint difference in the probabil-
the attacker nodes. SybilLimit provides very weak guar-                                                                          ity of landing on nodes in the honest and dishonest regions
antees when high degree (e.g. degree 10 – red line) nodes                                                                        is present, and the sampler makes use of it to get good cuts
are compromised, and can protect the system only for                                                                             between the honest and dishonest nodes.
4.6   Computational and time complexity                         O(|V |) per iteration, making the overall complexity of the
                                                                scheme O(|V |2 log |V |).
    Two implementations of the SybilInfer sampler were
build in Python and C++, of about 1KLOC each. The               5 Deployment Strategies
Python implementation can handle 10K node networks,
while the C++ implementation has handled up to 30K                  So far we presented an overview of the SybilInfer algo-
node networks, returning results in seconds.                    rithm, as well as a theoretical and empirical evaluation of
    The implementation strategy for both samplers has fa-       its performance when it comes to detecting Sybil nodes.
voured a low time complexity over storage costs. The            The core of the algorithm outperforms SybilGuard and
critical loop performs O(|V | · log |V |) Metropolis Hast-      SybilLimit, and is applicable in settings beyond which
ings iterations per sample returned, each only requiring        the two systems provide no security guarantees whatso-
about O(log |V |) operations. Two copies of the full state      ever. Yet a key difference between the previous systems
are stored, as well as associated data that allows for fast     and SybilInfer is the latter’s reliance on the full friendship
updating of the state, which requires O(|V |) storage. The      graph to perform the random walks that drive the infer-
transcript of evidence traces T is also stored, as well as an   ence engine. In this section we discuss how this constraint
index over it, which dominates the storage required and         still allows SybilInfer to be used for important classes of
makes it order O(|V | · log |V |).                              applications, as well as how it can be relaxed to accom-
    There is a serious time complexity penalty associated       modate peer-to-peer systems with limited resources per
with implementing non-naive sampling strategies. Our            client.
Python implementation biases the candidate moves to-
wards nodes that are more or less likely to be part of the      5.1 Full social graph knowledge
honest set. Yet exactly sampling nodes from this known
probability distribution, naively may raise the cost of each        The most straightforward way of applying SybilInfer
iteration to be O(|V |). Depending on the differential be-      is using the full graph of a social network to infer which
tween the highest and lowest probabilities, faster sampling     nodes are honest and which nodes are Sybils, given a
techniques like rejection sampling [15] can be used to          known honest seed node. This is applicable to centralised
bring the cost down. The Python implementation uses a           on-line services, like free email hosting services, blogging
variant of Metropolis-Hastings to implement selection of        sites, and discussion forums that want to deter spammers.
candidate nodes for the next move, at a computation cost        Today those systems use a mixture of CAPTCHA [25]
of O(log |V |). The C++ implementation uses naive sam-          and network based intrusion detection to eliminate mass
pling from the honest or dishonest sets, and has a very low     attacks. SybilInfer could be used to either complement
cost per iteration of order O(1).                               those mechanisms and provide additional information as
    The Markov chain sampling techniques used consider          to which identities are suspicious, or replace those sy-
sequences of states that are very close to each other, dif-     stems when they are expensive and error prone. One of
fering at most by a single node. This enables a key             the first social network based Sybil defence systems, Ad-
optimization, where the counts NXX , NX X , NXX and
                                                ¯   ¯           vogato [14], worked in such a centralized fashion.
NX X are stored for each state and updated when the state
   ¯ ¯                                                              The need to know the social graph does not preclude
changes. This simple variant of self-adjusting computa-         the use of SybilInfer in distributed or even peer-to-peer sy-
tion [1], allows for very fast computations of the proba-       stems. Social networks, once mature, are generally stable
bilities associated with each state. Updating the counts,       and do not change much over time. Their rate of change
and associated information is an order O(log |V |) opera-       is by no means comparable to the churn of nodes in the
tion. The alternative, of recounting these quantities from      network, and as a result the structure of the social network
T would cost O(|V | log |V |) for every iteration, leading      could be stored and used to perform inference on multiple
to a total computational complexity for our algorithm of        nodes in a network, along with a mechanisms to share oc-
O((|V | log |V |)2 ). Hence implementing it is vital to get-    casional updates. The storage overhead for storing large
ting results fast.                                              social networks is surprisingly low: a large social network
    Finally our implementations use a zero-copy strategy        with 10 billion nodes (roughly the population of planet
for the state. Two states and all associated information are    earth) with each node having about 1000 friends, can be
maintained at any time, the current state and the candi-        stored in about 187Gb of disk space uncompressed. In
date state. Operations on the candidate state can be done       such settings it is likely that SybilInfer computation will
and undone in O(log |V |) per operation. Accepted moves         be the bottleneck, rather than storage of the graph, for the
can be committed to the current state at the same cost.         foreseeable future.
These operations can be used to maintain the two states             A key application of Sybil defences is to ensure that
synchronised for use by the Metropolis-Hastings sampler.        volunteer relays in anonymous communication networks
The naive strategy of re-writing the full state would cost      belong to independent entities, and are not controlled
by a single adversary. Practical systems like Mixmas-           tain community of interest (a social club, a committee, a
ter [17], Mixminion [5] and Tor [7] operate such a volun-       town, etc.) can be extracted to form a sub-graph. SybilIn-
teer based anonymity infrastructure, that are very suscep-      fer can then be applied on this partial view of the graph,
tible to Sybil attacks. Extending such an infrastructure to     to detect nodes that are less well integrated than others in
use SybilInfer is an easy task: each relay in the system        the group. There is an important distinction between us-
would have to indicate to the central directory services        ing SybilInfer in this mode or using it with the full graph:
which other nodes it considers honest and non-colluding.        while the results using the full graph output an “absolute”
The graph of nodes and mutual trust relations can be used       probability for each node being a Sybil, applying SybilIn-
to run SybilInfer centrally by the directory service, or by     fer to a partial view of the network only provides a “rel-
each individual node that wishes to use the anonymizing         ative” probability the node is honest in that context. It is
service. Currently, the largest of those services, the Tor      likely that nodes are tagged as Sybils, because they do not
network has about 2000 nodes, which is well within the          have many contacts within the select group, which given
computation capabilities of our implementations.                the full graph would be classified as honest. Before ap-
                                                                plying SybilInfer in this mode it is important to assess, at
5.2   Partial social graph knowledge                            least, whether the subgroup is fast-mixing or not.

    SybilInfer can be used to detect and prevent Sybil at-      5.3   Using SybilInfer output optimally
tacks, using only a partial view of the social graph. In the
context of a distributed or peer-to-peer system each user          Unlike previous systems the output of the SybilInfer
discovers only a fixed diameter sub-graph around them.           algorithm is a probabilistic statement, or even more gen-
For example a user may choose to retrieve and store all         erally, a set of samples that allows probabilistic statements
other users two or three hops away in the social network        to be estimated. So far in the work we discussed how to
graph, or discover a certain threshold of nodes in a breadth    make inferences about the marginal probability specific
first manner. SybilInfer is then applied on the extracted        nodes are honest of dishonest by using the returned sam-
sub-graph to detect potential Sybil regions. This allows        ples to compute Pr[i is honest] for all nodes i. In our ex-
the user to prune its social neighbourhood from any Sybil       periments we applied a 0.5 threshold to the probability to
attacks, and is sufficient for selecting a set of honest nodes   classify nodes as honest or dishonest. This is a rather lim-
when sampling from the full network is not required. Dis-       ited use of the richer output that SybilInfer provides.
tributed backup and storage, and all friend and friend-            Distributed system applications can, instead of using
of-friend based sharing protocols can benefit from such          marginal probabilities of individual nodes, estimate the
protection. The storage and communication cost of this          probability that the particular security guarantees they re-
scheme is constant and relative to the number of nodes in       quire hold. High latency anonymous communication sy-
the chosen neighbourhood.                                       stems, for example, require a set of different nodes such
    In cases where nodes can afford to know a larger frac-      that with high probability at least one of them is honest.
tion of the social graph, they could choose to discover         Path selection is also subject to other constraints (like la-
O(c· |V |) nodes in their neighbourhood, for some small         tency.) In this case the samples returned by SybilInfer
integer c. This increases the chances two arbitrary nodes       can be used to calculate exactly the sought probability, i.e.
have to know a common node, that can perform the Sybil-         the probability a single node in the chosen path is hon-
Infer protocol and act as an introduction point for the         est. Onion routing based system, on the other hand are
nodes. In this protocol Alice and Bob want to ensure the        secure as long as the first and last hop of the relayed com-
other party is not a Sybil. They find a node C that is in the    munication is honest. As before, the samples returned by
c · |V neighbourhood of both of them, and each make             SybilInfer can be used to choose a path that has a high
sure that with high probability it is honest. They then con-    probability to exhibit this characteristic.
tact node C that attests to both of them, given its local          Other distributed applications, like peer-to-peer storage
run of the SybilInfer engine, that they are not Sybil nodes     and retrieval have similar needs, but also tunable param-
(with C as the honest seed.) This protocol introduces a         eters that depend on the probability of a node being dis-
single layer of transitive trust, and therefore it is neces-    honest. Storage systems like OceanStore, use Rabin’s in-
sary for Alice and Bob to be quite certain that C is indeed     formation dispersion algorithm to divide files into chunks
honest. Its storage and communication cost is O( |V |),         stored and retrieved to reconstruct a file. The degree of
which is the same order of magnitude as SybilLimit and          redundancy required crucially depends on the probability
SybilGuard. Modifying this simple minded protocol into          nodes are compromised. Such algorithms can use SybilIn-
a fully fledged one-hop distributed hash table [13] is an        fer to foil Sybil attacks, and calculate the probability the
interesting challenge for future work.                          set of nodes to be used to store particular files contains
    SybilInfer can also be applied to specific on-line com-      certain fractions of honest nodes. This probability can in
munities. In such cases a set of nodes belonging to a cer-      turn inform the choice of parameters to maximise the sur-
vivability of the files.                                        approaches that treat user statements beyond just black
   Finally a note of warning should accompany any Sybil        and white and make explicit use of probabilistic reasoning
prevention scheme: it is not the goal of SybilInfer (or any    and statements as their outputs will become increasingly
other such scheme) to ensure that all adversary nodes are      important in building safe systems.
filtered out of the network. The job of SybilInfer is to
ensure that a certain fraction of existing adversary nodes     Acknowledgements. This work was performed while
cannot significantly increase its control of the system by      Prateek Mittal was an intern at Microsoft Research, Cam-
introducing ‘fake’ Sybil identities. As it is illustrated by   bridge, UK. The authors would like to thank Carmela
the examples on anonymous communications and stor-             Troncoso, Emilia K¨ sper, Chris Lesniewski-Laas, Nikita
                                                                                   a
age, system specific mechanisms are still crucial to ensure     Borisov and Steven Murdoch for their helpful comments
that a minority of adversary entities cannot compromise        on the research and draft manuscript. Our shepherd, Ta-
any security properties. SybilInfer can only ensure that       dayoshi Kohno, and ISOC NDSS 2009 reviewers pro-
this minority remains a minority and cannot artificially in-    vided valuable feedback to improve this work. Barry Law-
crease its share of the network.                               son was very helpful with the technical preparation of the
   Sybil defence schemes are also bound to contain false-      final manuscript.
positives, namely honest nodes labeled as Sybils. For this
reason other mechanisms need to be in place to ensure that
                                                               References
those users can seek a remedy to the automatic classifica-
tion they suffered from the system, potentially by making
                                                                [1] U. A. Acar. Self-adjusting computation. PhD thesis, Pitts-
some additional effort. Proofs-of-work, social introduc-
                                                                    burgh, PA, USA, 2005. Co-Chair-Guy Blelloch and Co-
tion services, or even payment targeting those users could          Chair-Robert Harper.
be a way of ensuring SybilInfer is not turned into an auto-     [2] B. Awerbuch. Optimal distributed algorithms for mini-
mated social exclusion mechanism.                                   mum weight spanning tree, counting, leader election and
                                                                    related problems (detailed summary). In STOC, pages
6   Conclusion                                                      230–240. ACM, 1987.
                                                                [3] M. Castro, P. Druschel, A. Ganesh, A. Rowstron, and D. S.
                                                                    Wallach. Secure routing for structured peer-to-peer over-
    We presented SybilInfer, an algorithm aimed at detect-          lay networks. In OSDI ’02: Proceedings of the 5th sym-
ing Sybil attacks against peer-to-peer networks or open             posium on Operating systems design and implementation,
services, and label which nodes are honest and which are            pages 299–314, New York, NY, USA, 2002. ACM.
dishonest. Its applicability and performance in this task is    [4] F. Dabek. A cooperative file system. Master’s thesis, MIT,
an order of magnitude better than previous systems mak-             September 2001.
                                                                [5] G. Danezis, R. Dingledine, and N. Mathewson. Mixmin-
ing similar assumptions, like SybilGuard and SybilLimit,
                                                                    ion: Design of a Type III Anonymous Remailer Protocol.
even though it requires nodes to know a substantial part of
                                                                    In Proceedings of the 2003 IEEE Symposium on Security
the social structure within which honest nodes are embed-           and Privacy, pages 2–15, May 2003.
ded. SybilInfer illustrates how robust Sybil defences can       [6] G. Danezis, C. Lesniewski-Laas, M. F. Kaashoek, and
be bootstrapped from distributed trust judgements, instead          R. Anderson. Sybil-resistant dht routing. In ESORICS
of a centralised identity scheme. This is a key enabler for         2005: Proceedings of the European Symp. Research in
secure peer-to-peer architectures as well as collaborative          Computer Security, pages 305–318, 2005.
web 2.0 applications.                                           [7] R. Dingledine, N. Mathewson, and P. Syverson. Tor: the
    SybilInfer is also significant due to the use of machine         second-generation onion router. In SSYM’04: Proceed-
learning techniques and their careful application to a secu-        ings of the 13th conference on USENIX Security Sympo-
                                                                    sium, pages 21–21, Berkeley, CA, USA, 2004. USENIX
rity problem. Cross disciplinary designs are a challenge,
                                                                    Association.
and applying probabilistic techniques to system defence         [8] J. R. Douceur. The sybil attack. In IPTPS ’01: Re-
should not be at the expense of strength of protection, and         vised Papers from the First International Workshop on
strategy-proof designs. Our ability to demonstrate that the         Peer-to-Peer Systems, pages 251–260, London, UK, 2002.
underlying mechanisms behind SybilInfer is not suscepti-            Springer-Verlag.
ble to gaming by an adversary arranging its Sybil nodes in      [9] L. Goodman. Snowball sampling. Annals of Mathematical
a particular topology is, in this aspect, a very import part        Statistics, 32:148–170.
of the SybilInfer security design.                             [10] W. K. Hastings. Monte carlo sampling methods us-
    Yet machine learning techniques that take explicitly            ing markov chains and their applications. Biometrika,
                                                                    57(1):97–109, April 1970.
into account noise and incomplete information, as the one      [11] R. Kannan, S. Vempala, and A. Vetta. On clusterings:
contained in the social graphs, are key to building secu-           Good, bad and spectral. J. ACM, 51(3):497–515, 2004.
rity systems that degrade well when theoretical guarantees     [12] L. Lamport, R. Shostak, and M. Pease. The byzantine
are not exactly matching a messy reality. As security in-           generals problem. ACM Trans. Program. Lang. Syst.,
creasingly becomes a “people” problem, it is likely that            4(3):382–401, 1982.
06
06

Contenu connexe

Tendances

Formulation of modularity factor for community detection applying
Formulation of modularity factor for community detection applyingFormulation of modularity factor for community detection applying
Formulation of modularity factor for community detection applyingIAEME Publication
 
A wireless intrusion detection system and a new attack model (synopsis)
A wireless intrusion detection system and a new attack model (synopsis)A wireless intrusion detection system and a new attack model (synopsis)
A wireless intrusion detection system and a new attack model (synopsis)Mumbai Academisc
 
NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Net...
NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Net...NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Net...
NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Net...Migrant Systems
 
DYNAMIC NEURAL NETWORKS IN THE DETECTION OF DISTRIBUTED ATTACKS IN MOBILE AD-...
DYNAMIC NEURAL NETWORKS IN THE DETECTION OF DISTRIBUTED ATTACKS IN MOBILE AD-...DYNAMIC NEURAL NETWORKS IN THE DETECTION OF DISTRIBUTED ATTACKS IN MOBILE AD-...
DYNAMIC NEURAL NETWORKS IN THE DETECTION OF DISTRIBUTED ATTACKS IN MOBILE AD-...IJNSA Journal
 
DESIGN AND IMPLEMENTATION OF A TRUST-AWARE ROUTING PROTOCOL FOR LARGE WSNS
DESIGN AND IMPLEMENTATION OF A TRUST-AWARE ROUTING PROTOCOL FOR LARGE WSNSDESIGN AND IMPLEMENTATION OF A TRUST-AWARE ROUTING PROTOCOL FOR LARGE WSNS
DESIGN AND IMPLEMENTATION OF A TRUST-AWARE ROUTING PROTOCOL FOR LARGE WSNSIJNSA Journal
 
NOVEL HYBRID INTRUSION DETECTION SYSTEM FOR CLUSTERED WIRELESS SENSOR NETWORK
NOVEL HYBRID INTRUSION DETECTION SYSTEM FOR CLUSTERED WIRELESS SENSOR NETWORKNOVEL HYBRID INTRUSION DETECTION SYSTEM FOR CLUSTERED WIRELESS SENSOR NETWORK
NOVEL HYBRID INTRUSION DETECTION SYSTEM FOR CLUSTERED WIRELESS SENSOR NETWORKIJNSA Journal
 

Tendances (8)

H1803025360
H1803025360H1803025360
H1803025360
 
Formulation of modularity factor for community detection applying
Formulation of modularity factor for community detection applyingFormulation of modularity factor for community detection applying
Formulation of modularity factor for community detection applying
 
A wireless intrusion detection system and a new attack model (synopsis)
A wireless intrusion detection system and a new attack model (synopsis)A wireless intrusion detection system and a new attack model (synopsis)
A wireless intrusion detection system and a new attack model (synopsis)
 
NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Net...
NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Net...NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Net...
NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Net...
 
DYNAMIC NEURAL NETWORKS IN THE DETECTION OF DISTRIBUTED ATTACKS IN MOBILE AD-...
DYNAMIC NEURAL NETWORKS IN THE DETECTION OF DISTRIBUTED ATTACKS IN MOBILE AD-...DYNAMIC NEURAL NETWORKS IN THE DETECTION OF DISTRIBUTED ATTACKS IN MOBILE AD-...
DYNAMIC NEURAL NETWORKS IN THE DETECTION OF DISTRIBUTED ATTACKS IN MOBILE AD-...
 
DESIGN AND IMPLEMENTATION OF A TRUST-AWARE ROUTING PROTOCOL FOR LARGE WSNS
DESIGN AND IMPLEMENTATION OF A TRUST-AWARE ROUTING PROTOCOL FOR LARGE WSNSDESIGN AND IMPLEMENTATION OF A TRUST-AWARE ROUTING PROTOCOL FOR LARGE WSNS
DESIGN AND IMPLEMENTATION OF A TRUST-AWARE ROUTING PROTOCOL FOR LARGE WSNS
 
NOVEL HYBRID INTRUSION DETECTION SYSTEM FOR CLUSTERED WIRELESS SENSOR NETWORK
NOVEL HYBRID INTRUSION DETECTION SYSTEM FOR CLUSTERED WIRELESS SENSOR NETWORKNOVEL HYBRID INTRUSION DETECTION SYSTEM FOR CLUSTERED WIRELESS SENSOR NETWORK
NOVEL HYBRID INTRUSION DETECTION SYSTEM FOR CLUSTERED WIRELESS SENSOR NETWORK
 
B018140813
B018140813B018140813
B018140813
 

Similaire à 06

Sybil limit a near optimal social network defense against sybil attacks-2014
Sybil limit a near optimal social network defense against sybil attacks-2014Sybil limit a near optimal social network defense against sybil attacks-2014
Sybil limit a near optimal social network defense against sybil attacks-2014Vignesh Biggboss
 
An improvement to trust based cross layer security protocol against sybil att...
An improvement to trust based cross layer security protocol against sybil att...An improvement to trust based cross layer security protocol against sybil att...
An improvement to trust based cross layer security protocol against sybil att...Alexander Decker
 
S.A.kalaiselvan blocking misbehaving users in anonymizing
S.A.kalaiselvan blocking misbehaving users in anonymizingS.A.kalaiselvan blocking misbehaving users in anonymizing
S.A.kalaiselvan blocking misbehaving users in anonymizingkalaiselvanresearch
 
Sybil Attack Analysis and Detection Techniques in MANET
Sybil Attack Analysis and Detection Techniques in MANETSybil Attack Analysis and Detection Techniques in MANET
Sybil Attack Analysis and Detection Techniques in MANETrahulmonikasharma
 
NYMBLE: Servers Overcrowding Disobedient Users in Anonymizing Networks
NYMBLE: Servers Overcrowding Disobedient Users in Anonymizing NetworksNYMBLE: Servers Overcrowding Disobedient Users in Anonymizing Networks
NYMBLE: Servers Overcrowding Disobedient Users in Anonymizing Networksrahulmonikasharma
 
Nymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networksNymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networksMuthu Samy
 
Nymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networksNymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networksMuthu Samy
 
Nymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networksNymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networksMuthu Samy
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
An Agent Future For Network Control
An Agent Future For Network ControlAn Agent Future For Network Control
An Agent Future For Network ControlSara Alvarez
 
Space-efficient Verifiable Secret Sharing Using Polynomial Interpolation
Space-efficient Verifiable Secret Sharing Using Polynomial InterpolationSpace-efficient Verifiable Secret Sharing Using Polynomial Interpolation
Space-efficient Verifiable Secret Sharing Using Polynomial Interpolationnexgentechnology
 
Space efficient verifiable secret sharing
Space efficient verifiable secret sharingSpace efficient verifiable secret sharing
Space efficient verifiable secret sharingnexgentech15
 
Space-efficient Verifiable Secret Sharing Using Polynomial Interpolation
Space-efficient Verifiable Secret Sharing Using Polynomial InterpolationSpace-efficient Verifiable Secret Sharing Using Polynomial Interpolation
Space-efficient Verifiable Secret Sharing Using Polynomial Interpolationnexgentechnology
 
SPACE-EFFICIENT VERIFIABLE SECRET SHARING USING POLYNOMIAL INTERPOLATION
 SPACE-EFFICIENT VERIFIABLE SECRET SHARING USING POLYNOMIAL INTERPOLATION SPACE-EFFICIENT VERIFIABLE SECRET SHARING USING POLYNOMIAL INTERPOLATION
SPACE-EFFICIENT VERIFIABLE SECRET SHARING USING POLYNOMIAL INTERPOLATIONNexgen Technology
 
IRJET- A Cloud based Honeynet System for Attack Detection using Machine Learn...
IRJET- A Cloud based Honeynet System for Attack Detection using Machine Learn...IRJET- A Cloud based Honeynet System for Attack Detection using Machine Learn...
IRJET- A Cloud based Honeynet System for Attack Detection using Machine Learn...IRJET Journal
 
A Lightweight Algorithm for Detecting Sybil Attack in Mobile Wireless Sensor ...
A Lightweight Algorithm for Detecting Sybil Attack in Mobile Wireless Sensor ...A Lightweight Algorithm for Detecting Sybil Attack in Mobile Wireless Sensor ...
A Lightweight Algorithm for Detecting Sybil Attack in Mobile Wireless Sensor ...Editor IJCATR
 
COPYRIGHTThis thesis is copyright materials protected under the .docx
COPYRIGHTThis thesis is copyright materials protected under the .docxCOPYRIGHTThis thesis is copyright materials protected under the .docx
COPYRIGHTThis thesis is copyright materials protected under the .docxvoversbyobersby
 

Similaire à 06 (20)

Sybil limit a near optimal social network defense against sybil attacks-2014
Sybil limit a near optimal social network defense against sybil attacks-2014Sybil limit a near optimal social network defense against sybil attacks-2014
Sybil limit a near optimal social network defense against sybil attacks-2014
 
An improvement to trust based cross layer security protocol against sybil att...
An improvement to trust based cross layer security protocol against sybil att...An improvement to trust based cross layer security protocol against sybil att...
An improvement to trust based cross layer security protocol against sybil att...
 
Clickstream ppt copy
Clickstream ppt   copyClickstream ppt   copy
Clickstream ppt copy
 
S.A.kalaiselvan blocking misbehaving users in anonymizing
S.A.kalaiselvan blocking misbehaving users in anonymizingS.A.kalaiselvan blocking misbehaving users in anonymizing
S.A.kalaiselvan blocking misbehaving users in anonymizing
 
C0951520
C0951520C0951520
C0951520
 
Sybil Attack Analysis and Detection Techniques in MANET
Sybil Attack Analysis and Detection Techniques in MANETSybil Attack Analysis and Detection Techniques in MANET
Sybil Attack Analysis and Detection Techniques in MANET
 
NYMBLE: Servers Overcrowding Disobedient Users in Anonymizing Networks
NYMBLE: Servers Overcrowding Disobedient Users in Anonymizing NetworksNYMBLE: Servers Overcrowding Disobedient Users in Anonymizing Networks
NYMBLE: Servers Overcrowding Disobedient Users in Anonymizing Networks
 
Nymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networksNymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networks
 
Nymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networksNymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networks
 
Nymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networksNymble blocking misbehaviouring users in anonymizing networks
Nymble blocking misbehaviouring users in anonymizing networks
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
An Agent Future For Network Control
An Agent Future For Network ControlAn Agent Future For Network Control
An Agent Future For Network Control
 
Space-efficient Verifiable Secret Sharing Using Polynomial Interpolation
Space-efficient Verifiable Secret Sharing Using Polynomial InterpolationSpace-efficient Verifiable Secret Sharing Using Polynomial Interpolation
Space-efficient Verifiable Secret Sharing Using Polynomial Interpolation
 
Space efficient verifiable secret sharing
Space efficient verifiable secret sharingSpace efficient verifiable secret sharing
Space efficient verifiable secret sharing
 
Space-efficient Verifiable Secret Sharing Using Polynomial Interpolation
Space-efficient Verifiable Secret Sharing Using Polynomial InterpolationSpace-efficient Verifiable Secret Sharing Using Polynomial Interpolation
Space-efficient Verifiable Secret Sharing Using Polynomial Interpolation
 
SPACE-EFFICIENT VERIFIABLE SECRET SHARING USING POLYNOMIAL INTERPOLATION
 SPACE-EFFICIENT VERIFIABLE SECRET SHARING USING POLYNOMIAL INTERPOLATION SPACE-EFFICIENT VERIFIABLE SECRET SHARING USING POLYNOMIAL INTERPOLATION
SPACE-EFFICIENT VERIFIABLE SECRET SHARING USING POLYNOMIAL INTERPOLATION
 
IRJET- A Cloud based Honeynet System for Attack Detection using Machine Learn...
IRJET- A Cloud based Honeynet System for Attack Detection using Machine Learn...IRJET- A Cloud based Honeynet System for Attack Detection using Machine Learn...
IRJET- A Cloud based Honeynet System for Attack Detection using Machine Learn...
 
pBFT.pptx
pBFT.pptxpBFT.pptx
pBFT.pptx
 
A Lightweight Algorithm for Detecting Sybil Attack in Mobile Wireless Sensor ...
A Lightweight Algorithm for Detecting Sybil Attack in Mobile Wireless Sensor ...A Lightweight Algorithm for Detecting Sybil Attack in Mobile Wireless Sensor ...
A Lightweight Algorithm for Detecting Sybil Attack in Mobile Wireless Sensor ...
 
COPYRIGHTThis thesis is copyright materials protected under the .docx
COPYRIGHTThis thesis is copyright materials protected under the .docxCOPYRIGHTThis thesis is copyright materials protected under the .docx
COPYRIGHTThis thesis is copyright materials protected under the .docx
 

06

  • 1. SybilInfer: Detecting Sybil Nodes using Social Networks George Danezis Prateek Mittal Microsoft Research, University of Illinois at Urbana-Champaign, Cambridge, UK. Illinois, USA. gdane@microsoft.com mittal2@uiuc.edu Abstract contributors. On-line forums, starting with USENET [19], to contemporary blogs or virtual worlds like Second SybilInfer is an algorithm for labelling nodes in a so- Life [22] always have to deal with the issue of disruption cial network as honest users or Sybils controlled by an in the discussion threads, with persistent abusers coming adversary. At the heart of SybilInfer lies a probabilis- back under different names. All these are forms of Sybil tic model of honest social networks, and an inference attacks at the high-level application layers. engine that returns potential regions of dishonest nodes. There are two schools of Sybil defence mechanisms, The Bayesian inference approach to Sybil detection comes the centralised and decentralised ones. Centralised de- with the advantage label has an assigned probability, indi- fences assume the existence of an authority that is capable cating its degree of certainty. We prove through analytical of doing admission control for the network [3]. Its role is results as well as experiments on simulated and real-world to rate limit the introduction of ‘fake’ identities, to ensure network topologies that, given standard constraints on the that the fraction of corrupt nodes remains under a certain adversary, SybilInfer is secure, in that it successfully dis- threshold. The practicalities of running such an authority tinguishes between honest and dishonest nodes and is not are very system-specific and in general it would have to susceptible to manipulation by the adversary. Further- act as a Public Key Certification Authority as well as a more, our results show that SybilInfer outperforms state guardian of the moral standing of the nodes introduced – of the art algorithms, both in being more widely applica- a very difficult problem in practice. Such centralised so- ble, as well as providing vastly more accurate results. lutions are also at odds with the decentralisation guiding principle of peer-to-peer systems. Decentralised approaches recognise the difficulty in 1 Introduction having a single authority vouching for nodes, and dis- tribute this task across all nodes of the system. The first The Peer-to-peer paradigm allows cooperating users to such proposal is Advogato [14], which aimed to reduce enjoy a service with little or no need for any centralised abuse in on-line services, followed by a proposal to use infrastructure. While communication [24] and storage [4] introduction graphs of Distributed Hash Tables [6] to limit systems employing this design philosophy have been pro- the potential for routing disruption in those systems. The posed, the lack of any centralised control over identities state of the art SybilGuard [27] and SybiLimit [26] pro- opens such systems to Sybil attacks [8]: a few malicious pose the use of social networks to mitigate Sybil attacks. nodes can simulate the presence of a very large number As we will see, SybilGuard suffers from high false neg- of nodes, to take over or disrupt key functions of the atives, while SybilLimit makes unrealistic assumptions distributed or peer-to-peer system. Any attempt to build about the knowledge of number of honest nodes in the net- fault-tolerance mechanisms is doomed since adversaries work. In both cases the systems Sybil detection strategies can control arbitrary fractions of the system nodes. This are based on heuristics that are not optimal. Sybil attack is further made practical through the use of Our key contribution is to propose SybilInfer, a method the existing large number of compromised networked ma- for detecting Sybil nodes in a social network, that makes chines (often called zombies) being part of bot-nets. use of all information available to the defenders. The for- Similar problems plague Web 2.0 applications that mal model underlying our approach casts the problem of rely on collaborative tagging, filtering and editing, like detecting Sybil nodes in the context of Bayesian Infer- Wikipedia [31], or del.icio.us [28]. A single user ence: given a set of stated relationships between nodes, can register under different pseudonyms, and bypass any the task is to label nodes as honest or dishonest. Based on elections of velocity check mechanism that attempts to some simple and generic assumptions, like the fact that guarantee the quality of the data through the plurality of social networks are fast mixing [18], we sample cuts in
  • 2. the social graph according to the probability they divide Limit, the current state of the art Sybil defence mecha- it into honest and dishonest regions. These samples not nisms. We also propose extensions that enable our solu- only allow us to label nodes as honest or Sybil attackers, tion to be implemented in decentralised settings. but also to associate with each label output by our algo- This paper is organised in the following fashion: in sec- rithm a degree of certainty. tion 2 we present an overview of our approach that can The proposed techniques can be applied in a wide vari- be used as a road-map to the technical sections. In sec- ety of settings where high reliability peer-to-peer systems, tion 3 we present our security assumptions, threat model, or Sybil-resistant collaborative mechanisms, are required the probabilistic model and sampler underpinning Sybil- even under attack: Infer; a security evaluation follows in section 4, providing • Secure routing in Distributed Hash Tables motivated analytical as well as experimental arguments supporting early research into this field, and our proposal can the security of the method proposed along with a compar- be used instead of a centralised authority to limit the ison with SybilInfer. Section 5 discusses the settings in fraction of dishonest nodes, that could disrupt rout- which SybilInfer can be fruitfully used, followed by some ing [3]. conclusions in section 6. • In public anonymous communication networks, such 2 Overview as Tor [7], our techniques can be used to eliminate the potential for a single entity introducing a large num- ber of nodes, and de-anonymize users’ circuits. This The SybilInfer algorithm takes as an input a social was so far a key open problem for securely scaling graph G and a single known good node that is part of this such systems. graph. The following conceptual steps are then applied to return the probability each node is honest or controlled by • Leader Election [2] and Byzantine agreement [12] a Sybil attacker: mechanisms that were rendered useless by the Sybil • A set of traces T are generated and stored by per- attack can again be of use, after Sybil nodes have forming special random walks over the social graph been detected and eliminated from the social graph. G. These are the only information retained about the • Finally, detecting Sybil accounts is a key step in pre- graph for the rest of the SybilInfer algorithm, and venting false email accounts used for spam, or pre- their generation is detailed in section 3.1. venting trolling and abuse of on-line communities • A probabilistic model is then defined that describes and web-forums. Our techniques can be applied in the likelihood a trace T was generated by a specific all those settings, to fight spam and abuse [14]. honest set of nodes within G, called X. This model SybilInfer applies to settings where a peer-to-peer or dis- is based on our assumptions that social networks are tributed system is somehow based on or aware of social fast mixing, while the transitions to dishonest regions connections between users. Properties of natural social are slow. Given the probabilistic model, the traces T graphs are used to classify nodes are honest or Sybils. and the set of honest nodes we are able to calculate While this approach might not be applicable to very tradi- Pr[T |X is honest]. The calculation of this quantity tional peer-to-peer systems [24], it is more an more com- is the subject of section 3.1 and section 3.2. mon for designers to make distributed systems aware of • Once the probabilistic model is defined, we use the social environment of their users. Third party social Bayes theorem to calculate for any set of nodes X network services [29, 30], can also be used to extract so- and the generated trace T , the probability that X con- cial information to protect systems against sybil attacks sists of honest nodes. Mathematically this quality is using SybilInfer. Section 5 details deployment strategies defined as Pr[X is honest|T ]. The use of Bayes the- for SybilInfer and how it is applicable to current systems. orem is described in section 3.1. We show analytically that SybilInfer is, from a theo- retical perspective, very powerful: under ideal circum- • Since it is not possible to simply enumerate all sub- stances an adversary gains no advantage by introducing sets of nodes X of the graph G, we instead sample into the social network any additional Sybil nodes that are from the distribution of honest node sets X, to only not ‘naturally’ connected to the rest of the social structure. get a few X0 , . . . , XN ∼ Pr[X is honest|T ]. Using Even linking all dishonest nodes with each other (with- those representative sample sets of honest nodes, we out adding any Sybils) changes the characteristics of their can calculate the probability any node in the system social sub-graph, and can under some circumstances be is honest or dishonest. Sampling and the approxi- detected. We demonstrate the practical efficacy of our ap- mation of the sought marginal probabilities are the proach using both synthetic scale-free topologies as well subject of section 3.3. as real-world LiveJournal data. We show very significant The key conceptual difficulty of our approach is the security improvements over both SybilGuard and Sybil- definition of the probabilistic model over the traces T , and
  • 3. graph. Several authors have shown that real-life so- cial networks are indeed fast mixing [16, 18]. 3. A node knows the complete social network topology (G) : social network topologies are relatively static, and it is feasible to obtain a global snapshot of the network. Friendship relationships are already public Attack Edges data for popular social networks like Facebook [29] and Orkut [30]. This assumption can be relaxed to using sub-graphs, making SybilInfer applicable to Honest Nodes Sybil Nodes decentralised settings. Assumptions (1) and (2) are identical to those made by Figure 1. Illustration of Honest nodes, Sybil the SybilGuard and SybilInfer systems. Previously, the nodes and attack edges between them. authors of SybilGuard [27] observed that when the adver- sary creates too many Sybil nodes, then the graph G has a small cut: a set of edges that together have small station- ary probability and whose removal disconnects the graph its inversion using Bayes theorem to define a probability into two large sub-graphs. distribution over all possible honest sets of nodes X. This distribution describes the likelihood that a specific set of This intuition can be pushed much further to build su- nodes is honest. The key technical challenge is making perior Sybil defences. It has been shown [20] that the use of this distribution to extract the sought probability presence of a small cut in a graph results in slow mix- each node is honest or dishonest, that we achieve via sam- ing which means that fast mixing implies the absence of pling. Section 3, describes in some detail how these issues small cuts. Applied to social graphs this observation un- are tackled by SybilInfer. derpins the key intuition behind our Sybil defence mech- anism: the mixing between honest nodes in the social net- works is fast, while the mixing between honest nodes and 3 Model & Algorithm dishonest nodes is slow. Thus, computing the set of honest nodes in the graph is related to computing the bottleneck Let us denote the social network topology as a graph G cut of the graph. comprising vertices V , representing people and edges E, One way of formalising the notion of a bottleneck cut, representing trust relationships between people. We con- is in terms of graph conductance (Φ) [11], defined as: sider the friendship relationship to be an undirected edge in the graph G. Such an edge indicates that two nodes trust Φ= min ΦX , X⊂V :π(X)<1/2 each other to not be part of a Sybil attack. Furthermore, we denote the friendship relationship between an attacker node and an honest node as an attack edge and the hon- where ΦX is defined as: est node connected to an attacker node as a naive node or Σx∈X Σy∈X π(x)Pxy / misguided node. Different types of nodes are illustrated in ΦX = , figure 1. These relationships must be understood by users π(X) as having security implications, to restrict the promiscu- ous behaviour often observed in current social networks, and π(·) is the stationary distribution of the graph. Intu- where users often flag strangers as their friends [23]. itively for any subset of vertices X ⊂ V its conductance We build our Sybil defence around the following as- ΦX represents the probability of going from X to the rest sumptions: of the graph, normalised by the probability weight of be- ing on X. When the value is minimal the bottleneck cut 1. At least one honest node in the network is known. In in the graph is detected. practise, each node trying to detect Sybil nodes can Note that performing a brute force search for this bot- use itself as the apriori honest node. This assump- tleneck cut is computationally infeasible (it is actually NP- tion is necessary to break symmetry: otherwise an Hard, given its relationship to the sparse-cut problem). attacker could simply mirror the honest social struc- Furthermore, finding the exact smallest cut is not as im- ture, and any detector would not be able to distin- portant as being able to judge how likely any cut is, to be guish which of the two regions is the honest one. dividing nodes into an honest and dishonest region. This 2. Social networks are fast mixing: this means that a probability is related to the deviation of the size of any random walk on the social graph converges quickly cut from what we would expect in a natural, fast mixing, to a node following the stationary distribution of the social network.
  • 4. 3.1 Inferring honest sets ProbXX ProbXX ProbXX In this paper, we propose a framework based on Bayesian inference to detect approximate cuts between X X honest and Sybil node regions in a social graph and use those to infer the labels of each node. A key strength of our approach is that it, not only associates labels to each ProbXX node, but also finds the correct probability of error that (a) A schematic representation of tran- could be used by peer-to-peer or distributed applications sition probabilities between honest X ¯ and dishonest X regions of the social to select nodes. network. The first step of SybilInfer is the generation of a set of random walks on the social graph G. These walks are gen- erated by performing a number s of random walks, start- ProbXX = 1 / |V| + Exx Probability ing from each node in the graph (i.e. a total of s · |V | 1 / |V| walks.) A special probability transition matrix is used, ProbXX = 1 / |V| - Exx defined as follows: Honest Nodes Sybil nodes 1 1 min( di , dj ) if i → j is an edge in G Pij = , (b) The model of the probability a short random walk of length 0 otherwise O(log |V |) starting at an honest node ends on a particular honest or dishonest (Sybil) node. If no Sybils are present the network is fast where di denotes the degree of vertex i in G. mixing and the probability converges to 1/|V |, otherwise it is biased This choice of transition probabilities ensures that the towards landing on an honest node. stationary distribution of the random walk is uniform over all vertices |V |. The length of the random walks is l = Figure 2. Illustrations of the SybilInfer mod- O(log |V |), which is rather short, while the number of ran- els dom walks per node (denoted by s) is a tunable parameter of the model. Only the starting vertex and the ending ver- tex of each random walk are used by the algorithm, and we denote this set of vertex-pairs, also called the traces, by T . Note that since the set X is honest, we assume (by as- Now consider any cut X ⊂ V of nodes in the graph, sumption (2)) fast mixing amongst its elements, meaning such that the a-prior honest node is an element of X. We that a short random walk reaches any element of the sub- are interested in the probability that the vertices in set X set X uniformly at random. On the other hand a random are all honest nodes, given our set of traces T , i.e. P (X = walk starting in X is less likely to end up in the dishonest Honest|T ). Through the application of Bayes theorem we ¯ region X, since there should be an abnormally small cut have an expression of this probability: between them. (This intuition is illustrated in figure 2(a).) Therefore we approximate the probability that a short ran- P (T |X = Honest) · P (X = Honest) P (X = Honest|T ) = , dom walk of length l = O(log |V |) starts in X and ends at Z a particular node in X is given by ProbXX = Π + EXX , where Z is the normalization constant given by: Z = where Π is the stationary distribution given by 1/|V |, for ΣX⊂V P (T |X = Honest) · P (X = Honest). Note that some EXX > 0. Similarly, we approximate the proba- Z is difficult to compute because it involves the summa- bility that a random walk starts in X and does not end tion of an exponential number of terms in the size of |V |. in X is given by ProbX X = Π − EX X . Notice that ¯ ¯ Only being able to compute this probability up to a mul- ProbXX > ProbX X , which captures the property that ¯ tiplicative constant Z is not an impediment. The a-prior there is fast mixing amongst honest nodes and slow mix- distribution P (X = Honest) can be used to encode any ing between honest and dishonest nodes. The approxi- further knowledge about the honest nodes, or can simply mate probabilities ProbXX and ProbX X and their likely ¯ be set to be uniform over all possible cuts. gap from the ideal 1/|V | are illustrated in figure 2(b). Bayes theorem has allowed us to reduce the initial problem of inferring the set of good nodes X from the set Let NXX be the number of traces in T starting in the of traces T , to simply being able to assign a probability honest set X and ending in same honest set X. Let NX X ¯ to each set of traces T given a set of honest nodes X, i.e. be the number of random walks that start at the honest set calculating P (T |X = Honest). Our only remaining theo- ¯ X and end in the dishonest set X. NX X and NXX are ¯ ¯ ¯ retical task is deriving this probability, given our model’s defined similarly. Given the approximate probabilities of assumptions. transitions from one set to the other and the counts of such
  • 5. transitions we can ascribe a probability to the trace: Approximating ProbXX through the traces T provides us with a simple expression for the sought probability, P (T |X = Honest) = (ProbXX )NXX · (ProbX X )NX X · ¯ ¯ based simply on the number of walks starting in one re- (ProbX X )NX X · (ProbXX )NXX , ¯ ¯ ¯ ¯ ¯ ¯ gion and ending in another: where ProbX X and ProbXX are the probabilities a walk NXX 1 NXX ¯ ¯ ¯ P (T |X = Honest) = ( · ) · starting in the dishonest region ends in the dishonest or NXX + NX X ¯ |X| honest regions respectively. NX X¯ 1 ( · ¯ )NX X · ¯ The model described by P (T |X = Honest) is an ap- NX X + NXX ¯ |X| proximation to reality that is suitable enough to perform NX X ¯ ¯ 1 Sybil detection. It is of course unlikely that a random walk ( · ¯ )NX X · ¯ ¯ NX X + NXX ¯ ¯ ¯ |X| starting at an honest node will have a uniform probabil- NXX ¯ 1 NXX ity to land on all honest or dishonest nodes respectively. ( · ) ¯ , NXX + NX X ¯ ¯ ¯ |X| Yet this simple probabilistic model relating the starting and ending nodes of traces is rich enough to capture the This expression concludes the definition of our probabilis- “probability gap” between landing on an honest or dis- tic model, and contains only quantities that can be ex- honest node, as illustrated in figure 2(b), and suitable for tracted from either the known set of nodes X, or the set Sybil detection. of traces T that is assigned a probability. Note that we do not assume any prior knowledge of the size of the honest 3.2 Approximating EXX ¯ set, and it is simply a variable |X| or |X| of the model. Next, we shall describe how to sample from the distri- We have reduced the problem of calculating P (T |X = bution P (X = Honest|T ) using the Metropolis-Hastings Honest) to finding a suitable EXX , representing the ‘gap’ algorithm. between the case when the full graph is fast mixing (for EXX = 0) and when there is a distinctive Sybil attack (in 3.3 Sampling honest configurations which case EXX >> 0.) One approach could be to try inferring EXX through a At the heart of our Sybil detection techniques lies a trivial modification of our analysis to co-estimate P (X = model that assigns a probability to each sub-set of nodes Honest, EXX |T ). Another possibility is to approximate of being honest. This probability P (X = Honest|T ) can EXX or ProbXX directly, by choosing the most likely be calculated up to a constant multiplicative factor Z, that candidate value for each configuration of honest nodes X is not easily computable. Hence, instead of directly calcu- considered. This can be done through the conductance or lating this probability for any configuration of nodes X, through sampling random walks on the social graph. we will attempt instead to sample configurations Xi fol- Given the full graph G, ProbXX can be approximated lowing this distribution. Those samples are used to esti- Σx∈X Σy∈X Π(x)P lxy l mate the marginal probability that any specific node, or as ProbXX = Π(X) , where Pxy is the prob- collections of nodes, are honest or Sybil attackers. ability that a random walk of length l starting at x ends in Our sampler for P (X = Honest|T ) is based on y. This approximation is very closely related to the con- ¯ the established Metropolis-Hastings algorithm [10] (MH), ductance of the set X and X. Yet computing this measure which is an instance of a Markov Chain Monte Carlo sam- would require some effort. pler. In a nutshell, the MH algorithm holds at any point Notice that ProbXX , as calculated above, can also a sample X0 . Based on the X0 sample a new candidate be approximated by performing many random walks of sample X is proposed according to a probability distribu- length l starting at X and computing the fraction of those tion Q, with probability Q(X |X0 ). The new sample X walks that end in X. Interestingly our traces already con- is ‘accepted’ to replace X0 with probability α: tain random walks over the graph of exactly the appropri- ate length, and therefore we can reuse them to estimate a P (X |T ) · Q(X0 |X ) good ProbXX and related probabilities. Given the counts α = min( , 1) P (X0 |T ) · Q(X |X0 ) NXX , NX X , NXX and NX X : ¯ ¯ ¯ ¯ otherwise the original sample X0 is retained. It can be NXX 1 shown that after multiple iterations this yields samples X ProbXX = · NXX + NX X |X| ¯ according to the distribution P (X|T ) irrespective of the way new candidate sets X are proposed or the initial state and of the algorithm, i.e. a more likely state X will pop-out NX X ¯ ¯ 1 ProbX X = ¯ ¯ · ¯ , more frequently from the sampler, than less likely states. NX X + NXX |X| ¯ ¯ ¯ A relatively naive strategy can be used to propose can- and, ProbX X = 1−ProbXX and ProbXX = 1−ProbX X . ¯ ¯ ¯ ¯ didate states X given X0 for our problem. It relies on
  • 6. simply considering sets of nodes X that are only different 4.1 Theoretical results by a single member from X0 . Thus, with some probability ¯ padd a random node x ∈ X0 is added to the set to form the The security of our Sybil detection scheme hinges on candidate X = X0 ∪ x. Alternatively, with probability two important results. First, we show that we can detect premove , a member of X0 is removed from the set of nodes, whether a network is under Sybil attack, based on the so- defining X = X0 ∩ x for x ∈ X0 . It is trivial to calculate cial graph. Second, we show that we are able to detect the probabilities Q(X |X0 ) and Q(X |X0 ) based on padd , Sybil attackers connected to the honest social graph, and premove and using a uniformly at random choice over nodes this for any attacker topology. ¯ ¯ in X0 , X0 , X and X when necessary. Our first result states that: A key issue when utilizing the MH algorithm is decid- T HEOREM A. In the absence of any Sybil attack, ing how many iterations are necessary to get independent the distribution of P (X = Honest|T ), for a given samples. Our rule of thumb is that |V | · log |V | steps are size |X|, is close to uniform, and all cuts are equally likely to guarantee convergence to the target distribution likely (EXX 0). P . After that number of steps the coupon collector’s the- This result is based on our assumption that a random walk orem states that each node in the graph would have been over a social network is fast mixing meaning that, after considered at least once by the sampler, and assigned to log(|V |) steps, it visits nodes drawn from the stationary the honest or dishonest set. In practice, given very large distribution of the graph. In our case the random walk is traces T , the number of nodes that are difficult to cate- performed over a slightly modified version of the social gorise is very small, and a non-naive sampler requires few graph, where the transition probability attached to each steps to produce good samples (after a certain burn in- link ij is: period that allows it to detect the most likely honest re- 1 1 gion.) min( di , dj ) if i → j is an edge in G Pij = , Finally, given a set of N samples Xi ∼ P (X|T ) out- 0 otherwise put by the MH algorithm it is possible to calculate the marginal probabilities any node is honest. This is key out- which guarantees that the stationary distribution is uni- 1 put of the SybilInfer algorithm: given a node i it is pos- form over all nodes (i.e. Π = |V | ). Therefore we ex- sible to associate a probability it is honest by calculating: pect that in the absence of an adversary the short walks I(i∈Xj ) in T to end at a network node drawn at random amongst Pr[i is honest] = j∈[0,N −1)N , where I(i ∈ Xj ) is an indicator variable taking value 1 if node i is in the hon- all nodes |V |. In turn this means that the number of end est sample Xj , and value zero otherwise. Enough samples nodes in the set of traces T , that end in the honest set X is can be extracted from the sampler to estimate this proba- NXX = lim|TX |→∞ |X| · |TX |, where TX is the number |V | bility with an arbitrary degree of precision. of traces in T starting within the set |X|. Substituting this More sophisticated samplers would make use of a bet- in the equations presented in section 2.1 and 2.2 we get: ter strategy to propose candidate states X for each iter- NXX 1 ation. The choice of X can, for example, be biased to- ProbXX = · ⇒ (1) NXX + NX X |X| ¯ wards adding or removing nodes according to how often NXX 1 random walks starting at the single honest node land on Π + EXX = · ⇒ (2) NXX + NX X |X| ¯ them. We expect nodes that are reached often by random walks starting in the honest region to be honest, and the 1 (|X|/|V |) · |TX | 1 + EXX = · ⇒ (3) opposite to be true for dishonest nodes. In all cases this |V | |TX | |X| bias is simply an optimization for the sampling to take EXX = 0 (4) fewer iterations, and does not affect the correctness of the results. As a result, by sufficiently increasing the number of ran- dom walks T performed on the social graph, we can get EXX arbitrarily close to zero. In turn this means that our 4 Security evaluation distribution P (T |X = Honest) is uniform for given sizes of |X|, given our uniform a-prior P (X = Honest|T ). In a nutshell by estimating EXX for any sample X re- In this section we discuss the security of SybilInfer turned by the MH algorithm, and testing how close it is when under Sybil attack. We show analytically that we to zero we detect whether it corresponds to an attack (as can detect when a social network suffers from a Sybil at- we will see from theorem B) or a natural cut in the graph. tack, and correctly label the Sybil nodes. Our assumptions We can increase the precision of the detector arbitrarily by and full proposal are then tested experimentally on syn- increasing the number of walks T . thetic as well as real-world data sets, indicating that the Our second results relates to the behaviour of the sys- theoretical guarantees hold. tem under Sybil attack:
  • 7. T HEOREM B. Connecting any additional Sybil nodes 4.2 Practical considerations to the social network, through a set of corrupt nodes, lowers the dishonest sub-graph conductance to the Models and assumptions are always an approximation honest region, leading to slow mixing, and hence we of the real world. As a result, careful evaluation is nec- expect EXX > 0. essary to ensure that the theorems are robust to deviations ¯ from the ideal behaviour assumed so far. First we define the dishonest set X0 comprising all dis- honest nodes connected to honest nodes in the graph. The The first practical issue concerns the fast mixing prop- ¯ set XS contains all dishonest nodes in the system, includ- erties of social networks. There is a lot of evidence that ¯ ing nodes in X0 and the Sybil nodes attached to them. It social networks exhibit this behaviour [18], and previous ¯ ¯ must hold that |X0 | < |XS |, in case there is a Sybil attack. proposals relating to Sybil defence use and validate the Second we note that the probability of a transition be- same assumption [27, 26]. SybilInfer makes an further tween an honest node i ∈ X and a dishonest node j ∈ X ¯ assumption, namely that the modified random walk over cannot increase through Sybil attacks, since it is equal to the social network, that yields a uniform distribution over 1 1 Pij = min( di , dj ). At worst the corrupt node will in- all nodes, is also fast mixing for real social networks. The 1 1 crease its degree by connecting Sybils which has only the probability Pij = min( di , dj ), depends on the mutual potential to decrease this probability. Therefore we have degrees of the nodes i and j, and makes the transition to that x∈XS y∈XS Pxy ≤ x∈X0 y∈X0 Pxy . Com- ¯ ¯ ¯ ¯ nodes of higher degree less likely. This effect has the po- bining the two inequalities we get that: tential to slow down mixing times in the honest case, par- ticularly when there is a high variation in node degrees. This effect can be alleviated by removing random edges ¯ Pxy ¯ ¯ Pxy from high degree nodes to guarantee that the ratio of max- y∈XS x∈X0 y∈X0 ¯S | < ¯ ⇔ (5) imum and minimum node degree in the graph is bounded |X |X0 | (an approach also used by SybilLimit.) 1 1 y∈XS |V | Pxy ¯ ¯ x∈X0 y∈X0 |V | Pxy ¯ The second consideration also relates to the fast mix- ¯ < ¯ 1 ⇔ (6) |XS | 1 |V | |X0 | |V | ing properties of networks. While in theory fast mixing networks should not exhibit any small cuts, or regions of ¯ y∈XS π(x)Pxy ¯ x∈X0 y∈X0 π(x)Pxy ¯ < ⇔ (7) abnormally low conductance, in practice they do. This ¯S ) Π(X Π(X¯0) is especially true for regions with new users that have not ¯ ¯ Φ(XS ) < Φ(X0 ). (8) had the chance to connect to many others, as well as social networks that only contain users with particular charac- teristics (like interest, locality, or administrative groups.) This result signifies that independently of the topology of Those regions yield, even in the honest case, sample cuts the adversary region the conductance of a sub-graph con- that have the potential to be mistaken as attacks. This ef- taining Sybil nodes will be lower compared with the con- fect forces us to consider a threshold EXX under which ductance of the sub-graph of nodes that are simply com- we consider cuts to be simply false positives. In turn this promised and connected to the social network. Lower makes the guarantees of schemes weaker in practice than conductance in turn leads to slower mixing times be- in theory, since the adversary can introduce Sybils into a tween honest and dishonest regions [20] which means that region undetected, as long as the set threshold EXX is not EXX > 0, even for very few Sybils. This deviation is exceeded. subject to the sampling variation introduced by the trace The threshold EXX is chosen to be α·EXXmax , where T , but the error can be made arbitrarily small by sampling 1 1 EXXmax = |X| − |V | , and α is a constant between 0 more random walks in T . and 1. Here α can be used to control the tradeoff between false positives and false negatives. A higher value of alpha These two results are very strong: they indicate that, in enables the adversary to insert a larger number of sybils theory, a set of compromised nodes connecting to honest undetected but reduces the false positives. On the other nodes in a social network, would get no advantage by con- hand, a smaller value of α reduces the number of Sybils necting any additional Sybil nodes, since that would lead that can be introduced undetected but at the cost of higher to their detection. Sampling regions of the graph with ab- number of false positives. normally small conductance, through the use of the ran- dom walks T , should lead to their discovery, which is the Given these practical considerations, we can formulate theoretical foundation of our technique. Furthermore we a weaker security guarantee for SybilInfer: established that techniques based on detecting abnormali- T HEOREM C. Given a certain “natural” threshold ties in the value of EXX are strategy proof, meaning that value for EXX in an honest social network, a dis- there is no attacker strategy (in terms of special adversary honest region performing a Sybil attack will exceed topology) to foil detection. it after introducing a certain number of Sybil nodes.
  • 8. Scale Free Topology: 1000 nodes, 100 malicious Scale Free Topology: 1000 nodes, 100 malicious 250 250 False Negatives, alpha=0.65 False Negatives,alpha=0.65 False Positives, alpha=0.65 False Positives,alpha=0.65 False Negatives, alpha=0.7 False Negatives,alpha=0.7 False Positives, alpha=0.7 False Positives,alpha=0.7 False Negatives, alpha=0.75 False Negatives,alpha=0.75 200 False Positives, alpha=0.75 200 False Positives,alpha=0.75 Number of identities Number of identities 150 150 100 100 50 50 0 0 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Number of additional sybil identities (x) Number of additional sybil identities (x) (a) Average degree compromised nodes (b) Low degree compromised nodes Figure 3. Synthetic Scale Free Topology: SybilInfer Evaluation as a function of additional Sybil identities (ψ) introduced by colluding entities. False negatives denote the total number of dishon- est identities accepted by SybilInfer while false positives denote the number of honest nodes that are misclassified. This theorem is the result of Theorem B that demonstrates given node is proportional to the degree of that node; i.e.: that the conductance keeps decreasing as the number of Pr[(v, i)] = didj , where di is the degree of node i. In j Sybils attached to a dishonest region increases. This in our simulations, we use m = 5, giving an average node turn will slow down the mixing time between the hon- degree of 10. est and dishonest region, leading to an increasingly large In such a scale free topology of 1000 nodes, we con- EXX . sider a fraction f = 10% of the nodes to be compro- Intuitively, as the attack becomes larger, the cut be- mised by a single adversary. The compromised nodes are tween honest and dishonest nodes becomes increasingly distributed uniformly at random in the topology. Com- distinct, which makes Sybil detection easier. It is impor- promised nodes introduce ψ additional Sybil nodes and tant to note that as more Sybils are introduced into the establish a scale free topology amongst themselves. We dishonest region, the probability of the whole region be- configure SybilInfer to use 20 samples for computing the ing detected as an attack increases, not only the new Sybil marginal probabilities, and label as honest the set of nodes nodes. This provides strong disincentives to the adver- whose marginal probability of being honest is greater than sary from performing larger Sybil attacks, since even pre- 0.5. The experiment is repeated 100 times with different viously undetected malicious nodes might be flagged as scale free topologies. Sybils. Figure 3(a) illustrates the false positives and false neg- atives classifications returned by SybilInfer, for varying 4.3 Experimental evaluation using synthetic value of ψ, the number of additional Sybil nodes intro- data duced. We observe that when ψ < 100, α = 0.7 , then all the malicious identities are classified as honest by Sybil- We first experimentally demonstrate the validity of Infer. However, there is a threshold at ψ = 100, be- Theorem C using synthetic topologies. Our experiments yond which all of the Sybil identities, including the ini- consist of building synthetic social network topologies, in- tially compromised entities are flagged as attackers. This jecting a variable number of Sybil nodes, and applying is because beyond this point, the EXX for the Sybil region SybilInfer to establish how many of them are detected. A exceeds the natural threshold leading to full detection, val- key issue we explore is the number of introduced Sybil idating Theorem C. The value ψ = 100 is clearly the op- nodes under which Sybil attacks are not detected. timal attack strategy, in which the attacker can introduce Social networks exhibit a scale-free (or power law) the maximal number of Sybils without being detected. We node degree topology [21]. Our network synthesis al- also note that even in the worst case, the false positives are gorithm replicates this structure through preferential at- less than 5%. The false positive nodes have been misclas- tachment, following the methodology of Nagaraja [18]. sified because these nodes are closer to the Sybil region; We create m0 initial nodes connected in a clique, and SybilInfer is thus incentive compatible in the sense that then for each new node v, we create m new edges to nodes which have mostly honest friends are likely not to existing nodes, such that the probability of choosing any be misclassified.
  • 9. Figure 4 presents a plot of the maximum Sybil iden- Scale Free Topology: 1000 nodes 0.4 tities as a function of the compromised fraction of nodes False Negatives 0.35 Theoretical Prediction y=x f . Note that our theoretical prediction (which is strategy- independent) matches closely with the attacker strategy 0.3 Fraction of total malicious entities of connecting Sybil nodes in a scale free topology. The 0.25 adversary is able to introduce roughly about 1 additional 0.2 Sybil identity per real entity. For instance, at f = 0.2, 0.15 the total number of Sybil identities is 0.37. As we observe 0.1 from the figure the ability of the adversary to just include 0.05 about one additional Sybil identity per compromised node 0 embedded in the social network remains constant, no mat- 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Fraction of malicious entities (f) ter the fraction f of compromised nodes in the network. Figure 4. Scale Free Topology: fraction of 4.4 Experimental evaluation using real-world total malicious and Sybil identities as a func- data tion of real malicious entities. Next we validate the security guarantees provided by We can also see the effect of varying the threshold SybilInfer using a sampled LiveJournal topology. A vari- EXX . As α is increased from 0.65 to 0.7, the ψ for the ant of snowball [9] sampling was used to collect the full optimal attacker strategy increases from 70 to 100. This is data set data, comprising over 100,000 nodes. because an increase in the threshold EXX allows the ad- To perform our experiments we chose a random node versary to insert more Sybils undetected. The advantage and collect all nodes in its three hop neighbourhood. The of increasing α lies in reducing the worst case false pos- resulting social network has about 50,000 nodes. We then itives. We can see that by increasing α from 0.7 to 0.75, perform some pre-processing step on the sub-graph: the worst case false positives can be reduced from 5% to • Nodes with degree less than 3 are removed, to filter 2%. out nodes that are too new to the social network, or Note that for the remainder of the paper, we shall use inactive. α = 0.7. We also wish to show that the security of our scheme • If there is an edge between A → B, but no edge depends primarily on the number of colluding malicious between B → A, then A → B is removed (to only nodes and not on the number of attack edges. To this end, keep the symmetric friendship relationships.) we chose the compromised nodes to have the lowest num- ber of attack edges (instead of choosing them uniformly We note that despite this pre-processing nodes all degrees at random), and repeat the experiment. Figure 3(b) illus- can be found in the final dataset, since nodes with ini- trates that the false positives and false negatives classifica- tial degree over 3 will have some edges removed reducing tions returned by SybilInfer, where the average number of their degree to less than 3. attack edges are 500. Note that these results are very simi- After pre-processing, the social sub-graph consists of lar to the previous case illustrated in Figure 3(a), where the about 33,000 nodes. First, we ran SybilInfer on this topol- number of attack edges is around 800. This analysis in- ogy without introducing any artificial attack. We found a dicates that the security provided by SybilInfer primarily bottleneck cut diving off about 2, 000 Sybil nodes. It is depends on the number of colluding entities. The implica- impossible to establish whether these nodes are false pos- tion is that the compromise of high degree nodes does not itives (a rate of 6%) or a real-world Sybil attack present yield any significant advantage to the adversary. As we in the LiveJournal network. Since there is no way to es- shall see, this is in contrast to SybilGuard and SybilLimit, tablish ground truth, we do not label these nodes as either which are extremely vulnerable when high degree nodes honest/dishonest. are compromised. Next, we consider a fraction f of the nodes to be com- Our next experiment establishes the number of Sybil promised and compute the optimal attacker strategy, as in nodes that can be inserted into a network given different our experiments with synthetic data. Figure 5 shows the fractions of compromised nodes. We vary the fraction of fraction of malicious identities accepted by SybilInfer as compromised colluding nodes f , and for each value of f , a function of fraction of malicious entitites in the system. we compute the optimal number of additional Sybil iden- The trend is similar to our observations on synthetic scale tities that the attackers can insert, as in the previous exper- free topologies. At f = 0.2, the fraction of Sybil identities iment. accepted by SybilInfer is approximately 0.32.
  • 10. LiveJournal Topology: 31603 nodes Scale Free Topology: 1000 nodes 0.45 0.6 Fraction of malicious identities SybilLimit, uniform y=x SybilLimit, low degree 0.4 SybilInfer y=x 0.5 0.35 Fraction of total malicious entities Fraction of total malicious identities 0.3 0.4 0.25 0.3 0.2 0.15 0.2 0.1 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 Fraction of colluding entities Fraction of malicious entities (f) Figure 5. LiveJournal Topology: fraction of Figure 6. Comparison with related work total malicious identities as a function of real malicious entities f < 0.01. In this case SybilInfer allows for 5% total ma- 4.5 Comparison with SybilLimit and Sybil- licious entities, while SybilLimit allows for over 50%. Guard This is an illustration that SybilInfer performs an or- SybilGuard [27] and SybilLimit [26] are state of the art der of magnitude better than the state of the art both in decentralized protocols that defend against Sybil attacks. terms of range of applicability and performance within Similar to SybilInfer, both protocols exploit the fact that a that range (SybilGuard’s performance is strictly worse Sybil attack disrupts the fast mixing property of the social than SybilLimit’s performance, and is not illustrated.) An network topology, albeit in a heuristic fashion. A brief obvious question is: “why does SybilInfer perform so overview of the two systems can be found in the appendix, much better than SybilGuard and SybilLimit?” It is partic- and their full descriptions is given in [27, 26]. ularly pertinent since all three systems are making use of Figure 6 compares the performance of SybilInfer with the same assumptions, and a similar intuition, that there the performance of SybilLimit. First it is worth noting should be a “gap” between the honest and Sybil regions that the fraction of compromised nodes that SybilLimit of a social network. The reason SybilLimit and Sybil- tolerates is only a small fraction of the range within which Guard provide weaker guarantees is that they interpret SybilInfer provide its guarantees. SybilLimit tolerates up these assumptions in a very sensitive way: they assume to f = 0.02 compromised nodes when the degree of at- that an overwhelming majority of random walks staring in tackers is low (about degree 5 – green line), while we have the honest region will stay in the honest region, and then already shown the performance of SybilLimit for compro- bound the number of walks originating from the Sybil re- mised fractions up to f = 0.35 in figure 4. Within the gion via the number of corrupt edges. As a result they interval SybilLimit is applicable, our system systemati- are very sensitive to the length of those walks, and can cally outperforms: when very few compromised nodes are only provide strong guarantees for a very small number of present in the system (f = 0.01) our system only allows corrupt edges. Furthermore the validation procedure re- them to control less than 5% of the entities in the system, lies on collisions between honest nodes, via the birthday versus SybilLimit that allows them to control over 30% paradox, which adds a further layer of inefficiency to esti- of entities (rendering insecure byzantine fault tolerance mating good from bad regions. mechanisms that requires at least 2/3 honest nodes) At the limit of SybilLimit’s applicability range when f = 0.02, SybilInfer, on the other hand, interprets the disruption our approach caps the number of dishonest entities in the in fast-mixing between the honest and dishonest region system to fewer than 8%, while SybilLimit allows about simply as a faint bias in the last node of a short random 50% dishonest entities. (This large fraction renders leader walk (as illustrated in figures 2(a) and 2(b).) In our ex- election or other voting systems ineffective.) periments, as well as in theory, we observe a very large An important difference between SybilInfer and Sybil- fraction of the T walks crossing between the honest and Limit is that the former is not sensitive to the degree of dishonest regions. Yet the faint difference in the probabil- the attacker nodes. SybilLimit provides very weak guar- ity of landing on nodes in the honest and dishonest regions antees when high degree (e.g. degree 10 – red line) nodes is present, and the sampler makes use of it to get good cuts are compromised, and can protect the system only for between the honest and dishonest nodes.
  • 11. 4.6 Computational and time complexity O(|V |) per iteration, making the overall complexity of the scheme O(|V |2 log |V |). Two implementations of the SybilInfer sampler were build in Python and C++, of about 1KLOC each. The 5 Deployment Strategies Python implementation can handle 10K node networks, while the C++ implementation has handled up to 30K So far we presented an overview of the SybilInfer algo- node networks, returning results in seconds. rithm, as well as a theoretical and empirical evaluation of The implementation strategy for both samplers has fa- its performance when it comes to detecting Sybil nodes. voured a low time complexity over storage costs. The The core of the algorithm outperforms SybilGuard and critical loop performs O(|V | · log |V |) Metropolis Hast- SybilLimit, and is applicable in settings beyond which ings iterations per sample returned, each only requiring the two systems provide no security guarantees whatso- about O(log |V |) operations. Two copies of the full state ever. Yet a key difference between the previous systems are stored, as well as associated data that allows for fast and SybilInfer is the latter’s reliance on the full friendship updating of the state, which requires O(|V |) storage. The graph to perform the random walks that drive the infer- transcript of evidence traces T is also stored, as well as an ence engine. In this section we discuss how this constraint index over it, which dominates the storage required and still allows SybilInfer to be used for important classes of makes it order O(|V | · log |V |). applications, as well as how it can be relaxed to accom- There is a serious time complexity penalty associated modate peer-to-peer systems with limited resources per with implementing non-naive sampling strategies. Our client. Python implementation biases the candidate moves to- wards nodes that are more or less likely to be part of the 5.1 Full social graph knowledge honest set. Yet exactly sampling nodes from this known probability distribution, naively may raise the cost of each The most straightforward way of applying SybilInfer iteration to be O(|V |). Depending on the differential be- is using the full graph of a social network to infer which tween the highest and lowest probabilities, faster sampling nodes are honest and which nodes are Sybils, given a techniques like rejection sampling [15] can be used to known honest seed node. This is applicable to centralised bring the cost down. The Python implementation uses a on-line services, like free email hosting services, blogging variant of Metropolis-Hastings to implement selection of sites, and discussion forums that want to deter spammers. candidate nodes for the next move, at a computation cost Today those systems use a mixture of CAPTCHA [25] of O(log |V |). The C++ implementation uses naive sam- and network based intrusion detection to eliminate mass pling from the honest or dishonest sets, and has a very low attacks. SybilInfer could be used to either complement cost per iteration of order O(1). those mechanisms and provide additional information as The Markov chain sampling techniques used consider to which identities are suspicious, or replace those sy- sequences of states that are very close to each other, dif- stems when they are expensive and error prone. One of fering at most by a single node. This enables a key the first social network based Sybil defence systems, Ad- optimization, where the counts NXX , NX X , NXX and ¯ ¯ vogato [14], worked in such a centralized fashion. NX X are stored for each state and updated when the state ¯ ¯ The need to know the social graph does not preclude changes. This simple variant of self-adjusting computa- the use of SybilInfer in distributed or even peer-to-peer sy- tion [1], allows for very fast computations of the proba- stems. Social networks, once mature, are generally stable bilities associated with each state. Updating the counts, and do not change much over time. Their rate of change and associated information is an order O(log |V |) opera- is by no means comparable to the churn of nodes in the tion. The alternative, of recounting these quantities from network, and as a result the structure of the social network T would cost O(|V | log |V |) for every iteration, leading could be stored and used to perform inference on multiple to a total computational complexity for our algorithm of nodes in a network, along with a mechanisms to share oc- O((|V | log |V |)2 ). Hence implementing it is vital to get- casional updates. The storage overhead for storing large ting results fast. social networks is surprisingly low: a large social network Finally our implementations use a zero-copy strategy with 10 billion nodes (roughly the population of planet for the state. Two states and all associated information are earth) with each node having about 1000 friends, can be maintained at any time, the current state and the candi- stored in about 187Gb of disk space uncompressed. In date state. Operations on the candidate state can be done such settings it is likely that SybilInfer computation will and undone in O(log |V |) per operation. Accepted moves be the bottleneck, rather than storage of the graph, for the can be committed to the current state at the same cost. foreseeable future. These operations can be used to maintain the two states A key application of Sybil defences is to ensure that synchronised for use by the Metropolis-Hastings sampler. volunteer relays in anonymous communication networks The naive strategy of re-writing the full state would cost belong to independent entities, and are not controlled
  • 12. by a single adversary. Practical systems like Mixmas- tain community of interest (a social club, a committee, a ter [17], Mixminion [5] and Tor [7] operate such a volun- town, etc.) can be extracted to form a sub-graph. SybilIn- teer based anonymity infrastructure, that are very suscep- fer can then be applied on this partial view of the graph, tible to Sybil attacks. Extending such an infrastructure to to detect nodes that are less well integrated than others in use SybilInfer is an easy task: each relay in the system the group. There is an important distinction between us- would have to indicate to the central directory services ing SybilInfer in this mode or using it with the full graph: which other nodes it considers honest and non-colluding. while the results using the full graph output an “absolute” The graph of nodes and mutual trust relations can be used probability for each node being a Sybil, applying SybilIn- to run SybilInfer centrally by the directory service, or by fer to a partial view of the network only provides a “rel- each individual node that wishes to use the anonymizing ative” probability the node is honest in that context. It is service. Currently, the largest of those services, the Tor likely that nodes are tagged as Sybils, because they do not network has about 2000 nodes, which is well within the have many contacts within the select group, which given computation capabilities of our implementations. the full graph would be classified as honest. Before ap- plying SybilInfer in this mode it is important to assess, at 5.2 Partial social graph knowledge least, whether the subgroup is fast-mixing or not. SybilInfer can be used to detect and prevent Sybil at- 5.3 Using SybilInfer output optimally tacks, using only a partial view of the social graph. In the context of a distributed or peer-to-peer system each user Unlike previous systems the output of the SybilInfer discovers only a fixed diameter sub-graph around them. algorithm is a probabilistic statement, or even more gen- For example a user may choose to retrieve and store all erally, a set of samples that allows probabilistic statements other users two or three hops away in the social network to be estimated. So far in the work we discussed how to graph, or discover a certain threshold of nodes in a breadth make inferences about the marginal probability specific first manner. SybilInfer is then applied on the extracted nodes are honest of dishonest by using the returned sam- sub-graph to detect potential Sybil regions. This allows ples to compute Pr[i is honest] for all nodes i. In our ex- the user to prune its social neighbourhood from any Sybil periments we applied a 0.5 threshold to the probability to attacks, and is sufficient for selecting a set of honest nodes classify nodes as honest or dishonest. This is a rather lim- when sampling from the full network is not required. Dis- ited use of the richer output that SybilInfer provides. tributed backup and storage, and all friend and friend- Distributed system applications can, instead of using of-friend based sharing protocols can benefit from such marginal probabilities of individual nodes, estimate the protection. The storage and communication cost of this probability that the particular security guarantees they re- scheme is constant and relative to the number of nodes in quire hold. High latency anonymous communication sy- the chosen neighbourhood. stems, for example, require a set of different nodes such In cases where nodes can afford to know a larger frac- that with high probability at least one of them is honest. tion of the social graph, they could choose to discover Path selection is also subject to other constraints (like la- O(c· |V |) nodes in their neighbourhood, for some small tency.) In this case the samples returned by SybilInfer integer c. This increases the chances two arbitrary nodes can be used to calculate exactly the sought probability, i.e. have to know a common node, that can perform the Sybil- the probability a single node in the chosen path is hon- Infer protocol and act as an introduction point for the est. Onion routing based system, on the other hand are nodes. In this protocol Alice and Bob want to ensure the secure as long as the first and last hop of the relayed com- other party is not a Sybil. They find a node C that is in the munication is honest. As before, the samples returned by c · |V neighbourhood of both of them, and each make SybilInfer can be used to choose a path that has a high sure that with high probability it is honest. They then con- probability to exhibit this characteristic. tact node C that attests to both of them, given its local Other distributed applications, like peer-to-peer storage run of the SybilInfer engine, that they are not Sybil nodes and retrieval have similar needs, but also tunable param- (with C as the honest seed.) This protocol introduces a eters that depend on the probability of a node being dis- single layer of transitive trust, and therefore it is neces- honest. Storage systems like OceanStore, use Rabin’s in- sary for Alice and Bob to be quite certain that C is indeed formation dispersion algorithm to divide files into chunks honest. Its storage and communication cost is O( |V |), stored and retrieved to reconstruct a file. The degree of which is the same order of magnitude as SybilLimit and redundancy required crucially depends on the probability SybilGuard. Modifying this simple minded protocol into nodes are compromised. Such algorithms can use SybilIn- a fully fledged one-hop distributed hash table [13] is an fer to foil Sybil attacks, and calculate the probability the interesting challenge for future work. set of nodes to be used to store particular files contains SybilInfer can also be applied to specific on-line com- certain fractions of honest nodes. This probability can in munities. In such cases a set of nodes belonging to a cer- turn inform the choice of parameters to maximise the sur-
  • 13. vivability of the files. approaches that treat user statements beyond just black Finally a note of warning should accompany any Sybil and white and make explicit use of probabilistic reasoning prevention scheme: it is not the goal of SybilInfer (or any and statements as their outputs will become increasingly other such scheme) to ensure that all adversary nodes are important in building safe systems. filtered out of the network. The job of SybilInfer is to ensure that a certain fraction of existing adversary nodes Acknowledgements. This work was performed while cannot significantly increase its control of the system by Prateek Mittal was an intern at Microsoft Research, Cam- introducing ‘fake’ Sybil identities. As it is illustrated by bridge, UK. The authors would like to thank Carmela the examples on anonymous communications and stor- Troncoso, Emilia K¨ sper, Chris Lesniewski-Laas, Nikita a age, system specific mechanisms are still crucial to ensure Borisov and Steven Murdoch for their helpful comments that a minority of adversary entities cannot compromise on the research and draft manuscript. Our shepherd, Ta- any security properties. SybilInfer can only ensure that dayoshi Kohno, and ISOC NDSS 2009 reviewers pro- this minority remains a minority and cannot artificially in- vided valuable feedback to improve this work. Barry Law- crease its share of the network. son was very helpful with the technical preparation of the Sybil defence schemes are also bound to contain false- final manuscript. positives, namely honest nodes labeled as Sybils. For this reason other mechanisms need to be in place to ensure that References those users can seek a remedy to the automatic classifica- tion they suffered from the system, potentially by making [1] U. A. Acar. Self-adjusting computation. PhD thesis, Pitts- some additional effort. Proofs-of-work, social introduc- burgh, PA, USA, 2005. Co-Chair-Guy Blelloch and Co- tion services, or even payment targeting those users could Chair-Robert Harper. be a way of ensuring SybilInfer is not turned into an auto- [2] B. Awerbuch. Optimal distributed algorithms for mini- mated social exclusion mechanism. mum weight spanning tree, counting, leader election and related problems (detailed summary). In STOC, pages 6 Conclusion 230–240. ACM, 1987. [3] M. Castro, P. Druschel, A. Ganesh, A. Rowstron, and D. S. Wallach. Secure routing for structured peer-to-peer over- We presented SybilInfer, an algorithm aimed at detect- lay networks. In OSDI ’02: Proceedings of the 5th sym- ing Sybil attacks against peer-to-peer networks or open posium on Operating systems design and implementation, services, and label which nodes are honest and which are pages 299–314, New York, NY, USA, 2002. ACM. dishonest. Its applicability and performance in this task is [4] F. Dabek. A cooperative file system. Master’s thesis, MIT, an order of magnitude better than previous systems mak- September 2001. [5] G. Danezis, R. Dingledine, and N. Mathewson. Mixmin- ing similar assumptions, like SybilGuard and SybilLimit, ion: Design of a Type III Anonymous Remailer Protocol. even though it requires nodes to know a substantial part of In Proceedings of the 2003 IEEE Symposium on Security the social structure within which honest nodes are embed- and Privacy, pages 2–15, May 2003. ded. SybilInfer illustrates how robust Sybil defences can [6] G. Danezis, C. Lesniewski-Laas, M. F. Kaashoek, and be bootstrapped from distributed trust judgements, instead R. Anderson. Sybil-resistant dht routing. In ESORICS of a centralised identity scheme. This is a key enabler for 2005: Proceedings of the European Symp. Research in secure peer-to-peer architectures as well as collaborative Computer Security, pages 305–318, 2005. web 2.0 applications. [7] R. Dingledine, N. Mathewson, and P. Syverson. Tor: the SybilInfer is also significant due to the use of machine second-generation onion router. In SSYM’04: Proceed- learning techniques and their careful application to a secu- ings of the 13th conference on USENIX Security Sympo- sium, pages 21–21, Berkeley, CA, USA, 2004. USENIX rity problem. Cross disciplinary designs are a challenge, Association. and applying probabilistic techniques to system defence [8] J. R. Douceur. The sybil attack. In IPTPS ’01: Re- should not be at the expense of strength of protection, and vised Papers from the First International Workshop on strategy-proof designs. Our ability to demonstrate that the Peer-to-Peer Systems, pages 251–260, London, UK, 2002. underlying mechanisms behind SybilInfer is not suscepti- Springer-Verlag. ble to gaming by an adversary arranging its Sybil nodes in [9] L. Goodman. Snowball sampling. Annals of Mathematical a particular topology is, in this aspect, a very import part Statistics, 32:148–170. of the SybilInfer security design. [10] W. K. Hastings. Monte carlo sampling methods us- Yet machine learning techniques that take explicitly ing markov chains and their applications. Biometrika, 57(1):97–109, April 1970. into account noise and incomplete information, as the one [11] R. Kannan, S. Vempala, and A. Vetta. On clusterings: contained in the social graphs, are key to building secu- Good, bad and spectral. J. ACM, 51(3):497–515, 2004. rity systems that degrade well when theoretical guarantees [12] L. Lamport, R. Shostak, and M. Pease. The byzantine are not exactly matching a messy reality. As security in- generals problem. ACM Trans. Program. Lang. Syst., creasingly becomes a “people” problem, it is likely that 4(3):382–401, 1982.