06

SybilInfer: Detecting Sybil Nodes using Social Networks

George Danezis Prateek Mittal
Microsoft Research, University of Illinois at Urbana-Champaign,
Cambridge, UK. Illinois, USA.
gdane@microsoft.com mittal2@uiuc.edu

Abstract contributors. On-line forums, starting with USENET [19],
to contemporary blogs or virtual worlds like Second
SybilInfer is an algorithm for labelling nodes in a so- Life [22] always have to deal with the issue of disruption
cial network as honest users or Sybils controlled by an in the discussion threads, with persistent abusers coming
adversary. At the heart of SybilInfer lies a probabilis- back under different names. All these are forms of Sybil
tic model of honest social networks, and an inference attacks at the high-level application layers.
engine that returns potential regions of dishonest nodes. There are two schools of Sybil defence mechanisms,
The Bayesian inference approach to Sybil detection comes the centralised and decentralised ones. Centralised de-
with the advantage label has an assigned probability, indi- fences assume the existence of an authority that is capable
cating its degree of certainty. We prove through analytical of doing admission control for the network [3]. Its role is
results as well as experiments on simulated and real-world to rate limit the introduction of ‘fake’ identities, to ensure
network topologies that, given standard constraints on the that the fraction of corrupt nodes remains under a certain
adversary, SybilInfer is secure, in that it successfully dis- threshold. The practicalities of running such an authority
tinguishes between honest and dishonest nodes and is not are very system-specific and in general it would have to
susceptible to manipulation by the adversary. Further- act as a Public Key Certification Authority as well as a
more, our results show that SybilInfer outperforms state guardian of the moral standing of the nodes introduced –
of the art algorithms, both in being more widely applica- a very difficult problem in practice. Such centralised so-
ble, as well as providing vastly more accurate results. lutions are also at odds with the decentralisation guiding
principle of peer-to-peer systems.
Decentralised approaches recognise the difficulty in
1 Introduction having a single authority vouching for nodes, and dis-
tribute this task across all nodes of the system. The first
The Peer-to-peer paradigm allows cooperating users to such proposal is Advogato [14], which aimed to reduce
enjoy a service with little or no need for any centralised abuse in on-line services, followed by a proposal to use
infrastructure. While communication [24] and storage [4] introduction graphs of Distributed Hash Tables [6] to limit
systems employing this design philosophy have been pro- the potential for routing disruption in those systems. The
posed, the lack of any centralised control over identities state of the art SybilGuard [27] and SybiLimit [26] pro-
opens such systems to Sybil attacks [8]: a few malicious pose the use of social networks to mitigate Sybil attacks.
nodes can simulate the presence of a very large number As we will see, SybilGuard suffers from high false neg-
of nodes, to take over or disrupt key functions of the atives, while SybilLimit makes unrealistic assumptions
distributed or peer-to-peer system. Any attempt to build about the knowledge of number of honest nodes in the net-
fault-tolerance mechanisms is doomed since adversaries work. In both cases the systems Sybil detection strategies
can control arbitrary fractions of the system nodes. This are based on heuristics that are not optimal.
Sybil attack is further made practical through the use of Our key contribution is to propose SybilInfer, a method
the existing large number of compromised networked ma- for detecting Sybil nodes in a social network, that makes
chines (often called zombies) being part of bot-nets. use of all information available to the defenders. The for-
Similar problems plague Web 2.0 applications that mal model underlying our approach casts the problem of
rely on collaborative tagging, filtering and editing, like detecting Sybil nodes in the context of Bayesian Infer-
Wikipedia [31], or del.icio.us [28]. A single user ence: given a set of stated relationships between nodes,
can register under different pseudonyms, and bypass any the task is to label nodes as honest or dishonest. Based on
elections of velocity check mechanism that attempts to some simple and generic assumptions, like the fact that
guarantee the quality of the data through the plurality of social networks are fast mixing [18], we sample cuts in

the social graph according to the probability they divide Limit, the current state of the art Sybil defence mecha-
it into honest and dishonest regions. These samples not nisms. We also propose extensions that enable our solu-
only allow us to label nodes as honest or Sybil attackers, tion to be implemented in decentralised settings.
but also to associate with each label output by our algo- This paper is organised in the following fashion: in sec-
rithm a degree of certainty. tion 2 we present an overview of our approach that can
The proposed techniques can be applied in a wide vari- be used as a road-map to the technical sections. In sec-
ety of settings where high reliability peer-to-peer systems, tion 3 we present our security assumptions, threat model,
or Sybil-resistant collaborative mechanisms, are required the probabilistic model and sampler underpinning Sybil-
even under attack: Infer; a security evaluation follows in section 4, providing
• Secure routing in Distributed Hash Tables motivated analytical as well as experimental arguments supporting
early research into this field, and our proposal can the security of the method proposed along with a compar-
be used instead of a centralised authority to limit the ison with SybilInfer. Section 5 discusses the settings in
fraction of dishonest nodes, that could disrupt rout- which SybilInfer can be fruitfully used, followed by some
ing [3]. conclusions in section 6.

• In public anonymous communication networks, such
2 Overview
as Tor [7], our techniques can be used to eliminate the
potential for a single entity introducing a large num-
ber of nodes, and de-anonymize users’ circuits. This The SybilInfer algorithm takes as an input a social
was so far a key open problem for securely scaling graph G and a single known good node that is part of this
such systems. graph. The following conceptual steps are then applied to
return the probability each node is honest or controlled by
• Leader Election [2] and Byzantine agreement [12] a Sybil attacker:
mechanisms that were rendered useless by the Sybil • A set of traces T are generated and stored by per-
attack can again be of use, after Sybil nodes have forming special random walks over the social graph
been detected and eliminated from the social graph. G. These are the only information retained about the
• Finally, detecting Sybil accounts is a key step in pre- graph for the rest of the SybilInfer algorithm, and
venting false email accounts used for spam, or pre- their generation is detailed in section 3.1.
venting trolling and abuse of on-line communities • A probabilistic model is then defined that describes
and web-forums. Our techniques can be applied in the likelihood a trace T was generated by a specific
all those settings, to fight spam and abuse [14]. honest set of nodes within G, called X. This model
SybilInfer applies to settings where a peer-to-peer or dis- is based on our assumptions that social networks are
tributed system is somehow based on or aware of social fast mixing, while the transitions to dishonest regions
connections between users. Properties of natural social are slow. Given the probabilistic model, the traces T
graphs are used to classify nodes are honest or Sybils. and the set of honest nodes we are able to calculate
While this approach might not be applicable to very tradi- Pr[T |X is honest]. The calculation of this quantity
tional peer-to-peer systems [24], it is more an more com- is the subject of section 3.1 and section 3.2.
mon for designers to make distributed systems aware of • Once the probabilistic model is defined, we use
the social environment of their users. Third party social Bayes theorem to calculate for any set of nodes X
network services [29, 30], can also be used to extract so- and the generated trace T , the probability that X con-
cial information to protect systems against sybil attacks sists of honest nodes. Mathematically this quality is
using SybilInfer. Section 5 details deployment strategies defined as Pr[X is honest|T ]. The use of Bayes the-
for SybilInfer and how it is applicable to current systems. orem is described in section 3.1.
We show analytically that SybilInfer is, from a theo-
retical perspective, very powerful: under ideal circum- • Since it is not possible to simply enumerate all sub-
stances an adversary gains no advantage by introducing sets of nodes X of the graph G, we instead sample
into the social network any additional Sybil nodes that are from the distribution of honest node sets X, to only
not ‘naturally’ connected to the rest of the social structure. get a few X0 , . . . , XN ∼ Pr[X is honest|T ]. Using
Even linking all dishonest nodes with each other (with- those representative sample sets of honest nodes, we
out adding any Sybils) changes the characteristics of their can calculate the probability any node in the system
social sub-graph, and can under some circumstances be is honest or dishonest. Sampling and the approxi-
detected. We demonstrate the practical efficacy of our ap- mation of the sought marginal probabilities are the
proach using both synthetic scale-free topologies as well subject of section 3.3.
as real-world LiveJournal data. We show very significant The key conceptual difficulty of our approach is the
security improvements over both SybilGuard and Sybil- definition of the probabilistic model over the traces T , and

graph. Several authors have shown that real-life so-
cial networks are indeed fast mixing [16, 18].

3. A node knows the complete social network topology
(G) : social network topologies are relatively static,
and it is feasible to obtain a global snapshot of the
network. Friendship relationships are already public
Attack
Edges
data for popular social networks like Facebook [29]
and Orkut [30]. This assumption can be relaxed to
using sub-graphs, making SybilInfer applicable to
Honest Nodes Sybil Nodes decentralised settings.
Assumptions (1) and (2) are identical to those made by
Figure 1. Illustration of Honest nodes, Sybil the SybilGuard and SybilInfer systems. Previously, the
nodes and attack edges between them. authors of SybilGuard [27] observed that when the adver-
sary creates too many Sybil nodes, then the graph G has a
small cut: a set of edges that together have small station-
ary probability and whose removal disconnects the graph
its inversion using Bayes theorem to define a probability
into two large sub-graphs.
distribution over all possible honest sets of nodes X. This
distribution describes the likelihood that a specific set of This intuition can be pushed much further to build su-
nodes is honest. The key technical challenge is making perior Sybil defences. It has been shown [20] that the
use of this distribution to extract the sought probability presence of a small cut in a graph results in slow mix-
each node is honest or dishonest, that we achieve via sam- ing which means that fast mixing implies the absence of
pling. Section 3, describes in some detail how these issues small cuts. Applied to social graphs this observation un-
are tackled by SybilInfer. derpins the key intuition behind our Sybil defence mech-
anism: the mixing between honest nodes in the social net-
works is fast, while the mixing between honest nodes and
3 Model & Algorithm dishonest nodes is slow. Thus, computing the set of honest
nodes in the graph is related to computing the bottleneck
Let us denote the social network topology as a graph G cut of the graph.
comprising vertices V , representing people and edges E, One way of formalising the notion of a bottleneck cut,
representing trust relationships between people. We con- is in terms of graph conductance (Φ) [11], defined as:
sider the friendship relationship to be an undirected edge
in the graph G. Such an edge indicates that two nodes trust Φ= min ΦX ,
X⊂V :π(X)<1/2
each other to not be part of a Sybil attack. Furthermore,
we denote the friendship relationship between an attacker
node and an honest node as an attack edge and the hon- where ΦX is defined as:
est node connected to an attacker node as a naive node or
Σx∈X Σy∈X π(x)Pxy
/
misguided node. Different types of nodes are illustrated in ΦX = ,
figure 1. These relationships must be understood by users π(X)
as having security implications, to restrict the promiscu-
ous behaviour often observed in current social networks, and π(·) is the stationary distribution of the graph. Intu-
where users often flag strangers as their friends [23]. itively for any subset of vertices X ⊂ V its conductance
We build our Sybil defence around the following as- ΦX represents the probability of going from X to the rest
sumptions: of the graph, normalised by the probability weight of be-
ing on X. When the value is minimal the bottleneck cut
1. At least one honest node in the network is known. In in the graph is detected.
practise, each node trying to detect Sybil nodes can
Note that performing a brute force search for this bot-
use itself as the apriori honest node. This assump-
tleneck cut is computationally infeasible (it is actually NP-
tion is necessary to break symmetry: otherwise an
Hard, given its relationship to the sparse-cut problem).
attacker could simply mirror the honest social struc-
Furthermore, finding the exact smallest cut is not as im-
ture, and any detector would not be able to distin-
portant as being able to judge how likely any cut is, to be
guish which of the two regions is the honest one.
dividing nodes into an honest and dishonest region. This
2. Social networks are fast mixing: this means that a probability is related to the deviation of the size of any
random walk on the social graph converges quickly cut from what we would expect in a natural, fast mixing,
to a node following the stationary distribution of the social network.

3.1 Inferring honest sets ProbXX

ProbXX ProbXX
In this paper, we propose a framework based on
Bayesian inference to detect approximate cuts between X X
honest and Sybil node regions in a social graph and use
those to infer the labels of each node. A key strength of
our approach is that it, not only associates labels to each ProbXX
node, but also finds the correct probability of error that (a) A schematic representation of tran-
could be used by peer-to-peer or distributed applications sition probabilities between honest X
¯
and dishonest X regions of the social
to select nodes.
network.
The first step of SybilInfer is the generation of a set of
random walks on the social graph G. These walks are gen-
erated by performing a number s of random walks, start- ProbXX = 1 / |V| + Exx

Probability
ing from each node in the graph (i.e. a total of s · |V | 1 / |V|

walks.) A special probability transition matrix is used, ProbXX = 1 / |V| - Exx
defined as follows:
Honest Nodes Sybil nodes
1 1
min( di , dj ) if i → j is an edge in G
Pij = , (b) The model of the probability a short random walk of length
0 otherwise O(log |V |) starting at an honest node ends on a particular honest or
dishonest (Sybil) node. If no Sybils are present the network is fast
where di denotes the degree of vertex i in G. mixing and the probability converges to 1/|V |, otherwise it is biased
This choice of transition probabilities ensures that the towards landing on an honest node.
stationary distribution of the random walk is uniform over
all vertices |V |. The length of the random walks is l = Figure 2. Illustrations of the SybilInfer mod-
O(log |V |), which is rather short, while the number of ran- els
dom walks per node (denoted by s) is a tunable parameter
of the model. Only the starting vertex and the ending ver-
tex of each random walk are used by the algorithm, and
we denote this set of vertex-pairs, also called the traces,
by T . Note that since the set X is honest, we assume (by as-
Now consider any cut X ⊂ V of nodes in the graph, sumption (2)) fast mixing amongst its elements, meaning
such that the a-prior honest node is an element of X. We that a short random walk reaches any element of the sub-
are interested in the probability that the vertices in set X set X uniformly at random. On the other hand a random
are all honest nodes, given our set of traces T , i.e. P (X = walk starting in X is less likely to end up in the dishonest
Honest|T ). Through the application of Bayes theorem we ¯
region X, since there should be an abnormally small cut
have an expression of this probability: between them. (This intuition is illustrated in figure 2(a).)
Therefore we approximate the probability that a short ran-
P (T |X = Honest) · P (X = Honest)
P (X = Honest|T ) = , dom walk of length l = O(log |V |) starts in X and ends at
Z
a particular node in X is given by ProbXX = Π + EXX ,
where Z is the normalization constant given by: Z = where Π is the stationary distribution given by 1/|V |, for
ΣX⊂V P (T |X = Honest) · P (X = Honest). Note that some EXX > 0. Similarly, we approximate the proba-
Z is difficult to compute because it involves the summa- bility that a random walk starts in X and does not end
tion of an exponential number of terms in the size of |V |. in X is given by ProbX X = Π − EX X . Notice that
¯ ¯
Only being able to compute this probability up to a mul- ProbXX > ProbX X , which captures the property that
¯
tiplicative constant Z is not an impediment. The a-prior there is fast mixing amongst honest nodes and slow mix-
distribution P (X = Honest) can be used to encode any ing between honest and dishonest nodes. The approxi-
further knowledge about the honest nodes, or can simply mate probabilities ProbXX and ProbX X and their likely
¯
be set to be uniform over all possible cuts. gap from the ideal 1/|V | are illustrated in figure 2(b).
Bayes theorem has allowed us to reduce the initial
problem of inferring the set of good nodes X from the set Let NXX be the number of traces in T starting in the
of traces T , to simply being able to assign a probability honest set X and ending in same honest set X. Let NX X ¯
to each set of traces T given a set of honest nodes X, i.e. be the number of random walks that start at the honest set
calculating P (T |X = Honest). Our only remaining theo- ¯
X and end in the dishonest set X. NX X and NXX are
¯ ¯ ¯
retical task is deriving this probability, given our model’s defined similarly. Given the approximate probabilities of
assumptions. transitions from one set to the other and the counts of such

transitions we can ascribe a probability to the trace: Approximating ProbXX through the traces T provides
us with a simple expression for the sought probability,
P (T |X = Honest) = (ProbXX )NXX · (ProbX X )NX X ·
¯
¯
based simply on the number of walks starting in one re-
(ProbX X )NX X · (ProbXX )NXX ,
¯ ¯
¯ ¯
¯
¯ gion and ending in another:

where ProbX X and ProbXX are the probabilities a walk NXX 1 NXX
¯ ¯ ¯ P (T |X = Honest) = ( · ) ·
starting in the dishonest region ends in the dishonest or NXX + NX X ¯ |X|
honest regions respectively. NX X¯ 1
( · ¯ )NX X ·
¯

The model described by P (T |X = Honest) is an ap- NX X + NXX
¯ |X|
proximation to reality that is suitable enough to perform NX X
¯ ¯ 1
Sybil detection. It is of course unlikely that a random walk ( · ¯ )NX X ·
¯ ¯

NX X + NXX
¯ ¯ ¯ |X|
starting at an honest node will have a uniform probabil- NXX
¯ 1 NXX
ity to land on all honest or dishonest nodes respectively. ( · ) ¯ ,
NXX + NX X
¯ ¯ ¯ |X|
Yet this simple probabilistic model relating the starting
and ending nodes of traces is rich enough to capture the This expression concludes the definition of our probabilis-
“probability gap” between landing on an honest or dis- tic model, and contains only quantities that can be ex-
honest node, as illustrated in figure 2(b), and suitable for tracted from either the known set of nodes X, or the set
Sybil detection. of traces T that is assigned a probability. Note that we do
not assume any prior knowledge of the size of the honest
3.2 Approximating EXX ¯
set, and it is simply a variable |X| or |X| of the model.
Next, we shall describe how to sample from the distri-
We have reduced the problem of calculating P (T |X = bution P (X = Honest|T ) using the Metropolis-Hastings
Honest) to finding a suitable EXX , representing the ‘gap’ algorithm.
between the case when the full graph is fast mixing (for
EXX = 0) and when there is a distinctive Sybil attack (in 3.3 Sampling honest configurations
which case EXX >> 0.)
One approach could be to try inferring EXX through a At the heart of our Sybil detection techniques lies a
trivial modification of our analysis to co-estimate P (X = model that assigns a probability to each sub-set of nodes
Honest, EXX |T ). Another possibility is to approximate of being honest. This probability P (X = Honest|T ) can
EXX or ProbXX directly, by choosing the most likely be calculated up to a constant multiplicative factor Z, that
candidate value for each configuration of honest nodes X is not easily computable. Hence, instead of directly calcu-
considered. This can be done through the conductance or lating this probability for any configuration of nodes X,
through sampling random walks on the social graph. we will attempt instead to sample configurations Xi fol-
Given the full graph G, ProbXX can be approximated lowing this distribution. Those samples are used to esti-
Σx∈X Σy∈X Π(x)P lxy l mate the marginal probability that any specific node, or
as ProbXX = Π(X) , where Pxy is the prob-
collections of nodes, are honest or Sybil attackers.
ability that a random walk of length l starting at x ends in
Our sampler for P (X = Honest|T ) is based on
y. This approximation is very closely related to the con-
¯ the established Metropolis-Hastings algorithm [10] (MH),
ductance of the set X and X. Yet computing this measure
which is an instance of a Markov Chain Monte Carlo sam-
would require some effort.
pler. In a nutshell, the MH algorithm holds at any point
Notice that ProbXX , as calculated above, can also
a sample X0 . Based on the X0 sample a new candidate
be approximated by performing many random walks of
sample X is proposed according to a probability distribu-
length l starting at X and computing the fraction of those
tion Q, with probability Q(X |X0 ). The new sample X
walks that end in X. Interestingly our traces already con-
is ‘accepted’ to replace X0 with probability α:
tain random walks over the graph of exactly the appropri-
ate length, and therefore we can reuse them to estimate a P (X |T ) · Q(X0 |X )
good ProbXX and related probabilities. Given the counts α = min( , 1)
P (X0 |T ) · Q(X |X0 )
NXX , NX X , NXX and NX X :
¯ ¯ ¯ ¯
otherwise the original sample X0 is retained. It can be
NXX 1 shown that after multiple iterations this yields samples X
ProbXX = ·
NXX + NX X |X|
¯ according to the distribution P (X|T ) irrespective of the
way new candidate sets X are proposed or the initial state
and of the algorithm, i.e. a more likely state X will pop-out
NX X
¯ ¯ 1
ProbX X =
¯ ¯ · ¯ , more frequently from the sampler, than less likely states.
NX X + NXX |X|
¯ ¯ ¯
A relatively naive strategy can be used to propose can-
and, ProbX X = 1−ProbXX and ProbXX = 1−ProbX X .
¯ ¯ ¯ ¯ didate states X given X0 for our problem. It relies on

simply considering sets of nodes X that are only different 4.1 Theoretical results
by a single member from X0 . Thus, with some probability
¯
padd a random node x ∈ X0 is added to the set to form the The security of our Sybil detection scheme hinges on
candidate X = X0 ∪ x. Alternatively, with probability two important results. First, we show that we can detect
premove , a member of X0 is removed from the set of nodes, whether a network is under Sybil attack, based on the so-
defining X = X0 ∩ x for x ∈ X0 . It is trivial to calculate cial graph. Second, we show that we are able to detect
the probabilities Q(X |X0 ) and Q(X |X0 ) based on padd , Sybil attackers connected to the honest social graph, and
premove and using a uniformly at random choice over nodes this for any attacker topology.
¯ ¯
in X0 , X0 , X and X when necessary. Our first result states that:
A key issue when utilizing the MH algorithm is decid- T HEOREM A. In the absence of any Sybil attack,
ing how many iterations are necessary to get independent the distribution of P (X = Honest|T ), for a given
samples. Our rule of thumb is that |V | · log |V | steps are size |X|, is close to uniform, and all cuts are equally
likely to guarantee convergence to the target distribution likely (EXX 0).
P . After that number of steps the coupon collector’s the- This result is based on our assumption that a random walk
orem states that each node in the graph would have been over a social network is fast mixing meaning that, after
considered at least once by the sampler, and assigned to log(|V |) steps, it visits nodes drawn from the stationary
the honest or dishonest set. In practice, given very large distribution of the graph. In our case the random walk is
traces T , the number of nodes that are difficult to cate- performed over a slightly modified version of the social
gorise is very small, and a non-naive sampler requires few graph, where the transition probability attached to each
steps to produce good samples (after a certain burn in- link ij is:
period that allows it to detect the most likely honest re-
1 1
gion.) min( di , dj ) if i → j is an edge in G
Pij = ,
Finally, given a set of N samples Xi ∼ P (X|T ) out- 0 otherwise
put by the MH algorithm it is possible to calculate the
marginal probabilities any node is honest. This is key out- which guarantees that the stationary distribution is uni-
1
put of the SybilInfer algorithm: given a node i it is pos- form over all nodes (i.e. Π = |V | ). Therefore we ex-
sible to associate a probability it is honest by calculating: pect that in the absence of an adversary the short walks
I(i∈Xj ) in T to end at a network node drawn at random amongst
Pr[i is honest] = j∈[0,N −1)N , where I(i ∈ Xj ) is
an indicator variable taking value 1 if node i is in the hon- all nodes |V |. In turn this means that the number of end
est sample Xj , and value zero otherwise. Enough samples nodes in the set of traces T , that end in the honest set X is
can be extracted from the sampler to estimate this proba- NXX = lim|TX |→∞ |X| · |TX |, where TX is the number
|V |
bility with an arbitrary degree of precision. of traces in T starting within the set |X|. Substituting this
More sophisticated samplers would make use of a bet- in the equations presented in section 2.1 and 2.2 we get:
ter strategy to propose candidate states X for each iter- NXX 1
ation. The choice of X can, for example, be biased to- ProbXX = · ⇒ (1)
NXX + NX X |X|
¯
wards adding or removing nodes according to how often NXX 1
random walks starting at the single honest node land on Π + EXX = · ⇒ (2)
NXX + NX X |X|
¯
them. We expect nodes that are reached often by random
walks starting in the honest region to be honest, and the 1 (|X|/|V |) · |TX | 1
+ EXX = · ⇒ (3)
opposite to be true for dishonest nodes. In all cases this |V | |TX | |X|
bias is simply an optimization for the sampling to take EXX = 0 (4)
fewer iterations, and does not affect the correctness of the
results. As a result, by sufficiently increasing the number of ran-
dom walks T performed on the social graph, we can get
EXX arbitrarily close to zero. In turn this means that our
4 Security evaluation distribution P (T |X = Honest) is uniform for given sizes
of |X|, given our uniform a-prior P (X = Honest|T ).
In a nutshell by estimating EXX for any sample X re-
In this section we discuss the security of SybilInfer turned by the MH algorithm, and testing how close it is
when under Sybil attack. We show analytically that we to zero we detect whether it corresponds to an attack (as
can detect when a social network suffers from a Sybil at- we will see from theorem B) or a natural cut in the graph.
tack, and correctly label the Sybil nodes. Our assumptions We can increase the precision of the detector arbitrarily by
and full proposal are then tested experimentally on syn- increasing the number of walks T .
thetic as well as real-world data sets, indicating that the Our second results relates to the behaviour of the sys-
theoretical guarantees hold. tem under Sybil attack:

T HEOREM B. Connecting any additional Sybil nodes 4.2 Practical considerations
to the social network, through a set of corrupt nodes,
lowers the dishonest sub-graph conductance to the
Models and assumptions are always an approximation
honest region, leading to slow mixing, and hence we
of the real world. As a result, careful evaluation is nec-
expect EXX > 0.
essary to ensure that the theorems are robust to deviations
¯ from the ideal behaviour assumed so far.
First we define the dishonest set X0 comprising all dis-
honest nodes connected to honest nodes in the graph. The The first practical issue concerns the fast mixing prop-
¯
set XS contains all dishonest nodes in the system, includ- erties of social networks. There is a lot of evidence that
¯
ing nodes in X0 and the Sybil nodes attached to them. It social networks exhibit this behaviour [18], and previous
¯ ¯
must hold that |X0 | < |XS |, in case there is a Sybil attack. proposals relating to Sybil defence use and validate the
Second we note that the probability of a transition be- same assumption [27, 26]. SybilInfer makes an further
tween an honest node i ∈ X and a dishonest node j ∈ X ¯ assumption, namely that the modified random walk over
cannot increase through Sybil attacks, since it is equal to the social network, that yields a uniform distribution over
1 1
Pij = min( di , dj ). At worst the corrupt node will in- all nodes, is also fast mixing for real social networks. The
1 1
crease its degree by connecting Sybils which has only the probability Pij = min( di , dj ), depends on the mutual
potential to decrease this probability. Therefore we have degrees of the nodes i and j, and makes the transition to
that x∈XS y∈XS Pxy ≤ x∈X0 y∈X0 Pxy . Com-
¯ ¯ ¯ ¯
nodes of higher degree less likely. This effect has the po-
bining the two inequalities we get that: tential to slow down mixing times in the honest case, par-
ticularly when there is a high variation in node degrees.
This effect can be alleviated by removing random edges
¯ Pxy ¯ ¯ Pxy from high degree nodes to guarantee that the ratio of max-
y∈XS x∈X0 y∈X0
¯S | < ¯ ⇔ (5) imum and minimum node degree in the graph is bounded
|X |X0 |
(an approach also used by SybilLimit.)
1 1
y∈XS |V | Pxy
¯ ¯
x∈X0 y∈X0 |V | Pxy
¯ The second consideration also relates to the fast mix-
¯ < ¯ 1 ⇔ (6)
|XS | 1
|V | |X0 | |V | ing properties of networks. While in theory fast mixing
networks should not exhibit any small cuts, or regions of
¯
y∈XS π(x)Pxy ¯
x∈X0 y∈X0 π(x)Pxy
¯
< ⇔ (7) abnormally low conductance, in practice they do. This
¯S )
Π(X Π(X¯0)
is especially true for regions with new users that have not
¯ ¯
Φ(XS ) < Φ(X0 ). (8) had the chance to connect to many others, as well as social
networks that only contain users with particular charac-
teristics (like interest, locality, or administrative groups.)
This result signifies that independently of the topology of Those regions yield, even in the honest case, sample cuts
the adversary region the conductance of a sub-graph con- that have the potential to be mistaken as attacks. This ef-
taining Sybil nodes will be lower compared with the con- fect forces us to consider a threshold EXX under which
ductance of the sub-graph of nodes that are simply com- we consider cuts to be simply false positives. In turn this
promised and connected to the social network. Lower makes the guarantees of schemes weaker in practice than
conductance in turn leads to slower mixing times be- in theory, since the adversary can introduce Sybils into a
tween honest and dishonest regions [20] which means that region undetected, as long as the set threshold EXX is not
EXX > 0, even for very few Sybils. This deviation is exceeded.
subject to the sampling variation introduced by the trace The threshold EXX is chosen to be α·EXXmax , where
T , but the error can be made arbitrarily small by sampling 1 1
EXXmax = |X| − |V | , and α is a constant between 0
more random walks in T . and 1. Here α can be used to control the tradeoff between
false positives and false negatives. A higher value of alpha
These two results are very strong: they indicate that, in
enables the adversary to insert a larger number of sybils
theory, a set of compromised nodes connecting to honest
undetected but reduces the false positives. On the other
nodes in a social network, would get no advantage by con-
hand, a smaller value of α reduces the number of Sybils
necting any additional Sybil nodes, since that would lead
that can be introduced undetected but at the cost of higher
to their detection. Sampling regions of the graph with ab-
number of false positives.
normally small conductance, through the use of the ran-
dom walks T , should lead to their discovery, which is the Given these practical considerations, we can formulate
theoretical foundation of our technique. Furthermore we a weaker security guarantee for SybilInfer:
established that techniques based on detecting abnormali- T HEOREM C. Given a certain “natural” threshold
ties in the value of EXX are strategy proof, meaning that value for EXX in an honest social network, a dis-
there is no attacker strategy (in terms of special adversary honest region performing a Sybil attack will exceed
topology) to foil detection. it after introducing a certain number of Sybil nodes.

Scale Free Topology: 1000 nodes, 100 malicious Scale Free Topology: 1000 nodes, 100 malicious
250 250
False Negatives, alpha=0.65 False Negatives,alpha=0.65
False Positives, alpha=0.65 False Positives,alpha=0.65
False Positives, alpha=0.7 False Positives,alpha=0.7
200 False Positives, alpha=0.75 200 False Positives,alpha=0.75
Number of identities

Number of identities
150 150

100 100

50 50

0 0
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
Number of additional sybil identities (x) Number of additional sybil identities (x)

(a) Average degree compromised nodes (b) Low degree compromised nodes

Figure 3. Synthetic Scale Free Topology: SybilInfer Evaluation as a function of additional Sybil
identities (ψ) introduced by colluding entities. False negatives denote the total number of dishon-
est identities accepted by SybilInfer while false positives denote the number of honest nodes that
are misclassified.

This theorem is the result of Theorem B that demonstrates given node is proportional to the degree of that node; i.e.:
that the conductance keeps decreasing as the number of Pr[(v, i)] = didj , where di is the degree of node i. In
j
Sybils attached to a dishonest region increases. This in our simulations, we use m = 5, giving an average node
turn will slow down the mixing time between the hon- degree of 10.
est and dishonest region, leading to an increasingly large In such a scale free topology of 1000 nodes, we con-
EXX . sider a fraction f = 10% of the nodes to be compro-
Intuitively, as the attack becomes larger, the cut be- mised by a single adversary. The compromised nodes are
tween honest and dishonest nodes becomes increasingly distributed uniformly at random in the topology. Com-
distinct, which makes Sybil detection easier. It is impor- promised nodes introduce ψ additional Sybil nodes and
tant to note that as more Sybils are introduced into the establish a scale free topology amongst themselves. We
dishonest region, the probability of the whole region be- configure SybilInfer to use 20 samples for computing the
ing detected as an attack increases, not only the new Sybil marginal probabilities, and label as honest the set of nodes
nodes. This provides strong disincentives to the adver- whose marginal probability of being honest is greater than
sary from performing larger Sybil attacks, since even pre- 0.5. The experiment is repeated 100 times with different
viously undetected malicious nodes might be flagged as scale free topologies.
Sybils.
Figure 3(a) illustrates the false positives and false neg-
atives classifications returned by SybilInfer, for varying
4.3 Experimental evaluation using synthetic value of ψ, the number of additional Sybil nodes intro-
data duced. We observe that when ψ < 100, α = 0.7 , then all
the malicious identities are classified as honest by Sybil-
We first experimentally demonstrate the validity of Infer. However, there is a threshold at ψ = 100, be-
Theorem C using synthetic topologies. Our experiments yond which all of the Sybil identities, including the ini-
consist of building synthetic social network topologies, in- tially compromised entities are flagged as attackers. This
jecting a variable number of Sybil nodes, and applying is because beyond this point, the EXX for the Sybil region
SybilInfer to establish how many of them are detected. A exceeds the natural threshold leading to full detection, val-
key issue we explore is the number of introduced Sybil idating Theorem C. The value ψ = 100 is clearly the op-
nodes under which Sybil attacks are not detected. timal attack strategy, in which the attacker can introduce
Social networks exhibit a scale-free (or power law) the maximal number of Sybils without being detected. We
node degree topology [21]. Our network synthesis al- also note that even in the worst case, the false positives are
gorithm replicates this structure through preferential at- less than 5%. The false positive nodes have been misclas-
tachment, following the methodology of Nagaraja [18]. sified because these nodes are closer to the Sybil region;
We create m0 initial nodes connected in a clique, and SybilInfer is thus incentive compatible in the sense that
then for each new node v, we create m new edges to nodes which have mostly honest friends are likely not to
existing nodes, such that the probability of choosing any be misclassified.

Figure 4 presents a plot of the maximum Sybil iden-
Scale Free Topology: 1000 nodes
0.4
tities as a function of the compromised fraction of nodes
False Negatives

0.35
Theoretical Prediction
y=x f . Note that our theoretical prediction (which is strategy-
independent) matches closely with the attacker strategy
0.3
Fraction of total malicious entities

of connecting Sybil nodes in a scale free topology. The
0.25
adversary is able to introduce roughly about 1 additional
0.2
Sybil identity per real entity. For instance, at f = 0.2,
0.15 the total number of Sybil identities is 0.37. As we observe
0.1 from the figure the ability of the adversary to just include
0.05 about one additional Sybil identity per compromised node
0
embedded in the social network remains constant, no mat-
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Fraction of malicious entities (f) ter the fraction f of compromised nodes in the network.

Figure 4. Scale Free Topology: fraction of 4.4 Experimental evaluation using real-world
total malicious and Sybil identities as a func- data
tion of real malicious entities.

Next we validate the security guarantees provided by
We can also see the effect of varying the threshold SybilInfer using a sampled LiveJournal topology. A vari-
EXX . As α is increased from 0.65 to 0.7, the ψ for the ant of snowball [9] sampling was used to collect the full
optimal attacker strategy increases from 70 to 100. This is data set data, comprising over 100,000 nodes.
because an increase in the threshold EXX allows the ad- To perform our experiments we chose a random node
versary to insert more Sybils undetected. The advantage and collect all nodes in its three hop neighbourhood. The
of increasing α lies in reducing the worst case false pos- resulting social network has about 50,000 nodes. We then
itives. We can see that by increasing α from 0.7 to 0.75, perform some pre-processing step on the sub-graph:
the worst case false positives can be reduced from 5% to
• Nodes with degree less than 3 are removed, to filter
2%.
out nodes that are too new to the social network, or
Note that for the remainder of the paper, we shall use inactive.
α = 0.7.
We also wish to show that the security of our scheme • If there is an edge between A → B, but no edge
depends primarily on the number of colluding malicious between B → A, then A → B is removed (to only
nodes and not on the number of attack edges. To this end, keep the symmetric friendship relationships.)
we chose the compromised nodes to have the lowest num-
ber of attack edges (instead of choosing them uniformly We note that despite this pre-processing nodes all degrees
at random), and repeat the experiment. Figure 3(b) illus- can be found in the final dataset, since nodes with ini-
trates that the false positives and false negatives classifica- tial degree over 3 will have some edges removed reducing
tions returned by SybilInfer, where the average number of their degree to less than 3.
attack edges are 500. Note that these results are very simi- After pre-processing, the social sub-graph consists of
lar to the previous case illustrated in Figure 3(a), where the about 33,000 nodes. First, we ran SybilInfer on this topol-
number of attack edges is around 800. This analysis in- ogy without introducing any artificial attack. We found a
dicates that the security provided by SybilInfer primarily bottleneck cut diving off about 2, 000 Sybil nodes. It is
depends on the number of colluding entities. The implica- impossible to establish whether these nodes are false pos-
tion is that the compromise of high degree nodes does not itives (a rate of 6%) or a real-world Sybil attack present
yield any significant advantage to the adversary. As we in the LiveJournal network. Since there is no way to es-
shall see, this is in contrast to SybilGuard and SybilLimit, tablish ground truth, we do not label these nodes as either
which are extremely vulnerable when high degree nodes honest/dishonest.
are compromised. Next, we consider a fraction f of the nodes to be com-
Our next experiment establishes the number of Sybil promised and compute the optimal attacker strategy, as in
nodes that can be inserted into a network given different our experiments with synthetic data. Figure 5 shows the
fractions of compromised nodes. We vary the fraction of fraction of malicious identities accepted by SybilInfer as
compromised colluding nodes f , and for each value of f , a function of fraction of malicious entitites in the system.
we compute the optimal number of additional Sybil iden- The trend is similar to our observations on synthetic scale
tities that the attackers can insert, as in the previous exper- free topologies. At f = 0.2, the fraction of Sybil identities
iment. accepted by SybilInfer is approximately 0.32.

LiveJournal Topology: 31603 nodes Scale Free Topology: 1000 nodes
0.45 0.6
Fraction of malicious identities SybilLimit, uniform
y=x SybilLimit, low degree
0.4 SybilInfer
y=x
0.5
0.35

Fraction of total malicious entities
Fraction of total malicious identities

0.3 0.4

0.25
0.3
0.2

0.15 0.2

0.1
0.1
0.05

0 0
0.05 0.1 0.15 0.2 0.25 0.3 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02
Fraction of colluding entities Fraction of malicious entities (f)

Figure 5. LiveJournal Topology: fraction of Figure 6. Comparison with related work
total malicious identities as a function of
real malicious entities

f < 0.01. In this case SybilInfer allows for 5% total ma-
4.5 Comparison with SybilLimit and Sybil- licious entities, while SybilLimit allows for over 50%.
Guard
This is an illustration that SybilInfer performs an or-
SybilGuard [27] and SybilLimit [26] are state of the art der of magnitude better than the state of the art both in
decentralized protocols that defend against Sybil attacks. terms of range of applicability and performance within
Similar to SybilInfer, both protocols exploit the fact that a that range (SybilGuard’s performance is strictly worse
Sybil attack disrupts the fast mixing property of the social than SybilLimit’s performance, and is not illustrated.) An
network topology, albeit in a heuristic fashion. A brief obvious question is: “why does SybilInfer perform so
overview of the two systems can be found in the appendix, much better than SybilGuard and SybilLimit?” It is partic-
and their full descriptions is given in [27, 26]. ularly pertinent since all three systems are making use of
Figure 6 compares the performance of SybilInfer with the same assumptions, and a similar intuition, that there
the performance of SybilLimit. First it is worth noting should be a “gap” between the honest and Sybil regions
that the fraction of compromised nodes that SybilLimit of a social network. The reason SybilLimit and Sybil-
tolerates is only a small fraction of the range within which Guard provide weaker guarantees is that they interpret
SybilInfer provide its guarantees. SybilLimit tolerates up these assumptions in a very sensitive way: they assume
to f = 0.02 compromised nodes when the degree of at- that an overwhelming majority of random walks staring in
tackers is low (about degree 5 – green line), while we have the honest region will stay in the honest region, and then
already shown the performance of SybilLimit for compro- bound the number of walks originating from the Sybil re-
mised fractions up to f = 0.35 in figure 4. Within the gion via the number of corrupt edges. As a result they
interval SybilLimit is applicable, our system systemati- are very sensitive to the length of those walks, and can
cally outperforms: when very few compromised nodes are only provide strong guarantees for a very small number of
present in the system (f = 0.01) our system only allows corrupt edges. Furthermore the validation procedure re-
them to control less than 5% of the entities in the system, lies on collisions between honest nodes, via the birthday
versus SybilLimit that allows them to control over 30% paradox, which adds a further layer of inefficiency to esti-
of entities (rendering insecure byzantine fault tolerance mating good from bad regions.
mechanisms that requires at least 2/3 honest nodes) At the
limit of SybilLimit’s applicability range when f = 0.02, SybilInfer, on the other hand, interprets the disruption
our approach caps the number of dishonest entities in the in fast-mixing between the honest and dishonest region
system to fewer than 8%, while SybilLimit allows about simply as a faint bias in the last node of a short random
50% dishonest entities. (This large fraction renders leader walk (as illustrated in figures 2(a) and 2(b).) In our ex-
election or other voting systems ineffective.) periments, as well as in theory, we observe a very large
An important difference between SybilInfer and Sybil- fraction of the T walks crossing between the honest and
Limit is that the former is not sensitive to the degree of dishonest regions. Yet the faint difference in the probabil-
the attacker nodes. SybilLimit provides very weak guar- ity of landing on nodes in the honest and dishonest regions
antees when high degree (e.g. degree 10 – red line) nodes is present, and the sampler makes use of it to get good cuts
are compromised, and can protect the system only for between the honest and dishonest nodes.

4.6 Computational and time complexity O(|V |) per iteration, making the overall complexity of the
scheme O(|V |2 log |V |).
Two implementations of the SybilInfer sampler were
build in Python and C++, of about 1KLOC each. The 5 Deployment Strategies
Python implementation can handle 10K node networks,
while the C++ implementation has handled up to 30K So far we presented an overview of the SybilInfer algo-
node networks, returning results in seconds. rithm, as well as a theoretical and empirical evaluation of
The implementation strategy for both samplers has fa- its performance when it comes to detecting Sybil nodes.
voured a low time complexity over storage costs. The The core of the algorithm outperforms SybilGuard and
critical loop performs O(|V | · log |V |) Metropolis Hast- SybilLimit, and is applicable in settings beyond which
ings iterations per sample returned, each only requiring the two systems provide no security guarantees whatso-
about O(log |V |) operations. Two copies of the full state ever. Yet a key difference between the previous systems
are stored, as well as associated data that allows for fast and SybilInfer is the latter’s reliance on the full friendship
updating of the state, which requires O(|V |) storage. The graph to perform the random walks that drive the infer-
transcript of evidence traces T is also stored, as well as an ence engine. In this section we discuss how this constraint
index over it, which dominates the storage required and still allows SybilInfer to be used for important classes of
makes it order O(|V | · log |V |). applications, as well as how it can be relaxed to accom-
There is a serious time complexity penalty associated modate peer-to-peer systems with limited resources per
with implementing non-naive sampling strategies. Our client.
Python implementation biases the candidate moves to-
wards nodes that are more or less likely to be part of the 5.1 Full social graph knowledge
honest set. Yet exactly sampling nodes from this known
probability distribution, naively may raise the cost of each The most straightforward way of applying SybilInfer
iteration to be O(|V |). Depending on the differential be- is using the full graph of a social network to infer which
tween the highest and lowest probabilities, faster sampling nodes are honest and which nodes are Sybils, given a
techniques like rejection sampling [15] can be used to known honest seed node. This is applicable to centralised
bring the cost down. The Python implementation uses a on-line services, like free email hosting services, blogging
variant of Metropolis-Hastings to implement selection of sites, and discussion forums that want to deter spammers.
candidate nodes for the next move, at a computation cost Today those systems use a mixture of CAPTCHA [25]
of O(log |V |). The C++ implementation uses naive sam- and network based intrusion detection to eliminate mass
pling from the honest or dishonest sets, and has a very low attacks. SybilInfer could be used to either complement
cost per iteration of order O(1). those mechanisms and provide additional information as
The Markov chain sampling techniques used consider to which identities are suspicious, or replace those sy-
sequences of states that are very close to each other, dif- stems when they are expensive and error prone. One of
fering at most by a single node. This enables a key the ﬁrst social network based Sybil defence systems, Ad-
optimization, where the counts NXX , NX X , NXX and
¯ ¯ vogato [14], worked in such a centralized fashion.
NX X are stored for each state and updated when the state
¯ ¯ The need to know the social graph does not preclude
changes. This simple variant of self-adjusting computa- the use of SybilInfer in distributed or even peer-to-peer sy-
tion [1], allows for very fast computations of the proba- stems. Social networks, once mature, are generally stable
bilities associated with each state. Updating the counts, and do not change much over time. Their rate of change
and associated information is an order O(log |V |) opera- is by no means comparable to the churn of nodes in the
tion. The alternative, of recounting these quantities from network, and as a result the structure of the social network
T would cost O(|V | log |V |) for every iteration, leading could be stored and used to perform inference on multiple
to a total computational complexity for our algorithm of nodes in a network, along with a mechanisms to share oc-
O((|V | log |V |)2 ). Hence implementing it is vital to get- casional updates. The storage overhead for storing large
ting results fast. social networks is surprisingly low: a large social network
Finally our implementations use a zero-copy strategy with 10 billion nodes (roughly the population of planet
for the state. Two states and all associated information are earth) with each node having about 1000 friends, can be
maintained at any time, the current state and the candi- stored in about 187Gb of disk space uncompressed. In
date state. Operations on the candidate state can be done such settings it is likely that SybilInfer computation will
and undone in O(log |V |) per operation. Accepted moves be the bottleneck, rather than storage of the graph, for the
can be committed to the current state at the same cost. foreseeable future.
These operations can be used to maintain the two states A key application of Sybil defences is to ensure that
synchronised for use by the Metropolis-Hastings sampler. volunteer relays in anonymous communication networks
The naive strategy of re-writing the full state would cost belong to independent entities, and are not controlled

by a single adversary. Practical systems like Mixmas- tain community of interest (a social club, a committee, a
ter [17], Mixminion [5] and Tor [7] operate such a volun- town, etc.) can be extracted to form a sub-graph. SybilIn-
teer based anonymity infrastructure, that are very suscep- fer can then be applied on this partial view of the graph,
tible to Sybil attacks. Extending such an infrastructure to to detect nodes that are less well integrated than others in
use SybilInfer is an easy task: each relay in the system the group. There is an important distinction between us-
would have to indicate to the central directory services ing SybilInfer in this mode or using it with the full graph:
which other nodes it considers honest and non-colluding. while the results using the full graph output an “absolute”
The graph of nodes and mutual trust relations can be used probability for each node being a Sybil, applying SybilIn-
to run SybilInfer centrally by the directory service, or by fer to a partial view of the network only provides a “rel-
each individual node that wishes to use the anonymizing ative” probability the node is honest in that context. It is
service. Currently, the largest of those services, the Tor likely that nodes are tagged as Sybils, because they do not
network has about 2000 nodes, which is well within the have many contacts within the select group, which given
computation capabilities of our implementations. the full graph would be classified as honest. Before ap-
plying SybilInfer in this mode it is important to assess, at
5.2 Partial social graph knowledge least, whether the subgroup is fast-mixing or not.

SybilInfer can be used to detect and prevent Sybil at- 5.3 Using SybilInfer output optimally
tacks, using only a partial view of the social graph. In the
context of a distributed or peer-to-peer system each user Unlike previous systems the output of the SybilInfer
discovers only a fixed diameter sub-graph around them. algorithm is a probabilistic statement, or even more gen-
For example a user may choose to retrieve and store all erally, a set of samples that allows probabilistic statements
other users two or three hops away in the social network to be estimated. So far in the work we discussed how to
graph, or discover a certain threshold of nodes in a breadth make inferences about the marginal probability specific
first manner. SybilInfer is then applied on the extracted nodes are honest of dishonest by using the returned sam-
sub-graph to detect potential Sybil regions. This allows ples to compute Pr[i is honest] for all nodes i. In our ex-
the user to prune its social neighbourhood from any Sybil periments we applied a 0.5 threshold to the probability to
attacks, and is sufficient for selecting a set of honest nodes classify nodes as honest or dishonest. This is a rather lim-
when sampling from the full network is not required. Dis- ited use of the richer output that SybilInfer provides.
tributed backup and storage, and all friend and friend- Distributed system applications can, instead of using
of-friend based sharing protocols can benefit from such marginal probabilities of individual nodes, estimate the
protection. The storage and communication cost of this probability that the particular security guarantees they re-
scheme is constant and relative to the number of nodes in quire hold. High latency anonymous communication sy-
the chosen neighbourhood. stems, for example, require a set of different nodes such
In cases where nodes can afford to know a larger frac- that with high probability at least one of them is honest.
tion of the social graph, they could choose to discover Path selection is also subject to other constraints (like la-
O(c· |V |) nodes in their neighbourhood, for some small tency.) In this case the samples returned by SybilInfer
integer c. This increases the chances two arbitrary nodes can be used to calculate exactly the sought probability, i.e.
have to know a common node, that can perform the Sybil- the probability a single node in the chosen path is hon-
Infer protocol and act as an introduction point for the est. Onion routing based system, on the other hand are
nodes. In this protocol Alice and Bob want to ensure the secure as long as the first and last hop of the relayed com-
other party is not a Sybil. They find a node C that is in the munication is honest. As before, the samples returned by
c · |V neighbourhood of both of them, and each make SybilInfer can be used to choose a path that has a high
sure that with high probability it is honest. They then con- probability to exhibit this characteristic.
tact node C that attests to both of them, given its local Other distributed applications, like peer-to-peer storage
run of the SybilInfer engine, that they are not Sybil nodes and retrieval have similar needs, but also tunable param-
(with C as the honest seed.) This protocol introduces a eters that depend on the probability of a node being dis-
single layer of transitive trust, and therefore it is neces- honest. Storage systems like OceanStore, use Rabin’s in-
sary for Alice and Bob to be quite certain that C is indeed formation dispersion algorithm to divide files into chunks
honest. Its storage and communication cost is O( |V |), stored and retrieved to reconstruct a file. The degree of
which is the same order of magnitude as SybilLimit and redundancy required crucially depends on the probability
SybilGuard. Modifying this simple minded protocol into nodes are compromised. Such algorithms can use SybilIn-
a fully fledged one-hop distributed hash table [13] is an fer to foil Sybil attacks, and calculate the probability the
interesting challenge for future work. set of nodes to be used to store particular files contains
SybilInfer can also be applied to specific on-line com- certain fractions of honest nodes. This probability can in
munities. In such cases a set of nodes belonging to a cer- turn inform the choice of parameters to maximise the sur-

vivability of the files. approaches that treat user statements beyond just black
Finally a note of warning should accompany any Sybil and white and make explicit use of probabilistic reasoning
prevention scheme: it is not the goal of SybilInfer (or any and statements as their outputs will become increasingly
other such scheme) to ensure that all adversary nodes are important in building safe systems.
filtered out of the network. The job of SybilInfer is to
ensure that a certain fraction of existing adversary nodes Acknowledgements. This work was performed while
cannot significantly increase its control of the system by Prateek Mittal was an intern at Microsoft Research, Cam-
introducing ‘fake’ Sybil identities. As it is illustrated by bridge, UK. The authors would like to thank Carmela
the examples on anonymous communications and stor- Troncoso, Emilia K¨ sper, Chris Lesniewski-Laas, Nikita
a
age, system specific mechanisms are still crucial to ensure Borisov and Steven Murdoch for their helpful comments
that a minority of adversary entities cannot compromise on the research and draft manuscript. Our shepherd, Ta-
any security properties. SybilInfer can only ensure that dayoshi Kohno, and ISOC NDSS 2009 reviewers pro-
this minority remains a minority and cannot artificially in- vided valuable feedback to improve this work. Barry Law-
crease its share of the network. son was very helpful with the technical preparation of the
Sybil defence schemes are also bound to contain false- final manuscript.
positives, namely honest nodes labeled as Sybils. For this
reason other mechanisms need to be in place to ensure that
References
those users can seek a remedy to the automatic classifica-
tion they suffered from the system, potentially by making
[1] U. A. Acar. Self-adjusting computation. PhD thesis, Pitts-
some additional effort. Proofs-of-work, social introduc-
burgh, PA, USA, 2005. Co-Chair-Guy Blelloch and Co-
tion services, or even payment targeting those users could Chair-Robert Harper.
be a way of ensuring SybilInfer is not turned into an auto- [2] B. Awerbuch. Optimal distributed algorithms for mini-
mated social exclusion mechanism. mum weight spanning tree, counting, leader election and
related problems (detailed summary). In STOC, pages
6 Conclusion 230–240. ACM, 1987.
[3] M. Castro, P. Druschel, A. Ganesh, A. Rowstron, and D. S.
Wallach. Secure routing for structured peer-to-peer over-
We presented SybilInfer, an algorithm aimed at detect- lay networks. In OSDI ’02: Proceedings of the 5th sym-
ing Sybil attacks against peer-to-peer networks or open posium on Operating systems design and implementation,
services, and label which nodes are honest and which are pages 299–314, New York, NY, USA, 2002. ACM.
dishonest. Its applicability and performance in this task is [4] F. Dabek. A cooperative file system. Master’s thesis, MIT,
an order of magnitude better than previous systems mak- September 2001.
[5] G. Danezis, R. Dingledine, and N. Mathewson. Mixmin-
ing similar assumptions, like SybilGuard and SybilLimit,
ion: Design of a Type III Anonymous Remailer Protocol.
even though it requires nodes to know a substantial part of
In Proceedings of the 2003 IEEE Symposium on Security
the social structure within which honest nodes are embed- and Privacy, pages 2–15, May 2003.
ded. SybilInfer illustrates how robust Sybil defences can [6] G. Danezis, C. Lesniewski-Laas, M. F. Kaashoek, and
be bootstrapped from distributed trust judgements, instead R. Anderson. Sybil-resistant dht routing. In ESORICS
of a centralised identity scheme. This is a key enabler for 2005: Proceedings of the European Symp. Research in
secure peer-to-peer architectures as well as collaborative Computer Security, pages 305–318, 2005.
web 2.0 applications. [7] R. Dingledine, N. Mathewson, and P. Syverson. Tor: the
SybilInfer is also significant due to the use of machine second-generation onion router. In SSYM’04: Proceed-
learning techniques and their careful application to a secu- ings of the 13th conference on USENIX Security Sympo-
sium, pages 21–21, Berkeley, CA, USA, 2004. USENIX
rity problem. Cross disciplinary designs are a challenge,
Association.
and applying probabilistic techniques to system defence [8] J. R. Douceur. The sybil attack. In IPTPS ’01: Re-
should not be at the expense of strength of protection, and vised Papers from the First International Workshop on
strategy-proof designs. Our ability to demonstrate that the Peer-to-Peer Systems, pages 251–260, London, UK, 2002.
underlying mechanisms behind SybilInfer is not suscepti- Springer-Verlag.
ble to gaming by an adversary arranging its Sybil nodes in [9] L. Goodman. Snowball sampling. Annals of Mathematical
a particular topology is, in this aspect, a very import part Statistics, 32:148–170.
of the SybilInfer security design. [10] W. K. Hastings. Monte carlo sampling methods us-
Yet machine learning techniques that take explicitly ing markov chains and their applications. Biometrika,
57(1):97–109, April 1970.
into account noise and incomplete information, as the one [11] R. Kannan, S. Vempala, and A. Vetta. On clusterings:
contained in the social graphs, are key to building secu- Good, bad and spectral. J. ACM, 51(3):497–515, 2004.
rity systems that degrade well when theoretical guarantees [12] L. Lamport, R. Shostak, and M. Pease. The byzantine
are not exactly matching a messy reality. As security in- generals problem. ACM Trans. Program. Lang. Syst.,
creasingly becomes a “people” problem, it is likely that 4(3):382–401, 1982.

06

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (8)

Similaire à 06

Similaire à 06 (20)

06