This document summarizes research on anonymizing network trace data while maintaining usability. It discusses challenges in applying traditional anonymization techniques to network traces due to their unique structure. The paper proposes heuristics for usability-aware anonymization that apply microdata privacy techniques separately to different network trace attributes. Preliminary results suggest the potential to generate anonymized traces with improved usability through trade-offs determined on a case-by-case basis. The document also reviews related work on network trace anonymization and attacks against anonymized data.
A Study of Usability-aware Network Trace Anonymization
1. 1
A Study of Usability-aware Network Trace
Anonymization
Kato Mivule
Los Alamos National Laboratory
Los Alamos, New Mexico, USA
kmivue@gmail.com
Blake Anderson
Los Alamos National Laboratory
Los Alamos, New Mexico, USA
banderson@lanl.gov
Abstract— The publication and sharing of network trace
data is a critical to the advancement of collaborative
research among various entities, both in government,
private sector, and academia. However, due to the
sensitive and confidential nature of the data involved,
entities have to employ various anonymization techniques
to meet legal requirements in compliance with
confidentiality policies. Nevertheless, the very composition
of network trace data makes it a challenge when applying
anonymization techniques. On the other hand, basic
application of microdata anonymization techniques on
network traces is problematic and does not deliver the
necessary data usability. Therefore, as a contribution, we
point out some of the ongoing challenges in the network
trace anonymization. We then suggest usability-aware
anonymization heuristics by employing microdata privacy
techniques while giving consideration to usability of the
anonymized data. Our preliminary results show that with
trade-offs, it might be possible to generate anonymized
network traces with enhanced usability, on a case-by-case
basis using micro-data anonymization techniques.
Keywords—Network Trace Anonymization; Usability;
Differential Privacy; K-anonymity; Generalization
I. INTRODUCTION
While a number of network trace anonymization techniques
have been presented in literature, data utility remains
problematic due to the unique usability requirements by the
different consumers of the privatized network traces. Yet still,
a number of microdata privacy techniques from the statistical
and computation sciences, are difficult to implement when
anonymizing network traces due to the low usability of results.
Moreover, finding the right proportionality between
anonymization and data utility of network trace data is
intractable and requires trade-offs on a case-by-case basis,
after a careful consideration of privacy needs stipulated by
policy makers, and likewise the usability requirements of the
researchers, who in this case, are the consumers of the
anonymized data. Furthermore, a generalized approach fails to
deliver unique solutions, as each entity will have unique data
privacy requirements. In this study, we take a look at the
structure of the network trace data. We vertically partition the
network trace data into different attributes and apply micro-
data privatization techniques separately for each attribute. We
then suggest usability-aware anonymization heuristics for the
anonymization process. While a number of anonymization
attacks have been presented in literature, the main goal of this
study was generation of anonymized network traces with
better data usability. Therefore, the focus of the suggested
heuristics and preliminary results, is about the generation of
anonymized usability-aware network trace data, using privacy
techniques covered in the statistical disclosure control domain;
that include the following: Generalization, Noise addition and
Multiplicative noise perturbation, Differential Privacy, and
Data swapping [38]. A measure of usability by quantifying
descriptive and inference statistics of the anonymized data in
comparison with that of the original data is also presented.
Furthermore, we apply frequency distribution analysis and
unsupervised learning techniques in the measure of usability
for the unlabeled data. The rest of the paper is organized as
follows: In Section II, we present a review of related work,
and definition of important terms pertaining to this paper. In
Section III, we present methodologies and usability-aware
anonymization heuristics. In Section IV, the experiment and
results are given. Finally in Section V, the conclusion,
recommendations, and future works are presented.
II. RELATED WORK
One of the challenges of anonymizing network traces, is
how to keep the structure and flow of the data intact so as to
provide usability to the consumer of the anonymized data. In
such efforts, Maltz et al. (2004) demonstrated that network
trace data could be anonymized while preserving the structure
of the original data [1]. Additionally, Maltz et al. (2004)
observed and noted that some of the challenges in
anonymizing network traces included figuring out attributes in
the network trace that could leak sensitive information, and
how to anonymize the data such that the original
configurations are preserved [1]. Observations by Maltz et al.
are still relevant today, especially when considering the
intractability between privacy and usability. On the other
hand, Slagell, Wang, and Yurcik (2004) proposed Crypto-Pan,
a network trace anonymization tool that employs
cryptographic techniques in the privatization of network trace
data [2]. While anonymization using cryptographic means
might be effective in concealing sensitive data, usability of the
anonymized data is always a challenge. Bishop, Crawford,
Bhumiratana, Clark, and Levitt (2006), observed that one of
2. 2
the problems in the anonymization of network traces, is that
when handling IP addresses, the set of available addresses is
finite, thus setting a limit to any anonymization prospects [3].
Each octet in the IP address would handle a range of 0 to 255.
For instance, it would not make much sense to have an
anonymized IP address with an octet value of 345. This
limitation makes the data vulnerable to de-anonymization
attacks. On the issue of de-anonymization attacks, Coull,
Wright, Monrose, Collins, and Reiter (2007) presented
inference techniques for de-anonymizing and detecting
network topologies in anonymized network trace data [4].
Coull et al. showed that topological data could be deduced as
an artifact of functional network packet traces, if the data on
activity of hosts can be utilized as an advantage to prevent a
successful obfuscation of the network traces [4]. Moreover,
Coull et al., pointed out that obfuscating network trace data is
not a trivial task as publishers of the data need to be aware of
the tension between balancing privacy and data utility needs
for anonymized network traces [4]. Additionally, Ribeiro,
Chen, Miklau, and Towsley (2008), showed that systematic
attacks on prefix-preserving anonymized network traces,
could be done by adversaries using modest amount of publicly
available information about a network and employing attack
techniques such as finger printing [5]. However, Ribeiro et al.
anticipated that their proposed attack methodologies would be
employed in evaluating worst-case vulnerabilities and finding
trade-offs between privacy and utility in prefix-preserving
privatization of network traces [5]. Therefore, while
researchers might have an interest in anonymized data sets
that maintain the structure and flow of the original data,
curators of that data have to contend with the fact that such
prefix-preserving anonymization is subject to de-
anonymization attacks.
A comprehensive reference model was presented by Gattani
and Daniels (2008), in which they outlined that entities needed
to formulate the problem of anonymizing network traces [6].
Gattani and Daniels (2008) noted that the anonymization
procedure always aims at the following three goals [6]: (i)
defending the confidentiality of users, (ii) obfuscating the
inner structure of a network, and (iii) generating anonymized
network traces with acceptable levels of usability [6].
However, Gattani and Daniels (2008) observed that attaining
those three anonymization goals is problematic, as removing
too much sensitive information from a network data trace only
reduces the usability of the anonymized network traces [6].
Additionally, Gattani and Daniels (2008), categorized attacks
on anonymized data categorized as, (i) active data injection
attacks, (ii) known mapping attacks, (iii) network topology
inference attacks, and (iv) cryptographic attacks [6]. On the
categorization of attacks, King, Lakkaraju, and Slagell (2009)
presented a taxonomy of attacks on anonymization techniques
with the aim of helping curators of the privatization process
negotiate trade-offs between data utility and anonymization
[7]. King et al., classified attacks on anonymization methods
as (i) fingerprinting, (ii) structure recognition, (iii) known
mapping, (iv) data injection, and (v) cryptographic attacks [7].
A combined categorization of attacks on anonymization
techniques, from Gattani and Daniels, and King et al., would
then be listed as follows [7] [6]: (i) Fingerprinting attacks: in
this this category of attacks, attributes of anonymized data are
compared with traits of known network structures to uncover a
relationship between the anonymized and non-anonymized
data. (ii) Data injection attacks: in this type of exploit, an
attacker injects pseudo-traffic data in a network trace before
anonymization process and uses the pseudo-traffic traces to
de-anonymize the network traces and network structure. (iii)
Structure recognition attacks: in this type of exploit, an
attacker seeks to determine the structure between objects in
the anonymized data to discover multiple relations between
anonymized and non-anonymized data. (iv) Network topology
inference: similar to known mapping attacks, this category of
exploits seeks to retrieve the network topology map by de-
anonymizing the nodes that make up the vertices of the
network, the edges between the nodes that represent the
connectivity and the routers. (v) Known mapping attacks: in
this category of exploit, the attacker relies on external data
(auxiliary data) to find a mapping between the anonymized
network trace data and the original network trace data in order
to retrieve the original IP addresses. (vi) Cryptographic
attacks: in this category of attacks, exploits are carried out to
break cryptographic algorithms used to encrypt the network
traces.
A comparative analysis was done by Coull, Monrose, Reiter,
and Bailey (2009) in which they pointed out the similarities
and differences between network data anonymization and
microdata privatization techniques, and how microdata
obfuscation methods could be applied to anonymize network
traces [8]. Coull, et al. observed that uncertainties did exist
about the effectiveness of network data anonymization, from
both methodological and policy view, with the research
community in need for more study to understand the
implications of publishing anonymized network data and the
utility of such data to researchers [8]. Furthermore, Coull, et
al. suggested that the extensive work that exists in the
statistical disclosure control discipline could be employed by
the network research community towards the privatization of
network flow data [8]. On network trace packet
anonymization, Foukarakis, Antoniades, and Polychronakis
(2009), proposed the anonymization of network traces at the
packet level – in the payload of a packet, due to inadequacies
found in various network trace anonymization techniques [9].
Foukarakis et al., suggested identifying revealing information
contained in the shell-code of code injection attacks, and
anonymizing such packets to grant confidentiality in published
network attack traces [9]. However, on the subject of IP-flow
intrusion detection methods, Sperotto et al. (2010) presented
an overview of IP-flow intrusion detection approach and
highlighted the classification of attacks, and defense methods
and how flow-based method can be used to discover scans,
worms, botnets and denial of service (DoS) attacks [10].
Furthermore Sperotto et al. highlighted two types of sampling;
packet sampling whereby a packet is deterministically chosen
based on a time interval for analysis; and flow sampling in
which a sample flow is chosen for analysis [10]. At the same
3. 3
time, Burkhart et al. (2010), in their review of anonymization
techniques, showed that current anonymization techniques are
vulnerable to a series of injection attacks, by inserting attacker
packets into the network flow prior to anonymization, then
later retrieving the packets, thus revealing vulnerabilities and
patterns in the anonymized data [11]. As a mitigation to
injection attacks, Burhart et al. suggested that anonymization
of network flow data should be done as part of a
comprehensive approach including both legal and technical
perspectives on data confidentiality [11].
Meanwhile, McSherry and Mahajan (2011) showed that
differential privacy could be employed to anonymize network
trace data. Yet despite privacy guarantees provided by
differential privacy, the usability of the privatized data
remains a challenge due to excessive noise from the
anonymization [12]. However, McSherry, Frank, and Mahajan
(2011), in their study of applying differential privacy on
network trace data, acknowledged the challenges of balancing
usability and privacy, despite the confidentiality assurances
accorded by differential privacy [13]. On real time interactive
anonymization, Paul, Valgenti, and Kim (2011) proposed the
Real-time Netshuffle anonymization technique whereby
distortion is done to a complete graph to prevent inference
attacks in network traffic [14]. Netshuffle works by employing
k-anonymity methodology on network traces, by ensuring that
all trace records appear at least k>1, with k being the
anonymized record, and then shuffling gets applied on the k-
anonymized records, making it difficult for an attacker to
decipher due to the distortion [14]. A network trace
obfuscation methodology, (k, j)-obfuscation, was proposed by
Riboni, Villani, Vitali, Bettini, and Mancini (2012), in which a
network flow is considered obfuscated if it cannot be linked
with greater assurance, to its source and destination IP
addresses [15]. Riboni, et al. observed from their
implementation of (k, j)-obfuscation, that the large set of
network flows maintained the utility of the original network
trace [15]. However, the context of data utility remains
challenging as each consumer of privatized data will have
unique usability requirements, different levels of needed
assurance, and therefore, utility becomes constrained to a
case-by-case basis, depending on an entity's privacy and
usability needs. On the issue of preserving IP consistency in
anonymized data, Qardaji and Li (2012) observed that full
prefix-preserving IP anonymization suffers from a number of
attacks yet from a usability perspective, some level of
consistency is required in anonymized IP addresses [16]. To
mitigate this problem, Qardaji and li (2012), proposed
maintaining pseudonym consistency by dividing flow data
into buckets based on temporal closeness and separately
privatize flows in each bucket, thus maintaining consistency
only in each bucket but not globally across all buckets [16].
Mendonca, Seetharaman, and Obraczka (2012) proposed
AnonyFlow, an interactive anonymization technique that
provides end point privacy by preventing the tracking of
source behavior and location in network data [17]. However,
Mendonca et al. acknowledged that AnonyFlow does not
address issues of complete anonymity, data security,
steganography, and network trace anonymization in non-
interactive settings [17].
On generating synthetic network traces, Jeon, Yun, and Kim
(2013), proposed an anomaly-based intrusion detection system
(A-IDS) to generate pseudo-network traffic for the
obfuscation of real sensitive network traffic in supervisory
control and data acquisition (SCADA) systems [18]. An
overview of network data anonymization was presented by
Nassar, al Bouna, Malluhi (2013), in which the need to
address the problem of finding appropriate anonymization
algorithms that grant privacy but with an optimal risk-utility
trade-off, was highlighted [19]. On using entropy and
similarity distance measures, Xiaoyun, Yujie, Xiaosheng,
Xiaohong, and Yan (2013) employed similarity distance and
entropy techniques in the quantification of anonymized
network trace data [20]. Xiaoyun et al. proposed two types of
similarity measures: (i) external similarity, in which the
distance measurements are done to compute the probability
that an adversary will obtain a one-to-one mapping relation
between the anonymized and the original data, based on
auxiliary knowledge; (ii) Internal similarity, in which distance
measurements are done on the privatized and the original data
to indicate how distinguishable or indistinguishable the data
sets are [20]. On the extracting, classification, and
anonymization of packet traces, Lin, Lin, Wang, Chen, and
Lai (2014), observed that capturing and sharing real network
traffic faced two challenges, first various protocols are
associated with the packet traces and secondly, such packet
traces tend not to be well classified before deep packet
anonymization [21]. Therefore, Lin et al. proposed PCAPLib
methodology to extract, classify, and the deep packet
anonymization of packet traces [21]. In their work on Session
Initiation Protocol (SIP) used in multimedia communication
sessions, Stanek, Kencl, and Kuthan (2014), pointed out that
current network trace anonymization techniques are
insufficient for SIP traces due to the data format of the SIP
trace, that includes, the IP address, the SIP URI, and the e-
mail address [22]. To mitigate this problem, Stanek et al,
proposed SiAnTo, an anonymization methodology that
replaces SIP information with non-descriptive but matching
labels [22]. Of recent, Riboni, Villani, Vitali, Bettini, and
Mancini (2014), cautioned that current network trace
anonymization techniques are vulnerable to various attacks
while at the same time it is problematic to apply microdata
privatization methods in obfuscating network traces [23].
Moreover, Riboni et al. noted that current obfuscation
methods depend on assumptions about an adversary
intentions, which are challenging to model, and do not
guarantee privacy against background knowledge attacks [23].
In Table I, is a summary of some of network trace
anonymization challenges outlined in literature for the past ten
years.
A. Network trace anonymization techniques
In the following section, a review of some of the common
network trace anonymization techniques is presented [24] [25]
[26] [27] [28] [16]: (i) Black marker technique: in this
4. 4
method, sensitive values are erased or substituted with fixed
values.
TABLE I. SUMMARY OF NETWORK TRACE ANONYMIZATION
CHALLENGES
Author (s) Network Trace Anonymization Challenges
Maltz et al., (2004) Challenge of identifying attributes to anonymize while
conserving usability
Slagell et al., (2004) Crypto-Pan – cryptography to anonymize IP addresses
– usability a challenge.
Bishop et al.,(2006) Anonymization of IP addresses problematic – set of IP
addresses is finite.
Coull et al.,(2007) Obfuscation not trivial task due to the tensions
between privacy and usability.
Ribeiro et al., (2008) Prefix-preserving anonymized data subject to
Fingerprinting attacks.
King et al., (2009) Taxonomy of attacks on anonymization technique –
anonymization challenges.
Coull et al., (2009) Comparison between network and micro data
anonymization – significant differences.
Foukarakis et al.,
(2009)
Network trace anonymization at the packet level – a
challenge.
Burkhart et al, (2010) Injection attacks on anonymized network trace data.
McSherry and
Mahajan (2011)
Differential privacy anonymization of network trace
data.
Paul, Valgenti, and
Kim (2011)
Real-time anonymization with k-anonymity.
Riboni et al., (2012) (k, j)-obfuscation – network flow is obfuscated if it
cannot be linked to original data with greater
assurance
Qardaji and Li (2012) Global Prefix Consistency is subject to attacks.
Mendonca et al.,
(2012)
Interactive network trace anonymization.
Jeon, Yun, and Kim
(2013)
Synthetic (anonymized) network trace data generation.
Nassar, al et al.,
(2013)
Balance between utility and privacy needed - still a
problem.
Farah and Trajkovic
(2013)
Network trace anonymization techniques - an
overview.
Stanek et al., (2014) Proposed Session Initiation Protocol (SIP)
anonymization and challenges.
Riboni et al., (2014) Caution with current network anonymization
techniques – vulnerable to attacks
(ii) Enumeration technique: in this scheme, sensitive values in
a sequence are replaced with an ordered sequence of synthetic
values. (iii) Hash technique: unique values are substituted
with a fixed size bit string in the hash technique. (iv)
Partitioning technique: with the partitioning method,
revealing values are partitioned into a subset of values and
each of the values in the subset is replaced with a generalized
value. For example, an IP address 141.121.10.12, could be
partitioned into four octets and the last two octets replaced
with zero values, 141.121.0.0. (v) Precision degradation
technique: highly specific values of a time-stamp attribute are
removed when employing the precision degradation method.
(vi) Permutation technique: A random permutation is done to
link non-anonymized IP and MAC addresses to a set of
available addresses. (vii) Prefix-preserving anonymization
technique: in this technique, values of an IP address are
replaced with synthetic values in such a way that the original
structure of the IP address is kept – the prefix values of an IP
address structure is preserved. Prefix-preservation could be
applied fully or partially on the IP address. The fully prefix-
preserving anonymization will map the full structure of the
original IP address in the anonymized data, while the partially
prefix-preserving anonymization will preserve a select
structure of the original IP address, for example the first two
octets. (viii) Random time shift technique: this methodology
works by applying a random value as an offset to each value
in the field. (ix) Truncation technique: with this technique,
part of the IP or MAC address is suppressed or deleted and the
remaining IP address remains intact. (x) Time unit
annihilation: In this partitioning anonymization methodology,
part of the time-stamp is deleted and replaced with zeros. In
Table 1, a summary of ongoing challenges from literature, on
anonymizing network traces is given. Although a number of
network trace anonymization solutions have been proposed in
literature, usability of the anonymized data remains
problematic. While a number of challenges exist, this study
labored to focus on the challenge of usability-aware
anonymization of network traces.
B. Statistical disclosure control techniques
The following are some of the main microdata privatization
methods used: Suppression: in this technique, revealing and
sensitive data values are deleted from a data set at the cell
level [29]. Generalization: to achieve confidentiality for
revealing values in an attribute, a single value is allocated to a
group of sensitive values in the attribute [30]. K-anonymity: in
this method, data privacy is enforced by requiring that all
values in the quasi-attributes be repeated k times, such that k
>1, thus providing confidentiality and making it harder to
uniquely distinguish individuals values. K-anonymity employs
both generalization and suppression methods to achieve k >1
[31]. Data swapping: Data swapping is a data privacy
technique that involves exchanging sensitive cell values with
other cell values in the same attribute while keeping intact the
frequencies and statistical traits of the original data, and as
such, making it difficult for an attacker to map the privatized
values to the original record [32]. Noise addition: noise
addition is a data privacy method that adds random values
(noise) to revealing and sensitive numerical values, in the
original data, to ensure confidentiality. The random values are
usually chosen from between the mean and standard deviation
of the original values [33]:
𝑋! + 𝜀! = 𝑍! (1)
Multiplicative noise: similar to noise addition, random values
generated between the mean and variance of the original data
5. 5
values, are then multiplied to the original data generating a
privatized data set [34].
𝑋! ∗ 𝜀! = 𝑍! (2)
Where X = original data, Z = privatized data, and ε = the
random values. Differential Privacy: Similar to noise addition,
differential privacy imposes privacy by adding Laplace noise
to query results from the database such that it cannot be
distinguished if a particular value has been adjusted in that
database or not; making it more difficult for an attacker to
decode items in the database [35]. ε-differential privacy is
satisfied if the results to a query run on database D1 and D2
should probabilistically be similar, and meet the following
condition [35]:
𝑃 𝑞! 𝐷! ∈ 𝑅 𝑃 𝑞! 𝐷! ∈ 𝑅 ⩽ 𝑒!
(3)
Where D1 and D2 are the two databases; P is the probability of
the perturbed query results D1 and D2; qn() is the privacy
granting procedure (perturbation); qn(D1) is the privacy
granting procedure on query results from database D1; qn(D2)
is the privacy granting procedure on query results from
database D2; R is the perturbed query results from the
databases D1 and D2 respectively; eε
is the exponential e
epsilon value. Differential privacy can be implemented as
follows [36]:
(i) Run query on database
𝑤ℎ𝑒𝑟𝑒𝑓 𝑥 = 𝑞𝑢𝑒𝑟𝑦𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛
(ii) Calculate the most influential observation
𝛥𝑓 = 𝑀𝑎𝑥 𝑓 𝐷! − 𝑓 𝐷! (4)
(iii) Calculate the Laplace noise distribution
𝑏 = 𝛥 𝑓 𝜀 (5)
(iv) Add Laplace noise distribution to the query results
𝐷𝑃 = 𝑓 𝑥 + 𝐿𝑎𝑝𝑎𝑙𝑐𝑒 0, 𝑏 (6)
(v) Publish perturbed query results in interactive (query
responses) or non-interactive (macro, micro data) mode.
C. Metrics used to quantify usability in this study
The Shannon entropy: entropy is used essentially to measure
the amount of randomness and uncertainty in a data set; if all
values in a set of information fall into one category, then
entropy in such cases is at zero. Probability is used to quantify
randomness of elements in an information set; normalized
entropy values range from 0 to 1, getting to the upper bound
level when all probabilities are equal [37] [36]. Entropy is
formally described using the following formula [37]:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐻 𝑝!, 𝑝!, . . . , 𝑝! = 𝑝!
!
!!! ⋅ 𝑙𝑜𝑔
!
!!
(7)
where pi = probability; H(p1, p2,...,pn) is entropy for each pi.
Correlation Metric (between Original data X and Privatized
data Z): Correlation rxz computes the inclination and tendency
of an additive linear relation between two data points; the
correlation is dimensionless, independent of the environs in
which the data points x and y are measured and is expressed as
follows [38]:
𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑟!" = 𝐶𝑜𝑣 𝑥𝑧 𝜎! 𝜎! (8)
Where Cov xz is the covariance product of X and Z, sigma (σ)
represents the standard deviation product of X and Z. If rxz = -
1, then a negative linear relation exists between X and Z; if rxz
= 0, no linear relation exits between X and Z; when rxz = 1, a
strong linear relation between X and Z exists. Descriptive
Statistics Metric: Descriptive statistics (DS) such as the mean,
standard deviation, variance, etc., are used in quantifying how
much distortion there is between the anonymized and original
data. The larger the difference, the more privacy but also an
indication of less usability; the closer the difference, the more
usability but perhaps less privacy. The format used in the
quantification is always in the form [36]:
𝑈𝑠𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 𝐷𝑆(𝑍) − 𝐷𝑆(𝑋) (9)
Where Z is the anonymized data, X is the original data, and
DS, the descriptive statistics. Distance Measures Metric
(Euclidean Distance): For distance measures, we employed
clustering with k-means to evaluate how well the clustering of
the original data compares with that of the anonymized data.
In this case, the Euclidean Distance is used for k-means
clustering and is expressed as follows [39]:
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑥, 𝑦 = 𝑥! − 𝑦!
!!
!!! (10)
The Davis Bouldin index: was also used in the evaluation of
how well the clustering performed. The Davis Bouldin Index
(DBI) is expressed as follows [21]:
𝐷𝐵𝐼 =
!
!
𝐷!
!
!!! (11)
Where 𝐷! ≡ 𝑚𝑎𝑥
!:!!!
𝑅!,! (12)
And 𝑅!,! ≡
!!!!!
!!,!
(13)
And Ri,j is a quantification of how good the clustering is. Si
and Sj is the distance within each cluster. Mi,j is the distance
between clusters. Classification Error Metric: With the
classification error test, both the original and anonymized data
are passed through machine learning and the classification
error (or accuracy) is returned. The classification error (CE) of
the anonymized data is subtracted from that of the original.
6. 6
The larger the difference, the more privacy (due to distortion);
this might be an indication of low usability. However, a
smaller difference might indicate better usability but then low
privacy, as anonymized results might be closer to the original
data in similarity. Depending on the machine-learning
algorithm used, the classification error metric will be in this
form [36]:
𝑈𝑠𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝐺𝑎𝑢𝑔𝑒 = 𝐶𝐸 𝑍 − 𝐶𝐸 𝑋 (14)
Where Z is the anonymized data, X the original data, and
CE is the classification error.
III. METHODOLOGY
In this section, we describe the implemented methodology;
in this case, heuristics used in the anonymization of network
trace data, within the context of usability while at the same
time granting privacy requirements. The goal of the heuristics
is to provide an anonymized data that could be used by
researchers with close statistical traits to the original data. The
trade-off in this case, is that we tilt towards more utility while
making it harder for an attacker to decrypt the original data,
assuming that the attacker has no prior knowledge. Because of
the unique data structure of network traces, a single
generalized approach is not applicable in anonymizing all the
network trace attributes. In our approach, we apply a hybrid of
anonymization heuristics for each group of related attributes.
Combinations of microdata anonymization techniques were
used in this study, as illustrated in Figure 1. The following
attributes were anonymized in the network trace data: (i) Start
and End Time (Time-stamp), (ii) Source IP and Destination
IP, (iii) Protocol, (iv) Source Port and Destination Port, (v)
Source Packet Size and Destination Packet Size, (vi) Source
Bytes and Destination Bytes, (vii) TOS Flags. However, due
to space constraints, we only present results for the Timestamp
and IP Address attributes.
Figure 1: An illustration of the proposed anonymization heuristics for the network trace data.
A. Enumeration with multiplicative pertubation
To preserve the flow structure of the timestamp, we
employed enumeration with multiplicative perturbation, a
heuristic that combines multiplicative noise addition technique
from the microdata privatization techniques and enumeration
from network trace anonymization. The Enumeration with
Multiplicative Perturbation Heuristic is implemented as
follows: Step (I): A small epsilon constant value is chosen
between 0 and 1. Data curators could conceal this random
value, arbitrarily chosen between 0 and 1, as an additional
layer of confidentiality. Step (ii): The small epsilon constant
value is then multiplied to the original data (timestamp, both
Start and End Time attributes) generating an enumerated set.
Step (iii): The generated enumerated data is then added to the
original data, producing an anonymized data set. Step (iv): A
test for usability is done, using descriptive statistical analysis,
entropy, correlation, and unsupervised learning using
clustering (k-means). Step (v): If the desired threshold is met,
the anonymized data is published. The goal with this heuristic
is to keep the time flow structure intact and similar to the
original data while at the same time anonymizing the time
series values. In this case, the anonymized time series data
should generate similar usability results to the original.
B. Generalization and differential privacy
The IP address is one of the most challenging attributes to
anonymize since each octet of the IP address is limited to a
finite set of numbers, from 0 to 255. This makes the IP address
attribute vulnerable to attackers in attempts to de-anonymize
the privatized network trace [3]. With such restrictions, the
curator of the data is left with the choice of completely
anonymizing the IP address by employing full perturbation
techniques, which in turn keeps the flow structure and prefix
of the IP address distorted, and thus poor data usability. One
solution to this problem would be to employ heuristics that
would grant anonymization and at the same time keep the
prefix of the IP address intact. However, full IP address
prefix-preserving anonymization has been shown to be prone
to de-anonymization attacks, yet presenting another challenge
[5]. Therefore, to deal with this problem, we suggest a partial
prefix-preserving heuristic in which differential privacy and
generalization are used and implemented as follows: Octet 1,
anonymization: The IP address is split into four octets.
Generalization is applied to the first octet to partially preserve
7. 7
the prefix of the anonymized IP address. The goal is to give
the users of the anonymized data some level of usability by
being able get a synthetic flow of the IP address structure in
the network trace. Step (i): A small epsilon constant value is
chosen and used for application (added or multiplied to data)
with noise addition or multiplicative noise on the first octet.
The goal is to preserve the flow structure in the first octet.
Step (ii): Frequency count analysis to check that none of the
first octet values from the original data re-appear in the
anonymized data is done at this stage. Step (iii): If first octet
values reappear in the anonymized data, generalization by
replacing the reappearing values with the most frequent values
in the anonymized first octet is done. Step (iv): Finally,
generalization and k-anonymity are applied to ensure no
unique values appear, and that all values in the first octet
appear k >1. Step (v): A test for usability by comparing the
original and anonymized first octet values, is done. Octet 2, 3,
and 4 anonymization: To make it difficult to de-anonymize
the full IP address, randomization using differential privacy is
applied to the remaining three octets. However, since each
octet is limited to a set of 0 to 255 finite numbers, the
differential privacy perturbation process will generate some
values that would exceed 255; for instance, it would not make
meaning to have an octet value of 350. To mitigate this
situation, a control statement is introduced at the end of the
differential privacy process, to exclude all values greater than
an IP class address and octet range. In this case, any values
greater than 255 are excluded from the end results of the
perturbation process. Differential privacy is applied to each of
the three octets vertically and separately. Step (i): A vertical
split of octet 2, 3, and 4 into separate attributes, is done. Step
(ii): Anonymization using differential privacy on each
attribute (octet) separately is done at this stage. Step (iii): Test
to ensure that anonymized values in each octet are in range,
from 0 to 255. Step (iv): If the anonymized values in an octet
exceed the 0 to 255 range then return a generalized value
using the most frequent value in that 0 to 255 range. Step (v):
Test for usability. Step (vi): Combine all octets to a full
anonymized IP address.
IV. RESULTS
Preliminary results are presented in this section. However,
due to space limitation in this publication, only results for the
timestamp and IP address attributes are presented. Real 2014
network trace (NetFlow) data provided by Los Alamos
National Laboratory were used in this experiment. A total of
500000 network flow records were anonymized in this study.
Microdata obfuscation techniques were applied for the
anonymization process. Each attribute of the NetFlow trace
was anonymized separately.
A. Timestamp anonymization and usability results
Descriptive statistical analysis was done on both the original
and anonymized data sets, as shown in Table II. The aim was
to study the statistical traits of both the original and
anonymized data sets and show any similarities. In this case,
the statistical traits of the anonymized data show an
augmentation of the original data – a generation of a synthetic
data set in this case. For instance the original mean of the start
time and end time was 1123355142 and 1123355214
respectively, and while that of the anonymized data set was at
1944808589 and 1944808714. The difference between the
anonymized and original data was at 821453447 and
821453500 respectively. A larger difference might indicate
more privacy and less usability, while a smaller difference
might indicate better privacy but less usability. The results
presented in Table II indicate a mid-way with both privacy
and usability needs met after trade-offs (the difference).
TABLE II. STATISTICAL TRAITS OF ORIGINAL AND ANONYMIZED
TIMESTAMP DATA
However, to meet the requirements of different users for the
anonymized data, a fine-tuning of the parameters in the
anonymization heuristics would need to be done. Additionally,
the normalized Shannon's entropy results, as shown in Table
II, were similar for both original and anonymized data at
approximately 0.77 and 0.76 for the start and end times
respectively. The entropy results indicate that the distortions
and uncertainty in both data sets might be similar. While the
entropy results might be good for usability, it could likewise
be argued that privacy levels might be inadequate since the
two data sets are similar in that regard. However, the
correlation values between the anonymized and original data
was at 0.532 and 0.534 for the start and end time attributes
respectively. The results could indicate that while correlations
exist between the two data sets, the significance is not that
high since the values do not approach 1.
Figure 2: K-means clustering results for the original start and end time data.
8. 8
The results might indicate that privacy is maintained in the
anonymized data, with an acceptable level of usability. In
Figure 2, results from clustering the original network trace
data (timestamp attribute), is presented. The x-axis in Figure 2
represents the start-time, while the y-axis represents the end-
time of the activity in the network trace. The value of k for the
k-means was set to 5 in this experiment. From an anecdotal
point of view, we can see that the clustering results in Figure 2
have their own skeletal structure. However, this is not the case
in Figure 3. In Figure 3, data privacy using noise addition was
applied idealistically, without much consideration given to the
issue of usability.
Figure 3: Idealistic Privacy application and clustering results
An anecdotal view of results in Figure 3 might point to better
privacy, since the skeletal cluster structure of the original data
was dismantled and replaced with a new skeletal cluster
structure.
Figure 4: K-means clustering for the anonymized start and end-time data.
However, usability remains a challenge, as the anonymized
clustering results are far from being close to the original
clustering. In the case of this study, the aim was to obtain
clustering results with better usability. Therefore, a re-tuning
of the parameters in the data privacy procedure is done to
achieve better usability. On the other hand, the goal of using
cluster analysis with k-means was to analyze how the
unlabeled original network trace data would perform in
comparison to the anonymized data. Furthermore, the Davis-
Bouldin criterion shows a value of 0.522, as depicted in Table
II, indicating how well the clustering algorithm (k-means)
performed with the original time-stamp (start and end times)
data. In Figure 4, clustering results (with k=5 for the k-means)
for the anonymized data are presented, with the x-axis
showing the start time and the y-axis presenting the end time.
Figure 5: K-means Cluster performance showing the average distance within
centroid and items in each cluster
The Davis-Bouldin criterion for the clustering performance on
the anonymized data was 0.393 as shown in Table II, a value
lower than that of the original data, and an indication of better
clustering. However, while an anecdotal view of the plots
shows that the cluster results look similar, the number of items
in each cluster in the anonymized data differ from that of the
original, as shown in Figure 5. For instance, in Figure 5, the
number of items in cluster 0 for the original data is at 310678,
while that of the anonymized data is at 291002. The trade-off
would be the difference of 19676 items. The challenge still
remains as to effectively balance anonymity and usability
requirements, with trade-offs. In this case, if the usability
threshold is not met, then the curator can fine-tune the
anonymization parameters. The average-within-centroid
distance returned a lower value for the anonymized data at
77865, and for the original data at 157093, with the lower
value indicating better clustering, as shown in Figure 5.
B. Source and destination IP address anonymity results
The IP address remains a challenging attribute to anonymize
due to the finite nature of the IP addresses. Each octet is
limited to a range of 0 to 255 and obfuscation becomes
constricted to that range. As we hinted earlier, it would not
make any meaning to have octet values ranging between 270
and 450, for instance. In this section we present preliminary
results on the anonymization and usability of the source and
destination IP attribute values using the heuristics in section 3.
Correlation: The correlation between the original and
anonymized data, as shown in Table III, for the first octet of
9. 9
the source and destination IP show values of 0.9 and 1
respectively. These strong correlation values are indicative of
a strong linear relationship between the original and
anonymized octet 1 data. The first octet of the IP address was
anonymized using noise addition and generalization to keep
the flow structure similar to the original. Since a partial prefix
preserving anonymization was used, it is noteworthy that there
are strong correlation values between the original and
anonymized data for the first octet IP values.
TABLE III. STATISTICAL TRAITS OF ORIGINAL AND ANONYMIZED SOURCE AND DESTINATION IP ADDRESSES
Our view is that a researcher could still derive general
network information from the flow structure presented by the
first octet in the IP address without compromising the
specifics of the other inner 3 octets. Yet the correlation
between the anonymized data and original data for the 2, 3,
and 4 octets show values of 0 for the destination IP addresses
and minimal values of -0.081, 0.093, and 0.213, for source IP
addresses, indicating that there is very low relationship
between the anonymized and original data for octets 2, 3, and
4. However, the very low correlation values might be a good
indicator for stronger privacy, since we employed differential
privacy in the anonymization of octets 2, 3 and 4. Therefore
the correlation between the anonymized and original data
would be nonexistent or at least very minimal due to the
differential privacy randomization. Hence the partial prefix-
preserving heuristic works in this case, the user of the
anonymized data is only able to derive information from the
first octet while all other internal IP address information is
kept confidential.
Entropy: The Shannon Entropy test was done on both the
original and anonymized data IP addresses to study the
uncertainty and randomness in the data sets. The normalized
Shannon's entropy values range between 0 and 1, with 0
indicating certainty and 1 indicating uncertainty. As shown in
both Table III and Figure 6, the entropy values for octet 1 in
both the original and anonymized data, is approximately at
0.1, indicative of certainty of values and thus maintenance of
flow in the first octet. However, for octets 3 and 4, there is
much less certainty in the original data and in octets 2, 3, and
4 for the anonymized data, though much lower than the
original. Nevertheless, octet 2 in the original data provides
more certainty than octet 2 in the anonymized data. While the
entropy levels in octet 3 and 4 in the original data seem higher
than that of the anonymized data, overall, octets 2, 3, and 4 in
the anonymized data, provide more distributed uncertainty,
better randomness, and thus better anonymity. Yet still, we
constrained the random values in octet 2, 3, and 4 generated
during the differential privacy procedure not to exceed 255.
An octet value of 355 or 400 would affect the usability of the
anonymized IP address data. However, it could be argued that
the certainty levels are maintained in octet 1 for both original
and anonymized data, with distortion on octet 2, 3, and 4 in
the anonymized data, indicating that the flow structure is kept,
and thus partial prefix-preserving anonymity might be
achieved.
Figure 6: Normalized Shannon's Entropy values for the original and
anonymized IP addresses.
Frequency Distribution histogram analysis: Furthermore, we
did a frequency analysis to compare the distribution of values
in each octet in the IP address, for both the original and
anonymized IP addresses. For the original data the number of
items in octet 1 between 40 and 45, that is, source IP addresses
that start with octet values 40 to 45, came to approximately
400,000 out of 500,000 records, as shown in Figure 7. Similar
results were actualized for the destination IP address, for octet
1 with about 300,000 items with values 40 to 45, as illustrated
in Figure 8. With the exception of octet 2, the values in octet 3
and 4 are distributed across the range 0 to 85 in the original IP
address data; this correlates with results shown in Figure 6,
with higher entropy values for octet 3 and 4 in the original
10. 10
data, indicating more uncertainty. The x-axis in each graph
represents the IP octet values, and the y-axis, shows the
frequency of each of those octet values. However, a look at
the anonymized IP address data shows that octet 1 had about
390,000 IP address octet 1 values beginning with 200, as
shown in Figure 9 and 10, for both source and destination IP
address data respectively. The results show the effect of
generalization used in the obfuscation of the original data for
octet 1. The values in octet 2, 3 and 4 were distributed across
the 0 to 255 range, with the highest concentration around octet
value 190 due to the constraints placed on the differential
privacy results, to prevent a return of value greater than 255. It
would not make much meaning, as mentioned earlier, to have
differential privacy results that exceed 255. For octet 2, 3, and
4, the Laplace distribution is kept due to the noise distribution
used in the differential privacy process.
Figure 7: Frequency distribution for the original source IP octet values.
Figure 8: Frequency distribution for the original destination IP octet values
Our recommendation as a result of this study is that a privacy
engineering approach be highly considered by curators during
the anonymization process.
V. CONCLUSION
Anonymizing network traces while maintaining an acceptable
level of usability remains a challenge, especially when
employing privatization techniques used for microdata
obfuscation. Moreover, obfuscating network traces remains
problematic due to the IP addresses and octet values being
finite. Furthermore, generalized anonymization approaches
fail to deliver specific solutions, as each entity will have
unique data privacy and usability requirements, and the data in
most cases have varying characteristics to be considered
during the obfuscation process. In this study, we have
provided a review of literature, pointing out some of the
ongoing challenges in the network trace anonymization over
the last 10-year period. We have suggested usability-aware
anonymization heuristics by employing microdata privacy
techniques, while taking into consideration the usability of the
anonymized network trace data. Our preliminary results show
that with trade-offs, it might be possible to generate
anonymized network traces on a case-by-case basis, using
micro-data anonymization techniques, such as differential
privacy, k-anonymity, generalization, multiplicative noise
addition.
Figure 9: Frequency distribution for anonymized source IP octet values
In the initial stage of the privacy engineering process, the
curators could gather privacy and usability requirements from
the stakeholders involved, this would include both the policy
makers and anticipated users (researchers) of the anonymized
network trace data. The curators could then model the most
applicable approach given trade-offs, on a case-by-case basis.
The generated anonymization model could then be
implemented across the enterprise for uniformity and
prevention of information leakage attacks. On the limitations
of this study, focus was placed on usability-aware
11. 11
anonymization of network trace data and not on the types of
attacks on anonymized network traces. While some
consideration and mention of anonymization attacks was
given in this study, focusing on de-anonymization attacks was
beyond the scope of this study, and a subject left for future
work.
Figure 10: Frequency distribution for anonymized destination IP octet values
ACKNOWLEDGMENT
We would like to express our appreciation to the Los
Alamos National Laboratory, and more specifically, the
Advanced Computing Solutions Group, for making this work
possible.
REFERENCES
[1] D. A. Maltz, J. Zhan, G. Xie, H. Zhang, G. Hjálmtýsson, A. Greenberg,
and J. Rexford, “Structure preserving anonymization of router
configuration data”, In Proceedings of the 4th ACM SIGCOMM
conference on Internet measurement (IMC '04), 2004, Pages 239-244.
[2] A. Slagell, J. Wang, and W. Yurcik, "Network log anonymization:
Application of crypto-pan to cisco netflows." In Proceedings of the
Workshop on Secure Knowledge Management , 2004.
[3] M. Bishop, R. Crawford, B. Bhumiratana, L. Clark, and K. Levitt,
"Some problems in sanitizing network data.", 15th IEEE International
Workshops on Enabling Technologies: Infrastructure for Collaborative
Enterprises, 2006., pp. 307-312.
[4] S.E. Coull, C.V. Wright, F. Monrose, M.P. Collins, and M.K. Reiter,
"Playing Devil's Advocate: Inferring Sensitive Information from
Anonymized Network Traces." In NDSS, 2007, vol. 7, pp. 35-47.
[5] B.F. Ribeiro, W. Chen, G. Miklau, and D.F. Towsley, "Analyzing
Privacy in Enterprise Packet Trace Anonymization." In NDSS, 2008.
[6] S. Gattani and T.E. Daniels, “Reference models for network data
anonymization”, In Proceedings of the 1st ACM workshop on Network
data anonymization (NDA '08), 2008, pp. 41-48.
[7] J. King, K. Lakkaraju, and A. Slagell. "A taxonomy and adversarial
model for attacks against network log anonymization." In Proceedings
of the 2009 ACM symposium on Applied Computing, 2009, pp. 1286-
1293.
[8] S.E. Coull, F. Monrose, M.K. Reiter, M. Bailey, "The Challenges of
Effectively Anonymizing Network Data," Conference For Homeland
Security, CATCH 2009, pp.230-236.
[9] M. Foukarakis, D. Antoniades, and M. Polychronakis, “Deep packet
anonymization”, In Proceedings of the Second European Workshop on
System Security (EUROSEC '09). ACM, 2009, pp. 16-21.
[10] A. Sperotto, G. Schaffrath, R. Sadre, C. Morariu, A. Pras, and B. Stiller,
"An overview of IP flow-based intrusion detection." Communications
Surveys & Tutorials, IEEE 12, no. 3, 2010, pp. 343-356.
[11] M. Burkhart, D. Schatzmann, B. Trammell, E. Boschi, and B. Plattner.
"The role of network trace anonymization under attack.", ACM
SIGCOMM Computer Communication Review 40, no. 1, 2010, pp. 5-
11.
[12] F. McSherry, and R. Mahajan, "Differentially-private network trace
analysis.", ACM SIGCOMM Computer Communication Review 41.4,
2011, pp. 123-134.
[13] F. McSherry, and R. Mahajan., "Differentially-private network trace
analysis.", ACM SIGCOMM Computer Communication Review 41, no.
4, 2011, pp. 123-134.
[14] R.R. Paul, V.C. Valgenti, M. Kim, "Real-time Netshuffle: Graph
distortion for on-line anonymization," Network Protocols (ICNP), 19th
IEEE International Conference on, 2011, pp.133,134.
[15] D. Riboni, A. Villani, D. Vitali, C. Bettini, L.V. Mancini, "Obfuscation
of sensitive data in network flows," INFOCOM, 2012 Proceedings,
IEEE, 2012, pp.2372-2380.
[16] W. Qardaji and L. Ninghui, "Anonymizing Network Traces with
Temporal Pseudonym Consistency." IEEE 32nd International
Conference on Distributed Computing Systems Workshops (ICDCSW),
2012, pp. 622-633.
[17] M. Mendonca, S. Seetharaman, and K. Obraczka, "A flexible in-network
ip anonymization service.", In Communications (ICC), 2012 IEEE
International Conference, pp. 6651-6656.
[18] S. Jeon, J-H. Yun, and W-N. Kim, “Obfuscation of Critical
Infrastructure Network Traffic using Fake Communication”, Annual
Computer Security Applications Conference (ACSAC) 2013, Poster.
[19] M. Nassar, B. al Bouna, and Q. Malluhi, "Secure Outsourcing of
Network Flow Data Analysis.", In Big Data (BigData Congress), 2013
IEEE International Congress, 2013, pp. 431-432.
[20] C. Xiaoyun, S. Yujie, T. Xiaosheng, H. Xiaohong, and M. Yan, "On
measuring the privacy of anonymized data in multiparty network data
sharing.", Communications, China 10, no. 5, 2013, pp. 120-127.
[21] Y-D. Ying-Dar, P-C. Lin, S-H. Wang, I-W. Chen, and Y-C. Lai.
"Pcaplib: A system of extracting, classifying, and anonymizing real
packet traces.", IEEE Systems Journal, Issue 99, pp.1-12.
[22] J. Stanek, L. Kencl, and J. Kuthan, "Analyzing anomalies in anonymized
SIP traffic.", In IEEE 2014 IFIP Networking Conference, 2014, 2014,
pp. 1-9.
[23] D. Riboni, A. Villani, D. Vitali, C. Bettini, L.V. Mancini, L.V,
"Obfuscation of Sensitive Data for Incremental Release of Network
Flows," IEEE Transactions on Networking, Issue 99, 2014, pp.1.
[24] T. Farah, and L. Trajkovic, "Anonym: A tool for anonymization of the
Internet traffic." In IEEE 2013 International Conference on Cybernetics
(CYBCONF), 2013, pp. 261-266.
[25] A.J. Slagell, K. Lakkaraju, and K. Luo, "FLAIM: A Multi-level
Anonymization Framework for Computer and Network Logs." In LISA,
vol. 6, 2006, pp. 3-8.
[26] J. Xu, J. Fan, M.H. Ammar, and Sue B. Moon, "Prefix-preserving ip
address anonymization: Measurement-based security evaluation and a
new cryptography-based scheme.", In 10th IEEE International
Conference on Network Protocols, 2002, pp. 280-289.
[27] M. Burkhart, D. Brauckhoff, M. May, and E. Boschi, "The risk-utility
tradeoff for IP address truncation." In Proceedings of the 1st ACM
workshop on Network data anonymization, 2008, pp. 23-30.
[28] W. Yurcik, C. Woolam, G. Hellings, L. Khan, B. Thuraisingham,
"Measuring anonymization privacy/analysis tradeoffs inherent to sharing
network data", IEEE Network Operations and Management Symposium,
2008, pp.991-994.
[29] V. Ciriani, S.D.C. Vimercati, S. Foresti, and P. Samarati, “Theory of
privacy and Anonymity”, In M. J. Atallah & M. Blanton (Eds.), In
Algorithms and theory of computation handbook, CRC Press, 2009, pp.
12. 12
18-33.
[30] P. Samarati and L. Sweeney, “Protecting privacy when disclosing
information: k-anonymity and its enforcement through generalization
and suppression”, Technical Report SRI-CSL-98-04, SRI Computer
Science Laboratory, 1998
[31] L. Sweeney, “Achieving k-anonymity privacy protection using
generalization and suppression”, International Journal of Uncertainty
Fuzziness and Knowledge-Based Systems, 10(5), 2002, pp.571–588.
[32] T. Dalenius and S.P. Reiss, “Data-swapping: A technique for disclosure
control”, Journal of Statistical Planning and Inference, 6(1), 1982, pp.
73–85.
[33] J. Kim, “A Method For Limiting Disclosure in Microdata Based
Random Noise and Transformation”, In Proceedings of the Survey
Research Methods, American Statistical Association, Vol. A, 1986, pp.
370–374.
[34] J. Kim and W.E. Winkler, “Multiplicative Noise for Masking
Continuous Data”, Research Report Series, Statistics #2003-01,
Statistical Research Division. 2003, Washington, D.C. Retrieved from
http://www.census.gov/srd/papers/pdf/rrs2003-01.pdf
[35] C. Dwork, “Differential Privacy”, In M. Bugliesi, B. Preneel, V.
Sassone, & I. Wegener (Eds.), Automata languages and programming,
Vol. 4052, 2006, pp. 1–12. Springer.
[36] K. Mivule, “An Investigation Of Data Privacy and utility using machine
learning as a gauge”, Dissertation, Computer Science Department,
Bowie State University, 2014, ProQuest No: 3619387.
[37] M.H. Dunham, “Data Mining Introductory and Advanced Topics”,
2003, pp. 58–60, 97–99. Upper Saddle River, New Jersey: Prentice Hall.
[38] K. Mivule, (2012). “Utilizing noise addition for data privacy, an
Overview”, In Proceedings of the International Conference on
Information and Knowledge Engineering (IKE), 2012, pp. 65–71.
[39] S.E. Coull, C.V. Wright, A.D. Keromytis, F. Monrose, and M.K. Reiter,
“Taming the devil: Techniques for evaluating anonymized network
data”, In Network and Distributed System Security Symposium, 2008,
pp. 125-135.