Camouflage: Automated Anonymization of Field Data (ICSE 2011)

CAMOUFLAGE:
AUTOMATED ANONYMIZATION
OF FIELD DATA
James Clause
University of Delaware
Alessandro Orso
Georgia Institute
of Technology

THE BIG PICTURE
• Apple crash reporter
• Windows error reporting
• Ubuntu Apport
• Gnome BugBuddy
• Mozilla / Google Breakpad
• many others
• Chilimbi and colleagues ’09
• Elbaum and Diep ’05
• Hilbert and Redmiles ’00
• Liblit and colleagues ’05
• Pavlopoulou andYoung ’99
• many others

Handling concerns in practice
PRIVACY CONCERNS

PRIVACY CONCERNS
• Ignore them

PRIVACY CONCERNS
• Ignore them
• Privacy policies

PRIVACY CONCERNS
• Ignore them
• Privacy policies
• Collect limited amounts of
information
• less likely to be sensitive
• can rely on user checking

PRIVACY CONCERNS
Unfortunately:
Register values
Stack dumps
Branch proﬁles
Path proﬁles
Test cases
Usefulness

PRIVACY CONCERNS
Privacy concerns
Unfortunately:
Register values
Stack dumps
Branch proﬁles
Path proﬁles
Test cases
Usefulness

PRIVACY CONCERNS
Privacy concerns
Unfortunately:
Register values
Stack dumps
Branch proﬁles
Path proﬁles
Test cases
Usefulness
GOAL: Enable the collection of detailed information
while reducing or eliminating privacy concerns.

OUTLINE
• Intuition
• Castro and colleagues’ technique
• Our improvements
• Path condition relaxation
• Breakable input conditions
• Evaluation
• Related work
• Conclusions and future work

Sensitive
input (I)
that causes F
Input domain
INTUITION

Sensitive
input (I)
that causes F
Input domain
Inputs that
cause F
INTUITION

Sensitive
input (I)
that causes F
Input domain
Inputs that
cause F
INTUITION
Anonymized
input (I’)
that also
causes F

Inputs that satisfy
F’s path condition Sensitive
input (I)
that causes F
Input domain
Inputs that
cause F
INTUITION
Anonymized
input (I’)
that also
causes F

CASTRO AND COLLEAGUES’ TECHNIQUE
(PATH CONDITION GENERATION)
Path condition: set of constraints on a program’s
inputs that encode the conditions necessary for a
speciﬁc path to be executed.

boolean foo(int x, int y, int z) {
if(x <= 5) {
int a = x * 2;
if(y + a > 10) {
if(z == 0) {
return true;
}
}
}
return false;
}

if(x <= 5) {
int a = x * 2;
if(y + a > 10) {
if(z == 0) {
return true;
}
}
}
return false;
}
5 3 0
(sensitive)

Path Condition:
Symbolic State:
if(x <= 5) {
int a = x * 2;
if(y + a > 10) {
if(z == 0) {
return true;
}
}
}
return false;
}
5 3 0
(sensitive)

Path Condition:
Symbolic State:
if(x <= 5) {
int a = x * 2;
if(y + a > 10) {
if(z == 0) {
return true;
}
}
}
return false;
}
5 3 0
x→i1
y→i2
z→i3
(sensitive)

Path Condition:
i1 <= 5
Symbolic State:
if(x <= 5) {
int a = x * 2;
if(y + a > 10) {
if(z == 0) {
return true;
}
}
}
return false;
}
5 3 0
x→i1
y→i2
z→i3
(sensitive)

Path Condition:
i1 <= 5
Symbolic State:
a→i1*2
if(x <= 5) {
int a = x * 2;
if(y + a > 10) {
if(z == 0) {
return true;
}
}
}
return false;
}
5 3 0
x→i1
y→i2
z→i3
(sensitive)

Path Condition:
i1 <= 5
Symbolic State:
a→i1*2
if(x <= 5) {
int a = x * 2;
if(y + a > 10) {
if(z == 0) {
return true;
}
}
}
return false;
}
5 3 0
x→i1
y→i2
z→i3
∧ i2+i1*2 > 10
(sensitive)

Path Condition:
i1 <= 5
Symbolic State:
a→i1*2
if(x <= 5) {
int a = x * 2;
if(y + a > 10) {
if(z == 0) {
return true;
}
}
}
return false;
}
5 3 0
x→i1
y→i2
z→i3
∧ i2+i1*2 > 10
∧ i3 == 0
(sensitive)

i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0
(CHOOSING ANONYMIZED INPUTS)

Constraint
Solver
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0

Constraint
Solver
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0
i1 == 5
i2 == 3
i3 == 0

OUR IMPROVEMENTS
Increase the number of
possible choices for I’
Chose I’ such that it is
as different as possible from I

PATH CONDITION RELAXATION
Sensitive
input (I)
that causes F
Input domain

1.Array inequalities 3. Multi-clause conditionals
2. Switch statements 4.Array reads

x.equals(y);

x.equals(y);
abc abd

x.equals(y);
Traditional:
x0 == y0
∧ x1 == y1
∧ x2 != y2
abc abd

x.equals(y);
Traditional:
x0 == y0
∧ x1 == y1
∧ x2 != y2
Relaxed:
x0 != y0
∨ x1 != y1
∨ x2 != y2
abc abd

switch(x) {
case 1:
...
break;
case 3:
case 5:
...
break;
default:
...
}

switch(x) {
case 1:
...
break;
case 3:
case 5:
...
break;
default:
...
}
5

switch(x) {
case 1:
...
break;
case 3:
case 5:
...
break;
default:
...
}
Traditional:
x == 5
5

Relaxed:
x == 5
∨ x == 3
switch(x) {
case 1:
...
break;
case 3:
case 5:
...
break;
default:
...
}
Traditional:
x == 5
5

switch(x) {
case 1:
...
break;
case 3:
case 5:
...
break;
default:
...
}
10

switch(x) {
case 1:
...
break;
case 3:
case 5:
...
break;
default:
...
}
Traditional:
x == 10
10

switch(x) {
case 1:
...
break;
case 3:
case 5:
...
break;
default:
...
}
Traditional:
x == 10
Relaxed:
x != 1
∧ x != 3
∧ x != 5
10

Constraint
Solver
BREAKABLE INPUT CONDITIONS
Path Condition:
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0 i1 == 5
i2 == 3
i3 == 0

Constraint
Solver
Path Condition:
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0 i1 == 5
i2 == 3
i3 == 0
if(x <= 5) {
int a = x * 2;
if(y + a > 10) {
if(z == 0) {
return true;
}
}
}
return false;
}
5 3 0

Constraint
Solver
Path Condition:
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0

Constraint
Solver
Path Condition:
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0
Breakable
Input Condition:
i1 != 5
∧ i2 != 3
∧ i3 != 0

Constraint
Solver
Path Condition:
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0
Breakable
Input Condition:
i1 != 5
∧ i2 != 3
∧ i3 != 0
i1 == 4
i2 == 10
i3 == 0

ASSUMPTIONS
1. The failure f is observable and can be detected with an
assertion.
‣ common to all debugging techniques; holds in most, if not all, cases.

ASSUMPTIONS
assertion.
2. Any input that satisﬁes the path condition results in f.
• Non-determinism
‣ common to all debugging techniques; requires a deterministic
replay mechanism
• Implicit checks (e.g., division by zero)
‣ likely that they do not involve relevant inputs
‣ make implicit checks explicit (e.g., 100/x → assert x != 0)

assertion.

assertion.
2. Any input that satisﬁes the path condition results in f.
• Non-determinism
‣ common to all debugging techniques; requires a deterministic
replay mechanism
• Implicit checks (e.g., division by zero)
‣ likely that they do not involve relevant inputs
‣ make implicit checks explicit (e.g., 100/x → assert x != 0)

EVALUATION
1 Feasibility
Can the approach
generate, in a reasonable
amount of time,
anonymized inputs that
reproduce a failure?

EVALUATION
1 Feasibility
Can the approach
amount of time,
Strength
How much information
about the original inputs
is revealed?
2

EVALUATION
Effectiveness
Are the anonymized
inputs safe to send to
developers?
31 Feasibility
Can the approach
amount of time,
Strength
is revealed?
2

EVALUATION
Effectiveness
Are the anonymized
inputs safe to send to
developers?
31 Feasibility
Can the approach
amount of time,
Strength
is revealed?
2 4 Improvement
Does the use of path
condition relaxation and
breakable input
conditions provide any
beneﬁts over the basic
approach?

i1 == 4
i2 == 10
i3 == 0
Constraint
Solver
PROTOTYPE IMPLEMENTATION
Path Condition:
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0
Breakable
Input Condition:
i1 != 5
∧ i2 != 3
∧ i3 != 0

i1 == 4
i2 == 10
i3 == 0
Constraint
Solver
Path Condition:
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0
Breakable
Input Condition:
i1 != 5
∧ i2 != 3
∧ i3 != 0
Java
Pathﬁnder
Java
Pathﬁnder

i1 == 4
i2 == 10
i3 == 0
Constraint
Solver
Path Condition:
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0
Breakable
Input Condition:
i1 != 5
∧ i2 != 3
∧ i3 != 0
Java
Pathﬁnder
Java
Pathﬁnder
Yices

i1 == 4
i2 == 10
i3 == 0
Constraint
Solver
Path Condition:
i1 <= 5
∧ i2+i1*2 > 10
∧ i3 == 0
Breakable
Input Condition:
i1 != 5
∧ i2 != 3
∧ i3 != 0
Java
Pathﬁnder
Java
Pathﬁnder
Yices
Ruby
scripts
Executable
inputs

SUBJECTS
• Columba: 1 fault
• htmlparser: 1 fault
• Printtokens: 2 faults
• NanoXML: 16 faults
(20 faults, total)

SUBJECTS
Select sensitive failure-inducing inputs
• 170 total inputs
• manually generated or included with subject
• 100 bytes to 5MB in size
(20 faults, total)

SUBJECTS
Select sensitive failure-inducing inputs
• 170 total inputs
• manually generated or included with subject
• 100 bytes to 5MB in size
(Assume all of each input is potentially sensitive)
(20 faults, total)

RQ1:FEASIBILITY
0
150
300
450
600
0
5
10
15
20
columba
htmlparser
printtokens1
printtokens2
nanoxml1
nanoxml2
nanoxml3
nanoxml4
nanoxml5
nanoxml6
nanoxml7
nanoxml8
nanoxml9
nanoxml10
nanoxml11
nanoxml12
nanoxml13
nenoxml14
nanoxml15
nanoxml16
Average
executiontime(s)
Average
solvertime(s)

RQ1:FEASIBILITY
0
150
300
450
600
0
5
10
15
20
columba
htmlparser
printtokens1
printtokens2
nanoxml1
nanoxml2
nanoxml3
nanoxml4
nanoxml5
nanoxml6
nanoxml7
nanoxml8
nanoxml9
nanoxml10
nanoxml11
nanoxml12
nanoxml13
nenoxml14
nanoxml15
nanoxml16
Average
executiontime(s)
Average
solvertime(s)
Inputs can be anonymized in a reasonable
amount of time (easily done overnight)

Average % Bits Revealed Average % Residue
RQ2: STRENGTH

RQ2: STRENGTH
Measures how many inputs
that satisfy the path
condition
Little
information revealed

RQ2: STRENGTH
condition
Lots of

RQ2: STRENGTH
condition
Measures how much of the
anonymized input is identical
to the original input
AAAAAA
secret
AAAAAA
...
AAAAAA
BBBBBB
secret
BBBBBB
...
BBBBBB
I’
Lots of
I

RQ2: STRENGTH
0
25
50
75
100
0
25
50
75
100
columba
htmlparser
printtokens1
printtokens2
nanoxml1
nanoxml2
nanoxml3
nanoxml4
nanoxml5
nanoxml6
nanoxml7
nanoxml8
nanoxml9
nanoxml10
nanoxml11
nanoxml12
nanoxml13
nenoxml14
nanoxml15
nanoxml16
Average
%BitsRevealed
Average
%Residue

RQ2: STRENGTH
0
25
50
75
100
0
25
50
75
100
columba
htmlparser
printtokens1
printtokens2
nanoxml1
nanoxml2
nanoxml3
nanoxml4
nanoxml5
nanoxml6
nanoxml7
nanoxml8
nanoxml9
nanoxml10
nanoxml11
nanoxml12
nanoxml13
nenoxml14
nanoxml15
nanoxml16
Average
%BitsRevealed
Average
%Residue
Anonymized inputs reveal, on average, between
60% (worst case) and 2% (best case) of the
information in the original inputs

RQ3: EFFECTIVENESS
NANOXML
<!DOCTYPE Foo [
   <!ELEMENT Foo (ns:Bar)>
   <!ATTLIST Foo
       xmlns CDATA #FIXED 'http://nanoxml.n3.net/bar'
       a     CDATA #REQUIRED>
   <!ELEMENT ns:Bar (Blah)>
   <!ATTLIST ns:Bar
       xmlns:ns CDATA #FIXED 'http://nanoxml.n3.net/bar'>
   <!ELEMENT Blah EMPTY>
   <!ATTLIST Blah
       x    CDATA #REQUIRED
       ns:x CDATA #REQUIRED>
]>

<Foo a='very' b='secret' c='stuff'>vaz
   <ns:Bar>
       <Blah x="1" ns:x="2"/>
   </ns:Bar>
</Foo>

RQ3: EFFECTIVENESS
NANOXML
<!DOCTYPE [
   <! >
   <!ATTLIST
        #FIXED ' '
        >
   <!E >
   <!ATTLIST
        #FIXED ' '>
   <!E >
   <!ATTLIST
        #
        : # >
]>

< =' ' =' ' =' '>
   < : >
       < =" " : =" "/>
   </ :

Wayne,Bartley,Bartley,Wayne,wbartly@acp.com,,
Ronald,Kahle,Kahle,Ron,ron.kahle@kahle.com,,
Wilma,Lavelle,Lavelle,Wilma,,lavelle678@aol.com,
Jesse,Hammonds,Hammonds,Jesse,,hamj34@comcast.com,
Amy,Uhl,Uhl,Amy,uhla@corp1.com,uhla@gmail.com,
Hazel,Miracle,Miracle,Hazel,hazel.miracle@corp2.com,,
Roxanne,Nealy,Nealy,Roxie,,roxie.nearly@gmail.com,
Heather,Kane,Kane,Heather,kaneh@corp2.com,,
Rosa,Stovall,Stovall,Rosa,,sstoval@aol.com,
Peter,Hyden,Hyden,Pete,,peteh1989@velocity.net,
Jeffrey,Wesson,Wesson,Jeff,jwesson@corp4.com,,
Virginia,Mendoza,Mendoza,Ginny,gmendoza@corp4.com,,
Richard,Robledo,Robledo,Ralph,ralphrobledo@corp1.com,,
Edward,Blanding,Blanding,Ed,,eblanding@gmail.com,
Sean,Pulliam,Pulliam,Sean,spulliam@corp2.com,,
Steven,Kocher,Kocher,Steve,kocher@kocher.com,,
Tony,Whitlock,Whitlock,Tony,,tw14567@aol.com,
Frank,Earl,Earl,Frankie,,,
Shelly,Riojas,Riojas,Shelly,srojas@corp6.com,,
RQ3: EFFECTIVENESS
COLUMBA
, , , , ,,
, , , , ,,
, , , ,, ,
, , , ,, ,
, , , , , ,
, , , , ,,
, , , ,, ,
, , , , ,,
, , , ,, ,
, , , ,, ,
, , , , ,,
, , , , ,,
, , , , ,,
, , , ,, ,
, , , , ,,
, , , , ,,
, , , ,, ,
, , , ,,,

RQ3: EFFECTIVENESS
COLUMBA
, , , , ,,
, , , , ,,
, , , ,, ,
, , , ,, ,
, , , , , ,
, , , , ,,
, , , ,, ,
, , , , ,,
, , , ,, ,
, , , ,, ,
, , , , ,,
, , , , ,,
, , , , ,,
, , , ,, ,
, , , , ,,
, , , , ,,
, , , ,, ,
, , , ,,,

RQ3: EFFECTIVENESS
HTMLPARSER
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/
xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>james clause @ gatech | home</title>
<style type="text/css" media="screen" title="">
<![CDATA[
</style>
</head>
<body>
...
</body>

RQ3: EFFECTIVENESS
HTMLPARSER
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/
xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>james clause @ gatech | home</title>
<style type="text/css" media="screen" title="">
<![CDATA[
</style>
</head>
<body>
...
</body>
The portions of the inputs that remain after
anonymization tend to be structural in nature and
therefore are safe to send to developers

RQ4: IMPROVEMENT
0
25
50
75
100
0
25
50
75
100
columba
htmlparser
printtokens1
printtokens2
nanoxml1
nanoxml2
nanoxml3
nanoxml4
nanoxml5
nanoxml6
nanoxml7
nanoxml8
nanoxml9
nanoxml10
nanoxml11
nanoxml12
nanoxml13
nenoxml14
nanoxml15
nanoxml16
%Improvement
BitsRevealed
%Improvement
Residue

RQ4: IMPROVEMENT
0
25
50
75
100
0
25
50
75
100
columba
htmlparser
printtokens1
printtokens2
nanoxml1
nanoxml2
nanoxml3
nanoxml4
nanoxml5
nanoxml6
nanoxml7
nanoxml8
nanoxml9
nanoxml10
nanoxml11
nanoxml12
nanoxml13
nenoxml14
nanoxml15
nanoxml16
%Improvement
BitsRevealed
%Improvement
Residue
Inputs anonymized using our improvements
reveal an average of 30% less bits of information
and 40% less residue.
(With only a marginal increase in time.)

RELATED WORK
• Castro and colleagues ’08

RELATED WORK
• Broadwell and colleagues ’03

RELATED WORK
• Uses dynamic taint analysis to identify and remove sensitive information in crash
dumps.

RELATED WORK
dumps.
• Wang and colleagues ’08

RELATED WORK
dumps.
• Uses a combination of dynamic taint analysis, symbolic execution and Q/A with a
client machine to construct anonymized inputs

RELATED WORK
dumps.
• Data set anonymization techniques (e.g., k-anonymization)

RELATED WORK
dumps.
• Budi and colleagues ’11

RELATED WORK
dumps.
• Grechanik and colleagues ’11

RELATED WORK
dumps.
• Grechanik and colleagues ’11
• Dynamic symbolic execution techniques

FUTURE WORK
• Additional quality metrics that:
• consider additional aspects of privacy loss
• consider the relative sensitivity of different inputs
• are intuitive and easy to use

FUTURE WORK
• Conduction additional (human) studies
• additional (larger) subjects

FUTURE WORK
• Conduction additional (human) studies
• additional (larger) subjects
• Investigate the combination of anonymization and
minimization

SUMMARY
1. An approach for automatically anonymizing failure-inducing
inputs
• extends Castro and colleagues’ technique through the
novel concepts of path condition relaxation and
breakable input conditions

SUMMARY
1. An approach for automatically anonymizing failure-inducing
inputs
• extends Castro and colleagues’ technique through the
novel concepts of path condition relaxation and
breakable input conditions
2. An empirical evaluation that demonstrates, for the subjects
considered, our approach is:
• feasible — generates anonymized inputs in < 10 minutes
• effective — anonymized inputs did not contain sensitive
information
• an improvement over the state-of-the-art

Camouflage: Automated Anonymization of Field Data (ICSE 2011)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Camouflage: Automated Anonymization of Field Data (ICSE 2011)

Similaire à Camouflage: Automated Anonymization of Field Data (ICSE 2011) (20)

Plus de James Clause

Plus de James Clause (14)

Dernier

Dernier (20)

Camouflage: Automated Anonymization of Field Data (ICSE 2011)