SlideShare une entreprise Scribd logo
1  sur  85
Télécharger pour lire hors ligne
Preserving Privacy and Utility in Text Data Analysis
Tom Diethe, Oluwaseyi Feyisetan, Thomas Drake, Borja Balle
{sey,tdiethe,draket}@amazon.com
borja.balle@gmail.com
PrivateNLP Workshop, WSDM
February 7 2020
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 1 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 2 / 41
Alexa AI
What is Alexa?
A cloud-based voice service that can help
you with tasks, entertainment, general
information, shopping, and more
The more you talk to Alexa, the more
Alexa adapts to your speech patterns,
vocabulary, and personal preferences
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
Alexa AI
What is Alexa?
A cloud-based voice service that can help
you with tasks, entertainment, general
information, shopping, and more
The more you talk to Alexa, the more
Alexa adapts to your speech patterns,
vocabulary, and personal preferences
How do we ...
create robust and efficient AI systems?
maintain the privacy of customer data?
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
Failure Modes
Unintentional failures: ML system produces a formally correct but completely unsafe
outcome
Outliers/anomalies
Dataset shift
Limited memory
Intentional failures: failure is caused by an active adversary attempting to subvert the
system to attain her goals, such as to:
misclassify the result
infer private training data
steal the underlying algorithm
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 4 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 5 / 41
A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
London IT £##### May 1985 Portuguese Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
London IT £##### May 1985 Portuguese Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
UK IT £##### 1980-1985 - Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
Anonymized Data Isn’t
Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released
“anonymized” data on state employees that showed every hospital visit
Goal was to help researchers. Removed all obvious identifiers such as name, address, and
social security number
MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization,
requested a copy of the data
Reidentification
William Weld, then Governor of Massachusetts, assured the public that GIC had protected
patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital
records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts,
population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the
city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every
voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6
people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code.
Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
Anonymized Data Isn’t
Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released
“anonymized” data on state employees that showed every hospital visit
Goal was to help researchers. Removed all obvious identifiers such as name, address, and
social security number
MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization,
requested a copy of the data
Reidentification
William Weld, then Governor of Massachusetts, assured the public that GIC had protected
patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital
records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts,
population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the
city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every
voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6
people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code.
Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
Anonymized Data Isn’t
Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated
movies over a six-year period
Netflix “anonymized” the data before releasing it by removing usernames, but assigned
unique identification numbers to users in order to allow for continuous tracking of user
ratings and trends
Reidentification
Researchers used this information to uniquely identify individual Netflix users by crossing the
data with the public IMDB database. According to the study, if a person has information about
when and how a user rated six movies, that person can identify 99% of people in the Netflix
database.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
Anonymized Data Isn’t
Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated
movies over a six-year period
Netflix “anonymized” the data before releasing it by removing usernames, but assigned
unique identification numbers to users in order to allow for continuous tracking of user
ratings and trends
Reidentification
Researchers used this information to uniquely identify individual Netflix users by crossing the
data with the public IMDB database. According to the study, if a person has information about
when and how a user rated six movies, that person can identify 99% of people in the Netflix
database.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
Differential Privacy
A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs
x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have
P[M(x) ∈ E] ≤ e P M x ∈ E
0 5 10 15 20 25
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Ratio bounded by e
M(D)
M(D')
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
Differential Privacy
A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs
x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have
P[M(x) ∈ E] ≤ e P M x ∈ E
0 5 10 15 20 25
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Ratio bounded by e
M(D)
M(D')
Mechanisms:
Randomised response −→ plausible
deniability
Laplace mechanism: e.g. ˜µ = µ + ξ,
ξ ∼ Lap 1
n
Output perturbation
...
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
Randomized Response [Warner ’65]
Say you want to release a bit x ∈ {Yes, No}. Do the following:
1 flip a coin
2 if tails, respond truthfully with x
3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
Randomized Response [Warner ’65]
Say you want to release a bit x ∈ {Yes, No}. Do the following:
1 flip a coin
2 if tails, respond truthfully with x
3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails
Claim: Above algorithm satisfies (log 3)-differential privacy
Pr[Response = Yes|x = Yes]
Pr[Response = Yes|x = No]
=
1/2 × 1 + 1/2 × 1/2
1/2 × 0 + 1/2 × 1/2
=
3/4
1/4
= 3 =⇒ e = 3
Same for Pr[Response=No|x=Yes]
Pr[Response=No|x=No] .
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
Important Properties
Robustness to post-processing: M is ( , δ)-DP, then f (M) is ( , δ)-DP
Composition: if M1, . . . , Mn are ( , δ)-DP, then g (M1, . . . , Mn) is
( n
i=1 i , n
i=1 δi )-DP
Protects against arbitrary side knowledge
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 11 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 12 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
Desired Functionality
Intent Query x Modified Query x
GetWeather Will it be colder in Cleveland Will it be colder in Ohio
PlayMusic Play Cantopop on lastfm Play C-pop on lastfm
BookRestaurant Book a restaurant in Milladore Book a restaurant in Wood County
SearchCreativeWork I want to watch Manthan film I want to watch Hindi film
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 14 / 41
Word Embeddings
Mapping from words into vectors of real numbers (many ways to do this!)
e.g. Neural network based models (e.g. Word2Vec, GloVe, fastText)
Defines a mapping φ : W → Rn
Nearest neigbours are often synonyms
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 15 / 41
Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 18 / 41
Differential Privacy in the Space of Euclidean Word Embedding
Adding noise to a location always produces
a valid location — a point somewhere on
the earth’s surface
Adding noise to a word embedding
produces a new point in the embedding
space, but it’s A.S. not the location of a
valid word embedding
We perform approximate nearest neighbors
find the nearest valid embedding
Nearest valid embedding could be the
original word itself: in that case, the
original word is returned
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 19 / 41
Practical Considerations
To help choose , we define:
Uncertainty statistics for the adversary over the outputs
Indistinguishability statistics: plausible deniability
Find a radius of high protection: guarantee on the likelihood of changing any word in the
embedding vocabulary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 20 / 41
Euclidean Experiments: Setup
Dataset IMDb Enron InsuranceQA
Task type Sentiment analysis Author identification Question answering
Evaluation Metric accuracy accuracy MAP, MRR
Training set size 25, 000 8, 517 12, 887
Test set size 25, 000 850 1, 800
Total word count 5, 958, 157 307, 639 92, 095
Vocabulary size 79, 428 15, 570 2, 745
Sentence length
µ = 42.27
σ = 34.38
µ = 30.68
σ = 31.54
µ = 7.15
σ = 2.06
Scenario 1: Train time protection little access to public data (10%), but abundant
access to private training data (90%); model training is done on the combined dataset
(i.e. public subset + perturbed private subset)
Scenario 2: Test time protection models trained on complete training set; evaluation
on privatized version of the test sets
We used 300-D GloVe word embeddings with biLSTM models
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 21 / 41
Results
IMDb reviews – Accuracy vs baseline for different values of ε
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at training time)
Accuracy
Baseline
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at test time)
Accuracy
Baseline
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
Results
Enron emails – Accuracy vs baseline for different values of ε
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at training time)
Accuracy
Baseline
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy (at test time)
Accuracy
Baseline
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
Results
InsuranceQA – MAP/MRR scores for different values of ε on the dev set
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
Scores for dev at training time
MAP on dev
MRR on dev
MAP baseline
MRR baseline
200 400 600 800 1000
epsilon
0.0
0.2
0.4
0.6
0.8
1.0
Scores for dev at test time
MAP on dev
MRR on dev
MAP baseline
MRR baseline
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Privacy Evaluation
In the previous experiments, we didn’t explicitly evaluate privacy
Problem: is an arbitrary number that is hard to interpret
This is especially true in metric DP, since is on a different scale
As we have seen, there are empirical ways to calibrate according to statistics of the word
embeddings
But how do we convince stakeholders that the privacy guarantees are holding, and there
are no bugs?
Solution: machine auditors – machine learning algorithms designed to different types of
privacy attacks on the data
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
Machine Auditors
Probabilistic record linkage auditing attack
Objective: link a user in a public dataset, to a user in a (leaked) private dataset.
Attack simulation: simulate public and “leaked” datasets by randomly splitting
an initial dataset. The attack takes advantage of rare words and queries issued
by users. A vector of word counts can be extracted from user queries and used to
perform the linkage.
Assumptions: attacker is able to narrow the attack set (using side knowledge)
Evaluation: how many accurate links can the attacker reconstruct?
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
Machine Auditors
Membership auditing attack [Shokri et al ’17, Song & Shmatikov ’18]
Objective: identify whether an individual’s data (queries) were used in the
training set of an ML model.
Attack simulation: train ML model on queries from m users. Train “shadow”
models using data from a different set of n users. The attack model is a classifier
built using the output of the shadow models
Assumptions: attacker is able to narrow the attack set (using side knowledge)
Evaluation: can the attacker correctly detect m users inside and outside the
model’s dataset
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 25 / 41
Hyperbolic Spaces
(a) (b)
(a) Projection of a point in the Lorentz model Hn to the Poincaré model
(b) WebIsADb is-a relationships in GloVe vocabulary on B2 Poincaré disk
Continuous analog of a tree
structure
Natural language captures
hypernomy and hyponomy
−→ embeddings require fewer
dimensions
Use models of Hyperbolic space -
projections into Euclidean space
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 26 / 41
Hyperbolic Differential Privacy
Distances in n−dimensional Poincaré ball are given by:
dBn (u, v) = arcosh 1 + 2
u − v 2
(1 − u 2
)(1 − v 2
)
Claim: dBn (u, v) is a valid metric. Proof (via Lorentzian model) in the paper
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 27 / 41
Hyperbolic Noise
Recall for Euclidean metric DP, we use Laplacian
noise to achieve −mDP, i.e:
ξ ∼ Lap
1
n
We derive the Hyperbolic Laplace distribution:
p(x|µ = 0, ε) =
1 + ε
2 2F1(1, ε, 2 + ε, −1)
−
2
x − 1
− 1
−ε
where 2F1(a, b; c, z) is the hypergeometric function
For sampling, we developed a Lorentzian Metropolis
Hastings sampler (see paper)
−0.4 −0.2 0.0 0.2 0.4
−0.4
−0.2
0.0
0.2
0.4
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
Hyperbolic Noise
Recall for Euclidean metric DP, we use Laplacian
noise to achieve −mDP, i.e:
ξ ∼ Lap
1
n
We derive the Hyperbolic Laplace distribution:
p(x|µ = 0, ε) =
1 + ε
2 2F1(1, ε, 2 + ε, −1)
−
2
x − 1
− 1
−ε
where 2F1(a, b; c, z) is the hypergeometric function
For sampling, we developed a Lorentzian Metropolis
Hastings sampler (see paper)
−0.4 −0.2 0.0 0.2 0.4
−0.4
−0.2
0.0
0.2
0.4
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
Hyperbolic Noise
Recall for Euclidean metric DP, we use Laplacian
noise to achieve −mDP, i.e:
ξ ∼ Lap
1
n
We derive the Hyperbolic Laplace distribution:
p(x|µ = 0, ε) =
1 + ε
2 2F1(1, ε, 2 + ε, −1)
−
2
x − 1
− 1
−ε
where 2F1(a, b; c, z) is the hypergeometric function
For sampling, we developed a Lorentzian Metropolis
Hastings sampler (see paper)
−0.4 −0.2 0.0 0.2 0.4
−0.4
−0.2
0.0
0.2
0.4
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
Hyperbolic Privacy Experiments 1
Task: obfuscation vs. Koppel’s authorship attribution algorithm
Datasets: TPAN@Clef tasks, correct author predictions (lower=better)
Pan-11 Pan-12
small large set-A set-C set-D set-I
0.5 36 72 4 3 2 5
1 35 73 3 3 2 5
2 40 78 4 3 2 5
8 65 116 4 5 4 5
∞ 147 259 6 6 6 12
Correct author predictions (lower is better)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 29 / 41
Hyperbolic Privacy Experiments 2
Task: expected privacy vs Euclidean baseline
Datasets: 100/200/300d GloVe embeddings
expected value Nw
ε worst-case Nw hyp-100 euc-100 euc-200 euc-300
0.125 134 1.25 38.54 39.66 39.88
0.5 148 1.62 42.48 43.62 43.44
1 172 2.07 48.80 50.26 53.82
2 297 3.92 92.42 93.75 90.90
8 960 140.67 602.21 613.11 587.68
Privacy comparisons (lower Nw is better)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 30 / 41
Hyperbolic Utility Experiments
5 classification tasks: sentiment x2, product reviews, opinion polarity, question-type
3 natural language tasks: NL inference, paraphrase detection, semantic textual similarity
baselines: utility results baselined using SentEval against random replacement
hyp-100d original
dataset random ε = 0.125 ε = 1 ε = 8 InferSent SkipThought fastText
MR 58.19 58.38 63.56 74.52 81.10 79.40 78.20
CR 77.48 83.21∗∗
83.92∗∗
85.19∗∗
86.30 83.1 80.20
MPQA 84.27 88.53∗
88.62∗
88.98∗
90.20 89.30 88.00
SST-5 30.81 41.76 42.40 42.53 46.30 − 45.10
TREC-6 75.20 82.40 82.40 84.20∗
88.20 88.40 83.40
SICK-E 79.20 81.00∗∗
82.38∗∗
82.34∗∗
86.10 79.5 78.9
MRPC 69.86 74.78∗
75.07∗
75.01∗
76.20 − 74.40
STS14 0.17/0.16 0.44/0.45 0.45/0.46∗
0.52/0.53∗
0.68/0.65 0.44/0.45 0.65/0.63
Accuracy scores on classification tasks. * indicates results better than 1 baseline, ** better than 2 baselines
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 31 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 32 / 41
UTILITYPRIVACY
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 33 / 41
Example: Differentially Private SGD
Algorithm 1: Differentially Private SGD
Input: dataset z = (z1, . . . , zn)
Hyperparameters: learning rate η, mini-batch size m, number of epochs T, noise variance
σ2, clipping norm L
Initialize w ← 0
for t ∈ [T] do
for k ∈ [n/m] do
Sample S ⊂ [n] with |S| = m uniformly at random
Let g ← 1
m j∈S clipL( (zj , w)) + 2L
m N(0, σ2I)
Update w ← w − ηg
return w
5+ hyper-parameters affecting both privacy and utility
For deep learning applications we only have empirical utility (not analyitic)
How do we find the hyperparameters that give us an optimal trade-off?
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 34 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
DPareto
DPareto
Repeat:
1 For each objective (privacy, utility):
1 Fit a surrogate model (Gaussian process (GP)) using the available dataset
2 Calculate the predictive distribution using the GP mean and variance functions
2 Use the posterior of the surrogate models to form an acquisition function
3 Collect the next point at the estimated global max. of the acquisition function
until budget exhausted
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 37 / 41
DPareto vs Random Sampling
28
)
20
22
24
26
28
Sampled points
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
PFhypervolume
Hypervolume Evolution
MLP1 (RS)
MLP1 (BO)
MLP2 (RS)
MLP2 (BO)
10−1
100
101
ε
0.0
0.2
0.4
0.6
0.8
1.0
Classificationerror
MLP2 Pareto Fronts
Initial
+256 RS
+256 BO
10−1
100
101
ε
0.16
0.18
0.20
0.22
0.24
Classificationerror
LogReg+SGD Samples
1500 RS
256 BO
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 38 / 41
Outline
1 Alexa AI
2 Algorithmic Privacy
3 Privacy for Text
4 Differential Privacy in Euclidean Spaces
5 Differential Privacy in Hyperbolic Spaces
6 Optimizing the Privacy Utility Trade-off
7 Summary
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 39 / 41
Summary: Privacy Enhancing Technologies
Privacy
Privacy risks can be counter-intuitive and tricky to formalize
High-dimensional data and side knowledge make privacy hard
Semantic guarantees (eg. DP) behave better than syntactic ones (eg.
k-anonymization)
Differential privacy is a mature privacy enhancing technology
Metric DP provides local plausible deniability, accuracy can be good even in
cases with an infinite number of outcomes
Empirical privacy-utility trade-off evaluation enables application-specific decisions
Bayesian optimization provides computationally efficient method to recover the
Pareto front (esp. with large number of hyper-parameters)
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 40 / 41
Questions?
tdiethe@amazon.com
Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 41 / 41

Contenu connexe

Similaire à Preserving Privacy and Utility in Text Data Analysis

Similaire à Preserving Privacy and Utility in Text Data Analysis (20)

DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya SweeneyDataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
DataTags: Sharing Privacy Sensitive Data by Latanya Sweeney
 
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
Using Apache Spark and Differential Privacy for Protecting the Privacy of the...
 
Explainability for NLP
Explainability for NLPExplainability for NLP
Explainability for NLP
 
Strata Conference NY: The Accidental Chief Privacy Officer
Strata Conference NY: The Accidental Chief Privacy OfficerStrata Conference NY: The Accidental Chief Privacy Officer
Strata Conference NY: The Accidental Chief Privacy Officer
 
Data Mining Challenges
Data Mining ChallengesData Mining Challenges
Data Mining Challenges
 
2020 Data Breach Investigations Report (DBIR)
2020 Data Breach Investigations Report (DBIR)2020 Data Breach Investigations Report (DBIR)
2020 Data Breach Investigations Report (DBIR)
 
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS TaipeiBehavioral Big Data & Healthcare Research: Talk at WiDS Taipei
Behavioral Big Data & Healthcare Research: Talk at WiDS Taipei
 
1 tenea lewissocw 6301methodological approach
1 tenea lewissocw 6301methodological approach1 tenea lewissocw 6301methodological approach
1 tenea lewissocw 6301methodological approach
 
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (02/12/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker  (02/12/2020)Reuters/Ipsos Core Political Survey: Presidential Approval Tracker  (02/12/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (02/12/2020)
 
Data Coordinator Guidebook
Data Coordinator GuidebookData Coordinator Guidebook
Data Coordinator Guidebook
 
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?
Wolfram Data Summit: Data Feast, Privacy Famine: What Is a Healthy Data Diet?
 
IE_expressyourself_EssayH
IE_expressyourself_EssayHIE_expressyourself_EssayH
IE_expressyourself_EssayH
 
Data collection for cultural project
Data collection for cultural projectData collection for cultural project
Data collection for cultural project
 
Carpe Datum! Who knows who you are?
Carpe Datum! Who knows who you are?Carpe Datum! Who knows who you are?
Carpe Datum! Who knows who you are?
 
Umhoefer: Data-driven enterprise - handout
Umhoefer: Data-driven enterprise - handoutUmhoefer: Data-driven enterprise - handout
Umhoefer: Data-driven enterprise - handout
 
SOC2002 Lecture 6
SOC2002 Lecture 6SOC2002 Lecture 6
SOC2002 Lecture 6
 
1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx1. Data Science overview - part1.pptx
1. Data Science overview - part1.pptx
 
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)
Reuters/Ipsos Core Political Survey: Presidential Approval Tracker (03/04/2020)
 
AI Challenges for Non-Profits, Small Business and Government
AI Challenges for Non-Profits, Small Business and GovernmentAI Challenges for Non-Profits, Small Business and Government
AI Challenges for Non-Profits, Small Business and Government
 
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
 

Dernier

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 

Dernier (20)

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 

Preserving Privacy and Utility in Text Data Analysis

  • 1. Preserving Privacy and Utility in Text Data Analysis Tom Diethe, Oluwaseyi Feyisetan, Thomas Drake, Borja Balle {sey,tdiethe,draket}@amazon.com borja.balle@gmail.com PrivateNLP Workshop, WSDM February 7 2020
  • 2. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 1 / 41
  • 3. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 2 / 41
  • 4. Alexa AI What is Alexa? A cloud-based voice service that can help you with tasks, entertainment, general information, shopping, and more The more you talk to Alexa, the more Alexa adapts to your speech patterns, vocabulary, and personal preferences Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
  • 5. Alexa AI What is Alexa? A cloud-based voice service that can help you with tasks, entertainment, general information, shopping, and more The more you talk to Alexa, the more Alexa adapts to your speech patterns, vocabulary, and personal preferences How do we ... create robust and efficient AI systems? maintain the privacy of customer data? Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 3 / 41
  • 6. Failure Modes Unintentional failures: ML system produces a formally correct but completely unsafe outcome Outliers/anomalies Dataset shift Limited memory Intentional failures: failure is caused by an active adversary attempting to subvert the system to attain her goals, such as to: misclassify the result infer private training data steal the underlying algorithm Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 4 / 41
  • 7. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 5 / 41
  • 8. A first attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and office If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case) Office Dept. Salary D.O.B. Nationality Gender London IT £##### May 1985 Portuguese Female Still presents risk of re-identification!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
  • 9. A first attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and office If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case) Office Dept. Salary D.O.B. Nationality Gender London IT £##### May 1985 Portuguese Female Still presents risk of re-identification!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
  • 10. A first attempt: Can’t I just anonymize my data? k-anonymity: information for each person cannot be distinguished from at least k − 1 individuals whose information also appear in the release Suppose a company is audited for salary discrimination The auditor can see salaries by gender, age and nationality for each department and office If the auditor has a friend, an ex, a date, working for the company she will learn the salary of that person Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case) Office Dept. Salary D.O.B. Nationality Gender UK IT £##### 1980-1985 - Female Still presents risk of re-identification!. If there are 10 females born between 80-85 in the whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 6 / 41
  • 11. Anonymized Data Isn’t Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released “anonymized” data on state employees that showed every hospital visit Goal was to help researchers. Removed all obvious identifiers such as name, address, and social security number MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization, requested a copy of the data Reidentification William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6 people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code. Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
  • 12. Anonymized Data Isn’t Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released “anonymized” data on state employees that showed every hospital visit Goal was to help researchers. Removed all obvious identifiers such as name, address, and social security number MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization, requested a copy of the data Reidentification William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6 people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code. Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 7 / 41
  • 13. Anonymized Data Isn’t Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated movies over a six-year period Netflix “anonymized” the data before releasing it by removing usernames, but assigned unique identification numbers to users in order to allow for continuous tracking of user ratings and trends Reidentification Researchers used this information to uniquely identify individual Netflix users by crossing the data with the public IMDB database. According to the study, if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netflix database. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
  • 14. Anonymized Data Isn’t Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated movies over a six-year period Netflix “anonymized” the data before releasing it by removing usernames, but assigned unique identification numbers to users in order to allow for continuous tracking of user ratings and trends Reidentification Researchers used this information to uniquely identify individual Netflix users by crossing the data with the public IMDB database. According to the study, if a person has information about when and how a user rated six movies, that person can identify 99% of people in the Netflix database. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 8 / 41
  • 15. Differential Privacy A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have P[M(x) ∈ E] ≤ e P M x ∈ E 0 5 10 15 20 25 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Ratio bounded by e M(D) M(D') Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
  • 16. Differential Privacy A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have P[M(x) ∈ E] ≤ e P M x ∈ E 0 5 10 15 20 25 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Ratio bounded by e M(D) M(D') Mechanisms: Randomised response −→ plausible deniability Laplace mechanism: e.g. ˜µ = µ + ξ, ξ ∼ Lap 1 n Output perturbation ... Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 9 / 41
  • 17. Randomized Response [Warner ’65] Say you want to release a bit x ∈ {Yes, No}. Do the following: 1 flip a coin 2 if tails, respond truthfully with x 3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
  • 18. Randomized Response [Warner ’65] Say you want to release a bit x ∈ {Yes, No}. Do the following: 1 flip a coin 2 if tails, respond truthfully with x 3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails Claim: Above algorithm satisfies (log 3)-differential privacy Pr[Response = Yes|x = Yes] Pr[Response = Yes|x = No] = 1/2 × 1 + 1/2 × 1/2 1/2 × 0 + 1/2 × 1/2 = 3/4 1/4 = 3 =⇒ e = 3 Same for Pr[Response=No|x=Yes] Pr[Response=No|x=No] . Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 10 / 41
  • 19. Important Properties Robustness to post-processing: M is ( , δ)-DP, then f (M) is ( , δ)-DP Composition: if M1, . . . , Mn are ( , δ)-DP, then g (M1, . . . , Mn) is ( n i=1 i , n i=1 δi )-DP Protects against arbitrary side knowledge Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 11 / 41
  • 20. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 12 / 41
  • 21. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 22. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 23. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 24. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 25. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 26. User-AI system interaction via natural language User’s goal: meet some specific need with respect to an issued query x Agent’s goal: satisfy the user’s request Privacy violation: occurs when x is used to make personal inference. e.g. unrestricted PII present Mechanism: Modify the query to protect privacy whilst preserving semantics Our approach: Metric Differential Privacy Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 13 / 41
  • 27. Desired Functionality Intent Query x Modified Query x GetWeather Will it be colder in Cleveland Will it be colder in Ohio PlayMusic Play Cantopop on lastfm Play C-pop on lastfm BookRestaurant Book a restaurant in Milladore Book a restaurant in Wood County SearchCreativeWork I want to watch Manthan film I want to watch Hindi film Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 14 / 41
  • 28. Word Embeddings Mapping from words into vectors of real numbers (many ways to do this!) e.g. Neural network based models (e.g. Word2Vec, GloVe, fastText) Defines a mapping φ : W → Rn Nearest neigbours are often synonyms Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 15 / 41
  • 29. Metric Differential Privacy Recall the definition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric differential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
  • 30. Metric Differential Privacy Recall the definition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric differential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
  • 31. Metric Differential Privacy Recall the definition of DP ... P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1 = 1 This can be rewritten into a single equation as: P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e x−x 1 Metric differential privacy generalises this to use any valid metric d(x, x ): P[M(x) ∈ E] P[M(x ) ∈ E] ≤ e d(x,x ) (easy to see that standard DP is metric DP with d(x, x ) = x − x 1) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 16 / 41
  • 32. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
  • 33. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
  • 34. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020] Given: w ∈ W: word to be “privatised” from word space W (dictionary) φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn ) v = φ(w): corresponding word vector d : Z × Z → R: distance function in embedding space Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1 n , i = 1, ..., n for Rn ) Metric DP Mechanism for word embeddings 1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( ) 2 The new vector v will not be a word (a.s.) 3 Project back to W: w = arg minw∈W d(v , φ(w)), return w What do we need? d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle) A way to sample using Ω in the metric space that respects d and gives us -metric DP Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 17 / 41
  • 35. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 18 / 41
  • 36. Differential Privacy in the Space of Euclidean Word Embedding Adding noise to a location always produces a valid location — a point somewhere on the earth’s surface Adding noise to a word embedding produces a new point in the embedding space, but it’s A.S. not the location of a valid word embedding We perform approximate nearest neighbors find the nearest valid embedding Nearest valid embedding could be the original word itself: in that case, the original word is returned Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 19 / 41
  • 37. Practical Considerations To help choose , we define: Uncertainty statistics for the adversary over the outputs Indistinguishability statistics: plausible deniability Find a radius of high protection: guarantee on the likelihood of changing any word in the embedding vocabulary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 20 / 41
  • 38. Euclidean Experiments: Setup Dataset IMDb Enron InsuranceQA Task type Sentiment analysis Author identification Question answering Evaluation Metric accuracy accuracy MAP, MRR Training set size 25, 000 8, 517 12, 887 Test set size 25, 000 850 1, 800 Total word count 5, 958, 157 307, 639 92, 095 Vocabulary size 79, 428 15, 570 2, 745 Sentence length µ = 42.27 σ = 34.38 µ = 30.68 σ = 31.54 µ = 7.15 σ = 2.06 Scenario 1: Train time protection little access to public data (10%), but abundant access to private training data (90%); model training is done on the combined dataset (i.e. public subset + perturbed private subset) Scenario 2: Test time protection models trained on complete training set; evaluation on privatized version of the test sets We used 300-D GloVe word embeddings with biLSTM models Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 21 / 41
  • 39. Results IMDb reviews – Accuracy vs baseline for different values of ε 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at training time) Accuracy Baseline 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at test time) Accuracy Baseline Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
  • 40. Results Enron emails – Accuracy vs baseline for different values of ε 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at training time) Accuracy Baseline 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 accuracy Accuracy (at test time) Accuracy Baseline Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
  • 41. Results InsuranceQA – MAP/MRR scores for different values of ε on the dev set 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Scores for dev at training time MAP on dev MRR on dev MAP baseline MRR baseline 200 400 600 800 1000 epsilon 0.0 0.2 0.4 0.6 0.8 1.0 Scores for dev at test time MAP on dev MRR on dev MAP baseline MRR baseline Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 22 / 41
  • 42. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 43. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 44. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 45. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 46. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 47. Privacy Evaluation In the previous experiments, we didn’t explicitly evaluate privacy Problem: is an arbitrary number that is hard to interpret This is especially true in metric DP, since is on a different scale As we have seen, there are empirical ways to calibrate according to statistics of the word embeddings But how do we convince stakeholders that the privacy guarantees are holding, and there are no bugs? Solution: machine auditors – machine learning algorithms designed to different types of privacy attacks on the data Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 23 / 41
  • 48. Machine Auditors Probabilistic record linkage auditing attack Objective: link a user in a public dataset, to a user in a (leaked) private dataset. Attack simulation: simulate public and “leaked” datasets by randomly splitting an initial dataset. The attack takes advantage of rare words and queries issued by users. A vector of word counts can be extracted from user queries and used to perform the linkage. Assumptions: attacker is able to narrow the attack set (using side knowledge) Evaluation: how many accurate links can the attacker reconstruct? Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
  • 49. Machine Auditors Membership auditing attack [Shokri et al ’17, Song & Shmatikov ’18] Objective: identify whether an individual’s data (queries) were used in the training set of an ML model. Attack simulation: train ML model on queries from m users. Train “shadow” models using data from a different set of n users. The attack model is a classifier built using the output of the shadow models Assumptions: attacker is able to narrow the attack set (using side knowledge) Evaluation: can the attacker correctly detect m users inside and outside the model’s dataset Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 24 / 41
  • 50. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 25 / 41
  • 51. Hyperbolic Spaces (a) (b) (a) Projection of a point in the Lorentz model Hn to the Poincaré model (b) WebIsADb is-a relationships in GloVe vocabulary on B2 Poincaré disk Continuous analog of a tree structure Natural language captures hypernomy and hyponomy −→ embeddings require fewer dimensions Use models of Hyperbolic space - projections into Euclidean space Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 26 / 41
  • 52. Hyperbolic Differential Privacy Distances in n−dimensional Poincaré ball are given by: dBn (u, v) = arcosh 1 + 2 u − v 2 (1 − u 2 )(1 − v 2 ) Claim: dBn (u, v) is a valid metric. Proof (via Lorentzian model) in the paper Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 27 / 41
  • 53. Hyperbolic Noise Recall for Euclidean metric DP, we use Laplacian noise to achieve −mDP, i.e: ξ ∼ Lap 1 n We derive the Hyperbolic Laplace distribution: p(x|µ = 0, ε) = 1 + ε 2 2F1(1, ε, 2 + ε, −1) − 2 x − 1 − 1 −ε where 2F1(a, b; c, z) is the hypergeometric function For sampling, we developed a Lorentzian Metropolis Hastings sampler (see paper) −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
  • 54. Hyperbolic Noise Recall for Euclidean metric DP, we use Laplacian noise to achieve −mDP, i.e: ξ ∼ Lap 1 n We derive the Hyperbolic Laplace distribution: p(x|µ = 0, ε) = 1 + ε 2 2F1(1, ε, 2 + ε, −1) − 2 x − 1 − 1 −ε where 2F1(a, b; c, z) is the hypergeometric function For sampling, we developed a Lorentzian Metropolis Hastings sampler (see paper) −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
  • 55. Hyperbolic Noise Recall for Euclidean metric DP, we use Laplacian noise to achieve −mDP, i.e: ξ ∼ Lap 1 n We derive the Hyperbolic Laplace distribution: p(x|µ = 0, ε) = 1 + ε 2 2F1(1, ε, 2 + ε, −1) − 2 x − 1 − 1 −ε where 2F1(a, b; c, z) is the hypergeometric function For sampling, we developed a Lorentzian Metropolis Hastings sampler (see paper) −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 28 / 41
  • 56. Hyperbolic Privacy Experiments 1 Task: obfuscation vs. Koppel’s authorship attribution algorithm Datasets: TPAN@Clef tasks, correct author predictions (lower=better) Pan-11 Pan-12 small large set-A set-C set-D set-I 0.5 36 72 4 3 2 5 1 35 73 3 3 2 5 2 40 78 4 3 2 5 8 65 116 4 5 4 5 ∞ 147 259 6 6 6 12 Correct author predictions (lower is better) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 29 / 41
  • 57. Hyperbolic Privacy Experiments 2 Task: expected privacy vs Euclidean baseline Datasets: 100/200/300d GloVe embeddings expected value Nw ε worst-case Nw hyp-100 euc-100 euc-200 euc-300 0.125 134 1.25 38.54 39.66 39.88 0.5 148 1.62 42.48 43.62 43.44 1 172 2.07 48.80 50.26 53.82 2 297 3.92 92.42 93.75 90.90 8 960 140.67 602.21 613.11 587.68 Privacy comparisons (lower Nw is better) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 30 / 41
  • 58. Hyperbolic Utility Experiments 5 classification tasks: sentiment x2, product reviews, opinion polarity, question-type 3 natural language tasks: NL inference, paraphrase detection, semantic textual similarity baselines: utility results baselined using SentEval against random replacement hyp-100d original dataset random ε = 0.125 ε = 1 ε = 8 InferSent SkipThought fastText MR 58.19 58.38 63.56 74.52 81.10 79.40 78.20 CR 77.48 83.21∗∗ 83.92∗∗ 85.19∗∗ 86.30 83.1 80.20 MPQA 84.27 88.53∗ 88.62∗ 88.98∗ 90.20 89.30 88.00 SST-5 30.81 41.76 42.40 42.53 46.30 − 45.10 TREC-6 75.20 82.40 82.40 84.20∗ 88.20 88.40 83.40 SICK-E 79.20 81.00∗∗ 82.38∗∗ 82.34∗∗ 86.10 79.5 78.9 MRPC 69.86 74.78∗ 75.07∗ 75.01∗ 76.20 − 74.40 STS14 0.17/0.16 0.44/0.45 0.45/0.46∗ 0.52/0.53∗ 0.68/0.65 0.44/0.45 0.65/0.63 Accuracy scores on classification tasks. * indicates results better than 1 baseline, ** better than 2 baselines Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 31 / 41
  • 59. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 32 / 41
  • 60. UTILITYPRIVACY Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 33 / 41
  • 61. Example: Differentially Private SGD Algorithm 1: Differentially Private SGD Input: dataset z = (z1, . . . , zn) Hyperparameters: learning rate η, mini-batch size m, number of epochs T, noise variance σ2, clipping norm L Initialize w ← 0 for t ∈ [T] do for k ∈ [n/m] do Sample S ⊂ [n] with |S| = m uniformly at random Let g ← 1 m j∈S clipL( (zj , w)) + 2L m N(0, σ2I) Update w ← w − ηg return w 5+ hyper-parameters affecting both privacy and utility For deep learning applications we only have empirical utility (not analyitic) How do we find the hyperparameters that give us an optimal trade-off? Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 34 / 41
  • 62. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 63. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 64. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 65. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 66. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 67. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 68. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 69. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 70. The Privacy-Utility Pareto Front Pareto-Optimal Points Hyper-parameter Space Privacy Loss Error Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 35 / 41
  • 71. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 72. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 73. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 74. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 75. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 76. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 77. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 78. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 79. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 80. Bayesian Optimization Gradient-free optimization for black-box functions Widely used in applications (HPO in ML, scheduling & planning, experimental design ...) In multi-objective problems, BO aims to learn the Pareto front with a minimal number of evaluations. Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 36 / 41
  • 81. DPareto DPareto Repeat: 1 For each objective (privacy, utility): 1 Fit a surrogate model (Gaussian process (GP)) using the available dataset 2 Calculate the predictive distribution using the GP mean and variance functions 2 Use the posterior of the surrogate models to form an acquisition function 3 Collect the next point at the estimated global max. of the acquisition function until budget exhausted Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 37 / 41
  • 82. DPareto vs Random Sampling 28 ) 20 22 24 26 28 Sampled points 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 PFhypervolume Hypervolume Evolution MLP1 (RS) MLP1 (BO) MLP2 (RS) MLP2 (BO) 10−1 100 101 ε 0.0 0.2 0.4 0.6 0.8 1.0 Classificationerror MLP2 Pareto Fronts Initial +256 RS +256 BO 10−1 100 101 ε 0.16 0.18 0.20 0.22 0.24 Classificationerror LogReg+SGD Samples 1500 RS 256 BO Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 38 / 41
  • 83. Outline 1 Alexa AI 2 Algorithmic Privacy 3 Privacy for Text 4 Differential Privacy in Euclidean Spaces 5 Differential Privacy in Hyperbolic Spaces 6 Optimizing the Privacy Utility Trade-off 7 Summary Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 39 / 41
  • 84. Summary: Privacy Enhancing Technologies Privacy Privacy risks can be counter-intuitive and tricky to formalize High-dimensional data and side knowledge make privacy hard Semantic guarantees (eg. DP) behave better than syntactic ones (eg. k-anonymization) Differential privacy is a mature privacy enhancing technology Metric DP provides local plausible deniability, accuracy can be good even in cases with an infinite number of outcomes Empirical privacy-utility trade-off evaluation enables application-specific decisions Bayesian optimization provides computationally efficient method to recover the Pareto front (esp. with large number of hyper-parameters) Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 40 / 41
  • 85. Questions? tdiethe@amazon.com Diethe, Feyisetan, Drake, Balle (Amazon) Privacy and Utility in Text Data Analysis February 7 2020 41 / 41