Presented at the University of Bristol Interactive AI CDT Winter School.
Amazon prides itself on being the most customer-centric company on earth. That means creating systems that are robust to changes in the environment, privacy preserving, and treat different subgroups fairly. Here I present approaches to tackling the problem from both research and engineering perspectives, including continual learning, differential privacy, and algorithmic fairness.
Formation of low mass protostars and their circumstellar disks
Practical Considerations for Interactive AI: Robustness, Privacy, Fairness, Transparency
1. Practical Considerations for Interactive AI: Robustness, Privacy,
Fairness, Transparency
Tom Diethe
tdiethe@amazon.com
Interactive AI CDT Winter School
January 29 2020
2. Outline
1 Interactive AI at Amazon
2 Robustness & Transparency via Continual Learning
Bayesian Continual Learning
Continual Learning in Practice
3 Algorithmic Privacy
Differential Privacy
Privacy for Text
Experiments on Text Data
Optimizing the Privacy Utility Trade-off
DPareto experiments
4 Algorithmic Fairness
5 Summary
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 1 / 44
3. Outline
1 Interactive AI at Amazon
2 Robustness & Transparency via Continual Learning
3 Algorithmic Privacy
4 Algorithmic Fairness
5 Summary
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 2 / 44
4. Interactive AI at Amazon
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 3 / 44
5. Alexa AI
What is Alexa?
A cloud-based voice service that can help
you with tasks, entertainment, general
information, shopping, and more
The more you talk to Alexa, the more
Alexa adapts to your speech patterns,
vocabulary, and personal preferences
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 4 / 44
6. Alexa AI
What is Alexa?
A cloud-based voice service that can help
you with tasks, entertainment, general
information, shopping, and more
The more you talk to Alexa, the more
Alexa adapts to your speech patterns,
vocabulary, and personal preferences
How do we ensure that ...
we create robust and efficient AI systems?
we ensure that the privacy of customer
data is safeguarded?
customers are treated fairly by ML
algorithms?
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 4 / 44
7. Failure Modes
Unintentional failures: ML system produces a formally correct but completely unsafe
outcome
Outliers/anomalies
Dataset shift
Limited memory
Intentional failures: failure is caused by an active adversary attempting to subvert the
system to attain her goals, such as to:
misclassify the result
infer private training data
steal the underlying algorithm
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 5 / 44
8. Outline
1 Interactive AI at Amazon
2 Robustness & Transparency via Continual Learning
Bayesian Continual Learning
Continual Learning in Practice
3 Algorithmic Privacy
4 Algorithmic Fairness
5 Summary
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 6 / 44
9. FX (xt1 , . . . , xtn ) = FX (xt1+τ , . . . , xtn+τ )
for all τ, t1, . . . , tn
for all n ∈ N
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 7 / 44
11. Robustness & Transparency via Continual Learning
Data arrive continually
(Possibly) non-IID
Tasks may change over time (e.g. trends/fashions in
shopping)
New tasks may emerge (e.g. new product
categories, new marketplaces)
Robustness How can we adapt to new data whilst
retaining existing knowledge?
Transparency: How can we have systems can
signal they’re going wrong?
Standard approaches:
Train individual models on each task. Train
combination
Maintain single model and use regularization to fix
influential parameters
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
12. Robustness & Transparency via Continual Learning
Data arrive continually
(Possibly) non-IID
Tasks may change over time (e.g. trends/fashions in
shopping)
New tasks may emerge (e.g. new product
categories, new marketplaces)
Robustness How can we adapt to new data whilst
retaining existing knowledge?
Transparency: How can we have systems can
signal they’re going wrong?
Standard approaches:
Train individual models on each task. Train
combination
Maintain single model and use regularization to fix
influential parameters
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
13. Robustness & Transparency via Continual Learning
Data arrive continually
(Possibly) non-IID
Tasks may change over time (e.g. trends/fashions in
shopping)
New tasks may emerge (e.g. new product
categories, new marketplaces)
Robustness How can we adapt to new data whilst
retaining existing knowledge?
Transparency: How can we have systems can
signal they’re going wrong?
Standard approaches:
Train individual models on each task. Train
combination
Maintain single model and use regularization to fix
influential parameters
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
14. Robustness & Transparency via Continual Learning
Data arrive continually
(Possibly) non-IID
Tasks may change over time (e.g. trends/fashions in
shopping)
New tasks may emerge (e.g. new product
categories, new marketplaces)
Robustness How can we adapt to new data whilst
retaining existing knowledge?
Transparency: How can we have systems can
signal they’re going wrong?
Standard approaches:
Train individual models on each task. Train
combination
Maintain single model and use regularization to fix
influential parameters
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
15. Robustness & Transparency via Continual Learning
Data arrive continually
(Possibly) non-IID
Tasks may change over time (e.g. trends/fashions in
shopping)
New tasks may emerge (e.g. new product
categories, new marketplaces)
Robustness How can we adapt to new data whilst
retaining existing knowledge?
Transparency: How can we have systems can
signal they’re going wrong?
Standard approaches:
Train individual models on each task. Train
combination
Maintain single model and use regularization to fix
influential parameters
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
16. Robustness & Transparency via Continual Learning
Data arrive continually
(Possibly) non-IID
Tasks may change over time (e.g. trends/fashions in
shopping)
New tasks may emerge (e.g. new product
categories, new marketplaces)
Robustness How can we adapt to new data whilst
retaining existing knowledge?
Transparency: How can we have systems can
signal they’re going wrong?
Standard approaches:
Train individual models on each task. Train
combination
Maintain single model and use regularization to fix
influential parameters
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
17. Robustness & Transparency via Continual Learning
Data arrive continually
(Possibly) non-IID
Tasks may change over time (e.g. trends/fashions in
shopping)
New tasks may emerge (e.g. new product
categories, new marketplaces)
Robustness How can we adapt to new data whilst
retaining existing knowledge?
Transparency: How can we have systems can
signal they’re going wrong?
Standard approaches:
Train individual models on each task. Train
combination
Maintain single model and use regularization to fix
influential parameters
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
18. Robustness & Transparency via Continual Learning
Data arrive continually
(Possibly) non-IID
Tasks may change over time (e.g. trends/fashions in
shopping)
New tasks may emerge (e.g. new product
categories, new marketplaces)
Robustness How can we adapt to new data whilst
retaining existing knowledge?
Transparency: How can we have systems can
signal they’re going wrong?
Standard approaches:
Train individual models on each task. Train
combination
Maintain single model and use regularization to fix
influential parameters
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 9 / 44
19. Bayesian Continual Learning [Nguyen 2018]
Given e.g. data in task t as Dt = x
(nt )
t , y
(nt )
t
Nt
n=1
, parameters θ (e.g. BLR, BNN, GP ...)
p(θ|D1:T ) ∝ p(θ)p(D1:T |θ)
= p(θ)
T
t−1
NT
n=1
p y
(nt )
t |θ, x
(nt )
t
= p(θ|D1:T−1)p(DT |θ).
Natural recursive algorithm!
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 10 / 44
20. Bayesian Continual Learning [Nguyen 2018]
Given e.g. data in task t as Dt = x
(nt )
t , y
(nt )
t
Nt
n=1
, parameters θ (e.g. BLR, BNN, GP ...)
p(θ|D1:T ) ∝ p(θ)p(D1:T |θ)
= p(θ)
T
t−1
NT
n=1
p y
(nt )
t |θ, x
(nt )
t
= p(θ|D1:T−1)p(DT |θ).
Natural recursive algorithm!
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 10 / 44
21. Generative models in continual learning
Generative models in continual learning. Task i consists of items of class i and generated samples from the previous task;
the goal is to generate samples from all previously seen classes
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 11 / 44
22. Why is this Useful?
Fashion-MNIST examples generated
by a Wasserstein GAN in Bayesian
continual learning
Generative models play an important role in
mitigating this, as they can be used to generate
samples of previous tasks [Wu 2018], a method
known as generative replay
For deep learning models this is a form of
transparency: a window onto what the model has
learnt
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 12 / 44
23. Engineering a Continual Learning System
Automating Data Retention Policies:
Sketcher/Compressor: when the data rate is too high
Joiner: when labels arrive late
Shared infrastructure: optimal use of space, like an OS cache
Automating Monitoring and Quality Control:
Data monitoring: dataset shift detection, anomaly detection
Prediction monitoring: monitor performance of models
Automating the ML Life-Cycle:
Trainer and HPO: store provenance, warm start training
Model policy engine: ensure re-training performed at right cadence
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 13 / 44
24. “Zero-Touch” Machine Learning
Model Policy
Engine
Streams
Model
Stream
Trainer
HPO
Data
Statistics
Data Monitoring
Anomaly Detection,
Distribution Shift
Measurement
Retrain
Rollback
Prediction
statistics
Prediction
Statistics
Prediction
Monitoring
Accuracy, Shift
Predictor
Business Metrics
Business Logic
Business metrics
Costs
Desired accuracy
Joiner
System State
DB
Diagnostic
Logs
Sketcher/
Sampler
Predictions
Predictions
Shared Infrastructure
Model DB
Training Data
Reservoir
Validation Data
Reservoir
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 14 / 44
25. Summary: Continual Learning
Continual Learning
Bayesian methods are a natural fit for continual learning
However it’s tricky to make them work well with deep learning methods
Engineering viewpoint is also required
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 15 / 44
26. Outline
1 Interactive AI at Amazon
2 Robustness & Transparency via Continual Learning
3 Algorithmic Privacy
Differential Privacy
Privacy for Text
Experiments on Text Data
Optimizing the Privacy Utility Trade-off
DPareto experiments
4 Algorithmic Fairness
5 Summary
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 16 / 44
27. A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
London IT £##### May 1985 Portuguese Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 17 / 44
28. A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
London IT £##### May 1985 Portuguese Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 17 / 44
29. A first attempt: Can’t I just anonymize my data?
k-anonymity: information for each person cannot be distinguished from at least k − 1
individuals whose information also appear in the release
Suppose a company is audited for salary discrimination
The auditor can see salaries by gender, age and nationality for each department and office
If the auditor has a friend, an ex, a date, working for the company she will learn the salary
of that person
Reducing data granularity reduces the risk, but also reduces accuracy (fidelity in this case)
Office Dept. Salary D.O.B. Nationality Gender
UK IT £##### 1980-1985 - Female
Still presents risk of re-identification!. If there are 10 females born between 80-85 in the
whole of UK’s IT department, 9 of them could conspire to learn the salary of the 10th one
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 17 / 44
30. Anonymized Data Isn’t
Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released
“anonymized” data on state employees that showed every hospital visit
Goal was to help researchers. Removed all obvious identifiers such as name, address, and
social security number
MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization,
requested a copy of the data
Reidentification
William Weld, then Governor of Massachusetts, assured the public that GIC had protected
patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital
records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts,
population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the
city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every
voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6
people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code.
Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 18 / 44
31. Anonymized Data Isn’t
Example 1: Mid 1990’s: Massachusetts “Group Insurance Commission” released
“anonymized” data on state employees that showed every hospital visit
Goal was to help researchers. Removed all obvious identifiers such as name, address, and
social security number
MIT PhD student Latanya Sweeney decided to attempt to reverse the anonymization,
requested a copy of the data
Reidentification
William Weld, then Governor of Massachusetts, assured the public that GIC had protected
patient privacy by deleting identifiers. Sweeney started hunting for the Governor’s hospital
records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts,
population 54,000 and 7 ZIP codes. For $20, she purchased the complete voter rolls from the
city of Cambridge, containing the name, address, ZIP code, birth date, and gender of every
voter. Crossing this with the GIC records, Sweeney found Governor Weld with ease: Only 6
people shared his birth date, only 3 of them men, and of them, only he lived in his ZIP code.
Sweeney sent the Governor’s health records (including diagnoses and prescriptions) to his office.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 18 / 44
32. Anonymized Data Isn’t
Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated
movies over a six-year period
Netflix “anonymized” the data before releasing it by removing usernames, but assigned
unique identification numbers to users in order to allow for continuous tracking of user
ratings and trends
Reidentification
Researchers used this information to uniquely identify individual Netflix users by crossing the
data with the public IMDB database. According to the study, if a person has information about
when and how a user rated six movies, that person can identify 99% of people in the Netflix
database.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 19 / 44
33. Anonymized Data Isn’t
Example 2: In 2006, Netflix released data pertaining to how 500,000 of its users rated
movies over a six-year period
Netflix “anonymized” the data before releasing it by removing usernames, but assigned
unique identification numbers to users in order to allow for continuous tracking of user
ratings and trends
Reidentification
Researchers used this information to uniquely identify individual Netflix users by crossing the
data with the public IMDB database. According to the study, if a person has information about
when and how a user rated six movies, that person can identify 99% of people in the Netflix
database.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 19 / 44
34. Differential Privacy
A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs
x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have
P[M(x) ∈ E] ≤ e P M x ∈ E
0 5 10 15 20 25
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Ratio bounded by e
M(D)
M(D')
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 20 / 44
35. Differential Privacy
A randomised mechanism M : X → Y is -differentially private if for all neighbouring inputs
x x (i.e. x − x 1 = 1) and for all sets of outputs E ⊆ Y we have
P[M(x) ∈ E] ≤ e P M x ∈ E
0 5 10 15 20 25
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Ratio bounded by e
M(D)
M(D')
Mechanisms:
Randomised response −→ plausible
deniability
Laplace mechanism: e.g. ˜µ = µ + ξ,
ξ ∼ Lap 1
n
Output perturbation
...
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 20 / 44
36. Randomized Response [Warner ’65]
Say you want to release a bit x ∈ {Yes, No}. Do the following:
1 flip a coin
2 if tails, respond truthfully with x
3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 21 / 44
37. Randomized Response [Warner ’65]
Say you want to release a bit x ∈ {Yes, No}. Do the following:
1 flip a coin
2 if tails, respond truthfully with x
3 if heads, flip a second coin and respond “Yes” if heads; respond “No” if tails
Claim: Above algorithm satisfies (log 3)-differential privacy
Pr[Response = Yes|x = Yes]
Pr[Response = Yes|x = No]
=
1/2 × 1 + 1/2 × 1/2
1/2 × 0 + 1/2 × 1/2
=
3/4
1/4
= 3 =⇒ e = 3
Same for Pr[Response=No|x=Yes]
Pr[Response=No|x=No] .
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 21 / 44
38. Important Properties
Robustness to post-processing: M is ( , δ)-DP, then f (M) is ( , δ)-DP
Composition: if M1, . . . , Mn are ( , δ)-DP, then g (M1, . . . , Mn) is
( n
i=1 i , n
i=1 δi )-DP
Protects against arbitrary side knowledge
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 22 / 44
39. User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
40. User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
41. User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
42. User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
43. User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
44. User-AI system interaction via natural language
User’s goal: meet some specific need with respect to an
issued query x
Agent’s goal: satisfy the user’s request
Privacy violation: occurs when x is used to make personal
inference. e.g. unrestricted PII present
Mechanism: Modify the query to protect privacy whilst
preserving semantics
Our approach: Metric Differential Privacy
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 23 / 44
45. Desired Functionality
Intent Query x Modified Query x
GetWeather Will it be colder in Cleveland Will it be colder in Ohio
PlayMusic Play Cantopop on lastfm Play C-pop on lastfm
BookRestaurant Book a restaurant in Milladore Book a restaurant in Wood County
SearchCreativeWork I want to watch Manthan film I want to watch Hindi film
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 24 / 44
46. Word Embeddings
Mapping from words into vectors of real numbers (many ways to do this!)
e.g. Neural network based models (e.g. Word2Vec, GloVe, fastText)
Defines a mapping φ : W → Rn
Nearest neigbours are often synonyms
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 25 / 44
47. Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 26 / 44
48. Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 26 / 44
49. Metric Differential Privacy
Recall the definition of DP ...
P[M(x) ∈ E] ≤ e P M x ∈ E for x, x ∈ X s.t. x − x 1
= 1
This can be rewritten into a single equation as:
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e x−x 1
Metric differential privacy generalises this to use any valid metric d(x, x ):
P[M(x) ∈ E]
P[M(x ) ∈ E]
≤ e d(x,x )
(easy to see that standard DP is metric DP with d(x, x ) = x − x 1)
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 26 / 44
50. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 27 / 44
51. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 27 / 44
52. Privacy in the Space of Word Embeddings [Feyisetan 2019, Feyisetan 2020]
Given:
w ∈ W: word to be “privatised” from word space W (dictionary)
φ : W → Z: embedding function from word space to embedding space Z (e.g. Rn
)
v = φ(w): corresponding word vector
d : Z × Z → R: distance function in embedding space
Ω( ): the D.P. noise sampling distribution (e.g. Ωi ( ) = Lap 1
n , i = 1, ..., n for Rn
)
Metric DP Mechanism for word embeddings
1 Perturb the word vector: v = v + ξ where ξ ∼ Ω( )
2 The new vector v will not be a word (a.s.)
3 Project back to W: w = arg minw∈W d(v , φ(w)), return w
What do we need?
d satisfies the axioms of a metric (nonnegative, indiscernibles, symmetry, triangle)
A way to sample using Ω in the metric space that respects d and gives us -metric DP
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 27 / 44
54. Example: Differentially Private SGD
Algorithm 1: Differentially Private SGD
Input: dataset z = (z1, . . . , zn)
Hyperparameters: learning rate η, mini-batch size m, number of epochs T, noise variance
σ2, clipping norm L
Initialize w ← 0
for t ∈ [T] do
for k ∈ [n/m] do
Sample S ⊂ [n] with |S| = m uniformly at random
Let g ← 1
m j∈S clipL( (zj , w)) + 2L
m N(0, σ2I)
Update w ← w − ηg
return w
5+ hyper-parameters affecting both privacy and utility
For deep learning applications we only have empirical utility (not analyitic)
How do we find the hyperparameters that give us an optimal trade-off?
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 29 / 44
55. The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
56. The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
57. The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
58. The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
59. The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
60. The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
61. The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
62. The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
63. The Privacy-Utility Pareto Front
Pareto-Optimal Points
Hyper-parameter Space
Privacy Loss
Error
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 30 / 44
64. Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
65. Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
66. Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
67. Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
68. Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
69. Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
70. Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
71. Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
72. Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
73. Bayesian Optimization
Gradient-free optimization for black-box functions
Widely used in applications (HPO in ML, scheduling & planning, experimental design ...)
In multi-objective problems, BO aims to learn the Pareto front with a minimal number of
evaluations.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 31 / 44
74. DPareto
DPareto
Repeat:
1 For each objective (privacy, utility):
1 Fit a surrogate model (Gaussian process (GP)) using the available dataset
2 Calculate the predictive distribution using the GP mean and variance functions
2 Use the posterior of the surrogate models to form an acquisition function
3 Collect the next point at the estimated global max. of the acquisition function
until budget exhausted
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 32 / 44
76. Summary: Privacy Enhancing Technologies
Privacy
Privacy risks can be counter-intuitive and tricky to formalize
High-dimensional data and side knowledge make privacy hard
Semantic guarantees (eg. DP) behave better than syntactic ones (eg.
k-anonymization)
Differential privacy is a mature privacy enhancing technology
Metric DP provides local plausible deniability, accuracy can be good even in
cases with an infinite number of outcomes
Empirical privacy-utility trade-off evaluation enables application-specific decisions
Bayesian optimization provides computationally efficient method to recover the
Pareto front (esp. with large number of hyper-parameters)
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 34 / 44
77. Outline
1 Interactive AI at Amazon
2 Robustness & Transparency via Continual Learning
3 Algorithmic Privacy
4 Algorithmic Fairness
5 Summary
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 35 / 44
78. The Need for Algorithmic Fairness
Risks:
1 ML predictors might discriminate against groups of individuals protected by law or by ethics
2 choosing a model that minimizes the expected loss may be good for the majority population,
but overlooks the minority populations
Examples: image classification [Buolamwini & Gebru, 2018] and natural language tasks
[Bolukbasi et al., 2016]
Causes:
1 training data may contain biases
2 the analysis of the training data may inadvertently introduce biases
3 Unlike privacy, there’s no single agreed on definition!
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 36 / 44
79. Statistical Bias
Definition: The difference between an estimator’s expected value and the true value
Is statistical bias an adequate fairness criterion?
“The model summarises the data correctly, if the data is biased it’s not the algorithm’s
fault”
Says nothing about the distribution of errors (variance of estimator)
Biases are inevitable! Take ownership ...
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 37 / 44
80. Statistical Bias
Definition: The difference between an estimator’s expected value and the true value
Is statistical bias an adequate fairness criterion?
“The model summarises the data correctly, if the data is biased it’s not the algorithm’s
fault”
Says nothing about the distribution of errors (variance of estimator)
Biases are inevitable! Take ownership ...
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 37 / 44
81. Statistical Bias
Definition: The difference between an estimator’s expected value and the true value
Is statistical bias an adequate fairness criterion?
“The model summarises the data correctly, if the data is biased it’s not the algorithm’s
fault”
Says nothing about the distribution of errors (variance of estimator)
Biases are inevitable! Take ownership ...
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 37 / 44
82. Statistical Bias
Definition: The difference between an estimator’s expected value and the true value
Is statistical bias an adequate fairness criterion?
“The model summarises the data correctly, if the data is biased it’s not the algorithm’s
fault”
Says nothing about the distribution of errors (variance of estimator)
Biases are inevitable! Take ownership ...
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 37 / 44
83. Statistical Bias
Definition: The difference between an estimator’s expected value and the true value
Is statistical bias an adequate fairness criterion?
“The model summarises the data correctly, if the data is biased it’s not the algorithm’s
fault”
Says nothing about the distribution of errors (variance of estimator)
Biases are inevitable! Take ownership ...
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 37 / 44
84. Calibration
Calibrated Classifier [Dawid 1982]
“a forecaster is well calibrated if, for example, of those events to which he assigns a
probability 30 percent, the long-run proportion that actually occurs turns out to be 30
percent"
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 38 / 44
85. Calibration
α-Accuracy: If we do not want a predictor f not to downplay S ⊆ X, we require it to be
(approx.) unbiased over S for some small α ∈ [0, 1]:
|Ei∼S (fi − p∗
i )| ≤ α
α-Calibration: for any v ∈ [0, 1], let Sv = {i ∈ S : fi = v}, then:
|Ei∼Sv (fi − p∗
i )| = |v − Ei∼Sv (p∗
i )| ≤ α
i.e. we are calibrated for all but a small number of items α.
Weakness: Guarantees too coarse. E.g. assign every member in S the value Ei∼S (p∗
i ).
The is perfectly calibrated, but “qualified” members of S with large p∗
i will be hurt.
Typically this is applied over large disjoint sets - e.g. race or gender.
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 39 / 44
86. Multicalibration [Herbert-Johnson 2018]
Stronger notion: ensure calibration on every subpopulation (including qualified members
from before). But ... requires perfect predictions!
Need an intermediary definition that balances protecting subgroups vs information
bottleneck of small samples
Multicalibration Definition
“A predictor f is multicalibrated w.r.t. a family of subpopulations C if it is
calibrated w.r.t. every S ∈ C”, where C are computationally-identifiable subsets
Let C ⊆ 2X be a collection of subsets of X and α ∈ [0, 1]. A predictor f is
(C, α)-multicalibrated if for all S ∈ C, f is α-calibrated w.r.t. S.
Think of C as a collection of subpopulations where set membership can be determined
efficiently, e.g. through boolean operations or by small decision trees
C can be quite rich, with many overlapping subgroups of a protected group S
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 40 / 44
87. Summary: Algorithmic Fairness
Multicalibration
One particular notion of algorithmic fairness
Attractive since it can be run as post-hoc
But ... currently limited to small datasets
How does this interact with privacy?
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 41 / 44
88. Outline
1 Interactive AI at Amazon
2 Robustness & Transparency via Continual Learning
3 Algorithmic Privacy
4 Algorithmic Fairness
5 Summary
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 42 / 44
89. Summary
www.mbmlbook.com
Interactive AI requires more than just smart algorithms!
Requires us to think also about robustness and ethical implications
Future work (potential CDT projects!):
Multi-calibration using random forests
Optimize the fairness–utility, privacy–utility, privacy–fairness–utility trade-offs
Build privacy and fairness directly into continual learning systems
Leverage crowdsourcing and active learning to test privacy and fairness hypotheses
Tom Diethe (Amazon) Practical Considerations for Interactive AI January 29 2020 43 / 44