1. Paolo Missier
School of Computing
Newcastle University
Supporting Algorithm Accountability using Provenance
A ProvenanceWeek 2018 workshop
London, July 12th, 2018
Transparency and fairness of predictive models, and
the provenance of the data used to build them:
thoughts and challenges
2. 2
One of my favourite books
<eventname>
How much of Big Data is My Data?
Is Data the problem?
Or the algorithms?
Or how much we trust them?
Is there a problem at all?
3. 3
What matters?
<eventname>
• automatically filtering job applicants
• approving loans or other credit
• approving access to benefits schemes
• predicting insurance risk levels
• user profiling for policing purposes and to predict risk of criminal
recidivism
• identifying health risk factors
• …
Decisions made based on algorithmically-generated knowledge:
4. 4
GDPR and algorithmic decision making
<eventname>
Article 22: Automated individual decision-making, including profiling, paragraph
1 (see figure 1) prohibits any“decision based solely on automated processing,
including profiling” which “significantly affects” a data subject.
it stands to reason that an algorithm can only be explained if the trained model can be
articulated and understood by a human.
It is reasonable to suppose that any adequate explanation would provide an account of
how input features relate to predictions:
- Is the model more or less likely to recommend a loan if the applicant is a minority?
- Which features play the largest role in prediction?
B. Goodman and S. Flaxman, “European Union regulations on algorithmic decision-making and a ‘right to explanation,’”
Proc. 2016 ICML Work. Hum. Interpret. Mach. Learn. (WHI 2016), Jun. 2016.
7. 8
Interpretability (of machine learning models)
<eventname>
Z. C. Lipton, “The Mythos of Model Interpretability,” Proc. 2016 ICML Work. Hum. Interpret. Mach.
Learn. (WHI 2016), Jun. 2016.
- Transparency
- Are features understandable?
- Which features are more important?
- Post hoc interpretability
- Natural language explanations
- Visualisations of models
- Explanations by example
- “this tumor is classified as malignant
because to the model it looks a lot like
these other tumors”
W. Samek, T. Wiegand, and K.-R. Müller, “Explainable Artificial Intelligence: Understanding, Visualizing
and Interpreting Deep Learning Models,” Aug. 2017.
Interpretability: Ability to provide a qualitative understanding between the input
variables and the response
8. 9
Black-box approaches
<eventname>
Model agnostic:
An explainer should be able to explain any model, and thus be model-
agnostic (i.e. treat the original model as a black box)
Local fidelity:
for an explanation to be meaningful it must at least be locally faithful, i.e. it
must correspond to how the model behaves in the vicinity of the instance
being predicted
9. 10
Occlusion testing
<eventname>
M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’ : Explaining the Predictions of Any Classifier,” in Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, 2016, pp. 1135–1144.
10. 11
Expected accuracy not enough for trust
<eventname>
SVM classifier, 94% accuracy
…but questionable!
11. 13
LIME
<eventname>
Model agnostic
Locally faithful: it must
correspond to how the model
behaves in the vicinity of the
instance being predicted
M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’ : Explaining the Predictions of Any Classifier,” in Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, 2016, pp. 1135–1144.
12. 14
Other model explanation approaches
<eventname>
[1] Lakkaraju, H., Kamar, E., Caruana, R., & Leskovec, J. (2017). Interpretable & Explorable
Approximations of Black Box Models. arXiv preprint arXiv:1707.01154.
[2] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare:
Predicting pneumonia risk and hospital 30-day readmission,” in Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 1721–1730.
1. Black Box Explanations through Transparent Approximations (BETA) [1]
• Decision Set approximation of black box models
• Fidelity + interpretability of the explanation
• Global (unlike LIME)
2. Intelligible additive models [2]
• General Additive Model (GAM)
• Pairwise interactions General Additive Model (GA2M)
13. 15
Data Model Predictions
<eventname>
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during data collection:
- Where does the data come from?
- What’s in the dataset?
Complementing current ML approaches to model interpretability
14. 16
Possible roles for provenance
<eventname>
1) Data acquisition: Provenance Transparency Trust
15. 17
Data Model Predictions
<eventname>
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
Data
collection
Instances
Key decisions are made during
- Data collection:
- where does the data come from? What’s in
the dataset?
- Data preparation: how was it pre-processed?
1. Can we explain these decisions?
2. Are these explanations useful?
16. 18
Explaining data preparation
PaoloMissier(Computing),DennisPrangle(Stats)
Data
collection
Model
Population data pre-processing
Raw
datasets
features
Predicted you:
- Ranking
- Score
- Class
- Integration
- Cleaning
- Outlier removal
- Normalisation
- Feature selection
- Class rebalancing
- Sampling
- Stratification
- …
Data acquisition and wrangling:
- How were datasets acquired?
- How recently?
- For what purpose?
- Are they being reused /
repurposed?
- What is their quality?
Instances
- Scripts Python / TensorFlow, Pandas, Spark
- Workflows Knime, …
Provenance Transparency
17. 19
Provenance for transparency
<eventname>
1. Collection
- Program-level
- System-level
2. Representation
- W3C PROV (for interoperability)
- Multiple proprietary formats (for efficient encoding)
3. Querying / analysis
• RDBMS
• GDBMS
• RDF / SPARQL
• Configuration of each pre-processing step
• Data dependency graph
- Which kind of normalisation did you apply?
- Was the data (down/up) sampled? How?
- How did you define / remove outliers?
- How did you window your time series?
- Was the data repurposed (acquired from a repository)?
- How was the original protocol defined?
18. 20
Example
<eventname>
• The classic ”Titanic” dataset
• Can you predict survival probabilities?
• A simple logistic regression analysis
Survived - Survival (0 = No; 1 = Yes)
Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name - Name
Sex - Sex
Age - Age
SibSp - Number of Siblings/Spouses Aboard
Parch - Number of Parents/Children Aboard
Ticket - Ticket Number
Fare - Passenger Fare (British pound)
Cabin - Cabin
Embarked - Port of Embarkation (C = Cherbourg; Q =
Queenstown; S = Southampton)
19. 21
Enable analysis of data pre-processing
<eventname>
Managing
missing
values
Is the target
class
balanced?
• Data preparation workflow includes a number of decisions
Dropping
irrelevant
attributes
PassengerId',
'Name',
'Ticket',
'Cabin'
Dropping
correlated
features (?)
Age missing in
714/891 records
“Pclass is a
good predictor
for age”
Impute Age values
using average age
for PClass
Drop
“Fare”, “Pclass”
21. 23
Exploring the effect of alternative pre-processing
<eventname>
D
P1 D1 Learn M1 Predict
x
y1
How can knowledge of P1, P2 help understand why y1 ≠ y2 ?
Ex. Alternative imputation methods for missing values
Ex. Boost minority class / downsample majority class
P2 D2 Learn M2 Predict y2
y1 ≠ y2
22. 24
Also: script alludes to human decisions
<eventname>
How do we capture these decisions?
To what extent can they be inferred from code?
23. 25
Correlation analysis
<eventname>
• Is Pclass really a good
predictor for Age?
• Why drop both PClass
and Fare?
1. Dropped Age only
(Nearly identical performance (F1=0.77, 0.76))
2. Use sex, Pclass only
Alternative pre-processing:
24. 26
Possible roles for provenance
<eventname>
1) Data acquisition: Provenance Transparency Trust
2) Data transformation: Provenance explanations
- Is data preparation correct?
- Is training data fit to learn from?
- What is the effect of alternative pre-processing?
- Can we infer data prep decisions from pre-processing code?
25. 27
Bias (in ML)
<eventname>
(*) Mitchell, T. M. (1980). The need for biases in learning generalizations. Tech. rep. CBMTR-117,
Rutgers University, New Brunswick, NJ
Bias: “Any basis for choosing one generalization [hypothesis] over another,
other than strict consistency with the observed training instances." (*)
Absolute bias:
• certain hypotheses are entirely eliminated from the hypothesis space)
• Eg “A priori choice of model (decision trees, SVM, NN, …)
Relative bias:
• certain hypotheses are preferred over others
• Eg “prefer shallow simple decision trees to deep ones”
26. 28
Fairness and bias: the (notorious) COMPAS case
<eventname>
• Increasingly popular within the criminal justice system
• Used or considered for use in pre-trial decision-making (USA)
1: The initial claim
Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine bias: There’s
software used across the country to predict future criminals. and it’s biased against
blacks. 2016.
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-
algorithm
black defendants who did not recidivate over a two-year period were nearly twice as
likely to be misclassified as higher risk compared to their white counterparts (45
percent vs. 23 percent).
white defendants who re-offended within the next two years were mistakenly labeled
low risk almost twice as often as black re-offenders (48 percent vs. 28 percent)
27. 29
Model Fairness and data bias
<eventname>
A. Chouldechova, “Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction
Instruments,” Big Data, vol. 5, no. 2, pp. 153–163, Jun. 2017.
In this paper we show that the differences in false positive and false negative rates cited as
evidence of racial bias in the ProPublica article are a direct consequence of applying an instrument
that is free from predictive bias to a population in which recidivism prevalence differs across
groups.
COMPAS complies with the test fairness condition:
Observed P(Y | S=s) largely independent of R
28. 30
COMPAS Scores are skewed
<eventname>
- scores for white defendants were skewed toward lower-risk categories,
while black defendants were evenly distributed across scores
- large discrepancies in FPR and FNR between Black and White defendants
- … but this does not mean that the score itself is unfair
6,172 defendants
who had not been
arrested for a new
offense or who had
recidivated within two
years
29. 31
FPR / FNR
<eventname>
positive predictive value of Sc:
The test fairness condition (2.1) can be expressed as the constraint that PPV
does not depend on R
recidivism prevalence within groups:
False positive rate:
False negative rate:
When the recidivism prevalence differs between two groups, a test-fair score
cannot have equal FPR, FNR across those groups
30. 32
The actual “provenance” of the analysis
<eventname>
https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm
Data acquisition + transformation Model bias and fairness
- Can knowledge of data prep explain model bias?
- Does data prep introduce / remove bias?
31. 33
Fairness: many possible definitions
<eventname>
(*) M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual Fairness,” in Advances in Neural
Information Processing Systems 30, I. Guyon, U. V Luxburg, S. Bengio, H. Wallach, R. Fergus,
S.Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 4066–4076.
32. 34
Causality and counterfactual fairness
<eventname>
aggressive
driving
accident
rate
red cars
preference
Driver’s
race
Latent
Protected
Predicted
Observable
• Individuals belonging to a race A are more likely to drive red cars (A X)
• However, race is not a good predictor for either U or Y
• Aggressive drivers tend to prefer red cars (U X)
Using X to predict Y leads to a counterfactually unfair model:
• it may charge individuals of a certain race more than others, even though no
race is more likely to have an accident
Is knowledge of data prep useful at all to determine
this kind of fairness?
33. 35
Possible roles for provenance
<eventname>
1) Data acquisition: Provenance Transparency Trust
2) Data transformation: Provenance explanations
- Is data preparation correct?
- Is training data fit to learn from?
- What is the effect of alternative pre-processing?
3) Data acquisition + transformation Model bias and fairness
- Is provenance useful to diagnose an unfair / biased model?
- Does data prep introduce / remove bias?
34. 36
Opportunities and challenges: Summary
<eventname>
1) Data acquisition: Provenance Transparency Trust
2) Data transformation: Provenance explanations
- Is data preparation correct?
- Is training data fit to learn from?
- What is the effect of alternative pre-processing?
3) Data acquisition + transformation Model bias and fairness
- Is provenance useful to diagnose an unfair / biased model?
- Does data prep introduce / remove bias?
35. 37
A few initial references
[1] C. O’Neill, Weapons of Math Destruction. Crown books, 2016.
[2] B. Goodman and S. Flaxman, “European Union regulations on algorithmic decision-making and a ‘right to
explanation,’” Proc. 2016 ICML Work. Hum. Interpret. Mach. Learn. (WHI 2016), Jun. 2016.
[3] M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust You?’ : Explaining the Predictions of Any
Classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining - KDD ’16, 2016, pp. 1135–1144.
[4] H. Lakkaraju, S. H. Bach, and J. Leskovec, “Interpretable Decision Sets: A Joint Framework for Description and
Prediction,” in
Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016,
pp. 1675–1684.
[5] K. Yang and J. Stoyanovich, “Measuring Fairness in Ranked Outputs,” in Proceedings of the 29th International
Conference on Scientific and Statistical Database Management - SSDBM ’17, 2017, pp. 1–6.
[6] T. Gebru et al., “Datasheets for Datasets,” 2108.
[7] Z. Abedjan, L. Golab, and F. Naumann, “Profiling relational data: a survey,” VLDB J., vol. 24, no. 557, 2015.
[8] A. Weller, “Challenges for Transparency,” in Proceedings of the 2016 ICML Workshop on Human Interpretability
in Machine Learning (WHI 2016).
[8] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare: Predicting
pneumonia risk and hospital 30-day readmission,” in Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2015, pp. 1721–1730.
Individuals as well as businesses, which we will initially refer to as subjects (and later upgrade to active participants), increasingly find themselves at the receiving end of impactful decisions made by organisations on their behalf, based on processes that use algorithmically-generated knowledge.
Brings about the issue of trust in the models.
Should I use the prediction?
“Determining trust in individual predictions is an importantproblem when the model is used for decision making. When using machine learning for medical diagnosis [6] or terrorism detection, for example, predictions cannot be acted upon on blind faith, as the consequences may be catastrophic”
How about the data used to train / build the model?
How about the data used to train / build the model?
Relatively easy to keep track of data pre-processing provenance