The impact of AI on society gets bigger and bigger - and it is not all good. We as Data Scientists have to really put in work to not end up in Machine Learning hell. Every Data Scientist should account for fairness... but how? In this talk, I'll show some recent examples how AI led to unfair outcomes at scale and argue that fairness should be part of the standard toolbox of Data Scientists. Building on cutting-edge research, I'll show how an adversarial classifier can force a model to be fair. The talk ends with some pointers on how to embed fairness in your organisation.
2. Data Scientists have to put in the work
so that society does not end up in ML hell
2
“The gates of hell are open night and day;
Smooth the descent,and easy is the way:
But to return, and view the cheerfulskies,
In this the task and mighty labor lies.”
The Works of Virgil (John Dryden)
3. About me:
one of those Data Scientists who should put in the work
3
GoDataDriven
Driving Your Success With
Data and AI
Henk Griffioen
Lead Data Scientist
@
4. The impact of AI on society is not all good. AI can encode and
amplify human biases, leading to unfair outcomes at scale
5. Fairness is a hot topic and gaining traction!
https://fairmlclass.github.io/
10. Ratio of
• probability of a positive outcome given the sensitive attribute being true;
• probability of a positive outcome given the sensitive attribute being false;
is no less than p:100
p%-rule: measure demographic parity
40% : 50% = 80%10% : 50% = 20%
11. Our model is unfair: low probability of high income for
black people and women
13. • Skewed sample initial bias that compoundsover time
• Tainted examples bias in the data caused by humans
• Sample size disparity minority groups not as well represented
• Limited features less informative data collected on minority groups
• Proxies data implicitly encoding sensitive attributes
• …
Many reasons why bias is creeping into our systems
Barocas & Selbst, 2016
15. Ethnic profiling by the
Dutch tax authorities
15
Profiling people for fraud
“A daycare center in Almere sounded the alarm when
only non-Dutch parents were confronted with
discontinuation of childcare allowances…
…The Tax and Customs Administration says that it
uses the data on Dutch nationality or non-Dutch
nationality in the so-called automatic risk selection
for fraud.”
https://www.nrc.nl/nieuws/2019/05/20/autoriteit
-persoonsgegevens-onderzoekt-mogelijke-
discriminatie-door-belastingdienst-a3960840
16. Is it enough to leave
out data on (second)
nationality?
16
…In a response, the Tax and Customs Administration
states that the information about the (second)
nationality of parents or intermediary is not used in this
investigation…
“Since 2014, a second nationality with Dutch nationality
is no longer included in the basic registration. This has
been introduced to prevent discrimination for people
with dual nationality.”
Is this enough to
assure that non-Dutch
parents in Almere will
not suffer another tax
injustice?
Towards a fair future?
17. The model is still unfair
without sensitive data.
Biases are still encoded
by proxies in the
dataset!
17
26. No mathematical formulation of fairness.
There are many (conflicting) measures
http://www.ece.ubc.ca/~mjulia/publications/Fairness_Definitions_Explained_2018.pdf
27. Many ML fairness approaches
https://dzone.com/articles/machine-learning-models-bias-mitigation-strategies
28. Fairness should be a key part of your product process
https://www.slideshare.net/KrishnaramKenthapadi/fairnessaware-machine-learning-practical-
challenges-and-lessons-learned-www-2019-tutorial
State
problem
Construct
dataset
Select
algorithm
Train
model
Test
model
Deploy
solution
Gather
feedback
Is using an algorithm ethical? Can it
be misused? Who might be harmed?
Are there biased features? Are
groups over-/underrepresented?
Should we get more data?
Is our objective function in line with
ethics? Do we need separate models
for minority populations?
Do we need to enforce fairness?
What metrics should we track?
What fairness metrics? Do the
metrics capture consumer needs?
Are we deploying on a population not
capture in the dataset?
Does the solution enforce unfair
feedback loops? Is intervention
needed?
30. Data Scientists have to put in the work
so that society does not end up in ML hell
“The gates of hell are open night and day;
Smooth the descent,and easy is the way:
But to return, and view the cheerfulskies,
In this the task and mighty labor lies.”
The Works of Virgil (John Dryden)
31. An ethics checklist for data scientists
• http://deon.drivendata.org/
Tutorial on fairness for products
• sites.google.com/view/wsdm19-fairness-tutorial
Community concerned with fairness in ML
• www.fatml.org
Our blogs
• blog.godatadriven.com/fairness-in-ml
• blog.godatadriven.com/fairness-in-pytorch
Where to go from here?