An algorithm is set of steps that perform calculations, process data, or automate tasks. Algorithms are everywhere we look (and even places we don’t look) controlling what we see, do, and where we go. They’re great for solving our problems and helping us make better and quicker decisions, or taking the decision-making out of our hands. Their guidance is perfect in their objective and unbiased calculation. Except they are not, actually. Like everything else, they are created by people, and people have biases that get encoded into the algorithms they create. Algorithms learn from data, which is also created by people, so the algorithms also learn biases from data. This can be a problem when algorithms encode these biases into their calculations and go on to perpetuate the bias.
In this talk you will hear why we should care about algorithmic accountability, and details on a case study on how computational journalism can be used to investigate algorithms and advocate the need for transparency and accountability.
2. COMPUTATIONAL
JOURNALISM
▸ Develop tools for Newsrooms
▸ Data gathering
▸ Story tracking
▸ Personalized news
▸ Comment moderation
▸ Using computational methods to
investigate a story
▸ Algorithmic accountability and
transparency
Applying computer
science to journalism
7. GOOGLE AUTOCOMPLETE FAQ
▸ “…we exclude a narrow class of search queries related to
pornography, violence, hate speech, and copyright
infringement.”
8. GOOGLE AUTOCOMPLETE FAQ
▸ “…we exclude a narrow class of search queries related to
pornography, violence, hate speech, and copyright
infringement.”
▸ Criteria: Boundaries of censorship; Differences among
search engines; Mistakes?
13. SEARCH ENGINES ARE COMPLICATED!
▸ Are we using search terms
that people in real life use?
▸ Personalization (IP, profile,
history)
▸ Randomization, A/B tests
▸ …not to mention Google
doesn't want people
scraping their results (ack!)
15. ▸ Discriminatory/unfair
▸ Mistake that denies a service
▸ Censorship
▸ Breaks law or social norm
▸ False prediction
▸ Violation of privacy
16. PREVIOUS
WORK
▸ Surge pricing triggered by
car requests outnumbering
available cars (demand >
supply)
▸ Goal of surge pricing:
▸ Encourage more drivers
on the road
▸ Redistribute current
drivers to areas of high
demand
17. ▸ Surge pricing triggered by
car requests outnumbering
available cars (demand >
supply)
▸ Goal of surge pricing:
▸ Encourage more drivers
on the road
▸ Redistribute current
drivers to areas of high
demand
PREVIOUS
WORK
18. CURRENT
▸ Propose service quality
may not be the same
across D.C.
▸ Expected Wait Time proxy
for service: combines car
availability, current and
historical surge pricing,
other hidden factors.
▸ If true, can this be
predicted by census data?
19. APPROACHES, TOOLS
▸ Data sources
▸ Uber API, `uber.py`, census.gov resources (tons, free)
▸ Spatial sampling across the District
▸ Python GIS-related libraries (`geopy`, `address`, `cenpy`)
▸ The http://data.fcc.gov/ API returns an address when given an latitude and longitude
▸ Sample grid-style, averaged to census tracts
▸ Data wrangling and statistics
▸ `pandas`, `numpy`, `statsmodels`
▸ Visualization
▸ CARTO for mapping (3 maps for free) + Adobe Illustrator
▸ `matplotlib` or `seaborn` for graphs
▸ with touch of Adobe Illustrator
22. UBER DATA
▸ Expected Wait Time from
Uber API for each location
every 3 minutes over 4 weeks
▸ Calculated as mean
expected wait time per
tract (MEWT)
▸ Proportion calculated as
percentage time each tract
spent with a surge price
multiplier > 1
23. AMERICAN
COMMUNITY
SURVEY 2014
▸ % People of Color (POC)
▸ % Poverty
▸ Population Density
▸ Median Household
Income
▸ Z-score normalized
24. APPROACH - STILL BASICALLY ALL PYTHON
DATA PROCESSING
▸ Collapse data across time (4 weeks in February 2016)
▸ Average data within census tracts
▸ Select only uberX “product_types”
▸ One “ETA” and one “Surge Price Multiplier” value per tract
▸ Census / American Community Survey data:
▸ Poverty -> Calculate % in each tract
▸ Income -> Median income per tract
▸ Race/Ethnicity -> Dichotomized %
▸ Population density (population x tract land mass)
▸ Normalized to z-scores
25. ESTIMATED WAIT TIMES FOR UBERX
Map showing
average ETA for
an uberX.
Northwest DC
has a mostly
white racial
demographic,
whereas
southeast is
mostly people of
color.
Tract 92.03.
75% POC, Short wait times
Universities, restaurants, bars…
27. WHAT NEXT - MORE DATA
▸ Does it reflect differences in
Supply/Demand? -> Taxi FOIA
▸ Crime stats -> perception vs facts
▸ Banked / unbanked stats (~14%
in DC)
▸ Smart phone ownership
▸ Would the results differ in a
different month or city?
28. DESIGNING FOR TRANSPARENCY AND ACCESSIBILITY
WHAT NEXT - DESIGN?
▸ What if:
▸ Taxi demand is high in census tracts underserved
by Uber in DC?
▸ Difference in price? Accessibility? Marketing?
▸ Unbanked people with no bank accounts or smart
phones could hail via voice? Pay with cash?
▸ Crime perception is different from real life?
▸ Could we indicate crime stats in-app?
▸ Should we?
▸ TRANSPARENCY! https://github.com/comp-
journalism/2016-03-wapo-uber
▸ datalensdc.com, Houston, Georgetown, UBER,
AARP…
29. ALGORITHMIC
ACCOUNTABILITY
IN JOURNALISM
▸ Opportunity for UBER to
check our work
▸ Opportunity for
audience to check
▸ Spurs us to write better,
documented code,
check our conclusions
and assumptions
▸ Others can use code /
data for other stories
https://github.com/comp-journalism
30. ▸ Code: GitHub
▸ IPython Notebook
▸ Documentation:
README.md
▸ Data: Google Drive
▸ Save wrangled data at
intervals in .csv files
▸ Programmatic solutions
where possible
https://github.com/comp-journalism
Free
Open Source
ALGORITHMIC
ACCOUNTABILITY
IN JOURNALISM