Exposing algorithms pydatadc2016

EXPOSING ALGORITHMS
COMPUTATIONAL
JOURNALISM LAB,
UNIVERSITY OF MARYLAND

COMPUTATIONAL
JOURNALISM
▸ Develop tools for Newsrooms
▸ Data gathering
▸ Story tracking
▸ Personalized news
▸ Comment moderation
▸ Using computational methods to
investigate a story
▸ Algorithmic accountability and
transparency
Applying computer
science to journalism

GOOGLE AUTOCOMPLETE FAQ
▸ “…we exclude a narrow class of search queries related to
pornography, violence, hate speech, and copyright
infringement.”

GOOGLE AUTOCOMPLETE FAQ
▸ “…we exclude a narrow class of search queries related to
pornography, violence, hate speech, and copyright
infringement.”
▸ Criteria: Boundaries of censorship; Differences among
search engines; Mistakes?

INPUT - OUTPUT STUDY
OutputInput

Warning!
This presentation contains explicit language.

N. Diakopoulos. Sex, Violence, and Autocomplete Algorithms. Slate. 2013.

SEARCH ENGINES ARE COMPLICATED!
▸ Are we using search terms
that people in real life use?
▸ Personalization (IP, proﬁle,
history)
▸ Randomization, A/B tests
▸ …not to mention Google
doesn't want people
scraping their results (ack!)

▸ Discriminatory/unfair
▸ Mistake that denies a service
▸ Censorship
▸ Breaks law or social norm
▸ False prediction
▸ Violation of privacy

PREVIOUS
WORK
▸ Surge pricing triggered by
car requests outnumbering
available cars (demand >
supply)
▸ Goal of surge pricing:
▸ Encourage more drivers
on the road
▸ Redistribute current
drivers to areas of high
demand

▸ Surge pricing triggered by
car requests outnumbering
available cars (demand >
supply)
▸ Goal of surge pricing:
▸ Encourage more drivers
on the road
▸ Redistribute current
drivers to areas of high
demand
PREVIOUS
WORK

CURRENT
▸ Propose service quality
may not be the same
across D.C.
▸ Expected Wait Time proxy
for service: combines car
availability, current and
historical surge pricing,
other hidden factors.
▸ If true, can this be
predicted by census data?

APPROACHES, TOOLS
▸ Data sources
▸ Uber API, `uber.py`, census.gov resources (tons, free)
▸ Spatial sampling across the District
▸ Python GIS-related libraries (`geopy`, `address`, `cenpy`)
▸ The http://data.fcc.gov/ API returns an address when given an latitude and longitude
▸ Sample grid-style, averaged to census tracts
▸ Data wrangling and statistics
▸ `pandas`, `numpy`, `statsmodels`
▸ Visualization
▸ CARTO for mapping (3 maps for free) + Adobe Illustrator
▸ `matplotlib` or `seaborn` for graphs
▸ with touch of Adobe Illustrator

APPROACH - BASICALLY ALL PYTHON
COLLECTION
▸ Determine our sampling locations:
▸ Spatial sampling DC -> grid (how dense?)
▸ Temporal sampling -> 3 min (why?)
▸ Uber API rate limits,
▸ #API key access
▸ Address validation
▸ https://github.com/comp-journalism/2016-03-wapo-uber/
blob/master/Mapping_points_across_DC.ipynb

TEXT
LOCATIONS PASSED TO UBER API

UBER DATA
▸ Expected Wait Time from
Uber API for each location
every 3 minutes over 4 weeks
▸ Calculated as mean
expected wait time per
tract (MEWT)
▸ Proportion calculated as
percentage time each tract
spent with a surge price
multiplier > 1

AMERICAN
COMMUNITY
SURVEY 2014
▸ % People of Color (POC)
▸ % Poverty
▸ Population Density
▸ Median Household
Income
▸ Z-score normalized

APPROACH - STILL BASICALLY ALL PYTHON
DATA PROCESSING
▸ Collapse data across time (4 weeks in February 2016)
▸ Average data within census tracts
▸ Select only uberX “product_types”
▸ One “ETA” and one “Surge Price Multiplier” value per tract
▸ Census / American Community Survey data:
▸ Poverty -> Calculate % in each tract
▸ Income -> Median income per tract
▸ Race/Ethnicity -> Dichotomized %
▸ Population density (population x tract land mass)
▸ Normalized to z-scores

ESTIMATED WAIT TIMES FOR UBERX
Map showing
average ETA for
an uberX.
Northwest DC
has a mostly
white racial
demographic,
whereas
southeast is
mostly people of
color.
Tract 92.03.
75% POC, Short wait times
Universities, restaurants, bars…

APPROACH - PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON PYTHON
REGRESSION (GLM, STATSMODELS)
% POC***
Population Density***
Median Income
% Poverty
% POC : % Poverty**
% POC : IncomeExplanatory Variables:

WHAT NEXT - MORE DATA
▸ Does it reﬂect differences in
Supply/Demand? -> Taxi FOIA
▸ Crime stats -> perception vs facts
▸ Banked / unbanked stats (~14%
in DC)
▸ Smart phone ownership
▸ Would the results differ in a
different month or city?

DESIGNING FOR TRANSPARENCY AND ACCESSIBILITY
WHAT NEXT - DESIGN?
▸ What if:
▸ Taxi demand is high in census tracts underserved
by Uber in DC?
▸ Difference in price? Accessibility? Marketing?
▸ Unbanked people with no bank accounts or smart
phones could hail via voice? Pay with cash?
▸ Crime perception is different from real life?
▸ Could we indicate crime stats in-app?
▸ Should we?
▸ TRANSPARENCY! https://github.com/comp-
journalism/2016-03-wapo-uber
▸ datalensdc.com, Houston, Georgetown, UBER,
AARP…

ALGORITHMIC
ACCOUNTABILITY
IN JOURNALISM
▸ Opportunity for UBER to
check our work
▸ Opportunity for
audience to check
▸ Spurs us to write better,
documented code,
check our conclusions
and assumptions
▸ Others can use code /
data for other stories
https://github.com/comp-journalism

▸ Code: GitHub
▸ IPython Notebook
▸ Documentation:
README.md
▸ Data: Google Drive
▸ Save wrangled data at
intervals in .csv ﬁles
▸ Programmatic solutions
where possible
Free
Open Source
ALGORITHMIC
ACCOUNTABILITY
IN JOURNALISM

QUESTIONS?
COLLABORATIONS?
Jennifer A. Stark
@_JAStark
starkja@umd.edu

Exposing algorithms pydatadc2016

Recommandé

Recommandé

Contenu connexe

Similaire à Exposing algorithms pydatadc2016

Similaire à Exposing algorithms pydatadc2016 (20)

Dernier

Dernier (20)

Exposing algorithms pydatadc2016