UVA Data Science Institute Master of Science in Data Science researchers Lucas Beane and Elena Gillis undertook a capstone project to investigate possible reasons for the stagnation of the Charlottesville Open Data Portal.
Connecting citizens with public data to drive policy change
1. Developing a data pipeline to improve accessibility
and utilization of Charlottesville’s Open Data Portal
Lucas Beane, Elena Gillis, Rafael Alvarado, Caitlin Wylie
University of Virginia, lhb7tz, emg3sc, rca2t, cdw9y@virginia.edu
Abstract – To improve democratic engagement between
the people and the government, the city of Charlottesville
put forward a proposition to construct an online portal
that would contain data from the city departments that is
considered public by nature. This move was intended to
promote the ease of access to data pertinent to ongoing
policy debates in the city and incentivize the public to
contribute to the policy-making process with informed
participation. Such efforts, while successful at their start,
have gradually stagnated, and the end objective of the
portal has not been reached. In this paper we identify
possible reasons for this stagnation – inconsistent
formatting of the datasets, variables that are not meant
for human legibility, and limited data with
disproportional representation from the city departments.
We then propose a data pipeline that serves as a tool to
extract utility from the data. It does so by converting the
datasets into a consistent format, merges the datasets, and
allows for creation of simple visualizations. The pipeline
acts as a link between the raw data published by the
government units and the city by increasing its
interpretability and legibility and outputting results that
are easily relatable to the policy issues at hand. We
demonstrate this by analyzing datasets for crime and real
estate and relating our findings to the affordable housing
debate.
Index Terms – open data, Charlottesville, informed policy-
making, data visualization, data pipeline.
INTRODUCTION
Civic data portals have become a common feature of local
governance across the globe, and in response to requests from
the community, the City of Charlottesville made a collective
decision to disclose city department data to the public.
Launching the Charlottesville Open Data Portal was a socially
and strategically important action meant to democratize the
policy-making processes in the city by enabling informed
public participation. Despite this collective effort, the portal
remains largely unutilized by the public. We primarily
attribute this trend to the fragmented nature of the datasets,
inconsistency in variable conventions, and limits of available
information. In our project, we aim to increase usability and
utilization of disclosed data by comprehensively combining
existing datasets to draw additional insights, make them easily
accessible to the public through visualizations on an
interactive dashboard, and demonstrate how such information
can be used to inform policy decisions. To emphasize these
points, we construct a data pipeline that combines data from
the portal into an incorporated dataset that takes all these goals
into account. In addition, by looking at the portal through a
spectrum of user perspectives, ranging from data experts to
non-experts, we develop a series of actionable suggestions
with the goal of improving the overall structure and use of the
portal.
We created a pipeline for dealing with similar data types
and merging individual datasets. We developed a prototypical
model for combining different types of geospatial data, which
ultimately outputs a result in the form of a census block (the
most granular measure attainable with our data) for each entry
with a geospatial location. We aim to expand this sort of
model to other data formats for easier comparison between
datasets. Additionally, we kept track of any hurdles we
encountered that might discourage a non-data scientist from
working with data on the portal. Although open-access tools
exist to interact with the portal’s datasets, they aren’t always
as reliable or powerful as users would like. In addition,
directly downloading the data (to use with third party
programs) is not always an option due to inconsistent
formatting in data and astronomically sized datasets. By
noting these hurdles, we will forward our suggested solutions
to the city staff who maintain the portal, with the goal of
actionable policy revision.
Our pipeline ensures that the data is cleaned and in a
consistent format, allowing us to visualize each dataset
individually and overlay the results to observe any patterns in
the data. We then relate the findings to ongoing policy debates
pertaining to the city’s development. The ultimate goal of this
project is to show the city of Charlottesville how the Open
Data Portal can create a positive impact and incentivize
greater citizen participation in the policy-making process.
BACKGROUND
I. Charlottesville’s Portal
In late 2016, after months of solicitation by the city’s tech
sector, Charlottesville City Council passed a resolution to
begin and support an open data initiative for the city. The
main goals of the project were to let the public have an eye on
what was happening data-wise in the city and, in doing so, to
enable citizens to have more of a say in their local government
through collaborative efforts.
To set up the portal, seven city employees were assigned
to the newly formed Open Data Committee as architects and
developers of the portal with five local citizens acting as a
2. review committee deemed the Open Data Advisory Group.
The ODC collected datasets from various public institutions
around Charlottesville while performing pre-processing (e.g.
anonymizing data, removing violent crime data and active
investigations, and correcting obvious errors) so that having a
public eye on the data would be unproblematic. Though the
city maintains that the purpose of the portal is to get citizens
more engaged in local government, many of their efforts to
make the portal more accessible have been relatively
unsuccessful. When asked if the ODC planned to teach the
public how to use the portal, one member stated that “it’s not
for novices” and pointed technological hopefuls in the
direction of D.C.’s open data portal for training [1].
Unfortunately, once the data has been published to the portal,
it was not seen by the city as their responsibility to teach the
public how to utilize it to its potential, which lead to the
current predicament.
In August of 2017, the Charlottesville Open Data Portal
had its soft launch, which was eclipsed by the tragedy of a
violent white nationalist rally a few days prior. We partially
explain the lack of public awareness of the portal through the
domination of this event in local attention and the news cycle,
but we also acknowledge the difficulty in promoting a digital
platform to a public audience.
II. Other Cities’ Portals
Other cities in Virginia have implemented open data portals,
and, through using them, some have had tremendous
successes with implementing positive policy change. For
example, Lynchburg utilized direct citizen input to help with
their “Poverty to Progress” plan, a goal of which was to map
food deserts in order to inform non-profits which areas need
the most help [2]. In an endeavor to improve transparency,
Virginia Beach created web apps to get citizens’ direct input
on budget proposals as part of their Information Technology
department’s 10 strategic goals [2]. These two cities garnered
national attention when they were among the first-place
winners of the Center for Digital Government’s 2017 Digital
Cities Survey for their efforts [2]. Cities across the U.S. were
recognized in this competition, and they all had one thing in
common: they had specific problems they wanted to address
with their data portal. Though some agencies in
Charlottesville use the data portal in such a way, we find no
evidence that the city of Charlottesville invites the public to
work with them on these specific uses, such as the cities of
Lynchburg and Virginia Beach do. We believe this to be part
of the explanation behind the portal’s under-utilization.
On a different scale but with similar problems, the
national open data portal had a similar start: “In May, 2009,
with just 47 datasets. It was not an instant hit” [3]. That being
said, the national data portal now has over 230,000 datasets,
an indicator that its utility is widely recognized. This narrative
is inspiring for Charlottesville’s portal, as it provides hope for
an achievable goal while still allowing for a period of
stagnation.
PROBLEM DESCRIPTION
Democracy and transparency in policy-making were
identified by the creators as the main objectives of the portal,
however, as of now, the portal is not being utilized to its full
potential. In this paper we aim to outline the stages of building
a portal and identify the main reasons pertaining to this issue.
We then attempt to propose possible solutions that would
tackle these hurdles.
We identified three main stages of portal development:
collection, distribution, and presentation of data. The
collection of data is performed by the Open Data Committee
(ODC) that consists of representatives from the city staff. The
ODC works with the heads of city departments to obtain the
data. However, this is complicated because many view
sharing their data as a personal liability as the data is
frequently sensitive and there are no apparent immediate
benefits to its disclosure. Staff in the city’s tech departments
clean and publish the collected data using Esri’s ArcGIS
platform. Since the data is collected from different sources
that utilize various collection methods and use the data for
different purposes, the published datasets are incompatibly
structured and often cannot be combined to derive additional
information.
Presentation of the data is primarily done by so-called
“super users” of the portal – people with strong technical
backgrounds who use the published data to create
visualizations and draw additional insights. This is where our
project comes in—we aim to make the published data more
comprehensible and accessible as well as create a blueprint
for working with new data as it is published on the portal.
Considering the immense scope of this problem, we focus on
two primary objectives: identifying the main reasons behind
why the portal is not being widely utilized and tackling these
issues by discovering the hurdles that may hinder users. To
increase usability and interpretability of the data, we develop
a data pipeline to demonstrate utilization of the published data
to produce actionable information to inform policy decisions.
Our work can demonstrate to the city (staff, users and
municipal departments) how disclosing public data can create
positive impact, thereby incentivizing more government units
to share data.
DATA
Currently there are 81 datasets published on the portal that are
divided into ten categories: Property and Land, Economy,
City Operations, Public Safety, Demographics, Getting
Around, Recreation, Infrastructure, Environment, GIS Base
Layers. There are over three million observations across all
the datasets with over three hundred variables, with about
75% of this data containing geospatial information. Although
the datasets are largely geospatial, this information is
presented in different formats, e.g., coordinates, parcel
numbers, census tracts and block numbers, and physical
addresses. This generates a layer of complication when
merging the datasets, as the geographical measures do not
always overlap. In addition, the same variables in different
3. datasets have different interpretations, depending on the
source and purpose of the data, and in the absence of a data
dictionary they are hard to distinguish and define. The
opposite is also true – some variables containing similar and
integrable information had inconsistent names and, as a result,
were difficult to identify and overlay.
APPROACH
I. Data Discovery
At his stage data collection and maintenance of the portal are
no one’s official responsibility and is performed purely on a
volunteer basis, as was revealed through a meeting with a
representative of the Open Data Advisory Group. Given the
lack of any imminent positive impact from the portal’s initial
opening, the enthusiasm for open data in Charlottesville has
substantially diminished and at times the efforts to improve
the portal are put completely on hold, which hinders any
further attempts at expanding what the portal can accomplish.
Most of the portal’s geospatial data is designed for easy
integration with GIS, which was pointed out by UVA’s
Scholars Lab, but that characteristic is not apparent and
requires advanced technical skills that are not common among
the general public. Nonetheless, viewing the data as
integrable with GIS helps define some of the variables and
expand the amount of utilizable information in the dataset.
The design of the datasets and lack of interpretability of
the variables, as well as disproportional representation of data
across departments is explained by the fact that most of the
published data was already collected and available in the city
departments and, as such, formatted to the departments’ initial
needs. No new data is recorded and formatted specifically for
publication on the portal and some departments that had
initially committed to sharing their data started to worry about
the risks of sharing it without knowing who will use the
disclosed information and to what end. In addition,
publication of the data is seen by the city as the final objective,
leaving it to the public to derive their own utility.
II. Data Flow
Figure 1 below shows a flowchart of the three stages of the
portal development. The first two stages - collection and
distribution of the data - are performed by the Open Data
Committee and the city staff and thus are outside the scope of
this project. Instead, we offer additional steps that process the
given data and output user-friendly visualizations that could
increase the usability of the portal and contribute to its initial
goal of a democratized and informed policy-making process.
PIPELINE
Our main goal was to develop an architecture for a data
pipeline, to transform data currently on the portal to
accessible information. We aim to create a model that could
be implemented by anyone with some language-agnostic
technological knowledge, with the ultimate output of a tool
that could be utilized by anyone. Along the way, we also take
note of any hurdles we encounter that might discourage other
users from working toward a similar goal.
To begin, we utilize a simple web scraper using the
Python package Selenium that grabs the dataframes from the
portal and stores them locally as a series of files in CSV
format.
We then clean the data using a variety of methods,
primarily using typical packages in Python. The most relevant
approach we take is to assign a confidence level from 1 to 4
to each feature from every dataframe to represent how
confident we are about the entries’ meanings. We then only
keep features with confidence levels of 3 or 4 (the levels
representing the most confidence), discarding the rest. This
approach is a direct result of the problem of lack of
documentation on the portal; a more comprehensive data
dictionary would eliminate the need for this entirely. At this
point, we also make sure to rename our features according to
their perceived meaning, both consolidating similar features
(e.g., changing geospatial data named “X” and “Y” to
“Longitude” and “Latitude”) and separating features with
deceptively similar names (e.g., “BlockNumber” to “FIPS
Block Code” and “Street Number”). We choose to further
amalgamate our geospatial data into a single format—the
census block number—to make the visualization step simpler,
acknowledging that this sacrifices precision in favor of
aggregation.
With the data cleaned, we use the pandas package in
Python to merge similar dataframes on their shared features
(e.g. “Longitude” and “Latitude”), only keeping desired
features in the resulting merge. We primarily work with the
geospatial data at this step, but this could easily be applied to
any type of feature.
At this point, any desired analysis could be performed,
and due to the nature of our project, we do so in the form of
visualizations. We choose this medium because our work up
to this point is with geospatial data, which would be intuitive
to analyze visually. We also want to use the visuals as an
intuitive way of working toward our ultimate goal of
increasing accessibility and use of the portal.
FIGURE 1
THREE STAGES OF PORTAL DEVELOPMENT
4. RESULTS
I. Hurdles to a Portal User
While working on the data pipeline, we took note of multiple
obstacles in utilizing the data. Firstly, there is an uneven
representation of data from different city departments. We
attribute this lack of proportionality to three causes: city
staff’s persisting reluctance to share their data, obscurity in
the existing data that diminishes its utility, and disparate
practices of data collection across city departments. This
results in inconsistency of data formatting, since, as of now,
there are no guidelines for generating the data submitted to
the portal. Data submission is instead performed on a
voluntary basis and any data is seen as good data as long as it
is disclosed to the public. This practice makes it difficult for
both humans and machines to read the data, especially with
the lack of comprehensive data dictionaries for the published
datasets.
There also exist multiple technical issues with the portal.
At times some datasets are unavailable for download. While
some became open after a period of time, a few still remain
inaccessible. Among the datasets that we were able to scrape,
some are too large to be easily read by home computers and
standard software. In cases when these datasets are readable,
it is still difficult to identify significant observations and
variables. We believe that the ability to do so is important
given the main objective of the portal.
II. Examples and Visualizations
Our data pipeline utilizes the raw data published on the portal
and creates a baseline for informed policy debates. It
standardizes geospatial variables for ease of dataset
integration and mergeability then outputs visualizations that
provide insightful summaries for public users of the portal. To
demonstrate this capability, we use real estate and crime rate
data to produce relevant visualizations and show how these
results can be used in the ongoing policy debate for affordable
housing in Charlottesville.
In a visualization of total real estate value in
Charlottesville since 1997, one can immediately glean a few
insights. For example, as the recession hit the city in 2009, the
sum value of properties in Charlottesville took a hit. Only
recently did the city gain back its sum momentum, but it
seems to have accomplished this well based on the trend
before 2009. In addition, some of the highest-valued
properties in Charlottesville are labeled along the right, and
one can easily see that beside the hospitals, park, and
shopping center, the top-valued properties in Charlottesville
are apartment complexes that have sprung up since 2011.
Together, these conclusions could potentially lead a user
toward a hypothesis that companies took advantage of lower
cost property values during the recession to buy, develop, and
market upper end housing, further fueling the gentrification
process in Charlottesville.
Visualizing crime data alongside the real estate data helps
point out trends in distribution of crime and the geospatial
pockets in which certain types of crime is prevalent. Although
crime dataset is one of the most comprehensible and well
formatted datasets in the portal we still see inherent
misrepresentation of values. For instance, geospatially we can
observe that many data points are concentrated around the
police department in downtown Charlottesville. We believe
that such concentration can be caused by two reasons – the
police station is recorded as the location of the violation for
the crimes that are charged at the station, or the station address
is used as a default in the absence of another address. As a
result, in our geospatial analysis of crime data we exclude the
violations that have the police station as their location of
record to give more weight to the observations that occurred
at other locations.
The data points overlaid on the physical map of
Charlottesville illustrate density of crimes across the city as
well as show pockets of concentration for specific types of
violations. Such data, plotted over time, can be of significance
in analyzing effects in implementation of new policies that
aim to reduce crime. In the area diagram plotted across the
time of day, we can see that Towed Vehicles is the most
prevalent type of crime recorded in the afternoons, while
Simple Assault is more common before 10 a.m. In addition,
we see a sudden drop of crime-recording practices during
lunch hours. We believe that either fewer offenses were
penalized during that time or they were not logged until later
in the afternoon. In a heatmap for 20 most common crimes
logged in Charlottesville over the span of the day, we can see
similar patterns (Appendix 2). Crime pattern analysis over the
months of the year identifies the prevalence of crimes such as
Towed Vehicles and Simple Assault in the months of
September and October, which could be explained by UVA
students returning to the campus and the beginning of the
football game season. By observing monthly patterns, the city
can develop strategies to respond to the violations
accordingly.
CONCLUSION
We successfully built a data pipeline that takes the data from
the portal as input and outputs actionable information in an
accessible way. We created a few sample visualizations to
showcase what this sort of process can achieve. We’ve also
outlined our process such that anyone who wants to either
replicate or make any adjustments to our process can do so in
any desired language.
We encountered myriad problems that could potentially
explain the dearth of users of the portal. Chief among them is
the lack of interpretability of the data; though the datasets
have descriptions, they lack extensive data dictionaries, so
many of the features therein are unusable. This is likely due
to the data collection policy, which has few guidelines to
allow for consistency across datasets. Additionally, we found
a disproportionate number of representative datasets for some
departments, perhaps because they were willing and able to
offer up more data. If a user isn’t interested in the over-
represented departments, this trend could disenfranchise them
simply because the available data do not form a reliable
picture of Charlottesville. There also still exist simple
5. problems of accessibility; at times, some datasets on the portal
are not available. On top of that, several datasets are simply
too large for the average home computer to utilize.
Addressing these problems would be an important first step
toward achieving a wider user base for Charlottesville’s open
data portal.
REFERENCES
[1] Wylie, C., Neeley, K., Ferguson, S. 2019. “Beyond Technological
Literacy: Open Data as Active Democratic Engagement?”, accepted for
publication in Digital Culture and Society, January 2019.
[2] Erepublicnews.com. 2019. Digital Cities Winners Leverage
Technology to Enhance Inclusion, Solve Social Challenges.
http://erepublicnews.com/digital-cities-winners-leverage-technology-
to-enhance-inclusion-solve-social-challenges.
[3] Forbes.com. 2019. States Offer Information Resources: 50+ Open Data
Portals. https://www.forbes.com/sites/metabrown/2018/04/30/us-
states-offer-information-resources-50-open-data-
portals/#18f5850c5225.
AUTHOR INFORMATION
Lucas Beane, MSDS Student, Data Science Institute,
University of Virginia.
Elena Gillis, MSDS Student, Data Science Institute,
University of Virginia.
Rafael Alvarado, Assistant Director, Data Science Institute,
University of Virginia.
Caitlin Wylie, Assistant Professor, School of Engineering
and Applied Science, University of Virginia.
APPENDICES
Appendix 1
Appendix 2