This document discusses the problems caused by bad data in government. It begins by explaining that bad data impacts the quality of decisions, undermines data-driven policymaking, and raises civil liberty and liability risks. It then provides examples of how bad data has led to mismanaged programs and security vulnerabilities. The document also notes that bad data increases costs and inefficiencies. It suggests that bad data undermines trust in government. The rest of the document explores factors that determine bad data and how it can arise at different stages of the data collection, processing, sharing, analysis, and use process.
4. BAD DATA: WHY DO WE CARE?
IMPACTS THE QUALITY OF DECISIONS AND UNDERMINES THE
RELIABILITY OF DATA-DRIVEN POLICY MAKING
Kansas City audit showed that employees were labeling cases brought in through
the 311 as closed, even though the problems had never been fixed.
Dever’s Road Home planned to end homelessness but the city didn’t keep any data.
5. BAD DATA: WHY DO WE CARE?
CIVIL LIBERTY CONCERNS AND LIABILITY RISKS
Multnomah County, Oregon’s audit of the Mental Health and Addiction Services
Division found errors, inconsistencies and haphazard coding.
Dallas audit on the Security of Weapons Inventories and Storage found that police
department employees could “add, delete and modify sensitive data.”
6. BAD DATA: WHY DO WE CARE?
INCREASES COSTS, WASTE AND INEFFICIENCIES
The estimated fraction of time
that data scientists spend
cleaning and organizing data,
according to CrowdFlower.
60%
7. Undermines Trust in Government which is already at an all-time low
BAD DATA: WHY DO WE CARE?
8. Usefullness? Jack Olsen: “data has quality if it satisfies the requirements of
its intended use”
Data attributes? Wand and Strong, propose fifteen data dimensions that
determine bad/quality data assembled into four categories:
ACCURACY
RELEVANCY
REPRESENTATION
ACCESSIBILITY
BAD DATA?
9. COLLECTION PROCESSING SHARING ANALYZING USING
BAD DATA AND THE DATA VALUE CHAIN
Poor/Dirty Data entry
Insufficient security
provisions
Lack of interoperable
institutional norms and
practices
Inaccurate data modeling
Faulty reporting
Duplication
Aggregation and
correlation challenges
Improper or unauthorized
access
Biased algorithms Lack of understanding
Inconsistencies
Conflicting legal
jurisdiction
Poor problem definition/design Misinterpretation
Non-representation/bias Different levels of security
10. Gartner predicts that 25%
of Fortune 1000 companies will have
information that is inaccurate,
incomplete or duplicated.
=
IN OUT
BAD DATA
COLLECTION STAGE
POOR/DIRTY DATA ENTRY
DUPLICATION
INCONSISTENCIES
NON-REPRESENTATION/BIAS
11. CONSIDER
9,040,595,509 data records have been lost
or stolen since 2013.
The number of US data breaches tracked in 2016
increased over 40% from the previous year.
BAD DATA
PROCESSING STAGE
INSUFFICIENT SECURITY
PROVISIONS
AGGREGATION AND
CORRELATION CHALLENGES
12. IINACCURATE DATA
MODELING
BIASED ALGORITHMS
POOR PROBLEM DEFINITION/
DESIGN
BAD DATA
ANALYSIS STAGE
FOR INSTANCE
Algorithms in Florida’s criminal courts produced
biased risk predictions; African American
defendants were 77% more likely to be considered
“higher risk” of committing crimes than their
caucasian counterparts.
14. BAD DATA DETERMINING FACTORS
• TECHNOLOGICAL CHALLENGES AND MISCONFIGURATIONS
• INDIVIDUAL OR INSTITUTIONAL NORMS AND STANDARDS OF QUALITY
• LEGAL CONFUSION OR GAPS
• MISALIGNED INCENTIVES OR INTERESTS