This document discusses data validation in the digital age and provides tips for evaluating data quality. It begins by noting that each person knows more about ensuring data quality in some aspect. It then discusses how data sets are living things that have a pedigree and genealogy. Several key steps for evaluating data quality are outlined, including understanding the data collection and validation processes, checking for consistency in definitions and ranges of values, and assessing completeness. Common verification methods like counting records and checking proportions of completed fields are also described. Data quality is positioned as being dependent on the objectives and reputation of the data creator.
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
Tom johnson datavalidity-eng-nov21-arbol
1. Árbol de vida
de los datos
(Data validation in the Digital Age)
Tom Johnson
Managing Director
Inst. for Analytic Journalism
Santa Fe, New Mexico USA
tom@jtjohnson.com
@jtjohnson
1
2. Data validation in the
Digital Age
Presentation by Tom Johnson at
Cátedra Walter Lippmann de Periodismo y Opinión Pública
Claustro de la Universidad
Universidad del Rosario, Bogota, Colombia
Date/Time: 22 November 2012
This PowerPoint deck and Tipsheets posted at:
http:// s d r v . m s / w N t i M 7
2
3. Impt. Point 1-You know more than I do
Important point
Each of you know
more about some
aspect of insuring
data quality than I
do.
3
5. DataSet--CollectionProcess
0100111010100101
Collection 0010001010101001
The
0010100101001010
Process 1010010010100010
1010100100111010
Data
1001010010001010
1010010010100101
STORY!
0010101010010010
1000101010100100
Set
1110101001010010
0010101010010010
1001010010101010
0100101000101110
1010010010101010
1001001010011101
0100101001000101
0101001001010010
1001010101001001
0100010101010110
1101010010100101
5
6. DataSet-ValidationProcess
0100111010100101
Collection 0010001010101001
The
0010100101001010
Process 1010010010100010
1010100100111010
Data
1001010010001010
1010010010100101
STORY!
0010101010010010
1000101010100100
Set
1110101001010010
0010101010010010
1001010010101010
0100101000101110
1010010010101010
1001001010011101
0100101001000101
0101001001010010
1001010101001001 Validation
0100010101010110
1101010010100101 Process
[6]
7. Paying the price of bad data
Illinois and Missouri sex-offender DB
• “St. Louis Post-Dispatch - 2 May 1999: A11 –
“ABOUT 700 SEX OFFENDERS DO NOT APPEAR TO
LIVE AT THE ADDRESSES LISTED ON A ST. LOUIS
REGISTRY; MANY SEX OFFENDERS NEVER MAKE
THE LIST” By Reese Dunklin; Data Analysis By David Heath and Julie Luca
• Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A
“Criminal checks deficient; State's database of
convictions is hurt by lack of reporting, putting
public safety at risk, law officials say”
By Diane Jennings and Darlean Spangenberger
8. How bad data can do you wrong
2011 - New Mexico Sec. of State’s “questionable
voters” data set – “The Big Bundle”
• ~1.1m voters
• Previous Sec. of State didn’t clean rolls
• Matched name, address, DoB and SS#
• SSA data base; NM driver’s licenses
• 2 variables “mismatch” = Questionable?
• Asked State Police (not AG’s office) to investigate
8
9. Problems with Sec. of State methodology
• What is the error rate of original DB?
• Definition of “error”? (Gonzales or Gonzalez)
• Sample(s) by county and state total?
• Error rates of comparative DBs?
• Aggregation of error problem
• 2011 Help America Vote Verification Transaction
Totals, Year-to-Date, by State
https://www.socialsecurity.gov/open/havv/havv-
year-to-date-2011.html
10. DataSetCollectionProcess
0100111010100101
Collection 0010001010101001
The
0010100101001010
Process 1010010010100010
1010100100111010
Data
1001010010001010
1010010010100101
STORY!
0010101010010010
1000101010100100
Set
1110101001010010
0010101010010010
1001010010101010
0100101000101110
1010010010101010
1001001010011101
0100101001000101
0101001001010010
1001010101001001
0100010101010110
1101010010100101
10
11. Data sets are living things; they have pedigree and genealogy
Important point
•Most [all?] data sets are living
things.
•And they have a pedigree, a
genealogy, an “árbol de vida”.
•Data sets live in a dynamic
environment.
•Understand the DB ecology
11
12. Data sets are living things; they have pedigree and genealogy
Important point
• NEVER work with your
original data set; always a
copy of the file(s)
• More combined data sets =
greater chance of error
• Larger data sets = greater
chance of error
12
13. Types of Data 0100111010100101
0010001010101001
0010100101001010
1010010010100010
1010100100111010
Data
1001010010001010
1010010010100101
0010101010010010
1000101010100100
Set
1110101001010010
0010101010010010
1001010010101010
0100101000101110
1010010010101010
1001001010011101
0100101001000101
0101001001010010
1001010101001001
0100010101010110
1101010010100101
13
14. •DataQuality=FunctionOf…
Data Quality = function of…
• Objectives, reputation of data-base
creator
• Validity and precision of the
collection/creation process – and
resulting data
• Statistical Data?
• Primary Data (collected, managed by
agency or individual)
• Secondary (Agency or individual is using
someone else’s “primary” data)
[14]
15. Pyramid of significance
• How to judge whether some data – and its
potential stories -- are more trustworthy
than others?
• Go back to librarians’ hierarchy of trusted
sources when searching?
(Has anyone tested the “quality” of data sets from
those strata of sources? If not, a good research
project.)
[15]
16. Learn from Librarians
• Evaluating Web Pages: Techniques to
Apply & Questions to Ask
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet
/Evaluate.html
• What can the URL tell you?
• Gov’t agency? Scholarly? Interest
Group? Individual?
• Has a reputation for accuracy been
created over time?
[16]
17. Learn from Librarians
• Does it all add up?
• Why was the page put on the web?
• Inform, give facts, give data?
• Explain, persuade?
• Sell, entice?
• Share?
• Disclose?
• Is the information current?
When was it last updated and by whom?
• If the data is available on other sites,
who/what was the original creator and
editor of the data?
[17]
18. Hierarchy of Trust
• ".org" is organization. Sites that
end in .org are usually non-
profit organizations.
• For .gov, .edu, or
• Can be very good sources or
.mil, probably the
very poor sources; take care to
information has been
research their possible agendas
or political biases. was posted.
vetted before it
• “.net” means network. .edu
Websites with .gov,
and .mil have to be applied
“.info” is the Internet’s first
for, and their use is
unrestricted top-level domain
controlled.
since .COM. There are no
• restrictions on whothey register
It doesn’t mean may are
.INFO names.though.
fool-proof
.INFO was created for general
use around the world.
Source: http://www.morriscs.org/webpages/jwaffle/index.cfm?subpage=1317299
19. Hierarchy of Trust
• Credible websites should list
contact information and
resources.
• If only cell phones and
PO boxes = suspicion
• If the author is named, find
his/her web page to…
• Verify educational credits
• Discover if the writer is either
published in a scholarly journal
• Verify that the writer is
employed by a research
institution or university
20. Hierarchy of Trust
• Internet pages that have
been published more
recently are usually more
credible.
• Find this information at
the bottom of a website;
in the "about us“; or
“view page source”
21. Hierarchy of Trust
• Selling something?
• Asking you to sign up
for something?
• May not be
presenting you with
neutral, unbiased
information.
22. Hierarchy of Trust
Probably reliable sites,
but not necessarily
reliable data
What is the
site's
purpose?
Check the
publishing date.
Credible
Websites/Authors
Check the domain of the
URL
23. CollectionProcessDataSet
0100111010100101
Collection 0010001010101001
The
0010100101001010
Process 1010010010100010
1010100100111010
Data
1001010010001010
1010010010100101
STORY!
0010101010010010
1000101010100100
Set
1110101001010010
0010101010010010
1001010010101010
0100101000101110
1010010010101010
1001001010011101
0100101001000101
0101001001010010
1001010101001001
0100010101010110
1101010010100101
23
24. Precess of Data Evaluation
1. Pre- 2. Lit. review/ 3. Do data fit
planning interview peers theoretical
models?
• 2nd Monitor • Nothing is new; - Depends on subject:
• “Logbook” everything has a traffic flow vs. Crime or
(bitácora) precedent educational level vs.
apps • How have Income
• Checklist of others attacked
- Sometimes good to use
intended this problem?
non-trad. models:
steps
Crime and disease
24
25. Precess of Data Evaluation
4. Do a “critical 5. Does 6. Have others
biography” of biography raise run analysis of
the data critical warnings? this data?
- Why was data - Have laws - Not only
collected? Who related to data journalists, but other
ordered its remained the agencies/people
creation (law?
same?
Agency?
- Have
Individual?)
- When first definitions
collected? remained the
- News stories same?
about the data?
25
26. Precess of Data Evaluation
7. Acquire latest
data and related
documentation
- Get data
schema & code
sheet
- Get instructions
to data
collectors and
data entry
clerks
26
27. Process of DB evaluation
Ask for copy
Computer
ofData-Entry
DATA
ENTRY form & Explanation
Sheet Sheet Codes
Data
Data base schema sheet
27
28. Precess of Data Evaluation
7. Acquire latest 8. Compare 9. Do documents specify
data and related record layout to expected ranges &
documentation tables frequencies?
- Get data This may tell you: - Suggests variables to be
schema & code - What data found. If expected
sheet you did not range is 1-7 and you
- Get instructions receive
find 8…
to data - Possibly, what
collectors and data is
data entry feeding into
clerks other
variables or
calculations
28
29. Precess of Data Evaluation
10. Are data values
missing or out of
range?
- Use Excel (or R) formula
to test “expected” ranges
- =MIN(A1:A100) or
=MAX(A1:A100)
- Use Excel's conditional
formatting feature
29
30. Process of DB evaluation
Major questions - Revise your list of major checkpoints
10. Review major checklist
• Are there changes in definitions
• Changed by law?
• By the administrators?
• Formal or informal by data entry process?
• Are there changes in the collection methods, data
entry, editing of data, quality checking, and the type
and form of files?
• Were there changes in the users and the use of the
data?
• Now it is time to clean the data
30
31. Is perfection necessary?
• How “clean” must the data be?
• Depends on the goals – and scale -- of the
analysis
• How important is the actual age of an
individual? Or…
• How precise should be the lat/longitude data?
• Precision: Are the numbers rounded or?
• Hope for fine-grained, not summaries or
aggregates
• Can be especially important with temporal and
geographic data, i.e. What is the range(s) of
the time scales?
31
32. Data Quality checkpoints
• Constancy of definitions and coding
categories?
• Completeness:
• How many records have unfilled cells?
• Are the tendencies of “nulls” consistent in all
records, variable types?
33. COMMON VERIFICATION METHODS
• Counting
Do you have the number of records indicated/promised?
• If >1,000 records, sample to test
• To confirm your mythology
• Proportion of completed fields
• If a record has X fields, what % of records are
complete?
• Are there trends of null (empty) fields?
• Draw on many Excel functions:
• COUNTIFs or SUMIF
33
36. What is a scatterplot?
• Scatterplot is often 1st step in
analysis
• Examine relationship between
the variables; determine if
there are any problems/issues
with the data
• Scatterplot indicates anything
unique or interesting about the
data, such as:
• How is the data dispersed?
• Are there outliers? A
scatterplot is useful for
"eyeballing" the presence of
outliers.
36
37. Convergence of Data Quality with Data Veracity
What is the difference?
• Data quality is the responsibility of who
or what agency is collecting or creating the
data set
This suggests questions journalists should
ask about DQ
Do methodologies differ?
38. Resources
• Free
• Power Pivot – Excel 2010 add-on for working with large data sets
• R – free software environment for statistical computing and graphics
• Shiny – Lets R users turn analyses into interactive web applications
• Google Refine - tool for working with messy data, cleaning it up,
transforming it from one format into another, extending it with web services, and
linking it to database
• Google Fusion Tables - an experimental data visualization web application
to gather, visualize, and share larger data tables.
• Tableau Public - Interact with the data, download it, or create visualizations
of it
• Junar - cloud-based platform for opening data
39. Resources
• Open Source
• Flat File Checker - a simple, intuitive tool for validation of
structured data in flat files (*.txt, *.csv, etc.).
• Shiny – Lets R users turn analyses into interactive web
applications
• Excel add-ons
• Commercial Companies & Products
• Techspeed Data Cleansing
• SAS® Data Quality Advanced
40. Resources
Professional disciplines and organizations
• International Association for Information and Data
Quality
• DAMA International
• Forensic Accounting/ Performance Measurement
• National Association of Forensic Accountants (NAFA)
• Certified Fraud Examiner (CFE)
• International Forensic Accounting Association
• Forensic Accountants Society of North America
• International City/County Management Association
42. Recursos
Disciplinas profesionales, organizaciones y otros
• La Contabilidad o Auditoria Forense: un conocimiento
básico en Colombia
• Contabilidad Forense: ¿El lado sexy de la Contaduría?
• La Contabilidad Forense
• Contabilidad Forense, una herramienta que busca la
verdad
• Aplicación del Derecho a la Contabilidad Forense: La
práctica indagatoria contra el delito económico
43. Árbol de vida
de los datos
(Data validation in the Digital Age)
Tom Johnson
Managing Director
Inst. for Analytic Journalism
Santa Fe, New Mexico USA
tom@jtjohnson.com
@jtjohnson
43