SlideShare une entreprise Scribd logo
1  sur  25
“OK, but where did that data come from?”


          Data validation in the
              Digital Age

Tom Johnson                                          Cheryl Phillips
Managing Director                           Data Enterprise Editor
Inst. for Analytic Journalism                        Seattle Times
Santa Fe, New Mexico USA                  Seattle, Washington USA
tom@jtjohnson.com
                                cphillips@seattletImes.com
                                                                  1
Data validation in the
                     Digital Age
Presentation by Cheryl Phillips and Tom Johnson at
National Institute of Computer-Assisted Reporting Conference
Date/Time: Friday, Feb. 24 at 11 a.m.
Location: Frisco/Burlington Room
St. Louis, Missouri USA


This PowerPoint deck and Tipsheets posted at:


http:// s d r v . m s / w N t i M 7


                                                               2
The methodology / = the value of the data set and your story




                                                                1
                                           Important point

    A data base (or
    report) is only as
    good as the
    methodology used
    to create it.
                                                                    3
2
Data sets are living things; they have pedigree and genealogy




                                    Important points
    •Most [all?] data sets are living
    things.
    •And they have a pedigree, a
    genealogy.
    •Data sets live in a dynamic
    environment.
    •Understand the DB ecology

                                                                    4
How bad data can do you wrong
Illinois and Missouri sex-offender DB
•“St. Louis Post-Dispatch - 2 May 1999: A11 – “ABOUT 700 SEX
OFFENDERS DO NOT APPEAR TO LIVE AT THE ADDRESSES
LISTED ON A ST. LOUIS REGISTRY; MANY SEX OFFENDERS NEVER
MAKE THE LIST” By Reese Dunklin; Data Analysis By David Heath and Julie
Luca
•Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A
“Criminal checks deficient; State's database of convictions is
hurt by lack of reporting, putting public safety at risk, law
officials say” By Diane Jennings and Darlean Spangenberger
•See stories here
How bad data can do you wrong
2011 - New Mexico Sec. of State’s “questionable
voters” data set – “The Big Bundle”
•~1.1m voters
•Previous SoS didn’t clean rolls
•Matched name, address, DoB and SS#
  – SSA data base; NM driver’s licenses
  – 2 variables “mismatch” =  Questionable?
  – Asked State Police (not AG’s office) to investigate
Problems with Sec. of State methodology

• What’s the error rate of original DB?
  •  Definition of “error”? (Gonzales or Gonzalez)
  •  Sample(s) by county and state total?
  •  Error rates of comparative DBs?
  •  Aggregation of error problem
• 2011 Help America Vote Verification Transaction
  Totals, Year-to-Date, by State
  https://www.socialsecurity.gov/open/havv/havv-year-
Source: https://www.socialsecurity.gov/open/havv/havv-year-to-date-2011.html
There be dragons!

                                 A most
Data base
                                wonderful
rich with                        story!!!
potential




                                            9
Building genealogy for target DB

1. Pre-plan                                                         1. Acquire latest data and
  •2nd monitor                                                         related docs
  •“Logbook” apps                                                   1. Do tables conform to
1. Lit. review/ interview peers                                        record layout?
1. Do data fit theoretical                                          1. Do docs specify expected
   models?                                                             ranges & frequencies?
1. Do a “critical biography” of                                     1. Are data values missing or
   the data                                                            out of range?
1. Does biography raise                                             1. Review major checklist
   critical warnings?
1. Have others run analysis of
   this data?
Source: Palmer, Griff. “Flowchart/decision tree for data base analysis.” pgs. 136-146. Ver 1.0 Proceedings, IAJ Press (Santa Fe,
NM), April 2006. http://www.lulu.com/product/paperback/ver-10-workshop-proceedings/546459
Building genealogy for target DB

1. Pre-plan                      1. Acquire latest data and
• Changes in
  •2nd monitor                      related docs
  definitions?
  •“Logbook” apps                1. Do tables conform to
     • review/ interview peers
1. Lit. By administrators?          record layout?
        • Formal or informal?
1. Do By statute?
     • data fit theoretical      1. Do docs specify expected
   models?                          ranges & frequencies?
• Changes in collection
1.methods, data entry,
  Do a “critical biography” of   1. Are data values missing or
  the data                          out of range?
  vetting, updating, file
1.type/format?raise
  Does biography              1. Review major checklist
  critical warnings?
• Changes in users and
1.usage
  Have others run analysis of
  this data?
• Data cleaning
Data Quality checkpoints

• Constancy of definitions and coding categories?
  • All at same time and location?
• Completeness: How many records have unfilled
  cells? Are the tendencies of “nulls” consistent in
  all records, variable types?
• Precision: Are the numbers rounded or?
  • Hope for fine-grained, not summaries or aggregates
  • Can be especially important with temporal and
    geographic data, i.e. What is the range(s) of the time
    scales?
Cheryl on Quant methods for
  measuring data quality
Data Quality checkpoints

• Constancy of definitions and coding categories?
  • All at same time and location?
• Completeness: How many records have unfilled
  cells? Are the tendencies of “nulls” consistent in
  all records, variable types?
• Precision: Are the numbers rounded or?
  • Hope for fine-grained, not summaries or aggregates
  • Can be especially important with temporal and
    geographic data, i.e. What is the range(s) of the time
    scales?
Newsroom methods for
       measuring data quality




• Test frequencies on key fields
  Bicycle accidents in Seattle included a time field. But
  it was almost always noon when accidents occurred.
  Caveat: Don’t over-reach with your conclusions or
  analysis
Don’t over-reach with your
          analysis




– Rates are good – IF you have the data to calculate
  them.
Outliers are important
    Explore the reasons behind anomalies or unexpected
    trends in the data.
From the state of WA: After
going back and forth with our
analyst on this, we decided it
would be easiest for her to
just pull the data. You would
have been able to get most of
the way there through that
fiscal.wa.gov site, but there
was some stimulus money
you wouldn’t have captured
and we included the changes
so far to the current
biennium (based on the
supplemental the legislature
approved in December).
Other Key Data Checks

            – When you update
              the data, make sure
              nothing has changed.
              Check definitions for
              expansion or
              reduction and talk to
              the creator of the
              data.
            – Be ready to nix a
              story.
Other Key Data Checks

– Do the math: run sums, percent change, other
  calculations. Test that math against the results in
  the database – do they match?
– Look for unexpected nulls
– Run a group by query and sort alphabetically by
  major fields to test for misspellings or other
  categorization errors.
– If your data should include every city, or every
  county in the state, does it? Are you missing data?
Other Key Data Checks

– Check with experts and have them test your
  analysis. Research the methodology used with the
  kind of data you are working with.
– There is version control for Web frameworks – use
  some kind of version control for your database,
  even if it’s in an Excel spreadsheet. Any time you
  change it, log what you did and when and why.
Other Key Data Checks
– Test the data against source documents.
Other Key Data Checks
   • How we did it
Building genealogy for target DB
• Pre-plan                        • Acquire latest data and
   2nd monitor                      related docs

      NOW you are ready to
   “Logbook” apps
                                  • Do tables conform to record
• Lit. review/ interview peers      layout?

      write a story•Do docs&specifyon
• Do data fit theoretical
  models?
                                  based expected
                                 ranges frequencies?
                   a data base!values missing or
• Do a “critical biography” of
  the data
                               • Are data
                                 out of range?
• Does biography raise critical   • Review major checklist
  warnings?
• Have others run analysis of             Analysis
  this data?
Summing Up

• Databases are constantly dynamic, “living” things.
  Look for and measure their energy and change.
• Beware of rounding error
   – Always try to get the most fine-grained data possible in its
     ORIGINAL data form or application, i.e. avoid PDFs with
     SUMMARY data
• Beware of changing definitions
• Beware of changing data collectors, data entry
  personnel, changing norms of editing and usage.
“OK, but where did that data come from?”

         Many Thanks
        Data validation in the
 This PowerPoint deck and Tipsheets posted at:


    http:// s d r v . m s / w N t i M 7
Tom Johnson                                          Cheryl Phillips
Managing Director                           Data Enterprise Editor
Inst. for Analytic Journalism                        Seattle Times
Santa Fe, New Mexico USA                  Seattle, Washington USA
tom@jtjohnson.com
                                cphillips@seattletImes.com
                                                                 25

Contenu connexe

Similaire à Data validation methods for journalists

NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceSusanna-Assunta Sansone
 
Turning Data into Infographics: An Interactive Workshop for Problem Solvers
Turning Data into Infographics: An Interactive Workshop for Problem SolversTurning Data into Infographics: An Interactive Workshop for Problem Solvers
Turning Data into Infographics: An Interactive Workshop for Problem SolversUNCResearchHub
 
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...Susanna-Assunta Sansone
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityPrecisely
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataPrecisely
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in EducationPhilip Piety
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypseENUG
 
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Susanna-Assunta Sansone
 
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014Susanna-Assunta Sansone
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014Susanna-Assunta Sansone
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...News Leaders Association's NewsTrain
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1
Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1
Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1jmorriso
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSubrata Saharia
 

Similaire à Data validation methods for journalists (20)

NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
 
Turning Data into Infographics: An Interactive Workshop for Problem Solvers
Turning Data into Infographics: An Interactive Workshop for Problem SolversTurning Data into Infographics: An Interactive Workshop for Problem Solvers
Turning Data into Infographics: An Interactive Workshop for Problem Solvers
 
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
 
Data Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data QualityData Profiling: The First Step to Big Data Quality
Data Profiling: The First Step to Big Data Quality
 
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your DataFoundational Strategies for Trust in Big Data Part 2: Understanding Your Data
Foundational Strategies for Trust in Big Data Part 2: Understanding Your Data
 
NCME Big Data in Education
NCME Big Data  in EducationNCME Big Data  in Education
NCME Big Data in Education
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
 
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
METRO RDM Webinar
METRO RDM WebinarMETRO RDM Webinar
METRO RDM Webinar
 
2015 04-18-wilson cg
2015 04-18-wilson cg2015 04-18-wilson cg
2015 04-18-wilson cg
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Intro scikitlearnstatsmodels
Intro scikitlearnstatsmodelsIntro scikitlearnstatsmodels
Intro scikitlearnstatsmodels
 
Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1
Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1
Hscb Focus 2010 Data Acquisition Extraction Management Debrief Jgm R1
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 

Plus de J T "Tom" Johnson

Doing Journalism in The Digital Age.
Doing Journalism in The Digital Age.  Doing Journalism in The Digital Age.
Doing Journalism in The Digital Age. J T "Tom" Johnson
 
Death (or Live?) of American Journalism-Part 2
 Death (or Live?) of American Journalism-Part 2 Death (or Live?) of American Journalism-Part 2
Death (or Live?) of American Journalism-Part 2J T "Tom" Johnson
 
Death (or Live?) of American Journalism-Part 1
 Death (or Live?) of American Journalism-Part 1 Death (or Live?) of American Journalism-Part 1
Death (or Live?) of American Journalism-Part 1J T "Tom" Johnson
 
Dominican republic journos cir 31 jan 2020
Dominican republic journos   cir 31 jan 2020Dominican republic journos   cir 31 jan 2020
Dominican republic journos cir 31 jan 2020J T "Tom" Johnson
 
Presentation to Journalists from the Dominican Republic
Presentation to Journalists from the Dominican RepublicPresentation to Journalists from the Dominican Republic
Presentation to Journalists from the Dominican RepublicJ T "Tom" Johnson
 
Data can only dance with its music NICAR17
Data can only dance with its music NICAR17Data can only dance with its music NICAR17
Data can only dance with its music NICAR17J T "Tom" Johnson
 
It’s the people’s data presentation april 2015
It’s the people’s data presentation april 2015It’s the people’s data presentation april 2015
It’s the people’s data presentation april 2015J T "Tom" Johnson
 
Dancing faster in the datasphere
Dancing faster in the datasphereDancing faster in the datasphere
Dancing faster in the datasphereJ T "Tom" Johnson
 
Building Data-centric Media Organizations
Building Data-centric Media OrganizationsBuilding Data-centric Media Organizations
Building Data-centric Media OrganizationsJ T "Tom" Johnson
 
Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013J T "Tom" Johnson
 
Tom johnson datavalidity-eng-nov21-arbol
Tom johnson datavalidity-eng-nov21-arbolTom johnson datavalidity-eng-nov21-arbol
Tom johnson datavalidity-eng-nov21-arbolJ T "Tom" Johnson
 
Maps and data esri health care 2012
Maps and data   esri health care 2012Maps and data   esri health care 2012
Maps and data esri health care 2012J T "Tom" Johnson
 
Esp #001-no son los documentos; son los datos-traducido
 Esp #001-no son los documentos; son los datos-traducido Esp #001-no son los documentos; son los datos-traducido
Esp #001-no son los documentos; son los datos-traducidoJ T "Tom" Johnson
 
Esp #002-validación de datos en la era digital-traducido
 Esp #002-validación de datos en la era digital-traducido Esp #002-validación de datos en la era digital-traducido
Esp #002-validación de datos en la era digital-traducidoJ T "Tom" Johnson
 
Esp #003-open-datamovement-traducido
 Esp #003-open-datamovement-traducido Esp #003-open-datamovement-traducido
Esp #003-open-datamovement-traducidoJ T "Tom" Johnson
 
Esp #004-proceso de periodismo en el nuevo datosfera-traducido
 Esp #004-proceso de periodismo en el nuevo datosfera-traducido Esp #004-proceso de periodismo en el nuevo datosfera-traducido
Esp #004-proceso de periodismo en el nuevo datosfera-traducidoJ T "Tom" Johnson
 
The Global Open Data Movement
The Global Open Data MovementThe Global Open Data Movement
The Global Open Data MovementJ T "Tom" Johnson
 
The s+a3 project: leveraging analytic resources
The s+a3 project: leveraging analytic resourcesThe s+a3 project: leveraging analytic resources
The s+a3 project: leveraging analytic resourcesJ T "Tom" Johnson
 
It's not the documents; it's the DATA
It's not the documents; it's the DATAIt's not the documents; it's the DATA
It's not the documents; it's the DATAJ T "Tom" Johnson
 

Plus de J T "Tom" Johnson (20)

Doing Journalism in The Digital Age.
Doing Journalism in The Digital Age.  Doing Journalism in The Digital Age.
Doing Journalism in The Digital Age.
 
Death (or Live?) of American Journalism-Part 2
 Death (or Live?) of American Journalism-Part 2 Death (or Live?) of American Journalism-Part 2
Death (or Live?) of American Journalism-Part 2
 
Death (or Live?) of American Journalism-Part 1
 Death (or Live?) of American Journalism-Part 1 Death (or Live?) of American Journalism-Part 1
Death (or Live?) of American Journalism-Part 1
 
Dominican republic journos cir 31 jan 2020
Dominican republic journos   cir 31 jan 2020Dominican republic journos   cir 31 jan 2020
Dominican republic journos cir 31 jan 2020
 
Presentation to Journalists from the Dominican Republic
Presentation to Journalists from the Dominican RepublicPresentation to Journalists from the Dominican Republic
Presentation to Journalists from the Dominican Republic
 
Data can only dance with its music NICAR17
Data can only dance with its music NICAR17Data can only dance with its music NICAR17
Data can only dance with its music NICAR17
 
It’s the people’s data presentation april 2015
It’s the people’s data presentation april 2015It’s the people’s data presentation april 2015
It’s the people’s data presentation april 2015
 
Dancing faster in the datasphere
Dancing faster in the datasphereDancing faster in the datasphere
Dancing faster in the datasphere
 
Building Data-centric Media Organizations
Building Data-centric Media OrganizationsBuilding Data-centric Media Organizations
Building Data-centric Media Organizations
 
Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013
 
Tom johnson datavalidity-eng-nov21-arbol
Tom johnson datavalidity-eng-nov21-arbolTom johnson datavalidity-eng-nov21-arbol
Tom johnson datavalidity-eng-nov21-arbol
 
Maps and data esri health care 2012
Maps and data   esri health care 2012Maps and data   esri health care 2012
Maps and data esri health care 2012
 
Esp #001-no son los documentos; son los datos-traducido
 Esp #001-no son los documentos; son los datos-traducido Esp #001-no son los documentos; son los datos-traducido
Esp #001-no son los documentos; son los datos-traducido
 
Esp #002-validación de datos en la era digital-traducido
 Esp #002-validación de datos en la era digital-traducido Esp #002-validación de datos en la era digital-traducido
Esp #002-validación de datos en la era digital-traducido
 
Esp #003-open-datamovement-traducido
 Esp #003-open-datamovement-traducido Esp #003-open-datamovement-traducido
Esp #003-open-datamovement-traducido
 
Esp #004-proceso de periodismo en el nuevo datosfera-traducido
 Esp #004-proceso de periodismo en el nuevo datosfera-traducido Esp #004-proceso de periodismo en el nuevo datosfera-traducido
Esp #004-proceso de periodismo en el nuevo datosfera-traducido
 
The Global Open Data Movement
The Global Open Data MovementThe Global Open Data Movement
The Global Open Data Movement
 
It's the people's data
It's the people's dataIt's the people's data
It's the people's data
 
The s+a3 project: leveraging analytic resources
The s+a3 project: leveraging analytic resourcesThe s+a3 project: leveraging analytic resources
The s+a3 project: leveraging analytic resources
 
It's not the documents; it's the DATA
It's not the documents; it's the DATAIt's not the documents; it's the DATA
It's not the documents; it's the DATA
 

Dernier

13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Emerging issues in migration policies.ppt
Emerging issues in migration policies.pptEmerging issues in migration policies.ppt
Emerging issues in migration policies.pptNandinituteja1
 
11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
lok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptxlok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptxdigiyvbmrkt
 
Power in International Relations (Pol 5)
Power in International Relations (Pol 5)Power in International Relations (Pol 5)
Power in International Relations (Pol 5)ssuser583c35
 
Geostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptGeostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptUsmanKaran
 
16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...The Lifesciences Magazine
 
15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdf15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Political-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxPolitical-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxSasikiranMarri
 
14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdf14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdfFIRST INDIA
 
Foreign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptxForeign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptxunark75
 
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road ConnectivityTransforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivitynarsireddynannuri1
 
12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdfFIRST INDIA
 

Dernier (14)

13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf13042024_First India Newspaper Jaipur.pdf
13042024_First India Newspaper Jaipur.pdf
 
Emerging issues in migration policies.ppt
Emerging issues in migration policies.pptEmerging issues in migration policies.ppt
Emerging issues in migration policies.ppt
 
11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf11042024_First India Newspaper Jaipur.pdf
11042024_First India Newspaper Jaipur.pdf
 
lok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptxlok sabha Elections in india- 2024 .pptx
lok sabha Elections in india- 2024 .pptx
 
Power in International Relations (Pol 5)
Power in International Relations (Pol 5)Power in International Relations (Pol 5)
Power in International Relations (Pol 5)
 
Geostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.pptGeostrategic significance of South Asian countries.ppt
Geostrategic significance of South Asian countries.ppt
 
16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf16042024_First India Newspaper Jaipur.pdf
16042024_First India Newspaper Jaipur.pdf
 
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
Mitochondrial Fusion Vital for Adult Brain Function and Disease Understanding...
 
15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdf15042024_First India Newspaper Jaipur.pdf
15042024_First India Newspaper Jaipur.pdf
 
Political-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptxPolitical-Ideologies-and-The-Movements.pptx
Political-Ideologies-and-The-Movements.pptx
 
14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdf14042024_First India Newspaper Jaipur.pdf
14042024_First India Newspaper Jaipur.pdf
 
Foreign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptxForeign Relation of Pakistan with Neighboring Countries.pptx
Foreign Relation of Pakistan with Neighboring Countries.pptx
 
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road ConnectivityTransforming Andhra Pradesh: TDP's Legacy in Road Connectivity
Transforming Andhra Pradesh: TDP's Legacy in Road Connectivity
 
12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf12042024_First India Newspaper Jaipur.pdf
12042024_First India Newspaper Jaipur.pdf
 

Data validation methods for journalists

  • 1. “OK, but where did that data come from?” Data validation in the Digital Age Tom Johnson Cheryl Phillips Managing Director Data Enterprise Editor Inst. for Analytic Journalism Seattle Times Santa Fe, New Mexico USA Seattle, Washington USA tom@jtjohnson.com cphillips@seattletImes.com 1
  • 2. Data validation in the Digital Age Presentation by Cheryl Phillips and Tom Johnson at National Institute of Computer-Assisted Reporting Conference Date/Time: Friday, Feb. 24 at 11 a.m. Location: Frisco/Burlington Room St. Louis, Missouri USA This PowerPoint deck and Tipsheets posted at: http:// s d r v . m s / w N t i M 7 2
  • 3. The methodology / = the value of the data set and your story 1 Important point A data base (or report) is only as good as the methodology used to create it. 3
  • 4. 2 Data sets are living things; they have pedigree and genealogy Important points •Most [all?] data sets are living things. •And they have a pedigree, a genealogy. •Data sets live in a dynamic environment. •Understand the DB ecology 4
  • 5. How bad data can do you wrong Illinois and Missouri sex-offender DB •“St. Louis Post-Dispatch - 2 May 1999: A11 – “ABOUT 700 SEX OFFENDERS DO NOT APPEAR TO LIVE AT THE ADDRESSES LISTED ON A ST. LOUIS REGISTRY; MANY SEX OFFENDERS NEVER MAKE THE LIST” By Reese Dunklin; Data Analysis By David Heath and Julie Luca •Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A “Criminal checks deficient; State's database of convictions is hurt by lack of reporting, putting public safety at risk, law officials say” By Diane Jennings and Darlean Spangenberger •See stories here
  • 6. How bad data can do you wrong 2011 - New Mexico Sec. of State’s “questionable voters” data set – “The Big Bundle” •~1.1m voters •Previous SoS didn’t clean rolls •Matched name, address, DoB and SS# – SSA data base; NM driver’s licenses – 2 variables “mismatch” =  Questionable? – Asked State Police (not AG’s office) to investigate
  • 7. Problems with Sec. of State methodology • What’s the error rate of original DB? • Definition of “error”? (Gonzales or Gonzalez) • Sample(s) by county and state total? • Error rates of comparative DBs? • Aggregation of error problem • 2011 Help America Vote Verification Transaction Totals, Year-to-Date, by State https://www.socialsecurity.gov/open/havv/havv-year-
  • 9. There be dragons! A most Data base wonderful rich with story!!! potential 9
  • 10. Building genealogy for target DB 1. Pre-plan 1. Acquire latest data and •2nd monitor related docs •“Logbook” apps 1. Do tables conform to 1. Lit. review/ interview peers record layout? 1. Do data fit theoretical 1. Do docs specify expected models? ranges & frequencies? 1. Do a “critical biography” of 1. Are data values missing or the data out of range? 1. Does biography raise 1. Review major checklist critical warnings? 1. Have others run analysis of this data? Source: Palmer, Griff. “Flowchart/decision tree for data base analysis.” pgs. 136-146. Ver 1.0 Proceedings, IAJ Press (Santa Fe, NM), April 2006. http://www.lulu.com/product/paperback/ver-10-workshop-proceedings/546459
  • 11. Building genealogy for target DB 1. Pre-plan 1. Acquire latest data and • Changes in •2nd monitor related docs definitions? •“Logbook” apps 1. Do tables conform to • review/ interview peers 1. Lit. By administrators? record layout? • Formal or informal? 1. Do By statute? • data fit theoretical 1. Do docs specify expected models? ranges & frequencies? • Changes in collection 1.methods, data entry, Do a “critical biography” of 1. Are data values missing or the data out of range? vetting, updating, file 1.type/format?raise Does biography 1. Review major checklist critical warnings? • Changes in users and 1.usage Have others run analysis of this data? • Data cleaning
  • 12. Data Quality checkpoints • Constancy of definitions and coding categories? • All at same time and location? • Completeness: How many records have unfilled cells? Are the tendencies of “nulls” consistent in all records, variable types? • Precision: Are the numbers rounded or? • Hope for fine-grained, not summaries or aggregates • Can be especially important with temporal and geographic data, i.e. What is the range(s) of the time scales?
  • 13. Cheryl on Quant methods for measuring data quality
  • 14. Data Quality checkpoints • Constancy of definitions and coding categories? • All at same time and location? • Completeness: How many records have unfilled cells? Are the tendencies of “nulls” consistent in all records, variable types? • Precision: Are the numbers rounded or? • Hope for fine-grained, not summaries or aggregates • Can be especially important with temporal and geographic data, i.e. What is the range(s) of the time scales?
  • 15. Newsroom methods for measuring data quality • Test frequencies on key fields Bicycle accidents in Seattle included a time field. But it was almost always noon when accidents occurred. Caveat: Don’t over-reach with your conclusions or analysis
  • 16. Don’t over-reach with your analysis – Rates are good – IF you have the data to calculate them.
  • 17. Outliers are important Explore the reasons behind anomalies or unexpected trends in the data. From the state of WA: After going back and forth with our analyst on this, we decided it would be easiest for her to just pull the data. You would have been able to get most of the way there through that fiscal.wa.gov site, but there was some stimulus money you wouldn’t have captured and we included the changes so far to the current biennium (based on the supplemental the legislature approved in December).
  • 18. Other Key Data Checks – When you update the data, make sure nothing has changed. Check definitions for expansion or reduction and talk to the creator of the data. – Be ready to nix a story.
  • 19. Other Key Data Checks – Do the math: run sums, percent change, other calculations. Test that math against the results in the database – do they match? – Look for unexpected nulls – Run a group by query and sort alphabetically by major fields to test for misspellings or other categorization errors. – If your data should include every city, or every county in the state, does it? Are you missing data?
  • 20. Other Key Data Checks – Check with experts and have them test your analysis. Research the methodology used with the kind of data you are working with. – There is version control for Web frameworks – use some kind of version control for your database, even if it’s in an Excel spreadsheet. Any time you change it, log what you did and when and why.
  • 21. Other Key Data Checks – Test the data against source documents.
  • 22. Other Key Data Checks • How we did it
  • 23. Building genealogy for target DB • Pre-plan • Acquire latest data and 2nd monitor related docs NOW you are ready to “Logbook” apps • Do tables conform to record • Lit. review/ interview peers layout? write a story•Do docs&specifyon • Do data fit theoretical models? based expected ranges frequencies? a data base!values missing or • Do a “critical biography” of the data • Are data out of range? • Does biography raise critical • Review major checklist warnings? • Have others run analysis of Analysis this data?
  • 24. Summing Up • Databases are constantly dynamic, “living” things. Look for and measure their energy and change. • Beware of rounding error – Always try to get the most fine-grained data possible in its ORIGINAL data form or application, i.e. avoid PDFs with SUMMARY data • Beware of changing definitions • Beware of changing data collectors, data entry personnel, changing norms of editing and usage.
  • 25. “OK, but where did that data come from?” Many Thanks Data validation in the This PowerPoint deck and Tipsheets posted at: http:// s d r v . m s / w N t i M 7 Tom Johnson Cheryl Phillips Managing Director Data Enterprise Editor Inst. for Analytic Journalism Seattle Times Santa Fe, New Mexico USA Seattle, Washington USA tom@jtjohnson.com cphillips@seattletImes.com 25

Notes de l'éditeur

  1. “ The devil is in the data” “ How pure/faulty/legit are the “genes” in your data? =================================================== Opener: They don’t believe us (perhaps with good reason). Get some stats on public’s trust of journalism and journalists. Way to save and perhaps improve our reputation is to make sure of the truthfulness – the validity – of what we are reporting. As we do more and more analysis of data as part of our stories, make sure we are analyzing correct and valid pure–quality data becomes crucial. (We should also be sharing out methods and data with the public, but that’s a topic for another session.)
  2. Finding the headwaters of your data Tracing the process of DB creation Type of agency? Gov’t, NGO, non-profit, profit Who’s responsible for the DB conception? Mandated by legislation, federal or state regulations, executive order? Some administrator For what purpose? Who’s responsible for designing and defining… Variables Collection methods Quantitative or qualitative data? Degree of precision in classification, geography, dates, time-factor Self-reported? Census or sampling? Training for data collectors? Training and verification of classification assignment?
  3. The methodology determines the value of the data set and your story I’m suspicious of -- and reluctant to use – sweeping generalities and Adjectives, but in this case…. Appropriateness of method ALWAYS determines the validity of the analysis, though the method(s) (i.e. analytic tools) may vary depending on your objectives. Methods used to create a data set ALWAYS determine the validity and functionality of the data set Ergo, before we start crunching data and data mining, we need to recognize and know…. The methods used to create the data set determine: The reliability of the data set The functionality (for multiple audiences) of the data set (e.g. who called for the creation of this data set, when and why? Who is to use it for what ends? What is its “measured” value for original users and for our readers? Knowning and understanding those “methods of creation” determines the value of your analysis and, hence, your story.
  4. Most [all?] data sets are living things . A data base, may look to be just a static matrix of text or numbers, but there are living, breathing dynamic forces at work in and around any data set that can provide an interesting context of understanding for journalists. And they have a pedigree, a genealogy. If we don’t understand that genealogy, we can’t evaluate – or properly use – that DB Data sets live in a dynamic environment. All data sets “live” in a context, in an environment in the datasphere that is constantly changing in terms of the validity of the data, who is collecting/updating/editing the data, who is using the data for what purposes and how often? How is Data Set A (or parts of it) related to DS B and C and G. And how do the administrators/analysts of the secondary data measure the quality of the data they are getting from DS A, if they do it at all? Understand the DB ecology See how the data set relates to other sets of data, agencies and users.
  5. Tom will had hyperlinks to these stories, though we might include them in handouts Get bibliography on SSA publications
  6. Get bibliography on SSA publications “ The biggest problem with E-Verify is that it’s based on SSA’s inaccurate records. SSA estimates that 17.8 million (or 4.1 percent) of its records contain discrepancies related to name, date of birth, or citizenship status, with 12.7 million of those records pertaining to U.S. citizens. That means E-Verify will erroneously tell you that 1 in 26 of your legal workforce is not actually legal.” http://www.laborcounselors.com/index.php?option=com_content&view=article&id=715:social-security-mismatch-and-immigration-2011-where-do-we-go-from-here&catid=44&Itemid=300008 “ The error rate for US citizens in the SSA data base is estimated to be 11 percent, meaning that 12.7 million of the 17.8 million "bad" SSNs in 2006 are believed to belong to US citizens, according to SSA's inspector general. “http://migration.ucdavis.edu/mn/more.php?id=3315_0_2_0 2011 Help America Vote Verification Transaction Totals, Year-to-Date, by State https://www.socialsecurity.gov/open/havv/havv-year-to-date-2011.html Tom: I think the answer depends on how many records are in each db. If db1 is very large in comparison to db2, then the error rate should be close to 4.5%.  And vice versa. There's probably a formula for this, but I sure don't know it.  I'd do the match and then check a sample of the results to estimate the combined error rate. Steve Doig ======================= Let's say each db holds similar data and is the same size, 1000 records. Let's also assume that there are no records duplicated in the two databases, either internally or from one data set to the other. Then you have 45 bad records in one set, and 137 in the other. Combining, you have (45+137) = 182 bad records, in 2000 total records, or an error rate of 9.1%. Same process can be used to calculate error rate combining data from any number of sets, of any size as long as no records are duplicated. Error LIMITS/confidence intervals would be quite a different matter. Steve Ross Ah, but what if one DB has an error rate of 73% and the other has an error rate of 82%. How could you have an error rate >100%? Ergo, the question becomes: What is the lowest “acceptable” error rate for meaningful analysis. (Whatever “meaningful” means.)
  7. Always a VERY complex problem for analysis bcs of “definitions,” changes over time and then statistical evaluation methods Assume you can determine, from sampling, that Data Base “A” has 8.5% records with errors. Assume DB “B” has 11.3% of records with errors (how to define “error”?). If you compare one to the other, your probability of errors will be 8.5+11.3 or 19.8%. Ah, but what if one DB has an error rate of 73% and the other has an error rate of 82%. How could you have an error rate >100%? Ergo, the question becomes: What is the lowest “acceptable” error rate for meaningful analysis. (Whatever “meaningful” means.) Help America Vote Transactions? Note that New Mexico has not sought any clarifications. Social Security Makes Help America Vote Act Data Available   http://www.socialsecurity.gov/pressoffice/pr/HAVA-pr.html ( Printer friendly version ) Michael J. Astrue, Commissioner of Social Security, today announced the agency is publishing data on its Open Government website www.socialsecurity.gov/open about verifications the agency conducts for States under the Help America Vote Act (HAVA) of 2002.  Under HAVA, most States are required to verify the last four digits of the Social Security number of people newly registering to vote who do not possess a valid State driver's license. “ I strongly support President Obama’s commitment to creating an open and transparent government,” Commissioner Astrue said.  “As we approach another federal election year, it remains absolutely critical that Americans are able to register to vote without undue obstacles.  Making this data publicly available will allow the media and the public on a timely basis to raise questions about unexpected patterns with the appropriate State officials.” The data available at www.socialsecurity.gov/open/havv represents the summary results for each State of the four-digit match performed by Social Security under HAVA. # # # http://www.socialsecurity.gov/pressoffice/pr/HAVA-pr.html
  8. DYNAMIC DATA & DATA BASE OR SET https://www.socialsecurity.gov/open/havv/havv-year-to-date-2011.html What do these terms mean? The following list describes the types of data in the HAVV dataset. Total Transactions: The total number of verification requests made during the time period. Unprocessed Transactions: The total number of verification requests that could not be processed because the data sent to us was invalid, (e.g., missing, not formatted correctly). Total Matches:   The total number of verification requests where there is at least one match in our records on the name, last four digits of the SSN and date of birth. Total Non Matches: The total number of verification requests where there is no match in our records on the name, last four digits of the SSN or date of birth. Multiple Matches Found – At least one alive and at least one deceased : The total number of verification requests where there are multiple matches on name, date of birth, and the last four digits of the SSN, and at least one of the number holders is alive and at least one of the number holders is deceased. Single Match Found – Alive: The total number of verification requests where there is only one match in our records on name, last four digits of the SSN and date of birth, and the number holder is alive. Single Match Found – Deceased: The total number of verification requests where there is only one match in our records on name, date of birth, and last four digits of the SSN, and the number holder is deceased. Multiple Matches Found – All Alive: The total number of verification requests where there are multiple matches on name, date of birth, and last four digits of the SSN, and each match indicates the number holder is alive. Multiple Match Found – All Deceased:  The total number of verification requests where there are multiple matches on name, date of birth, and the last four digits of the SSN, and each match indicates the  number holder is deceased.
  9. Source: Palmer, Griff. “Flowchart/decision tree for data base analysis.” pgs. 136-146 Ver 1.0 Proceedings, IAJ Press (Santa Fe, NM), April 2006. http://www.lulu.com/product/paperback/ver-10-workshop-proceedings/546459 1. Pre-plan 1a. 2 nd monitor 2a. “logbook” applications 2. Lit. review/ interview peers 3. Do data fit theoretical models? 4. Do a “critical biography” of the data 5. Does biography raise critical warnings? 6. Have others run analysis of this data? 7. Acquire latest data and related docs 8. Do tables conform to record layout? 9. Do docs specify expected ranges & frequencies? 10. Are data values missing or out of range? 11. Review major checklist
  10. Source: http://nsu.aphis.usda.gov/outlook/issue5/data_quality_part2.pdf Constancy of definitions and coding categories ? All at same time and location? Completeness: How many records have unfilled cells? Are the tendencies of “nulls” consistent in all records, variable types? Precision: Are the numbers rounded or? Hope for fine-grained, not summaries or aggregates Can be especially important with temporal and geographic data, i.e. What is the range(s) of the time scales? Can be a lot of difference in traffic counts, for example, if the data is hourly vs. 15-minute intervals. Or in range of ages.
  11. Source: http://nsu.aphis.usda.gov/outlook/issue5/data_quality_part2.pdf Constancy of definitions and coding categories ? All at same time and location? Completeness: How many records have unfilled cells? Are the tendencies of “nulls” consistent in all records, variable types? Precision: Are the numbers rounded or? Hope for fine-grained, not summaries or aggregates Can be especially important with temporal and geographic data, i.e. What is the range(s) of the time scales? Can be a lot of difference in traffic counts, for example, if the data is hourly vs. 15-minute intervals. Or in range of ages.
  12. Important to note not to jump to conclusions, or try to do more analysis than makes sense. For example, rates would have been misleading because we don’t have good bicycle counts by street or intersection, much less car-traffic counts. But we could use this anecdotally in the story: In the city's annual mid-September count, there were 3,251 cyclists commuting into downtown in 2010, up from 2,273 in 2007. So, accidents are holding steady while the number of commuters is increasing.
  13. Important to note not to jump to conclusions, or try to do more analysis than makes sense. For example, rates would have been misleading because we don’t have good bicycle counts by street or intersection, much less car-traffic counts. But we could use this anecdotally in the story: In the city's annual mid-September count, there were 3,251 cyclists commuting into downtown in 2010, up from 2,273 in 2007. So, accidents are holding steady while the number of commuters is increasing.
  14. Important to note not to jump to conclusions, or try to do more analysis than makes sense. For example, rates would have been misleading because we don’t have good bicycle counts by street or intersection, much less car-traffic counts. But we could use this anecdotally in the story: In the city's annual mid-September count, there were 3,251 cyclists commuting into downtown in 2010, up from 2,273 in 2007. So, accidents are holding steady while the number of commuters is increasing.
  15. Last year, editors at The Seattle Times noticed more food trucks around. There must be a story about the safety record of these trucks, they thought. So, of course, we checked it out. What we found? Food trucks were just as clean, met inspection rules, just as much as all other types of restaurants. In part, this was because their food came from prep sites most of the time and was not cooked in a mobile unit. And, just to be sure, we checked the prep sites. They got good grades too.
  16. “ The devil is in the data” “ How pure/faulty/legit are the “genes” in your data? =================================================== Opener: They don’t believe us (perhaps with good reason). Get some stats on public’s trust of journalism and journalists. Way to save and perhaps improve our reputation is to make sure of the truthfulness – the validity – of what we are reporting. As we do more and more analysis of data as part of our stories, make sure we are analyzing correct and valid pure–quality data becomes crucial. (We should also be sharing out methods and data with the public, but that’s a topic for another session.)