SlideShare une entreprise Scribd logo
1  sur  43
Árbol de vida
 de los datos
(Data validation in the Digital Age)




  Tom Johnson
  Managing Director
  Inst. for Analytic Journalism
  Santa Fe, New Mexico USA
  tom@jtjohnson.com
  @jtjohnson
                                       1
Data validation in the
Digital Age
   Presentation by Tom Johnson at
   Cátedra Walter Lippmann de Periodismo y Opinión Pública
   Claustro de la Universidad
   Universidad del Rosario, Bogota, Colombia
   Date/Time: 22 November 2012


   This PowerPoint deck and Tipsheets posted at:


   http:// s d r v . m s / w N t i M 7


                                                             2
Impt. Point 1-You know more than I do




                                        Important point

            Each of you know
            more about some
            aspect of insuring
            data quality than I
            do.
                                                          3
DataSet--Story

                 The
                 STORY!




                      4
DataSet--CollectionProcess
             0100111010100101
Collection   0010001010101001
                                The
             0010100101001010
Process      1010010010100010
             1010100100111010

              Data
             1001010010001010
             1010010010100101
                                STORY!
             0010101010010010
             1000101010100100

              Set
             1110101001010010
             0010101010010010
             1001010010101010
             0100101000101110
             1010010010101010
             1001001010011101
             0100101001000101
             0101001001010010
             1001010101001001
             0100010101010110
             1101010010100101
                                     5
DataSet-ValidationProcess
              0100111010100101
Collection    0010001010101001
                                   The
              0010100101001010
Process       1010010010100010
              1010100100111010

                Data
               1001010010001010
               1010010010100101
                                   STORY!
               0010101010010010
               1000101010100100

                Set
               1110101001010010
               0010101010010010
               1001010010101010
               0100101000101110
               1010010010101010
               1001001010011101
               0100101001000101
               0101001001010010
               1001010101001001   Validation
               0100010101010110
               1101010010100101   Process
                                               [6]
Paying the price of bad data
Illinois and Missouri sex-offender DB
• “St. Louis Post-Dispatch - 2 May 1999: A11 –
  “ABOUT 700 SEX OFFENDERS DO NOT APPEAR TO
  LIVE AT THE ADDRESSES LISTED ON A ST. LOUIS
  REGISTRY; MANY SEX OFFENDERS NEVER MAKE
  THE LIST” By Reese Dunklin; Data Analysis By David Heath and Julie Luca
• Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A
   “Criminal checks deficient; State's database of
   convictions is hurt by lack of reporting, putting
   public safety at risk, law officials say”
   By Diane Jennings and Darlean Spangenberger
How bad data can do you wrong
2011 - New Mexico Sec. of State’s “questionable
voters” data set – “The Big Bundle”
• ~1.1m voters
• Previous Sec. of State didn’t clean rolls
• Matched name, address, DoB and SS#
  • SSA data base; NM driver’s licenses
  • 2 variables “mismatch” =  Questionable?
  • Asked State Police (not AG’s office) to investigate




                                                          8
Problems with Sec. of State methodology

• What is the error rate of original DB?
   • Definition of “error”? (Gonzales or Gonzalez)
   • Sample(s) by county and state total?
   • Error rates of comparative DBs?
   • Aggregation of error problem
• 2011 Help America Vote Verification Transaction
  Totals, Year-to-Date, by State
  https://www.socialsecurity.gov/open/havv/havv-
  year-to-date-2011.html
DataSetCollectionProcess
             0100111010100101
Collection   0010001010101001
                                The
             0010100101001010
Process      1010010010100010
             1010100100111010

             Data
             1001010010001010
             1010010010100101
                                STORY!
             0010101010010010
             1000101010100100

             Set
             1110101001010010
             0010101010010010
             1001010010101010
             0100101000101110
             1010010010101010
             1001001010011101
             0100101001000101
             0101001001010010
             1001010101001001
             0100010101010110
             1101010010100101
                                     10
Data sets are living things; they have pedigree and genealogy




                                    Important point
    •Most [all?] data sets are living
    things.
    •And they have a pedigree, a
    genealogy, an “árbol de vida”.
    •Data sets live in a dynamic
    environment.
    •Understand the DB ecology

                                                                11
Data sets are living things; they have pedigree and genealogy




                                    Important point
          • NEVER work with your
            original data set; always a
            copy of the file(s)
          • More combined data sets =
            greater chance of error
          • Larger data sets = greater
            chance of error
                                                                12
Types of Data   0100111010100101
                0010001010101001
                0010100101001010
                1010010010100010
                1010100100111010

                Data
                1001010010001010
                1010010010100101
                0010101010010010
                1000101010100100

                Set
                1110101001010010
                0010101010010010
                1001010010101010
                0100101000101110
                1010010010101010
                1001001010011101
                0100101001000101
                0101001001010010
                1001010101001001
                0100010101010110
                1101010010100101
                                   13
•DataQuality=FunctionOf…
  Data Quality = function of…
  • Objectives, reputation of data-base
    creator
  • Validity and precision of the
    collection/creation process – and
    resulting data
• Statistical Data?
  • Primary Data (collected, managed by
    agency or individual)
  • Secondary (Agency or individual is using
    someone else’s “primary” data)

                                           [14]
Pyramid of significance

• How to judge whether some data – and its
  potential stories -- are more trustworthy
  than others?
  • Go back to librarians’ hierarchy of trusted
    sources when searching?
   (Has anyone tested the “quality” of data sets from
   those strata of sources? If not, a good research
   project.)




                                                        [15]
Learn from Librarians

• Evaluating Web Pages: Techniques to
  Apply & Questions to Ask
 http://www.lib.berkeley.edu/TeachingLib/Guides/Internet
 /Evaluate.html
  • What can the URL tell you?
     • Gov’t agency? Scholarly? Interest
       Group? Individual?
     • Has a reputation for accuracy been
       created over time?


                                                      [16]
Learn from Librarians

• Does it all add up?
  • Why was the page put on the web?
     •   Inform, give facts, give data?
     •   Explain, persuade?
     •   Sell, entice?
     •   Share?
     •   Disclose?
• Is the information current?
  When was it last updated and by whom?
• If the data is available on other sites,
  who/what was the original creator and
  editor of the data?

                                             [17]
Hierarchy of Trust
                                                                     • ".org" is organization. Sites that
                                                                       end in .org are usually non-
                                                                       profit organizations.
                                                                     • For .gov, .edu, or
                                                                     • Can be very good sources or
                                                                        .mil, probably the
                                                                       very poor sources; take care to
                                                                        information has been
                                                                       research their possible agendas
                                                                       or political biases. was posted.
                                                                        vetted before it
                                                                    • “.net” means network. .edu
                                                                       Websites with .gov,
                                                                       and .mil have to be applied
                                                                      “.info” is the Internet’s first
                                                                       for, and their use is
                                                                      unrestricted top-level domain
                                                                       controlled.
                                                                      since .COM. There are no
                                                                    • restrictions on whothey register
                                                                       It doesn’t mean may are
                                                                      .INFO names.though.
                                                                       fool-proof
                                                                      .INFO was created for general
                                                                      use around the world.


   Source: http://www.morriscs.org/webpages/jwaffle/index.cfm?subpage=1317299
Hierarchy of Trust
                     • Credible websites should list
                       contact information and
                       resources.
                     • If only cell phones and
                       PO boxes = suspicion
                     • If the author is named, find
                       his/her web page to…
                          • Verify educational credits
                          • Discover if the writer is either
                            published in a scholarly journal
                          • Verify that the writer is
                            employed by a research
                            institution or university
Hierarchy of Trust

                     • Internet pages that have
                       been published more
                       recently are usually more
                       credible.
                     • Find this information at
                       the bottom of a website;
                       in the "about us“; or
                       “view page source”
Hierarchy of Trust

                     • Selling something?
                     • Asking you to sign up
                       for something?
                     • May not be
                       presenting you with
                       neutral, unbiased
                       information.
Hierarchy of Trust
    Probably reliable sites,
      but not necessarily
         reliable data
           What is the
             site's
           purpose?

          Check the
        publishing date.

          Credible
       Websites/Authors

    Check the domain of the
              URL
CollectionProcessDataSet
             0100111010100101
Collection   0010001010101001
                                The
             0010100101001010
Process      1010010010100010
             1010100100111010

             Data
             1001010010001010
             1010010010100101
                                STORY!
             0010101010010010
             1000101010100100

             Set
             1110101001010010
             0010101010010010
             1001010010101010
             0100101000101110
             1010010010101010
             1001001010011101
             0100101001000101
             0101001001010010
             1001010101001001
             0100010101010110
             1101010010100101
                                     23
Precess of Data Evaluation

    1. Pre-      2. Lit. review/      3. Do data fit
   planning      interview peers        theoretical
                                        models?
• 2nd Monitor    • Nothing is new;    - Depends on subject:
• “Logbook”        everything has a     traffic flow vs. Crime or
  (bitácora)       precedent            educational level vs.
  apps           • How have             Income
• Checklist of     others attacked
                                      - Sometimes good to use
  intended         this problem?
                                        non-trad. models:
  steps
                                        Crime and disease




                                                                    24
Precess of Data Evaluation
4. Do a “critical   5. Does              6. Have others
   biography” of    biography raise      run analysis of
   the data         critical warnings?   this data?
- Why was data      - Have laws          - Not only
  collected? Who      related to data      journalists, but other
  ordered its         remained the         agencies/people
  creation (law?
                      same?
  Agency?
                    - Have
  Individual?)
- When first          definitions
  collected?          remained the
- News stories        same?
  about the data?




                                                                    25
Precess of Data Evaluation
7. Acquire latest
data and related
documentation

- Get data
  schema & code
  sheet
- Get instructions
  to data
  collectors and
  data entry
  clerks




                             26
Process of DB evaluation




    Ask for copy
      Computer
    ofData-Entry
       DATA
    ENTRY form & Explanation
        Sheet Sheet Codes
           Data

              Data base schema sheet

                                       27
Precess of Data Evaluation
7. Acquire latest    8. Compare            9. Do documents specify
data and related     record layout to      expected ranges &
documentation        tables                frequencies?
- Get data           This may tell you:    - Suggests variables to be
  schema & code         - What data          found. If expected
  sheet                   you did not        range is 1-7 and you
- Get instructions        receive
                                             find 8…
  to data               - Possibly, what
  collectors and          data is
  data entry              feeding into
  clerks                  other
                          variables or
                          calculations




                                                                        28
Precess of Data Evaluation
10. Are data values
    missing or out of
    range?

- Use Excel (or R) formula
  to test “expected” ranges
  - =MIN(A1:A100) or
   =MAX(A1:A100)
- Use Excel's conditional
formatting feature




                              29
Process of DB evaluation


     Major questions - Revise your list of major checkpoints
10. Review major checklist
     • Are there changes in definitions
        • Changed by law?
        • By the administrators?
        • Formal or informal by data entry process?
     • Are there changes in the collection methods, data
       entry, editing of data, quality checking, and the type
       and form of files?
     • Were there changes in the users and the use of the
       data?
     • Now it is time to clean the data

                                                                30
Is perfection necessary?
• How “clean” must the data be?
• Depends on the goals – and scale -- of the
  analysis
  • How important is the actual age of an
    individual? Or…
  • How precise should be the lat/longitude data?
• Precision: Are the numbers rounded or?
  • Hope for fine-grained, not summaries or
    aggregates
  • Can be especially important with temporal and
    geographic data, i.e. What is the range(s) of
    the time scales?
                                                    31
Data Quality checkpoints

• Constancy of definitions and coding
  categories?
• Completeness:
  • How many records have unfilled cells?
  • Are the tendencies of “nulls” consistent in all
    records, variable types?
COMMON VERIFICATION METHODS
• Counting
 Do you have the number of records indicated/promised?
• If >1,000 records, sample to test
  • To confirm your mythology
• Proportion of completed fields
 • If a record has X fields, what % of records are
   complete?
 • Are there trends of null (empty) fields?
• Draw on many Excel functions:
 • COUNTIFs or SUMIF

                                                         33
Data Quality Examples
ScatterPlots+BoxPlots




    Box Plots




                        35
What is a scatterplot?
• Scatterplot is often 1st step in
  analysis
• Examine relationship between
  the variables; determine if
  there are any problems/issues
  with the data
• Scatterplot indicates anything
  unique or interesting about the
  data, such as:
  • How is the data dispersed?
  • Are there outliers? A
    scatterplot is useful for
    "eyeballing" the presence of
    outliers.



                                     36
Convergence of Data Quality with Data Veracity
  What is the difference?
  • Data quality is the responsibility of who
    or what agency is collecting or creating the
    data set
    This suggests questions journalists should
    ask about DQ
  Do methodologies differ?
Resources
• Free
  • Power Pivot – Excel 2010 add-on for working with large data sets
  • R – free software environment for statistical computing and graphics
       • Shiny – Lets R users turn analyses into interactive web applications
  • Google Refine        - tool for working with messy data, cleaning it up,
     transforming it from one format into another, extending it with web services, and
     linking it to database
  • Google Fusion Tables -         an experimental data visualization web application
     to gather, visualize, and share larger data tables.
  • Tableau Public -     Interact with the data, download it, or create visualizations
     of it
  • Junar -   cloud-based platform for opening data
Resources
• Open Source
  • Flat File Checker - a simple, intuitive tool for validation of
    structured data in flat files (*.txt, *.csv, etc.).
  • Shiny – Lets R users turn analyses into interactive web
    applications


• Excel add-ons
• Commercial Companies & Products
   • Techspeed Data Cleansing
   • SAS® Data Quality Advanced
Resources
Professional disciplines and organizations
  • International Association for Information and Data
    Quality
  • DAMA International

  •   Forensic Accounting/ Performance Measurement
  •   National Association of Forensic Accountants (NAFA)
  •   Certified Fraud Examiner (CFE)
  •   International Forensic Accounting Association
  •   Forensic Accountants Society of North America
  •   International City/County Management Association
Contabilidad Forense




                       41
Recursos
Disciplinas profesionales, organizaciones y otros
• La Contabilidad o Auditoria Forense: un conocimiento
  básico en Colombia
• Contabilidad Forense: ¿El lado sexy de la Contaduría?
• La Contabilidad Forense
• Contabilidad Forense, una herramienta que busca la
  verdad
• Aplicación del Derecho a la Contabilidad Forense: La
  práctica indagatoria contra el delito económico
Árbol de vida
                de los datos
                        (Data validation in the Digital Age)




Tom Johnson
Managing Director
Inst. for Analytic Journalism
Santa Fe, New Mexico USA
tom@jtjohnson.com
@jtjohnson
                                                               43

Contenu connexe

Similaire à Tom johnson datavalidity-eng-nov21-arbol

INFM600 Module 1 lecture
INFM600 Module 1 lectureINFM600 Module 1 lecture
INFM600 Module 1 lectureJessica Vitak
 
Kathryn Cassidy - What metadata do we need for preservation?
Kathryn Cassidy - What metadata do we need for preservation?Kathryn Cassidy - What metadata do we need for preservation?
Kathryn Cassidy - What metadata do we need for preservation?dri_ireland
 
Kathryn Cassidy - What metadata do we need for preservation?
Kathryn Cassidy - What metadata do we need for preservation?Kathryn Cassidy - What metadata do we need for preservation?
Kathryn Cassidy - What metadata do we need for preservation?dri_ireland
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital AgeJ T "Tom" Johnson
 
Not Dead Yet: Designing Great Experiences with Bad Data
Not Dead Yet: Designing Great Experiences with Bad DataNot Dead Yet: Designing Great Experiences with Bad Data
Not Dead Yet: Designing Great Experiences with Bad DataSonia Koesterer
 
DataEd Slides: Getting Data Quality Right – Success Stories
DataEd Slides: Getting Data Quality Right – Success StoriesDataEd Slides: Getting Data Quality Right – Success Stories
DataEd Slides: Getting Data Quality Right – Success StoriesDATAVERSITY
 
Effects of Age and Think-Aloud Protocol on Eye-Tracking Data and Usability Me...
Effects of Age and Think-Aloud Protocol on Eye-Tracking Data and Usability Me...Effects of Age and Think-Aloud Protocol on Eye-Tracking Data and Usability Me...
Effects of Age and Think-Aloud Protocol on Eye-Tracking Data and Usability Me...Jennifer Romano Bergstrom
 
Noticing the Nuance: Designing intelligent systems that can understand semant...
Noticing the Nuance: Designing intelligent systems that can understand semant...Noticing the Nuance: Designing intelligent systems that can understand semant...
Noticing the Nuance: Designing intelligent systems that can understand semant...Elizabeth Murnane
 
Sensemaker for Partos Plaza - Irene Guyt
Sensemaker for Partos Plaza  - Irene GuytSensemaker for Partos Plaza  - Irene Guyt
Sensemaker for Partos Plaza - Irene Guytannepartos
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with DataRitvvij Parrikh
 
Open Corporate Data: not just good, better
Open Corporate Data: not just good, betterOpen Corporate Data: not just good, better
Open Corporate Data: not just good, betterChris Taggart
 
Querylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in DelpherQuerylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in DelpherMyriam Traub
 
Train your digital brain
Train your digital brain Train your digital brain
Train your digital brain TeodoraTenciu
 
Lessons Learned Using Direct Sensing Technologies
Lessons Learned Using Direct Sensing TechnologiesLessons Learned Using Direct Sensing Technologies
Lessons Learned Using Direct Sensing TechnologiesJohn Sohl
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataHamilton Public Library
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Rogier brussee what is data science, and what does it have to do with agri fo...
Rogier brussee what is data science, and what does it have to do with agri fo...Rogier brussee what is data science, and what does it have to do with agri fo...
Rogier brussee what is data science, and what does it have to do with agri fo...Rogier Brussee
 

Similaire à Tom johnson datavalidity-eng-nov21-arbol (20)

INFM600 Module 1 lecture
INFM600 Module 1 lectureINFM600 Module 1 lecture
INFM600 Module 1 lecture
 
Kathryn Cassidy - What metadata do we need for preservation?
Kathryn Cassidy - What metadata do we need for preservation?Kathryn Cassidy - What metadata do we need for preservation?
Kathryn Cassidy - What metadata do we need for preservation?
 
Kathryn Cassidy - What metadata do we need for preservation?
Kathryn Cassidy - What metadata do we need for preservation?Kathryn Cassidy - What metadata do we need for preservation?
Kathryn Cassidy - What metadata do we need for preservation?
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital Age
 
Data Mining Lecture_2.pptx
Data Mining Lecture_2.pptxData Mining Lecture_2.pptx
Data Mining Lecture_2.pptx
 
Not Dead Yet: Designing Great Experiences with Bad Data
Not Dead Yet: Designing Great Experiences with Bad DataNot Dead Yet: Designing Great Experiences with Bad Data
Not Dead Yet: Designing Great Experiences with Bad Data
 
DataEd Slides: Getting Data Quality Right – Success Stories
DataEd Slides: Getting Data Quality Right – Success StoriesDataEd Slides: Getting Data Quality Right – Success Stories
DataEd Slides: Getting Data Quality Right – Success Stories
 
Effects of Age and Think-Aloud Protocol on Eye-Tracking Data and Usability Me...
Effects of Age and Think-Aloud Protocol on Eye-Tracking Data and Usability Me...Effects of Age and Think-Aloud Protocol on Eye-Tracking Data and Usability Me...
Effects of Age and Think-Aloud Protocol on Eye-Tracking Data and Usability Me...
 
Noticing the Nuance: Designing intelligent systems that can understand semant...
Noticing the Nuance: Designing intelligent systems that can understand semant...Noticing the Nuance: Designing intelligent systems that can understand semant...
Noticing the Nuance: Designing intelligent systems that can understand semant...
 
Sensemaker for Partos Plaza - Irene Guyt
Sensemaker for Partos Plaza  - Irene GuytSensemaker for Partos Plaza  - Irene Guyt
Sensemaker for Partos Plaza - Irene Guyt
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with Data
 
Open Corporate Data: not just good, better
Open Corporate Data: not just good, betterOpen Corporate Data: not just good, better
Open Corporate Data: not just good, better
 
Open source and its career benefits
Open source and its career benefitsOpen source and its career benefits
Open source and its career benefits
 
Querylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in DelpherQuerylog-based Assessment of Retrievability Bias in Delpher
Querylog-based Assessment of Retrievability Bias in Delpher
 
Train your digital brain
Train your digital brain Train your digital brain
Train your digital brain
 
IoT State of the Union
IoT State of the UnionIoT State of the Union
IoT State of the Union
 
Lessons Learned Using Direct Sensing Technologies
Lessons Learned Using Direct Sensing TechnologiesLessons Learned Using Direct Sensing Technologies
Lessons Learned Using Direct Sensing Technologies
 
APLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with DataAPLIC 2012: Discovering & Dealing with Data
APLIC 2012: Discovering & Dealing with Data
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Rogier brussee what is data science, and what does it have to do with agri fo...
Rogier brussee what is data science, and what does it have to do with agri fo...Rogier brussee what is data science, and what does it have to do with agri fo...
Rogier brussee what is data science, and what does it have to do with agri fo...
 

Plus de J T "Tom" Johnson

Doing Journalism in The Digital Age.
Doing Journalism in The Digital Age.  Doing Journalism in The Digital Age.
Doing Journalism in The Digital Age. J T "Tom" Johnson
 
Death (or Live?) of American Journalism-Part 2
 Death (or Live?) of American Journalism-Part 2 Death (or Live?) of American Journalism-Part 2
Death (or Live?) of American Journalism-Part 2J T "Tom" Johnson
 
Death (or Live?) of American Journalism-Part 1
 Death (or Live?) of American Journalism-Part 1 Death (or Live?) of American Journalism-Part 1
Death (or Live?) of American Journalism-Part 1J T "Tom" Johnson
 
Dominican republic journos cir 31 jan 2020
Dominican republic journos   cir 31 jan 2020Dominican republic journos   cir 31 jan 2020
Dominican republic journos cir 31 jan 2020J T "Tom" Johnson
 
Presentation to Journalists from the Dominican Republic
Presentation to Journalists from the Dominican RepublicPresentation to Journalists from the Dominican Republic
Presentation to Journalists from the Dominican RepublicJ T "Tom" Johnson
 
Data can only dance with its music NICAR17
Data can only dance with its music NICAR17Data can only dance with its music NICAR17
Data can only dance with its music NICAR17J T "Tom" Johnson
 
It’s the people’s data presentation april 2015
It’s the people’s data presentation april 2015It’s the people’s data presentation april 2015
It’s the people’s data presentation april 2015J T "Tom" Johnson
 
Dancing faster in the datasphere
Dancing faster in the datasphereDancing faster in the datasphere
Dancing faster in the datasphereJ T "Tom" Johnson
 
Building Data-centric Media Organizations
Building Data-centric Media OrganizationsBuilding Data-centric Media Organizations
Building Data-centric Media OrganizationsJ T "Tom" Johnson
 
Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013J T "Tom" Johnson
 
Maps and data esri health care 2012
Maps and data   esri health care 2012Maps and data   esri health care 2012
Maps and data esri health care 2012J T "Tom" Johnson
 
Esp #001-no son los documentos; son los datos-traducido
 Esp #001-no son los documentos; son los datos-traducido Esp #001-no son los documentos; son los datos-traducido
Esp #001-no son los documentos; son los datos-traducidoJ T "Tom" Johnson
 
Esp #002-validación de datos en la era digital-traducido
 Esp #002-validación de datos en la era digital-traducido Esp #002-validación de datos en la era digital-traducido
Esp #002-validación de datos en la era digital-traducidoJ T "Tom" Johnson
 
Esp #003-open-datamovement-traducido
 Esp #003-open-datamovement-traducido Esp #003-open-datamovement-traducido
Esp #003-open-datamovement-traducidoJ T "Tom" Johnson
 
Esp #004-proceso de periodismo en el nuevo datosfera-traducido
 Esp #004-proceso de periodismo en el nuevo datosfera-traducido Esp #004-proceso de periodismo en el nuevo datosfera-traducido
Esp #004-proceso de periodismo en el nuevo datosfera-traducidoJ T "Tom" Johnson
 
The Global Open Data Movement
The Global Open Data MovementThe Global Open Data Movement
The Global Open Data MovementJ T "Tom" Johnson
 
The s+a3 project: leveraging analytic resources
The s+a3 project: leveraging analytic resourcesThe s+a3 project: leveraging analytic resources
The s+a3 project: leveraging analytic resourcesJ T "Tom" Johnson
 
It's not the documents; it's the DATA
It's not the documents; it's the DATAIt's not the documents; it's the DATA
It's not the documents; it's the DATAJ T "Tom" Johnson
 
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...J T "Tom" Johnson
 

Plus de J T "Tom" Johnson (20)

Doing Journalism in The Digital Age.
Doing Journalism in The Digital Age.  Doing Journalism in The Digital Age.
Doing Journalism in The Digital Age.
 
Death (or Live?) of American Journalism-Part 2
 Death (or Live?) of American Journalism-Part 2 Death (or Live?) of American Journalism-Part 2
Death (or Live?) of American Journalism-Part 2
 
Death (or Live?) of American Journalism-Part 1
 Death (or Live?) of American Journalism-Part 1 Death (or Live?) of American Journalism-Part 1
Death (or Live?) of American Journalism-Part 1
 
Dominican republic journos cir 31 jan 2020
Dominican republic journos   cir 31 jan 2020Dominican republic journos   cir 31 jan 2020
Dominican republic journos cir 31 jan 2020
 
Presentation to Journalists from the Dominican Republic
Presentation to Journalists from the Dominican RepublicPresentation to Journalists from the Dominican Republic
Presentation to Journalists from the Dominican Republic
 
Data can only dance with its music NICAR17
Data can only dance with its music NICAR17Data can only dance with its music NICAR17
Data can only dance with its music NICAR17
 
It’s the people’s data presentation april 2015
It’s the people’s data presentation april 2015It’s the people’s data presentation april 2015
It’s the people’s data presentation april 2015
 
Dancing faster in the datasphere
Dancing faster in the datasphereDancing faster in the datasphere
Dancing faster in the datasphere
 
Building Data-centric Media Organizations
Building Data-centric Media OrganizationsBuilding Data-centric Media Organizations
Building Data-centric Media Organizations
 
Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013Gold rushwriterspresentation 2013
Gold rushwriterspresentation 2013
 
Maps and data esri health care 2012
Maps and data   esri health care 2012Maps and data   esri health care 2012
Maps and data esri health care 2012
 
Esp #001-no son los documentos; son los datos-traducido
 Esp #001-no son los documentos; son los datos-traducido Esp #001-no son los documentos; son los datos-traducido
Esp #001-no son los documentos; son los datos-traducido
 
Esp #002-validación de datos en la era digital-traducido
 Esp #002-validación de datos en la era digital-traducido Esp #002-validación de datos en la era digital-traducido
Esp #002-validación de datos en la era digital-traducido
 
Esp #003-open-datamovement-traducido
 Esp #003-open-datamovement-traducido Esp #003-open-datamovement-traducido
Esp #003-open-datamovement-traducido
 
Esp #004-proceso de periodismo en el nuevo datosfera-traducido
 Esp #004-proceso de periodismo en el nuevo datosfera-traducido Esp #004-proceso de periodismo en el nuevo datosfera-traducido
Esp #004-proceso de periodismo en el nuevo datosfera-traducido
 
The Global Open Data Movement
The Global Open Data MovementThe Global Open Data Movement
The Global Open Data Movement
 
It's the people's data
It's the people's dataIt's the people's data
It's the people's data
 
The s+a3 project: leveraging analytic resources
The s+a3 project: leveraging analytic resourcesThe s+a3 project: leveraging analytic resources
The s+a3 project: leveraging analytic resources
 
It's not the documents; it's the DATA
It's not the documents; it's the DATAIt's not the documents; it's the DATA
It's not the documents; it's the DATA
 
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
IRE "Better Watchdog" workshop presentation "Data: Now I've got it, what do I...
 

Dernier

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 

Dernier (20)

Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 

Tom johnson datavalidity-eng-nov21-arbol

  • 1. Árbol de vida de los datos (Data validation in the Digital Age) Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA tom@jtjohnson.com @jtjohnson 1
  • 2. Data validation in the Digital Age Presentation by Tom Johnson at Cátedra Walter Lippmann de Periodismo y Opinión Pública Claustro de la Universidad Universidad del Rosario, Bogota, Colombia Date/Time: 22 November 2012 This PowerPoint deck and Tipsheets posted at: http:// s d r v . m s / w N t i M 7 2
  • 3. Impt. Point 1-You know more than I do Important point Each of you know more about some aspect of insuring data quality than I do. 3
  • 4. DataSet--Story The STORY! 4
  • 5. DataSet--CollectionProcess 0100111010100101 Collection 0010001010101001 The 0010100101001010 Process 1010010010100010 1010100100111010 Data 1001010010001010 1010010010100101 STORY! 0010101010010010 1000101010100100 Set 1110101001010010 0010101010010010 1001010010101010 0100101000101110 1010010010101010 1001001010011101 0100101001000101 0101001001010010 1001010101001001 0100010101010110 1101010010100101 5
  • 6. DataSet-ValidationProcess 0100111010100101 Collection 0010001010101001 The 0010100101001010 Process 1010010010100010 1010100100111010 Data 1001010010001010 1010010010100101 STORY! 0010101010010010 1000101010100100 Set 1110101001010010 0010101010010010 1001010010101010 0100101000101110 1010010010101010 1001001010011101 0100101001000101 0101001001010010 1001010101001001 Validation 0100010101010110 1101010010100101 Process [6]
  • 7. Paying the price of bad data Illinois and Missouri sex-offender DB • “St. Louis Post-Dispatch - 2 May 1999: A11 – “ABOUT 700 SEX OFFENDERS DO NOT APPEAR TO LIVE AT THE ADDRESSES LISTED ON A ST. LOUIS REGISTRY; MANY SEX OFFENDERS NEVER MAKE THE LIST” By Reese Dunklin; Data Analysis By David Heath and Julie Luca • Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A “Criminal checks deficient; State's database of convictions is hurt by lack of reporting, putting public safety at risk, law officials say” By Diane Jennings and Darlean Spangenberger
  • 8. How bad data can do you wrong 2011 - New Mexico Sec. of State’s “questionable voters” data set – “The Big Bundle” • ~1.1m voters • Previous Sec. of State didn’t clean rolls • Matched name, address, DoB and SS# • SSA data base; NM driver’s licenses • 2 variables “mismatch” =  Questionable? • Asked State Police (not AG’s office) to investigate 8
  • 9. Problems with Sec. of State methodology • What is the error rate of original DB? • Definition of “error”? (Gonzales or Gonzalez) • Sample(s) by county and state total? • Error rates of comparative DBs? • Aggregation of error problem • 2011 Help America Vote Verification Transaction Totals, Year-to-Date, by State https://www.socialsecurity.gov/open/havv/havv- year-to-date-2011.html
  • 10. DataSetCollectionProcess 0100111010100101 Collection 0010001010101001 The 0010100101001010 Process 1010010010100010 1010100100111010 Data 1001010010001010 1010010010100101 STORY! 0010101010010010 1000101010100100 Set 1110101001010010 0010101010010010 1001010010101010 0100101000101110 1010010010101010 1001001010011101 0100101001000101 0101001001010010 1001010101001001 0100010101010110 1101010010100101 10
  • 11. Data sets are living things; they have pedigree and genealogy Important point •Most [all?] data sets are living things. •And they have a pedigree, a genealogy, an “árbol de vida”. •Data sets live in a dynamic environment. •Understand the DB ecology 11
  • 12. Data sets are living things; they have pedigree and genealogy Important point • NEVER work with your original data set; always a copy of the file(s) • More combined data sets = greater chance of error • Larger data sets = greater chance of error 12
  • 13. Types of Data 0100111010100101 0010001010101001 0010100101001010 1010010010100010 1010100100111010 Data 1001010010001010 1010010010100101 0010101010010010 1000101010100100 Set 1110101001010010 0010101010010010 1001010010101010 0100101000101110 1010010010101010 1001001010011101 0100101001000101 0101001001010010 1001010101001001 0100010101010110 1101010010100101 13
  • 14. •DataQuality=FunctionOf… Data Quality = function of… • Objectives, reputation of data-base creator • Validity and precision of the collection/creation process – and resulting data • Statistical Data? • Primary Data (collected, managed by agency or individual) • Secondary (Agency or individual is using someone else’s “primary” data) [14]
  • 15. Pyramid of significance • How to judge whether some data – and its potential stories -- are more trustworthy than others? • Go back to librarians’ hierarchy of trusted sources when searching? (Has anyone tested the “quality” of data sets from those strata of sources? If not, a good research project.) [15]
  • 16. Learn from Librarians • Evaluating Web Pages: Techniques to Apply & Questions to Ask http://www.lib.berkeley.edu/TeachingLib/Guides/Internet /Evaluate.html • What can the URL tell you? • Gov’t agency? Scholarly? Interest Group? Individual? • Has a reputation for accuracy been created over time? [16]
  • 17. Learn from Librarians • Does it all add up? • Why was the page put on the web? • Inform, give facts, give data? • Explain, persuade? • Sell, entice? • Share? • Disclose? • Is the information current? When was it last updated and by whom? • If the data is available on other sites, who/what was the original creator and editor of the data? [17]
  • 18. Hierarchy of Trust • ".org" is organization. Sites that end in .org are usually non- profit organizations. • For .gov, .edu, or • Can be very good sources or .mil, probably the very poor sources; take care to information has been research their possible agendas or political biases. was posted. vetted before it • “.net” means network. .edu Websites with .gov, and .mil have to be applied “.info” is the Internet’s first for, and their use is unrestricted top-level domain controlled. since .COM. There are no • restrictions on whothey register It doesn’t mean may are .INFO names.though. fool-proof .INFO was created for general use around the world. Source: http://www.morriscs.org/webpages/jwaffle/index.cfm?subpage=1317299
  • 19. Hierarchy of Trust • Credible websites should list contact information and resources. • If only cell phones and PO boxes = suspicion • If the author is named, find his/her web page to… • Verify educational credits • Discover if the writer is either published in a scholarly journal • Verify that the writer is employed by a research institution or university
  • 20. Hierarchy of Trust • Internet pages that have been published more recently are usually more credible. • Find this information at the bottom of a website; in the "about us“; or “view page source”
  • 21. Hierarchy of Trust • Selling something? • Asking you to sign up for something? • May not be presenting you with neutral, unbiased information.
  • 22. Hierarchy of Trust Probably reliable sites, but not necessarily reliable data What is the site's purpose? Check the publishing date. Credible Websites/Authors Check the domain of the URL
  • 23. CollectionProcessDataSet 0100111010100101 Collection 0010001010101001 The 0010100101001010 Process 1010010010100010 1010100100111010 Data 1001010010001010 1010010010100101 STORY! 0010101010010010 1000101010100100 Set 1110101001010010 0010101010010010 1001010010101010 0100101000101110 1010010010101010 1001001010011101 0100101001000101 0101001001010010 1001010101001001 0100010101010110 1101010010100101 23
  • 24. Precess of Data Evaluation 1. Pre- 2. Lit. review/ 3. Do data fit planning interview peers theoretical models? • 2nd Monitor • Nothing is new; - Depends on subject: • “Logbook” everything has a traffic flow vs. Crime or (bitácora) precedent educational level vs. apps • How have Income • Checklist of others attacked - Sometimes good to use intended this problem? non-trad. models: steps Crime and disease 24
  • 25. Precess of Data Evaluation 4. Do a “critical 5. Does 6. Have others biography” of biography raise run analysis of the data critical warnings? this data? - Why was data - Have laws - Not only collected? Who related to data journalists, but other ordered its remained the agencies/people creation (law? same? Agency? - Have Individual?) - When first definitions collected? remained the - News stories same? about the data? 25
  • 26. Precess of Data Evaluation 7. Acquire latest data and related documentation - Get data schema & code sheet - Get instructions to data collectors and data entry clerks 26
  • 27. Process of DB evaluation Ask for copy Computer ofData-Entry DATA ENTRY form & Explanation Sheet Sheet Codes Data Data base schema sheet 27
  • 28. Precess of Data Evaluation 7. Acquire latest 8. Compare 9. Do documents specify data and related record layout to expected ranges & documentation tables frequencies? - Get data This may tell you: - Suggests variables to be schema & code - What data found. If expected sheet you did not range is 1-7 and you - Get instructions receive find 8… to data - Possibly, what collectors and data is data entry feeding into clerks other variables or calculations 28
  • 29. Precess of Data Evaluation 10. Are data values missing or out of range? - Use Excel (or R) formula to test “expected” ranges - =MIN(A1:A100) or =MAX(A1:A100) - Use Excel's conditional formatting feature 29
  • 30. Process of DB evaluation Major questions - Revise your list of major checkpoints 10. Review major checklist • Are there changes in definitions • Changed by law? • By the administrators? • Formal or informal by data entry process? • Are there changes in the collection methods, data entry, editing of data, quality checking, and the type and form of files? • Were there changes in the users and the use of the data? • Now it is time to clean the data 30
  • 31. Is perfection necessary? • How “clean” must the data be? • Depends on the goals – and scale -- of the analysis • How important is the actual age of an individual? Or… • How precise should be the lat/longitude data? • Precision: Are the numbers rounded or? • Hope for fine-grained, not summaries or aggregates • Can be especially important with temporal and geographic data, i.e. What is the range(s) of the time scales? 31
  • 32. Data Quality checkpoints • Constancy of definitions and coding categories? • Completeness: • How many records have unfilled cells? • Are the tendencies of “nulls” consistent in all records, variable types?
  • 33. COMMON VERIFICATION METHODS • Counting Do you have the number of records indicated/promised? • If >1,000 records, sample to test • To confirm your mythology • Proportion of completed fields • If a record has X fields, what % of records are complete? • Are there trends of null (empty) fields? • Draw on many Excel functions: • COUNTIFs or SUMIF 33
  • 35. ScatterPlots+BoxPlots Box Plots 35
  • 36. What is a scatterplot? • Scatterplot is often 1st step in analysis • Examine relationship between the variables; determine if there are any problems/issues with the data • Scatterplot indicates anything unique or interesting about the data, such as: • How is the data dispersed? • Are there outliers? A scatterplot is useful for "eyeballing" the presence of outliers. 36
  • 37. Convergence of Data Quality with Data Veracity What is the difference? • Data quality is the responsibility of who or what agency is collecting or creating the data set This suggests questions journalists should ask about DQ Do methodologies differ?
  • 38. Resources • Free • Power Pivot – Excel 2010 add-on for working with large data sets • R – free software environment for statistical computing and graphics • Shiny – Lets R users turn analyses into interactive web applications • Google Refine - tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to database • Google Fusion Tables - an experimental data visualization web application to gather, visualize, and share larger data tables. • Tableau Public - Interact with the data, download it, or create visualizations of it • Junar - cloud-based platform for opening data
  • 39. Resources • Open Source • Flat File Checker - a simple, intuitive tool for validation of structured data in flat files (*.txt, *.csv, etc.). • Shiny – Lets R users turn analyses into interactive web applications • Excel add-ons • Commercial Companies & Products • Techspeed Data Cleansing • SAS® Data Quality Advanced
  • 40. Resources Professional disciplines and organizations • International Association for Information and Data Quality • DAMA International • Forensic Accounting/ Performance Measurement • National Association of Forensic Accountants (NAFA) • Certified Fraud Examiner (CFE) • International Forensic Accounting Association • Forensic Accountants Society of North America • International City/County Management Association
  • 42. Recursos Disciplinas profesionales, organizaciones y otros • La Contabilidad o Auditoria Forense: un conocimiento básico en Colombia • Contabilidad Forense: ¿El lado sexy de la Contaduría? • La Contabilidad Forense • Contabilidad Forense, una herramienta que busca la verdad • Aplicación del Derecho a la Contabilidad Forense: La práctica indagatoria contra el delito económico
  • 43. Árbol de vida de los datos (Data validation in the Digital Age) Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA tom@jtjohnson.com @jtjohnson 43