SlideShare une entreprise Scribd logo
1  sur  22
Diagnosing
Dirty Data
Jaimi Dowdell, IRE/NICAR
Jennifer LaFleur, ProPublica
Get your data's history
• Know the source of the data
• Know how it's used
• Know what all the fields mean
• Know what other stories have
been done with it
What is dirty data?
• Missing records
• Incorrect information
• Duplicate information
• No standardization
Take your data's
temperature
• How many records should you have?
• Double-check totals or counts. Check for
studies/ summary reports.
• Check for duplicates. Make sure they are
real duplicates. Is it possible that there are
hidden duplicates?
• Consistency-check all fields. Are all
city/county names spelled the same? Are
all codes found within documentation?
Internal consistency
checks
• Is there more money going to sub-contractors than went to
the prime contractor?
• Are there more teachers than students?
• How about other important fields?
• Check the range of fields. (For example, check for DOBs
that would make people too old or too young.)
• Check for missing data or blank fields. Are they real values,
or did something happen with an import or append query?
External Checks
• Compare to reports
• Data reported to other agencies
• On the ground reporting
• Verification from sources
Steps for cleaning data
• Assess the problem
• Identify your goal
• Find the right tool for the job
• Set aside time (double what you think)
• Make a backup copy
• Make a backup copy
• Never alter the original data. Make new
columns so you can compare and show
your work.
• Create an audit trail.
• Spot check as you go.
Tips for success
• Keep a data notebook
• Duplicate your work
• Duplicate your work
• Bounce your results off folks who really know
the data
• Set up some standards for your
work/newsroom
Choose the right
tool
• You don't need to be fancy, just get the job done
• Work with what you're comfortable with
• Don't forget the power of Excel
• Text editors can be lifesavers
• Many tools exist - Open Refine, programming, etc.
• Get training as needed
Focus is important
So get plenty
of food and rest
Get a data
buddy
Common ailments
Dates that aren't dates
Names, names, names...
Location matters
Leading and trailing spaces
"Pretty" reports
Inoperable data: Pain management
• Explain caveats
• Choose your wording carefully
• Know when to leave out records
• Be transparent
• Know what questions can and can't be
answered with this dataset
• Know when to get more information
Continue learning about dirty data: Sat. 3:40 p.m.
Conference Room 11
BYOD (Bring your own data): Sat. 4:50 p.m.,
Conference Room 11
Get your hands dirty
Jennifer.lafleur@propublica.org (@j_la28)
jaimi@ire.org (@jaimidowdell)
Questions?

Contenu connexe

En vedette

Data journalism at Techraking 6
Data journalism at Techraking 6Data journalism at Techraking 6
Data journalism at Techraking 6
Jennifer LaFleur
 
Data journalism without data
Data journalism without dataData journalism without data
Data journalism without data
Jennifer LaFleur
 
Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)
Jennifer LaFleur
 
VVOJ Intro to data journalism
VVOJ Intro to data journalismVVOJ Intro to data journalism
VVOJ Intro to data journalism
Jennifer LaFleur
 
Crunching the numbers NR14
Crunching the numbers NR14Crunching the numbers NR14
Crunching the numbers NR14
Jennifer LaFleur
 

En vedette (14)

Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden storyCat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
 
Getting it the rightest
Getting it the rightestGetting it the rightest
Getting it the rightest
 
Number Off
Number OffNumber Off
Number Off
 
Data journalism at Techraking 6
Data journalism at Techraking 6Data journalism at Techraking 6
Data journalism at Techraking 6
 
ACP Getting the Goods
ACP Getting the GoodsACP Getting the Goods
ACP Getting the Goods
 
Data journalism without data
Data journalism without dataData journalism without data
Data journalism without data
 
Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)
 
VVOJ Intro to data journalism
VVOJ Intro to data journalismVVOJ Intro to data journalism
VVOJ Intro to data journalism
 
Crunching the numbers NR14
Crunching the numbers NR14Crunching the numbers NR14
Crunching the numbers NR14
 
The CASTLE Principles - mini description
The CASTLE Principles - mini descriptionThe CASTLE Principles - mini description
The CASTLE Principles - mini description
 
The CASTLE Principles - Presentation
The CASTLE Principles - PresentationThe CASTLE Principles - Presentation
The CASTLE Principles - Presentation
 
Transparency ire13
Transparency ire13Transparency ire13
Transparency ire13
 
Ona 2012
Ona 2012Ona 2012
Ona 2012
 
Cats stats
Cats statsCats stats
Cats stats
 

Similaire à Diagnosing dirty data_ire2013

Epidata presentation course for heath science
Epidata presentation course for heath scienceEpidata presentation course for heath science
Epidata presentation course for heath science
MitikuTeka1
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...
David Saldaña Sage
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...
David Saldaña
 
Ramping up to the debate
Ramping up to the debateRamping up to the debate
Ramping up to the debate
nstearns
 

Similaire à Diagnosing dirty data_ire2013 (20)

Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
 
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRIICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
 
Great Survey Design
Great Survey DesignGreat Survey Design
Great Survey Design
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
 
Analysing Qualitative Data
Analysing Qualitative DataAnalysing Qualitative Data
Analysing Qualitative Data
 
Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)
 
Preparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The InterviewPreparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The Interview
 
Questionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_finalQuestionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_final
 
Epidata presentation course for heath science
Epidata presentation course for heath scienceEpidata presentation course for heath science
Epidata presentation course for heath science
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital Age
 
Intro to dh data management
Intro to dh data management Intro to dh data management
Intro to dh data management
 
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
 
Four Short Foibles of Organizational Data
Four Short Foibles of Organizational DataFour Short Foibles of Organizational Data
Four Short Foibles of Organizational Data
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...
 
Ramping up to the debate
Ramping up to the debateRamping up to the debate
Ramping up to the debate
 
ER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin MillerER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin Miller
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicators
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sources
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Diagnosing dirty data_ire2013

  • 1. Diagnosing Dirty Data Jaimi Dowdell, IRE/NICAR Jennifer LaFleur, ProPublica
  • 2. Get your data's history • Know the source of the data • Know how it's used • Know what all the fields mean • Know what other stories have been done with it
  • 3. What is dirty data? • Missing records • Incorrect information • Duplicate information • No standardization
  • 4. Take your data's temperature • How many records should you have? • Double-check totals or counts. Check for studies/ summary reports. • Check for duplicates. Make sure they are real duplicates. Is it possible that there are hidden duplicates? • Consistency-check all fields. Are all city/county names spelled the same? Are all codes found within documentation?
  • 5. Internal consistency checks • Is there more money going to sub-contractors than went to the prime contractor? • Are there more teachers than students? • How about other important fields? • Check the range of fields. (For example, check for DOBs that would make people too old or too young.) • Check for missing data or blank fields. Are they real values, or did something happen with an import or append query?
  • 6. External Checks • Compare to reports • Data reported to other agencies • On the ground reporting • Verification from sources
  • 7. Steps for cleaning data • Assess the problem • Identify your goal • Find the right tool for the job • Set aside time (double what you think) • Make a backup copy • Make a backup copy • Never alter the original data. Make new columns so you can compare and show your work. • Create an audit trail. • Spot check as you go.
  • 8. Tips for success • Keep a data notebook • Duplicate your work • Duplicate your work • Bounce your results off folks who really know the data • Set up some standards for your work/newsroom
  • 9. Choose the right tool • You don't need to be fancy, just get the job done • Work with what you're comfortable with • Don't forget the power of Excel • Text editors can be lifesavers • Many tools exist - Open Refine, programming, etc. • Get training as needed
  • 11. So get plenty of food and rest
  • 19. Inoperable data: Pain management • Explain caveats • Choose your wording carefully • Know when to leave out records • Be transparent • Know what questions can and can't be answered with this dataset • Know when to get more information
  • 20. Continue learning about dirty data: Sat. 3:40 p.m. Conference Room 11 BYOD (Bring your own data): Sat. 4:50 p.m., Conference Room 11 Get your hands dirty