SlideShare une entreprise Scribd logo
1  sur  22
Diagnosing
Dirty Data
Jaimi Dowdell, IRE/NICAR
Jennifer LaFleur, ProPublica
Get your data's history
• Know the source of the data
• Know how it's used
• Know what all the fields mean
• Know what other stories have
been done with it
What is dirty data?
• Missing records
• Incorrect information
• Duplicate information
• No standardization
Take your data's
temperature
• How many records should you have?
• Double-check totals or counts. Check for
studies/ summary reports.
• Check for duplicates. Make sure they are
real duplicates. Is it possible that there are
hidden duplicates?
• Consistency-check all fields. Are all
city/county names spelled the same? Are
all codes found within documentation?
Internal consistency
checks
• Is there more money going to sub-contractors than went to
the prime contractor?
• Are there more teachers than students?
• How about other important fields?
• Check the range of fields. (For example, check for DOBs
that would make people too old or too young.)
• Check for missing data or blank fields. Are they real values,
or did something happen with an import or append query?
External Checks
• Compare to reports
• Data reported to other agencies
• On the ground reporting
• Verification from sources
Steps for cleaning data
• Assess the problem
• Identify your goal
• Find the right tool for the job
• Set aside time (double what you think)
• Make a backup copy
• Make a backup copy
• Never alter the original data. Make new
columns so you can compare and show
your work.
• Create an audit trail.
• Spot check as you go.
Tips for success
• Keep a data notebook
• Duplicate your work
• Duplicate your work
• Bounce your results off folks who really know
the data
• Set up some standards for your
work/newsroom
Choose the right
tool
• You don't need to be fancy, just get the job done
• Work with what you're comfortable with
• Don't forget the power of Excel
• Text editors can be lifesavers
• Many tools exist - Open Refine, programming, etc.
• Get training as needed
Focus is important
So get plenty
of food and rest
Get a data
buddy
Common ailments
Dates that aren't dates
Names, names, names...
Location matters
Leading and trailing spaces
"Pretty" reports
Inoperable data: Pain management
• Explain caveats
• Choose your wording carefully
• Know when to leave out records
• Be transparent
• Know what questions can and can't be
answered with this dataset
• Know when to get more information
Continue learning about dirty data: Sat. 3:40 p.m.
Conference Room 11
BYOD (Bring your own data): Sat. 4:50 p.m.,
Conference Room 11
Get your hands dirty
Jennifer.lafleur@propublica.org (@j_la28)
jaimi@ire.org (@jaimidowdell)
Questions?

Contenu connexe

En vedette

Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden storyCat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden storyVAidehi Sachin
 
Number Off
Number OffNumber Off
Number OffLouka5
 
Data journalism at Techraking 6
Data journalism at Techraking 6Data journalism at Techraking 6
Data journalism at Techraking 6Jennifer LaFleur
 
Data journalism without data
Data journalism without dataData journalism without data
Data journalism without dataJennifer LaFleur
 
Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Jennifer LaFleur
 
VVOJ Intro to data journalism
VVOJ Intro to data journalismVVOJ Intro to data journalism
VVOJ Intro to data journalismJennifer LaFleur
 
Crunching the numbers NR14
Crunching the numbers NR14Crunching the numbers NR14
Crunching the numbers NR14Jennifer LaFleur
 
The CASTLE Principles - mini description
The CASTLE Principles - mini descriptionThe CASTLE Principles - mini description
The CASTLE Principles - mini descriptionLance Secretan
 
The CASTLE Principles - Presentation
The CASTLE Principles - PresentationThe CASTLE Principles - Presentation
The CASTLE Principles - PresentationLance Secretan
 

En vedette (14)

Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden storyCat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
Cat techie aka vaidehi sachin nbc newsmakers broadcasting real hidden story
 
Getting it the rightest
Getting it the rightestGetting it the rightest
Getting it the rightest
 
Number Off
Number OffNumber Off
Number Off
 
Data journalism at Techraking 6
Data journalism at Techraking 6Data journalism at Techraking 6
Data journalism at Techraking 6
 
ACP Getting the Goods
ACP Getting the GoodsACP Getting the Goods
ACP Getting the Goods
 
Data journalism without data
Data journalism without dataData journalism without data
Data journalism without data
 
Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)Mind the Gap NICAR14 (holes in data)
Mind the Gap NICAR14 (holes in data)
 
VVOJ Intro to data journalism
VVOJ Intro to data journalismVVOJ Intro to data journalism
VVOJ Intro to data journalism
 
Crunching the numbers NR14
Crunching the numbers NR14Crunching the numbers NR14
Crunching the numbers NR14
 
The CASTLE Principles - mini description
The CASTLE Principles - mini descriptionThe CASTLE Principles - mini description
The CASTLE Principles - mini description
 
The CASTLE Principles - Presentation
The CASTLE Principles - PresentationThe CASTLE Principles - Presentation
The CASTLE Principles - Presentation
 
Transparency ire13
Transparency ire13Transparency ire13
Transparency ire13
 
Ona 2012
Ona 2012Ona 2012
Ona 2012
 
Cats stats
Cats statsCats stats
Cats stats
 

Similaire à Diagnosing dirty data_ire2013

Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...News Leaders Association's NewsTrain
 
Great Survey Design
Great Survey DesignGreat Survey Design
Great Survey DesignSurveyGizmo
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...News Leaders Association's NewsTrain
 
Analysing Qualitative Data
Analysing Qualitative DataAnalysing Qualitative Data
Analysing Qualitative DataMike Crabb
 
Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)Prof. Dr. Hironmoy Roy
 
Preparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The InterviewPreparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The InterviewSusanne Markgren
 
Questionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_finalQuestionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_finalRuth Deakin Crick
 
Epidata presentation course for heath science
Epidata presentation course for heath scienceEpidata presentation course for heath science
Epidata presentation course for heath scienceMitikuTeka1
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital AgeJ T "Tom" Johnson
 
Intro to dh data management
Intro to dh data management Intro to dh data management
Intro to dh data management Rachel Di Cresce
 
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...Matt Stubbs
 
Four Short Foibles of Organizational Data
Four Short Foibles of Organizational DataFour Short Foibles of Organizational Data
Four Short Foibles of Organizational DataLars von Sneidern
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learningSara Hooker
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...David Saldaña Sage
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...David Saldaña
 
Ramping up to the debate
Ramping up to the debateRamping up to the debate
Ramping up to the debatenstearns
 
ER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin MillerER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin MillerejdmillerUNT
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicatorsclearsateam
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesMarieke Guy
 

Similaire à Diagnosing dirty data_ire2013 (20)

Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
Data-driven enterprise off your beat - Aaron Mendelson - Fresno NewsTrain 4.2...
 
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRIICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
 
Great Survey Design
Great Survey DesignGreat Survey Design
Great Survey Design
 
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
Bringing a data mindset to your reporting - Brant Houston - Illinois NewsTrai...
 
Analysing Qualitative Data
Analysing Qualitative DataAnalysing Qualitative Data
Analysing Qualitative Data
 
Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)Art of a Medical Research (Art of making an Original Research Article)
Art of a Medical Research (Art of making an Original Research Article)
 
Preparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The InterviewPreparing for Today's Job Market - The Interview
Preparing for Today's Job Market - The Interview
 
Questionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_finalQuestionnaires hampshire teaching schools_final
Questionnaires hampshire teaching schools_final
 
Epidata presentation course for heath science
Epidata presentation course for heath scienceEpidata presentation course for heath science
Epidata presentation course for heath science
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital Age
 
Intro to dh data management
Intro to dh data management Intro to dh data management
Intro to dh data management
 
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
Big Data LDN 2017: Preserving The Key Principles Of Academic Research In A Bu...
 
Four Short Foibles of Organizational Data
Four Short Foibles of Organizational DataFour Short Foibles of Organizational Data
Four Short Foibles of Organizational Data
 
Module 1 introduction to machine learning
Module 1  introduction to machine learningModule 1  introduction to machine learning
Module 1 introduction to machine learning
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...
 
Case Studies: When you can't or won't run an experiment (and still want to...
Case Studies: When you can't or  won't run an  experiment (and still  want to...Case Studies: When you can't or  won't run an  experiment (and still  want to...
Case Studies: When you can't or won't run an experiment (and still want to...
 
Ramping up to the debate
Ramping up to the debateRamping up to the debate
Ramping up to the debate
 
ER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin MillerER&L Presentation Chris & Erin Miller
ER&L Presentation Chris & Erin Miller
 
Designing Indicators
Designing IndicatorsDesigning Indicators
Designing Indicators
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sources
 

Dernier

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Diagnosing dirty data_ire2013

  • 1. Diagnosing Dirty Data Jaimi Dowdell, IRE/NICAR Jennifer LaFleur, ProPublica
  • 2. Get your data's history • Know the source of the data • Know how it's used • Know what all the fields mean • Know what other stories have been done with it
  • 3. What is dirty data? • Missing records • Incorrect information • Duplicate information • No standardization
  • 4. Take your data's temperature • How many records should you have? • Double-check totals or counts. Check for studies/ summary reports. • Check for duplicates. Make sure they are real duplicates. Is it possible that there are hidden duplicates? • Consistency-check all fields. Are all city/county names spelled the same? Are all codes found within documentation?
  • 5. Internal consistency checks • Is there more money going to sub-contractors than went to the prime contractor? • Are there more teachers than students? • How about other important fields? • Check the range of fields. (For example, check for DOBs that would make people too old or too young.) • Check for missing data or blank fields. Are they real values, or did something happen with an import or append query?
  • 6. External Checks • Compare to reports • Data reported to other agencies • On the ground reporting • Verification from sources
  • 7. Steps for cleaning data • Assess the problem • Identify your goal • Find the right tool for the job • Set aside time (double what you think) • Make a backup copy • Make a backup copy • Never alter the original data. Make new columns so you can compare and show your work. • Create an audit trail. • Spot check as you go.
  • 8. Tips for success • Keep a data notebook • Duplicate your work • Duplicate your work • Bounce your results off folks who really know the data • Set up some standards for your work/newsroom
  • 9. Choose the right tool • You don't need to be fancy, just get the job done • Work with what you're comfortable with • Don't forget the power of Excel • Text editors can be lifesavers • Many tools exist - Open Refine, programming, etc. • Get training as needed
  • 11. So get plenty of food and rest
  • 19. Inoperable data: Pain management • Explain caveats • Choose your wording carefully • Know when to leave out records • Be transparent • Know what questions can and can't be answered with this dataset • Know when to get more information
  • 20. Continue learning about dirty data: Sat. 3:40 p.m. Conference Room 11 BYOD (Bring your own data): Sat. 4:50 p.m., Conference Room 11 Get your hands dirty