The document discusses proper data management practices for research. It notes that researchers often do quick manual cleanups of data when they first receive it but do not implement robust data management practices. This can lead to problems when working with additional data sets later on. The document advocates establishing good data management practices like tracking changes made to the data over time so analyses can be replicated and problems in the data or analyses identified. It also discusses the benefits of keeping data organized and in a "tidy" format to facilitate analysis.
1. DATA MANAGEMENT
SCI 2777 • Storytelling with Data • Spring 2014
Sister Edith Bogue • The College of St Scholastica
2. DISPOSABLE DATA MANAGEMENT
• Researchers know they need clean
reliable data
• The analysis really interests them
• When data arrive do quick manual
clean-up of any problems they see.
• Often cut-and-paste in spreadsheets
• Look for and fix anomalies
• If no errors crop up in the analysis,
they make a clean archive copy
and forget about the data.
The Perils of Disposable Data Management from Prometheus Research blog at
https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
3. DISPOSABLE DATA MANAGEMENT
• PROBLEM #1: More data arrive and
they have to do the same cut-andpaste / sorting / combining operations
over again.
• PROBLEM #2: An anomaly appears in a
later data set. She has to check all the
earlier data to find out if it’s there too.
It was a cut-and-paste error.
• PROBLEM #3: The results look peculiar, or are opposite to
the prediction. Was it the data handling or is it real?
The Perils of Disposable Data Management from Prometheus Research blog at
https://www.prometheusresearch.com/the-perils-of-disposable-data-management/
4. GOOD DATA PRACTICES
• ―It’s common to spend many
tedious and frustrating hours
cleaning and wrangling your
data into a usable format,
followed by careful exploration to provide context and
reveal potential problems with the analyses you
want to run.‖
• ―Data cleaning and data transformation are two
major bottlenecks in data analysis.‖
Good Data Management Practices for Data Analysis from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/
5. Good Data Management Practices for Data Analysis from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/
DATA CLEANING
It should be no surprise that it takes longer
to clean messier data. Unfortunately, there
are many ways that data can be messy.
Powerful tools and practices can help you
turn messy data into clean data.
6. Good Data Management Practices for Data Analysis from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-introduction-part-1/
DATA TRANSFORMATION
―This is more subtle. It’s often important to
visualize and model the data in various ways
when conducting an analysis. I’m not talking
about going on fishing expeditions, but rather
about familiarizing yourself with the data…
The point is that frequent data transformations
are required to mediate changes between
these representations, introducing an underappreciated amount of friction in analysis.‖
7. TIDY DATA
• Each variable forms a column
• Each observation forms a row
• Each data set contains information on
only one observational unit of analysis
(e.g., families, participants, participan
t visits)
Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
8. MESSY DATA
• Column names represent data values instead
of variable names
• A single column contains data on multiple
variables instead of a single variable
• Variables are contained in both rows and
columns instead of just columns
• A single table contains more than one
observational unit
• Data about an observational unit is spread
across multiple data sets
Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
9. TIDY TOOLS
• Tidy tools are those that
accept, manipulate, and return tidy data.
• Tidy tools are like Lego blocks—individually
simple but flexible & powerful in combination.
• What tools are tidy?
• Most functions in R
• Most transformations in SPSS or SAS
• Relational databases (an entire skill of its own)
• Spreadsheets are not tidy tools
Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
10. SCI 2777
• We will learn about cleaning data first with
untidy tools: spreadsheets and the like.
• They are more familiar and easy to use right away
• We will learn how to track the provenance even
with our untidy tools.
• Soon, we will use R for some tasks, and get some
basic skills for using a tidy tool for cleaning data.
Good Data Management Practices for Data Analysis: Tidy Data (Part 2) from Prometheus Research blog at
https://www.prometheusresearch.com/good-data-management-practices-for-data-analysis-tidy-data-part-2/
12. • THOMAS HERNDON
• Third-year economics grad
student at UMass-Amherst
(age 28)
• Class assignment:
replicate the findings
of a published study.
• Growth in a Time of Debt by
Reinhart & Rogoff in American
Economic Review
• Finding: Growth drops off
sharply if debt is high
• Basis for austerity economics
• Could not replicate
Photo : The 28-Year-Old Who Caught the Excel Error Heard
Round the World. In These Times http://bit.ly/Lz2eDm
• Found 3-4 errors.
Herndon et al. (2013) Does High Public Debt Consistently Stifle Economic Growth?
A Critique of Reinhart and Rogoff. PERI Working Papers Number 322.
http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf
13. “There were
actually four
errors all together.
Any one error by
itself would not
have been
enough to cause
the negative
average. It was
the combined
effect of all four of
them: They
interacted with
each other and
amplified each
other—almost like
a perfect storm of
errors.”
Quote from: The 28-Year-Old
Who Caught the Excel Error
Heard Round the World. In These
Times http://bit.ly/Lz2eDm
Researchers Finally Replicated Reinhart-Rogoff, and There Are Serious Problems
from Next New Deal at http://bit.ly/1f1XUHG
14. DATA PROVENANCE
• Main goals
• Keep a record
• Be able to replicate your steps
• Facilitate collaboration (most data work uses a team)
• Versioning
• Some software automatically keeps old versions of files
• Google docs (online files) does this
• Dropbox also syncs files across all your devices,
keeps a local copy on computers (ie one you can use
when there is no internet)
15. TODAY
• Look at the World Bank Data visually: what do we
notice?
• World Bank Data – computing variables in spreadsheet
using the School of Data instructions.
• Getting your first look at Graphs using the School of
Data instructions.
• Seeing versions of files in Google Drive
16. GOALS BY JANUARY 29
• Clean data from the World Bank
• First graphs of variables
• Practice in dreaming up analyses
• Beginning to find our own data
• Basic Descriptive Statistics in ALEKS
• Basic Graphics in ALEKS
• FUN with Design
• First thoughts about your projects
17. DATA MANAGEMENT
SCI 2777 • Storytelling with Data • Spring 2014
Sister Edith Bogue • The College of St Scholastica