18 cleaning

Garrett Grolemund
Phd Student / Rice University
Department of Statistics
Data cleaning

1. Intro to data cleaning
2. What you can’t ﬁx
3. What you can ﬁx
4. Intro to reshape

Your turn
Do you think men or women leave a larger
tip when dining out? What data would
you collect to test this belief? What would
prompt you to change your belief?

Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform

Data Cleaning
Data
Residuals
Model
Compare
Visualize
Transform

“Happy families are all alike;
every unhappy family is
unhappy in its own way.”
—Leo Tolstoy

“Clean datasets are all alike;
every messy dataset is
messy in its own way.”
—Hadley Wickham

Clean data is:
Complete
Correct
(factual and internally consistent)
Concise
Compatible
(required variables: observations in rows, one column per
variable)

Correct
Can’t restore incorrect values without
original data but can remove clearly
incorrect values
Options:
Remove entire row
Mark incorrect value as missing (NA)

When two rows present the same
information with different values, at least
one row is wrong.
Whenever there is inconsistency, you are
going to have to make some tradeoff to
ensure concision.
Detecting inconsistency is not always
easy.
Inconsistency = incorrect

General strategy
To ﬁnd incorrect values you need to be
creative, combining graphics and data
processing.

Tipping data
One waiter recorded information
about each tip he received over a
period of a few months
244 records
Do men or women tip more?

Your turn
Subset the tipping data to include only
rows without NA’s. Judge whether you
think all of the data points are correct.
How will you make your decision?

tips <- read.csv("tipping.csv",
stringsAsFactors = FALSE)
summary(tips)
tips <- subset(tips, !is.na(smoker) &
!is.na(non_smoker))
qplot(tip, data = tips, binwidth = .5)
qplot(total_bill, data = tips, binwidth = 2)
qplot(total_bill, tip, data = tips)

nrow(tips)
sum(tips$male)
sum(tips$female)
subset(tips, male != female)

Concise
(each fact represented once)
Repeating facts:
1. wastes memory
2. creates opportunities for inconsistency

Compatible
(Data is compatible with your analysis
in both form and fact)
1. Do you have the relevant variables for
your analysis?

This often requires some type of calculation.
For example,
proportion = sucesses / attempts
Avg score per game per team = ?
join(), transform(), summarise(), ddply(), plyr
address this need

Compatible
(Data is compatible with your analysis
in both form and fact)
2. Is the data in the right form for your
analysis and visualization tools? (reshape)

Variables
in columns
(1 column per variable)

Your turn
What are the variables in tipping.csv?
How are they arranged in rows and
columns? Can you form the variables into
two groups?

install.packages("reshape")
library(reshape)
library(stringr)
head(tips)

Molten data
We can use melt to put each
variable into its own column.
“Protect” the good columns.
“Melt” the offending columns.
Then subset.

1. ID variables - identify the object that
measurements will take place on (we
know these before the experiment)
2. Measured variables - the features of
the object that will be measured (we have
to do an experiment to observe these)
Two types of variables

object
ID Variables
Bruce Wayne
Batman
SSN:
555-89-3000
Measured Var.
Height (6’1’’)
IQ (180)
Age (71)

ID Variables
Gotham City +
male +
Top 1% tax
bracket

Identiﬁer variable Measured variable
Index of random
variable
Random variable
Dimension Measure
Experimental design Measurement
predictors (Xi) response (Y)

Molten data
Molten data collapses all the
measured variables into two
columns: 1) the variable being
measured and 2) the value.
Sometimes called “long” form.
To protect a column from being
melted, label it as an id variable.
reshape::melt(data, id)

tips1 <- melt(tips, id =
c("customer_ID", "total_bill", "tip",
"smoker", "non_smoker"))
# assign an appropriate variable name
names(tips1)[6] <- "sex"
# subset out unwanted rows
tips1 <- subset(tips1, value == 1)
tips1 <- tips1[ , c(1,2,6,4,5,3)]

Use melt to ﬁx the smoking variable. One
column should be enough to record
whether a person smokes or not.
Your turn

Rectangular data are
much easier to work with!
qplot(total_bill, tip, data = tips1,
color = sex)
# vs.
qplot(total_bill, tip, data = tip,
colour = ?)

qplot(total_bill, tip, data = tips1, color = sex) +
geom_smooth(method = lm)

Resource
Wickham, H. (2007) Reshaping data with
the reshape package. Journal of
Statistical Software. 22 (12)
http://www.jstatsoft.org/v21/i12

Clean data is:
Rectangular
(observations in rows, one column per variable)
Consistent
Concise
Complete
Correct

Data
Residuals
Model
Compare
Visualize
Transform

Data
Residuals
Model
Compare
Visualize
Transform
ggplot2

Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
plyr

Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
plyr
reshape

Data
Residuals
Model
Compare
Visualize
Transform
most statistics
classes

This work is licensed under the Creative
Commons Attribution-Noncommercial 3.0 United
States License. To view a copy of this license,
visit http://creativecommons.org/licenses/by-nc/
3.0/us/ or send a letter to Creative Commons,
171 Second Street, Suite 300, San Francisco,
California, 94105, USA.

18 cleaning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (8)

Similaire à 18 cleaning

Similaire à 18 cleaning (20)

Plus de Hadley Wickham

Plus de Hadley Wickham (20)

Dernier

Dernier (20)

18 cleaning