SlideShare une entreprise Scribd logo
1  sur  56
Télécharger pour lire hors ligne
Garrett Grolemund
Phd Student / Rice University
Department of Statistics
Data cleaning
1. Intro to data cleaning
2. What you can’t fix
3. What you can fix
4. Intro to reshape
Your turn
Do you think men or women leave a larger
tip when dining out? What data would
you collect to test this belief? What would
prompt you to change your belief?
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
Data Analysis
Data
Residuals
Model
Compare
Visualize
Transform
10 - 20%
of an analysis
Data Cleaning
Data
Residuals
Model
Compare
Visualize
Transform
Data
cleaning
“Happy families are all alike;
every unhappy family is
unhappy in its own way.”
—Leo Tolstoy
“Clean datasets are all alike;
every messy dataset is
messy in its own way.”
—Hadley Wickham
Clean data is:
Complete
Correct
(factual and internally consistent)
Concise
Compatible
(required variables: observations in rows, one column per
variable)
What you
can’t fix:
Complete
Correct
Correct
Can’t restore incorrect values without
original data but can remove clearly
incorrect values
Options:
Remove entire row
Mark incorrect value as missing (NA)
When two rows present the same
information with different values, at least
one row is wrong.
Whenever there is inconsistency, you are
going to have to make some tradeoff to
ensure concision.
Detecting inconsistency is not always
easy.
Inconsistency = incorrect
General strategy
To find incorrect values you need to be
creative, combining graphics and data
processing.
Tipping data
One waiter recorded information
about each tip he received over a
period of a few months
244 records
Do men or women tip more?
Your turn
Subset the tipping data to include only
rows without NA’s. Judge whether you
think all of the data points are correct.
How will you make your decision?
tips <- read.csv("tipping.csv",
stringsAsFactors = FALSE)
summary(tips)
tips <- subset(tips, !is.na(smoker) &
!is.na(non_smoker))
qplot(tip, data = tips, binwidth = .5)
qplot(total_bill, data = tips, binwidth = 2)
qplot(total_bill, tip, data = tips)
nrow(tips)
sum(tips$male)
sum(tips$female)
subset(tips, male != female)
What you
can fix:
Concise
(each fact represented once)
Repeating facts:
1. wastes memory
2. creates opportunities for inconsistency
Compatible
(Data is compatible with your analysis
in both form and fact)
1. Do you have the relevant variables for
your analysis?
This often requires some type of calculation.
For example,
proportion = sucesses / attempts
Avg score per game per team = ?
join(), transform(), summarise(), ddply(), plyr
address this need
Compatible
(Data is compatible with your analysis
in both form and fact)
2. Is the data in the right form for your
analysis and visualization tools? (reshape)
Rectangular
Observations
in rows
Variables
in columns
(1 column per variable)
Your turn
What are the variables in tipping.csv?
How are they arranged in rows and
columns? Can you form the variables into
two groups?
Reshape
install.packages("reshape")
library(reshape)
library(stringr)
head(tips)
Molten data
We can use melt to put each
variable into its own column.
“Protect” the good columns.
“Melt” the offending columns.
Then subset.
1. ID variables - identify the object that
measurements will take place on (we
know these before the experiment)
2. Measured variables - the features of
the object that will be measured (we have
to do an experiment to observe these)
Two types of variables
object
ID Variables
Bruce Wayne
Batman
SSN:
555-89-3000
Measured Var.
Height (6’1’’)
IQ (180)
Age (71)
ID Variables
Gotham City +
male +
Top 1% tax
bracket
Identifier variable Measured variable
Index of random
variable
Random variable
Dimension Measure
Experimental design Measurement
predictors (Xi) response (Y)
Molten data
Molten data collapses all the
measured variables into two
columns: 1) the variable being
measured and 2) the value.
Sometimes called “long” form.
To protect a column from being
melted, label it as an id variable.
reshape::melt(data, id)
tips1 <- melt(tips, id =
c("customer_ID", "total_bill", "tip",
"smoker", "non_smoker"))
# assign an appropriate variable name
names(tips1)[6] <- "sex"
# subset out unwanted rows
tips1 <- subset(tips1, value == 1)
tips1 <- tips1[ , c(1,2,6,4,5,3)]
Use melt to fix the smoking variable. One
column should be enough to record
whether a person smokes or not.
Your turn
Rectangular data are
much easier to work with!
qplot(total_bill, tip, data = tips1,
color = sex)
# vs.
qplot(total_bill, tip, data = tip,
colour = ?)
qplot(total_bill, tip, data = tips1, color = sex) +
geom_smooth(method = lm)
Clean data is:
Complete
Correct
(factual and internally consistent)
Concise
Compatible
(required variables: observations in rows, one column per
variable)
Resource
Wickham, H. (2007) Reshaping data with
the reshape package. Journal of
Statistical Software. 22 (12)
http://www.jstatsoft.org/v21/i12
Summary
Clean data is:
Rectangular
(observations in rows, one column per variable)
Consistent
Concise
Complete
Correct
Data
Residuals
Model
Compare
Visualize
Transform
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
plyr
Data
Residuals
Model
Compare
Visualize
Transform
ggplot2
plyr
reshape
Data
Residuals
Model
Compare
Visualize
Transform
most statistics
classes
This work is licensed under the Creative
Commons Attribution-Noncommercial 3.0 United
States License. To view a copy of this license,
visit http://creativecommons.org/licenses/by-nc/
3.0/us/ or send a letter to Creative Commons,
171 Second Street, Suite 300, San Francisco,
California, 94105, USA.

Contenu connexe

Tendances

Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
Harry Potter
 
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Universidad Particular de Loja
 

Tendances (8)

Mean conceptual
Mean   conceptualMean   conceptual
Mean conceptual
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRandom Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
 
Data mining maximumlikelihood
Data mining maximumlikelihoodData mining maximumlikelihood
Data mining maximumlikelihood
 
cross tabulation
 cross tabulation cross tabulation
cross tabulation
 
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
Multiple sample test - Anova, Chi-square, Test of association, Goodness of Fit
 
Classification modelling review
Classification modelling reviewClassification modelling review
Classification modelling review
 
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
Statistics and Public Health. Curso de Inglés Técnico para profesionales de S...
 
Dive into the Data
Dive into the DataDive into the Data
Dive into the Data
 

Similaire à 18 cleaning

Advanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneursAdvanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneurs
Dr. Trilok Kumar Jain
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
Brian Lin
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
butest
 
Statistice Chapter 02[1]
Statistice  Chapter 02[1]Statistice  Chapter 02[1]
Statistice Chapter 02[1]
plisasm
 
Write a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docxWrite a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docx
edgar6wallace88877
 
Lect 2 basic ppt
Lect 2 basic pptLect 2 basic ppt
Lect 2 basic ppt
Tao Hong
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
Simplilearn
 

Similaire à 18 cleaning (20)

Advanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneursAdvanced business mathematics and statistics for entrepreneurs
Advanced business mathematics and statistics for entrepreneurs
 
Applied statistics part 5
Applied statistics part 5Applied statistics part 5
Applied statistics part 5
 
Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...
Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...
Quantitative Methods for Lawyers - Class #7 - Probability & Basic Statistics ...
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final2016 Symposium Poster - statistics - Final
2016 Symposium Poster - statistics - Final
 
Spss basic Dr Marwa Zalat
Spss basic Dr Marwa ZalatSpss basic Dr Marwa Zalat
Spss basic Dr Marwa Zalat
 
Ders 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptxDers 1 mean mod media st dev.pptx
Ders 1 mean mod media st dev.pptx
 
Engineering Statistics
Engineering Statistics Engineering Statistics
Engineering Statistics
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
 
Correlation and linear regression
Correlation and linear regression Correlation and linear regression
Correlation and linear regression
 
Rclass
RclassRclass
Rclass
 
Statistice Chapter 02[1]
Statistice  Chapter 02[1]Statistice  Chapter 02[1]
Statistice Chapter 02[1]
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Explore ml day 2
Explore ml day 2Explore ml day 2
Explore ml day 2
 
Krupa rm
Krupa rmKrupa rm
Krupa rm
 
Dymystify Statistics Day 1.pdf
Dymystify Statistics Day 1.pdfDymystify Statistics Day 1.pdf
Dymystify Statistics Day 1.pdf
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
Write a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docxWrite a Mission Statement 1. What are your most important .docx
Write a Mission Statement 1. What are your most important .docx
 
Lect 2 basic ppt
Lect 2 basic pptLect 2 basic ppt
Lect 2 basic ppt
 
Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...Data Science Interview Questions | Data Science Interview Questions And Answe...
Data Science Interview Questions | Data Science Interview Questions And Answe...
 

Plus de Hadley Wickham (20)

27 development
27 development27 development
27 development
 
27 development
27 development27 development
27 development
 
24 modelling
24 modelling24 modelling
24 modelling
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
R packages
R packagesR packages
R packages
 
22 spam
22 spam22 spam
22 spam
 
21 spam
21 spam21 spam
21 spam
 
20 date-times
20 date-times20 date-times
20 date-times
 
19 tables
19 tables19 tables
19 tables
 
17 polishing
17 polishing17 polishing
17 polishing
 
16 critique
16 critique16 critique
16 critique
 
15 time-space
15 time-space15 time-space
15 time-space
 
14 case-study
14 case-study14 case-study
14 case-study
 
13 case-study
13 case-study13 case-study
13 case-study
 
12 adv-manip
12 adv-manip12 adv-manip
12 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
11 adv-manip
11 adv-manip11 adv-manip
11 adv-manip
 
10 simulation
10 simulation10 simulation
10 simulation
 
10 simulation
10 simulation10 simulation
10 simulation
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

18 cleaning