SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
A systematic approach to data cleaning with R
Mark van der Loo
markvanderloo.eu | @markvdloo
Budapest | September 3 2016
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Demos and other materials
https://github.com/markvanderloo/satRday
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Contents
The statistical value chain
From raw data to technically correct data
Strings and encoding
Regexp and approximate matching
Type coercion
From technically correct data to consistent data
Data validation
Error localization
Correction, imputation, adjustment
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
The statistical value chain
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Statistical value chain
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Concepts
Technically correct data
Well-defined format (data structure)
Well-defined types (numbers, date/time,string, categorical... )
Statistical units can be identified (persons, transactions, phone
calls...)
Variables can be identified as properties of statistical units.
Note: tidy data ⊂ technically correct data
Consistent data
Data satisfies demands from domain knowledge
(more on this when we talk about validation)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
From raw to technically correct data
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Dirty tabular data
Demo
Coercing while reading: /table
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Tabular data: long story short
read.table: R’s swiss army knife
fairly strict (no sniffing)
Very flexible
Interface could be cleaner (see this talk)
readr::read_csv
Easy to switch between strict/lenient parsing
Compact control over column types
Fast
Clear reports of parsing failure
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Really dirty data
Demo
Output file parsing: /parsing
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
A few lessons from the demo
(base) R has great text processing tools.
Need to work with regular expressions1
Write many small functions extracting single data elements.
Don’t overgeneralize: adapt functions as you meet new input.
Smart use of existing tools (read.table(text=))
1
Mastering Regular Expressions (2006) by Jeffrey Friedl is a great resource
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Packages for standard format parsing
jsonlite: parse JSON files
yaml: parse yaml files
xml2: parse XML files
rvest: scrape and parse HTML files
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Some tips on regular expressions with R
stringr has many useful shorthands for common tasks.
Generate regular expressions with rex
library(rex)
# recognize a number in scientific notation
rex(one_or_more(digit)
, maybe(".",one_or_more(digit))
, "E" %or% "e"
, one_or_more(digit))
## (?:[[:digit:]])+(?:.(?:[[:digit:]])+)?(?:E|e)(?:[[:digi
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Regular expressions
Express a pattern of text, e.g.
"(a|b)c*" = {"a", "ac", "acc", . . . , "b", "bc", "bcc", . . .}
Task stringr function:
string detection str_detect(string, pattern)
string extraction str_extract(string, pattern)
string replacement str_extract(string, pattern, replacement)
string splitting str_split(string, pattern)
Base R: grep grepl | regexpr regmatches | sub gsub | strsplit
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
String normalization
Bring a text string in a standard format, e.g.
Standardize upper/lower case (casefolding)
stringr: str_to_lower, str_to_upper, str_to_title
base R: tolower, toupper
Remove accents (transliteration)
stringi: stri_trans_general
base R: iconv
Re-encoding
stringi: stri_encode
base R: iconv
Uniformize encoding (unicode normalization)
stringi: stri_trans_nfkc (and more)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding in R
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding in R
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Encoding in R
Demo
Normalization, re-encoding, transliteration: /strings
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
A few tips
Detect encoding stringi::stri_enc_detect
Conversion options iconvlist() stringi::stri_enc_list()
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Approximate text matching
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Approximate text matching
Demo
Approximate matching and normalization: /matching
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Approximate text matching: edit-based distances
Allowed operation
Distance substitution deletion insertion transposition
Hamming    
LCS    
Levenshtein    
OSA    ∗
Damerau-
Levenshtein
   
∗Substrings may be edited only once.
leela → leea → leia
stringdist::stringdist(leela,leia,method=dl)
## [1] 2
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Some pointers for approximate matching
Normalisation and approximate matching are complementary
See my useR2014 talk or paper on stringdist for more distances
The fuzzyjoin package allows fuzzy joining of datasets
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Other good stuff
lubridate: extract dates from strings
lubridate::dmy(17 December 2015)
## [1] 2015-12-17
tidyr: many data cleaning operations to make your life easier
readr: Parse numbers from text strings
readr::parse_number(c(2%,6%,0.3%))
## [1] 2.0 6.0 0.3
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
From technically correct to consistent data
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
The mantra of data cleaning
Detection (data conflicts with domain knowledge)
Selection (find the value(s) that cause the violation)
Correction (replace them with better values)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Detection, AKA data validation
Informally:
Data Validation is checking data against (multivariate) expectations
about a data set.
Validation rules
Often these expectations can be expressed as a set of simple
validation rules.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Data validation
Demo
The validate package /validate
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
The validate package, in summary
Make data validation rules explicit
Treat them as objects of computation
store to / read from file
manipulate
annotate
Confront data with rules
Analyze/visualize the results
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Tracking changes when altering data
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Tracking changes in rule violations
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Use rules to correct data
Main idea
Rules restrict the data. Sometimes this is enough to derive a correct
value uniquely.
Examples
Correct typos in values under linear restrictions
123 + 45 = 177, but 123 + 54 = 177.
Derive imputations from values under linear restrictions
123 + NA = 177, compute 177 − 123 = 54.
Both can be generalized to systems Ax ≤ b.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Deductive correction and imputation
Demo
The deductive package: /deductive.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Selection, or: error localization
Fellegi and Holt (1976)
Find the least (weighted) number of fields that can be imputed such
that all rules can be satisfied.
Note
Solutions need not be unique.
Random one chosen in case of degeneracy.
Lowest weight need not guarantee smallest number of altered
variables.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Error localization
Demo
The errorlocate package: /errorlocate
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Notes on errorlocate
For in-record rules
Support for
linear (in)equality rules
Conditionals on categorical variables (if male then not pregnant)
Mixed conditionals (has job then age = 15)
Conditionals w/linear predicates (staff  0 then staff cost  0)
Optimization is mapped to MIP problem.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Missing values
Mechanisms (Rubin):
MCAR: missing completely at random
MAR: P(Y = NA) depends on value of X
MNAR: P(Y = NA) depends on value of Y
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Imputation
Purpose of imputation vs prediction
Prediction: estimate a single value (often for a single use)
Imputation: estimate values such that the completed data set
allows for valid inferencea
a
This is very difficult!
Imputation methods
Deductive imputation
Imputation based on predictive models
Donor imputation (knn, pmm, sequential/random hot deck )
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Predictive model-based imputation
ˆy = ˆf (x) +
e.g.Linear regression
ˆy = α + xT ˆβ +
Residual:
= 0 Impute expected value
drawn from observed residuals e
∼ N(0, σ) parametric residual, ˆσ2
= var(e)
Multiple imputation (Bayesian bootstrap)
Draw β from parametric distribution, impute multiple times.
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Donor imputation (hot deck)
Method variants:
Random hot deck: copy value from random record.
Sequential hot deck: copy value from previous record.
k-nearest neighbours: draw donor from k neares neigbours
Predictive mean matching: copy value closest to prediction
Donor pool variants:
per variable
per missing data pattern
per record
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Note on multivariate donor imputation
Many multivariate methods seem relatively ad hoc, and more
theoretical and empirical comparisons with alternative approaches
would be of interest.
Andridge and Little (2010) A Review of Hot Deck Imputation for Survey
Non-response. Int. Stat. Rev. 78(1) 40–64
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Demo time
Demo
Imputation /imputation
VIM: visualisation, GUI, extensive methodology
simputation: simple, scriptable interface to common methods
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Methods supported by simputation
Model based (optionally add [non-]parametric random residual)
linear regression
robust linear regression
CART models
Random forest
Donor imputation (including various donor pool specifications)
k-nearest neigbour (based on gower’s distance)
sequential hotdeck (LOCF, NOCB)
random hotdeck
Predictive mean matching
Other
(groupwise) median imputation (optional random residual)
Proxy imputation (copy from other variable)
Mark van der Loo A systematic approach to data cleaning with R
The statistical value chain
From raw to technically correct data
From technically correct to consistent data
Credits
deductive Mark van der Loo, Edwin de Jonge
errorlocate Edwin de Jonge, Mark van der Loo
gower Mark van der Loo
jsonlite Jeroen Ooms, Duncan Temple Lang, Lloyd Hilaiel
magrittr Stefan Milton Bache, Hadley Wickham
rex Kevin Ushey Jim Hester, Robert Krzyzanowski
simputation Mark van der Loo
stringdist Mark van der Loo, Jan van der Laan, R Core, Nick Logan
stringi Marek Gagolewski, Bartek Tartanus
stringr Hadley Wickham, RStudio
tidyr Hadley Wickham, RStudio
validate Mark van der Loo, Edwin de Jonge
VIM Matthias Templ, Andreas Alfons, Alexander Kowarik, Bernd Prantner
xml2 Hadley Wickham, Jim Hester, Jeroen Ooms, RStudio, R foundation
Mark van der Loo A systematic approach to data cleaning with R

Contenu connexe

En vedette

Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)Neha Ahuja
 
มารยาทในการลีลาศ
มารยาทในการลีลาศมารยาทในการลีลาศ
มารยาทในการลีลาศTepasoon Songnaa
 
การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)Kittiya Youngjarean
 
Luyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin họcLuyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin họcltgiang87
 
การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)Kittiya Youngjarean
 
Meaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. SinielMeaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. Sinielairazygy
 
Bridging the gap between theory and practise
Bridging the gap between theory and practiseBridging the gap between theory and practise
Bridging the gap between theory and practiseAnshul Punetha
 
การรวมกิจการ
การรวมกิจการการรวมกิจการ
การรวมกิจการKittiya Youngjarean
 
ประวัติวอลเลย์บอล
ประวัติวอลเลย์บอลประวัติวอลเลย์บอล
ประวัติวอลเลย์บอลTepasoon Songnaa
 
Policy Engagement through Digital Participation (Serious Games)
Policy Engagement through Digital Participation (Serious Games)Policy Engagement through Digital Participation (Serious Games)
Policy Engagement through Digital Participation (Serious Games)Abertay University
 
Graphene light bulb set for shops
Graphene light bulb set for shopsGraphene light bulb set for shops
Graphene light bulb set for shopsskinnyengineer898
 

En vedette (16)

Storyworld Jam '14
Storyworld Jam '14Storyworld Jam '14
Storyworld Jam '14
 
Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)Infant feeding-and-climate-change-india (2)
Infant feeding-and-climate-change-india (2)
 
มารยาทในการลีลาศ
มารยาทในการลีลาศมารยาทในการลีลาศ
มารยาทในการลีลาศ
 
Tugas 1
Tugas 1Tugas 1
Tugas 1
 
การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)การลงทุนในหุ้นสามัญ (ต่อ)
การลงทุนในหุ้นสามัญ (ต่อ)
 
Luyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin họcLuyện thi Chứng chỉ A Tin học
Luyện thi Chứng chỉ A Tin học
 
การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)การรวมกิจการ (ต่อ)
การรวมกิจการ (ต่อ)
 
Meaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. SinielMeaning and Importance of Genetic Engineering by Aira J. Siniel
Meaning and Importance of Genetic Engineering by Aira J. Siniel
 
˹· 1 ǡѻ
˹· 1  ǡѻ˹· 1  ǡѻ
˹· 1 ǡѻ
 
Bridging the gap between theory and practise
Bridging the gap between theory and practiseBridging the gap between theory and practise
Bridging the gap between theory and practise
 
การรวมกิจการ
การรวมกิจการการรวมกิจการ
การรวมกิจการ
 
ประวัติวอลเลย์บอล
ประวัติวอลเลย์บอลประวัติวอลเลย์บอล
ประวัติวอลเลย์บอล
 
IoT Protocols
IoT ProtocolsIoT Protocols
IoT Protocols
 
TV3.0 New TV frontiers.
TV3.0 New TV frontiers.TV3.0 New TV frontiers.
TV3.0 New TV frontiers.
 
Policy Engagement through Digital Participation (Serious Games)
Policy Engagement through Digital Participation (Serious Games)Policy Engagement through Digital Participation (Serious Games)
Policy Engagement through Digital Participation (Serious Games)
 
Graphene light bulb set for shops
Graphene light bulb set for shopsGraphene light bulb set for shops
Graphene light bulb set for shops
 

Similaire à Sat rday

Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsHPCC Systems
 
HospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformHospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformAngela Razzell
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL ServerStéphane Fréchette
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdfRohanBorgalli
 
Lec 01 - Microcomputer Architecture and Logic Design
Lec 01 - Microcomputer Architecture and Logic DesignLec 01 - Microcomputer Architecture and Logic Design
Lec 01 - Microcomputer Architecture and Logic DesignVajira Thambawita
 
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...AI Frontiers
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbaiUnmesh Baile
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners Jen Stirrup
 
Understanding R for Epidemiologists
Understanding R for EpidemiologistsUnderstanding R for Epidemiologists
Understanding R for EpidemiologistsTomas J. Aragon
 
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and ScalaData Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and ScalaSubhasish Guha
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentationnibraspk
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by CervedData Science Milan
 
computer architecture
computer architecture computer architecture
computer architecture Dr.Umadevi V
 
Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesEdwin de Jonge
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsJen Stirrup
 
Software Measurement: Lecture 1. Measures and Metrics
Software Measurement: Lecture 1. Measures and MetricsSoftware Measurement: Lecture 1. Measures and Metrics
Software Measurement: Lecture 1. Measures and MetricsProgrameter
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxmyworld93
 

Similaire à Sat rday (20)

Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
 
HospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics PlatformHospETL - Delivering a Healthcare Analytics Platform
HospETL - Delivering a Healthcare Analytics Platform
 
Unit 3-2.ppt
Unit 3-2.pptUnit 3-2.ppt
Unit 3-2.ppt
 
Data Analytics with R and SQL Server
Data Analytics with R and SQL ServerData Analytics with R and SQL Server
Data Analytics with R and SQL Server
 
R Programming - part 1.pdf
R Programming - part 1.pdfR Programming - part 1.pdf
R Programming - part 1.pdf
 
Lec 01 - Microcomputer Architecture and Logic Design
Lec 01 - Microcomputer Architecture and Logic DesignLec 01 - Microcomputer Architecture and Logic Design
Lec 01 - Microcomputer Architecture and Logic Design
 
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
Saket Saurabh at AI Frontiers: Data Operations or: How I Learned to Stop Data...
 
R-programming-training-in-mumbai
R-programming-training-in-mumbaiR-programming-training-in-mumbai
R-programming-training-in-mumbai
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Understanding R for Epidemiologists
Understanding R for EpidemiologistsUnderstanding R for Epidemiologists
Understanding R for Epidemiologists
 
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and ScalaData Quality, Correctness and Dynamic Transformations using Spark and Scala
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
A view of graph data usage by Cerved
A view of graph data usage by CervedA view of graph data usage by Cerved
A view of graph data usage by Cerved
 
computer architecture
computer architecture computer architecture
computer architecture
 
Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rules
 
Cerved Datascience Milan
Cerved Datascience MilanCerved Datascience Milan
Cerved Datascience Milan
 
SQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and StatisticsSQLBits Module 2 RStats Introduction to R and Statistics
SQLBits Module 2 RStats Introduction to R and Statistics
 
Software Measurement: Lecture 1. Measures and Metrics
Software Measurement: Lecture 1. Measures and MetricsSoftware Measurement: Lecture 1. Measures and Metrics
Software Measurement: Lecture 1. Measures and Metrics
 
DATA MINING USING R (1).pptx
DATA MINING USING R (1).pptxDATA MINING USING R (1).pptx
DATA MINING USING R (1).pptx
 
R programming by ganesh kavhar
R programming by ganesh kavharR programming by ganesh kavhar
R programming by ganesh kavhar
 

Dernier

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 

Dernier (20)

Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 

Sat rday

  • 1. The statistical value chain From raw to technically correct data From technically correct to consistent data A systematic approach to data cleaning with R Mark van der Loo markvanderloo.eu | @markvdloo Budapest | September 3 2016 Mark van der Loo A systematic approach to data cleaning with R
  • 2. The statistical value chain From raw to technically correct data From technically correct to consistent data Demos and other materials https://github.com/markvanderloo/satRday Mark van der Loo A systematic approach to data cleaning with R
  • 3. The statistical value chain From raw to technically correct data From technically correct to consistent data Contents The statistical value chain From raw data to technically correct data Strings and encoding Regexp and approximate matching Type coercion From technically correct data to consistent data Data validation Error localization Correction, imputation, adjustment Mark van der Loo A systematic approach to data cleaning with R
  • 4. The statistical value chain From raw to technically correct data From technically correct to consistent data The statistical value chain Mark van der Loo A systematic approach to data cleaning with R
  • 5. The statistical value chain From raw to technically correct data From technically correct to consistent data Statistical value chain Mark van der Loo A systematic approach to data cleaning with R
  • 6. The statistical value chain From raw to technically correct data From technically correct to consistent data Concepts Technically correct data Well-defined format (data structure) Well-defined types (numbers, date/time,string, categorical... ) Statistical units can be identified (persons, transactions, phone calls...) Variables can be identified as properties of statistical units. Note: tidy data ⊂ technically correct data Consistent data Data satisfies demands from domain knowledge (more on this when we talk about validation) Mark van der Loo A systematic approach to data cleaning with R
  • 7. The statistical value chain From raw to technically correct data From technically correct to consistent data From raw to technically correct data Mark van der Loo A systematic approach to data cleaning with R
  • 8. The statistical value chain From raw to technically correct data From technically correct to consistent data Dirty tabular data Demo Coercing while reading: /table Mark van der Loo A systematic approach to data cleaning with R
  • 9. The statistical value chain From raw to technically correct data From technically correct to consistent data Tabular data: long story short read.table: R’s swiss army knife fairly strict (no sniffing) Very flexible Interface could be cleaner (see this talk) readr::read_csv Easy to switch between strict/lenient parsing Compact control over column types Fast Clear reports of parsing failure Mark van der Loo A systematic approach to data cleaning with R
  • 10. The statistical value chain From raw to technically correct data From technically correct to consistent data Really dirty data Demo Output file parsing: /parsing Mark van der Loo A systematic approach to data cleaning with R
  • 11. The statistical value chain From raw to technically correct data From technically correct to consistent data A few lessons from the demo (base) R has great text processing tools. Need to work with regular expressions1 Write many small functions extracting single data elements. Don’t overgeneralize: adapt functions as you meet new input. Smart use of existing tools (read.table(text=)) 1 Mastering Regular Expressions (2006) by Jeffrey Friedl is a great resource Mark van der Loo A systematic approach to data cleaning with R
  • 12. The statistical value chain From raw to technically correct data From technically correct to consistent data Packages for standard format parsing jsonlite: parse JSON files yaml: parse yaml files xml2: parse XML files rvest: scrape and parse HTML files Mark van der Loo A systematic approach to data cleaning with R
  • 13. The statistical value chain From raw to technically correct data From technically correct to consistent data Some tips on regular expressions with R stringr has many useful shorthands for common tasks. Generate regular expressions with rex library(rex) # recognize a number in scientific notation rex(one_or_more(digit) , maybe(".",one_or_more(digit)) , "E" %or% "e" , one_or_more(digit)) ## (?:[[:digit:]])+(?:.(?:[[:digit:]])+)?(?:E|e)(?:[[:digi Mark van der Loo A systematic approach to data cleaning with R
  • 14. The statistical value chain From raw to technically correct data From technically correct to consistent data Regular expressions Express a pattern of text, e.g. "(a|b)c*" = {"a", "ac", "acc", . . . , "b", "bc", "bcc", . . .} Task stringr function: string detection str_detect(string, pattern) string extraction str_extract(string, pattern) string replacement str_extract(string, pattern, replacement) string splitting str_split(string, pattern) Base R: grep grepl | regexpr regmatches | sub gsub | strsplit Mark van der Loo A systematic approach to data cleaning with R
  • 15. The statistical value chain From raw to technically correct data From technically correct to consistent data String normalization Bring a text string in a standard format, e.g. Standardize upper/lower case (casefolding) stringr: str_to_lower, str_to_upper, str_to_title base R: tolower, toupper Remove accents (transliteration) stringi: stri_trans_general base R: iconv Re-encoding stringi: stri_encode base R: iconv Uniformize encoding (unicode normalization) stringi: stri_trans_nfkc (and more) Mark van der Loo A systematic approach to data cleaning with R
  • 16. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding Mark van der Loo A systematic approach to data cleaning with R
  • 17. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Mark van der Loo A systematic approach to data cleaning with R
  • 18. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Mark van der Loo A systematic approach to data cleaning with R
  • 19. The statistical value chain From raw to technically correct data From technically correct to consistent data Encoding in R Demo Normalization, re-encoding, transliteration: /strings Mark van der Loo A systematic approach to data cleaning with R
  • 20. The statistical value chain From raw to technically correct data From technically correct to consistent data A few tips Detect encoding stringi::stri_enc_detect Conversion options iconvlist() stringi::stri_enc_list() Mark van der Loo A systematic approach to data cleaning with R
  • 21. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching Mark van der Loo A systematic approach to data cleaning with R
  • 22. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching Demo Approximate matching and normalization: /matching Mark van der Loo A systematic approach to data cleaning with R
  • 23. The statistical value chain From raw to technically correct data From technically correct to consistent data Approximate text matching: edit-based distances Allowed operation Distance substitution deletion insertion transposition Hamming LCS Levenshtein OSA ∗ Damerau- Levenshtein ∗Substrings may be edited only once. leela → leea → leia stringdist::stringdist(leela,leia,method=dl) ## [1] 2 Mark van der Loo A systematic approach to data cleaning with R
  • 24. The statistical value chain From raw to technically correct data From technically correct to consistent data Some pointers for approximate matching Normalisation and approximate matching are complementary See my useR2014 talk or paper on stringdist for more distances The fuzzyjoin package allows fuzzy joining of datasets Mark van der Loo A systematic approach to data cleaning with R
  • 25. The statistical value chain From raw to technically correct data From technically correct to consistent data Other good stuff lubridate: extract dates from strings lubridate::dmy(17 December 2015) ## [1] 2015-12-17 tidyr: many data cleaning operations to make your life easier readr: Parse numbers from text strings readr::parse_number(c(2%,6%,0.3%)) ## [1] 2.0 6.0 0.3 Mark van der Loo A systematic approach to data cleaning with R
  • 26. The statistical value chain From raw to technically correct data From technically correct to consistent data From technically correct to consistent data Mark van der Loo A systematic approach to data cleaning with R
  • 27. The statistical value chain From raw to technically correct data From technically correct to consistent data The mantra of data cleaning Detection (data conflicts with domain knowledge) Selection (find the value(s) that cause the violation) Correction (replace them with better values) Mark van der Loo A systematic approach to data cleaning with R
  • 28. The statistical value chain From raw to technically correct data From technically correct to consistent data Detection, AKA data validation Informally: Data Validation is checking data against (multivariate) expectations about a data set. Validation rules Often these expectations can be expressed as a set of simple validation rules. Mark van der Loo A systematic approach to data cleaning with R
  • 29. The statistical value chain From raw to technically correct data From technically correct to consistent data Data validation Demo The validate package /validate Mark van der Loo A systematic approach to data cleaning with R
  • 30. The statistical value chain From raw to technically correct data From technically correct to consistent data The validate package, in summary Make data validation rules explicit Treat them as objects of computation store to / read from file manipulate annotate Confront data with rules Analyze/visualize the results Mark van der Loo A systematic approach to data cleaning with R
  • 31. The statistical value chain From raw to technically correct data From technically correct to consistent data Tracking changes when altering data Mark van der Loo A systematic approach to data cleaning with R
  • 32. The statistical value chain From raw to technically correct data From technically correct to consistent data Tracking changes in rule violations Mark van der Loo A systematic approach to data cleaning with R
  • 33. The statistical value chain From raw to technically correct data From technically correct to consistent data Use rules to correct data Main idea Rules restrict the data. Sometimes this is enough to derive a correct value uniquely. Examples Correct typos in values under linear restrictions 123 + 45 = 177, but 123 + 54 = 177. Derive imputations from values under linear restrictions 123 + NA = 177, compute 177 − 123 = 54. Both can be generalized to systems Ax ≤ b. Mark van der Loo A systematic approach to data cleaning with R
  • 34. The statistical value chain From raw to technically correct data From technically correct to consistent data Deductive correction and imputation Demo The deductive package: /deductive. Mark van der Loo A systematic approach to data cleaning with R
  • 35. The statistical value chain From raw to technically correct data From technically correct to consistent data Selection, or: error localization Fellegi and Holt (1976) Find the least (weighted) number of fields that can be imputed such that all rules can be satisfied. Note Solutions need not be unique. Random one chosen in case of degeneracy. Lowest weight need not guarantee smallest number of altered variables. Mark van der Loo A systematic approach to data cleaning with R
  • 36. The statistical value chain From raw to technically correct data From technically correct to consistent data Error localization Demo The errorlocate package: /errorlocate Mark van der Loo A systematic approach to data cleaning with R
  • 37. The statistical value chain From raw to technically correct data From technically correct to consistent data Notes on errorlocate For in-record rules Support for linear (in)equality rules Conditionals on categorical variables (if male then not pregnant) Mixed conditionals (has job then age = 15) Conditionals w/linear predicates (staff 0 then staff cost 0) Optimization is mapped to MIP problem. Mark van der Loo A systematic approach to data cleaning with R
  • 38. The statistical value chain From raw to technically correct data From technically correct to consistent data Missing values Mechanisms (Rubin): MCAR: missing completely at random MAR: P(Y = NA) depends on value of X MNAR: P(Y = NA) depends on value of Y Mark van der Loo A systematic approach to data cleaning with R
  • 39. The statistical value chain From raw to technically correct data From technically correct to consistent data Imputation Purpose of imputation vs prediction Prediction: estimate a single value (often for a single use) Imputation: estimate values such that the completed data set allows for valid inferencea a This is very difficult! Imputation methods Deductive imputation Imputation based on predictive models Donor imputation (knn, pmm, sequential/random hot deck ) Mark van der Loo A systematic approach to data cleaning with R
  • 40. The statistical value chain From raw to technically correct data From technically correct to consistent data Predictive model-based imputation ˆy = ˆf (x) + e.g.Linear regression ˆy = α + xT ˆβ + Residual: = 0 Impute expected value drawn from observed residuals e ∼ N(0, σ) parametric residual, ˆσ2 = var(e) Multiple imputation (Bayesian bootstrap) Draw β from parametric distribution, impute multiple times. Mark van der Loo A systematic approach to data cleaning with R
  • 41. The statistical value chain From raw to technically correct data From technically correct to consistent data Donor imputation (hot deck) Method variants: Random hot deck: copy value from random record. Sequential hot deck: copy value from previous record. k-nearest neighbours: draw donor from k neares neigbours Predictive mean matching: copy value closest to prediction Donor pool variants: per variable per missing data pattern per record Mark van der Loo A systematic approach to data cleaning with R
  • 42. The statistical value chain From raw to technically correct data From technically correct to consistent data Note on multivariate donor imputation Many multivariate methods seem relatively ad hoc, and more theoretical and empirical comparisons with alternative approaches would be of interest. Andridge and Little (2010) A Review of Hot Deck Imputation for Survey Non-response. Int. Stat. Rev. 78(1) 40–64 Mark van der Loo A systematic approach to data cleaning with R
  • 43. The statistical value chain From raw to technically correct data From technically correct to consistent data Demo time Demo Imputation /imputation VIM: visualisation, GUI, extensive methodology simputation: simple, scriptable interface to common methods Mark van der Loo A systematic approach to data cleaning with R
  • 44. The statistical value chain From raw to technically correct data From technically correct to consistent data Methods supported by simputation Model based (optionally add [non-]parametric random residual) linear regression robust linear regression CART models Random forest Donor imputation (including various donor pool specifications) k-nearest neigbour (based on gower’s distance) sequential hotdeck (LOCF, NOCB) random hotdeck Predictive mean matching Other (groupwise) median imputation (optional random residual) Proxy imputation (copy from other variable) Mark van der Loo A systematic approach to data cleaning with R
  • 45. The statistical value chain From raw to technically correct data From technically correct to consistent data Credits deductive Mark van der Loo, Edwin de Jonge errorlocate Edwin de Jonge, Mark van der Loo gower Mark van der Loo jsonlite Jeroen Ooms, Duncan Temple Lang, Lloyd Hilaiel magrittr Stefan Milton Bache, Hadley Wickham rex Kevin Ushey Jim Hester, Robert Krzyzanowski simputation Mark van der Loo stringdist Mark van der Loo, Jan van der Laan, R Core, Nick Logan stringi Marek Gagolewski, Bartek Tartanus stringr Hadley Wickham, RStudio tidyr Hadley Wickham, RStudio validate Mark van der Loo, Edwin de Jonge VIM Matthias Templ, Andreas Alfons, Alexander Kowarik, Bernd Prantner xml2 Hadley Wickham, Jim Hester, Jeroen Ooms, RStudio, R foundation Mark van der Loo A systematic approach to data cleaning with R