SlideShare a Scribd company logo
1 of 1
Download to read offline
testdat: An	
  R	
  package	
  for	
  unit	
  tes2ng	
  of	
  tabular	
  data	
  
Mo#va#on	
  
Karthik	
  Ram1,	
  Hilary	
  Parker2,	
  Alyssa	
  Frazee3	
  
1	
  The	
  rOpenSci	
  project,	
  University	
  of	
  California,	
  Berkeley.	
  Berkeley,	
  CA	
  94720	
  USA,	
  karthik.ram@berkeley.edu
2	
  Etsy	
  Inc.,	
  Brooklyn,	
  NY.	
  USA,	
  hilary@etsy.com
3	
  Department	
  of	
  Biosta2s2cs,	
  Johns	
  Hopkins	
  Bloomberg	
  School	
  of	
  Public	
  Health,	
  Bal2more,	
  MD.	
  USA,	
  afrazee@jhsph.edu
Contribute	
  
The	
  testdat	
  package,	
  like	
  rOpenSci,	
  is	
  an	
  open-­‐
source,	
  community-­‐supported	
  project!	
  	
  
Improve	
  data	
  preprocessing:	
  
Data	
  preprocessing	
  is	
  an	
  important	
  and	
  under-­‐
discussed	
  step	
  in	
  data	
  analysis.	
  By	
  providing	
  
func2ons	
  to	
  easily	
  test	
  for	
  and	
  correct	
  common	
  
piXalls,	
  we	
  aim	
  to	
  help	
  researchers	
  overcome	
  these	
  
stumbling	
  blocks.	
  
	
  	
  	
  
Encourage	
  reproducibility:	
  
By	
  providing	
  a	
  suite	
  of	
  func2ons	
  that	
  easily	
  test	
  and	
  
correct	
  data	
  for	
  common	
  errors,	
  we	
  hope	
  to	
  
encourage	
  researchers	
  to	
  perform	
  data	
  
preprocessing	
  as	
  part	
  of	
  a	
  reproducible	
  workflow,	
  
rather	
  than	
  in	
  tools	
  such	
  as	
  Excel.	
  
	
  	
  
Communicate	
  analy#cal	
  steps:	
  
By	
  providing	
  readable	
  func2ons	
  for	
  preprocessing,	
  
we	
  aim	
  for	
  researchers	
  to	
  include	
  the	
  data	
  
preprocessing	
  code	
  in	
  their	
  analyses	
  or	
  papers,	
  to	
  
communicate	
  that	
  they	
  took	
  exhaus2ve	
  steps	
  to	
  
remove	
  ar2facts	
  from	
  data.	
  
Example	
  Func#ons	
   Workflow	
  
Obtain	
  
> dat
date num name
1 2014-01-01 1 NULL
2 2014-01-01 2 naa
3 2014-01-01 3 foo
4 2014-01-01 4 foo
5 2014-01-01 5 foo
6 2014-01-01 6 foo
7 2014-01-01 7 foo
8 2014-01-01 8 foo
9 2014-01-01 999 foo
10 2014-01-01 n/a foo
> class(dat$num)
[1] "factor"
> class(dat$name)
[1] "factor”
> test_NA(dat)
Now checking 3 columns...
999 was identified as a possible
NA alias -- please verify this is
not a data value!
row column value
1 9 2 999
2 10 2 n/a
3  1 3 NULL
> clean_dat <- fix_NA(dat,
custom_NAs="naa")
Now fixing 3 columns...
> clean_dat
date num name
1 2014-01-01 1 <NA>
2 2014-01-01 2 <NA>
3 2014-01-01 3 foo
4 2014-01-01 4 foo
5 2014-01-01 5 foo
6 2014-01-01 6 foo
7 2014-01-01 7 foo
8 2014-01-01 8 foo
9 2014-01-01 NA foo
10 2014-01-01 NA foo
> class(clean_dat$num)
[1] "numeric"
> class(clean_dat$name)
[1] "character"
Test	
  
Fix	
  
test_utf8.R, clean_utf8.R!
!
Test	
  and	
  correct	
  uX8	
  characters,	
  which	
  cannot	
  be	
  
read	
  into	
  R.	
  
!
test_NA.R, fix_NA.R!
!
Test	
  and	
  correct	
  for	
  common	
  missing-­‐value	
  
indicators	
  that	
  are	
  not	
  converted	
  to	
  an	
  NA	
  
character	
  in	
  R.	
  
!
test_continuous_date.R,
fix_continuous_date.R!
!
Test	
  and	
  correct	
  for	
  unexpected	
  gaps	
  in	
  date	
  
ranges.	
  
!
test_white_spaces.R,
fix_white_spaces.R!
!
Test	
  and	
  correct	
  for	
  white-­‐spaces	
  in	
  character	
  
vectors.	
  
!
test_outliers.R!
!
Test	
  for	
  outliers	
  in	
  your	
  numeric	
  data.	
  A	
  correct	
  
func2on	
  is	
  not	
  supplied,	
  as	
  this	
  has	
  sta2s2cal	
  
implica2ons.	
  
!

More Related Content

What's hot (8)

Computer science solution - programming - big c plus plus
Computer science   solution - programming - big c plus plusComputer science   solution - programming - big c plus plus
Computer science solution - programming - big c plus plus
 
Linked Lists Saloni
Linked Lists SaloniLinked Lists Saloni
Linked Lists Saloni
 
Artificial Intelligence Lab File
Artificial Intelligence Lab FileArtificial Intelligence Lab File
Artificial Intelligence Lab File
 
blast and fasta
 blast and fasta blast and fasta
blast and fasta
 
Bc0038– data structure using c
Bc0038– data structure using cBc0038– data structure using c
Bc0038– data structure using c
 
Day 5b statistical functions.pptx
Day 5b   statistical functions.pptxDay 5b   statistical functions.pptx
Day 5b statistical functions.pptx
 
Lecture4
Lecture4Lecture4
Lecture4
 
Lecture2
Lecture2Lecture2
Lecture2
 

Similar to testdat: An R package for unit testing of tabular data

Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification Challenges
Marc Borowczak
 
Kudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxKudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docx
DIPESH30
 
Mathematic iii test case
Mathematic iii test caseMathematic iii test case
Mathematic iii test case
syafiqahrahimi
 
Mathematic iii test case
Mathematic iii test caseMathematic iii test case
Mathematic iii test case
syafiqahrahimi
 

Similar to testdat: An R package for unit testing of tabular data (20)

Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...Data Science Academy Student Demo day--Michael blecher,the importance of clea...
Data Science Academy Student Demo day--Michael blecher,the importance of clea...
 
Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3Introduction to R for Data Science :: Session 3
Introduction to R for Data Science :: Session 3
 
Machine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification ChallengesMachine Learning, Key to Your Classification Challenges
Machine Learning, Key to Your Classification Challenges
 
Bc0041
Bc0041Bc0041
Bc0041
 
Data base testing
Data base testingData base testing
Data base testing
 
R Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB AcademyR Programming Tutorial for Beginners - -TIB Academy
R Programming Tutorial for Beginners - -TIB Academy
 
Normalisation revision
Normalisation revisionNormalisation revision
Normalisation revision
 
Normalization in Database
Normalization in DatabaseNormalization in Database
Normalization in Database
 
Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14Research Method for Business chapter 11-12-14
Research Method for Business chapter 11-12-14
 
Ezgi Karaesmen - Data Cleaning and Manipulation with R
Ezgi Karaesmen - Data Cleaning and Manipulation with REzgi Karaesmen - Data Cleaning and Manipulation with R
Ezgi Karaesmen - Data Cleaning and Manipulation with R
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
Mathematic iii
Mathematic iiiMathematic iii
Mathematic iii
 
Data exploration in r
Data exploration in rData exploration in r
Data exploration in r
 
4 Descriptive Statistics with R
4 Descriptive Statistics with R4 Descriptive Statistics with R
4 Descriptive Statistics with R
 
Data structures cs301 power point slides lecture 01
Data structures   cs301 power point slides lecture 01Data structures   cs301 power point slides lecture 01
Data structures cs301 power point slides lecture 01
 
Kudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docxKudler has plenty of room to increase sales while controlling cost.docx
Kudler has plenty of room to increase sales while controlling cost.docx
 
R programming
R programmingR programming
R programming
 
Mathematic iii test case
Mathematic iii test caseMathematic iii test case
Mathematic iii test case
 
Mathematic iii test case
Mathematic iii test caseMathematic iii test case
Mathematic iii test case
 

More from Hilary Parker

More from Hilary Parker (8)

WiDS Claremont 2022.pdf
WiDS Claremont 2022.pdfWiDS Claremont 2022.pdf
WiDS Claremont 2022.pdf
 
eCOTS 2020
eCOTS 2020eCOTS 2020
eCOTS 2020
 
rstudio::conf(2019L)
rstudio::conf(2019L)rstudio::conf(2019L)
rstudio::conf(2019L)
 
Using Data Effectively: Beyond Art and Science
Using Data Effectively: Beyond Art and ScienceUsing Data Effectively: Beyond Art and Science
Using Data Effectively: Beyond Art and Science
 
ICOTS 2018
ICOTS 2018ICOTS 2018
ICOTS 2018
 
Women in Analytics Conference, April 2018
Women in Analytics Conference, April 2018Women in Analytics Conference, April 2018
Women in Analytics Conference, April 2018
 
Opinionated Analysis Development -- EARL SF Keynote
Opinionated Analysis Development -- EARL SF KeynoteOpinionated Analysis Development -- EARL SF Keynote
Opinionated Analysis Development -- EARL SF Keynote
 
Opinionated Analysis Development -- rstudio::conf
Opinionated Analysis Development -- rstudio::confOpinionated Analysis Development -- rstudio::conf
Opinionated Analysis Development -- rstudio::conf
 

Recently uploaded

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Recently uploaded (20)

Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

testdat: An R package for unit testing of tabular data

  • 1. testdat: An  R  package  for  unit  tes2ng  of  tabular  data   Mo#va#on   Karthik  Ram1,  Hilary  Parker2,  Alyssa  Frazee3   1  The  rOpenSci  project,  University  of  California,  Berkeley.  Berkeley,  CA  94720  USA,  karthik.ram@berkeley.edu 2  Etsy  Inc.,  Brooklyn,  NY.  USA,  hilary@etsy.com 3  Department  of  Biosta2s2cs,  Johns  Hopkins  Bloomberg  School  of  Public  Health,  Bal2more,  MD.  USA,  afrazee@jhsph.edu Contribute   The  testdat  package,  like  rOpenSci,  is  an  open-­‐ source,  community-­‐supported  project!     Improve  data  preprocessing:   Data  preprocessing  is  an  important  and  under-­‐ discussed  step  in  data  analysis.  By  providing   func2ons  to  easily  test  for  and  correct  common   piXalls,  we  aim  to  help  researchers  overcome  these   stumbling  blocks.         Encourage  reproducibility:   By  providing  a  suite  of  func2ons  that  easily  test  and   correct  data  for  common  errors,  we  hope  to   encourage  researchers  to  perform  data   preprocessing  as  part  of  a  reproducible  workflow,   rather  than  in  tools  such  as  Excel.       Communicate  analy#cal  steps:   By  providing  readable  func2ons  for  preprocessing,   we  aim  for  researchers  to  include  the  data   preprocessing  code  in  their  analyses  or  papers,  to   communicate  that  they  took  exhaus2ve  steps  to   remove  ar2facts  from  data.   Example  Func#ons   Workflow   Obtain   > dat date num name 1 2014-01-01 1 NULL 2 2014-01-01 2 naa 3 2014-01-01 3 foo 4 2014-01-01 4 foo 5 2014-01-01 5 foo 6 2014-01-01 6 foo 7 2014-01-01 7 foo 8 2014-01-01 8 foo 9 2014-01-01 999 foo 10 2014-01-01 n/a foo > class(dat$num) [1] "factor" > class(dat$name) [1] "factor” > test_NA(dat) Now checking 3 columns... 999 was identified as a possible NA alias -- please verify this is not a data value! row column value 1 9 2 999 2 10 2 n/a 3  1 3 NULL > clean_dat <- fix_NA(dat, custom_NAs="naa") Now fixing 3 columns... > clean_dat date num name 1 2014-01-01 1 <NA> 2 2014-01-01 2 <NA> 3 2014-01-01 3 foo 4 2014-01-01 4 foo 5 2014-01-01 5 foo 6 2014-01-01 6 foo 7 2014-01-01 7 foo 8 2014-01-01 8 foo 9 2014-01-01 NA foo 10 2014-01-01 NA foo > class(clean_dat$num) [1] "numeric" > class(clean_dat$name) [1] "character" Test   Fix   test_utf8.R, clean_utf8.R! ! Test  and  correct  uX8  characters,  which  cannot  be   read  into  R.   ! test_NA.R, fix_NA.R! ! Test  and  correct  for  common  missing-­‐value   indicators  that  are  not  converted  to  an  NA   character  in  R.   ! test_continuous_date.R, fix_continuous_date.R! ! Test  and  correct  for  unexpected  gaps  in  date   ranges.   ! test_white_spaces.R, fix_white_spaces.R! ! Test  and  correct  for  white-­‐spaces  in  character   vectors.   ! test_outliers.R! ! Test  for  outliers  in  your  numeric  data.  A  correct   func2on  is  not  supplied,  as  this  has  sta2s2cal   implica2ons.   !