SlideShare une entreprise Scribd logo
1  sur  14
Detecting Bad Data CARMA Research Module Jeff Stanton
May 18-20, 2006 Internet Data Collection Methods (Day 2-2) Sources of Data Problems in Online Studies Technical errors: Programming errors: Not common, but damaging when they occur Server errors: Can halt the collection of data Transmission errors: Uncommon and usually isolated to one record or field Response fraud: Inadvertent multiple response and malicious multiple response Missing data Intentionally malicious patterns of response leading to outliers or self-contradictory data
Response Fraud Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality Minimal frauds: skipping questions, not thinking through the answers Maximal frauds: A robot that randomly answers  May 18-20, 2006 Internet Data Collection Methods (Day 2-3)
Duplicate Detection Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columns Create a new variable that contains this unique “checksum” value for each row/case Sort the dataset on the checksum Create a lag difference variable that subtracts the checksum for each neighboring row Sort on the lag variable and investigate all cases of zero or small differences May 18-20, 2006 Internet Data Collection Methods (Day 2-4)
May 18-20, 2006 Internet Data Collection Methods (Day 2-5) Bogus Response Detection  Calculate common univariate statistics using the complete row of responses for each subject Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min) Sort the cases by the mean value Look for extreme outliers on the high and low ends Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum Look for anomalies and trace them back to the original data for that subject
May 18-20, 2006 Internet Data Collection Methods (Day 2-6) Multivariate Outlier Detection Use Mahalanobis distance to detect outliers Regress a set of related items on an arbitrary dependent variable Sort by Mahalanobis distance: Larger distances are suggestive of outliers Use autocorrelation to detect unusual data patterns Flip the data: Cases become variables and variables become cases Run an autocorrelation function Look at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags) I have provided example SPSS code in the utilities area of the LMS for each of these tests
May 18-20, 2006 Internet Data Collection Methods (Day 2-7) Mahalanobis
May 18-20, 2006 Internet Data Collection Methods (Day 2-8) Plot, Sort, and Examine
May 18-20, 2006 Internet Data Collection Methods (Day 2-9) An ACF Indicating No Pattern
May 18-20, 2006 Internet Data Collection Methods (Day 2-10) An ACF with a Suspicious Pattern
May 18-20, 2006 Internet Data Collection Methods (Day 2-11) Common Missing Data Mitigation Techniques Item imputation For composite scales expressed as the average of a set of items, ignore any missing that appear on a small subset Mean substitution Suppresses variability Time series imputation Mean of neighboring points; suppresses spikes Regression imputation, works well for highly intercorrelated variables Full information maximum likelihood imputation Available in some SEM programs
May 18-20, 2006 Internet Data Collection Methods (Day 2-12) Excel Tips Your friend the “fill” function The power of “Paste Special” Sorting: Click on Data/Sort
May 18-20, 2006 Internet Data Collection Methods (Day 2-13) Excel Statistical Formulas =find(<find text>, <within text>, <start>) Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start> Example: =find(“=“, “fish=head”, 1) =Len(<string>) Returns the number of characters in a string Example =Len(“Ouch”) =Right(<string>,<length>) Returns the rightmost <length> characters in string Example: =Right(“fishhead“,4) =Left(<string>,<length>) works similarly =average(value, value…) Gives the arithmetic mean of a collection of cells and/or numeric values =stdev(value, value…) // stdevp(value, value…) Gives the sample/population standard deviation of a collection of cells and/or numeric values =sum(value, value…) Gives the sum of a collection of cells and/or numeric values =correl(vector1, vector2) Gives the pearson correlation between two vectors =if(<test>,<value if true>,<value if false>) Makes a logical test and returns a different value depending on whether the test is true or false Example =if(1=1, “Yes!”, “No…”)
May 18-20, 2006 Internet Data Collection Methods (Day 2-14) Summary of Bad Data Problems Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back… Unmotivated responding: participant uses same option over and over again Malicious patterns: Participate enters some unusually regular pattern of responses There are at least five errors of these kinds in the exercise dataset (see below)

Contenu connexe

Tendances

Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Greg Landrum
 
Association Mining
Association Mining Association Mining
Association Mining
Edureka!
 
resume_LangZhou
resume_LangZhouresume_LangZhou
resume_LangZhou
Lang Zhou
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
CTSI at UCSF
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 

Tendances (20)

Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences ResearchAnalyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
 
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDFPharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
Pharmaceutical Knowledge retrieval through Reasoning of ChEMBL RDF
 
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is that a scientific report or just some cool pictures from the lab? Reproduc...
 
Association Mining
Association Mining Association Mining
Association Mining
 
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
 
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
 
Beyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AIBeyond Proofs of Concept for Biomedical AI
Beyond Proofs of Concept for Biomedical AI
 
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache SparkDrug and Vaccine Discovery: Knowledge Graph + Apache Spark
Drug and Vaccine Discovery: Knowledge Graph + Apache Spark
 
resume_LangZhou
resume_LangZhouresume_LangZhou
resume_LangZhou
 
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life SciencesAnalyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
 
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
Applications of the US EPA’s CompTox Chemistry Dashboard to support structure...
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
UCSF Informatics Day 2014 - Ida Sim, "Informatics Technologies: From a Data-C...
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
 
In-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineIn-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems Medicine
 
New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...New developments in delivering public access to data from the National Center...
New developments in delivering public access to data from the National Center...
 
Big Data & ML for Clinical Data
Big Data & ML for Clinical DataBig Data & ML for Clinical Data
Big Data & ML for Clinical Data
 
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
US EPA CompTox Chemicals Dashboard Data Integration Hub to Support Environmen...
 
Open interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBIOpen interoperability standards, tools and services at EMBL-EBI
Open interoperability standards, tools and services at EMBL-EBI
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 

En vedette

Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale development
Syracuse University
 

En vedette (9)

Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale development
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question pro
 
Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University Faculty
 
Discovery informaticsstanton
Discovery informaticsstantonDiscovery informaticsstanton
Discovery informaticsstanton
 
Chapter9 r studio2
Chapter9 r studio2Chapter9 r studio2
Chapter9 r studio2
 
Strategic planning
Strategic planningStrategic planning
Strategic planning
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
Martin data collection methods
Martin  data collection methodsMartin  data collection methods
Martin data collection methods
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 

Similaire à Carma internet research module detecting bad data

Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
butest
 
Public PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterPublic PhD Defense - Ben De Meester
Public PhD Defense - Ben De Meester
Ben De Meester
 
Datamining
DataminingDatamining
Datamining
sumit621
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
butest
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
suganmca14
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 

Similaire à Carma internet research module detecting bad data (20)

Data Mining with SQL Server 2008
Data Mining with SQL Server 2008Data Mining with SQL Server 2008
Data Mining with SQL Server 2008
 
Machine Learning with WEKA
Machine Learning with WEKAMachine Learning with WEKA
Machine Learning with WEKA
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
 
Public PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterPublic PhD Defense - Ben De Meester
Public PhD Defense - Ben De Meester
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
 
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASEMEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
MEMORY EFFICIENT FREQUENT PATTERN MINING USING TRANSPOSITION OF DATABASE
 
Datamining
DataminingDatamining
Datamining
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating  Scores...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Scores...
 
Mcs 021
Mcs 021Mcs 021
Mcs 021
 
IRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its AnalysisIRJET- Probability based Missing Value Imputation Method and its Analysis
IRJET- Probability based Missing Value Imputation Method and its Analysis
 
data Sreening.doc
data Sreening.docdata Sreening.doc
data Sreening.doc
 
Data mining
Data miningData mining
Data mining
 
Computer notes - data structures
Computer notes - data structuresComputer notes - data structures
Computer notes - data structures
 
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databasesIEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
IEEE 2014 JAVA DATA MINING PROJECTS Searching dimension incomplete databases
 
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
2014 IEEE JAVA DATA MINING PROJECT Searching dimension incomplete databases
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
SPSS GuideAssessing Normality, Handling Missing Data, and Calculating Total S...
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internet
 

Plus de Syracuse University

Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)
Syracuse University
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collection
Syracuse University
 
Carma internet research module detecting bad data
Carma internet research module   detecting bad dataCarma internet research module   detecting bad data
Carma internet research module detecting bad data
Syracuse University
 
Carma internet research module preparing for manuscript submission
Carma internet research module   preparing for manuscript submissionCarma internet research module   preparing for manuscript submission
Carma internet research module preparing for manuscript submission
Syracuse University
 
Carma internet research module survey design issues
Carma internet research module   survey design issuesCarma internet research module   survey design issues
Carma internet research module survey design issues
Syracuse University
 

Plus de Syracuse University (20)

Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issues
 
Siop impact of social media
Siop impact of social mediaSiop impact of social media
Siop impact of social media
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
R-Studio Vs. Rcmdr
R-Studio Vs. RcmdrR-Studio Vs. Rcmdr
R-Studio Vs. Rcmdr
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics Course
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)
 
What is Data Science
What is Data ScienceWhat is Data Science
What is Data Science
 
Reducing Response Burden
Reducing Response BurdenReducing Response Burden
Reducing Response Burden
 
PACIS Survey Workshop
PACIS Survey WorkshopPACIS Survey Workshop
PACIS Survey Workshop
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collection
 
Carma internet research module: Encouraging responding
Carma internet research module: Encouraging respondingCarma internet research module: Encouraging responding
Carma internet research module: Encouraging responding
 
Carma internet research module: Survey reduction
Carma internet research module: Survey reductionCarma internet research module: Survey reduction
Carma internet research module: Survey reduction
 
Carma internet research module: Research design catalog
Carma internet research module: Research design catalogCarma internet research module: Research design catalog
Carma internet research module: Research design catalog
 
Stanton eScience Presentation
Stanton eScience PresentationStanton eScience Presentation
Stanton eScience Presentation
 
Carma internet research module detecting bad data
Carma internet research module   detecting bad dataCarma internet research module   detecting bad data
Carma internet research module detecting bad data
 
Carma internet research module preparing for manuscript submission
Carma internet research module   preparing for manuscript submissionCarma internet research module   preparing for manuscript submission
Carma internet research module preparing for manuscript submission
 
Carma internet research module survey design issues
Carma internet research module   survey design issuesCarma internet research module   survey design issues
Carma internet research module survey design issues
 

Dernier

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Carma internet research module detecting bad data

  • 1. Detecting Bad Data CARMA Research Module Jeff Stanton
  • 2. May 18-20, 2006 Internet Data Collection Methods (Day 2-2) Sources of Data Problems in Online Studies Technical errors: Programming errors: Not common, but damaging when they occur Server errors: Can halt the collection of data Transmission errors: Uncommon and usually isolated to one record or field Response fraud: Inadvertent multiple response and malicious multiple response Missing data Intentionally malicious patterns of response leading to outliers or self-contradictory data
  • 3. Response Fraud Deindividuation: Anonymous respondents, working at a distance from the researcher, have limited accountability to the research process Participant incentives introduce mixed motives: necessity of completing the instrument, but not to any particular level of quality Minimal frauds: skipping questions, not thinking through the answers Maximal frauds: A robot that randomly answers May 18-20, 2006 Internet Data Collection Methods (Day 2-3)
  • 4. Duplicate Detection Fingerprint each row, e.g., with sum of numeric columns, multiplied by SD of same columns Create a new variable that contains this unique “checksum” value for each row/case Sort the dataset on the checksum Create a lag difference variable that subtracts the checksum for each neighboring row Sort on the lag variable and investigate all cases of zero or small differences May 18-20, 2006 Internet Data Collection Methods (Day 2-4)
  • 5. May 18-20, 2006 Internet Data Collection Methods (Day 2-5) Bogus Response Detection Calculate common univariate statistics using the complete row of responses for each subject Create new variables for the univariate summaries (mean, sd, skew, kurt, max, min) Sort the cases by the mean value Look for extreme outliers on the high and low ends Sort the cases by standard deviation, skewness, kurtosis, maximum, minimum Look for anomalies and trace them back to the original data for that subject
  • 6. May 18-20, 2006 Internet Data Collection Methods (Day 2-6) Multivariate Outlier Detection Use Mahalanobis distance to detect outliers Regress a set of related items on an arbitrary dependent variable Sort by Mahalanobis distance: Larger distances are suggestive of outliers Use autocorrelation to detect unusual data patterns Flip the data: Cases become variables and variables become cases Run an autocorrelation function Look at the ACF graphs to find oddly regular patterns of responding (autocorrs in excess of .5 across one or more lags) I have provided example SPSS code in the utilities area of the LMS for each of these tests
  • 7. May 18-20, 2006 Internet Data Collection Methods (Day 2-7) Mahalanobis
  • 8. May 18-20, 2006 Internet Data Collection Methods (Day 2-8) Plot, Sort, and Examine
  • 9. May 18-20, 2006 Internet Data Collection Methods (Day 2-9) An ACF Indicating No Pattern
  • 10. May 18-20, 2006 Internet Data Collection Methods (Day 2-10) An ACF with a Suspicious Pattern
  • 11. May 18-20, 2006 Internet Data Collection Methods (Day 2-11) Common Missing Data Mitigation Techniques Item imputation For composite scales expressed as the average of a set of items, ignore any missing that appear on a small subset Mean substitution Suppresses variability Time series imputation Mean of neighboring points; suppresses spikes Regression imputation, works well for highly intercorrelated variables Full information maximum likelihood imputation Available in some SEM programs
  • 12. May 18-20, 2006 Internet Data Collection Methods (Day 2-12) Excel Tips Your friend the “fill” function The power of “Paste Special” Sorting: Click on Data/Sort
  • 13. May 18-20, 2006 Internet Data Collection Methods (Day 2-13) Excel Statistical Formulas =find(<find text>, <within text>, <start>) Looks for the string <find text> within the string <within text> and returns the position of the first occurrence after <start> Example: =find(“=“, “fish=head”, 1) =Len(<string>) Returns the number of characters in a string Example =Len(“Ouch”) =Right(<string>,<length>) Returns the rightmost <length> characters in string Example: =Right(“fishhead“,4) =Left(<string>,<length>) works similarly =average(value, value…) Gives the arithmetic mean of a collection of cells and/or numeric values =stdev(value, value…) // stdevp(value, value…) Gives the sample/population standard deviation of a collection of cells and/or numeric values =sum(value, value…) Gives the sum of a collection of cells and/or numeric values =correl(vector1, vector2) Gives the pearson correlation between two vectors =if(<test>,<value if true>,<value if false>) Makes a logical test and returns a different value depending on whether the test is true or false Example =if(1=1, “Yes!”, “No…”)
  • 14. May 18-20, 2006 Internet Data Collection Methods (Day 2-14) Summary of Bad Data Problems Multiple submissions: Same participant clicks on Submit, then Back, then Submit, then Back… Unmotivated responding: participant uses same option over and over again Malicious patterns: Participate enters some unusually regular pattern of responses There are at least five errors of these kinds in the exercise dataset (see below)