SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Research Data Management
Spring 2014: Session 3
Practical strategies for better results
University Library
Center for Digital Scholarship
QUALITY ASSURANCE & CONTROL
MODULE 3
LEARNING
OUTCOMES
• Develop procedures
for quality
assurance and
quality control
activities.
Data Integrity
1. Data have integrity if they have been
maintained without unauthorized alteration
or destruction
2. Data integrity is data that has a complete or
whole structure.
(http://www.princeton.edu/~achaney/tmve/
wiki100k/docs/Data_integrity.html)
Data Quality
• Fitness for use (depends on context of your questions)
• Data quality is the most important aspect of data
management
• Ensured by
– Sufficient resources and expertise
– Paying close attention to the design of data collection
instruments
– Creating appropriate entry, validation, and reporting processes
– Ongoing QC processes
– Understanding the data collected
Chapman, 2005
Dept of Biostatistics – Data Management, IUSM
Data Quality Standards
• Check data for its logical consistency.
• Check data for reasonableness.
• Ensure adherence to sound estimation methodologies.
• Ensure adherence to monetary submission standards for
stolen and recovered property.
• Ensure that other statistical edit functions are processed
within established parameters.
FBI: http://www.fbi.gov/about-us/cjis/ucr/data_quality_guidelines
Dept of Biostatistics – Data Management, IUSM
Data Entry and Manipulation
• Strategies for preventing errors from entering a dataset
• Activities to ensure quality of data before collection
• Activities that involve monitoring and maintaining the
quality of data during the study
Data Entry and Manipulation
• Define & enforce standards
◦ Formats
◦ Codes
◦ Measurement units
◦ Metadata
• Assign responsibility for data quality
◦ Be sure assigned person is educated in QA/QC
Quality Assurance v. Control
• QA: set of processes, procedures, and activities that
are initiated prior to data collection to ensure the
expected level of quality will be reached and data
integrity will be maintained.
• QC: a system for verifying and maintaining a desired
level of quality in a product or service.
http://c2.com/cgi/wiki?QualityAssuranceIsNotQualityC
ontrol
Quality Assurance in Practice
• CRF (data collection instrument) review & validation
• System/process testing & validation
• Training, education, communication of a team
• Standard Operating Procedures, Standard Operating
Guidelines
• Site audits
Dept of Biostatistics – Data Management, IUSM
Quality Control in Practice
• Set of processes, procedures, and activities
associated with monitoring, detection, and action
during and after data collection.
• Examples:
– Errors in individual data fields
– Systematic errors
– Violation of protocol
– Staff performance issues
– Fraud or scientific misconduct
Dept of Biostatistics – Data Management, IUSM
Activity
Define data quality standards for the following
variables:
• Age
• Height
• BMI
• Life satisfaction scale
• Number of close friends
Don’t forget to upload this to Box.
Suggested file name “Data Quality Standards”
References
1. Department of Biostatistics – Data Management Team, Indiana
University School of Medicine (2013). Data Management including
REDCap. (provided via email)
2. Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for
the Global Biodiversity Information Facility, Copenhagen. ISBN 87-92020-
03-8. http://www.gbif.org/resources/2829
3. DataONE Education Module: Data Quality Control and Assurance.
DataONE. From http://www.dataone.org/sites/all/documents
/L05_DataQualityControlAssurance.pptx
DATA COLLECTION
MODULE 3
LEARNING
OUTCOMES
• Describe key
considerations for
selecting data
collection tools.
Choose your tools wisely
Choose your tools wisely
Allie Brosh, 2010
Activity
Draft data collection instrument
See document “DataMgmtLab-Spr14-
CollectionCodingEntry_EX“
Don’t forget to upload this to Box.
Suggested file name “Data Collection Tool”
References
1. Brosh. A. 2010. Boyfriend doesn’t have ebola. Probably.
http://hyperboleandahalf.blogspot.com/2010/02/boyfriend-doesnt-
have-ebola-probably.html
DATA CODING & ENTRY
MODULE 3
LEARNING
OUTCOMES
• Use best practices
for coding.
• Use best practices
for data entry.
Goals of Data Entry
• Publishable results!
– Valid data that are organized to support smooth
analysis
• Easy to import into analytical program
• Minimize manipulations and errors
• Has a logical [data] structure
Activity
Draft data coding scheme for data
entry
• Review data entry best practices
document in Box
Don’t forget to upload this to Box.
Suggested file name “Coding Scheme”
References
1. DataONE Education Module: Data Entry and Manipulation. DataONE.
From http://www.dataone.org/sites/all/documents/
L04_DataEntryManipulation.pptx
2. Tilmes, C. (2011). Data Management 101 for the Earth Scientist
presented at the AGU Workshop. From
http://wiki.esipfed.org/index.php/2011AGUworkshop
3. Scott, T. (2012). Guidelines to Data Collection and Data Entry, Vanderbilt
CRC Research Skills Workshop Series. From
http://www.mc.vanderbilt.edu/gcrc/workshop_files/2012-09-07.pdf
DATA SCREENING & CLEANING
MODULE 3
LEARNING
OUTCOMES
• Develop a screening
and cleaning
protocol and/or
checklist.
Data Entry and Manipulation
Data Contamination
• Process or phenomenon, other than the one of interest,
that affects the variable value
• Erroneous values
CCimagebyMichaelCoghlanonFlickr
Data Entry and Manipulation
• Errors of Commission
o Incorrect or inaccurate data entered
o Examples: malfunctioning instrument, mistyped data
• Errors of Omission
o Data or metadata not recorded
o Examples: inadequate documentation, human error, anomalies in the
field
CCimagebyNickJWebbonFlickr
Data Entry and Manipulation
• Double entry
◦ Data keyed in by two independent people
◦ Check for agreement with computer verification
• Record a reading of the data and transcribe from the
recording
• Use text-to-speech program to read data back
CCimagebyweskrieselonFlickr
Data Entry and Manipulation
• Design data storage well
◦ Minimize number of times items that must be entered repeatedly
◦ Use consistent terminology
◦ Atomize data: one cell per piece of information
• Document changes to data
◦ Avoids duplicate error checking
◦ Allows undo if necessary
Data Entry and Manipulation
• Make sure data line up in proper columns
• No missing, impossible, or anomalous values
• Perform statistical summaries
CCimagebychesapeakeclimateonFlickr
Data Entry and Manipulation
• Look for outliers
◦ Outliers are extreme values for a variable given the statistical model
being used
◦ The goal is not to eliminate outliers but to identify potential data
contamination
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35
Data Entry and Manipulation
• Methods to look for outliers
◦ Graphical
• Normal probability plots
• Regression
• Scatter plots
◦ Maps
◦ Subtract values from mean
Data Entry and Manipulation
• Data contamination is data that results from a factor not
examined by the study that results in altered data values
• Data error types: commission or omission
• Quality assurance and quality control are strategies for
◦ preventing errors from entering a dataset
◦ ensuring data quality for entered data
◦ monitoring, and maintaining data quality throughout the project
• Identify and enforce quality assurance and quality control
measures throughout the Data Life Cycle
Discussion
Using the Data Review Checklist,
evaluate the HBSC codebook
“DataMgmtLab-Spr14_DataReviewChecklist_EX”
What screening & cleaning procedures
were used?
Data Entry and Manipulation
1. D. Edwards, in Ecological Data: Design, Management and Processing,
WK Michener and JW Brunt, Eds. (Blackwell, New York, 2000), pp. 70-
91. Available at www.ecoinformatics.org/pubs
2. R. B. Cook, R. J. Olson, P. Kanciruk, L. A. Hook, Best practices for
preparing ecological data sets to share and archive. Bull. Ecol. Soc.
Amer. 82, 138-141 (2001).
3. A. D. Chapman, “Principles of Data Quality:. Report for the Global
Biodiversity Information Facility” (Global Biodiversity Information
Facility, Copenhagen, 2004). Available at
http://www.gbif.org/communications/resources/print-and-online-
resources/download-publications/bookelets/
References
1. Cook, 2013, NACP Best Data Management Practices Workshop. From
http://daac.ornl.gov/NACP_AIM_2013/04_data_management_cook_201
3.02.03.ppt
2. Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data
provenance in e-Science. SIGMOD Record, 34(3), 31-36. From
http://www.sigmod.org/publications/sigmod-record/0509/p31-special-
sw-section-5.pdf
3. Ram, S. (2012). Emerging Role of Social Media in Data Sharing and
Management. From http://www.slideshare.net/INSITEUA/provenance-
management-to-enable-data-sharing
AUTOMATION
MODULE 3
LEARNING
OUTCOMES
• Explain why
automation
provides better
provenance than
manual processes.
• Identify effective
tools for automating
data processing and
analysis.
Choose your tools wisely
• Documents
• Excel
• Access
• SPSS, Minitab
• Mathematica, MATLAB, Scilab
• SAS, Stata
• R
• MapReduce
• NVivo, Atlas.ti, Dedoose, HyperRESEARCH, etc.
http://www.dataone.org/all-software-tools
Data Formats; Version 1.0
Overview
• Spreadsheets are amazingly flexible, and are commonly
used for data collection, analysis and management
• Spreadsheets are seldom self-documenting, and seldom
well-documented
• Subtle (and not so subtle) errors are easily introduced
during entry, manipulation and analysis
• Spreadsheet conventions – often ad hoc and evolutionary –
may change or be applied inconsistently
• Spreadsheet file formats are proprietary and thus generally
unacceptable as long term archival purposes
Data Entry and Manipulation
• Great for charts, graphs,
calculations
• Flexible about cell content
type—cells in same column
can contain numbers or text
• Lack record integrity--can
sort a column independently
of all others)
• Easy to use – but harder to
maintain as complexity and
size of data grows
• Easy to query to select
portions of data
• Data fields are typed – For
example, only integers are
allowed in integer fields
• Columns cannot be sorted
independently of each other
• Steeper learning curve than
a spreadsheet
NACP Best Data Management Practices, February 3, 2013
5. Preserve information (cont)
• Use a scripted language to process data
– R Statistical package (free, powerful)
– SAS
– MATLAB
• Processing scripts are records of processing
– Scripts can be revised, rerun
• Graphical User Interface-based analyses may
seem easy, but don’t leave a record
45
Provenance, Audit Trails, etc.
• “…information that helps determine the
derivation history of a data product, starting from
its original sources.” (Simmhan et al, 2005)
– Ancestral data products from which the data evolved
– Process of transformation of these ancestral data
products
• Uses: data quality, audit trail, replication recipe,
attribution, informational
More Considerations
• Field names & descriptions
• Structured entry
• Validation
• Record integrity
• Missing data
• Data/field types
• File types: common, open documented standard
• Output required for analysis and visualization
Demonstration & Discussion
Run [analysis] in Excel and Stata.
Compare output.
• What features does Stata have that Excel
does not?
• How do these features support
provenance and data integrity?
References
1. DataONE Education Module: Data Entry and Manipulation. DataONE.
From http://www.dataone.org/sites/all/documents/
L04_DataEntryManipulation.pptx

Contenu connexe

En vedette

Are Your Students Ready for Lab?
Are Your Students Ready for Lab?Are Your Students Ready for Lab?
Are Your Students Ready for Lab?Cengage Learning
 
Corporate Data Quality Management Research and Services Overview
Corporate Data Quality Management Research and Services OverviewCorporate Data Quality Management Research and Services Overview
Corporate Data Quality Management Research and Services OverviewBoris Otto
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slidesNicolas Sarramagna
 
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?Harald Erb
 
Highway Engineering Lab Protocol (Cycle-1)
Highway Engineering Lab Protocol (Cycle-1)Highway Engineering Lab Protocol (Cycle-1)
Highway Engineering Lab Protocol (Cycle-1)PENKI RAMU
 
Physics Lab Practical
Physics Lab PracticalPhysics Lab Practical
Physics Lab PracticalAkib Al Islam
 
Construction Materials Engineering and Testing
Construction Materials Engineering and TestingConstruction Materials Engineering and Testing
Construction Materials Engineering and Testingmecocca5
 
Science laboratory equipment
Science laboratory equipmentScience laboratory equipment
Science laboratory equipmentLauriz Aclan
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...Jürgen Ambrosi
 
Material Testing Lab Equipments
Material Testing Lab EquipmentsMaterial Testing Lab Equipments
Material Testing Lab EquipmentsNaveed Hussain
 
Graphical representation of data mohit verma
Graphical representation of data mohit verma Graphical representation of data mohit verma
Graphical representation of data mohit verma MOHIT KUMAR VERMA
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of datadrasifk
 
Graphical Representation of data
Graphical Representation of dataGraphical Representation of data
Graphical Representation of dataJijo K Mathew
 
Data Analysis, Presentation and Interpretation of Data
Data Analysis, Presentation and Interpretation of DataData Analysis, Presentation and Interpretation of Data
Data Analysis, Presentation and Interpretation of DataRoqui Malijan
 

En vedette (20)

Are Your Students Ready for Lab?
Are Your Students Ready for Lab?Are Your Students Ready for Lab?
Are Your Students Ready for Lab?
 
Corporate Data Quality Management Research and Services Overview
Corporate Data Quality Management Research and Services OverviewCorporate Data Quality Management Research and Services Overview
Corporate Data Quality Management Research and Services Overview
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
 
Big Data At A Human Scale
Big Data At A Human ScaleBig Data At A Human Scale
Big Data At A Human Scale
 
Data Quality Control
Data Quality ControlData Quality Control
Data Quality Control
 
Biology lab safety
Biology lab safety Biology lab safety
Biology lab safety
 
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?
Exploratory Analysis in the Data Lab - Team-Sport or for Nerds only?
 
Highway Engineering Lab Protocol (Cycle-1)
Highway Engineering Lab Protocol (Cycle-1)Highway Engineering Lab Protocol (Cycle-1)
Highway Engineering Lab Protocol (Cycle-1)
 
Physics Lab Practical
Physics Lab PracticalPhysics Lab Practical
Physics Lab Practical
 
Construction Materials Engineering and Testing
Construction Materials Engineering and TestingConstruction Materials Engineering and Testing
Construction Materials Engineering and Testing
 
Science laboratory equipment
Science laboratory equipmentScience laboratory equipment
Science laboratory equipment
 
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...1° Sessione Oracle CRUI: Analytics Data Lab,  the power of Big Data Investiga...
1° Sessione Oracle CRUI: Analytics Data Lab, the power of Big Data Investiga...
 
Lab safety rules and symbols Summary
Lab safety rules and symbols SummaryLab safety rules and symbols Summary
Lab safety rules and symbols Summary
 
Material Testing Lab Equipments
Material Testing Lab EquipmentsMaterial Testing Lab Equipments
Material Testing Lab Equipments
 
Graphical representation of data mohit verma
Graphical representation of data mohit verma Graphical representation of data mohit verma
Graphical representation of data mohit verma
 
Graphical presentation of data
Graphical presentation of dataGraphical presentation of data
Graphical presentation of data
 
Graphical Representation of data
Graphical Representation of dataGraphical Representation of data
Graphical Representation of data
 
Data Analysis, Presentation and Interpretation of Data
Data Analysis, Presentation and Interpretation of DataData Analysis, Presentation and Interpretation of Data
Data Analysis, Presentation and Interpretation of Data
 
Chapter 4 presentation of data
Chapter 4 presentation of dataChapter 4 presentation of data
Chapter 4 presentation of data
 
Presentation of data
Presentation of dataPresentation of data
Presentation of data
 

Similaire à Data Management Lab: Session 3 Slides

Ensuring data quality
Ensuring data qualityEnsuring data quality
Ensuring data qualityIUPUI
 
Data Management Lab: Session 1 Slides
Data Management Lab: Session 1 SlidesData Management Lab: Session 1 Slides
Data Management Lab: Session 1 SlidesIUPUI
 
Machine Learning for Predictive Data Analysis in Clinical Research
Machine Learning for Predictive Data Analysis in Clinical ResearchMachine Learning for Predictive Data Analysis in Clinical Research
Machine Learning for Predictive Data Analysis in Clinical ResearchClinosolIndia
 
Data Cleaning and Validation: Best Practices for Data Integrity
Data Cleaning and Validation: Best Practices for Data IntegrityData Cleaning and Validation: Best Practices for Data Integrity
Data Cleaning and Validation: Best Practices for Data IntegrityClinosolIndia
 
A simplified approach for quality management in data warehouse
A simplified approach for quality management in data warehouseA simplified approach for quality management in data warehouse
A simplified approach for quality management in data warehouseIJDKP
 
Data Management Lab: Data mapping exercise instructions
Data Management Lab: Data mapping exercise instructionsData Management Lab: Data mapping exercise instructions
Data Management Lab: Data mapping exercise instructionsIUPUI
 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Health Catalyst
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE
 
Introduction to Data Analytics.pptx
Introduction to Data Analytics.pptxIntroduction to Data Analytics.pptx
Introduction to Data Analytics.pptxDikshantSharma63
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseVaticle
 
How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...Soumodeep Nanee Kundu
 
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...ClinosolIndia
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingKnoldus Inc.
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesChristopher Eaker
 
Ethical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data RevolutionEthical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data RevolutionMelissa Moody
 
CLINICAL DATA MANAGEMENT.pptx
CLINICAL DATA MANAGEMENT.pptxCLINICAL DATA MANAGEMENT.pptx
CLINICAL DATA MANAGEMENT.pptxAkshata Kawaste
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theoryC. Tobin Magle
 

Similaire à Data Management Lab: Session 3 Slides (20)

Ensuring data quality
Ensuring data qualityEnsuring data quality
Ensuring data quality
 
Data Management Lab: Session 1 Slides
Data Management Lab: Session 1 SlidesData Management Lab: Session 1 Slides
Data Management Lab: Session 1 Slides
 
(2012) The Role of Test Administrator and Error proposal
(2012) The Role of Test Administrator and Error proposal(2012) The Role of Test Administrator and Error proposal
(2012) The Role of Test Administrator and Error proposal
 
Quality Assurance in Knowledge Data Warehouse
Quality Assurance in Knowledge Data WarehouseQuality Assurance in Knowledge Data Warehouse
Quality Assurance in Knowledge Data Warehouse
 
Machine Learning for Predictive Data Analysis in Clinical Research
Machine Learning for Predictive Data Analysis in Clinical ResearchMachine Learning for Predictive Data Analysis in Clinical Research
Machine Learning for Predictive Data Analysis in Clinical Research
 
Data Cleaning and Validation: Best Practices for Data Integrity
Data Cleaning and Validation: Best Practices for Data IntegrityData Cleaning and Validation: Best Practices for Data Integrity
Data Cleaning and Validation: Best Practices for Data Integrity
 
A simplified approach for quality management in data warehouse
A simplified approach for quality management in data warehouseA simplified approach for quality management in data warehouse
A simplified approach for quality management in data warehouse
 
Data Management Lab: Data mapping exercise instructions
Data Management Lab: Data mapping exercise instructionsData Management Lab: Data mapping exercise instructions
Data Management Lab: Data mapping exercise instructions
 
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
Optimize Your Healthcare Data Quality Investment: Three Ways to Accelerate Ti...
 
DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?DataONE Education Module 01: Why Data Management?
DataONE Education Module 01: Why Data Management?
 
Introduction to Data Analytics.pptx
Introduction to Data Analytics.pptxIntroduction to Data Analytics.pptx
Introduction to Data Analytics.pptx
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
 
How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...How do you assess the quality and reliability of data sources in data analysi...
How do you assess the quality and reliability of data sources in data analysi...
 
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
Enhancing Data Quality in Clinical Trials: Best Practices and Quality Control...
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
Ethical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data RevolutionEthical Priniciples for the All Data Revolution
Ethical Priniciples for the All Data Revolution
 
CLINICAL DATA MANAGEMENT.pptx
CLINICAL DATA MANAGEMENT.pptxCLINICAL DATA MANAGEMENT.pptx
CLINICAL DATA MANAGEMENT.pptx
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
 

Plus de IUPUI

Altmetrics 101 - Altmetrics in Libraries
Altmetrics 101 - Altmetrics in LibrariesAltmetrics 101 - Altmetrics in Libraries
Altmetrics 101 - Altmetrics in LibrariesIUPUI
 
Gather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your researchGather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your researchIUPUI
 
Managing data responsibly to enable research interity
Managing data responsibly to enable research interityManaging data responsibly to enable research interity
Managing data responsibly to enable research interityIUPUI
 
Case studies for open science
Case studies for open scienceCase studies for open science
Case studies for open scienceIUPUI
 
Midwest Medical Library Association 2015 Big Data Panel
Midwest Medical Library Association 2015 Big Data PanelMidwest Medical Library Association 2015 Big Data Panel
Midwest Medical Library Association 2015 Big Data PanelIUPUI
 
Gathering Evidence to Demonstrate Impact
Gathering Evidence to Demonstrate ImpactGathering Evidence to Demonstrate Impact
Gathering Evidence to Demonstrate ImpactIUPUI
 
Citation & altmetrics - a comparison
Citation & altmetrics - a comparisonCitation & altmetrics - a comparison
Citation & altmetrics - a comparisonIUPUI
 
Altmetrics for Team Science
Altmetrics for Team ScienceAltmetrics for Team Science
Altmetrics for Team ScienceIUPUI
 
Preventing data loss
Preventing data lossPreventing data loss
Preventing data lossIUPUI
 
Practical Data Management Plans
Practical Data Management PlansPractical Data Management Plans
Practical Data Management PlansIUPUI
 
Teaching data management in a lab environment (IASSIST 2014)
Teaching data management in a lab environment (IASSIST 2014)Teaching data management in a lab environment (IASSIST 2014)
Teaching data management in a lab environment (IASSIST 2014)IUPUI
 
Building the Future of Research Together
Building the Future of Research TogetherBuilding the Future of Research Together
Building the Future of Research TogetherIUPUI
 
NIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - HandoutNIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - HandoutIUPUI
 
NIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - SlidesNIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - SlidesIUPUI
 
Data Management Lab: Session 4 Slides
Data Management Lab: Session 4 SlidesData Management Lab: Session 4 Slides
Data Management Lab: Session 4 SlidesIUPUI
 
Data Management Lab: Session 4 Review Outline
Data Management Lab: Session 4 Review OutlineData Management Lab: Session 4 Review Outline
Data Management Lab: Session 4 Review OutlineIUPUI
 
Data Management Lab: Session 3 Data Review Checklist
Data Management Lab: Session 3 Data Review ChecklistData Management Lab: Session 3 Data Review Checklist
Data Management Lab: Session 3 Data Review ChecklistIUPUI
 
Data Management Lab: Session 3 Data Entry Best Practices
Data Management Lab: Session 3 Data Entry Best PracticesData Management Lab: Session 3 Data Entry Best Practices
Data Management Lab: Session 3 Data Entry Best PracticesIUPUI
 
Data Management Lab: Session 3 Data Coding Best Practices
Data Management Lab: Session 3 Data Coding Best PracticesData Management Lab: Session 3 Data Coding Best Practices
Data Management Lab: Session 3 Data Coding Best PracticesIUPUI
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesIUPUI
 

Plus de IUPUI (20)

Altmetrics 101 - Altmetrics in Libraries
Altmetrics 101 - Altmetrics in LibrariesAltmetrics 101 - Altmetrics in Libraries
Altmetrics 101 - Altmetrics in Libraries
 
Gather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your researchGather evidence to demonstrate the impact of your research
Gather evidence to demonstrate the impact of your research
 
Managing data responsibly to enable research interity
Managing data responsibly to enable research interityManaging data responsibly to enable research interity
Managing data responsibly to enable research interity
 
Case studies for open science
Case studies for open scienceCase studies for open science
Case studies for open science
 
Midwest Medical Library Association 2015 Big Data Panel
Midwest Medical Library Association 2015 Big Data PanelMidwest Medical Library Association 2015 Big Data Panel
Midwest Medical Library Association 2015 Big Data Panel
 
Gathering Evidence to Demonstrate Impact
Gathering Evidence to Demonstrate ImpactGathering Evidence to Demonstrate Impact
Gathering Evidence to Demonstrate Impact
 
Citation & altmetrics - a comparison
Citation & altmetrics - a comparisonCitation & altmetrics - a comparison
Citation & altmetrics - a comparison
 
Altmetrics for Team Science
Altmetrics for Team ScienceAltmetrics for Team Science
Altmetrics for Team Science
 
Preventing data loss
Preventing data lossPreventing data loss
Preventing data loss
 
Practical Data Management Plans
Practical Data Management PlansPractical Data Management Plans
Practical Data Management Plans
 
Teaching data management in a lab environment (IASSIST 2014)
Teaching data management in a lab environment (IASSIST 2014)Teaching data management in a lab environment (IASSIST 2014)
Teaching data management in a lab environment (IASSIST 2014)
 
Building the Future of Research Together
Building the Future of Research TogetherBuilding the Future of Research Together
Building the Future of Research Together
 
NIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - HandoutNIH Data Sharing Plan Workshop - Handout
NIH Data Sharing Plan Workshop - Handout
 
NIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - SlidesNIH Data Sharing Plan Workshop - Slides
NIH Data Sharing Plan Workshop - Slides
 
Data Management Lab: Session 4 Slides
Data Management Lab: Session 4 SlidesData Management Lab: Session 4 Slides
Data Management Lab: Session 4 Slides
 
Data Management Lab: Session 4 Review Outline
Data Management Lab: Session 4 Review OutlineData Management Lab: Session 4 Review Outline
Data Management Lab: Session 4 Review Outline
 
Data Management Lab: Session 3 Data Review Checklist
Data Management Lab: Session 3 Data Review ChecklistData Management Lab: Session 3 Data Review Checklist
Data Management Lab: Session 3 Data Review Checklist
 
Data Management Lab: Session 3 Data Entry Best Practices
Data Management Lab: Session 3 Data Entry Best PracticesData Management Lab: Session 3 Data Entry Best Practices
Data Management Lab: Session 3 Data Entry Best Practices
 
Data Management Lab: Session 3 Data Coding Best Practices
Data Management Lab: Session 3 Data Coding Best PracticesData Management Lab: Session 3 Data Coding Best Practices
Data Management Lab: Session 3 Data Coding Best Practices
 
Data Management Lab: Session 2 slides
Data Management Lab: Session 2 slidesData Management Lab: Session 2 slides
Data Management Lab: Session 2 slides
 

Dernier

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 

Dernier (20)

Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 

Data Management Lab: Session 3 Slides

  • 1. Research Data Management Spring 2014: Session 3 Practical strategies for better results University Library Center for Digital Scholarship
  • 2. QUALITY ASSURANCE & CONTROL MODULE 3
  • 3. LEARNING OUTCOMES • Develop procedures for quality assurance and quality control activities.
  • 4. Data Integrity 1. Data have integrity if they have been maintained without unauthorized alteration or destruction 2. Data integrity is data that has a complete or whole structure. (http://www.princeton.edu/~achaney/tmve/ wiki100k/docs/Data_integrity.html)
  • 5. Data Quality • Fitness for use (depends on context of your questions) • Data quality is the most important aspect of data management • Ensured by – Sufficient resources and expertise – Paying close attention to the design of data collection instruments – Creating appropriate entry, validation, and reporting processes – Ongoing QC processes – Understanding the data collected Chapman, 2005 Dept of Biostatistics – Data Management, IUSM
  • 6. Data Quality Standards • Check data for its logical consistency. • Check data for reasonableness. • Ensure adherence to sound estimation methodologies. • Ensure adherence to monetary submission standards for stolen and recovered property. • Ensure that other statistical edit functions are processed within established parameters. FBI: http://www.fbi.gov/about-us/cjis/ucr/data_quality_guidelines Dept of Biostatistics – Data Management, IUSM
  • 7. Data Entry and Manipulation • Strategies for preventing errors from entering a dataset • Activities to ensure quality of data before collection • Activities that involve monitoring and maintaining the quality of data during the study
  • 8. Data Entry and Manipulation • Define & enforce standards ◦ Formats ◦ Codes ◦ Measurement units ◦ Metadata • Assign responsibility for data quality ◦ Be sure assigned person is educated in QA/QC
  • 9. Quality Assurance v. Control • QA: set of processes, procedures, and activities that are initiated prior to data collection to ensure the expected level of quality will be reached and data integrity will be maintained. • QC: a system for verifying and maintaining a desired level of quality in a product or service. http://c2.com/cgi/wiki?QualityAssuranceIsNotQualityC ontrol
  • 10. Quality Assurance in Practice • CRF (data collection instrument) review & validation • System/process testing & validation • Training, education, communication of a team • Standard Operating Procedures, Standard Operating Guidelines • Site audits Dept of Biostatistics – Data Management, IUSM
  • 11. Quality Control in Practice • Set of processes, procedures, and activities associated with monitoring, detection, and action during and after data collection. • Examples: – Errors in individual data fields – Systematic errors – Violation of protocol – Staff performance issues – Fraud or scientific misconduct Dept of Biostatistics – Data Management, IUSM
  • 12. Activity Define data quality standards for the following variables: • Age • Height • BMI • Life satisfaction scale • Number of close friends Don’t forget to upload this to Box. Suggested file name “Data Quality Standards”
  • 13. References 1. Department of Biostatistics – Data Management Team, Indiana University School of Medicine (2013). Data Management including REDCap. (provided via email) 2. Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. ISBN 87-92020- 03-8. http://www.gbif.org/resources/2829 3. DataONE Education Module: Data Quality Control and Assurance. DataONE. From http://www.dataone.org/sites/all/documents /L05_DataQualityControlAssurance.pptx
  • 15. LEARNING OUTCOMES • Describe key considerations for selecting data collection tools.
  • 17. Choose your tools wisely Allie Brosh, 2010
  • 18. Activity Draft data collection instrument See document “DataMgmtLab-Spr14- CollectionCodingEntry_EX“ Don’t forget to upload this to Box. Suggested file name “Data Collection Tool”
  • 19. References 1. Brosh. A. 2010. Boyfriend doesn’t have ebola. Probably. http://hyperboleandahalf.blogspot.com/2010/02/boyfriend-doesnt- have-ebola-probably.html
  • 20. DATA CODING & ENTRY MODULE 3
  • 21. LEARNING OUTCOMES • Use best practices for coding. • Use best practices for data entry.
  • 22.
  • 23. Goals of Data Entry • Publishable results! – Valid data that are organized to support smooth analysis • Easy to import into analytical program • Minimize manipulations and errors • Has a logical [data] structure
  • 24.
  • 25. Activity Draft data coding scheme for data entry • Review data entry best practices document in Box Don’t forget to upload this to Box. Suggested file name “Coding Scheme”
  • 26. References 1. DataONE Education Module: Data Entry and Manipulation. DataONE. From http://www.dataone.org/sites/all/documents/ L04_DataEntryManipulation.pptx 2. Tilmes, C. (2011). Data Management 101 for the Earth Scientist presented at the AGU Workshop. From http://wiki.esipfed.org/index.php/2011AGUworkshop 3. Scott, T. (2012). Guidelines to Data Collection and Data Entry, Vanderbilt CRC Research Skills Workshop Series. From http://www.mc.vanderbilt.edu/gcrc/workshop_files/2012-09-07.pdf
  • 27. DATA SCREENING & CLEANING MODULE 3
  • 28. LEARNING OUTCOMES • Develop a screening and cleaning protocol and/or checklist.
  • 29. Data Entry and Manipulation Data Contamination • Process or phenomenon, other than the one of interest, that affects the variable value • Erroneous values CCimagebyMichaelCoghlanonFlickr
  • 30. Data Entry and Manipulation • Errors of Commission o Incorrect or inaccurate data entered o Examples: malfunctioning instrument, mistyped data • Errors of Omission o Data or metadata not recorded o Examples: inadequate documentation, human error, anomalies in the field CCimagebyNickJWebbonFlickr
  • 31. Data Entry and Manipulation • Double entry ◦ Data keyed in by two independent people ◦ Check for agreement with computer verification • Record a reading of the data and transcribe from the recording • Use text-to-speech program to read data back CCimagebyweskrieselonFlickr
  • 32. Data Entry and Manipulation • Design data storage well ◦ Minimize number of times items that must be entered repeatedly ◦ Use consistent terminology ◦ Atomize data: one cell per piece of information • Document changes to data ◦ Avoids duplicate error checking ◦ Allows undo if necessary
  • 33. Data Entry and Manipulation • Make sure data line up in proper columns • No missing, impossible, or anomalous values • Perform statistical summaries CCimagebychesapeakeclimateonFlickr
  • 34. Data Entry and Manipulation • Look for outliers ◦ Outliers are extreme values for a variable given the statistical model being used ◦ The goal is not to eliminate outliers but to identify potential data contamination 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35
  • 35. Data Entry and Manipulation • Methods to look for outliers ◦ Graphical • Normal probability plots • Regression • Scatter plots ◦ Maps ◦ Subtract values from mean
  • 36. Data Entry and Manipulation • Data contamination is data that results from a factor not examined by the study that results in altered data values • Data error types: commission or omission • Quality assurance and quality control are strategies for ◦ preventing errors from entering a dataset ◦ ensuring data quality for entered data ◦ monitoring, and maintaining data quality throughout the project • Identify and enforce quality assurance and quality control measures throughout the Data Life Cycle
  • 37. Discussion Using the Data Review Checklist, evaluate the HBSC codebook “DataMgmtLab-Spr14_DataReviewChecklist_EX” What screening & cleaning procedures were used?
  • 38. Data Entry and Manipulation 1. D. Edwards, in Ecological Data: Design, Management and Processing, WK Michener and JW Brunt, Eds. (Blackwell, New York, 2000), pp. 70- 91. Available at www.ecoinformatics.org/pubs 2. R. B. Cook, R. J. Olson, P. Kanciruk, L. A. Hook, Best practices for preparing ecological data sets to share and archive. Bull. Ecol. Soc. Amer. 82, 138-141 (2001). 3. A. D. Chapman, “Principles of Data Quality:. Report for the Global Biodiversity Information Facility” (Global Biodiversity Information Facility, Copenhagen, 2004). Available at http://www.gbif.org/communications/resources/print-and-online- resources/download-publications/bookelets/
  • 39. References 1. Cook, 2013, NACP Best Data Management Practices Workshop. From http://daac.ornl.gov/NACP_AIM_2013/04_data_management_cook_201 3.02.03.ppt 2. Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance in e-Science. SIGMOD Record, 34(3), 31-36. From http://www.sigmod.org/publications/sigmod-record/0509/p31-special- sw-section-5.pdf 3. Ram, S. (2012). Emerging Role of Social Media in Data Sharing and Management. From http://www.slideshare.net/INSITEUA/provenance- management-to-enable-data-sharing
  • 41. LEARNING OUTCOMES • Explain why automation provides better provenance than manual processes. • Identify effective tools for automating data processing and analysis.
  • 42. Choose your tools wisely • Documents • Excel • Access • SPSS, Minitab • Mathematica, MATLAB, Scilab • SAS, Stata • R • MapReduce • NVivo, Atlas.ti, Dedoose, HyperRESEARCH, etc. http://www.dataone.org/all-software-tools
  • 43. Data Formats; Version 1.0 Overview • Spreadsheets are amazingly flexible, and are commonly used for data collection, analysis and management • Spreadsheets are seldom self-documenting, and seldom well-documented • Subtle (and not so subtle) errors are easily introduced during entry, manipulation and analysis • Spreadsheet conventions – often ad hoc and evolutionary – may change or be applied inconsistently • Spreadsheet file formats are proprietary and thus generally unacceptable as long term archival purposes
  • 44. Data Entry and Manipulation • Great for charts, graphs, calculations • Flexible about cell content type—cells in same column can contain numbers or text • Lack record integrity--can sort a column independently of all others) • Easy to use – but harder to maintain as complexity and size of data grows • Easy to query to select portions of data • Data fields are typed – For example, only integers are allowed in integer fields • Columns cannot be sorted independently of each other • Steeper learning curve than a spreadsheet
  • 45. NACP Best Data Management Practices, February 3, 2013 5. Preserve information (cont) • Use a scripted language to process data – R Statistical package (free, powerful) – SAS – MATLAB • Processing scripts are records of processing – Scripts can be revised, rerun • Graphical User Interface-based analyses may seem easy, but don’t leave a record 45
  • 46. Provenance, Audit Trails, etc. • “…information that helps determine the derivation history of a data product, starting from its original sources.” (Simmhan et al, 2005) – Ancestral data products from which the data evolved – Process of transformation of these ancestral data products • Uses: data quality, audit trail, replication recipe, attribution, informational
  • 47. More Considerations • Field names & descriptions • Structured entry • Validation • Record integrity • Missing data • Data/field types • File types: common, open documented standard • Output required for analysis and visualization
  • 48. Demonstration & Discussion Run [analysis] in Excel and Stata. Compare output. • What features does Stata have that Excel does not? • How do these features support provenance and data integrity?
  • 49. References 1. DataONE Education Module: Data Entry and Manipulation. DataONE. From http://www.dataone.org/sites/all/documents/ L04_DataEntryManipulation.pptx