Data Management Lab: Session 3 slides (more details at http://ulib.iupui.edu/digitalscholarship/dataservices/datamgmtlab)
What you will learn:
1. Build awareness of research data management issues associated with digital data.
2. Introduce methods to address common data management issues and facilitate data integrity.
3. Introduce institutional resources supporting effective data management methods.
4. Build proficiency in applying these methods.
5. Build strategic skills that enable attendees to solve new data management problems.
4. Data Integrity
1. Data have integrity if they have been
maintained without unauthorized alteration
or destruction
2. Data integrity is data that has a complete or
whole structure.
(http://www.princeton.edu/~achaney/tmve/
wiki100k/docs/Data_integrity.html)
5. Data Quality
• Fitness for use (depends on context of your questions)
• Data quality is the most important aspect of data
management
• Ensured by
– Sufficient resources and expertise
– Paying close attention to the design of data collection
instruments
– Creating appropriate entry, validation, and reporting processes
– Ongoing QC processes
– Understanding the data collected
Chapman, 2005
Dept of Biostatistics – Data Management, IUSM
6. Data Quality Standards
• Check data for its logical consistency.
• Check data for reasonableness.
• Ensure adherence to sound estimation methodologies.
• Ensure adherence to monetary submission standards for
stolen and recovered property.
• Ensure that other statistical edit functions are processed
within established parameters.
FBI: http://www.fbi.gov/about-us/cjis/ucr/data_quality_guidelines
Dept of Biostatistics – Data Management, IUSM
7. Data Entry and Manipulation
• Strategies for preventing errors from entering a dataset
• Activities to ensure quality of data before collection
• Activities that involve monitoring and maintaining the
quality of data during the study
8. Data Entry and Manipulation
• Define & enforce standards
◦ Formats
◦ Codes
◦ Measurement units
◦ Metadata
• Assign responsibility for data quality
◦ Be sure assigned person is educated in QA/QC
9. Quality Assurance v. Control
• QA: set of processes, procedures, and activities that
are initiated prior to data collection to ensure the
expected level of quality will be reached and data
integrity will be maintained.
• QC: a system for verifying and maintaining a desired
level of quality in a product or service.
http://c2.com/cgi/wiki?QualityAssuranceIsNotQualityC
ontrol
10. Quality Assurance in Practice
• CRF (data collection instrument) review & validation
• System/process testing & validation
• Training, education, communication of a team
• Standard Operating Procedures, Standard Operating
Guidelines
• Site audits
Dept of Biostatistics – Data Management, IUSM
11. Quality Control in Practice
• Set of processes, procedures, and activities
associated with monitoring, detection, and action
during and after data collection.
• Examples:
– Errors in individual data fields
– Systematic errors
– Violation of protocol
– Staff performance issues
– Fraud or scientific misconduct
Dept of Biostatistics – Data Management, IUSM
12. Activity
Define data quality standards for the following
variables:
• Age
• Height
• BMI
• Life satisfaction scale
• Number of close friends
Don’t forget to upload this to Box.
Suggested file name “Data Quality Standards”
13. References
1. Department of Biostatistics – Data Management Team, Indiana
University School of Medicine (2013). Data Management including
REDCap. (provided via email)
2. Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for
the Global Biodiversity Information Facility, Copenhagen. ISBN 87-92020-
03-8. http://www.gbif.org/resources/2829
3. DataONE Education Module: Data Quality Control and Assurance.
DataONE. From http://www.dataone.org/sites/all/documents
/L05_DataQualityControlAssurance.pptx
18. Activity
Draft data collection instrument
See document “DataMgmtLab-Spr14-
CollectionCodingEntry_EX“
Don’t forget to upload this to Box.
Suggested file name “Data Collection Tool”
19. References
1. Brosh. A. 2010. Boyfriend doesn’t have ebola. Probably.
http://hyperboleandahalf.blogspot.com/2010/02/boyfriend-doesnt-
have-ebola-probably.html
23. Goals of Data Entry
• Publishable results!
– Valid data that are organized to support smooth
analysis
• Easy to import into analytical program
• Minimize manipulations and errors
• Has a logical [data] structure
24.
25. Activity
Draft data coding scheme for data
entry
• Review data entry best practices
document in Box
Don’t forget to upload this to Box.
Suggested file name “Coding Scheme”
26. References
1. DataONE Education Module: Data Entry and Manipulation. DataONE.
From http://www.dataone.org/sites/all/documents/
L04_DataEntryManipulation.pptx
2. Tilmes, C. (2011). Data Management 101 for the Earth Scientist
presented at the AGU Workshop. From
http://wiki.esipfed.org/index.php/2011AGUworkshop
3. Scott, T. (2012). Guidelines to Data Collection and Data Entry, Vanderbilt
CRC Research Skills Workshop Series. From
http://www.mc.vanderbilt.edu/gcrc/workshop_files/2012-09-07.pdf
29. Data Entry and Manipulation
Data Contamination
• Process or phenomenon, other than the one of interest,
that affects the variable value
• Erroneous values
CCimagebyMichaelCoghlanonFlickr
30. Data Entry and Manipulation
• Errors of Commission
o Incorrect or inaccurate data entered
o Examples: malfunctioning instrument, mistyped data
• Errors of Omission
o Data or metadata not recorded
o Examples: inadequate documentation, human error, anomalies in the
field
CCimagebyNickJWebbonFlickr
31. Data Entry and Manipulation
• Double entry
◦ Data keyed in by two independent people
◦ Check for agreement with computer verification
• Record a reading of the data and transcribe from the
recording
• Use text-to-speech program to read data back
CCimagebyweskrieselonFlickr
32. Data Entry and Manipulation
• Design data storage well
◦ Minimize number of times items that must be entered repeatedly
◦ Use consistent terminology
◦ Atomize data: one cell per piece of information
• Document changes to data
◦ Avoids duplicate error checking
◦ Allows undo if necessary
33. Data Entry and Manipulation
• Make sure data line up in proper columns
• No missing, impossible, or anomalous values
• Perform statistical summaries
CCimagebychesapeakeclimateonFlickr
34. Data Entry and Manipulation
• Look for outliers
◦ Outliers are extreme values for a variable given the statistical model
being used
◦ The goal is not to eliminate outliers but to identify potential data
contamination
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35
35. Data Entry and Manipulation
• Methods to look for outliers
◦ Graphical
• Normal probability plots
• Regression
• Scatter plots
◦ Maps
◦ Subtract values from mean
36. Data Entry and Manipulation
• Data contamination is data that results from a factor not
examined by the study that results in altered data values
• Data error types: commission or omission
• Quality assurance and quality control are strategies for
◦ preventing errors from entering a dataset
◦ ensuring data quality for entered data
◦ monitoring, and maintaining data quality throughout the project
• Identify and enforce quality assurance and quality control
measures throughout the Data Life Cycle
37. Discussion
Using the Data Review Checklist,
evaluate the HBSC codebook
“DataMgmtLab-Spr14_DataReviewChecklist_EX”
What screening & cleaning procedures
were used?
38. Data Entry and Manipulation
1. D. Edwards, in Ecological Data: Design, Management and Processing,
WK Michener and JW Brunt, Eds. (Blackwell, New York, 2000), pp. 70-
91. Available at www.ecoinformatics.org/pubs
2. R. B. Cook, R. J. Olson, P. Kanciruk, L. A. Hook, Best practices for
preparing ecological data sets to share and archive. Bull. Ecol. Soc.
Amer. 82, 138-141 (2001).
3. A. D. Chapman, “Principles of Data Quality:. Report for the Global
Biodiversity Information Facility” (Global Biodiversity Information
Facility, Copenhagen, 2004). Available at
http://www.gbif.org/communications/resources/print-and-online-
resources/download-publications/bookelets/
39. References
1. Cook, 2013, NACP Best Data Management Practices Workshop. From
http://daac.ornl.gov/NACP_AIM_2013/04_data_management_cook_201
3.02.03.ppt
2. Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data
provenance in e-Science. SIGMOD Record, 34(3), 31-36. From
http://www.sigmod.org/publications/sigmod-record/0509/p31-special-
sw-section-5.pdf
3. Ram, S. (2012). Emerging Role of Social Media in Data Sharing and
Management. From http://www.slideshare.net/INSITEUA/provenance-
management-to-enable-data-sharing
42. Choose your tools wisely
• Documents
• Excel
• Access
• SPSS, Minitab
• Mathematica, MATLAB, Scilab
• SAS, Stata
• R
• MapReduce
• NVivo, Atlas.ti, Dedoose, HyperRESEARCH, etc.
http://www.dataone.org/all-software-tools
43. Data Formats; Version 1.0
Overview
• Spreadsheets are amazingly flexible, and are commonly
used for data collection, analysis and management
• Spreadsheets are seldom self-documenting, and seldom
well-documented
• Subtle (and not so subtle) errors are easily introduced
during entry, manipulation and analysis
• Spreadsheet conventions – often ad hoc and evolutionary –
may change or be applied inconsistently
• Spreadsheet file formats are proprietary and thus generally
unacceptable as long term archival purposes
44. Data Entry and Manipulation
• Great for charts, graphs,
calculations
• Flexible about cell content
type—cells in same column
can contain numbers or text
• Lack record integrity--can
sort a column independently
of all others)
• Easy to use – but harder to
maintain as complexity and
size of data grows
• Easy to query to select
portions of data
• Data fields are typed – For
example, only integers are
allowed in integer fields
• Columns cannot be sorted
independently of each other
• Steeper learning curve than
a spreadsheet
45. NACP Best Data Management Practices, February 3, 2013
5. Preserve information (cont)
• Use a scripted language to process data
– R Statistical package (free, powerful)
– SAS
– MATLAB
• Processing scripts are records of processing
– Scripts can be revised, rerun
• Graphical User Interface-based analyses may
seem easy, but don’t leave a record
45
46. Provenance, Audit Trails, etc.
• “…information that helps determine the
derivation history of a data product, starting from
its original sources.” (Simmhan et al, 2005)
– Ancestral data products from which the data evolved
– Process of transformation of these ancestral data
products
• Uses: data quality, audit trail, replication recipe,
attribution, informational
47. More Considerations
• Field names & descriptions
• Structured entry
• Validation
• Record integrity
• Missing data
• Data/field types
• File types: common, open documented standard
• Output required for analysis and visualization
48. Demonstration & Discussion
Run [analysis] in Excel and Stata.
Compare output.
• What features does Stata have that Excel
does not?
• How do these features support
provenance and data integrity?
49. References
1. DataONE Education Module: Data Entry and Manipulation. DataONE.
From http://www.dataone.org/sites/all/documents/
L04_DataEntryManipulation.pptx