This document summarizes a survey of data quality and validation mechanisms used in citizen science projects. The survey found that most projects use expert review and that projects use an average of 2.5 validation methods, often in combination. Larger projects and those with more funding tend to use fewer validation methods that require more human resources. Common validation methods included expert review, photo submissions, paper data sheets, and replication by multiple participants. Future work should focus on validation of data analysis methods and developing tools to help projects plan quality assurance procedures.
Genislab builds better products and faster go-to-market with Lean project man...
Mechanisms for Data Quality and Validation in Citizen Science
1. Mechanisms for
Data Quality and Validation
in Citizen Science
A. Wiggins, G. Newman, R. Stevenson & K. Crowston
Presented by Nathan Prestopnik
2. Motivation
Data quality and validation are a primary concern
for most citizen science projects
More contributors = more opportunities for error
There has been no review of appropriate data
quality and validation mechanisms
Diverse projects face similar challenges
Contributors’ skills and scale of participation are
important considerations in ensuring quality
3. Methods
Survey
Questionnaire with 70 items, all optional
63 completed questionnaires representing 62 projects
Mostly small-to-medium sized projects in US, Canada,
UK; most focus on monitoring and observation
Inductive development of framework
Based on survey results and authors’ direct experience
with citizen science projects
4. Survey: Resources
FTEs: 0 – 50+
Average: 2.4; Median: 1
Often small fractions of several individuals’ time
Annual budgets: $125 - $1,000,000
Average: $105,000; Median: $35,000; Mode: $20,000
Up to 5 different funding sources, usually grants, in-
kind contributions (staff time), & private donations
Age/duration: -1 to 100 years
Average age: 13 years; Median: 9 years; Mode: 2 years
5. Survey: Methods Used
Method n Percentage
Expert review 46 77%
Photo submissions 24 40%
Paper data sheets submitted along with online entry 20 33%
Replication/rating by multiple participants 14 23%
QA/QC training program 13 22%
Automatic filtering of unusual reports 11 18%
Uniform equipment 9 15%
Validation planned but not yet implemented 5 8%
Replication/rating, by the same participant 2 3%
Rating of established control items 2 3%
None 2 3%
Not sure/don’t know 2 3%
6. Survey:
Combining Methods
Methods n Percentage
Single method 10 17%
Multiple methods, up to 5 (average 2.5) 45 75%
Expert review + Automatic filtering 11 18%
Expert review + Paper data sheets 10 17%
Expert review + Photos 14 23%
Expert review + Photos + Paper data sheets 6 10%
Expert review + Replication, multiple 10 17%
7. Survey:
Resources & Methods
Number of validation methods and staff are
positively correlated (r2 = 0.11)
More staffing = more supervisory capacity
Number of validation methods and budget are
negatively correlated (r2 = -0.15)
If larger budgets means more contributors, this
constrains scalability of multiple methods
Larger projects may use fewer but more sophisticated
mechanisms
Suggests that human-supervised methods don’t scale
8. Survey:
Other Validation Options
“Please describe any additional validation methods
used in your project”
Several projects rely on personal knowledge of
contributing individuals for data quality
Not scientifically robust, but understandably relevant
Most comments referred to details of expert review
Reinforces the perceived value of expertise
Reporting interface and associated error-checking is
often overlooked, but provides important initial data
verification
9. Choosing Mechanisms
Data characteristics to consider when choosing
mechanisms to ensure quality
Accuracy and precision: taxonomic, spatial, temporal,
etc.
Error prevention: malfeasance (gaming the system),
inexperience, data entry errors, etc.
Evaluate assumptions about error and accuracy
Where does error originate? How do mechanisms
address this? At what step in the research process?
How transparent is data review and outcomes? How
much data will be reviewed? In how much detail?
10. Mechanisms: Protocols
Mechanism Process Type/Detail
QA project plans Before SOP in some areas
Repeated samples/tasks During By multiple participants, single
participant, or experts (calibration)
Tasks involving control During Contributions compared to known states
items
Uniform/calibrated During Used for measurements; cost/scale
equipment tradeoff; who pays?
Paper data sheets + During Extended details, verifying data entry
online entry* accuracy
Digital vouchers* During Photos, audio, specimens/archives
Data triangulation, After Corroboration from other data sources;
normalization, mining* statistical & computer science methods
Data documentation* After Provide metadata about processes
11. Mechanisms: Participants
Mechanism Process Types/Details
Participant training Before, Initial; Ongoing; Formal QA/QC
During
Participant testing Before, Following training; Pre/test-retest
During
Rating participant During, Unknown to participant; Known to
performance After participant
Filtering of unusual During, Automatically; Manually
reports After
Contacting participants After May alienate/educate contributors
about unusual reports
Automatic recognition After Techniques for image/text processing
Expert review After By professionals, experienced contributors,
or multiple parties
12. Discussion
Need to pay more attention to way that data are
created, not just protocols but also qualities of data
like accuracy, precision
Clear need for quality/validation mechanisms for
analysis, not only for data collection/processing
Data mining techniques
Spatio-temporal modeling
Scalability of validation may be limited
May need to plan different quality management
techniques based on expected/actual project growth
13. Future Work
Most projects worry more about contributor
expertise than appropriate analysis methods
Resources are needed to support suitable analysis
approaches and tools
Comparative valuation of the efficacy of the data
quality and validation mechanisms identified
Develop a QA/QC planning and evaluation tool
Develop examples of appropriate data
documentation for citizen science projects
Necessary for peer review, data re-use
14. Thanks!
Nate Prestopnik
DataONE working group on Public Participation in
Scientific Research
US NSF grants 09-43049 & 11-11107
Notes de l'éditeur
Rating = classification or judgment tasks, admittedly not the clearest wording, but no one corrected this in text responsesPercentage = percentage of responding projects that use each method
Percentage = Percentage of responding projects that use this combination of methodsThere were a few other combinations that a handful of projects used; these were the dominant ones.Surprised to see so many with photos, as they are hard to use and store, and the frequency of using paper data sheets
Note that we did ask about numbers of contributions, but the units of contribution for each project (and even the way they count volunteers) were so different that they couldn’t be used for analysis
Split framework of mechanisms in two for ease of viewing; these are methods that address the protocol as the presumed source of errorStarred items address errors arising from both protocols and participants
These methods all address expected errors form participants, focusing primarily on skill evaluation and filtering or review for unusual reports