This document summarizes the data quality assessment process used by PEDSnet, a pediatric data research network. It describes how PEDSnet conducted initial infrastructure work and data quality checks in Phase 1, identifying issues in over 50% of data. In Phase 2, over 30 scientific studies were run, uncovering new data quality issues from users. This led to a 20% increase in the types of data quality checks performed and challenges in check design. The conclusion is that a user-driven perspective provided new insights into assessing and improving the usability of PEDSnet's data.
Bridging Data Quality Checks and Research in a Pediatric Network
1. Understanding the gaps between Data Quality
Checks and Research Capabilities in a Pediatric
Data Research Network
Ritu Khare, Hanieh Razzaghi, Levon Utidjian,
Matthew Miller, L. Charles Bailey
The Children’s Hospital of Philadelphia
3. Data Quality Assessment in PEDSnet
• Data ready for research use???
• PEDSnet data quality workflow
• Design data quality checks
• https://github.com/PEDSnet/Data-Quality-Analysis
• Identify data quality issues
• Rate of extract-transform load (ETL) errors reduced from >50% to
<10% (Khare et al., JAMIA in press)
Type of Check Issue Example
Missing data Gestational age missing for 70% of patients
Invalid value Race outside the acceptable values in
PEDSnet conventions
Implausible event Encounter start date after the end date
4. PEDSnet Phase 1: Data Quality Assessment
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9
#DataQualityChecks
Data CycleJan 2015 May 2016
Frameworks,
methods in
literature
(Brown et al. 2013,
Weiskopf and Weng,
2013, Kahn et al. 2015)
c
THEORY-DRIVEN
50 members in
informatics
team
Data and issue
review
DEVELOPER-
DRIVEN
5. PEDSnet Phase 2: Conducting Science Queries
• >30 scientific studies: Computable phenotypes,
feasibility queries, association studies, etc.
Site % children with CT-scan during
ED visits in 2013-2016
A 3.32%
B 4.87%
C 3.58%
D 2.98%
E 0.11%
F 3.62%
G 5.11%
H 5.92%
Incorrect mapping
of CT-scan
procedure
Invalid coding of
ED visits
Bug in the query
True anomaly
6. PEDSnet Phase 2: Data Quality Assessment
• USER-DRIVEN: >75 new issues, and 8 new check types
Check Type Issue Example
Outliers in derived values Average length of inpatient stays
Inconsistency between
similar concepts captured in
different tables
Specialty data in provider vs. care_site tables
Incorrect mapping from EHR
to PEDSnet
Mapping of labs to LOINC
Missing Expected Facts GI Providers, creatinine labs, etc
7. PEDSnet Phase 2: Data Quality Assessment
Check Type Issue Example
Unexpected Facts Procedures recorded in the condition table
Variability in coding Different concepts used to represent same lab or vitals
Unexpected most
frequent values
“shooting pain” identified as top inpatient visit condition
Face validity issues Tables with unexpectedly low number of records
8. PEDSnet Phase 2: Check Design Challenges
• Determine the combination of fields / tables
• ~100 fields in PEDSnet data model
• Determination of outlier
• Differentiate between true anomaly and real data
quality issue
• Determination of thresholds
• Experimentation with datasets
• Automatic review of ETL mappings
• labs, organisms, specialty, route, race, ethnicity,
drugs, language, procedure, smoking history
• 1000s of manually derived mappings
9. Conclusions
• A new (user-driven) perspective on data quality
• Usability evaluation of PEDSnet data quality
assessment program
• 20% increase in types of checks
• Future work
• Investigate the Phase 2 check design challenges
• Reverse engineering of checks from issues
identified in science queries
10. Acknowledgments
• PEDSnet Teams
• Leadership and governance
• Informatics
• Pilot studies
• PCORnet Governance Committees and DRN OC
• OHDSI Consortium
• Patients and Families
• This work was supported by PCORI Contract CDRN-1306-01556.
• PEDSnet Data Quality Scripts: https://github.com/PEDSnet/Data-
Quality-Analysis
Notes de l'éditeur
Talk about scale and range of PEDSnet
More stuff about pediatrics and CDRNs
Implemented during the development, in tandem. 2 two years,
Developing checks and identifying/documenting issues. Check vs issue graph. # issues logged till May 2016 etc.
Implemented during the development, in tandem. 2 two years,
Developing checks and identifying/documenting issues. Check vs issue graph. # issues logged till May 2016 etc.