PEDSnet is a clinical data research network that aggregates electronic health record data from eight large children's hospitals. It aims to support various types of pediatric research. The document discusses data quality challenges in PEDSnet due to issues like heterogeneous data sources and coding. It describes a comprehensive data quality assessment workflow used in PEDSnet involving automated checks and adjudication of issues. Over 600 issues were identified across four data cycles. Common issues included missing data, outliers, and unexpected differences from previous data extracts. The study aims to help define data quality in clinical research networks and establish standardized quality evaluation guidelines.
Selaginella: features, morphology ,anatomy and reproduction.
PEDSnet DQA CHOP Symposium
1. What
is
PEDSnet?
The
prominent
considera0ons
in
building
PEDSnet
include:
• lack
of
EHR
data’s
orienta0on
toward
research
analy0cs:
unstructured
and
locally
coded
data,
complex
clinical
workflows
• seman0c
heterogeneity:
diverse
source
systems
(5
Epic,
2
Cerner,
1
AllScripts)
and
vocabularies
(ICD9,
RxNorm,
GPI,
CPT-‐4,
etc.)
• data
peculiari0es
in
pediatrics:
procedures
conducted
before
a
child
is
born
such
as
prenatal
surgery
or
tes0ng
With
the
ul0mate
goal
of
serving
a
variety
of
research
projects,
a
prerequisite
in
PEDSnet
is
to
ensure
that
the
network’s
data
is
“high
quality.”
Major
gaps
in
this
area
include
the
lack
of
a
clear
defini0on
of
“data
quality”
in
CDRNs
and
the
lack
of
standardized
guidelines
for
conduc0ng
data
quality
assessments
in
large-‐scale
distributed
networks.
We
implement
a
comprehensive
suite
of
data
quality
checks,
and
propose
a
semi-‐
automa0c
workflow
to
conduct
data
quality
assessment
in
PEDSnet,
and
make
the
following
contribu0ons:
Data
Integra2on
and
Valida2on
in
PEDSnet
Fig
1
PEDSnet
Partner
Sites
(Total
8)
1. L a r g e -‐ s c a l e
d a t a
q u a l i t y
assessments
in
pediatrics
ü ~4.7M
children
(nearly
5%
of
US
child
popula2on)
ü Observa2onal
data
associated
with
>97M
clinical
encounters
Our
Contribu2ons
2.
Providing
a
real-‐world
account
on
data
quality
issues
in
a
CDRN
ü Iden2fied
and
tracked
the
causes
of
~600
issues
ü Longitudinal
analysis
showing
trends
of
causes
and
domains
Collabora0ons
across
mul0ple
ins0tu0ons
are
essen0al
to
achieve
adequate
cohort
sizes
in
pediatrics
research.
PEDSnet
is
a
newly
established
clinical
data
research
network
(CDRN)
that
aggregates
electronic
health
record
(EHR)
data
from
eight
of
the
na0on’s
largest
children’s
hospitals.
PEDSnet
is
part
of
a
much
larger
network,
PCORnet,
supported
by
the
Pa0ent-‐Centered
Outcomes
Research
Ins0tute
(PCORI).
The
goal
of
PEDSnet
is
to
support
a
variety
of
research
projects:
• Compara0ve
effec0veness
studies
• Drug
safety
studies
• Descrip0ve
studies
• Computable
phenotypes
Identifying and Understanding Data Quality Issues in a Pediatric Distributed
Research Network
Ritu Khare1, Levon Utidjian1, Evanette Burrows1, Greg Schulte2, Kevin Murphy1, Sara Deakyne2, Richard Hoyt3,
Nandan Patibandla4, Byron Ruth1, Aaron Browne1, Megan Reynolds3, Keith Marsolo5, Michael Kahn2, Charles Bailey1
1Children’s Hospital of Philadelphia, 2Children’s Hospital Colorado,3Nationwide Children’s Hospital,
4Boston Children’s Hospital, 5Cincinnati Children’s Hospital Medical Center
PEDSnet
Data
Quality
Assessment
Workflow
Results
Data
Quality
Assessment
Program
Issue
Ranking
and
Adjudica2on
S1
Fig
2.
A
typical
data
cycle
in
PEDSnet
showing
communica0on
between
the
Data
Coordina0ng
Center
and
the
Partner
Sites
Key
Findings
&
Conclusions
Even
a`er
the
4th
data
cycle,
on
an
average
74
issues
iden0fied
per
site.
Prospec0ve
data
quality
assurance
is
important
but
not
sufficient
for
valida0ng
data
quality
in
CDRNs.
The
percentage
of
ETL
issues
decreased
across
successive
data
cycles.
A
strong
post
hoc
data
quality
assessment
program
is
necessary
to
ensure
data
quality
in
CDRNs.
The
total
number
of
provenance
issues
increased
across
cycles.
This
study
serves
as
a
data
educa0on
pladorm
where
the
experts
learn
and
evolve
the
data
quality
program
with
each
itera0on.
Related
Publica2on
Iden2fying
and
Understanding
Data
Quality
Issues
in
a
Pediatric
Distributed
Research
Network
Ritu
Khare;
Levon
U1djian;
Gregory
Schulte;
Keith
Marsolo;
Charles
Bailey
AMIA
Annual
Symposium
2015,
November
14-‐18
2015,
San
Francisco,
CA
Future
Direc2ons
Implement
advanced
data
quality
checks,
e.g.
comparison
with
external
datasets
and
published
results
Use
the
data
quality
results
to
determine
the
analy0c
fitness
of
the
PEDSnet
dataset
with
respect
to
a
given
target
research
study
PEDSnet
Sites
(at
various
loca1ons)
1. Shared
private
repository
for
extract-‐transform-‐load
(ETL)
scripts
2. ETL
conven0ons
document
3. Weekly
troubleshoo0ng
mee0ngs
Acknowledgments
This
work
was
supported
by
PCORI
Contract
CDRN-‐1306-‐01556.
The
inves0gators
would
like
to
thank
the
PEDSnet
informa0cs
team
for
their
efforts
,
and
the
pa0ents
and
families
contribu0ng
to
the
PEDSnet
dataset.
Most
Frequent
Issues
Missing
Data
31%
En0ty
Outliers
15%
Unexpected
difference
from
previous
ETL
run
9%
Implausible
Dates
9%
Temporal
Outliers
6%
Unexpected
Facts
4%
Unexpected
concept
iden0fiers
3%
Numerical
Outliers
3%
u ETL:
due
to
an
ETL
programming
error
at
the
site
largely
due
to
incompleteness
or
ambiguity
in
the
conven0ons
document
u Provenance:
due
to
the
nature
of
EHR
data,
e.g.
data
entry
error,
clinical
&
administra0ve
workflows,
true
anomaly,
etc.
u I2b2
transform:
due
to
limita0ons
of
i2b2-‐>PEDSnet
transforma0on
scripts
u Non-‐issue:
false
alert
by
the
data
quality
assessment
program
Longitudinal
Analysis
of
Issues
Summary
of
Data
Quality
Issues
Contact
Person
Ritu
Khare,
PhD
(kharer@email.chop.edu)
PEDSnet
Data
Coordina2ng
Center
Prospec2ve
Data
Quality
Assurance
Aggregated
(OMOP
CDM)
S2
S3
S4
S5
S6
S7
S8
Types
of
Data
Quality
Checks
implemented
in
the
Program
Fidelity
the
degree
to
which
PEDSnet
data
accurately
reflect
the
source
systems
from
which
it
is
drawn
e.g.
the
distribu0ons
of
gender
values
do
not
match
between
the
source
system
(EHR)
and
the
derived
PEDSnet
dataset.
Consistency
the
degree
to
which
a
specific
type
of
informa0on
is
recorded
in
the
same
way
in
the
different
data
sources
contribu0ng
to
PEDSnet
Value
set
viola0ons
(e.g.
incorrect
concept
iden0fier
used
to
represent
no
informa0on
in
race
field)
or
mapping
issues
(e.g.
a
site
using
internal
codes
to
capture
drug
informa0on
cannot
completely
map
to
standard
RxNorm
codes)
Accuracy
the
degree
to
which
PEDSnet
data
correctly
reflect
the
clinical
characteris0cs
of
pa0ents
e.g.
a
site
having
significantly
larger
ra0o
for
average
number
of
facts
(observa0ons,
procedures,
or
drug
exposures)
per
pa0ent
Feasibility
the
degree
to
which
a
given
type
of
informa0on
is
actually
collected
and
available
in
PEDSnet
e.g.
the
recorded
gesta0onal
age
is
missing
for
70%
of
pa0ents
Expert
Review
Communica0on
of
issues
to
the
origina0ng
sites
At
the
end
of
the
4th
data
cycle,
the
data
quality
assessment
program
helped
iden0fy
591
data
quality
issues.
Domains
with
Most
Issues
Drug
Exposure
21%
Measurement
13%
Observa0on
12%
Condi0on
Occurrence
11%
Procedure
Occurrence
10%
Person
8%
Visit
Occurrence
7%
0
10
20
30
40
50
60
Jan
'15
(OMOP
v4)
Mar
'15
(OMOP
v4)
May
'15
(OMOP
v5)
Jul
'15
(OMOP
v5)
Percentage
of
Issues
Data
Cycles
Distribu2on
of
Issue
Causes
across
Data
Cycles
PCORnet
CDM
Final
Target