Comparison of Privacy-Protecting Methods for Distributed Data Analysis

Comparison of Privacy-Protecting
Analytic and Data-sharing Methods:
a Simulation Study
Kazuki Yoshida*, Susan Gruber,
Bruce Fireman, Darren Toh
* Departments of Epidemiology and Biostatistics
Harvard T.H. Chan School of Public Health
SCS Meeting on June 21, 2017
1 / 51

Acknowledgment
This study was funded through a Patient-Centered
Outcomes Research Institute (PCORI) Award
(ME-1403-11305; PI: Darren Toh).
All statements in this document, including its ﬁndings
and conclusions, are solely those of the authors and do
not necessarily represent the views of PCORI or PCORI’s
Board of Governors or Methodology Committee.
2 / 51

Background
Distributed data networks such as [1] and
[2] are becoming a platform of choice of rapid
synthesis of evidence.
Here we focus on distributed data networks with
horizontally partitioned patient data [3].
4 / 51

Data partners
Data partners are entities that routinely collect patient
care data as part of their daily operation.
For example, Sentinel has the following data partners. [4]
5 / 51

Structure of distributed data network
Patient data are stored based on the common data model
[5] at each site for the purpose of sharing.
Analyses are conducted at coordinating center by
aggregating data from individual data partners.
6 / 51

Challenges in distributed data network
Sharing of data from data partners should be minimized
to protect patient privacy as well as each data partner’s
proprietary interest.
Several privacy-protecting analytic and data-sharing
methods [6] have been proposed.
These methods have not been systematically compared.
7 / 51

Aims
To provide framework for classifying previously suggested
privacy-protecting methods.
To assess the relative performance of various
privacy-protecting methods in the setting of a simulated
distributed data network.
Specifically, to examine how different levels of data
sharing can affect analysis performance.
8 / 51

Classification of methods
We considered the following "axes" in classification.
Levels of data sharing
Types of confounder summary scores
Confounding adjustment methods
Matching Stratification Weighting
PS Individual data Individual data Individual data
Risk sets Risk sets Risk sets
Summary tables Summary tables -
Effect estimates Effect estimates Effect estimates
DRS Individual data Individual data -
Risk sets Risk sets -
Effect estimates Effect estimates -
10 / 51

Levels of data sharing
Individual-level data [7]
Individual-level exposure, outcome, event time, and summary
score data are shared.
Risk-set data [8]
Aggregated risk sets at event times, that is, the number of
individuals experiencing events at each time points, and the
number of individuals still being followed, are shared.
Summary-table data
Aggregated event counts and total person-time (how many
people were exposed to the drug for how long) are shared.
Site-speciﬁc eﬀect estimate data
Entire analysis is conducted within each site, and analysis
results are shared across sites.
11 / 51

Example of individual-level data
site A time event PS Matched
1 1 251 1 0.5402941 1
1 1 277 1 0.4949680 1
1 0 366 0 0.4921805 1
1 0 261 1 0.5128428 1
1 1 52 0 0.5801256 1
1 0 366 0 0.5334244 1
1 0 223 1 0.5267744 1
1 0 28 1 0.5135982 1
1 1 100 0 0.5506620 1
1 0 311 0 0.5361661 1
1 0 260 0 0.4979951 1
1 1 254 0 0.5665530 1
The individual-level data contain the time-to-event status of each
individual. Depending on the analysis method, the summary score itself,
derived weights, or matched cohort status need to be shared for each
individual.
12 / 51

Example of risk set data
site method eval_time events_A0 events_A1 riskset_A0 riskset_A1
1 PS Match 0 0 0 457 457
1 PS Match 1 2 2 457 457
1 PS Match 2 5 0 454 455
1 PS Match 3 1 0 449 455
1 PS Match 4 0 3 447 455
1 PS Match 5 3 1 446 448
1 PS Match 6 0 2 442 444
1 PS Match 7 1 4 440 439
1 PS Match 8 0 2 439 434
1 PS Match 9 1 1 436 432
1 PS Match 10 0 1 434 429
1 PS Match 11 0 1 431 425
1 PS Match 12 0 2 431 422
1 PS Match 13 0 1 429 419
1 PS Match 14 2 1 429 418
1 PS Match 15 2 1 427 416
1 PS Match 16 1 0 423 412
Risk-set data are created at each time points at which event of
interest occurred constructed separately for the treated A = 1 and
untreated A = 0. The number of events at each time point (the
time scale itself can be converted to an ordinal variable) is shared,
but no individual-level data are required.
13 / 51

Example of summary table data
site method A events person-time
1 PS Match 0 185 75453
1 PS Match 1 196 69686
2 PS Match 0 404 144741
2 PS Match 1 410 137917
3 PS Match 0 645 224931
3 PS Match 1 652 223105
Summary-table data are essentially tables from each site created
separately for the treated A = 1 and untreated A = 0. The
numbers of events in each group are shared, but no individual-level
data are required.
14 / 51

Example of site-specific effect estimate data
site method log HR Var(log HR)
1 PS Match 0.13490472 0.010514751
2 PS Match 0.06062382 0.004917153
3 PS Match 0.01847491 0.003084460
Site-specific effect estimate data only contain the effect estimate
(in this case log hazard ratio) and corresponding variance of the
estimate. Each site contributes only two numbers. There is no
element of individual patient-level data.
15 / 51

Types of confounder summary scores
Two types of confounder summary scores are commonly used
in pharmacoepidemiology.
Propensity score (PS) [9]
Predicted probability of receiving treatment of interest
Disease risk score (DRS) [10]
Binary: Predicted probability of outcome of interest
under no treatment
Survival: Relative log hazard ratio (linear predictor from
Cox regression) under no treatment
Both scores summarize multiple covariates, simplify analyses,
and reduce data being shared.
16 / 51

Confounder summary scores
When patient characteristics determine both treatment assignment
and outcome of interest, correlation arises between treatment and
outcome even without a true eﬀect of treatment on the outcome
(confounding) [11]. Statistical assessment of the true eﬀect of
treatment requires accounting for these confounders.
17 / 51

Types of confounding adjustment
Matching [12]
Create pairs of individuals with similar scores, thereby creating
a cohort of similar individuals except for the exposure status.
Stratiﬁcation [13]
Create sub-groups of individuals with similar scores, for
example, deciles (10 subgroups with 1/10 size of the cohort),
and comparison of treated and untreated is done within each
stratum.
Weighting with PS (IPTW [14], matching weights [15])
Balance distribution of score across treatment by re-weighting
individuals, that is, making some individuals contribute more
or less to the analysis depending on some functions their PS.
18 / 51

Rationale for a simulation study
Essential component of method comparison research. [16]
Allows assessment of performance of different methods in
a controlled environment.
Patient data are artificially generated so that we know the
truth each method should find.
Data generation, analysis, and performance assessment
are repeated many times for accuracy.
20 / 51

Simulation: Base scenario
4 sites with size 100K, 20K, 20K, and 5K patients
7 covariates (1 continuous, 6 binary)
X → A association OR 0.3 - 3.0
X → Y association HR 0.6 - 1.6
Treatment prevalence 50%
No treatment eﬀect
5% one-year observed incidence of survival outcome
21 / 51

Scenario overview
Scenario Explanation Incidence % Treated Effect
1 Base scenario 5% 50% Null
2 10% treated 5% 10% Null
3 1% outcome incidence 1% 50% Null
4 0.1% outcome incidence 0.1% 50% Null
5 0.01% outcome incidence 0.01% 50% Null
6 Varying outcome incidence 0.01%-5% 50% Null
7 Protective treatment effect 5% 50% Protective
8 8-sites 5% 50% Null
9 Varying confounder counts 5% 50% Null
10 Small sites 1% 50% Null
Null treatment effect is a conditional log hazard ratio of 0
(conditional hazard ratio of 1.0). Protective treatment effect
is a conditional log hazard ratio of -0.22 (conditional hazard
ratio of 0.8).
22 / 51

Scenario overview
Scenario # Sites (Sizes) # Confounders
1 4 (100K, 20K × 2, 5K) 7
2 4 (100K, 20K × 2, 5K) 7
3 4 (100K, 20K × 2, 5K) 7
4 4 (100K, 20K × 2, 5K) 7
5 4 (100K, 20K × 2, 5K) 7
6 4 (20K × 4) 7
7 4 (100K, 20K × 2, 5K) 7
8 8 (100K × 2, 20K × 4, 5K × 2) 7
9 4 (20K × 4) 5, 10, 20, 40
10 4 (5K × 4) 7
23 / 51

Simulated data partners
Four sites of diﬀerent data sizes are simulated in the base
scenario.
Each site is generated as a separate dataset to emulate
the distributed data network setting in which data reside
behind the ﬁrewall of each data partner.
24 / 51

Data generation
Covariates X1, ..., X7 were generated ﬁrst. Treatment
assignment was determined by the covariates. Then covariates
and treatment (when non-null eﬀect) determined the outcome.
25 / 51

Data preparation
Each site prepares data to be shared across sites.
Summary score estimation
Propensity score (PS)
Disease risk score (DRS)
Adjustment for confounding
Matching (PS & DRS)
Stratification (PS & DRS)
Weighting (PS only)
Data reduction
Individual-level data
Risk-set data
Summary-table data
Site-specific effect estimate
data
26 / 51

Data sharing
Prepared less identiﬁable data are then shared across sites to the
coordinating center, where they are aggregated for ﬁnal analysis.
27 / 51

Comparison of interest
Within each confounding adjustment method (cell),
different levels of data sharing was compared to the
individual-level data sharing.
Matching Stratification Weighting
PS Individual data Individual data Individual data
Risk sets Risk sets Risk sets
Effect estimates Effect estimates Effect estimates
DRS Individual data Individual data -
Risk sets Risk sets -
Effect estimates Effect estimates -
28 / 51

Assessment metrics
Bias metric
Average of point estimates (should be close to truth)
Precision metrics
Variability of point estimates (should be small)
Standard error estimates (should reﬂect true variability)
Computation metric
Proportion of failure to produce results
29 / 51

Implementation of simulation
The simulation suite was
implemented in an
open-source statistical
language except for
a small part written in
SAS, which was then
called from R.
Package ‘distributed’
June 19, 2017
Type Package
Title Examine Privacy Preserving Data Analysis Methods in Simulated
Distributed Data Network
Version 0.1.0
Date 2017-01-27
Author Kazuki Yoshida
Maintainer Kazuki Yoshida <kazukiyoshida@mail.harvard.edu>
Description Simulate a distributed data network and examine performance of various privacy preserv-
ing data analysis methods. See the package vignette for instructions.
License GPL-2
Imports magrittr, dplyr, tidyr, assertthat, doRNG, foreach, geepack,
tableone, Matching, survival, sandwich, survey, gnm, pryr
Suggests testthat, rmarkdown, MatchIt
URL
VignetteBuilder rmarkdown
RoxygenNote 6.0.1
NeedsCompilation no
R topics documented:
distributed-package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Analyze . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
AnalyzeSiteDataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
AnalyzeSiteDatasetBin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
AnalyzeSiteDatasetSurv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
AnalyzeSiteRegression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
AnalyzeSiteRegressionHelper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
AnalyzeSiteRisksets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
AnalyzeSiteSummary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
AnalyzeSiteSummaryBin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
AnalyzeSiteSummarySurv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
AnalyzeSiteTruth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
AnalyzeSiteTruthBin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
AnalyzeSiteTruthSurv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
AssignCovariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1
30 / 51

Computing environment
Harvard University’s
Odyssey high
performance computing
cluster was used for the
simulation as simulation
had to be repeated many
times.
31 / 51

Scenario 1 (base scenario)
4 sites with size 100k, 20k, 20k, and 5k patients
50% treatment prevalence
5% incidence of binary outcome
33 / 51

●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
PS MW DRS Match. DRS Strat.
PS Match. PS Strat. PS IPTW
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
−0.05
0.00
0.05
0.10
−0.05
0.00
0.05
0.10
logHR
Survival analysis log HR. Scenario 1
34 / 51

●●●
●●
●
●●
●●●
●●
●
●●●
●●●
●●
●
●●●
●●●
●●
●
●●●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●●●●● ●●●●●
●
●●●●●
●
●●●●●
●
●●●●●●
●
●
●●●
●
●●●●●●
●
●
●●●
●
●●●
●
●●●●●●●
●
●
●●●
●
●●●
●
●
●
● ● ●
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
0.50
0.75
0.90
1.00
1.10
1.50
2.00
0.50
0.75
0.90
1.00
1.10
1.50
2.00
EstimatedSE(logHR)/simulationSE(logHR)
Survival analysis SE estimate accuracy. Scenario 1
35 / 51

Scenario 2 (infrequent treatment)
5% one-year incidence of survival outcome
36 / 51

●
●
●●
●● ●
●
●●
●● ●
●
●●
●● ●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
−0.1
0.0
0.1
0.2
−0.1
0.0
0.1
0.2
logHR
37 / 51

●●●
●
●●●
●
●●●
●
●●●
●
●●●●
●
●●●●
●
●●●●
●
●
●
●
●
●
●
●
●
●●● ●●● ●●● ●●●
●
●
●
●●
●●●
●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●●
●
●●
●
●●
●
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
0.50
0.75
0.90
1.00
1.10
1.50
2.00
0.50
0.75
0.90
1.00
1.10
1.50
2.00
38 / 51

Scenario 5 (infrequent outcome)
0.01% one-year incidence of survival outcome
39 / 51

●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●●●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
−2
0
2
4
−2
0
2
4
logHR
40 / 51

● ● ●
●
●
●
●●
●
●
●●
●
●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●●●
●
●●●
●●●●●●●
●●
●●●
● ●
●
●●●
●
●●●
●●●●●●●●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●●
●
●
●●●●●●
●
●●●
●
●●
●
●
●
●●●
●
●
●●●●●●
●
●●●
●
●●
●
●
●
●●●
●
●
●●●●●●
●
●●●
●
●●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●●●●●●●
●
●●●●●●●●●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
0.50
0.75
0.90
1.00
1.10
1.50
2.00
0.50
0.75
0.90
1.00
1.10
1.50
2.00
41 / 51

m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
m
eta
sum
m
ary
risksets
dataset
0
25
50
75
100
0
25
50
75
100
data
%successfuliterations
Survival analysis successful iterations (%). Scenario 5
42 / 51

Summary
We examined various privacy-protecting analytic and
data-sharing methods through a simulation study to
assess whether restricting the level of data sharing could
aﬀect the performance of analytic methods compared to
the pooled individual-level data analysis.
Overall, levels of data sharing had little impact on bias
and precision of log HR estimates within each
confounding adjustment method in most simulated
scenarios.
44 / 51

Implications
This implies that in the setting where each data partner
provides similar site-specific results, that is, when it
makes sense to pool information across sites to form an
overall effect estimate, a meta-analysis of site-specific
effect estimates may be the most attractive option.
Pooling of site-specific analysis results has the benefit of
requiring investigator at the coordinating center to
examine the homogeneity or heterogeneity of site-specific
results, thereby, preventing inappropriate pooling when
heterogeneity is prominent.
45 / 51

Limitations
The true underlying treatment eﬀects were kept identical
across sites. This was necessary to ensure valid
comparison of methods.
We generated survival data based on exponential model
(time-constant hazard). Departure from this may make
summary table-based events/person-time analysis and
Cox regression less comparable.
Risk-set data analysis using PS-weighted dataset was
implemented as an experimental attempt. Although the
point estimates were correct, the SE estimates were not
accurate when treatment groups are of diﬀerent sizes.
46 / 51

Conclusion
Privacy-protecting methods, regardless confounding
adjustment methods employed, demonstrated similar
performance to the patient-level data analysis in the
simulation scenarios we examined.
Meta-analysis of site-level analysis results seems to be a
reasonable approach provided that data partners are
similar in patient characteristics and the outcome is not
too rare, which can render some sites non-informative.
47 / 51

Bibliography I
[1] Fleurence RL, Curtis LH, Caliﬀ RM, Platt R, Selby JV, and Brown JS.
Launching PCORnet, a national patient-centered clinical research network.
Journal of the American Medical Informatics Association: JAMIA. 2014;
21(4):578–582.
[2] Platt R, Carnahan RM, Brown JS, Chrischilles E, Curtis LH, Hennessy S, Nelson
JC, Racoosin JA, Robb M, Schneeweiss S, Toh S, and Weiner MG.
The U.S. Food and Drug Administration’s Mini-Sentinel program: status and
direction.
Pharmacoepidemiology and Drug Safety. 2012;21:1–8.
[3] Bohn J, Eddings W, and Schneeweiss S.
Conducting Privacy-Preserving Multivariable Propensity Score Analysis When
Patient Covariate Information Is Stored in Separate Locations.
American Journal of Epidemiology. 2017;185(6):501–510.
[4] Data Partners | Sentinel System.
[5] Distributed Database and Common Data Model | Sentinel System.
[6] Toh S, Shetterly S, Powers JD, and Arterburn D.
Privacy-preserving analytic methods for multisite comparative eﬀectiveness and
patient-centered outcomes research.
Medical Care. 2014;52(7):664–668.
49 / 51

Bibliography II
[7] Rassen JA, Avorn J, and Schneeweiss S.
Multivariate-adjusted pharmacoepidemiologic analyses of confidential information
pooled from multiple health care utilization databases.
Pharmacoepidemiology and Drug Safety. 2010;19(8):848–857.
[8] Fireman B, Lee J, Lewis N, Bembom O, van der Laan M, and Baxter R.
Influenza vaccination and mortality: differentiating vaccine effects from bias.
American Journal of Epidemiology. 2009;170(5):650–656.
[9] Rosenbaum PR and Rubin DB.
The central role of the propensity score in observational studies for causal
effects.
Biometrika. 1983;70(1):41–55.
[10] Hansen BB.
The prognostic analogue of the propensity score.
Biometrika. 2008;95(2):481–488.
[11] Hernan MA and Robins JM.
Causal Inference.
Chapman & Hall/CRC. 2016.
50 / 51

Bibliography III
Constructing a Control Group Using Multivariate Matched Sampling Methods
That Incorporate the Propensity Score.
The American Statistician. 1985;39(1):33–38.
Reducing Bias in Observational Studies Using Subclassiﬁcation on the Propensity
Score.
Journal of the American Statistical Association. 1984;79(387):516.
[14] Robins JM, Hernán MA, and Brumback B.
Marginal structural models and causal inference in epidemiology.
Epidemiology (Cambridge, Mass). 2000;11(5):550–560.
[15] Li L and Greene T.
A weighting analogue to pair matching in propensity score analysis.
The International Journal of Biostatistics. 2013;9(2):215–234.
[16] Burton A, Altman DG, Royston P, and Holder RL.
The design of simulation studies in medical statistics.
Statistics in Medicine. 2006;25(24):4279–4292.
51 / 51

Comparison of Privacy-Protecting Methods for Distributed Data Analysis

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Comparison of Privacy-Protecting Methods for Distributed Data Analysis

Similaire à Comparison of Privacy-Protecting Methods for Distributed Data Analysis (20)

Plus de Kazuki Yoshida

Plus de Kazuki Yoshida (20)

Dernier

Dernier (20)

Comparison of Privacy-Protecting Methods for Distributed Data Analysis