Seminar Monday March 5th 2018 by BigInsight and Statistics Norway: Presentation by Kassaye Yitbarek Yigzaw. Distributed data analysis in the face og privacy concerns.
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
BigInsight seminar on Practical Privacy-Preserving Distributed Statistical Computation
1. Distributed data analysis in the face of
privacy concerns
Kassaye Yitbarek Yigzaw
Postdoctoral Fellow
Norwegian Centre for E-health Research
2. Outline
• Motivation
• Secure multi-party computation
• Challenges
• Proposed solutions
• Discussion
05.03.2018
Distributed data analysis in the face of privacy
concerns
2
3. Health data
Distributed data analysis in the face of privacy
concerns
3
Distributed
health data
EHR data
Registry data
Insurance
claims
Health data collected
by data custodians
05.03.2018
4. Opportunities
• Huge potential for a variety of purposes, such as
research and public health
• Increases the rate of new scientific discoveries
• Answers research questions that may not be
possible otherwise
Distributed data analysis in the face of privacy
concerns
405.03.2018
5. Distributed data
• Generalizability and reproducibility of analyses
results
• Often require data from multiple data sources
Large sample size that provides sufficient
statistical power
Heterogeneity
• An individual’s data are partitioned between
multiple data sources
05.03.2018
Distributed data analysis in the face of privacy
concerns
5
6. Horizontally partitioned data
• Data sources collect the same attributes for
disjoint set of individuals
05.03.2018
Distributed data analysis in the face of privacy
concerns
6
Data source 1 Data source 2 Data source N…
7. Challenges
• Data reuse raises privacy concerns
• Limit data sharing and secondary use
Distributed data analysis in the face of privacy
concerns
705.03.2018
Mental and physical harm
to patients
Evaluate their performance
Damage doctor–patient
relationship
Reveal confidential business
information
8. Distributed data analysis in the face of privacy
concerns
8
Privacy
Enabling health
data reuse
Research
Public health
05.03.2018
Objective
9. Common approaches
D1
D3
Data
D2
Third party
Distributed data analysis in the face of privacy
concerns
9
Data
Data
05.03.2018
De-identified data
Patient identifying
data
Re-identification risk
Data utility
Bias
Time
Distributed dataCost
10. Secure multi-party computation (SMC)
D1
D3
Data
D2
Third party
Secure multi-party
computation emulates
the third party
Distributed data analysis in the face of privacy
concerns
10
Data
Data
05.03.2018
Computing on distributed
data without revealing
sensitive information apart
from results
11. Challenges
• A generic solution for computing any function exist
• Efficiency and scalability are the main challenges
• Efficiency: communication and computation
overhead
• Scalability: number of data custodians and records
Distributed data analysis in the face of privacy
concerns
1105.03.2018
12. Adversarial model
• In semi-honest adversarial model participating
parties:
Follow the protocol specification
May try to learn private information from the
messages exchanged in the protocol execution
• Enables to develop protocols that are more
efficient and scalable
Distributed data analysis in the face of privacy
concerns
1205.03.2018
13. Dataset creation
Distributed data analysis in the face of privacy
concerns
13
User
Query Query
Query
Query
Virtualdataset
D1
Data
D2
Data
D3
Data
Coordinator
05.03.2018
14. Secure statistical computation
Distributed data analysis in the face of privacy
concerns
14
User
Query
Virtualdataset
D1
Data
D2
Data
D3
Data
Coordinator
05.03.2018
Secure
protocols
Aggregate
result
16. Secure sum protocol (2)
• Proposed an extension to the secure sum protocol
• The protocol makes collusion difficult:
Forming a ring topology at runtime and
Revealing only partial information about the
ring topology to each party
Distributed data analysis in the face of privacy
concerns
1605.03.2018
18. Other statistical problems
• A large number of statistical problems can be
decomposed into sub-functions of summation
forms
• Descriptive statistics (e.g., average, standard
deviation, covariance , Pearson’s r, minimum,
maximum, and median)
• Linear regression
• Clustering (k-means)
Distributed data analysis in the face of privacy
concerns
1805.03.2018
19. Secure computation of average
05.03.2018
Distributed data analysis in the face of privacy
concerns
19
20. Secure computation of average
Distributed data analysis in the face of privacy
concerns
20
id age height weight
1
2
D1
Data
1: Local computation
2: k-secure sum protocol
05.03.2018
21. Secure mth-ranked element protocol
Distributed data analysis in the face of privacy
concerns
21
D2
D3
Data
DN
D1
Data Data
Data
05.03.2018
22. Secure mth-ranked element protocol
• Computing minimum (𝑚 = 1) and maximum (𝑚 =
𝑛)
• Computing 𝑝 𝑡ℎ-percentile 𝑚 =
𝑝
100
× 𝑛
• First quartile, median, third quartile
• Box plot
05.03.2018
Distributed data analysis in the face of privacy
concerns
22
23. Discussion
• The proposed solution can be used for a wide
varieties of applications
• Antibiotics prescription monitoring and
benchmarking
• Infrastructure for research on primary care data
• The frameworks can be applied to domains other
than health
05.03.2018
Distributed data analysis in the face of privacy
concerns
23
24. Discussion
• Gives physical control to the data custodians
• Efficient and scalable to a very large number of
data custodians and records
Distributed data analysis in the face of privacy
concerns
2405.03.2018
25. Discussion
• Develop secure protocols for more statistical
functions
• Vertically partitioned data
• Disclosure control
Distributed data analysis in the face of privacy
concerns
2505.03.2018
26. Publications
• Yigzaw KY., Hailemichael MA, Skrøvseth SO, Bellika JG. Secure and Scalable Protocol
for Computing mth - Ranked Element on Distributed Data. In: In: AMIA Annual
Symposium Proceedings. 2018 (under revision)
• Yigzaw KY. Towards Practical Privacy-Preserving Distributed Statistical Computation of
Health Data. UiT The Arctic University of Norway. PhD Thesis. 2016.
• Hailemichael MA, Yigzaw KY, Bellika JG. Emnet: a tool for privacy-preserving statistical
computing on distributed health data. Proceedings from The 13th Scandinavian
Conference on Health Informatics.2015
• Andersen A, Yigzaw KY, Karlsen R. Privacy preserving health data processing. In: IEEE
16th International Conference on E-Health Networking, Applications and Services
(Healthcom). IEEE; 2014:225-230.
• Yigzaw KY, Bellika JG, Andersen A, Hartvigsen G, Fernandez-Llatas C. Towards
Privacy-preserving Computing on Distributed Electronic Health Record Data. In:
Proceedings of the 2013 Middleware Doctoral Symposium. MDS ’13. New York, NY,
USA: ACM; 2013:4:1–4:6.
05.03.2018
Distributed data analysis in the face of privacy
concerns
26
27. March 05, 2018 27
http://www.panoramio.com/photo/10889343
Thank you for your attention!
Privacy-preserving collection and analyses of
citizens-generated data
Kassaye Yitbarek Yigzaw
kassaye.yitbarek.yigzaw@ehealthresearch.no
Notes de l'éditeur
My name is Kassaye Yitbarek Yigzaw
I’m a postdoctoral fellow at Norwegian Centre for E-health Research
The title of the talk is “Distributed data analysis in the face of privacy concerns”
Part of the works presented in this presentation were done when I was a PhD student at UiT the Arctic University of Norway
I will talk a bit about the motivation for health data reuse and privacy challenge. Then, I will give a brief introduction to SMC paradigm. After presenting the main challenges of SMC, I will present the solutions we proposed to solve the challenges. I finish the presentation with a discussion
The increased adoption of electronic health record (EHR) systems, as well as a wide variety of other electronic data sources (e.g., insurance claims and registry data), led to the collection of large amounts of detailed health information about individuals
Reuse of health data has huge potential for a variety of purposes, such as research and public health
Data reuse increases the rate of new scientific discoveries and answers research questions that may not be possible otherwise
Generalizability and reproducibility of analyses results often require data from multiple data sources
This is because, the data from a single institution may not have large sample size that provides sufficient statistical power or heterogeneity that represent the population of interest
Or the required data about an individual can be partitioned between multiple data sources
In this presentation, I will focus on horizontally partitioned data, where each data source collects the same attributes for disjoint set of individuals. The union of all data sources’ datasets form the over all dataset.
Secondary use of data raises privacy concerns of individuals. Inappropriate disclosure of sensitive information may lead to mental and physical harm to patients. Even when individuals’ privacy concerns are addressed, Clinicians and healthcare providers are also concerned that data sharing may damage doctor-patient relationship, the data could be used to evaluate their performance or, in some contexts, reveal confidential business information
Therefore, privacy concerns limit willingness for data sharing.
Both protecting privacy and improving healthcare through data reuse are important social goods that should be maintained.
Therefore, we need data analysis techniques that address the privacy concerns of both patients and data custodians.
Traditionally, distributed data analyses is performed by centrally collecting the data at a trusted third party who analyses the data. The third party can be an institution like SSB or a researcher
The data collected at the third party can be patient identifying data that often requires consent. When there are systematic difference between individuals who consent and do not consent, it leads to bias. In addition, consent collection is expensive and takes long time
In rare cases, there are exemption for consent.
The other alternative is sharing de-identified data that often does not require consent. The challenges for de-identified data sharing are, making a balance between re-identification risk and data utility. The problem becomes even more challenging in the context of distributed data. A simple approach for de-identifying distributed data is that each institution locally de-identify their data before sharing. However, the union of the de-identified data does not give the same result as centrally collected data de-identified.
There is an area of research called secure multi-party computation. SMC deals with the problem of computing on distributed data without revealing anything apart form the result. In other words, it aims to emulate the third party
The research in SMC is not limited to computing statistical functions. It is also used for privacy-preserving record linkage. In this talk, I will focus on the computing statistical functions.
SMC was introduced in the 80s and a generic solution for computing any function exists. However, the generic solution is not efficient and scalable for practical uses.
Efficiency is the ability to compute with good performance. Usually expressed in terms of communication and computation complexity
Scalability is the ability to efficiently compute when the number of data custodians and records increase
The most commonly considered adversarial model is semi-honest (honest-but-curious). In this model, a party that participate in a protocol are assumed to follow the protocols steps, but it may tries to learn private information from the messages exchanged during the protocol steps
The popularity of the privacy model comes as it allows to develop mode efficient and scalable protocol while providing sufficient security for several use cases. In this presentation, I present protocols secure against assume semi-honest adversary.
Before going into the secure computation, let me give you a general overview.
We have a set of data custodians. Let us assume a third party denoted Coordinator. Coordinator is not trusted to collect any private information, it is only expected to be semi-honest
The coordinator receive user’s dataset criteria and broadcast to the data custodians. The data custodians execute the query against its database and locally store the result. The query results across the data custodians collectively make the over all dataset, we refer to these datasets as virtual dataset, since they are physically distributed.
There can be any for data cleaning and other pre-processing tasks at this stage. But, lets go a head and run statistical query on the virtual dataset.
The coordinator receives a user query and initiate SMC protocols appropriate for the user query. Then, SMC protocols are run on the virtual dataset and aggregated results are returned to the user
In the following slide, I will described some SMC protocols.
Secure summation protocols add private values of a set of data custodians without revealing the private value of a data custodian. It is the most widely studied problem and different secure sum protocol are proposed.
We consider the following secure summation protocol for its simplicity and efficiency. The simplified description of the protocol is as follows:
First, the data custodians form a ring topology.
The first data custodian D1 select a random value and sends the sum of the random value and its private value to D_2. D_2 adds its private value v_2 and s_1 and D_2 sends s_2 to D_3. The other parties in turn does the same. Finally, D1 calculates the total sum by subtracting the random number from s_N.
However, if party Di + 1 and Di - 1 collude, the private data of party Di will be revealed
We proposed an extension to the protocol that makes collusion between two parties difficult by forming a ring topology at runtime and revealing to an input party only partial knowledge about the ring topology.
To be able to scale the secure summation for large number of data custodians, we proposed further extension to the protocol, denoted k-secure summation protocol.
This protocol is based on dividing the data custodians into groups of k data custodians. Each group of k data custodians is denoted as privacy peer (PP). Each PP jointly run a secure summation protocol. Then, the results of the privacy-peers are centrally aggregated at the coordinator. Because of the parallel computation, the protocol can scale to very large number of data custodians.
Researchers exploited the fact that a large number of statistical problems can be decomposed into sub-functions of summation forms
Based on this concept, our group and other researchers created secure protocols for computing different statistics. Some of the exiting secure protocols include protocols for descriptive statistics , linear regression, and clustering
I will illustrate the decomposition into sub-functions of summation form with a simple example. Average is described as … and the sub-computation are summation and count
Lets say we want to compute average age.
For each sub computation, the data custodians locally compute on its data, and the k-secure sum protocol is used to aggregate the local results. Finally, the coordinator compute average based on the sub-functions’ results.
The example protocol I want to tell you a bit is a protocol for computing mth-ranked element. Let us say, each data custodian has ages of a set of individuals. The protocol finds the mth ranked age value
The use case for the protocol are computing what are the minimum and maximum ages in the dataset. The other use case is computing pth percentiles, for example we can compute 25th, 50th, and 75th percentiles to generate a box plot
The proposed solution can be used for a wide varieties of applications
Some of the application we are currently working are antibiotics prescription monitoring and benchmarking solution for GP
The framework is going to used in a national infrastructure for research on primary care data
More evaluations of the frameworks and development of secure protocols for more statistical functions need further study
The frameworks can be extended for stronger adversarial models
The framework for distributed EHR data can be extended for vertically partitioned data.
Extending the privacy-preserving distributed statistical computation framework for questionnaire data to other sources of PGHD is also a future work