SMART BANGLADESH I PPTX I SLIDE IShovan Prita Paul.pptx
Microdata anonymization considerations
1. Timing,
data access types and
degree of anonymization
in microdata dissemination
…
Rajiv Ranjan
NISR/UNDP-Rwanda
Reflections on
data
confidentiality,
privacy, and
curation Regional Workshop on Microdata Dissemination Policy
Kigali, Rwanda: 27 – 29 August 2014
2. Confidentiality concerns
Access issues
Legal basis
Assurance
Challenges
Harmony Governance
Practices
Timing,
data access types
and
degree of
anonymization
in microdata
dissemination
Scheme of the presentation
4. Caveat
Microdata dissemination must maintain
confidentiality of individual units: people,
households or enterprises.
Individual data collected by statistical agencies for statistical compilation, whether
they refer to natural or legal persons, are to be strictly confidential and used
exclusively for statistical purposes.
Principle 6
United Nations Fundamental Principles of Official Statistics
http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx
5. Legal basis in Rwanda
Source: Law on the organisation of statistical activities in Rwanda. Chapter VI: Statistical Confidentiality, Article
17: Prohibited dissemination of information (N° 45/2013 of 16/06/2013)
Data collected by the institutions of the national
statistical system through surveys or any other
method of collection are protected by statistical
confidentiality. Statistical confidentiality implies
that the dissemination of such data as well as
statistical information which can be calculated from
them, shall be conducted in a way that those who
provided it are not identified whether directly or
indirectly.
7. Access benefits
• Fosters diversity of research
• Increases transparency and accountability
• Mitigates duplication of data collection work
• Increases the quality of data
https://unstats.un.org/unsd/accsub-public/microdata.pdf
8. Access assurance in Rwanda
The anonymous basic databases on
individuals and other institutions
shall be accessible to researchers
who, however, shall be committed to :
1° make a written note, that they shall not communicate to any person
the contents of such databases without the written authorization of
the National Institute of Statistics of Rwanda;
2° give to the National Institute of Statistics of Rwanda, the findings of
their research.
Source: Law on the organisation of statistical activities in Rwanda. Chapter VI: Statistical Confidentiality, Article
19: Accessibility to anonymous basic database not to be published (N° 45/2013 of 16/06/2013)
10. Balancing act
Disclosure risks Information loss
• In practice, the more the disclosure risks are reduced, the
lower will be the expected utility of the microdata sets.
• The objective remains to deal with the trade-off between
disclosure risks and information loss.
Source: Chris Skinner: Statistical Disclosure Control for Survey Data: http://personal.lse.ac.uk/skinnecj/SDC%20for%20survey%20data%20S3RI.pdf
11. Challenges
[Emerging mash-ups]
Datasets are being
reused and
combined with other
datasets in ways
never before
thought possible,
including for use that
go beyond the
original intent.
[Growing motives]
While there are
promising research
efforts underway to
protect privacy, far
more advanced
efforts are presently
in use to re-identify
seemingly
“anonymous” data
[Improved access]
Access to datasets
have eased their
discoverability and
data could be used
to re-identify
previously de-
identified datasets
http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_5.1.14_final_print.pdf
12. Complicating the challenges
Disclosure risks Information loss
Images: (1.) From the cover of ‘Open Data Now’ - a book by Joel Gurin, exploring how open data within public records will create new jobs, applications and other
technology innovations . http://www.opendatanow.com & (2.) A project at PARIS21 on data revolution for post 2015 SDGs http://www.paris21.org/node/1654
Machine readability,
Open standards and
Free for reuse
Post 20151 2
14. Coexistence
“There is nothing inherently contradictory about
hiding one piece of information while revealing
another, so long as the information we want to hide is
different from the information we want to disclose.”
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2031808
- Felix T. Wu in Defining Privacy and Utility in Data Sets.
Though not easy, but it is possible and desirable for openness and privacy to co-exist.
22. Users types served
Govt. (Policy makers and researchers)
International development agencies
Research and academic institutions
Students and professors
Others (scientific researchers)
23. Release timing
6 – 24 monthsafter the 1st release of aggregated data from a survey/census
Within
DHS 2010
EICV(3) 2010-2011
Census 2012
7
7
?
Seasonal Agri Survey 2013 ?
24 Months
Examples
Integrated Household Living Conditions Survey (EICV)
25. Types of files/access
16
1
3
Open access (no restriction)
Direct access or Public Use Files
(some restrictions on use, but no
screening of users)
Research Use Files (or Scientific
Use Files, or Licensed Files)
Availability only in an enclave
No access authorized
Data not available
Data available from external repo 4
Totalnoofstudies=24
26. Degree of anonymization
• Suppressing/deleting the records of direct identifiers (e.g. name
of the head of HH) and few indirect identifiers (e.g. sub-national
admin boundaries)
• Generalizing/replacing (recoding) some indirect identifiers with
less specific but semantically consistent groupings of observation
values (e.g. place of birth, occupation)
• Perturbing/distorting some indirect identifiers by randomizing
the values (e.g. clusters)
Removing or modifying the identifying variables contained in the microdata
The usual practice at NISR is to release microdata as Public Use Files.
For example, in EICV3, the methods applied for anonymizing data were:
Integrated Household Living Conditions Survey (EICV): EICV3 was done in 2010-2011
Variations in the degree of anonymization (and resulting access files/types)
may be considered depending on the sensitivity of the dataset and the use.
28. @rajiv_r_in
…
Thank you!
“87% of the U.S.
population can be
uniquely identified
by date of birth +
gender + zip”
Latanya Sweeney, CMU
latanyasweeney.org
Notes de l'éditeur
We often use the terms "confidentiality" and "privacy" interchangeably in our everyday lives. However, they mean distinctly different things. While confidentiality relates to information/data about an individual, privacy relates to a person and is a right rooted in common law. Privacy protects access to the person, whereas confidentiality protects access to the data. In the context of statistics – ‘confidentiality’ is the researcher’s agreement with the participant about how the participant’s identifiable private information will be handled, managed, and disseminated. Hence, confidentiality is an ethical duty.
[Situations vary. In some cases the duty is easy and in some cases it is not.]
How is this duty is performed by controlling the factors of (1) timing of data release, (2) data access types and (3) degree of anonymization: is my topic of presentation.
I’ll keep two parallel tracks during my presentation. Generic track and Rwanda specific track. While talking about generic stuff, I’ll be often jumping off and on to Rwanda specific examples to illustrate my points
Lets dig deeper into the subject.
In most cases of statistical practices, the caveat is…. Microdata dissemination must maintain confidentiality of individual units: people, households or enterprises.
Driven by Principle 6 of UN Fundamental Principles of Official Statistics.
However, if in some cases, it facilitates the caveat, in others, the strict confidentiality is often invoked as a reason not to share any microdata
-In Rwanda, there is a strong legal basis – facilitating the caveat.
-The law also provide for ‘PENALTIES ‘ in case of breach of statistical confidentiality
Regarding the Principle 6 of UN Fundamental Principles of Official Statistics, if access becomes the casualty – then it is loss.
Therefore, broadly accepted rationale is: though confidentiality should be upheld, access to data should not be jeopardised.
See some benefits:
Access rationale is broadly accepted.
In Rwanda, statistical law provides for the ‘assurance of access’.
It is obvious that seemingly conflicting ideas may pose some challenges, if applied simultaneously.
It is therefore, a balancing act.
There is a constant struggle to minimize both.
What has added to the misery?
What recourse do we have? Is it possible to have harmony?
Though not easy, but it is possible and desirable for openness and privacy to co-exist.
What are the decision factors?
What helps?
Leaves the pressure out, for microdata to appeal to ‘all’ / ‘normal’ users.
At NISR there is only one dataset which has Licensed data files - General Census of Population and Housing 2002. It is because the entire dataset is made available (though anonymized). The current Census where only 5% data will be released (after anonymization) will be Public Use Files.
The challenge is quite big here (read in the context of Big Data). We are learning. And though simple means are currently in use, we intend to move towards more complex arrangements where ‘balancing act’ is more optimized.