An emerging reality for statistical scientists is that the cost of data collection for analysis projects
is often too high to justify those projects. Thus many statistical analysis projects increasingly use
administrative data – data gathered by administrative agencies in the course of regular
operations. In many cases, such administrative data includes content that can be repurposed to
support a variety of statistical analyses. However, such data is often sensitive, including details
about individuals or organizations that can be used to identify them, localize their whereabouts,
and draw conclusions about their behavior, health, and political and social agendas. In the
wrong hands, such data can be used to cause social, economic, or physical harm.
Privacy-preserving computation technologies have emerged in recent years to provide some
protection against such harm while enabling valuable statistical analyses. Some kinds of
privacy-preserving computation technologies allow computing on data while it remains
encrypted or otherwise opaque to those performing the computation, as well as to adversaries
who might seek to steal that information. Because data can remain encrypted during
computation, that data can remain encrypted “end-to-end” in analytic environments, so that the
data is immune to theft or misuse. However, protecting such data is only effective if we also
protect against what may be learned from the output of such analysis. Additional kinds of
emerging privacy-preserving computation technologies address this concern, protecting against
efforts to reverse engineer the input data from the outputs of analysis.
Unfortunately, privacy-preserving computation comes at a cost: current versions of these
technologies are computationally costly, rely on specialized computer hardware, are difficult to
program and configure directly, or some combination of the above. Thus National Statistics
Offices (NSOs) and other analytic scientists may need guidance in assessing whether the cost
of such technologies can be appropriately balanced against resulting privacy benefits.
In this handbook, we define specific goals for privacy-preserving computation for public good in
two salient use cases: giving NSOs access to new sources of (sensitive) Big Data; and enabling
Big Data collaborations across multiple NSOs. We describe the limits of current practice in
analyzing data while preserving privacy; explain emerging privacy-preserving computation
techniques; and outline key challenges to bringing these technologies into mainstream use. For
each technology addressed, we provide a technical overview; examples of applied uses; an
explanation of modeling adversaries and security arguments that typically apply; an overview of
the costs of using the technology; an explanation of the availability of the technology; and a
Wardley map that illustrates the technology’s readiness and suggested development focus.
UN Handbook on Privacy-Preserving Computation Techniques 8
Ha
2. Protecting Quasi-Identifiers
• Masking EI alone is not sufficient, as an adversary can still
use QI to re-identify a record owner
• This linking is called record linkage where a record from a
database is linked with a record in an external data source.
3.
4. There are two important aspects that need to be
considered while anonymizing QI:
• The analytical utility of QI needs to be preserved
• The correlation of QI attributes with sensitive data needs to
be maintained to support the utility of anonymized data
5. Challenges in Protecting QI
• Protection of QI is key to the success of any anonymization program,
especially with respect to multidimensional data.
6. The main challenges in anonymizing QI attributes are
• High dimensionality
• Background knowledge of the adversary
• Availability of external knowledge
• Correlation with SD to ensure utility
• Maintaining analytical utility
7. Challenge!
• Principle (6) offers guidance
Principle of data structure complexity—Anonymization design is
dependent on the data structure.
8. • Another important aspect to consider while anonymizing QI
attributes is that the correlation between QI and SD attributes must
be maintained.
• For example, in a life insurance application, the age of a policy holder
and the premium she pays for a particular insurance product are
correlated.
9. • Here, AGE is a QI attribute and PREMIUM is an SD attribute.
• Therefore, as part of the anonymization, it is important to maintain
this relationship between QI and SD attributes wherever applicable.
“Higher the age, higher the premium"
• Another aspect that needs to be looked into is the analytical utility of
anonymized QI attributes.
12. • How many employees with EDUCATION = “Doctorate” are part of this
company?
• Perturbative method is used in the above tables for anonymization.
13. • Perturbative techniques are generally referred to as masking and non-
perturbative techniques as anonymization.
• QI attributes are generally composed of a record owner’s
demographics, which are available in external data sources, such as a
voters database.
• It is indeed a challenge to anonymize QI attributes especially in the
presence of external data sources, protect outlier records, and
provide high utility.
• Principle (13) comes into play here.
14. • One of the techniques suggested is group-based anonymization
where data are anonymized in a group-specific manner. This
technique is called k-anonymization.
• k-anonymity achieves this through suppression and generalization of
identifiers.
15. Protecting Sensitive Data (SD)
• Data protection design ensures that EI are completely masked and QI
are anonymized, leaving out SD in original form as it is required for
analysis or as test data.
• As EI are completely masked, the transformed data make no sense
and are not useful for re-identification, and properly anonymized QI
also prevent re-identification.
• If sensitive data are in original form, then it provides a channel for re-
identification.
17. • Even though the data have been randomly perturbed, they have been
ensured that the mean and covariance of the original table and
perturbed tables are the same.
• This means that the transformed table is still valid for analysis
rendering the data to be useful at the same time maintaining the
privacy of the data.