SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Technische Universität München
Fabian Prasser
Lehrstuhl für Medizinische Informatik
Institut für Medizinische Statistik und Epidemiologie
Klinikum rechts der Isar der TU München
An overview of methods for data anonymization
Technische Universität München
Three types of de-identification / anonymization
• Masking identifiers in unstructured data
• Subject: clinical notes, …
• Methods: machine learning, regular expressions, …
• Implementations: MIST, MITdeid, NLM Scrubber
• Privacy preserving data analysis (interactive scenario)
• Subject: query results, …
• Methods: interactive differential privacy, query-set-size control, …
• Implementations: AirCloak, Airavat, Fuzz, PINQ, HIDE
• Transforming structured data (non-interactive scenario)
• Subject: tabular data, …
• Methods: generalization, suppression, randomization, ...
• Implementations: AnonTool, ARX, sdcMicro, μArgus, PARAT
219.03.2015
An overview of methods for data anonymization
Technische Universität München
Multiple aspects have to be balanced
• Main goal: Achieve a balance between data utility and privacy
• Complex task
• Many different types of methods need to be applied in an
integrated manner
• Methods may need to be parameterized
• Different aspects are interrelated
• Just the most important aspects and relationships
319.03.2015
Use cases
Types of data
Privacy models
Transformation models
Utility measures
Algorithms
Risk models
An overview of methods for data anonymization
Technische Universität München
Important aspects of use cases
• Who or what will process the data in which way?
• Humans, e.g., epidemiologists
• Different types of analyses
• Interactive vs. non-interactive
• Machines, i.e., data mining
• Classification vs. clustering
• How will the data be released?
• Access control
• Open access vs. restricted access
• Continous data publishing
• Multiple views vs. re-release (incremental vs. new attributes)
• Is the data distributed?
• Collaborative environments
• Vertical vs. horizontal vs. hybrid distribution
419.03.2015
[FWF11]
[FWF11]
[FWF11]
An overview of methods for data anonymization
Technische Universität München
Important properties of data
• Relational data
• Tabular data
• One row per individual
• Transactional data
• Data consisting of set-valued attributes
• Example: Follow-up collection of diagnosis codes
• Data with relational and transactional characteristics
• Dimensionality of data
• Mitigating re-identification is practically infeasible for
high-dimensional data
• Data with clusters
• Example: household structures
• Other types of data: Trajectory data, social network data
519.03.2015
[Agg05]
An overview of methods for data anonymization
Technische Universität München
Privacy models: some background
• Definition of (perfect) privacy
• “Anything that can be learned about a respondent from a statistical
database should be learnable without access to the database”
• Syntactic models
• Syntactic conditions on the released datasets
• No (direct) semantic implications regarding the above definition
• Instead: Assumptions about attack vectors and definition of (likely)
background knowledge and goals by classifying attributes
• Direct and indirect identifiers (or quasi-identifiers, or keys)
• Sensitive and insensitive attributes
• Semantic models
• Privacy models that relax a formalization of the above definition
• Much fewer assumptions need to be made about attackers
619.03.2015
formulated by [Dwork08]
[Swe02]
[Dal77]
An overview of methods for data anonymization
Technische Universität München
Risk and threat models
• Disclosure models
• Identity disclosure (re-identification, tuple linkage)
• Attribute disclosure (sensitive information disclosure)
• Membership disclosure (table linkage)
• Models for quantifying re-identification risks
• Super-population models: Population is modeled with probability
distributions parameterized with sample characteristics
• Decision rule by Dankar et al.: Combination of three models,
which has been evaluated for biomedical datasets
• Attacker models: May be used to derive/compile global risks
• Prosecutor scenario: Targets one specific individual
• Marketer scenario: Targets as many individuals as possible
• Journalist scenario: Targets any individual
719.03.2015
[Emam13]
[DEN+12]
[LLZ+12]
An overview of methods for data anonymization
Technische Universität München
Syntactic models against re-identification
• Goal: Prevent linkage attacks on quasi-identifiers
• Some models for relational data
• k-Anonymity: Requires groups (cells or equivalence classes) of
size ≥ k, which defines an upper bound on the re-identification risk
(over-) estimated with sample frequencies
• LKC-Privacy: Relaxed variant of k-anonymity + (ℓ-diversity)
• Risk-based approaches: Enforce thresholds on re-identification
risks, which may be quantified with super-population models
• HIPAA Safe Harbor: Heuristic with many predefined identifiers and
a few quasi-identifiers (regions and all kinds of dates). Contains
wildcards („any other unique identifying number, characteristic, or
code“). Provides sound legal protection for custodians in the US
• Some models for transactional data
• (km
)-Anonymity: k-Anonymity regarding ≤ m values from a set
819.03.2015
[Swe02]
[MFH+09]
[HIP]
[TMK08]
An overview of methods for data anonymization
Technische Universität München
Syntactic models against attribute disclosure
• Observation: Preventing linkage attacks is not enough
• Goal: Prevent knowledge gain from sensitive information associated
with an equivalence class
• Some models for relational data
• ℓ-Diversity: Sensitive values must be „well-represented”. Multiple
variants exist with different privacy/utility trade-offs
• t-Closeness: Distribution of sensitive values must not be „too
different“ from the overall dataset. Multiple variants exist
• p-Sensitive k-anonymity: Focus on identity & attribute disclosure
• LKC Privacy: ℓ-Diversity & relaxed k-anonymity
• Some models for transactional data
• (h, k, p)-coherence: km
-anonymity + protection against inference
• ρ-Uncertainty: protection against inference with fewer
assumptions
919.03.2015
[MFH+09]
[MGK+06]
[LLV07]
[TV06]
[CKR+10]
[XWF+08]
An overview of methods for data anonymization
Technische Universität München
Further syntactic models
• Models against membership disclosure for relational data
• Goal: Bounds on the certainty with which the presence of data
about an individual in a database can be inferred via linkage
• Upside: With strict thresholds, they provide semantic privacy
• Downside: Basically impossible to achieve
• δ-Presence: Relates sample counts to population counts
• c-Confident δ-presence: Relaxation of δ-presence in which
population characteristics are estimated
• Models for data which is relational and transactional
• (k, km
)-Anonymity: Mixture of k-anonymity and km
-anonymity
• Models for continuous publishing of relational data
• Approach by Byun et al.: Only supports insertions
• m-Invariance: Supports insertions, deletions, updates
1019.03.2015
[NAC07]
[NC10]
[TMK08]
[XT07]
An overview of methods for data anonymization
Technische Universität München
A semantic model: Differential Privacy
• Observation
• The formal notion of privacy is impossible to achieve
• Even for individuals that are not part of the statistical database
• Idea
• Do not compare an attacker's information about an individual
before and after accessing a statistical database, but
• Compare the risks for an individual when joining (or leaving) a
statistical database
• (Slightly) more formal
• -Differential Privacy:ϵ A (randomized) function fulfills -DP if theϵ
probability of every possible output value changes by a factor of at
most exp( ) when data about an individual is or is not contained inϵ
a database.
• Relaxations: ( ,ϵ δ)-DP, approximate DP
1119.03.2015
[Dwork 06, Dwork08]
[Dwork06]
[Dwork 06, Dwork08]
[PK08, LM12]
An overview of methods for data anonymization
Technische Universität München
A semantic model: Differential Privacy (cont'd)
• DP in interactive scenarios
• Sequential composition rule
• Privacy budget
• DP in non-interactive scenarios
• Release of contingency tables or marginals
• Relationships to syntactical models exist, e.g.,
• (k, β)-SDGS: Random sampling + k-anonymity fulfills ( , δ)-DPϵ
• t-Closeness (with a specific distance function): Implies
-DP regarding the sensitive attributesϵ
• Has been criticized in the context of biomedical research
• DP is often not a truthful mechanism: Functions are
randomized, often data is pertubated, e.g., by adding noise
• DP is not intuitive: What is a good value for ? What does itϵ
mean?
1219.03.2015
[FDE13]
[DS15]
[LQS11]
[BCD+07][BCD+07]
[FDE13]
An overview of methods for data anonymization
Technische Universität München
Measuring data utility
• Often used interchangeably with “loss of information”
• Exemplary utility measures for syntactic models
• Used for evaluating transformed datasets
• Discernibility: Based on sizes of equivalence classes
• Average equivalence class size: Analogously to discernibility
• (Non-uniform) entropy: Information theoretic measure
• Loss: Measures the coverage of the domain of attributes
• Utility constraints: Use cases are modeled as queries
• Exemplary utility measures for Differential Privacy
• Used for evaluating a method that fulfills DP
• Error: Absolute, relative, variance
• (α, δ)-Usefulness: P[distance ≤ α] ≥ δ
1319.03.2015
[BA05]
[LDD+05]
[GT09]
[LGM10]
[VI02]
[FDE13]
An overview of methods for data anonymization
Technische Universität München
Transformation methods
• Coding models
• Global recoding: Similar transformation for similar values
• Local recoding: Different transformations may be applied
• Truthful transformations
• Generalization: Based on domain generalization hierarchies
• Full-domain generalization: All values of an attribute are
generalized to the same level
• Subtree generalization: Different levels of generalization may
be applied
• Suppression: Removal of values of cells or complete tuples
• Top & bottom coding: Replacing values that exceed given
bounds
1419.03.2015
[FWF11]
[FWF11]
An overview of methods for data anonymization
Technische Universität München
Transformation methods (cont'd)
• Non-truthful transformations (Pertubation)
• Post-randomization: Randomly change categories of a
categorical variable according to predefined probabilities
• Value distortion: Multiplicative or additive noise
• Numerical rank swapping: Randomly swap values with other
values with a rank that does not differ by more than a predefined
threshold
• Microaggregation: Aggregate values in one group
• Replacing values: Distribution sample or distribution itself
• Methods on a structural level
• Random sampling: Randomly select a set of tuples
• Slicing: Partition the data horizontally and vertically and creates
links between partitions
1519.03.2015
[FWF11]
[LLZ+12]
[LQS11]
An overview of methods for data anonymization
Technische Universität München
Algorithms
• Transform data to meet privacy models
• Given transformation methods, data properties etc.
• Randomized algorithms
• Randomized functions for Differential Privacy
• Genetic search
• Search algorithms
• Optimal algorithms: Flash, Incognito, OLA
• Heuristic algorithms: Top-Down-Specialization
• Clustering algorithms: Iteratively merge groups
• Data (focus on tuples): Method by Tassa et al.
• Space (focus on taxonomies): For transactional data
• Partitioning algorithms: Iteratively split groups
• Data: Mondrian
1619.03.2015
[Dwork08a]
[EDI+09, LDR05, KPE+12]
[FWY05]
[GT10]
[Iye02]
[GLS14]
[LG13]
[LDR06]
An overview of methods for data anonymization
Technische Universität München
Thank you for your attention!
1719.03.2015
An overview of methods for data anonymization
Technische Universität München
Further Readings
• Fung CMB, Wang K, Fu A, Yu P. Introduction to Privacy-Preserving
Data Publishing: Concepts and Techniques. Chapman & Hall/CRC,
ISBN: 1420091484, 2011.
• Gkoulalas-Divanis A, Loukides G, Sun J. Publishing data from
electronic health records while preserving privacy: A survey of
algorithms. J Biomed Inform, Vol. 50, p. 4-19, 2014
• Dankar FK, El Emam K. Practicing Differential Privacy in Health Care:
A Review. Trans. Data Privacy, Vol. 6:1, 2013.
• Gkoulalas-Divanis A, Loukides G. Anonymization of Electronic Medical
Records to Support Clinical Analysis. Springer. ISBN:
978-1-4614-5668-1, 2013
• El Emam K. Guide to the De-Identification of Personal Health
Information. Auerbach/CRC, ISBN 978-1-4665-7906-4, 2013
1819.03.2015
An overview of methods for data anonymization
Technische Universität München
References[Agg05] Aggarwal CC. On k-anonymity and the curse of dimensionality. PVLDB, p. 901–909, 2005.
[BA05] Bayardo RJ, Agrawal R. Data Privacy through Optimal k-Anonymization. ICDE. pp. 217–228, 2005.
[BCD+07] Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F et al. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. PODS, p. 273–282, 2007.
[CKR+10] Cao J, Karras P, Raïssi C, Tan K. rho-uncertainty: inference-proof transaction anonymization. PVLDB;3(1):1033–44. 2010.
[Dal77] Dalenius T. Towards a methodology for statistical disclosure control. Statistisk Tidskrift, 5, p. 429–444, 1977.
[DEN+12] Dankar F, El Emam K, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Medical Informatics and Decision Making, 12:66, 2012.
[DS15] Domingo-Ferrer J, Soria-Comas J. From t-closeness to differential privacy and vice versa in data anonymization. Knowl.-Based Syst., 74:151–158, 2015.
[Dwork06] Dwork C. Differential Privacy. ICALP; 4052, pp. 1–12. Springer, 2006.
[Dwork08] Dwork C. An Ad Omnia Approach to Defining and Achieving Private Data Analysis. PinKDD, p. 1-13, 2008.
[Dwork08a] Dwork C. Differential privacy: A survey of results. TAMC. pp. 1–19, 2008.
[EDI+09] El Emam K, Dankar F, Issa R, Jonker E et al. A globally optimal k-anonymity method for the de-identification of health data. JAMIA, 16(5), pp. 670–682, 2009.
[Emam13] El Emam K. Guide to the De-Identification of Personal Health Information. Auerbach/CRC, ISBN 978-1-4665-7906-4, 2013.
[FDE13] Fida K, Dankar F, El Emam K. Practicing differential privacy in health care: A review. Trans. Data Privacy, 6(1):35–67, 2013.
[FWF11] Fung CMB, Wang K, Fu A, Yu P. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. Chapman & Hall/CRC, ISBN: 1420091484, 2011.
[FWY05] Fung BCM, Wang K, Yu PS. Top-Down Specialization for Information and Privacy Preservation. ICDE, 2005, pp. 205–216.
[GKK07] Ghinita G, Karras P, Kalnis P, Mamoulis N. Fast data anonymization with low information loss. VLDB, 2007, pp. 758–769.
[GT09] Gionis A, Tassa T. k-Anonymization with Minimal Loss of Information. IEEE Trans Knowl Data Eng, pp. 206–219, 2009.
[GT10] Goldberger J, Tassa T. Efficient anonymizations with enhanced utility. Transactions on data privacy, vol. 3, pp. 149–175, 2010.
[GLS14] Gkoulalas-Divanis A, Loukides G, Sun J. Publishing data from electronic health records while preserving privacy: A survey of algorithms. JBI, Vol. 50, p. 4-19, 2014
[HIP] U.S. Department of Health and Human Services Office for Civil Rights. HIPAA administrative simplification regulation text; 2006.
[Iye02] Iyengar VS. Transforming data to satisfy privacy constraints. SIGKDD, 2002.
[KPE+12] Kohlmayer F, Prasser F, Eckert C, Kemper A, Kuhn KA. Flash: Efficient, Stable and Optimal K-Anonymity. PASSAT., pp. 708–717, Sep. 2012.
[LDD+05] LeFevre K, DeWitt DJ, Ramakrishnan R. Multidimensional K-Anonymity (TR-1521). Tech. rep. University of Wisconsin, 2005.
[LDR05] LeFevre K, DeWitt D, Ramakrishnan R. Incognito: Efficient full-domain k-anonymity. SIGMOD. 2005, pp. 49–60, 2005.
[LDR06] LeFevre K, DeWitt DJ, Ramakrishnan R. Mondrian Multidimensional K-Anonymity. ICDE, 2006.
[LG13] Loukides G, Gkoulalas-Divanis A. Utility-aware anonymization of diagnosis codes. J Biomed Health Inform. 2013;17(1):60–70.
[LGM10] Loukides G, Gkoulalas-Divanis A, Malin B, COAT: COnstraint-based anonymization of transactions. Knowl Inf Syst, 28:2, p. 251–282, 2010.
[LLV07] Li N, Li T, Venkatasubramanian S. t-Closeness: privacy beyond k-anonymity and l-diversity. ICDE; p. 106–15. 2007.
[LLZ+12] Li T, Li N, Zhang J, Molloy I. Slicing: A new approach for privacy preserving data publishing. IEEE Trans Knowl Data Eng, 24(3):561–574, 2012
[LM12] Li C, Miklau G. An adaptive mechanism for accurate query answering under differential privacy. PVLDB, 5(6):514–525, 2012.
[LQS11] Li N, Qardaji WH, Su D. Provably private data anonymization: Or, k-anonymity meets differential privacy. CoRR, abs/1101.2604, 2011.
[MFH+09] Mohammed N, Fung BCM, Hung PCK, and Lee CK. Anonymizing healthcare data: A case study on the blood transfusion service. SIGKDD, p. 1285–1294, 2009.
[MGK+06] Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. l-Diversity: privacy beyond k-anonymity. ICDE; pp. 24, 2006.
[NAC07] Nergiz ME, Atzori M, Clifton C. Hiding the presence of individuals from shared databases. SIGMOD; p. 665–676, 2007.
[NC10] Nergiz ME, Clifton C. d-presence without complete world knowledge. IEEE Trans Knowl Data Eng; 22(6):868–83, 2010.
[PK08] Prasad S, Kasiviswanathan AS. A note on differential privacy: Defining resistance to arbitrary side information. CoRR abs/0803.3946, 2008.
[Swe02] Sweeney L. k-anonymity: a model for protecting privacy. IJUFKS;10:557–70, 2002.
[TMK08] Terrovitis M, Mamoulis N, Kalnis P. Privacy-preserving anonymization of set valued data. PVLDB;1(1):115–25, 2008.
[TMK08] Terrovitis M, Mamoulis N, Kalnis P. Privacy-preserving anonymization of setvalued data. PVLDB;1(1):115–25, 2008.
[TV06] Truta TM, Vinay B. Privacy protection: p-sensitive k-anonymity property. ICDE workshops; p. 94, 2006.
[VI02] Iyengar VS. Transforming data to satisfy privacy constraints. SIGKDD, pp. 279–288, 2002.
[XT07] Xiao X, Tao Y. m-invariance: Towards privacy preserving republication of dynamic datasets. SIGMOD, 2007.
[XWF+08] Xu Y, Wang K, Fu AW-C, Yu PS. Anonymizing transaction databases for publication. KDD; p. 767–75, 2008.
1919.03.2015
An overview of methods for data anonymization

Contenu connexe

Tendances

What is Zero Trust
What is Zero TrustWhat is Zero Trust
What is Zero TrustOkta-Inc
 
CIA Triad in Data Governance, Information Security, and Privacy: Its Role and...
CIA Triad in Data Governance, Information Security, and Privacy: Its Role and...CIA Triad in Data Governance, Information Security, and Privacy: Its Role and...
CIA Triad in Data Governance, Information Security, and Privacy: Its Role and...PECB
 
CISSP Prep: Ch 6. Identity and Access Management
CISSP Prep: Ch 6. Identity and Access ManagementCISSP Prep: Ch 6. Identity and Access Management
CISSP Prep: Ch 6. Identity and Access ManagementSam Bowne
 
Identifying Effective Endpoint Detection and Response Platforms (EDRP)
Identifying Effective Endpoint Detection and Response Platforms (EDRP)Identifying Effective Endpoint Detection and Response Platforms (EDRP)
Identifying Effective Endpoint Detection and Response Platforms (EDRP)Enterprise Management Associates
 
What is zero trust model (ztm)
What is zero trust model (ztm)What is zero trust model (ztm)
What is zero trust model (ztm)Ahmed Banafa
 
CISSP - Chapter 1 - Security Concepts
CISSP - Chapter 1 - Security ConceptsCISSP - Chapter 1 - Security Concepts
CISSP - Chapter 1 - Security ConceptsKarthikeyan Dhayalan
 
Concienciacion en ciberseguridad y buenas prácticas
Concienciacion en ciberseguridad y buenas prácticas Concienciacion en ciberseguridad y buenas prácticas
Concienciacion en ciberseguridad y buenas prácticas Javier Tallón
 
Presentation of Social Engineering - The Art of Human Hacking
Presentation of Social Engineering - The Art of Human HackingPresentation of Social Engineering - The Art of Human Hacking
Presentation of Social Engineering - The Art of Human Hackingmsaksida
 
Effective Security Operation Center - present by Reza Adineh
Effective Security Operation Center - present by Reza AdinehEffective Security Operation Center - present by Reza Adineh
Effective Security Operation Center - present by Reza AdinehReZa AdineH
 
INCIDENT RESPONSE CONCEPTS
INCIDENT RESPONSE CONCEPTSINCIDENT RESPONSE CONCEPTS
INCIDENT RESPONSE CONCEPTSSylvain Martinez
 
Information security awareness - 101
Information security awareness - 101Information security awareness - 101
Information security awareness - 101mateenzero
 
Guardicore - Shrink Your Attack Surface with Micro-Segmentation
Guardicore - Shrink Your Attack Surface with Micro-SegmentationGuardicore - Shrink Your Attack Surface with Micro-Segmentation
Guardicore - Shrink Your Attack Surface with Micro-SegmentationCSNP
 
Securityawareness
SecurityawarenessSecurityawareness
SecurityawarenessJayfErika
 
Building An Information Security Awareness Program
Building An Information Security Awareness ProgramBuilding An Information Security Awareness Program
Building An Information Security Awareness ProgramBill Gardner
 
Cyber Security and the CEO
Cyber Security and the CEOCyber Security and the CEO
Cyber Security and the CEOMicheal Axelsen
 
MITRE ATT&CK framework
MITRE ATT&CK frameworkMITRE ATT&CK framework
MITRE ATT&CK frameworkBhushan Gurav
 

Tendances (20)

What is Zero Trust
What is Zero TrustWhat is Zero Trust
What is Zero Trust
 
CIA Triad in Data Governance, Information Security, and Privacy: Its Role and...
CIA Triad in Data Governance, Information Security, and Privacy: Its Role and...CIA Triad in Data Governance, Information Security, and Privacy: Its Role and...
CIA Triad in Data Governance, Information Security, and Privacy: Its Role and...
 
CISSP Prep: Ch 6. Identity and Access Management
CISSP Prep: Ch 6. Identity and Access ManagementCISSP Prep: Ch 6. Identity and Access Management
CISSP Prep: Ch 6. Identity and Access Management
 
Identifying Effective Endpoint Detection and Response Platforms (EDRP)
Identifying Effective Endpoint Detection and Response Platforms (EDRP)Identifying Effective Endpoint Detection and Response Platforms (EDRP)
Identifying Effective Endpoint Detection and Response Platforms (EDRP)
 
What is zero trust model (ztm)
What is zero trust model (ztm)What is zero trust model (ztm)
What is zero trust model (ztm)
 
CISSP - Chapter 1 - Security Concepts
CISSP - Chapter 1 - Security ConceptsCISSP - Chapter 1 - Security Concepts
CISSP - Chapter 1 - Security Concepts
 
Concienciacion en ciberseguridad y buenas prácticas
Concienciacion en ciberseguridad y buenas prácticas Concienciacion en ciberseguridad y buenas prácticas
Concienciacion en ciberseguridad y buenas prácticas
 
Presentation of Social Engineering - The Art of Human Hacking
Presentation of Social Engineering - The Art of Human HackingPresentation of Social Engineering - The Art of Human Hacking
Presentation of Social Engineering - The Art of Human Hacking
 
Effective Security Operation Center - present by Reza Adineh
Effective Security Operation Center - present by Reza AdinehEffective Security Operation Center - present by Reza Adineh
Effective Security Operation Center - present by Reza Adineh
 
Data Protection Presentation
Data Protection PresentationData Protection Presentation
Data Protection Presentation
 
INCIDENT RESPONSE CONCEPTS
INCIDENT RESPONSE CONCEPTSINCIDENT RESPONSE CONCEPTS
INCIDENT RESPONSE CONCEPTS
 
Information security awareness - 101
Information security awareness - 101Information security awareness - 101
Information security awareness - 101
 
Guardicore - Shrink Your Attack Surface with Micro-Segmentation
Guardicore - Shrink Your Attack Surface with Micro-SegmentationGuardicore - Shrink Your Attack Surface with Micro-Segmentation
Guardicore - Shrink Your Attack Surface with Micro-Segmentation
 
Securityawareness
SecurityawarenessSecurityawareness
Securityawareness
 
Building An Information Security Awareness Program
Building An Information Security Awareness ProgramBuilding An Information Security Awareness Program
Building An Information Security Awareness Program
 
Information security
Information securityInformation security
Information security
 
Microsoft Zero Trust
Microsoft Zero TrustMicrosoft Zero Trust
Microsoft Zero Trust
 
Cyber Security and the CEO
Cyber Security and the CEOCyber Security and the CEO
Cyber Security and the CEO
 
Cybersecurity Awareness
Cybersecurity AwarenessCybersecurity Awareness
Cybersecurity Awareness
 
MITRE ATT&CK framework
MITRE ATT&CK frameworkMITRE ATT&CK framework
MITRE ATT&CK framework
 

Similaire à An overview of methods for data anonymization

Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion Alket Cecaj
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseVaticle
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!Khalid Salama
 
A Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information PrivacyA Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information PrivacyMicah Altman
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.pptbommaiah
 
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...Kato Mivule
 
Creating Effective Data Visualizations for Online Learning
Creating Effective Data Visualizations for Online Learning Creating Effective Data Visualizations for Online Learning
Creating Effective Data Visualizations for Online Learning Shalin Hai-Jew
 
Engineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolEngineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolarx-deidentifier
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningNikolay Karelin
 
Graph Models for Deep Learning
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep LearningExperfy
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMsSri Ambati
 
Characterizing Data and Software for Social Science Research
Characterizing Data and Software for Social Science ResearchCharacterizing Data and Software for Social Science Research
Characterizing Data and Software for Social Science ResearchMicah Altman
 
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...IT Network marcus evans
 

Similaire à An overview of methods for data anonymization (20)

Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion Presentation of PhD thesis on Location Data Fusion
Presentation of PhD thesis on Location Data Fusion
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Automating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge BaseAutomating Data Science over a Human Genomics Knowledge Base
Automating Data Science over a Human Genomics Knowledge Base
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 
A Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information PrivacyA Lifecycle Approach to Information Privacy
A Lifecycle Approach to Information Privacy
 
Introduction
IntroductionIntroduction
Introduction
 
Introduction.ppt
Introduction.pptIntroduction.ppt
Introduction.ppt
 
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
Literature Review: The Role of Signal Processing in Meeting Privacy Challenge...
 
Creating Effective Data Visualizations for Online Learning
Creating Effective Data Visualizations for Online Learning Creating Effective Data Visualizations for Online Learning
Creating Effective Data Visualizations for Online Learning
 
Engineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization toolEngineering data privacy - The ARX data anonymization tool
Engineering data privacy - The ARX data anonymization tool
 
Data mining
Data miningData mining
Data mining
 
Main principles of Data Science and Machine Learning
Main principles of Data Science and Machine LearningMain principles of Data Science and Machine Learning
Main principles of Data Science and Machine Learning
 
Graph Models for Deep Learning
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep Learning
 
Project 0th Review
Project 0th ReviewProject 0th Review
Project 0th Review
 
Risk Management for LLMs
Risk Management for LLMsRisk Management for LLMs
Risk Management for LLMs
 
Characterizing Data and Software for Social Science Research
Characterizing Data and Software for Social Science ResearchCharacterizing Data and Software for Social Science Research
Characterizing Data and Software for Social Science Research
 
Data Mining 101
Data Mining 101Data Mining 101
Data Mining 101
 
The Genopolis Microarray database
The Genopolis Microarray databaseThe Genopolis Microarray database
The Genopolis Microarray database
 
DATAIA & TransAlgo
DATAIA & TransAlgoDATAIA & TransAlgo
DATAIA & TransAlgo
 
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
Bigger and Better: Employing a Holistic Strategy for Big Data toward a Strong...
 

Plus de arx-deidentifier

An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identificationarx-deidentifier
 
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...arx-deidentifier
 
Anonymisierung und Risikomanagement mit ARX
Anonymisierung und Risikomanagement mit ARXAnonymisierung und Risikomanagement mit ARX
Anonymisierung und Risikomanagement mit ARXarx-deidentifier
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsarx-deidentifier
 
ARX - A Generic Method for Assessing the Quality of De-Identified Health Data
ARX - A Generic Method for Assessing the Quality of De-Identified Health DataARX - A Generic Method for Assessing the Quality of De-Identified Health Data
ARX - A Generic Method for Assessing the Quality of De-Identified Health Dataarx-deidentifier
 
ARX - a comprehensive tool for anonymizing / de-identifying biomedical data
ARX - a comprehensive tool for anonymizing / de-identifying biomedical dataARX - a comprehensive tool for anonymizing / de-identifying biomedical data
ARX - a comprehensive tool for anonymizing / de-identifying biomedical dataarx-deidentifier
 

Plus de arx-deidentifier (6)

An Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-IdentificationAn Open Source Tool for Game Theoretic Health Data De-Identification
An Open Source Tool for Game Theoretic Health Data De-Identification
 
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
A Tool for Optimizing De-Identified Health Data for Use in Statistical Classi...
 
Anonymisierung und Risikomanagement mit ARX
Anonymisierung und Risikomanagement mit ARXAnonymisierung und Risikomanagement mit ARX
Anonymisierung und Risikomanagement mit ARX
 
An experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithmsAn experimental comparison of globally-optimal data de-identification algorithms
An experimental comparison of globally-optimal data de-identification algorithms
 
ARX - A Generic Method for Assessing the Quality of De-Identified Health Data
ARX - A Generic Method for Assessing the Quality of De-Identified Health DataARX - A Generic Method for Assessing the Quality of De-Identified Health Data
ARX - A Generic Method for Assessing the Quality of De-Identified Health Data
 
ARX - a comprehensive tool for anonymizing / de-identifying biomedical data
ARX - a comprehensive tool for anonymizing / de-identifying biomedical dataARX - a comprehensive tool for anonymizing / de-identifying biomedical data
ARX - a comprehensive tool for anonymizing / de-identifying biomedical data
 

Dernier

COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsNurulAfiqah307317
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 

Dernier (20)

COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 

An overview of methods for data anonymization

  • 1. Technische Universität München Fabian Prasser Lehrstuhl für Medizinische Informatik Institut für Medizinische Statistik und Epidemiologie Klinikum rechts der Isar der TU München An overview of methods for data anonymization
  • 2. Technische Universität München Three types of de-identification / anonymization • Masking identifiers in unstructured data • Subject: clinical notes, … • Methods: machine learning, regular expressions, … • Implementations: MIST, MITdeid, NLM Scrubber • Privacy preserving data analysis (interactive scenario) • Subject: query results, … • Methods: interactive differential privacy, query-set-size control, … • Implementations: AirCloak, Airavat, Fuzz, PINQ, HIDE • Transforming structured data (non-interactive scenario) • Subject: tabular data, … • Methods: generalization, suppression, randomization, ... • Implementations: AnonTool, ARX, sdcMicro, μArgus, PARAT 219.03.2015 An overview of methods for data anonymization
  • 3. Technische Universität München Multiple aspects have to be balanced • Main goal: Achieve a balance between data utility and privacy • Complex task • Many different types of methods need to be applied in an integrated manner • Methods may need to be parameterized • Different aspects are interrelated • Just the most important aspects and relationships 319.03.2015 Use cases Types of data Privacy models Transformation models Utility measures Algorithms Risk models An overview of methods for data anonymization
  • 4. Technische Universität München Important aspects of use cases • Who or what will process the data in which way? • Humans, e.g., epidemiologists • Different types of analyses • Interactive vs. non-interactive • Machines, i.e., data mining • Classification vs. clustering • How will the data be released? • Access control • Open access vs. restricted access • Continous data publishing • Multiple views vs. re-release (incremental vs. new attributes) • Is the data distributed? • Collaborative environments • Vertical vs. horizontal vs. hybrid distribution 419.03.2015 [FWF11] [FWF11] [FWF11] An overview of methods for data anonymization
  • 5. Technische Universität München Important properties of data • Relational data • Tabular data • One row per individual • Transactional data • Data consisting of set-valued attributes • Example: Follow-up collection of diagnosis codes • Data with relational and transactional characteristics • Dimensionality of data • Mitigating re-identification is practically infeasible for high-dimensional data • Data with clusters • Example: household structures • Other types of data: Trajectory data, social network data 519.03.2015 [Agg05] An overview of methods for data anonymization
  • 6. Technische Universität München Privacy models: some background • Definition of (perfect) privacy • “Anything that can be learned about a respondent from a statistical database should be learnable without access to the database” • Syntactic models • Syntactic conditions on the released datasets • No (direct) semantic implications regarding the above definition • Instead: Assumptions about attack vectors and definition of (likely) background knowledge and goals by classifying attributes • Direct and indirect identifiers (or quasi-identifiers, or keys) • Sensitive and insensitive attributes • Semantic models • Privacy models that relax a formalization of the above definition • Much fewer assumptions need to be made about attackers 619.03.2015 formulated by [Dwork08] [Swe02] [Dal77] An overview of methods for data anonymization
  • 7. Technische Universität München Risk and threat models • Disclosure models • Identity disclosure (re-identification, tuple linkage) • Attribute disclosure (sensitive information disclosure) • Membership disclosure (table linkage) • Models for quantifying re-identification risks • Super-population models: Population is modeled with probability distributions parameterized with sample characteristics • Decision rule by Dankar et al.: Combination of three models, which has been evaluated for biomedical datasets • Attacker models: May be used to derive/compile global risks • Prosecutor scenario: Targets one specific individual • Marketer scenario: Targets as many individuals as possible • Journalist scenario: Targets any individual 719.03.2015 [Emam13] [DEN+12] [LLZ+12] An overview of methods for data anonymization
  • 8. Technische Universität München Syntactic models against re-identification • Goal: Prevent linkage attacks on quasi-identifiers • Some models for relational data • k-Anonymity: Requires groups (cells or equivalence classes) of size ≥ k, which defines an upper bound on the re-identification risk (over-) estimated with sample frequencies • LKC-Privacy: Relaxed variant of k-anonymity + (ℓ-diversity) • Risk-based approaches: Enforce thresholds on re-identification risks, which may be quantified with super-population models • HIPAA Safe Harbor: Heuristic with many predefined identifiers and a few quasi-identifiers (regions and all kinds of dates). Contains wildcards („any other unique identifying number, characteristic, or code“). Provides sound legal protection for custodians in the US • Some models for transactional data • (km )-Anonymity: k-Anonymity regarding ≤ m values from a set 819.03.2015 [Swe02] [MFH+09] [HIP] [TMK08] An overview of methods for data anonymization
  • 9. Technische Universität München Syntactic models against attribute disclosure • Observation: Preventing linkage attacks is not enough • Goal: Prevent knowledge gain from sensitive information associated with an equivalence class • Some models for relational data • ℓ-Diversity: Sensitive values must be „well-represented”. Multiple variants exist with different privacy/utility trade-offs • t-Closeness: Distribution of sensitive values must not be „too different“ from the overall dataset. Multiple variants exist • p-Sensitive k-anonymity: Focus on identity & attribute disclosure • LKC Privacy: ℓ-Diversity & relaxed k-anonymity • Some models for transactional data • (h, k, p)-coherence: km -anonymity + protection against inference • ρ-Uncertainty: protection against inference with fewer assumptions 919.03.2015 [MFH+09] [MGK+06] [LLV07] [TV06] [CKR+10] [XWF+08] An overview of methods for data anonymization
  • 10. Technische Universität München Further syntactic models • Models against membership disclosure for relational data • Goal: Bounds on the certainty with which the presence of data about an individual in a database can be inferred via linkage • Upside: With strict thresholds, they provide semantic privacy • Downside: Basically impossible to achieve • δ-Presence: Relates sample counts to population counts • c-Confident δ-presence: Relaxation of δ-presence in which population characteristics are estimated • Models for data which is relational and transactional • (k, km )-Anonymity: Mixture of k-anonymity and km -anonymity • Models for continuous publishing of relational data • Approach by Byun et al.: Only supports insertions • m-Invariance: Supports insertions, deletions, updates 1019.03.2015 [NAC07] [NC10] [TMK08] [XT07] An overview of methods for data anonymization
  • 11. Technische Universität München A semantic model: Differential Privacy • Observation • The formal notion of privacy is impossible to achieve • Even for individuals that are not part of the statistical database • Idea • Do not compare an attacker's information about an individual before and after accessing a statistical database, but • Compare the risks for an individual when joining (or leaving) a statistical database • (Slightly) more formal • -Differential Privacy:ϵ A (randomized) function fulfills -DP if theϵ probability of every possible output value changes by a factor of at most exp( ) when data about an individual is or is not contained inϵ a database. • Relaxations: ( ,ϵ δ)-DP, approximate DP 1119.03.2015 [Dwork 06, Dwork08] [Dwork06] [Dwork 06, Dwork08] [PK08, LM12] An overview of methods for data anonymization
  • 12. Technische Universität München A semantic model: Differential Privacy (cont'd) • DP in interactive scenarios • Sequential composition rule • Privacy budget • DP in non-interactive scenarios • Release of contingency tables or marginals • Relationships to syntactical models exist, e.g., • (k, β)-SDGS: Random sampling + k-anonymity fulfills ( , δ)-DPϵ • t-Closeness (with a specific distance function): Implies -DP regarding the sensitive attributesϵ • Has been criticized in the context of biomedical research • DP is often not a truthful mechanism: Functions are randomized, often data is pertubated, e.g., by adding noise • DP is not intuitive: What is a good value for ? What does itϵ mean? 1219.03.2015 [FDE13] [DS15] [LQS11] [BCD+07][BCD+07] [FDE13] An overview of methods for data anonymization
  • 13. Technische Universität München Measuring data utility • Often used interchangeably with “loss of information” • Exemplary utility measures for syntactic models • Used for evaluating transformed datasets • Discernibility: Based on sizes of equivalence classes • Average equivalence class size: Analogously to discernibility • (Non-uniform) entropy: Information theoretic measure • Loss: Measures the coverage of the domain of attributes • Utility constraints: Use cases are modeled as queries • Exemplary utility measures for Differential Privacy • Used for evaluating a method that fulfills DP • Error: Absolute, relative, variance • (α, δ)-Usefulness: P[distance ≤ α] ≥ δ 1319.03.2015 [BA05] [LDD+05] [GT09] [LGM10] [VI02] [FDE13] An overview of methods for data anonymization
  • 14. Technische Universität München Transformation methods • Coding models • Global recoding: Similar transformation for similar values • Local recoding: Different transformations may be applied • Truthful transformations • Generalization: Based on domain generalization hierarchies • Full-domain generalization: All values of an attribute are generalized to the same level • Subtree generalization: Different levels of generalization may be applied • Suppression: Removal of values of cells or complete tuples • Top & bottom coding: Replacing values that exceed given bounds 1419.03.2015 [FWF11] [FWF11] An overview of methods for data anonymization
  • 15. Technische Universität München Transformation methods (cont'd) • Non-truthful transformations (Pertubation) • Post-randomization: Randomly change categories of a categorical variable according to predefined probabilities • Value distortion: Multiplicative or additive noise • Numerical rank swapping: Randomly swap values with other values with a rank that does not differ by more than a predefined threshold • Microaggregation: Aggregate values in one group • Replacing values: Distribution sample or distribution itself • Methods on a structural level • Random sampling: Randomly select a set of tuples • Slicing: Partition the data horizontally and vertically and creates links between partitions 1519.03.2015 [FWF11] [LLZ+12] [LQS11] An overview of methods for data anonymization
  • 16. Technische Universität München Algorithms • Transform data to meet privacy models • Given transformation methods, data properties etc. • Randomized algorithms • Randomized functions for Differential Privacy • Genetic search • Search algorithms • Optimal algorithms: Flash, Incognito, OLA • Heuristic algorithms: Top-Down-Specialization • Clustering algorithms: Iteratively merge groups • Data (focus on tuples): Method by Tassa et al. • Space (focus on taxonomies): For transactional data • Partitioning algorithms: Iteratively split groups • Data: Mondrian 1619.03.2015 [Dwork08a] [EDI+09, LDR05, KPE+12] [FWY05] [GT10] [Iye02] [GLS14] [LG13] [LDR06] An overview of methods for data anonymization
  • 17. Technische Universität München Thank you for your attention! 1719.03.2015 An overview of methods for data anonymization
  • 18. Technische Universität München Further Readings • Fung CMB, Wang K, Fu A, Yu P. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. Chapman & Hall/CRC, ISBN: 1420091484, 2011. • Gkoulalas-Divanis A, Loukides G, Sun J. Publishing data from electronic health records while preserving privacy: A survey of algorithms. J Biomed Inform, Vol. 50, p. 4-19, 2014 • Dankar FK, El Emam K. Practicing Differential Privacy in Health Care: A Review. Trans. Data Privacy, Vol. 6:1, 2013. • Gkoulalas-Divanis A, Loukides G. Anonymization of Electronic Medical Records to Support Clinical Analysis. Springer. ISBN: 978-1-4614-5668-1, 2013 • El Emam K. Guide to the De-Identification of Personal Health Information. Auerbach/CRC, ISBN 978-1-4665-7906-4, 2013 1819.03.2015 An overview of methods for data anonymization
  • 19. Technische Universität München References[Agg05] Aggarwal CC. On k-anonymity and the curse of dimensionality. PVLDB, p. 901–909, 2005. [BA05] Bayardo RJ, Agrawal R. Data Privacy through Optimal k-Anonymization. ICDE. pp. 217–228, 2005. [BCD+07] Barak B, Chaudhuri K, Dwork C, Kale S, McSherry F et al. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. PODS, p. 273–282, 2007. [CKR+10] Cao J, Karras P, Raïssi C, Tan K. rho-uncertainty: inference-proof transaction anonymization. PVLDB;3(1):1033–44. 2010. [Dal77] Dalenius T. Towards a methodology for statistical disclosure control. Statistisk Tidskrift, 5, p. 429–444, 1977. [DEN+12] Dankar F, El Emam K, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Medical Informatics and Decision Making, 12:66, 2012. [DS15] Domingo-Ferrer J, Soria-Comas J. From t-closeness to differential privacy and vice versa in data anonymization. Knowl.-Based Syst., 74:151–158, 2015. [Dwork06] Dwork C. Differential Privacy. ICALP; 4052, pp. 1–12. Springer, 2006. [Dwork08] Dwork C. An Ad Omnia Approach to Defining and Achieving Private Data Analysis. PinKDD, p. 1-13, 2008. [Dwork08a] Dwork C. Differential privacy: A survey of results. TAMC. pp. 1–19, 2008. [EDI+09] El Emam K, Dankar F, Issa R, Jonker E et al. A globally optimal k-anonymity method for the de-identification of health data. JAMIA, 16(5), pp. 670–682, 2009. [Emam13] El Emam K. Guide to the De-Identification of Personal Health Information. Auerbach/CRC, ISBN 978-1-4665-7906-4, 2013. [FDE13] Fida K, Dankar F, El Emam K. Practicing differential privacy in health care: A review. Trans. Data Privacy, 6(1):35–67, 2013. [FWF11] Fung CMB, Wang K, Fu A, Yu P. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. Chapman & Hall/CRC, ISBN: 1420091484, 2011. [FWY05] Fung BCM, Wang K, Yu PS. Top-Down Specialization for Information and Privacy Preservation. ICDE, 2005, pp. 205–216. [GKK07] Ghinita G, Karras P, Kalnis P, Mamoulis N. Fast data anonymization with low information loss. VLDB, 2007, pp. 758–769. [GT09] Gionis A, Tassa T. k-Anonymization with Minimal Loss of Information. IEEE Trans Knowl Data Eng, pp. 206–219, 2009. [GT10] Goldberger J, Tassa T. Efficient anonymizations with enhanced utility. Transactions on data privacy, vol. 3, pp. 149–175, 2010. [GLS14] Gkoulalas-Divanis A, Loukides G, Sun J. Publishing data from electronic health records while preserving privacy: A survey of algorithms. JBI, Vol. 50, p. 4-19, 2014 [HIP] U.S. Department of Health and Human Services Office for Civil Rights. HIPAA administrative simplification regulation text; 2006. [Iye02] Iyengar VS. Transforming data to satisfy privacy constraints. SIGKDD, 2002. [KPE+12] Kohlmayer F, Prasser F, Eckert C, Kemper A, Kuhn KA. Flash: Efficient, Stable and Optimal K-Anonymity. PASSAT., pp. 708–717, Sep. 2012. [LDD+05] LeFevre K, DeWitt DJ, Ramakrishnan R. Multidimensional K-Anonymity (TR-1521). Tech. rep. University of Wisconsin, 2005. [LDR05] LeFevre K, DeWitt D, Ramakrishnan R. Incognito: Efficient full-domain k-anonymity. SIGMOD. 2005, pp. 49–60, 2005. [LDR06] LeFevre K, DeWitt DJ, Ramakrishnan R. Mondrian Multidimensional K-Anonymity. ICDE, 2006. [LG13] Loukides G, Gkoulalas-Divanis A. Utility-aware anonymization of diagnosis codes. J Biomed Health Inform. 2013;17(1):60–70. [LGM10] Loukides G, Gkoulalas-Divanis A, Malin B, COAT: COnstraint-based anonymization of transactions. Knowl Inf Syst, 28:2, p. 251–282, 2010. [LLV07] Li N, Li T, Venkatasubramanian S. t-Closeness: privacy beyond k-anonymity and l-diversity. ICDE; p. 106–15. 2007. [LLZ+12] Li T, Li N, Zhang J, Molloy I. Slicing: A new approach for privacy preserving data publishing. IEEE Trans Knowl Data Eng, 24(3):561–574, 2012 [LM12] Li C, Miklau G. An adaptive mechanism for accurate query answering under differential privacy. PVLDB, 5(6):514–525, 2012. [LQS11] Li N, Qardaji WH, Su D. Provably private data anonymization: Or, k-anonymity meets differential privacy. CoRR, abs/1101.2604, 2011. [MFH+09] Mohammed N, Fung BCM, Hung PCK, and Lee CK. Anonymizing healthcare data: A case study on the blood transfusion service. SIGKDD, p. 1285–1294, 2009. [MGK+06] Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. l-Diversity: privacy beyond k-anonymity. ICDE; pp. 24, 2006. [NAC07] Nergiz ME, Atzori M, Clifton C. Hiding the presence of individuals from shared databases. SIGMOD; p. 665–676, 2007. [NC10] Nergiz ME, Clifton C. d-presence without complete world knowledge. IEEE Trans Knowl Data Eng; 22(6):868–83, 2010. [PK08] Prasad S, Kasiviswanathan AS. A note on differential privacy: Defining resistance to arbitrary side information. CoRR abs/0803.3946, 2008. [Swe02] Sweeney L. k-anonymity: a model for protecting privacy. IJUFKS;10:557–70, 2002. [TMK08] Terrovitis M, Mamoulis N, Kalnis P. Privacy-preserving anonymization of set valued data. PVLDB;1(1):115–25, 2008. [TMK08] Terrovitis M, Mamoulis N, Kalnis P. Privacy-preserving anonymization of setvalued data. PVLDB;1(1):115–25, 2008. [TV06] Truta TM, Vinay B. Privacy protection: p-sensitive k-anonymity property. ICDE workshops; p. 94, 2006. [VI02] Iyengar VS. Transforming data to satisfy privacy constraints. SIGKDD, pp. 279–288, 2002. [XT07] Xiao X, Tao Y. m-invariance: Towards privacy preserving republication of dynamic datasets. SIGMOD, 2007. [XWF+08] Xu Y, Wang K, Fu AW-C, Yu PS. Anonymizing transaction databases for publication. KDD; p. 767–75, 2008. 1919.03.2015 An overview of methods for data anonymization