SlideShare une entreprise Scribd logo
1  sur  39
Télécharger pour lire hors ligne
大数据时代的隐私保护:
挑战和机遇
张振杰博士
伊利诺伊大学高等数字科学中心
Outline
• Why privacy?
• Privacy Attacking Examples
• Old models and limitations
– K-anonymity
– L-diversity

• Differential Privacy
Publishing sensitive data about
individuals.

• Medical research

– What treatments have the best outcomes?
– How can we recognize the onset of disease earlier?
– Are certain drugs better for certain phenotypes?

• Web search
– What are people really looking for when they search?
– How can we give them the most authoritative answers?

• Public health
– Where are our outbreaks of unpleasant diseases?
– What behavior patterns or patient characteristics are correlated
with these diseases?
Today, access to these data sets is usually
strictly controlled.
Only available:
• Inside the company/agency that collected the
data
Society would
• Or after signing a legal contract
benefit if we could
– Click streams, taxi data
• Or in very coarse-grained summariespublish some useful
– Public health
form of the data,
• Or after a very long wait
without having to
– US Census data details
worry about privacy.
• Or with definite privacy issues
–

US Census reports, the AOL click stream,
email

old NIH dbGaP summary tables, Enron

• Or with IRB (Institutional Review Board)
approval
–

dbGaP summary tables
Why is access so strictly controlled?
No one should learn who had which disease.
“Microdata”
What if we “de-identify” the records by
removing names?

publish
We can re-identify people, absolutely or
probabilistically
The published
table

Quasi-identifier (QI) attributes

A voter registration list

“Background
knowledge”
87% of Americans can be uniquely
identified by {zip code, gender, date of
birth}.
actually 63%
[Golle 06]

Latanya Sweeney [International Journal on Uncertainty, Fuzziness and
Knowledge-based Systems, 2002] used this approach to reidentify the medical record of an ex-governor
of Massachusetts.
Outline
• Why privacy?
• Privacy Attacking Examples
• Old models and limitations
– K-anonymity
– L-diversity

• Differential Privacy
Real query logs can be very useful to CS researchers.
But click history can uniquely identify a person.
<AnonID, Query, QueryTime, ItemRank,
domain name clicked>
What the New York Times did:
– Find all log entries for AOL user 4417749
– Multiple queries for businesses and
services in Lilburn, GA (population 11K)
– Several queries for Jarrett Arnold
• Lilburn has 14 people with the last name
Arnold

– NYT contacts them, finds out AOL User
4417749 is Thelma Arnold
Just because data looks hard to reidentify, doesn’t mean it is.
[Narayanan and Shmatikov, Oakland 08]

In 2009, the Netflix movie rental service offered a $1,000,000 prize for
improving their movie recommendation service.
High School
Musical 1
Customer #1

High School
Musical 2

High School
Musical 3

Twilight

4

5

5

?

Training data: ~100M ratings of 18K movies from ~500K randomly
selected customers, plus dates
Only 10% of their data; slightly perturbed
We can re-identify a Netflix rater if we
know just a little bit about her
• 8 movie ratings (≤ 2 wrong, dates ±2 weeks) 
re-identify 99% of raters
• 2 ratings, ±3 days  re-identify 68% of raters
– Relatively few candidates for the other 32% (especially
with movies outside the top 100)

• Even a handful of IMDB comments allows Netflix
re-identification, in many cases
– 50 IMDB users  re-identify 2 with very high
probability, one from ratings, one from dates
The Netflix attack works because the data
are sparse and dissimilar, with a long tail.
Considering just movies rated,
for 90% of records there isn’t
a single other record that is
more than 30% similar

slide 13
Why should we care about this innocuous
data set?
• All movie ratings  political and religious opinions,
sexual orientation, …
• Everything bought in a store  private life details
• Every doctor visit  private life details
“One customer … sued Netflix, saying she thought
her rental history could reveal that she was a
lesbian before she was ready to tell everyone.”
It is becoming routine for medical studies to
include a genetic component.
Genome-wide association studies (GWAS) aim to identify the correlation
between diseases, e.g., diabetes, and the patient’s DNA, by comparing
people with and without the disease.
GWAS papers usually include detailed correlation statistics.
Our attack: uncover the identities of the patients in a GWAS
– For studies of up to moderate size, a significant fraction of people, determine
whether a specific person has participated in a particular study within 10 seconds,
with high confidence!

A genome-wide
association study
identifies novel risk loci
for type 2 diabetes,
Nature 445, 881-885 (22
February 2007)
GWAS papers usually include detailed
correlation statistics.
Publish: linkage disequilibrium
between these SNP pairs.

SNPs 2, 3 are linked, so are SNPs 4, 5.

SNP1

SNP2

SNP3

SNP4

SNP5

…
Human DNA

Diabetes

Publish: p-values of these SNP
-disease pairs.

SNPs 1, 3, 4 are associated with diabetes.

16
Privacy attacks can use SNP-disease
association.
Idea [Homer et al. PloS Genet.’08, Jacobs et al. Nature’09]:
–
–
–
–

Obtain aggregate SNP info from the published p-values (1)
Obtain a sample DNA of the target individual (2)
Obtain the aggregate SNP info of a ref. population (3)
Compare (1), (2), (3)

0.1
SNP
1

0.2
SNP
2

0
SNP1
0.7

0.5
SNP2
0.6

SNP1

SNP2

0.3
SNP
3
0
SNP3
0.9
SNP3

Background knowledge

0.8
SNP
4
1
SNP4
0.3
SNP4

0.4
SNP
…
5
Aggregate DNA of patients in a study
0.5
…
SNP5
0.1
SNP5

DNA of an individual
…

Aggregate DNA of a reference population
Privacy attacks can use both SNPdisease and SNP-SNP associations.
Idea [Wang et al., CCS’09]:
–
–
–
–

Model patients’ SNPs to a matrix of unknowns
Obtain column sums from the published p-values
Obtain pair-wise column dot-products from the published LDs
Solve the matrix using integer programming

Each SNP can only be 0 or 1 (with a dominance model)
Patient

SNP1

SNP2

SNP3

SNP4

SNP5

1

x11

x12

x13

x14

x15

2

x21

x22

x23

x24

x25

3

x31
x11+x21+x31=2

x32

x33
x34
x13x14+x23x24+x33x34=1

A successful attack reveals the DNA of all patients!

x35
Outline
• Why privacy?
• Privacy Attacking Examples
• Old models and limitations
– K-anonymity
– L-diversity

• Differential Privacy
Issues
Privacy principle
What is adequate privacy protection?
Distortion approach
How can we achieve the privacy
principle,
while maximizing the utility of the
data?
Different applications may have
different privacy protection needs.
Membership disclosure: Attacker cannot tell that a given
person is/was in the data set (e.g., a set of AIDS patient
records or the summary data from a data set like dbGaP).
– δ-presence [Nergiz et al., 2007].
– Differential privacy [Dwork, 2007].

Sensitive attribute disclosure: Attacker cannot tell that a
slide 21
given person has a certain sensitive attribute.
– l-diversity [Machanavajjhala et al., 2006].
– t-closeness [Li et al., 2007].

Identity disclosure: Attacker cannot tell which record
corresponds to a given person.
– k-anonymity [Sweeney, 2002].
Privacy principle 1: k-anonymity
your quasi-identifiers are indistinguishable from ≥ k other people’s.
[Sweeney, Int’l J. on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]

2-anonymous generalization:

4 QI groups

A voter registration list

Sensitive attribute
QI attributes
The biggest advantage of k-anonymity is that
people can understand it.

And often it can be computed fast.
But in general, it is easy to attack.
k-anonymity ... or how not to define
[Shmatikov]
privacy.
• Does not say anything about the computations to be
done on the data (utility).
• Assumes that attacker will be able to join only on quasiidentifiers.

Intuitive reasoning:
• k-anonymity prevents attacker from telling which record
corresponds to which person.
• Therefore, attacker cannot tell that a certain person has
a particular value of a sensitive attribute.

This reasoning is fallacious!
k-anonymity does not provide privacy if the sensitive values in an
equivalence class lack diversity, or the attacker has certain background
knowledge.
From a voter registration
list

A 3-anonymous patient table

Bob

Zipcode

Age

Disease

476**

Homogeneity Attack

2*

Heart Disease

Zipcode

Age

476**

2*

Heart Disease

47678

27

476**

2*

Heart Disease

4790*

≥40

Flu

4790*

≥40

Heart Disease

4790*

≥40

Cancer

476**

3*

Heart Disease

476**

3*

Cancer

476**

3*

Cancer

Background Knowledge Attack
Carl
Zipcode

Age

47673

36
Principle 2: l-diversity

[Machanavajjhala et al., ICDE, 2006]

Each QI group should have at least l “wellrepresented” sensitive values.
Maybe each QI-group must have l different
sensitive values?
A 2-diverse table
Age
[1, 5]
[1, 5]
[6, 10]
[6, 10]
[11, 20]
[11, 20]
[21, 60]
[21, 60]
[21, 60]
[21, 60]

Sex
M
M
M
M
F
F
F
F
F
F

Zipcode
[10001, 15000]
[10001, 15000]
[15001, 20000]
[15001, 20000]
[20001, 25000]
[20001, 25000]
[30001, 60000]
[30001, 60000]
[30001, 60000]
[30001, 60000]

Disease
gastric ulcer
dyspepsia
pneumonia
bronchitis
flu
pneumonia
gastritis
gastritis
flu
flu
We can attack this probabilistically.
If we know Joe’s QI group, what is the probability he has HIV?

98 tuples
A QI group with 100 tuples

The conclusion researchers drew: The most frequent
sensitive value in a QI group cannot be too frequent.
Even then, we can still attack using background
knowledge.
Joe has HIV.
Sally knows Joe does not have pneumonia.
Sally can guess that Joe has HIV.

50 tuples
A QI group with 100 tuples
49 tuples
There are probably 100 other related
proposed privacy principles…
• k-gather, (a, k)-anonymity, personalized
anonymity, positive disclosure-recursive (c, l)diversity, non-positive-disclosure (c1, c2, l)diversity, m-invariance, (c, t)-isolation, …
And for other data models, e.g., graphs:
• k-degree anonymity, k-neighborhood anonymity,
k-sized grouping, (k, l) grouping, …
And there is an impossibility result that
applies to all of them.

[Dwork,

Naor 2006]

For any reasonable definition of “privacy breach”
and “sanitization”, with high probability some
adversary can breach some sanitized DB.
Example:
– Private fact: my exact height
– Background knowledge: I’m 5 inches taller than the
average American woman
– San(DB) allows computing average height of US
women
– This breaks my privacy … even if my record is not in
the database!
Outline
• Why privacy?
• Privacy Attacking Examples
• Old methods and limitations
– K-anonymity
– L-diversity
– T-closeness

• Differential Privacy

32
Formulation of Privacy
• What information can be published?
– Average height of US people
– Height of an individual

• Intuition:
– If something is insensitive to the change of any individual tuple, then it
should not be considered private

• Example:
– Assume that we arbitrarily change the height of an individual in the US
– The average height of US people would remain roughly the same
– i.e., The average height reveals little information about the exact
height of any particular individual
Differential Privacy
• Motivation:
– It is OK to publish information that is insensitive to
the change of any particular tuple in the dataset

• Definition:
– Neighboring datasets: Two datasets can be
obtained by changing one single tuple in
– A randomized algorithm satisfies differential
privacy, iff for any two neighboring datasets and
and for any output of ,

34
Differential Privacy
 
• Intuition:

ratio  

–  It is OK to publish information that is insensitive to
changes of any particular tuple

• Definition:

 

– Neighboring datasets: Two datasets and , such that
can be obtained by changing one single tuple in
– A randomized algorithm satisfies differential
privacy, iff for any two neighboring datasets and
and for any output of ,
– The value of decides the degree of privacy
protection
35
Differential Privacy via Laplace Noise

•Dataset:
•
•Objective:
privacy

A set of patients
Release # of diabetes patients with -differential

•Method:

Release the number + Laplace noise

•Rationale:
–: the original dataset;
–: modify a patient in ;

# of diabetes patients =
# of diabetes patients =

ratio bounded
 
 
 

 

 

36
# of diabetes patients
Current Progress
• What does differential privacy support now?
– Frequent Pattern Mining: Decision Tree
– Classification: SVM
– Clustering: K-Means

• What to do?
– Genetic Analysis
– Graph Analysis

37
Conclusion
• The risk of privacy is significant in big data era
• It is important to solve privacy problem to
enable data sharing for social/research
purpose
• Old privacy frameworks are mostly flawed
• Differential privacy is the most promising
framework
张振杰:大数据时代的隐私保护的挑战和机遇

Contenu connexe

Similaire à 张振杰:大数据时代的隐私保护的挑战和机遇

Distinct l diversity anonymization of set valued data
Distinct l diversity anonymization of set valued dataDistinct l diversity anonymization of set valued data
Distinct l diversity anonymization of set valued dataRohan Khude
 
Big Data for a Better World
Big Data for a Better WorldBig Data for a Better World
Big Data for a Better Worldleadinghands
 
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...Data Con LA
 
Differential privacy and applications to location privacy
Differential privacy and applications to location privacyDifferential privacy and applications to location privacy
Differential privacy and applications to location privacyPôle Systematic Paris-Region
 
의료의 미래, 디지털 헬스케어: 신약개발을 중심으로
의료의 미래, 디지털 헬스케어: 신약개발을 중심으로의료의 미래, 디지털 헬스케어: 신약개발을 중심으로
의료의 미래, 디지털 헬스케어: 신약개발을 중심으로Yoon Sup Choi
 
There Is A 90% Probability That Your Son Is Pregnant: Predicting The Future ...
There Is A 90% Probability That Your Son Is Pregnant:  Predicting The Future ...There Is A 90% Probability That Your Son Is Pregnant:  Predicting The Future ...
There Is A 90% Probability That Your Son Is Pregnant: Predicting The Future ...Health Catalyst
 
Sdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalSdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalkimlyman
 
Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for DiscoveryDayOne
 
Workshop finding and accessing data - fiona - lunteren april 18 2016
Workshop   finding and accessing data - fiona - lunteren april 18 2016Workshop   finding and accessing data - fiona - lunteren april 18 2016
Workshop finding and accessing data - fiona - lunteren april 18 2016Fiona Nielsen
 
AI in Healthcare
AI in HealthcareAI in Healthcare
AI in HealthcarePaul Agapow
 
The Uneven Future of Evidence-Based Medicine
The Uneven Future of Evidence-Based MedicineThe Uneven Future of Evidence-Based Medicine
The Uneven Future of Evidence-Based MedicineIda Sim
 
Angrist Fogm vi_7march2013
Angrist Fogm vi_7march2013Angrist Fogm vi_7march2013
Angrist Fogm vi_7march2013Misha Angrist
 
DNA and Personalized medicine
DNA and Personalized medicineDNA and Personalized medicine
DNA and Personalized medicinecancerdrg
 
Mapping Pathways - Washington, DC - Knowledge Exchange Workshop
Mapping Pathways - Washington, DC - Knowledge Exchange WorkshopMapping Pathways - Washington, DC - Knowledge Exchange Workshop
Mapping Pathways - Washington, DC - Knowledge Exchange WorkshopJim Pickett
 
The Genomics Revolution: The Good, The Bad, and The Ugly
The Genomics Revolution: The Good, The Bad, and The UglyThe Genomics Revolution: The Good, The Bad, and The Ugly
The Genomics Revolution: The Good, The Bad, and The UglyEmiliano De Cristofaro
 
Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Ian Fore
 
08.11.08: The Information Cycle
08.11.08: The Information Cycle08.11.08: The Information Cycle
08.11.08: The Information CycleOpen.Michigan
 

Similaire à 张振杰:大数据时代的隐私保护的挑战和机遇 (20)

Distinct l diversity anonymization of set valued data
Distinct l diversity anonymization of set valued dataDistinct l diversity anonymization of set valued data
Distinct l diversity anonymization of set valued data
 
Big Data for a Better World
Big Data for a Better WorldBig Data for a Better World
Big Data for a Better World
 
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
Data Con LA 2019 - Applied Privacy Engineering Study on SEER database by Ken ...
 
Differential privacy and applications to location privacy
Differential privacy and applications to location privacyDifferential privacy and applications to location privacy
Differential privacy and applications to location privacy
 
의료의 미래, 디지털 헬스케어: 신약개발을 중심으로
의료의 미래, 디지털 헬스케어: 신약개발을 중심으로의료의 미래, 디지털 헬스케어: 신약개발을 중심으로
의료의 미래, 디지털 헬스케어: 신약개발을 중심으로
 
There Is A 90% Probability That Your Son Is Pregnant: Predicting The Future ...
There Is A 90% Probability That Your Son Is Pregnant:  Predicting The Future ...There Is A 90% Probability That Your Son Is Pregnant:  Predicting The Future ...
There Is A 90% Probability That Your Son Is Pregnant: Predicting The Future ...
 
Sdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) finalSdal air health and social development (jan. 27, 2014) final
Sdal air health and social development (jan. 27, 2014) final
 
Artificial Intelligence for Discovery
Artificial Intelligence for DiscoveryArtificial Intelligence for Discovery
Artificial Intelligence for Discovery
 
KnetMiner - EBI Workshop 2017
KnetMiner - EBI Workshop 2017KnetMiner - EBI Workshop 2017
KnetMiner - EBI Workshop 2017
 
Workshop finding and accessing data - fiona - lunteren april 18 2016
Workshop   finding and accessing data - fiona - lunteren april 18 2016Workshop   finding and accessing data - fiona - lunteren april 18 2016
Workshop finding and accessing data - fiona - lunteren april 18 2016
 
AI in Healthcare
AI in HealthcareAI in Healthcare
AI in Healthcare
 
The Uneven Future of Evidence-Based Medicine
The Uneven Future of Evidence-Based MedicineThe Uneven Future of Evidence-Based Medicine
The Uneven Future of Evidence-Based Medicine
 
Angrist Fogm vi_7march2013
Angrist Fogm vi_7march2013Angrist Fogm vi_7march2013
Angrist Fogm vi_7march2013
 
DNA and Personalized medicine
DNA and Personalized medicineDNA and Personalized medicine
DNA and Personalized medicine
 
Mapping Pathways - Washington, DC - Knowledge Exchange Workshop
Mapping Pathways - Washington, DC - Knowledge Exchange WorkshopMapping Pathways - Washington, DC - Knowledge Exchange Workshop
Mapping Pathways - Washington, DC - Knowledge Exchange Workshop
 
The Genomics Revolution: The Good, The Bad, and The Ugly
The Genomics Revolution: The Good, The Bad, and The UglyThe Genomics Revolution: The Good, The Bad, and The Ugly
The Genomics Revolution: The Good, The Bad, and The Ugly
 
Fore FAIR ISMB 2019
Fore FAIR ISMB 2019Fore FAIR ISMB 2019
Fore FAIR ISMB 2019
 
08.11.08: The Information Cycle
08.11.08: The Information Cycle08.11.08: The Information Cycle
08.11.08: The Information Cycle
 
Cp34550555
Cp34550555Cp34550555
Cp34550555
 
GT1000 training IL
GT1000 training ILGT1000 training IL
GT1000 training IL
 

Plus de hdhappy001

詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
翟艳堂:腾讯大规模Hadoop集群实践
翟艳堂:腾讯大规模Hadoop集群实践翟艳堂:腾讯大规模Hadoop集群实践
翟艳堂:腾讯大规模Hadoop集群实践hdhappy001
 
袁晓如:大数据时代可视化和可视分析的机遇与挑战
袁晓如:大数据时代可视化和可视分析的机遇与挑战袁晓如:大数据时代可视化和可视分析的机遇与挑战
袁晓如:大数据时代可视化和可视分析的机遇与挑战hdhappy001
 
俞晨杰:Linked in大数据应用和azkaban
俞晨杰:Linked in大数据应用和azkaban俞晨杰:Linked in大数据应用和azkaban
俞晨杰:Linked in大数据应用和azkabanhdhappy001
 
杨少华:阿里开放数据处理服务
杨少华:阿里开放数据处理服务杨少华:阿里开放数据处理服务
杨少华:阿里开放数据处理服务hdhappy001
 
薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐hdhappy001
 
徐萌:中国移动大数据应用实践
徐萌:中国移动大数据应用实践徐萌:中国移动大数据应用实践
徐萌:中国移动大数据应用实践hdhappy001
 
肖永红:科研数据应用和共享方面的实践
肖永红:科研数据应用和共享方面的实践肖永红:科研数据应用和共享方面的实践
肖永红:科研数据应用和共享方面的实践hdhappy001
 
肖康:Storm在实时网络攻击检测和分析的应用与改进
肖康:Storm在实时网络攻击检测和分析的应用与改进肖康:Storm在实时网络攻击检测和分析的应用与改进
肖康:Storm在实时网络攻击检测和分析的应用与改进hdhappy001
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架hdhappy001
 
魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题hdhappy001
 
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王涛:基于Cloudera impala的非关系型数据库sql执行引擎王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王涛:基于Cloudera impala的非关系型数据库sql执行引擎hdhappy001
 
王峰:阿里搜索实时流计算技术
王峰:阿里搜索实时流计算技术王峰:阿里搜索实时流计算技术
王峰:阿里搜索实时流计算技术hdhappy001
 
钱卫宁:在线社交媒体分析型查询基准评测初探
钱卫宁:在线社交媒体分析型查询基准评测初探钱卫宁:在线社交媒体分析型查询基准评测初探
钱卫宁:在线社交媒体分析型查询基准评测初探hdhappy001
 
穆黎森:Interactive batch query at scale
穆黎森:Interactive batch query at scale穆黎森:Interactive batch query at scale
穆黎森:Interactive batch query at scalehdhappy001
 
罗李:构建一个跨机房的Hadoop集群
罗李:构建一个跨机房的Hadoop集群罗李:构建一个跨机房的Hadoop集群
罗李:构建一个跨机房的Hadoop集群hdhappy001
 
刘书良:基于大数据公共云平台的Dsp技术
刘书良:基于大数据公共云平台的Dsp技术刘书良:基于大数据公共云平台的Dsp技术
刘书良:基于大数据公共云平台的Dsp技术hdhappy001
 
刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sql刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sqlhdhappy001
 
刘昌钰:阿里大数据应用平台
刘昌钰:阿里大数据应用平台刘昌钰:阿里大数据应用平台
刘昌钰:阿里大数据应用平台hdhappy001
 
李战怀:大数据背景下分布式系统的数据一致性策略
李战怀:大数据背景下分布式系统的数据一致性策略李战怀:大数据背景下分布式系统的数据一致性策略
李战怀:大数据背景下分布式系统的数据一致性策略hdhappy001
 

Plus de hdhappy001 (20)

詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
翟艳堂:腾讯大规模Hadoop集群实践
翟艳堂:腾讯大规模Hadoop集群实践翟艳堂:腾讯大规模Hadoop集群实践
翟艳堂:腾讯大规模Hadoop集群实践
 
袁晓如:大数据时代可视化和可视分析的机遇与挑战
袁晓如:大数据时代可视化和可视分析的机遇与挑战袁晓如:大数据时代可视化和可视分析的机遇与挑战
袁晓如:大数据时代可视化和可视分析的机遇与挑战
 
俞晨杰:Linked in大数据应用和azkaban
俞晨杰:Linked in大数据应用和azkaban俞晨杰:Linked in大数据应用和azkaban
俞晨杰:Linked in大数据应用和azkaban
 
杨少华:阿里开放数据处理服务
杨少华:阿里开放数据处理服务杨少华:阿里开放数据处理服务
杨少华:阿里开放数据处理服务
 
薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐
 
徐萌:中国移动大数据应用实践
徐萌:中国移动大数据应用实践徐萌:中国移动大数据应用实践
徐萌:中国移动大数据应用实践
 
肖永红:科研数据应用和共享方面的实践
肖永红:科研数据应用和共享方面的实践肖永红:科研数据应用和共享方面的实践
肖永红:科研数据应用和共享方面的实践
 
肖康:Storm在实时网络攻击检测和分析的应用与改进
肖康:Storm在实时网络攻击检测和分析的应用与改进肖康:Storm在实时网络攻击检测和分析的应用与改进
肖康:Storm在实时网络攻击检测和分析的应用与改进
 
夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题
 
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王涛:基于Cloudera impala的非关系型数据库sql执行引擎王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
 
王峰:阿里搜索实时流计算技术
王峰:阿里搜索实时流计算技术王峰:阿里搜索实时流计算技术
王峰:阿里搜索实时流计算技术
 
钱卫宁:在线社交媒体分析型查询基准评测初探
钱卫宁:在线社交媒体分析型查询基准评测初探钱卫宁:在线社交媒体分析型查询基准评测初探
钱卫宁:在线社交媒体分析型查询基准评测初探
 
穆黎森:Interactive batch query at scale
穆黎森:Interactive batch query at scale穆黎森:Interactive batch query at scale
穆黎森:Interactive batch query at scale
 
罗李:构建一个跨机房的Hadoop集群
罗李:构建一个跨机房的Hadoop集群罗李:构建一个跨机房的Hadoop集群
罗李:构建一个跨机房的Hadoop集群
 
刘书良:基于大数据公共云平台的Dsp技术
刘书良:基于大数据公共云平台的Dsp技术刘书良:基于大数据公共云平台的Dsp技术
刘书良:基于大数据公共云平台的Dsp技术
 
刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sql刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sql
 
刘昌钰:阿里大数据应用平台
刘昌钰:阿里大数据应用平台刘昌钰:阿里大数据应用平台
刘昌钰:阿里大数据应用平台
 
李战怀:大数据背景下分布式系统的数据一致性策略
李战怀:大数据背景下分布式系统的数据一致性策略李战怀:大数据背景下分布式系统的数据一致性策略
李战怀:大数据背景下分布式系统的数据一致性策略
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

张振杰:大数据时代的隐私保护的挑战和机遇

  • 2. Outline • Why privacy? • Privacy Attacking Examples • Old models and limitations – K-anonymity – L-diversity • Differential Privacy
  • 3. Publishing sensitive data about individuals. • Medical research – What treatments have the best outcomes? – How can we recognize the onset of disease earlier? – Are certain drugs better for certain phenotypes? • Web search – What are people really looking for when they search? – How can we give them the most authoritative answers? • Public health – Where are our outbreaks of unpleasant diseases? – What behavior patterns or patient characteristics are correlated with these diseases?
  • 4. Today, access to these data sets is usually strictly controlled. Only available: • Inside the company/agency that collected the data Society would • Or after signing a legal contract benefit if we could – Click streams, taxi data • Or in very coarse-grained summariespublish some useful – Public health form of the data, • Or after a very long wait without having to – US Census data details worry about privacy. • Or with definite privacy issues – US Census reports, the AOL click stream, email old NIH dbGaP summary tables, Enron • Or with IRB (Institutional Review Board) approval – dbGaP summary tables
  • 5. Why is access so strictly controlled? No one should learn who had which disease. “Microdata”
  • 6. What if we “de-identify” the records by removing names? publish
  • 7. We can re-identify people, absolutely or probabilistically The published table Quasi-identifier (QI) attributes A voter registration list “Background knowledge”
  • 8. 87% of Americans can be uniquely identified by {zip code, gender, date of birth}. actually 63% [Golle 06] Latanya Sweeney [International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002] used this approach to reidentify the medical record of an ex-governor of Massachusetts.
  • 9. Outline • Why privacy? • Privacy Attacking Examples • Old models and limitations – K-anonymity – L-diversity • Differential Privacy
  • 10. Real query logs can be very useful to CS researchers. But click history can uniquely identify a person. <AnonID, Query, QueryTime, ItemRank, domain name clicked> What the New York Times did: – Find all log entries for AOL user 4417749 – Multiple queries for businesses and services in Lilburn, GA (population 11K) – Several queries for Jarrett Arnold • Lilburn has 14 people with the last name Arnold – NYT contacts them, finds out AOL User 4417749 is Thelma Arnold
  • 11. Just because data looks hard to reidentify, doesn’t mean it is. [Narayanan and Shmatikov, Oakland 08] In 2009, the Netflix movie rental service offered a $1,000,000 prize for improving their movie recommendation service. High School Musical 1 Customer #1 High School Musical 2 High School Musical 3 Twilight 4 5 5 ? Training data: ~100M ratings of 18K movies from ~500K randomly selected customers, plus dates Only 10% of their data; slightly perturbed
  • 12. We can re-identify a Netflix rater if we know just a little bit about her • 8 movie ratings (≤ 2 wrong, dates ±2 weeks)  re-identify 99% of raters • 2 ratings, ±3 days  re-identify 68% of raters – Relatively few candidates for the other 32% (especially with movies outside the top 100) • Even a handful of IMDB comments allows Netflix re-identification, in many cases – 50 IMDB users  re-identify 2 with very high probability, one from ratings, one from dates
  • 13. The Netflix attack works because the data are sparse and dissimilar, with a long tail. Considering just movies rated, for 90% of records there isn’t a single other record that is more than 30% similar slide 13
  • 14. Why should we care about this innocuous data set? • All movie ratings  political and religious opinions, sexual orientation, … • Everything bought in a store  private life details • Every doctor visit  private life details “One customer … sued Netflix, saying she thought her rental history could reveal that she was a lesbian before she was ready to tell everyone.”
  • 15. It is becoming routine for medical studies to include a genetic component. Genome-wide association studies (GWAS) aim to identify the correlation between diseases, e.g., diabetes, and the patient’s DNA, by comparing people with and without the disease. GWAS papers usually include detailed correlation statistics. Our attack: uncover the identities of the patients in a GWAS – For studies of up to moderate size, a significant fraction of people, determine whether a specific person has participated in a particular study within 10 seconds, with high confidence! A genome-wide association study identifies novel risk loci for type 2 diabetes, Nature 445, 881-885 (22 February 2007)
  • 16. GWAS papers usually include detailed correlation statistics. Publish: linkage disequilibrium between these SNP pairs. SNPs 2, 3 are linked, so are SNPs 4, 5. SNP1 SNP2 SNP3 SNP4 SNP5 … Human DNA Diabetes Publish: p-values of these SNP -disease pairs. SNPs 1, 3, 4 are associated with diabetes. 16
  • 17. Privacy attacks can use SNP-disease association. Idea [Homer et al. PloS Genet.’08, Jacobs et al. Nature’09]: – – – – Obtain aggregate SNP info from the published p-values (1) Obtain a sample DNA of the target individual (2) Obtain the aggregate SNP info of a ref. population (3) Compare (1), (2), (3) 0.1 SNP 1 0.2 SNP 2 0 SNP1 0.7 0.5 SNP2 0.6 SNP1 SNP2 0.3 SNP 3 0 SNP3 0.9 SNP3 Background knowledge 0.8 SNP 4 1 SNP4 0.3 SNP4 0.4 SNP … 5 Aggregate DNA of patients in a study 0.5 … SNP5 0.1 SNP5 DNA of an individual … Aggregate DNA of a reference population
  • 18. Privacy attacks can use both SNPdisease and SNP-SNP associations. Idea [Wang et al., CCS’09]: – – – – Model patients’ SNPs to a matrix of unknowns Obtain column sums from the published p-values Obtain pair-wise column dot-products from the published LDs Solve the matrix using integer programming Each SNP can only be 0 or 1 (with a dominance model) Patient SNP1 SNP2 SNP3 SNP4 SNP5 1 x11 x12 x13 x14 x15 2 x21 x22 x23 x24 x25 3 x31 x11+x21+x31=2 x32 x33 x34 x13x14+x23x24+x33x34=1 A successful attack reveals the DNA of all patients! x35
  • 19. Outline • Why privacy? • Privacy Attacking Examples • Old models and limitations – K-anonymity – L-diversity • Differential Privacy
  • 20. Issues Privacy principle What is adequate privacy protection? Distortion approach How can we achieve the privacy principle, while maximizing the utility of the data?
  • 21. Different applications may have different privacy protection needs. Membership disclosure: Attacker cannot tell that a given person is/was in the data set (e.g., a set of AIDS patient records or the summary data from a data set like dbGaP). – δ-presence [Nergiz et al., 2007]. – Differential privacy [Dwork, 2007]. Sensitive attribute disclosure: Attacker cannot tell that a slide 21 given person has a certain sensitive attribute. – l-diversity [Machanavajjhala et al., 2006]. – t-closeness [Li et al., 2007]. Identity disclosure: Attacker cannot tell which record corresponds to a given person. – k-anonymity [Sweeney, 2002].
  • 22. Privacy principle 1: k-anonymity your quasi-identifiers are indistinguishable from ≥ k other people’s. [Sweeney, Int’l J. on Uncertainty, Fuzziness and Knowledge-based Systems, 2002] 2-anonymous generalization: 4 QI groups A voter registration list Sensitive attribute QI attributes
  • 23. The biggest advantage of k-anonymity is that people can understand it. And often it can be computed fast. But in general, it is easy to attack.
  • 24. k-anonymity ... or how not to define [Shmatikov] privacy. • Does not say anything about the computations to be done on the data (utility). • Assumes that attacker will be able to join only on quasiidentifiers. Intuitive reasoning: • k-anonymity prevents attacker from telling which record corresponds to which person. • Therefore, attacker cannot tell that a certain person has a particular value of a sensitive attribute. This reasoning is fallacious!
  • 25. k-anonymity does not provide privacy if the sensitive values in an equivalence class lack diversity, or the attacker has certain background knowledge. From a voter registration list A 3-anonymous patient table Bob Zipcode Age Disease 476** Homogeneity Attack 2* Heart Disease Zipcode Age 476** 2* Heart Disease 47678 27 476** 2* Heart Disease 4790* ≥40 Flu 4790* ≥40 Heart Disease 4790* ≥40 Cancer 476** 3* Heart Disease 476** 3* Cancer 476** 3* Cancer Background Knowledge Attack Carl Zipcode Age 47673 36
  • 26. Principle 2: l-diversity [Machanavajjhala et al., ICDE, 2006] Each QI group should have at least l “wellrepresented” sensitive values.
  • 27. Maybe each QI-group must have l different sensitive values? A 2-diverse table Age [1, 5] [1, 5] [6, 10] [6, 10] [11, 20] [11, 20] [21, 60] [21, 60] [21, 60] [21, 60] Sex M M M M F F F F F F Zipcode [10001, 15000] [10001, 15000] [15001, 20000] [15001, 20000] [20001, 25000] [20001, 25000] [30001, 60000] [30001, 60000] [30001, 60000] [30001, 60000] Disease gastric ulcer dyspepsia pneumonia bronchitis flu pneumonia gastritis gastritis flu flu
  • 28. We can attack this probabilistically. If we know Joe’s QI group, what is the probability he has HIV? 98 tuples A QI group with 100 tuples The conclusion researchers drew: The most frequent sensitive value in a QI group cannot be too frequent.
  • 29. Even then, we can still attack using background knowledge. Joe has HIV. Sally knows Joe does not have pneumonia. Sally can guess that Joe has HIV. 50 tuples A QI group with 100 tuples 49 tuples
  • 30. There are probably 100 other related proposed privacy principles… • k-gather, (a, k)-anonymity, personalized anonymity, positive disclosure-recursive (c, l)diversity, non-positive-disclosure (c1, c2, l)diversity, m-invariance, (c, t)-isolation, … And for other data models, e.g., graphs: • k-degree anonymity, k-neighborhood anonymity, k-sized grouping, (k, l) grouping, …
  • 31. And there is an impossibility result that applies to all of them. [Dwork, Naor 2006] For any reasonable definition of “privacy breach” and “sanitization”, with high probability some adversary can breach some sanitized DB. Example: – Private fact: my exact height – Background knowledge: I’m 5 inches taller than the average American woman – San(DB) allows computing average height of US women – This breaks my privacy … even if my record is not in the database!
  • 32. Outline • Why privacy? • Privacy Attacking Examples • Old methods and limitations – K-anonymity – L-diversity – T-closeness • Differential Privacy 32
  • 33. Formulation of Privacy • What information can be published? – Average height of US people – Height of an individual • Intuition: – If something is insensitive to the change of any individual tuple, then it should not be considered private • Example: – Assume that we arbitrarily change the height of an individual in the US – The average height of US people would remain roughly the same – i.e., The average height reveals little information about the exact height of any particular individual
  • 34. Differential Privacy • Motivation: – It is OK to publish information that is insensitive to the change of any particular tuple in the dataset • Definition: – Neighboring datasets: Two datasets can be obtained by changing one single tuple in – A randomized algorithm satisfies differential privacy, iff for any two neighboring datasets and and for any output of , 34
  • 35. Differential Privacy   • Intuition: ratio   –  It is OK to publish information that is insensitive to changes of any particular tuple • Definition:   – Neighboring datasets: Two datasets and , such that can be obtained by changing one single tuple in – A randomized algorithm satisfies differential privacy, iff for any two neighboring datasets and and for any output of , – The value of decides the degree of privacy protection 35
  • 36. Differential Privacy via Laplace Noise •Dataset: • •Objective: privacy A set of patients Release # of diabetes patients with -differential •Method: Release the number + Laplace noise •Rationale: –: the original dataset; –: modify a patient in ; # of diabetes patients = # of diabetes patients = ratio bounded           36 # of diabetes patients
  • 37. Current Progress • What does differential privacy support now? – Frequent Pattern Mining: Decision Tree – Classification: SVM – Clustering: K-Means • What to do? – Genetic Analysis – Graph Analysis 37
  • 38. Conclusion • The risk of privacy is significant in big data era • It is important to solve privacy problem to enable data sharing for social/research purpose • Old privacy frameworks are mostly flawed • Differential privacy is the most promising framework