From Integrating Approaches to Privacy across the Research Lifecycle http://privacytools.seas.harvard.edu/fall-2013-workshop
This workshop will consider how emerging tools and perspectives from a variety of disciplines, such as computer science, social science, law, and the health sciences, should be integrated in the management of confidential research data. Multidisciplinary discussion groups will grapple with these issues in the context of exemplar research use cases.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Privacy in Research Data Managemnt - Use Cases
1. Prepared for:
Integrating Approaches to Privacy across the Research Lifecycle
Sept 2013
Introduction to Research Data
Privacy Use Cases
Micah Altman
<Micah_Altman@alumni.brown.edu>
Director of Research, MIT Libraries
Non-Resident Senior Fellow, Brookings Institution
2. DISCLAIMER
These opinions are my own, they are not the opinions
of MIT, Brookings, any of the project funders, nor (with
the exception of co-authored previously published
work) my collaborators.
Secondary disclaimer:
“It’s tough to make predictions, especially about the
future!”
-- Attributed to Woody Allen, Yogi Berra, Niels Bohr, Vint Cerf, Winston Churchill,
Confucius, Disreali [sic], Freeman Dyson, Cecil B. Demille, Albert Einstein, Enrico Fermi,
Edgar R. Fiedler, Bob Fourer, Sam Goldwyn, Allan Lamport, Groucho Marx, Dan Quayle,
George Bernard Shaw, Casey Stengel, Will Rogers, M. Taub, Mark Twain, Kerr L. White,
etc.
Introduction to Research Data Privacy Use Cases
3. About the ‘use cases”?
Technical definition:
A summary of a pattern of interactions between external actors within a
system under consideration to accomplish a goal.
Working definition:
Who does what, when; and what do they wish to accomplish?
Complemented by:
• User stories – simle generalized descriptions of specific interactions
• Scenarios – variations on a theme
• Examples/fact patterns – real life examples of the abstract use case
Introduction to Research Data Privacy Use Cases
4. Data InputOutput Model
Published Outputs
* Jones * * 1961 021*
* Jones * * 1961 021*
* Jones * * 1972 9404*
* Jones * * 1972 9404*
* Jones * * 1972 9404*
“The correlation between X and
Y was large and statistically
significant”
Summary statistics
Contingency table
Public use sample microdata
Information Visualization
Introduction to Research Data Privacy Use Cases
DATA
DATA
5. Information Life Cycle Model
Introduction to Research Data Privacy Use Cases
Creation/Colle
ction
Storage/I
ngest
Processing
Internal SharingAnalysis
External
dissemination/publica
tion
Re-use
• Scientometric
• Education
• Scientific
• Policy
Long-term
access
Research
methods
Data Management
Systems
Legal / Policy
Frameworks∂
∂
Statistical /
Computational
Frameworks
6. Legal/Policy Frameworks
Contract Intellectual Property
Access
Rights Confidentiality
Copyright
Fair Use
DMCA
Database Rights
Moral Rights
Intellectual
Attribution
Trade Secret
Patent
Trademark
Common Rule
45 CFR 26
HIPAA
FERPA
EU Privacy Directive
Privacy
Torts
(Invasion,
Defamation)
Rights of
Publicity
Sensitive but
Unclassified
Potentially
Harmful
(Archeological
Sites,
Endangered
Species, Animal
Testing, …)
Classified
FOIA
CIPSEA
State
Privacy Laws
EAR
State FOI
Laws
Journal
Replication
Requirements
Funder Open
Access
Contract
License
Click-Wrap
TOU
ITAR
Export
Restrictions
7. Introduction to Research Data Privacy Use Cases
Example: Stakeholder Concerns Across Lifecycle
Research sources:
- Research Subjects.
- Owners of subject material
- Owners of supplementary data
Research sponsors:
- Home institution
- Funding sources
Project Personnel:
- Investigators
- Research Staff
Research Publishers
- Print publishers
- Research archives
Research Consumers
- Readers
- Secondary researcher
Licensing
Copyright
DMCA
Informed Consent
Privacy
Trade secrets
Licensing
Freedom of Information
Copyright
Copyright
Copyright
Licensing
Fair Use
Information
Transfer
Privacy
Confidentiality
Intellectual Property
Replicable Research
Policy Relevance
Accessibility of Research
Protect IP
Avoid third party IP/Privacy Issues
Replicable Research
Publish
Promote use of Publications
Track use
Replicable research
Promote use of their publications
Protect publisher IP
Avoid third party IP/Privacy Issues
Replicate and extend
Secondary analysis
Link research
Stakeholder Concerns Legal Issues
8. • Infrastructure requirements analysis
– Data acquisition, storage, dissemination
– Identification, authorization, authentication
– Metadata, protocols
• System design: potential implementation cost of differential privacy:
– Information security -- hardening
– Information security – certification & auditing
– Model server development, provisioning, maintenance, reliability, availability
• System design: information security tradeoffs of Interactive privacy mechanisms:
– Availability risks: denial of service attack
– Availability/integrity risks: privacy budget exhaustion attacks
– Integrity risks: modification of delivered results (e.g. man-in-the-middle attacks)
– Secrecy/privacy: breach of authentication/authorization layer
• System design: optimizing privacy & utility across lifecycle
– When does limiting disclosive data collection dominate methods at the data analysis stage
– When does restricted virtual data enclaves + public synthetic data dominate interactive mechanisms
• System design: Information use/reuse
– Support of scientific analysis use cases (model diagnostics, exploratory data analysis, integration of externa
data) within interactive privacy systems.
– Align informational assumptions across stages & incorporating informative priors?
– Requirements for scientific replication/verification of results produced by model servers?
Introduction to Research Data Privacy Use Cases
Systems Policy Research questions deriving from
Information Lifecycle Analysis
9. Modeling Features
Features Characteristics
Data - Structure; Source; Unit of observation; Attribute
types; Dimensionality; Number of observations;
homogeneity; frequency of updates; quality
characteristics
Analytic Results - Form of output; analysis methodology;
analysis/inferential goal; utility/loss/quality
Disclosure scenario - - Source of threat; areas of vulnerability; attacker
objectives, background knowledge, capability;
Breach criteria/disclosure concept
Stakeholders - Stakeholder types; capacities; trust relationships;
budgets
Lifecycle characteristics - Lifecycle stages controlled/in scope; policies used;
stakeholders involved at each stage
Current privacy management approach - Regulation/policy; legal controls;
statistical/computational disclosure methods;
information security controls
Introduction to Research Data Privacy Use Cases
10. Exemplar: Social Media Analysis
Introduction to Research Data Privacy Use Cases
Attribute Type Examples
Data: Structure - network
Data: Attribute Types - Continuous/Discrete/
- Scale: ratio/interval/ordinal/nominal
Data: Performance
Characteristics
- 10M-1B observations
- Sample from stream of continuously
updated corpus
- Dozens of dimensions/measures
Measurement: Unit of
Observation
- Individuals; Interactions
Measurement: Measurement
type
- Observational
Measurement: Performance
characteristic
- High volume
- Complex network structure
- Sparsity
- Systematic and sparse metadata
Management Constraints - License; Replication
Analysis methods - Bespoke algorithms (clustering);
nonlinear optimization; Bayesian
methods
Desired Outputs - Summary scalars (model coefficients)
- Summary table
- Static /interactive visualization
More Information
• Grimmer, Justin, and Gary King. "General purpose computer-
assisted clustering and conceptualization." Proceedings of the
National Academy of Sciences 108.7 (2011): 2643-2650.
• King, Gary, Jennifer Pan, and Molly Roberts. "How censorship in
China allows government criticism but silences collective
expression." APSA 2012 Annual Meeting Paper. 2012.
• Lazer, David, et al. "Life in the network: the coming age of
computational social science." Science (New York, NY) 323.5915
(2009): 721.
11. Mapping the “Space” of Research Data
Privacy
• Many different types of potentially relevant features
• Many types stakeholders
• Many lifecycle stages
so can’t be exhaustive
Heuristic: Choose some points -- combinations of characteristics -- that
are near various corners of the (hyper-) space and that represent
substantively important examples. Document these…
Discuss. Think. Repeat.
Introduction to Research Data Privacy Use Cases
12. ExampleUseCases Name/Description Examples
Comparison case: Official Statistics
Well-resourced data collector summarizes
tables/relational data in the form of summary
statistics and contingency tables
• U.S. Census dissemination
• European statistical agencies
Privacy-Aware Journal Replication Policies
Scholarly journals adopting policies for deposit
and disposition of data for verification and
replication. How to balance privacy and
replicability without intensive review?
• Data Sharing Systems for Open Access
Journals
• American Political Science Association Data
Access and Research Transparency [DART]
Policy Initiative
Long-term Longitudinal data Collection
Data collections tracking individual subjects (and
possibly friends and relations) over decades
• National Longitudinal Study of Adolescent
Health (Add Health)
• Framingham Heart Study
• Panel Study of Income Dynamics
Computational Social Science
“Big” data. New forms and sources of data.
Cutting-edge analytical methods and algorithms.
Analyzing …
• Netflix
• Facebook
• Hubway
• GPS
• Blogs
Introduction to Research Data Privacy Use Cases
13. Proposed Discussion Questions
(for tomorrow)
• Characterization.
• Current approaches.
• Enhancing approaches.
• Integrating approaches.
• Utility.
• Privacy.
• Methodological Barriers
• Incentives.
• Future.
• Prior work.
Introduction to Research Data Privacy Use Cases
• Are these summaries
useful as descriptive
models?
• What is missing from
the big picture?
• What are the
opportunities for
research, practice &
policy?
(What one wants to know)(What one asks)
14. Selected Bibliography
• L. Willenborg and T. D. Waal. Elements of Statistical Disclosure
Control, volume 155 of Lecture Notes in Statistics. Springer Verlag,
New York, NY, 2001.
• Higgins, Sarah. "The DCC curation lifecycle model." International
Journal of Digital Curation 3.1 (2008): 134-
140.www.dcc.ac.uk/resources/curation-lifecycle-model
• ESSNET, Handbook on Statistical Disclosure Control. 2011.
neon.vb.cbs.nl/casc/SDC_Handbook.pdf
• Fung, Benjamin, et al. "Privacy-preserving data publishing: A survey
of recent developments." ACM Computing Surveys (CSUR) 42.4
(2010): 14.
• Altman, M. (2012). “Mitigating Threats To Data Quality Throughout
the Curation Lifecycle. In G. Marciano, C. Lee, & H. Bowden (Eds.),
Curating For Quality. datacuration.web.unc.edu
Introduction to Research Data Privacy Use Cases
16. Appendix: Full Questions
• Characterization.
– Are there key additional characteristics of the use case that should be noted? How do these characteristics change the analysis and
treatment of privacy in these cases?
• Current approaches.
– How is this use case treated now -- what's the state of the art & practice? How is success measured?
• Enhancing approaches.
– Are any of the approaches discussed yesterday used? How could the tools and approaches mentioned earlier or other existing tools be used
at particular stages of the research lifecycle to enhance utility and privacy?
• Integrating approaches.
– Are approaches that have been developed and used in different communities compatible with each other? How should legal,
computational, policy, and statistical tools be integrated so as to be most effective?
• Utility.
– What things would stakeholders like to do with the data that the toolset doesn't restrict or obstruct? Where is social benefit sub-optimal?
How is utility measured/perceived by the stakeholders?
• Privacy.
– What sorts of data/outputs are considered particularly sensitive? What are the most important real and perceived risks -- what harms could
occur if data is released and reidentified, how severe are these harms and how likely?
• Methodological Barriers
– . What are technical, methodological, computational or infrastructural barriers to improving privacy and utility in the management of this
data. What particular characteristics of the use case contribute barriers?
• Incentives.
– If better tools already exist, why aren't they used? What are barriers to adoption of new tools and methods? What are the specific "market
failures" in this area -- such as perverse incentives, lack/asymmetry of information, lack of well-developed market, irrational behavior,
transaction cost, network effects, etc.? What particular characteristics of the use case most influence incentives?
• Future.
– How is this use case likely to evolve over time? What are threats to stability/scalability/robustness/resilience of the proposed/current
solutions?
• Prior work.
– Are there key additional examples of the use case that should be noted? Are there additional key references or writings that should be
noted? Introduction to Research Data Privacy Use Cases
Notes de l'éditeur
This work. by Micah Altman (http://micahaltman.com) is licensed under the Creative Commons Attribution-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
The structure and design of digital storage systems is a cornerstone of digital preservation. To better understand ongoing storage practices of organizations committed to digital preservation, the National Digital Stewardship Alliance conducted a survey of member organizations. This talk discusses findings from this survey, common gaps, and trends in this area.(I also have a little fun highlighting the hidden assumptions underlying Amazon Glacier's reliability claims. For more on that see this earlier post: http://drmaltman.wordpress.com/2012/11/15/amazons-creeping-glacier-and-digital-preservation )