In this session, we will discuss a range of new emerging technologies for privacy and confidentiality in machine learning and data analytics. We will discuss how to use open source tools to put these technologies to work for databases and other data sources.
When we think about developing AI responsibly, there’s many different activities that we need to think about. In this session, we will discuss technologies that help protect people, preserve privacy, and enable you to do machine learning confidentially.
This session discusses industry standards and emerging privacy-enhanced computation techniques, secure multiparty computation, and trusted execution environments. We will discuss Zero Trust philosophy fundamentally changes the way we approach security since trust is a vulnerability that can be exploited particularly when working remotely and increasingly using cloud models. We will also discuss the “why, what, and how” of techniques for privacy preserving computing.
We will review how different industries are taking opportunity of these privacy preserving techniques. A retail company used secure multi-party computation to be able to respect user privacy and specific regulations and allow the retailer to gain insights while protecting the organization’s IP. Secure data-sharing is used by a healthcare organization to protect the privacy of individuals and they also store and search on encrypted medical data in cloud.
We will also review the benefits of secure data-sharing for financial institutions including a large bank that wanted to broaden access to its data lake without compromising data privacy but preserving the data’s analytical quality for machine learning purposes.
5. 5
Agenda
Machine learning (ML) and AI (Artificial Intelligence)
Secure Data-sharing
• Secure multi-party computation (SMPC) and uses cases
• Homomorphic encryption (HE) and use cases
• Zero trust architecture (ZTA) vs. Zero knowledge
• Trusted execution environments (TEE)
Regulations and Standards in Data Privacy
• International privacy standards
• Differential Privacy (DP) and K-Anonymity
7. 7http://dataprotection.link/Zn1Uk#https://www.wsj.com/articles/coronavirus-paves-way-for-new-age-of-digital-surveillance-11586963028
American officials are drawing cellphone location data from mobile advertising firms to track the presence of crowds—but
not individuals.
• Apple Inc. and Alphabet Inc.’s Google - a voluntary app that health officials can use to reverse-engineer sickened patients’
recent whereabouts—provided they agree to provide such information.
Collect personal or anonymized data?
In Western Australia, lawmakers approved a bill to install surveillance gadgets in people’s homes to monitor those placed
under quarantine.
Authorities in Hong Kong and India are using geofencing that draws virtual fences around quarantine zones.
• They monitor digital signals from smartphone or wristbands to deter rule breakers and nab offenders, who can be sent to
jail.
13. 13
Use Case: Insilico Medicine
https://insilico.com/
Since 2014: An alternative to animal testing for research and development programs in the pharmaceutical industry.
• By using artificial intelligence and deep-learning techniques, Insilico is able to analyze how a compound will affect cells
and what drugs can be used to treat the cells in addition to possible side effects.
• The company provides machine learning services to different pharmaceutical, biotechnology, and skin care companies.
• The company has multiple collaborations in the applications of next-generation artificial intelligence technologies such as
the generative adversarial networks and reinforcement learning to the generation of novel molecular structures with
desired properties.
A comprehensive drug
discovery engine, which
utilizes millions of samples
and multiple data types to
discover signatures of
disease and identify the
most promising targets for
billions of molecules that
already exist or can be
generated de novo with the
desired set of parameters.
15. 15https://www.marketresearchfuture.com/reports/machine-learning-market-2494, September 2019
Global Machine Learning Market
Machine learning is a part of artificial intelligence (AI) that grants computers the capability to learn without being
programmed in detail.
It has multiple uses in today’s technology market concerning safety and security such as face detection, face recognition,
image classification, speech recognition, antivirus, Google, antispam, genetic, signal diagnosing, and weather forecast.
18. 18Privacyshield.gov
Privacy Shield Program*
• On July 12, 2016, the European Commission deemed the EU-U.S. Privacy Shield Framework adequate to
enable data transfers under EU law (see the adequacy determination).
• On July 16, 2020, the Court of Justice of the European Union issued a judgment declaring as “invalid” the
European Commission’s Decision (EU) 2016/1250 of 12 July 2016 on the adequacy of the protection
provided by the EU-U.S. Privacy Shield.
As a result of that decision, the EU-U.S. Privacy Shield Framework is no longer a valid mechanism to comply
with EU data protection requirements when transferring personal data from the European Union to the
United States.
This decision does not relieve participants in the EU-U.S. Privacy Shield of their obligations under the EU-U.S. Privacy
Shield Framework.
*: The EU-U.S. Privacy Shield Framework were designed by the U.S. Department of Commerce, and the
European Commission and Swiss Administration, respectively, to provide companies on both sides of the Atlantic
with a mechanism to comply with data protection requirements when transferring personal data from the
European Union and Switzerland to the United States in support of transatlantic commerce.
19. 19
Privacy Shield safeguards: Encryption
• The CJEU reaffirmed the validity of SCCs* but stated that companies must verify, on a case-by-case basis,
whether the law in the recipient country ensures adequate protection.
• The ruling placed the same requirement on EU data protection authorities to suspend such transfers on a
case-by-case basis where equivalent protection can not be ensured.
• Privacy professionals may need to consider whether relevant surveillance programs and authorities apply in
particular contexts. If they do, they could then assess whether those authorities include proportional
limitations in the given context, as well as whether effective judicial remedies exist.
• Alternatively, they might consider ways to limit the context itself through additional safeguards.
• Encryption, for instance, might be a consideration.
https://iapp.org/news/a/the-schrems-ii-decision-eu-us-data-transfers-in-question/
*: Standard Contractual Clauses (SCC). Standard contractual clauses for data transfers between EU and non-EU countries.
20. 20
Gartner MQ for Data Science and
Machine Learning Platforms
https://www.kdnuggets.com/2020/02/gartner
-mq-2020-data-science-machine-learning.html
Data and analytics pipeline,
including all the following areas:
1. Data ingestion
2. Data preparation
3. Data exploration
4. Feature engineering
5. Model creation and training
6. Model testing
7. Deployment
8. Monitoring
9. Maintenance
10.Collaboration
2020 vs 2019 changes
21. 21Digikey, techbrij
Machine Learning Model Lifecycle - Example
1. Define the model: using the Sequential or Model class and add the layers
2. Compile the model: call compile method and specify the loss, optimizer
and metrics
3. Train the model: call fit method and use training data
4. Evaluate the model: call evaluate method and use testing data to evaluate
trained model
5. Get predictions: use predict method on new data for predictions
27. 27
Protection throughout the lifecycle of data in Hadoop
Tokenizes or encrypts
sensitive data fields
Enterprise
Policies
Privacy policies may be
managed on-prem or
Cloud Platform
• Policy Enforcement Point (PEP)
Protected data fields
U
Separation of Duties
• Encryption Key Management
Big Data Analytics
Data
Producers
Data
Users
Google Cloud
UU
Big Data Protection with Granular Field Level Protection for Google Cloud
29. 29
Legal Compliance and Nation-State Attacks
• Many companies have information that is attractive to governments and intelligence services.
• Others worry that litigation may result in a subpoena for all their data.
Securosis, 2019
Multi-Cloud Data Privacy considerations
Jurisdiction
• Cloud service
providers
redundancy is great
for resilience, but
regulatory concerns
arises when moving
data across regions
which may have
different laws and
jurisdictions.
SecuPi
30. 30Securosis, 2019
Consistency
• Most firms are quite familiar with their
on-premises encryption and key
management systems, so they often
prefer to leverage the same tool and skills
across multiple clouds.
• Firms often adopt a “best of breed” cloud
approach.
Examples of Hybrid Cloud considerations
Trust
• Some customers simply do not trust
their vendors.
Vendor Lock-in and Migration
• A common concern is vendor
lock-in, and an inability to
migrate to another cloud
service provider.
Cloud Gateway
Google Cloud AWS Cloud Azure Cloud
S3
Salesforce
32. 32
Increased need for data analytics drives requirements.
Data Lake,
ETL, Files
…
• Policy Enforcement Point (PEP)Protected data fields
U
• Encryption Key Management
U
External Data
Internal
Data
Secure Multi Party Computation
Analytics, Data Science, AI and ML
Data Pipeline
Data Collaboration
Data Pipeline
Data Privacy
On-premises
Cloud
Internal and Individual Third-Party Data Sharing
34. 34
Use case - Financial services industry
Confidential financial datasets which are vital for gaining significant insights.
• The use of this data requires navigating a minefield of private client information as well as sharing data
between independent financial institutions, to create a statistically significant dataset.
• Data privacy regulations such as CCPA, GDPR and other emerging regulations around the world
• Data residency controls as well as enable data sharing in a secure and private fashion.
Reduce and remove the legal, risk and compliance processes
• Collaboration across divisions, other organizations and across jurisdictions where data cannot be
relocated or shared
• Generating privacy respectful datasets with higher analytical value for Data Science and Analytics
applications.
35. 35
Use case – Retail - Data for Secondary Purposes
Large aggregator of credit card transaction data.
Open a new revenue stream
• Using its data with its business partners: retailers, banks and advertising companies.
• They could help their partners achieve better ad conversion rate, improved customer satisfaction, and more timely offerings.
• Needed to respect user privacy and specific regulations. In this specific case, they wanted to work with a retailer.
• Allow the retailer to gain insights while protecting user privacy, and the credit card organization’s IP.
• An analyst at each organization’s office first used the software to link the data without exchanging any of the underlying
data.
Data used to train the machine learning and statistical models.
• In this specific use-case, a logistic and linear regression model was trained using secure multi-party computation (SMC).
• In the simplest form SMC splits a dataset into secret shares and enables you to train a model without needing to put together
the pieces.
• The information that is communicated between the peers is encrypted at all times and cannot be reverse engineered.
• The resultant machine learning model coefficients (output of the training) were only shared with the partner identified as the
receiver of such information.
With the augmented dataset, the retailer was able to get a better picture of its customers buying habits.
36. 36
Use case: Bank - Internal Data Usage by Other Units
A large bank wanted to broaden access to its data lake without compromising data privacy, preserving the data’s analytical
value, and at reasonable infrastructure costs.
• Current approaches to de-identify data did not fulfill the compliance requirements and business needs, which had led to
several bank projects being stopped.
• The issue with these techniques, like masking, tokenization, and aggregation, was that they did not sufficiently protect the
data without overly degrading data quality.
This approach allows creating privacy protected datasets that retain their analytical value for Data Science and business
applications.
A plug-in to the organization’s analytical pipeline to enforce the compliance policies before the data was consumed by data
science and business teams from the data lake.
• The analytical quality of the data was preserved for machine learning purposes by-using AI and leveraging privacy models like
differential privacy and k-anonymity.
Improved data access for teams increased the business’ bottom line without adding excessive infrastructure costs, while
reducing the risk of-consumer information exposure.
38. 38
https://royalsociety.org
Secure Multi-Party Computation (MPC)
Private multi-party machine learning with MPC
Using MPC, different
parties send
encrypted messages
to each other, and
obtain the model
F(A,B,C) they wanted
to compute without
revealing their own
private input, and
without the need for a
trusted central
authority.
Secure Multi-Party machine learningCentral trusted authority
A B C
F(A, B,C)
F(A, B,C) F(A, B,C)
Protected data fields
U
B
A C
F(A, B,C)
U U
U
41. 41
Case Study – HE and Securely sharing sensitive information
An example from the healthcare domain.
The recent ability to fully map the human genome has opened endless possibilities for advances in
healthcare.
1. Data from DNA analysis can test for genetic abnormalities, empower disease-risk analysis,
discover family history, and the presence of an Alzheimer’s allele.
• But these studies require very large DNA sample sizes to detect accurate patterns.
2. However, sharing personal DNA data is a particularly problematic domain.
• Many citizens hesitate to share such personal information with third-party providers,
uncertain of if, how and to whom the information might be shared downstream.
3. Moreover, legal limitations designed to protect privacy restrict providers from sharing this data as
well.
4. HE techniques enable citizens to share their genome data and retain key privacy concerns without
the traditional all-or-nothing trust threshold with third-party providers.
42. 42
https://royalsociety.org
Homomorphic encryption (HE)
HE depicted in a client-server model
• The client sends encrypted
data to a server, where a
specific analysis is performed
on the encrypted data,
without decrypting that data.
• The encrypted result is then
sent to the client, who can
decrypt it to obtain the
result of the analysis they
wished to outsource.
Encryption of x
Client
Server
Analysis
Encrypted F(x)
• Policy Enforcement Point (PEP)
Protected data fields
U
• Encryption Key Management
45. 45
Trusted execution environments
Trusted Execution Environments (TEEs) provide secure computation capability through a combination of special-purpose
hardware in modern processors and software built to use those hardware features.
The special-purpose hardware provides a mechanism by which a process can run on a processor without its memory or
execution state being visible to any other process on the processor,
• not even the operating system or other privileged code.
*: Source: http://publications.officialstatistics.org
Computation in a TEE is not
performed on data while it remains
encrypted.
• Typically, the memory space of
each TEE (enclave) application is
protected from access
• AES-encrypted when and if
it is stored off-chip.
Usability is low and products/services are emerging in MS Azure, IBM’s cloud service Amazon AWS (late 2020)*
52. 52
Personally Identifiable Information
(PII) in compliance with the EU Cross
Border Data Protection Laws,
specifically
• Datenschutzgesetz 2000 (DSG
2000) in Austria, and
• Bundesdatenschutzgesetz in
Germany.
This required access to Austrian and
German customer data to be
restricted to only requesters in each
respective country.
• Achieved targeted compliance with
EU Cross Border Data Security laws
• Implemented country-specific data
access restrictions
Data sources
Case Study
A major international bank performed a consolidation of all European operational data sources
to Italy
53. 53
Access to DataLow High
High -
Low -
I I
Lower Risk and Higher Productivity
with More Access to More Data
User Productivity
Risk
More
Access to
Data
Low Risk Tokens
High Risk Clear Data
54. 54
Field Privacy Action (PA) PA Config
Variant Twin
Output
Gender Pseudonymise AD-lks75HF9aLKSa
Pseudonymization
Generalization
Field Privacy Action (PA) PA Config
Variant Twin
Output
Age Integer Range Bin
Step 10 +
Pseud.
Age_KXYC
Age Integer Range Bin
Custom
Steps
18-25
Aggregation/Binning
Field Privacy Action (PA) PA Config
Variant Twin
Output
Balance Nearest Unit Value Thousand 94000
Rounding
Generalization
Source data:
Output data:
Last name Balance Age Gender
Folds 93791 23 m
… … … …
Generalization
Source data:
Output data:
Patient Age Gender Region Disease
173965429 57 Female Hamburg Gastric ulcer
Patient Age Gender Region Disease
173965429 >50 Female Germany Gastric ulcer
Generalization
Examples of data de-identification
Source: INTERNATIONAL STANDARD ISO/IEC 20889, Privitar, Anonos
57. 57
Data protection techniques: Deployment on-premises, and clouds
Data
Warehouse
Centralized Distributed
On-
premises
Public
Cloud
Private
Cloud
Vault-based tokenization y y
Vault-less tokenization y y y y y y
Format preserving
encryption
y y y y y
Homomorphic encryption y y
Masking y y y y y y
Hashing y y y y y y
Server model y y y y y y
Local model y y y y y y
L-diversity y y y y y y
T-closeness y y y y y y
Privacy enhancing data de-identification
terminology and classification of techniques
De-
identification
techniques
Tokenization
Cryptographic
tools
Suppression
techniques
Formal
privacy
measurement
models
Differential
Privacy
K-anonymity
model
58. 58
IS: International Standard
TR: Technical Report
TS: Technical Specification
Guidelines to help comply
with ethical standards
20889 IS Privacy enhancing de-identification terminology and
classification of techniques
27018 IS Code of practice for protection of PII in public clouds acting
as PII processors
27701 IS Security techniques - Extension to ISO/IEC 27001 and
ISO/IEC 27002 for privacy information management - Requirements
and guidelines
29100 IS Privacy framework
29101 IS Privacy architecture framework
29134 IS Guidelines for Privacy impact assessment
29151 IS Code of Practice for PII Protection
29190 IS Privacy capability assessment model
29191 IS Requirements for partially anonymous, partially unlinkable
authentication
Cloud
11 Published International Privacy Standards
Framework
Management
Techniques
Impact
19608 TS Guidance for developing security and privacy functional
requirements based on 15408
Requirements
27550 TR Privacy engineering for system lifecycle processesProcess
ISO Privacy Standards
59. 59
References A:
1. C. Gentry. “A Fully Homomorphic Encryption Scheme.” Stanford University. September 2009,
https://crypto.stanford.edu/craig/craig-thesis.pdf
2. Status Report on the Second Round of the NIST Post-Quantum Cryptography Standardization Process,
https://csrc.nist.gov/publications/detail/nistir/8309/final
3. ISO/IEC 29101:2013 (Information technology – Security techniques – Privacy architecture framework)
4. ISO/IEC 19592-1:2016 (Information technology – Security techniques – Secret sharing – Part 1: General)
5. ISO/IEC 19592-2:2017 (Information technology – Security techniques – Secret sharing – Part 2: Fundamental mechanisms
6. Homomorphic Encryption Standardization, Academic Consortium to Advance Secure Computation,
https://homomorphicencryption.org/standards-meetings/
7. Homomorphic Encryption Standardization, https://homomorphicencryption.org/
8. NIST Post-Quantum Cryptography PQC, https://csrc.nist.gov/Projects/Post-Quantum-Cryptography
9. UN Handbook on Privacy-Preserving Computation Techniques, http://publications.officialstatistics.org/handbooks/privacy-
preserving-techniques-handbook/UN%20Handbook%20for%20Privacy-Preserving%20Techniques.pdf
10. ISO/IEC 29101:2013 Information technology – Security techniques – Privacy architecture framework,
https://www.iso.org/standard/45124.html
11. Homomorphic encryption, https://brilliant.org/wiki/homomorphic-encryption/
12. Survey on Secure Search Over Encrypted Data on the Cloud, https://arxiv.org/abs/1811.09767
60. 60
References B:
1. California Consumer Privacy Act, OCT 4, 2019, https://www.csoonline.com/article/3182578/california-consumer-privacy-act-what-
you-need-to-know-to-be-compliant.html
2. CIS Controls V7.1 Mapping to NIST CSF, https://dataprivacylab.org/projects/identifiability/paper1.pdf
3. GDPR and Tokenizing Data, https://tdwi.org/articles/2018/06/06/biz-all-gdpr-and-tokenizing-data-3.aspx
4. GDPR VS CCPA, https://wirewheel.io/wp-content/uploads/2018/10/GDPR-vs-CCPA-Cheatsheet.pdf
5. General Data Protection Regulation, https://en.wikipedia.org/wiki/General_Data_Protection_Regulation
6. IBM Framework Helps Clients Prepare for the EU's General Data Protection Regulation, https://ibmsystemsmag.com/IBM-
Z/03/2018/ibm-framework-gdpr
7. INTERNATIONAL STANDARD ISO/IEC 20889, https://webstore.ansi.org/Standards/ISO/ISOIEC208892018?gclid=EAIaIQobChMIvI-
k3sXd5gIVw56zCh0Y0QeeEAAYASAAEgLVKfD_BwE
8. INTERNATIONAL STANDARD ISO/IEC 27018, https://webstore.ansi.org/Standards/ISO/
ISOIEC270182019?gclid=EAIaIQobChMIleWM6MLd5gIVFKSzCh3k2AxKEAAYASAAEgKbHvD_BwE
9. New Enterprise Application and Data Security Challenges and Solutions https://www.brighttalk.com/webinar/new-enterprise-
application-and-data-security-challenges-and-solutions/
10. Machine Learning and AI in a Brave New Cloud World https://www.brighttalk.com/webcast/14723/357660/machine-learning-and-ai-
in-a-brave-new-cloud-world
11. Emerging Data Privacy and Security for Cloud https://www.brighttalk.com/webinar/emerging-data-privacy-and-security-for-cloud/
12. New Application and Data Protection Strategies https://www.brighttalk.com/webinar/new-application-and-data-protection-
strategies-2/
13. The Day When 3rd Party Security Providers Disappear into Cloud https://www.brighttalk.com/webinar/the-day-when-3rd-party-
security-providers-disappear-into-cloud/
14. Advanced PII/PI Data Discovery https://www.brighttalk.com/webinar/advanced-pii-pi-data-discovery/
15. Emerging Application and Data Protection for Cloud https://www.brighttalk.com/webinar/emerging-application-and-data-protection-
for-cloud/
16. Data Security: On Premise or in the Cloud, ISSA Journal, December 2019, ulf@ulfmattsson.com
17. Webinars and slides, www.ulfmattsson.com