SlideShare a Scribd company logo
1 of 36
Download to read offline
Protecting privacy in
practice
2017-06-13
Lars Albertsson
www.mapflat.com
1
Who’s talking?
● KTH-PDC Center for High Performance Computing (MSc thesis)
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (building very large machines)
● Google (Hangouts, productivity)
● Recorded Future (natural language processing startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat (independent data engineering consultant)
2
Privacy protection resources
3
All of this might go
wrong. Large fine.
Pour your data into
our product.
404
Privacy-driven design
● Technical scope
○ Toolbox
○ Not complete solutions
● Assuming that you solve:
○ Legal requirements
○ Security primitives
○ ...
● Not description of any company
4
Archi-
tecture
Statis-
tics
Legal
Organi-
sation
Security
Process Privacy UX
Culture
Requirements, engineer’s perspective
● Right to be forgotten
● Limited collection
● Limited retention
● Limited access
○ From employees
○ In case of security breach
● Right for explanations
● User data enumeration
● User data export
5
Ancient data-centric systems
● The monolith
● All data in one place
● Analytics + online serving from
single database
● Current state, mutable
- Please delete me?
- What data have you got on me?
- Sure, no problem!
6
DB
Presentation
Logic
Storage
Event oriented systems
7
Every event All events, ever,
raw, unprocessed
Refinement
pipeline
Artifact of
value
Event oriented systems
8
Every event All events, ever,
raw, unprocessed
Refinement
pipeline
Artifact of
value
● Motivated by
○ New types of data-driven (AI) features
○ Quicker product iterations
■ Data-driven product feedback (A/B tests)
■ Fewer teams involved in changes
○ Robustness - scales to more complex business logic
Enable disruption
v
Service
Data processing at scale
9
Cluster storage
Ingress Offline processing Egress
Data
lake
DB
Service
DatasetJob
Pipeline
Service
Export
Business
intelligence
DB
DB
Import
Workflow manager
10
● Dataset “build tool”
● Run job instance when
○ input is available
○ output missing
○ resources are available
● Backfill for previous failures
● DSL describes DAG
○ Includes ingress & egress
Recommended: Luigi / Airflow
DB
Factors of success
11
● Event-oriented - append only
● Immutability
● At-least-once semantics
● Reproducibility
○ Through 1000s of copies
● Redundancy
- Please delete me?
- What data have you got on me?
- Hold on a second...
Solution space
12
Technical
feasibility
Easy to do
the right thing
Awareness
culture
Personal information (PII) classification
● Red - sensitive data
○ Messages
○ GPS location
○ Views, preferences
● Yellow - personal data
○ IDs (user, device)
○ Name, email, address
○ IP address
● Green - insensitive data
○ Not related to persons
○ Aggregated numbers
13
● Grey zone
○ Birth date, zip code
○ Recommendation / ads models?
Nota bene: This is only an
example classification
PII arithmetics
Red + green = red red + yellow = red yellow + green = yellow
Aggregate(red/yellow) = green ?
Green + green + green = yellow ?
Yellow + yellow + yellow = red ?
Machine_learning_model(yellow) = yellow ?
Overfitting => persons could be identified
14
Make privacy visible at ground level
● In dataset names
○ hdfs://red/crm/received_messages/year=2017/month=6/day=13
○ s3://yellow/webshop/pageviews/year=2017/month=6/day=13
● In field names
○ response.y_text = “Dear ” + user.y_name + “, thanks for contacting us …”
● In credential / service / table / ... names
● Spreads awareness
● Catch mistakes in code review
● Enables custom tooling for violation warnings
15
Eye of the needle tool
● Provide data access through gateway tool
○ Thin wrapper around Spark/Hadoop/S3/...
○ Hard-wired configuration
● Governance
○ Access audit, verification
○ Policing/retention for experiment data
16
Eye of the needle tool
● Easy to do the right thing
○ Right resource choice, e.g. “allocate temporary
cluster/storage”
○ Enforce practices, e.g. run jobs from central
repository code
○ No command for data download
● Enabling for data scientists
○ Empowered without operations
○ Directory of resources
17
Towards oblivion
● Left to its own devices,
personal (PII) data
spreads like weed
● PII data needs to be
governed, collared, or
discarded
○ Discard what you can
18
● Discard all PII
○ User id in example
● No link between records or datasets
● Replace with non-PII
○ E.g. age, gender, country
● Still no link
○ Beware: rare combination => not anonymised
Drop user id
Discard: Anonymisation
19
Replace user id with
demographics
Useful for
business
insights
Useful for
metrics
Partial discard: Pseudonymisation
● Hash PII
● Records are linked
○ Across datasets
○ Still PII, GDPR applies
○ Persons can be identified (with additional data)
○ Hash recoverable from PII
● Hash PII + salt
○ Hash not recoverable
● Records are still linked
○ Across datasets if salt is constant
20
Hash user id Useful for
recommendations
Hash user id
+ salt
Useful for product
insights
● Push reruns with
workflow manager
- No versioning support in tools
- Computationally expensive
- Easy to miss datasets
+ No data model changes required
Governance: Recomputation
21
● Fields reference PII table
● Clear single record => oblivion
- PII table injection needed
- Key with UUID or hash
- Extra join
- Multiple or wide PII tables
+ Small PII leak risk
Ejected record pattern
22
● Datasets are immutable
● Version n+1 of raw dataset lacks record
● Short retention of old versions
● Always depend on latest version
Record removal in pipelines
23
User keys
2017-06-12
User keys
2017-06-13
class Purchases(Task):
date = DateParameter()
def requires(self):
return [Users(self.date),
Orders(self.date),
UserKeys.latest()]
● PII fields encrypted
● Per-user decryption key table
● Clear single user key => oblivion
- Extra join + decrypt
- Decryption (user) id needed
+ Multi-field oblivion
+ Small PII leak risk
Lost key pattern
24
● Different fields encrypted
with different keys
● Partial user oblivion
○ E.g. forget my GPS coordinates
Lost key partial oblivion
25
● Encrypt key fields that link datasets
● Ability to join is lost
● No data loss
○ Salt => anonymous data
○ No salt => pseudonymous data
Lost link key
26
Reversible oblivion
● Lost key pattern
● Give ejected record key to
third party
○ User
○ Trusted organisation
● Destroy company copies
27
Data model deadly sins
● Using PII data as key
○ Username, email
● Publishing entity ids containing PII data
○ E.g. user shared resources (favourites, compilations) including username
● Publishing pseudonymised datasets
○ They can be de-pseudonymised with external data
○ E.g. AOL, Netflix, ...
28
Tombstone line
● Produce dataset/stream of
forgotten users
● Egress components, e.g. online
service databases, may need
push for removal.
○ Higher PII leak risk
29
DB Service
The art of deletion
● Example: Cassandra
● Deletions == tombstones
● Data remains
○ Until compaction
○ In disconnected nodes
Component-specific expertise necessary
30
Deletion layers
● Every component adds deletion burden
○ Minimise number of components
○ Ephemeral >> dedicated. Recycle machines.
● Every storage layer adds deletion burden
○ Minimise number of storage layers
○ Cloud storage requires documented erasure semantics + agreements.
● Invent simple strategies
○ Example: Cycle Cassandra machines regularly, erase block devices.
Increasing cost of heterogeneity
31
Retention limitation
● Best solved in workflow manager
○ Connect creation and destruction
● Short default retention, whitelist exceptions
● In conflict with technical ideal of immutable raw data
32
Lake promotion
● Remove expire raw dataset, promote derived datasets to lake
● First derived dataset = washed(raw)?
● Workflow DAG still works
33
Data lake Derived
Cluster storage
Data lake Derived
Cluster storage
Lineage
● Tooling for tracking data flow
● Dataset granularity
○ Workflow manager?
● Field granularity
○ Framework instrumentation?
● Multiple use cases
○ (Discovering data)
○ (Pipeline change management)
○ Detecting dead end data flows
○ Right to export data
○ Explanation of model decisions
34
Solicitation: PII & lineage type systems
● Idea: decorate (scala) types
○ PII classification (red/yellow/green)
○ Lineage (e.g. processing class id + commit id + dataset revision)
● Assistance with PII arithmetics
○ PII[Red, String] + PII[Green, String] => PII[Red, String]
○ PII[Red, Int] + PII[Red, Int] => PII[Green, Int]
● Detect unused PII fields
● Assist with recomputation
○ For PII cleaning
○ Bug fixes
35
Resources Credits
● http://www.slideshare.net/lallea/d
ata-pipelines-from-zero-to-solid
● http://www.mapflat.com/lands/res
ources/reading-list
● https://ico.org.uk/
● EU Article 29 Working Party
36
● Alexander Kjeldaas, independent
● Lena Sundin, Spotify
● Oscar Söderlund, Spotify
● Oskar Löthberg, Spotify
● Sofia Edvardsen,
Sharp Cookie Advisors
● Øyvind Løkling,
Schibsted Media Group

More Related Content

What's hot

Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisisLars Albertsson
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish styleLars Albertsson
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science teamLars Albertsson
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)Ryan Blue
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Taro L. Saito
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and ReuseVasia Kalavri
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
 
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureProvenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureAndreas Schreiber
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems researchVasia Kalavri
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
Real Time Big Data
Real Time Big DataReal Time Big Data
Real Time Big DataInfoFarm
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri
 

What's hot (20)

Taming the reproducibility crisis
Taming the reproducibility crisisTaming the reproducibility crisis
Taming the reproducibility crisis
 
Towards Data Operations
Towards Data OperationsTowards Data Operations
Towards Data Operations
 
Data democratised
Data democratisedData democratised
Data democratised
 
Data ops in practice - Swedish style
Data ops in practice - Swedish styleData ops in practice - Swedish style
Data ops in practice - Swedish style
 
Don't build a data science team
Don't build a data science teamDon't build a data science team
Don't build a data science team
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Mastro
MastroMastro
Mastro
 
Mastro
MastroMastro
Mastro
 
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)The evolution of Netflix's S3 data warehouse (Strata NY 2018)
The evolution of Netflix's S3 data warehouse (Strata NY 2018)
 
ISNCC 2017
ISNCC 2017ISNCC 2017
ISNCC 2017
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 
m2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reusem2r2: A Framework for Results Materialization and Reuse
m2r2: A Framework for Results Materialization and Reuse
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
Provenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructureProvenance as a building block for an open science infrastructure
Provenance as a building block for an open science infrastructure
 
Big data processing systems research
Big data processing systems researchBig data processing systems research
Big data processing systems research
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
Real Time Big Data
Real Time Big DataReal Time Big Data
Real Time Big Data
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
 

Viewers also liked

Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Lars Albertsson
 
Tomáš Hajzler: Přednáška Nová práce @ Teal Club
Tomáš Hajzler: Přednáška Nová práce @ Teal ClubTomáš Hajzler: Přednáška Nová práce @ Teal Club
Tomáš Hajzler: Přednáška Nová práce @ Teal ClubTomáš Hajzler
 
Data Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management InitiativesData Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management InitiativesAlan McSweeney
 
GDPR: Requirements for Cloud Providers
GDPR: Requirements for Cloud ProvidersGDPR: Requirements for Cloud Providers
GDPR: Requirements for Cloud ProvidersIT Governance Ltd
 
EU GDPR and you: requirements for marketing
EU GDPR and you: requirements for marketingEU GDPR and you: requirements for marketing
EU GDPR and you: requirements for marketingIT Governance Ltd
 
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)Tomáš Hajzler
 
Revising policies and procedures under the new EU GDPR
Revising policies and procedures under the new EU GDPRRevising policies and procedures under the new EU GDPR
Revising policies and procedures under the new EU GDPRIT Governance Ltd
 

Viewers also liked (7)

Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0Test strategies for data processing pipelines, v2.0
Test strategies for data processing pipelines, v2.0
 
Tomáš Hajzler: Přednáška Nová práce @ Teal Club
Tomáš Hajzler: Přednáška Nová práce @ Teal ClubTomáš Hajzler: Přednáška Nová práce @ Teal Club
Tomáš Hajzler: Přednáška Nová práce @ Teal Club
 
Data Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management InitiativesData Governance: Keystone of Information Management Initiatives
Data Governance: Keystone of Information Management Initiatives
 
GDPR: Requirements for Cloud Providers
GDPR: Requirements for Cloud ProvidersGDPR: Requirements for Cloud Providers
GDPR: Requirements for Cloud Providers
 
EU GDPR and you: requirements for marketing
EU GDPR and you: requirements for marketingEU GDPR and you: requirements for marketing
EU GDPR and you: requirements for marketing
 
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
Tomáš Ervín Dombrovský: Pohyb ve světě (práce)
 
Revising policies and procedures under the new EU GDPR
Revising policies and procedures under the new EU GDPRRevising policies and procedures under the new EU GDPR
Revising policies and procedures under the new EU GDPR
 

Similar to Protecting privacy in practice

Privacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatPrivacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatEvention
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentationgustavosouto
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Jonathan Singer
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dan Lynn
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangDatabricks
 
Presto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraPresto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraShubham Tagra
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Julien Le Dem
 
Big Data overview
Big Data overviewBig Data overview
Big Data overviewalexisroos
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingMartinStrycek
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...LibbySchulze
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit
 
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.Leszek Mi?
 

Similar to Protecting privacy in practice (20)

Privacy by design
Privacy by designPrivacy by design
Privacy by design
 
Privacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, MapflatPrivacy by Design - Lars Albertsson, Mapflat
Privacy by Design - Lars Albertsson, Mapflat
 
Big Data & Social Analytics presentation
Big Data & Social Analytics presentationBig Data & Social Analytics presentation
Big Data & Social Analytics presentation
 
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019Splunk, SIEMs, and Big Data - The Undercroft - November 2019
Splunk, SIEMs, and Big Data - The Undercroft - November 2019
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016Dirty data? Clean it up! - Datapalooza Denver 2016
Dirty data? Clean it up! - Datapalooza Denver 2016
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Role of ML engineer
Role of ML engineerRole of ML engineer
Role of ML engineer
 
Cloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan WangCloud Cost Management and Apache Spark with Xuan Wang
Cloud Cost Management and Apache Spark with Xuan Wang
 
Presto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@MyntraPresto Bangalore Meetup1 Repertoire@Myntra
Presto Bangalore Meetup1 Repertoire@Myntra
 
Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020Data platform architecture principles - ieee infrastructure 2020
Data platform architecture principles - ieee infrastructure 2020
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Data preprocessing.pdf
Data preprocessing.pdfData preprocessing.pdf
Data preprocessing.pdf
 
Druid
DruidDruid
Druid
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...Intro to open source observability with grafana, prometheus, loki, and tempo(...
Intro to open source observability with grafana, prometheus, loki, and tempo(...
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.
May The Data Stay with U! Network Data Exfiltration Techniques - Brucon 2017.
 

More from Lars Albertsson

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divideLars Albertsson
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with ScalametaLars Albertsson
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfLars Albertsson
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfLars Albertsson
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application qualityLars Albertsson
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetLars Albertsson
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesLars Albertsson
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift leftLars Albertsson
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data opsLars Albertsson
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipelineLars Albertsson
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 

More from Lars Albertsson (15)

Crossing the data divide
Crossing the data divideCrossing the data divide
Crossing the data divide
 
Schema management with Scalameta
Schema management with ScalametaSchema management with Scalameta
Schema management with Scalameta
 
How to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdfHow to not kill people - Berlin Buzzwords 2023.pdf
How to not kill people - Berlin Buzzwords 2023.pdf
 
The 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdfThe 7 habits of data effective companies.pdf
The 7 habits of data effective companies.pdf
 
Holistic data application quality
Holistic data application qualityHolistic data application quality
Holistic data application quality
 
Secure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budgetSecure software supply chain on a shoestring budget
Secure software supply chain on a shoestring budget
 
DataOps - Lean principles and lean practices
DataOps - Lean principles and lean practicesDataOps - Lean principles and lean practices
DataOps - Lean principles and lean practices
 
Ai legal and ethics
Ai   legal and ethicsAi   legal and ethics
Ai legal and ethics
 
The right side of speed - learning to shift left
The right side of speed - learning to shift leftThe right side of speed - learning to shift left
The right side of speed - learning to shift left
 
The lean principles of data ops
The lean principles of data opsThe lean principles of data ops
The lean principles of data ops
 
Eventually, time will kill your data pipeline
Eventually, time will kill your data pipelineEventually, time will kill your data pipeline
Eventually, time will kill your data pipeline
 
Data ops in practice
Data ops in practiceData ops in practice
Data ops in practice
 
Big data == lean data
Big data == lean dataBig data == lean data
Big data == lean data
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 

Recently uploaded

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 

Recently uploaded (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 

Protecting privacy in practice

  • 1. Protecting privacy in practice 2017-06-13 Lars Albertsson www.mapflat.com 1
  • 2. Who’s talking? ● KTH-PDC Center for High Performance Computing (MSc thesis) ● Swedish Institute of Computer Science (distributed system test+debug tools) ● Sun Microsystems (building very large machines) ● Google (Hangouts, productivity) ● Recorded Future (natural language processing startup) ● Cinnober Financial Tech. (trading systems) ● Spotify (data processing & modelling) ● Schibsted Media Group (data processing & modelling) ● Mapflat (independent data engineering consultant) 2
  • 3. Privacy protection resources 3 All of this might go wrong. Large fine. Pour your data into our product. 404
  • 4. Privacy-driven design ● Technical scope ○ Toolbox ○ Not complete solutions ● Assuming that you solve: ○ Legal requirements ○ Security primitives ○ ... ● Not description of any company 4 Archi- tecture Statis- tics Legal Organi- sation Security Process Privacy UX Culture
  • 5. Requirements, engineer’s perspective ● Right to be forgotten ● Limited collection ● Limited retention ● Limited access ○ From employees ○ In case of security breach ● Right for explanations ● User data enumeration ● User data export 5
  • 6. Ancient data-centric systems ● The monolith ● All data in one place ● Analytics + online serving from single database ● Current state, mutable - Please delete me? - What data have you got on me? - Sure, no problem! 6 DB Presentation Logic Storage
  • 7. Event oriented systems 7 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value
  • 8. Event oriented systems 8 Every event All events, ever, raw, unprocessed Refinement pipeline Artifact of value ● Motivated by ○ New types of data-driven (AI) features ○ Quicker product iterations ■ Data-driven product feedback (A/B tests) ■ Fewer teams involved in changes ○ Robustness - scales to more complex business logic Enable disruption
  • 9. v Service Data processing at scale 9 Cluster storage Ingress Offline processing Egress Data lake DB Service DatasetJob Pipeline Service Export Business intelligence DB DB Import
  • 10. Workflow manager 10 ● Dataset “build tool” ● Run job instance when ○ input is available ○ output missing ○ resources are available ● Backfill for previous failures ● DSL describes DAG ○ Includes ingress & egress Recommended: Luigi / Airflow DB
  • 11. Factors of success 11 ● Event-oriented - append only ● Immutability ● At-least-once semantics ● Reproducibility ○ Through 1000s of copies ● Redundancy - Please delete me? - What data have you got on me? - Hold on a second...
  • 12. Solution space 12 Technical feasibility Easy to do the right thing Awareness culture
  • 13. Personal information (PII) classification ● Red - sensitive data ○ Messages ○ GPS location ○ Views, preferences ● Yellow - personal data ○ IDs (user, device) ○ Name, email, address ○ IP address ● Green - insensitive data ○ Not related to persons ○ Aggregated numbers 13 ● Grey zone ○ Birth date, zip code ○ Recommendation / ads models? Nota bene: This is only an example classification
  • 14. PII arithmetics Red + green = red red + yellow = red yellow + green = yellow Aggregate(red/yellow) = green ? Green + green + green = yellow ? Yellow + yellow + yellow = red ? Machine_learning_model(yellow) = yellow ? Overfitting => persons could be identified 14
  • 15. Make privacy visible at ground level ● In dataset names ○ hdfs://red/crm/received_messages/year=2017/month=6/day=13 ○ s3://yellow/webshop/pageviews/year=2017/month=6/day=13 ● In field names ○ response.y_text = “Dear ” + user.y_name + “, thanks for contacting us …” ● In credential / service / table / ... names ● Spreads awareness ● Catch mistakes in code review ● Enables custom tooling for violation warnings 15
  • 16. Eye of the needle tool ● Provide data access through gateway tool ○ Thin wrapper around Spark/Hadoop/S3/... ○ Hard-wired configuration ● Governance ○ Access audit, verification ○ Policing/retention for experiment data 16
  • 17. Eye of the needle tool ● Easy to do the right thing ○ Right resource choice, e.g. “allocate temporary cluster/storage” ○ Enforce practices, e.g. run jobs from central repository code ○ No command for data download ● Enabling for data scientists ○ Empowered without operations ○ Directory of resources 17
  • 18. Towards oblivion ● Left to its own devices, personal (PII) data spreads like weed ● PII data needs to be governed, collared, or discarded ○ Discard what you can 18
  • 19. ● Discard all PII ○ User id in example ● No link between records or datasets ● Replace with non-PII ○ E.g. age, gender, country ● Still no link ○ Beware: rare combination => not anonymised Drop user id Discard: Anonymisation 19 Replace user id with demographics Useful for business insights Useful for metrics
  • 20. Partial discard: Pseudonymisation ● Hash PII ● Records are linked ○ Across datasets ○ Still PII, GDPR applies ○ Persons can be identified (with additional data) ○ Hash recoverable from PII ● Hash PII + salt ○ Hash not recoverable ● Records are still linked ○ Across datasets if salt is constant 20 Hash user id Useful for recommendations Hash user id + salt Useful for product insights
  • 21. ● Push reruns with workflow manager - No versioning support in tools - Computationally expensive - Easy to miss datasets + No data model changes required Governance: Recomputation 21
  • 22. ● Fields reference PII table ● Clear single record => oblivion - PII table injection needed - Key with UUID or hash - Extra join - Multiple or wide PII tables + Small PII leak risk Ejected record pattern 22
  • 23. ● Datasets are immutable ● Version n+1 of raw dataset lacks record ● Short retention of old versions ● Always depend on latest version Record removal in pipelines 23 User keys 2017-06-12 User keys 2017-06-13 class Purchases(Task): date = DateParameter() def requires(self): return [Users(self.date), Orders(self.date), UserKeys.latest()]
  • 24. ● PII fields encrypted ● Per-user decryption key table ● Clear single user key => oblivion - Extra join + decrypt - Decryption (user) id needed + Multi-field oblivion + Small PII leak risk Lost key pattern 24
  • 25. ● Different fields encrypted with different keys ● Partial user oblivion ○ E.g. forget my GPS coordinates Lost key partial oblivion 25
  • 26. ● Encrypt key fields that link datasets ● Ability to join is lost ● No data loss ○ Salt => anonymous data ○ No salt => pseudonymous data Lost link key 26
  • 27. Reversible oblivion ● Lost key pattern ● Give ejected record key to third party ○ User ○ Trusted organisation ● Destroy company copies 27
  • 28. Data model deadly sins ● Using PII data as key ○ Username, email ● Publishing entity ids containing PII data ○ E.g. user shared resources (favourites, compilations) including username ● Publishing pseudonymised datasets ○ They can be de-pseudonymised with external data ○ E.g. AOL, Netflix, ... 28
  • 29. Tombstone line ● Produce dataset/stream of forgotten users ● Egress components, e.g. online service databases, may need push for removal. ○ Higher PII leak risk 29 DB Service
  • 30. The art of deletion ● Example: Cassandra ● Deletions == tombstones ● Data remains ○ Until compaction ○ In disconnected nodes Component-specific expertise necessary 30
  • 31. Deletion layers ● Every component adds deletion burden ○ Minimise number of components ○ Ephemeral >> dedicated. Recycle machines. ● Every storage layer adds deletion burden ○ Minimise number of storage layers ○ Cloud storage requires documented erasure semantics + agreements. ● Invent simple strategies ○ Example: Cycle Cassandra machines regularly, erase block devices. Increasing cost of heterogeneity 31
  • 32. Retention limitation ● Best solved in workflow manager ○ Connect creation and destruction ● Short default retention, whitelist exceptions ● In conflict with technical ideal of immutable raw data 32
  • 33. Lake promotion ● Remove expire raw dataset, promote derived datasets to lake ● First derived dataset = washed(raw)? ● Workflow DAG still works 33 Data lake Derived Cluster storage Data lake Derived Cluster storage
  • 34. Lineage ● Tooling for tracking data flow ● Dataset granularity ○ Workflow manager? ● Field granularity ○ Framework instrumentation? ● Multiple use cases ○ (Discovering data) ○ (Pipeline change management) ○ Detecting dead end data flows ○ Right to export data ○ Explanation of model decisions 34
  • 35. Solicitation: PII & lineage type systems ● Idea: decorate (scala) types ○ PII classification (red/yellow/green) ○ Lineage (e.g. processing class id + commit id + dataset revision) ● Assistance with PII arithmetics ○ PII[Red, String] + PII[Green, String] => PII[Red, String] ○ PII[Red, Int] + PII[Red, Int] => PII[Green, Int] ● Detect unused PII fields ● Assist with recomputation ○ For PII cleaning ○ Bug fixes 35
  • 36. Resources Credits ● http://www.slideshare.net/lallea/d ata-pipelines-from-zero-to-solid ● http://www.mapflat.com/lands/res ources/reading-list ● https://ico.org.uk/ ● EU Article 29 Working Party 36 ● Alexander Kjeldaas, independent ● Lena Sundin, Spotify ● Oscar Söderlund, Spotify ● Oskar Löthberg, Spotify ● Sofia Edvardsen, Sharp Cookie Advisors ● Øyvind Løkling, Schibsted Media Group