Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strata NYC 2017]

Taming the Compliance Beast:
Lessons learnt at LinkedIn
Sept 28, 2017
Shirshanka Das, Principal Staff Engineer, LinkedIn
Tushar Shanbhag, Head of Data Products, LinkedIn
@shirshanka, @tusharis
ever-evolving
^

Data Protection in a Digital World
PLAYING CATCH-UP WITH INNOVATION
GDPR

metric scripts
production code
Business facing
decision making
OUR VISION
Create economic opportunity for every
member of the global workforce
LinkedIn’s Vision
29K
schools
10M
companies
11B
endorsements
500M
Members
10M
jobs

The LinkedIn Privacy Paradox
“On one hand, the company has
500+ million members trusting
the company to protect highly
sensitive data.
On the other hand, one only
joins the largest professional
network on the Internet because
they want to be found !"

Kalinda Raina,
Head of Global Privacy, LinkedIn
MEMBER PRIVACY <> MEMBER DISCOVERY

metric scripts
Members First is a Core Value for LinkedIn
MEMBER PRIVACY WHILE DELIVERING MEMBER VALUE
production code
Well-connected.
Get relevance right.
Few connections.
Give them inventory.
Example
Member value is proportional to knowledge
Member privacy is paramount for LinkedIn
We strive to maintain this fine balance

Data Is the Lifeblood of LinkedIn
MEMBER EXPERIENCES + BUSINESS DECISIONS
production code
Member Data
System of Intelligence
Member Experiences
Business Decisions

We needed data democracy to
deliver member value
LinkedIn Data Science
I want to analyze as much data as
possible so my models are accurate
Data Democracy
ALL THE DATA, ALL THE TIME
I want to discover data that’s needed for my
analysis as fast as possible
I want to access that data as quickly as
possible for my analysis

I want my personal data to be stored only
where needed and not propagated
unnecessarily
Data Protection
Need to Ensure Member Privacy
LinkedIn Members
STORE, PROCESS, DELETE,..
I want my personal data to be deleted when
I close my account or request deletion
I want my personal data to only be
processed if essential and only if I consent

DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox

Data Hubs at LinkedIn
In Motion
At Rest
Scale
O(10) clusters
~2.3 Trillion messages
~450 TB
Scale
O(10) clusters
~10K machines
~100 PB

In Motion
At Rest
Data Integration
SFTP
JDBC
REST
Azure
Blob, Data
Lake
Storage

SFTP
JDBC
REST
Apache Gobblin: Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ https://gobblin.apache.org/
Stream + Batch
Adopted by LinkedIn, Intel, PayPal, Apple, IBM,
Swisscom, Prezi, AppLift, NerdWallet and many more…
SFTP
Azure
Blob, Data
Lake
Storage

REQUIREMENTS
Less Data
Legal: Right to Erasure or Right to be Forgotten
“Delete all my personal data without undue delay when it is no
longer necessary / when consent has been withdrawn”
Engineering:
Need the ability to delete some specific subset or all data associated
with a specific LinkedIn member from all our data systems

A lot of data, different formats
Challenges
Understand HDFS data: organization, formats, …
Cycle asynchronously, within an SLA, deleting
records, without affecting running jobs
Quarantine exceptional records for manual triage
Can scale to processing hundreds of PB of data
Data Deletion
IMPLICATIONS FOR HADOOP

Gobblin: The Logical Pipeline
Source
Work
Unit
Work
Unit
Work
Unit
Extract Convert Quality Write Data
Publish
WriteQualityConvertExtract
Extract Convert Quality Write
Task
Task
Task

Gobblin: Extending for Purge
HDFS
Work
Unit
Data
Publish
Extract Convert Quality Write
Task
Task
HDFS
If needs purge
then drop
else continue
Member’s Delete
Requests

STATUS AND CHALLENGES
Gobblin: Data Lifecycle Management at Scale
Status
Number of datasets: many thousands
Amount of data scanned for purge: XXX TB/day
Challenges
Immutable Storage Formats + Right to Erasure = Unhappy Disks
“Widespread implementation will surely lead to innovation in these formats!”

More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT

Metadata based Search Experience
for Data Scientists
Data Discovery
Where is dataset X?
How did it get created?
Usage : In production since 2014
Users : Data Scientists, Product Engineers
Use Cases: Discovery, Impact Analysis
WhereHows
FIND DATA, NAVIGATE RELATIONSHIPS
Open source @ github.com/linkedin/wherehows

More than just Discovery
Use Cases
Which datasets at LinkedIn contain PII or highly
confidential data?
How many contain member-member messages?
How many of them are accessible by team X?
Have all datasets been purged within SLA?
Discovering Violations
ANSWERING HARDER QUESTIONS

Wide + Deep
Metadata
Comprehensive coverage of data systems at LinkedIn
We have > 20 systems!
SQL, NoSQL, Indexes, Blob Stores, …
Deeper understanding of each dataset
Schema is not enough
Need to understand semantics
Discovering Violations
REQUIREMENTS

A METADATA REFINERY APPROACH
WhereHows Architecture @ 10,000 ft
ML driven
reﬁnements

More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
METADATA

METADATA
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox

FREEDOM OF EXPRESSION
Many Transformation Engines @ LinkedIn
In Motion
At Rest

HARD TO CHANGE ANYTHING UNDERNEATH!
Challenge for Infrastructure Providers
(Pig scripts)
My Raw Data
Native readers, dependencies on path, format hard-coded
Hard to move to
better formats
without breaking
everyone or
copying data twice
My Raw Data

HARD TO CHANGE ANYTHING UPSTREAM!
Semantic Challenges
Data is unclean (bad data on certain dates)
Data models are in constant flux (split event into multiple)
Have to change
data processing
logic everywhere!
My Raw Data

AN API TO MANAGE EVOLUTION
We need “microservices” for Data
My Data API
My Raw Data

A DATA ACCESS LAYER FOR LINKEDIN
We built Dali to solve this
Logical Tables + Views
Logical FileSystem
Abstract away underlying physical details to
allow users to focus solely on the logical
concerns

Dali: Implementation Details in Context
Dali FileSystem
Processing Engine
(MR, Spark)
Dali Datasets (Tables+Views)
Dataflow APIs
(MR, Spark,
Scalding)
Query Layers
(Pig, Hive,
Spark)
Dali CLI
Data Catalog
Git + Artifactory
View Def +
UDFs
Dataset
Owner
Data Source
Data Sink

Simple to Complex
Different Types
Basic Restrictions
Access to dataset based on business need
Privacy by Default
Analysts shouldn’t get access to raw PII by
default
Consent-based Access
Access to certain data elements only available
if member has consented for that particular use-
case
Access Restrictions
REQUIREMENTS

STEP 1: DATA + METADATA
Solving for Compliant Access
Schema = {
int memberId
String ﬁrstName
String lastName
Position[] positions
educationHistory[] educationHistory
…
}
MemberProﬁle
MEMBER_ID
NAME
PROFILE DATA
NAME : is_pii
MEMBER_ID : is_pii
Raw
Dataset
Meta
Data

STEP 2: A MEMBER’S PREFERENCES
Privacy Preferences

A BITMAP DATASET: ONE PER MEMBER
Privacy Preferences
Member Privacy
Preferences

Solving for Compliant Access With Dali
Raw
Dataset
Meta
Data
Member Privacy
Preferences
Dali Reader responsibility:
Given:
(Dataset, Metadata, UseCase)
Generate:
Dataset and Column-level
transformations
(obfuscate, null, …)
Auto-join with Member
Privacy Preferences
(filter out data elements that
are not consented to)
Processing
Logic
Dali
Reader
Library
Use
Case = X

Solving for Compliant Purging With Dali + Gobblin
Raw
Dataset
Meta
Data
Member Privacy
Preferences
Gobblin
Purger
Dali
Reader
Library
Use
Case =
Purge
Member’s Delete
Requests
Purged
Dataset

More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
METADATA
DATA ACCESS LAYER

More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox : Solved !
METADATA
DATA ACCESS LAYER

DATA DEMOCRACY + DATA PROTECTION
The Technology Blueprint
WhereHows*
Dali Apache Gobblin*
* Open Source : We can collaborate on these together!
DATA LIFECYCLE MANAGEMENTDATA ACCESS LAYER
METADATA

Core company value, implemented
by Technology & Process
Privacy By Design
Privacy : Technology + Process
SUSTAINABILITY IS CRITICAL
Product : Security & Privacy Review
Data : Data Model Review
Legal : Regulation change -> Tech requirements
Company-wide : “Horizontal” Initiatives

Getting Stricter and more complex
Data Protection
Key Takeaways
THE BEAST IS REAL
Stricter regulations in a digital world
Increasingly more complex to implement
This is an accelerating global trend

We’ve established a blueprint to
sustainably address privacy
Learnings at LinkedIn
Key Takeaways
THE BEAST CAN BE TAMED !
Privacy By Design : baked into technology
stack & product development process
Standardization : To solve at scale, certain
parts need to be centralized and standardized
Company-wide : Needs co-ordinated effort
across various functions

Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strata NYC 2017]

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strata NYC 2017]

Similaire à Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strata NYC 2017] (20)

Dernier

Dernier (20)

Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strata NYC 2017]