SlideShare une entreprise Scribd logo
1  sur  68
Télécharger pour lire hors ligne
Mark Grover | Lyft | @mark_grover
Alyssa Ransbury | Square | @alyssaran
Slides: cutt.ly/a6n
Amundsen: From Discovering to Securing Data
Who we are
• Product Manager at Lyft
• Committer/PMC on various
Apache projects
• Previously developer on
Spark at Cloudera
2
• Security Engineer at Square
• Leads and supports on privacy
engineering efforts across
dozens of product teams
Problem
3
>35% consumer time spent on discovering & validating
trustworthy data
4
Analyst/DS workflow and time spent on each step
Other side effects
1. Interrupt heavy culture
● What data should I use for
X?
● Is that trustworthy?
2. Increased DB load
SELECT * … LIMIT 100
SELECT COUNT(*) over x days
Requirements &
solutions
5
1. What kind of information? (aka ABC of metadata)
6
Application Context
Metadata needed by humans or applications to operate
● Where is the data?
● What are the semantics of the data?
Behavior
How is data created and used over time?
● Who’s using the data?
● Who created the data?
Change
Change in data over time
● How is the data evolving over time?
● Evolution of code that generates the data
Terminology borrowed from Ground paper
Data Lineage TODAY
Short answer: Any data within your organization
Long answer:
2. About what data?
7
Data stores
Schema registry
Events /
Schemas
StreamsPeople
Employees
TODAY
NotebooksDashboard /
Reports
Processes
Develop breadth of applications
8
Metadata
Data
Discovery
& Trust
Compliance
(GDPR /
CCPA /
Financial)
Security /
ACL
Data
Monitoring
/ Ops
Data
Quality
Cost
Mgmt
Data
Mainten-
ance
Search based Lineage based Network based Programmatic
Where is the
table/dashboard for X?
What does it contain?
I am changing a data model,
who are the owner and most
common users?
I want to follow a
power user in my team.
Access metadata
programmatically
Does this analysis
already exist?
This table’s delivery was
delayed today, I want to
notify everyone downstream.
I want to bookmark
tables of interest and
get a feed of data
delay, schema change,
incidents.
Put (pull / push)
metadata
programmatically
Other requirements
● Leverage as much data automatically as possible
● Preferably, open source and healthy community
● Easy to set up
Goal: Reduce time to find trusted data w/ versatile graph
Meet Amundsen
10
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
Search for data, or browse
Search for datasets
See details of the data set
See detailed descriptions and profile of the column
Preview existing data, if perms allow
Search for existing dashboards (aka reports)
16
Dashboard detail page
17
Search for co-workers!
Search for data owned and used by your peers!
Impact
20
21
“This is God’s
work” - Head of
Analytics, Lyft
“I was on call and
I’m confident 50%
of the questions
could have been
answered by a
simple search in
Amundsen” - Data
Scientist,Lyft
A6n @ Lyft: 750+ WAUs, 150k+ tables, 4k+ employee pages
Roles of Amundsen users at Lyft
22
Penetration rate:
DS (aka analyst): 81%
RS (aka DS): 71%
PM: 22%
SWE: 17%
Cust Serv: 7% (12/390)
Sp. Ops: 67% (10/15)
Sp. Op Leads: 53% (9/17)
Economist: 100% (7/7)
Cust. Quality: 78% (7/9)
Growth Mktg: 25% (6/24)
Roadmap
23
Roadmap (subject to change, not ordered)
• Tighter Lineage integration / visualization
• Better view integration
• ACL integration, allow only specific roles to edit descriptions
• Show search context for what matched
• Index more resources (notebooks, Kafka topics, etc.)
24
Amundsen Open Source
25
700
Community
members
150+
Companies in
the community
20+
Companies using
in production
Amundsen Open Source Community
26
ProminentusersActivecommunity
Develop breadth of applications
27
Metadata
Data
Discovery
& Trust
Compliance
(GDPR /
CCPA /
Financial)
Security /
ACL
Data
Monitoring
/ Ops
Data
Quality
Cost
Mgmt
Data
Mainten-
ance
Square:
Scaling Data Security /
Privacy
28
29
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
30
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
31
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
32
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
33
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
34
The graph
Benefits of a graph
The Problem
36
● Relied mostly on manual work to understand our data
● We have A LOT of data, and growing
● NOT scalable
● Lots of goals related to data privacy on top of requirements from laws like
GDPR / CCPA
37
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Store more
specific
metadata
2. Automate
data
inventorying
3. Support
more granular
access
controls
1. Store more specific
metadata
38
39
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Store more
specific
metadata
New Metadata Types
40
PII
Semantic
Type
Data
Storage
Security
Data
Subject
Type
PII
Semantic
Type
for data in
Column
PII Semantic Type
41
A set value describing the general contents of data in a column, with a focus on
discovering data that fits into one of three buckets
● Sensitive by itself
● Could be sensitive when taken together with other data
● Links to sensitive data, like an internal identification token
Some
Column
Data
Data Storage Security
42
A set value describing the form of data in a column
PII
Semantic
Type
for data in
Column
Some
Column
Data
Data
Storage
Security
for data in
Column
Data Subject Type
43
A set value describing users whose information exists in a column
● Example: square_product_user
PII
Semantic
Type
for data in
Column
Some
Column
Data
Data
Storage
Security
for data in
Column
Data
Subject
Type
for data in
Column
Example
44
PII
Semantic
Type
Column 1
Data
Storage
Security
encrypted
Data
Subject
Type
app_customer
Data
Storage
Security
plaintext
Column 2
UI Changes
45
2. Automate data
inventorying*
46
* where possible
47
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
2. Automate
data
inventorying
Why?
● We have a lot of data!
● Metadata should be kept current at scale
● New data sources get created all the time
● Mistakes happen
48
Flagging Possibly Sensitive Data Locations
49
Some
Column
Data
Data
Storage
Security
encrypted
Flagging Possibly Sensitive Data Locations
50
Some
Column
Data
Data
Storage
Security
encrypted
Flag
Flags
Contain information about:
● The metadata type it relates to (i.e. PII Semantic Type, Data Storage Security)
● The possible value (i.e. encrypted for Data Storage Security)
● Human readable description
51
Some
Column
Data
Flag
Reason
Result
Reviewer
Checking for Data Sensitivity
● Column name
○ Keyword partial or full match
○ Literal PII match
● Column type
● Table name
● Sample values + Google DLP API
52
User Flow
53
Some
Column
Data
Flag
Data Storage
Security
plaintext
Flag
PII Semantic
Type
phone number
User Flow
54
Some
Column
Data
Flag
Data Storage
Security
plaintext
Result: True
Reviewer: person 1
Data
Storage
Security
plaintext
Flag
PII Semantic
Type
phone number
Result: False
Reviewer: person 2
UI Changes
55
UI Changes
56
3. Support more granular
access controls
57
58
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Support
more granular
access
controls
Capability-governed access control
● Purpose - An enumerated, legal-approved reason to access specific PII
Semantic Types
● Capability - A combination of Purpose and Data Subject Type
59
Capability-governed access control
60
Capability
Purpose
Subject
Type
Semantic
Type
Semantic
Type
Capability-governed access control
61
Capability
001
USER SUPPORT
CASH
APP
CUSTOMER
PERSON
NAME
EMAIL
ADDRESS
Capability-governed access control
62
Amundsen
Gating
Service
4. Future Work
63
Future and Work in Progress
● Gremlin Proxy
● UI components
○ show users exactly the data they have access to based on their capabilities
○ show the data that is tied to a specific capability
○ empower data owners to revoke broad access to some data
● Supporting additional data types
● And beyond! (we are hiring…)
64
Summary
65
Develop breadth of applications
66
Metadata
Data
Discovery
& Trust
Compliance
(GDPR /
CCPA /
Financial)
Security /
ACL
Data
Monitoring
/ Ops
Data
Quality
Cost
Mgmt
Data
Mainten-
ance
Get started!
Check out Github:
github.com/lyft/amundsen
Join Slack (~700 users):
Link to join on github
67
Check out other blogs and videos:
Amundsen for Privacy @ Square
Introducing Amundsen
Amundsen's architecture
YouTube channel
Thanks!
68
Mark Grover | Lyft | @mark_grover
Alyssa Ransbury | Square | @alyssaran
Icons under Creative Commons License from https://thenounproject.com/

Contenu connexe

Tendances

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 
Introduction to Azure SQL DB
Introduction to Azure SQL DBIntroduction to Azure SQL DB
Introduction to Azure SQL DBChristopher Foot
 
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기흥래 김
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 
The DDS Tutorial - Part I
The DDS Tutorial - Part IThe DDS Tutorial - Part I
The DDS Tutorial - Part IAngelo Corsaro
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overviewgjuljo
 
Getting Started with DDS in C++, Java and Scala
Getting Started with DDS in C++, Java and ScalaGetting Started with DDS in C++, Java and Scala
Getting Started with DDS in C++, Java and ScalaAngelo Corsaro
 
Application Load Balancer and the integration with AutoScaling and ECS - Pop-...
Application Load Balancer and the integration with AutoScaling and ECS - Pop-...Application Load Balancer and the integration with AutoScaling and ECS - Pop-...
Application Load Balancer and the integration with AutoScaling and ECS - Pop-...Amazon Web Services
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache CassandraPatrick McFadin
 
Transport layer security.ppt
Transport layer security.pptTransport layer security.ppt
Transport layer security.pptImXaib
 
Operational Intelligence-as-a-Service to fuel SWIFT’s Digital Transformation
Operational Intelligence-as-a-Service to fuel SWIFT’s Digital TransformationOperational Intelligence-as-a-Service to fuel SWIFT’s Digital Transformation
Operational Intelligence-as-a-Service to fuel SWIFT’s Digital TransformationElasticsearch
 
Gateways to Power BI, Connect PowerBI.com to your On-Prem Data
Gateways to Power BI, Connect PowerBI.com to your On-Prem DataGateways to Power BI, Connect PowerBI.com to your On-Prem Data
Gateways to Power BI, Connect PowerBI.com to your On-Prem DataJean-Pierre Riehl
 
DDS Tutorial -- Part I
DDS Tutorial -- Part IDDS Tutorial -- Part I
DDS Tutorial -- Part IAngelo Corsaro
 
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Amazon Web Services
 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Edureka!
 

Tendances (20)

Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Introduction to Azure SQL DB
Introduction to Azure SQL DBIntroduction to Azure SQL DB
Introduction to Azure SQL DB
 
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기
엘라스틱서치 클러스터로 수십억 건의 데이터 운영하기
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
The DDS Tutorial - Part I
The DDS Tutorial - Part IThe DDS Tutorial - Part I
The DDS Tutorial - Part I
 
Amazon Cognito Deep Dive
Amazon Cognito Deep DiveAmazon Cognito Deep Dive
Amazon Cognito Deep Dive
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
 
Présentation AzureAD ( Identité hybrides et securité)
Présentation AzureAD ( Identité hybrides et securité)Présentation AzureAD ( Identité hybrides et securité)
Présentation AzureAD ( Identité hybrides et securité)
 
Azure: PaaS or IaaS
Azure: PaaS or IaaSAzure: PaaS or IaaS
Azure: PaaS or IaaS
 
Getting Started with DDS in C++, Java and Scala
Getting Started with DDS in C++, Java and ScalaGetting Started with DDS in C++, Java and Scala
Getting Started with DDS in C++, Java and Scala
 
Application Load Balancer and the integration with AutoScaling and ECS - Pop-...
Application Load Balancer and the integration with AutoScaling and ECS - Pop-...Application Load Balancer and the integration with AutoScaling and ECS - Pop-...
Application Load Balancer and the integration with AutoScaling and ECS - Pop-...
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
Transport layer security.ppt
Transport layer security.pptTransport layer security.ppt
Transport layer security.ppt
 
Operational Intelligence-as-a-Service to fuel SWIFT’s Digital Transformation
Operational Intelligence-as-a-Service to fuel SWIFT’s Digital TransformationOperational Intelligence-as-a-Service to fuel SWIFT’s Digital Transformation
Operational Intelligence-as-a-Service to fuel SWIFT’s Digital Transformation
 
Gateways to Power BI, Connect PowerBI.com to your On-Prem Data
Gateways to Power BI, Connect PowerBI.com to your On-Prem DataGateways to Power BI, Connect PowerBI.com to your On-Prem Data
Gateways to Power BI, Connect PowerBI.com to your On-Prem Data
 
DDS Tutorial -- Part I
DDS Tutorial -- Part IDDS Tutorial -- Part I
DDS Tutorial -- Part I
 
Decentralized Identifiers
Decentralized IdentifiersDecentralized Identifiers
Decentralized Identifiers
 
DDS QoS Unleashed
DDS QoS UnleashedDDS QoS Unleashed
DDS QoS Unleashed
 
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
Cost and Performance Optimisation in Amazon RDS - AWS Summit Sydney 2018
 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
 

Similaire à Amundsen: From discovering to security data

How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationTamikaTannis
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Webinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment PerformanceWebinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment PerformanceLucidworks
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Balancing Data Democracy with Data Privacy: The LinkedIn Story
Balancing Data Democracy with Data Privacy: The LinkedIn StoryBalancing Data Democracy with Data Privacy: The LinkedIn Story
Balancing Data Democracy with Data Privacy: The LinkedIn StoryAnthony Hsu
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overviewjkvr
 

Similaire à Amundsen: From discovering to security data (20)

How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Webinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment PerformanceWebinar: Lucidworks + Thomson Reuters for Improved Investment Performance
Webinar: Lucidworks + Thomson Reuters for Improved Investment Performance
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Balancing Data Democracy with Data Privacy: The LinkedIn Story
Balancing Data Democracy with Data Privacy: The LinkedIn StoryBalancing Data Democracy with Data Privacy: The LinkedIn Story
Balancing Data Democracy with Data Privacy: The LinkedIn Story
 
DataSpryng Overview
DataSpryng OverviewDataSpryng Overview
DataSpryng Overview
 

Plus de markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beammarkgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoopmarkgrover
 

Plus de markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 

Dernier

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 

Dernier (20)

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 

Amundsen: From discovering to security data

  • 1. Mark Grover | Lyft | @mark_grover Alyssa Ransbury | Square | @alyssaran Slides: cutt.ly/a6n Amundsen: From Discovering to Securing Data
  • 2. Who we are • Product Manager at Lyft • Committer/PMC on various Apache projects • Previously developer on Spark at Cloudera 2 • Security Engineer at Square • Leads and supports on privacy engineering efforts across dozens of product teams
  • 4. >35% consumer time spent on discovering & validating trustworthy data 4 Analyst/DS workflow and time spent on each step Other side effects 1. Interrupt heavy culture ● What data should I use for X? ● Is that trustworthy? 2. Increased DB load SELECT * … LIMIT 100 SELECT COUNT(*) over x days
  • 6. 1. What kind of information? (aka ABC of metadata) 6 Application Context Metadata needed by humans or applications to operate ● Where is the data? ● What are the semantics of the data? Behavior How is data created and used over time? ● Who’s using the data? ● Who created the data? Change Change in data over time ● How is the data evolving over time? ● Evolution of code that generates the data Terminology borrowed from Ground paper Data Lineage TODAY
  • 7. Short answer: Any data within your organization Long answer: 2. About what data? 7 Data stores Schema registry Events / Schemas StreamsPeople Employees TODAY NotebooksDashboard / Reports Processes
  • 8. Develop breadth of applications 8 Metadata Data Discovery & Trust Compliance (GDPR / CCPA / Financial) Security / ACL Data Monitoring / Ops Data Quality Cost Mgmt Data Mainten- ance
  • 9. Search based Lineage based Network based Programmatic Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Access metadata programmatically Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Put (pull / push) metadata programmatically Other requirements ● Leverage as much data automatically as possible ● Preferably, open source and healthy community ● Easy to set up Goal: Reduce time to find trusted data w/ versatile graph
  • 10. Meet Amundsen 10 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  • 11. Search for data, or browse
  • 13. See details of the data set
  • 14. See detailed descriptions and profile of the column
  • 15. Preview existing data, if perms allow
  • 16. Search for existing dashboards (aka reports) 16
  • 19. Search for data owned and used by your peers!
  • 21. 21 “This is God’s work” - Head of Analytics, Lyft “I was on call and I’m confident 50% of the questions could have been answered by a simple search in Amundsen” - Data Scientist,Lyft A6n @ Lyft: 750+ WAUs, 150k+ tables, 4k+ employee pages
  • 22. Roles of Amundsen users at Lyft 22 Penetration rate: DS (aka analyst): 81% RS (aka DS): 71% PM: 22% SWE: 17% Cust Serv: 7% (12/390) Sp. Ops: 67% (10/15) Sp. Op Leads: 53% (9/17) Economist: 100% (7/7) Cust. Quality: 78% (7/9) Growth Mktg: 25% (6/24)
  • 24. Roadmap (subject to change, not ordered) • Tighter Lineage integration / visualization • Better view integration • ACL integration, allow only specific roles to edit descriptions • Show search context for what matched • Index more resources (notebooks, Kafka topics, etc.) 24
  • 25. Amundsen Open Source 25 700 Community members 150+ Companies in the community 20+ Companies using in production
  • 26. Amundsen Open Source Community 26 ProminentusersActivecommunity
  • 27. Develop breadth of applications 27 Metadata Data Discovery & Trust Compliance (GDPR / CCPA / Financial) Security / ACL Data Monitoring / Ops Data Quality Cost Mgmt Data Mainten- ance
  • 29. 29 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 30. 30 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 31. 31 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 32. 32 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 33. 33 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 35. Benefits of a graph
  • 36. The Problem 36 ● Relied mostly on manual work to understand our data ● We have A LOT of data, and growing ● NOT scalable ● Lots of goals related to data privacy on top of requirements from laws like GDPR / CCPA
  • 37. 37 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources 1. Store more specific metadata 2. Automate data inventorying 3. Support more granular access controls
  • 38. 1. Store more specific metadata 38
  • 39. 39 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources 1. Store more specific metadata
  • 41. PII Semantic Type for data in Column PII Semantic Type 41 A set value describing the general contents of data in a column, with a focus on discovering data that fits into one of three buckets ● Sensitive by itself ● Could be sensitive when taken together with other data ● Links to sensitive data, like an internal identification token Some Column Data
  • 42. Data Storage Security 42 A set value describing the form of data in a column PII Semantic Type for data in Column Some Column Data Data Storage Security for data in Column
  • 43. Data Subject Type 43 A set value describing users whose information exists in a column ● Example: square_product_user PII Semantic Type for data in Column Some Column Data Data Storage Security for data in Column Data Subject Type for data in Column
  • 47. 47 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources 2. Automate data inventorying
  • 48. Why? ● We have a lot of data! ● Metadata should be kept current at scale ● New data sources get created all the time ● Mistakes happen 48
  • 49. Flagging Possibly Sensitive Data Locations 49 Some Column Data Data Storage Security encrypted
  • 50. Flagging Possibly Sensitive Data Locations 50 Some Column Data Data Storage Security encrypted Flag
  • 51. Flags Contain information about: ● The metadata type it relates to (i.e. PII Semantic Type, Data Storage Security) ● The possible value (i.e. encrypted for Data Storage Security) ● Human readable description 51 Some Column Data Flag Reason Result Reviewer
  • 52. Checking for Data Sensitivity ● Column name ○ Keyword partial or full match ○ Literal PII match ● Column type ● Table name ● Sample values + Google DLP API 52
  • 54. User Flow 54 Some Column Data Flag Data Storage Security plaintext Result: True Reviewer: person 1 Data Storage Security plaintext Flag PII Semantic Type phone number Result: False Reviewer: person 2
  • 57. 3. Support more granular access controls 57
  • 58. 58 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources 3. Support more granular access controls
  • 59. Capability-governed access control ● Purpose - An enumerated, legal-approved reason to access specific PII Semantic Types ● Capability - A combination of Purpose and Data Subject Type 59
  • 61. Capability-governed access control 61 Capability 001 USER SUPPORT CASH APP CUSTOMER PERSON NAME EMAIL ADDRESS
  • 64. Future and Work in Progress ● Gremlin Proxy ● UI components ○ show users exactly the data they have access to based on their capabilities ○ show the data that is tied to a specific capability ○ empower data owners to revoke broad access to some data ● Supporting additional data types ● And beyond! (we are hiring…) 64
  • 66. Develop breadth of applications 66 Metadata Data Discovery & Trust Compliance (GDPR / CCPA / Financial) Security / ACL Data Monitoring / Ops Data Quality Cost Mgmt Data Mainten- ance
  • 67. Get started! Check out Github: github.com/lyft/amundsen Join Slack (~700 users): Link to join on github 67 Check out other blogs and videos: Amundsen for Privacy @ Square Introducing Amundsen Amundsen's architecture YouTube channel
  • 68. Thanks! 68 Mark Grover | Lyft | @mark_grover Alyssa Ransbury | Square | @alyssaran Icons under Creative Commons License from https://thenounproject.com/