SlideShare une entreprise Scribd logo
1  sur  77
Télécharger pour lire hors ligne
Solving Data Discovery Challenges
with Amundsen, an open-source
metadata platform
Tao Feng | tfeng@apache.org
Staff Software Engineer
Who
● Engineer at Lyft Data Platform and
Tools
● Apache Airflow PMC and Committer
● Working on different data products
(Airflow, Amundsen, etc), and led
data org cost attribution effort
● Previously at Linkedin, Oracle
Agenda
● What is Data Discovery
● Challenges in Data Discovery
● Introducing Amundsen
● Amundsen Architecture
● Deep Dive
● Impact and Future Work
What is Data Discovery
Data-Driven Decisions
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
● Axiom: Good decisions are based in data
● Who needs Data? Anyone who wants to make good decisions
○ HR wants to ensure salaries are competitive with market
○ Politician wants to optimize campaign strategy
Data-Driven Decisions
1. Data is Collected
2. Analyst Finds the Data
3. Analyst Understands the Data
4. Analyst Creates Report
5. Analyst Shares the Results
6. Someone Makes a Decision
Challenges in
Data Discovery
● Why:
- An unknown number of RSVPs will no-show
- Need to procure pizza, drinks, chairs, etc
Case Study
● How: Use data from past meetups to build a predictive model
● Goal: Predict Meetup Attendance
● Ask a friend or expert
● Ask in a Slack channel
● Search in the Github repos, or other documents
Step 2: Find the Data
● We find a table called core.meetup_events with columns:
attending, not_attending, date, init_date
● Does attending mean they actually showed up or just RSVPed?
● What's the difference between date and init_date?
● Is this data trustworthy and reliable?
Step 3: Understand the Data
Step 3: Understand the Data
● Ask the data owner, but how do we find the owner?
● Look for further documentation on Github, Confluence, etc
● Run queries and try to figure it out
SELECT * FROM core.meetup_events LIMIT 100;
Data Discovery is Not Productive
● Data Scientists spend up to 30% of their
time in Data Discovery
● Data Discovery in itself provides little to
no intrinsic value. Impactful work
happens in Analysis.
● The answer to these problems is
Metadata
Introducing
What is Amundsen
• In a nutshell, Amundsen is a data discovery and metadata platform for improving the
productivity of data analysts, data scientists, and engineers when interacting with data.
• Amundsen is currently hosted at Linux Foundation AI (LFAI) as its incubation project with
open governance and RFC process. (e.g blog post)
Lyft data discovery before Amundsen exists
• Only a few
(20ish) core tables are listed
• Metadata refreshed through a cron
job, no human curation
• Metadata includes: owner, code, ETL
SLA(static defined), table/column
description
• The metadata not easy to extend
Amundsen homepage
Search for datasets
See details of the data set
See detailed descriptions and profile of the column
See dashboards built on this data set
Search for existing dashboards/reports
Dashboard detail page
Search for co-workers!
Search for data owned and used by your peers
Architecture
Postgres Hive Redshift ... Presto
Mode
Dashboa
rd
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Pluggable Pluggable
Frontend Service
Metadata Service
• A proxy layer to interact with graph database with API
‒ Supports different graph dbs: 1) Neo4j (Cypher based), 2) AWS Neptune
(Gremlin based)
‒ Supports Apache Atlas as meta-storedata engine
• Support Rest APIs for other services pushing / pulling metadata directly
‒ Service communication authorized through Envoy RBAC at Lyft
Search Service
• A proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch, and Apache Atlas as search backend.
• Support different search patterns
‒ Fuzzy search: search based on popularity
‒ Multi facet search
Databuilder
Metadata Sources
Databuilder in action
How is the databuilder orchestrated?
Amundsen uses a workflow engine (e.g Apache Airflow) to orchestrate Databuilder jobs
Current built-in connectors
Deep Dive
Metadata model
1. What kind of information? (aka ABC of metadata)
Application Context
Metadata needed by humans or applications to operate
● Where is the data?
● What are the semantics of the data?
Behavior
How is data created and used over time?
● Who’s using the data?
● Who created the data?
Change
Change in data over time
● How is the data evolving over time?
● Evolution of code that generates the data
TODAY
Short answer: Any data within your organization
Long answer:
2. About what data?
Data stores
Schema registry
Events /
Schemas
StreamsPeople
Employees
TODAY
NotebooksDashboard /
Reports
Processes
Dataset
Dataset
• Includes metadata both manual curated and programmatic curated
• Current metadata:
‒ Table description, column, column descriptions
‒ Last updated timestamp
‒ Partition date range
‒ Tags
‒ Owners, Frequent users
‒ Column stats, column usage
‒ Used in which dashboard
‒ Produced by which Airflow(ETL) task
‒ Github source definition
‒ Unstructured metadatas: (e.g data retention) which is easy to extend to cover different companies
metadata requirements
• Challenge: not every dataset defines the same set of metadata or
follows the same practice
‒ Tier, SLA (operation metadata)
User
• User has the most context / tribal knowledge around data assets.
• Connect user with data entities to surface those tribal knowledge.
Dashboard
• Dashboard represents existing users research analysis.
Dashboard
• Current metadata:
‒ Description
‒ Owner
‒ Last updated timestamp, last successful run timestamp, last run status
‒ Tables used in dashboard, queries, charts
‒ Dashboard preview
‒ Tags
• Challenge:
‒ Not every dashboard metadata applicable for other dashboard type
Push vs Pull
Pull model vs. Push model
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. DB) pushes to a message
bus which downstream subscribes to.
● Message format serves as the interface
● Allows for near-real time indexing
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
Preferred if
● Near-real time indexing is important
● Clean interface exists
Preferred if
● Waiting for indexing is ok
● Easy to bootstrap central metadata
Metadata ingestion
• Pull model ingestion with neo4j, AWS Neptune as backend.
‒ We could extend to a push and pull hybrid model if needed
Metadata ingestion
• Push model ingestion with Apache Atlas as backend (ING blog post)
• Cons: Apache Atlas doesn’t support the external source(e.g redshift)
if it doesn’t support hook interface (intercepting events, messages or function calls
during processing).
Why Graph Database?
Why graph database
• Data entities with its relationships could be represented as a graph
• Performance is better than RDBMS once numbers of nodes and
relationships are in large scale
• Adding a new metadata is easy as it is just adding a new node in the
graph
Search Tradeoff
Search Results
Ranked on Relevance and Popularity
Relevance - search for “apple” on Google
Low relevance High relevance
Popularity - search for “apple” on Google
Low popularity High popularity
Search Results - Striking the balance
Relevance Popularity
● Names, Description, Tags,
[Owners, Frequent users]
● Different weights for different
metadata. e.g., resource name
● Querying activity
● Lower weight for automated
querying
● Higher weight for ad-hoc
querying
Metadata Source Of
Truth
Metadata source of truth
• Centralize all the fragmented metadata
• Treat Amundsen graph as metadata source of truth
‒ Unless upstream source of truth is available (E.g at Lyft, we define metadata for events in IDL repo)
Other features
Announcement page
• Plugin client to support new feature or new datasets
Central data quality issue portal
• Central portal for users to
report data issues.
• Users could see all the past
issues as well.
• Users could request further
context / descriptions from
owners through the portal.
Data Preview
• Supports data preview for
datasets.
• Plugin client with different BI Viz
tools (e.g Apache Superset).
• Delegate the user authz to
Superset to verify whether the
given user could access the
data.
Data Exploration
• Supports integration between
Amundsen and BI Viz tool for
data exploration (e.g Apache
Superset by default).
• Allows users to do complex data
exploration.
Impact
“This is God’s
work” - George
X, ex-head of
Analytics, Lyft
“I was on call and
I’m confident 50%
of the questions
could have been
answered by a
simple search in
Amundsen” -
Bomee P, DS, Lyft
Amundsen @ Lyft: 750+ WAUs, 150k+ tables, 4k+ employee pages, 10k+
dashboards
Amundsen Open Source
950+
Community
members
150+
Companies in
the community
25+
Companies using
in production
Amundsen Open Source Community
ProminentusersActivecommunity
Edmunds.com
• Data Discovery use case and integrated with in-house Data quality
service (e.g blog post)
• Integrating with Databricks’ Delta analytics platform
ING
• Data Discovery on top of Amundsen with Apache Atlas
• Contributed a lot of security integrations to Amundsen (e.g blog post)
Workday
• Data Discovery on their analytics platform, named Goku
• Amundsen is Landing page for Goku
• 1400 users using their platform
Square
• Compliance and regulatory use cases
• Used by security analysis
• Contribute the Gremlin / AWS Neptune integration
• Production phase (e.g blog post)
Recent Contributions from the community
• Redash dashboard integration (Asana)
• Tableau dashboard integration (Gusto)
• Looker dashboard integration (in progress, Brex )
• Integrating with Delta analytics platform (In progress, Edmunds)
• ...
Future
Data Lineage
Pattern Description Example Key Benefit Key Challenge
Tool Contributed
Lineage
The tool creating
the data asset
also writes the
lineage
1) Informatica
2) Hive hook
expose
lineage
At time of creation No standard way
to write lineage;
Manual linked by
User
Manual added
and described
how datasets are
linked
Does not scale
Inferred from
DAG
Extract
dependencies
based on
scheduling
1) Airflow
lineage
2) Marquez
Automatable Doesn’t support
field/column level
lineage
Inferred from SQL Programmatic
extracting lineage
with SQL dialect
https://github.com
/uber/queryparser
Accurate,
supports all sql
dialect
SQL is easier, but
long tail of
support of others
(Spark)
Data Lineage
• Current main Q4 focus
‒ working on UX design for table lineage
• RFC is coming
‒ Provide data model for data lineage
‒ Provide UI for data lineage
‒ Allows different ingestion mechanisms (Push based, SQL parsing, etc)
Machine Learning Feature as entity
• ML Feature as a separate resource entity
‒ Surface feature stats
‒ Surface feature and upstream dataset lineage
‒ Surface various metadatas around ML features
Metadata platform
• Support other services metadata programmatic graphql API access
use cases
‒ Expose metadata (e.g which table joined with what table more frequently) to BI sql Viz
tool
‒ Integrate with data quality service to surface health score, data quality information in
Amundsen
• Support hybrid(pull + push) metadata ingestion
‒ Build SDK to push metadata to Amundsen either through API or through Kafka
Q & A
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Contenu connexe

Tendances

Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...DataWorks Summit
 
Graph Databases – Benefits and Risks
Graph Databases – Benefits and RisksGraph Databases – Benefits and Risks
Graph Databases – Benefits and RisksDATAVERSITY
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solidLars Albertsson
 
Big Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media AnalyticsBig Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media Analyticshafeeznazri
 
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022HostedbyConfluent
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKai Wähner
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationDenodo
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDATAVERSITY
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 

Tendances (20)

Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
Streamline Data Governance with Egeria: The Industry's First Open Metadata St...
 
Graph Databases – Benefits and Risks
Graph Databases – Benefits and RisksGraph Databases – Benefits and Risks
Graph Databases – Benefits and Risks
 
Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?Data Warehouse or Data Lake, Which Do I Choose?
Data Warehouse or Data Lake, Which Do I Choose?
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Big Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media AnalyticsBig Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media Analytics
 
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Data Vault Overview
Data Vault OverviewData Vault Overview
Data Vault Overview
 
Azure purview
Azure purviewAzure purview
Azure purview
 
Enabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data VirtualizationEnabling a Data Mesh Architecture with Data Virtualization
Enabling a Data Mesh Architecture with Data Virtualization
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 

Similaire à Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform

Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeDATAVERSITY
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchSheetal Pratik
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 

Similaire à Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform (20)

Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
LinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbenchLinkedInSaxoBankDataWorkbench
LinkedInSaxoBankDataWorkbench
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 

Plus de Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Plus de Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Dernier

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 

Dernier (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform

  • 1. Solving Data Discovery Challenges with Amundsen, an open-source metadata platform Tao Feng | tfeng@apache.org Staff Software Engineer
  • 2. Who ● Engineer at Lyft Data Platform and Tools ● Apache Airflow PMC and Committer ● Working on different data products (Airflow, Amundsen, etc), and led data org cost attribution effort ● Previously at Linkedin, Oracle
  • 3. Agenda ● What is Data Discovery ● Challenges in Data Discovery ● Introducing Amundsen ● Amundsen Architecture ● Deep Dive ● Impact and Future Work
  • 4. What is Data Discovery
  • 5. Data-Driven Decisions Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers ● Axiom: Good decisions are based in data ● Who needs Data? Anyone who wants to make good decisions ○ HR wants to ensure salaries are competitive with market ○ Politician wants to optimize campaign strategy
  • 6. Data-Driven Decisions 1. Data is Collected 2. Analyst Finds the Data 3. Analyst Understands the Data 4. Analyst Creates Report 5. Analyst Shares the Results 6. Someone Makes a Decision
  • 8. ● Why: - An unknown number of RSVPs will no-show - Need to procure pizza, drinks, chairs, etc Case Study ● How: Use data from past meetups to build a predictive model ● Goal: Predict Meetup Attendance
  • 9. ● Ask a friend or expert ● Ask in a Slack channel ● Search in the Github repos, or other documents Step 2: Find the Data
  • 10. ● We find a table called core.meetup_events with columns: attending, not_attending, date, init_date ● Does attending mean they actually showed up or just RSVPed? ● What's the difference between date and init_date? ● Is this data trustworthy and reliable? Step 3: Understand the Data
  • 11. Step 3: Understand the Data ● Ask the data owner, but how do we find the owner? ● Look for further documentation on Github, Confluence, etc ● Run queries and try to figure it out SELECT * FROM core.meetup_events LIMIT 100;
  • 12. Data Discovery is Not Productive ● Data Scientists spend up to 30% of their time in Data Discovery ● Data Discovery in itself provides little to no intrinsic value. Impactful work happens in Analysis. ● The answer to these problems is Metadata
  • 14. What is Amundsen • In a nutshell, Amundsen is a data discovery and metadata platform for improving the productivity of data analysts, data scientists, and engineers when interacting with data. • Amundsen is currently hosted at Linux Foundation AI (LFAI) as its incubation project with open governance and RFC process. (e.g blog post)
  • 15. Lyft data discovery before Amundsen exists • Only a few (20ish) core tables are listed • Metadata refreshed through a cron job, no human curation • Metadata includes: owner, code, ETL SLA(static defined), table/column description • The metadata not easy to extend
  • 18. See details of the data set
  • 19. See detailed descriptions and profile of the column
  • 20. See dashboards built on this data set
  • 21. Search for existing dashboards/reports
  • 24. Search for data owned and used by your peers
  • 26. Postgres Hive Redshift ... Presto Mode Dashboa rd Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources Pluggable Pluggable
  • 28. Metadata Service • A proxy layer to interact with graph database with API ‒ Supports different graph dbs: 1) Neo4j (Cypher based), 2) AWS Neptune (Gremlin based) ‒ Supports Apache Atlas as meta-storedata engine • Support Rest APIs for other services pushing / pulling metadata directly ‒ Service communication authorized through Envoy RBAC at Lyft
  • 29. Search Service • A proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch, and Apache Atlas as search backend. • Support different search patterns ‒ Fuzzy search: search based on popularity ‒ Multi facet search
  • 33. How is the databuilder orchestrated? Amundsen uses a workflow engine (e.g Apache Airflow) to orchestrate Databuilder jobs
  • 37. 1. What kind of information? (aka ABC of metadata) Application Context Metadata needed by humans or applications to operate ● Where is the data? ● What are the semantics of the data? Behavior How is data created and used over time? ● Who’s using the data? ● Who created the data? Change Change in data over time ● How is the data evolving over time? ● Evolution of code that generates the data TODAY
  • 38. Short answer: Any data within your organization Long answer: 2. About what data? Data stores Schema registry Events / Schemas StreamsPeople Employees TODAY NotebooksDashboard / Reports Processes
  • 40. Dataset • Includes metadata both manual curated and programmatic curated • Current metadata: ‒ Table description, column, column descriptions ‒ Last updated timestamp ‒ Partition date range ‒ Tags ‒ Owners, Frequent users ‒ Column stats, column usage ‒ Used in which dashboard ‒ Produced by which Airflow(ETL) task ‒ Github source definition ‒ Unstructured metadatas: (e.g data retention) which is easy to extend to cover different companies metadata requirements • Challenge: not every dataset defines the same set of metadata or follows the same practice ‒ Tier, SLA (operation metadata)
  • 41. User • User has the most context / tribal knowledge around data assets. • Connect user with data entities to surface those tribal knowledge.
  • 42. Dashboard • Dashboard represents existing users research analysis.
  • 43. Dashboard • Current metadata: ‒ Description ‒ Owner ‒ Last updated timestamp, last successful run timestamp, last run status ‒ Tables used in dashboard, queries, charts ‒ Dashboard preview ‒ Tags • Challenge: ‒ Not every dashboard metadata applicable for other dashboard type
  • 45. Pull model vs. Push model Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. DB) pushes to a message bus which downstream subscribes to. ● Message format serves as the interface ● Allows for near-real time indexing Crawler Database Data graph Scheduler Database Message queue Data graph Preferred if ● Near-real time indexing is important ● Clean interface exists Preferred if ● Waiting for indexing is ok ● Easy to bootstrap central metadata
  • 46. Metadata ingestion • Pull model ingestion with neo4j, AWS Neptune as backend. ‒ We could extend to a push and pull hybrid model if needed
  • 47. Metadata ingestion • Push model ingestion with Apache Atlas as backend (ING blog post) • Cons: Apache Atlas doesn’t support the external source(e.g redshift) if it doesn’t support hook interface (intercepting events, messages or function calls during processing).
  • 49. Why graph database • Data entities with its relationships could be represented as a graph • Performance is better than RDBMS once numbers of nodes and relationships are in large scale • Adding a new metadata is easy as it is just adding a new node in the graph
  • 51. Search Results Ranked on Relevance and Popularity
  • 52. Relevance - search for “apple” on Google Low relevance High relevance
  • 53. Popularity - search for “apple” on Google Low popularity High popularity
  • 54. Search Results - Striking the balance Relevance Popularity ● Names, Description, Tags, [Owners, Frequent users] ● Different weights for different metadata. e.g., resource name ● Querying activity ● Lower weight for automated querying ● Higher weight for ad-hoc querying
  • 56. Metadata source of truth • Centralize all the fragmented metadata • Treat Amundsen graph as metadata source of truth ‒ Unless upstream source of truth is available (E.g at Lyft, we define metadata for events in IDL repo)
  • 58. Announcement page • Plugin client to support new feature or new datasets
  • 59. Central data quality issue portal • Central portal for users to report data issues. • Users could see all the past issues as well. • Users could request further context / descriptions from owners through the portal.
  • 60. Data Preview • Supports data preview for datasets. • Plugin client with different BI Viz tools (e.g Apache Superset). • Delegate the user authz to Superset to verify whether the given user could access the data.
  • 61. Data Exploration • Supports integration between Amundsen and BI Viz tool for data exploration (e.g Apache Superset by default). • Allows users to do complex data exploration.
  • 63. “This is God’s work” - George X, ex-head of Analytics, Lyft “I was on call and I’m confident 50% of the questions could have been answered by a simple search in Amundsen” - Bomee P, DS, Lyft Amundsen @ Lyft: 750+ WAUs, 150k+ tables, 4k+ employee pages, 10k+ dashboards
  • 64. Amundsen Open Source 950+ Community members 150+ Companies in the community 25+ Companies using in production
  • 65. Amundsen Open Source Community ProminentusersActivecommunity
  • 66. Edmunds.com • Data Discovery use case and integrated with in-house Data quality service (e.g blog post) • Integrating with Databricks’ Delta analytics platform
  • 67. ING • Data Discovery on top of Amundsen with Apache Atlas • Contributed a lot of security integrations to Amundsen (e.g blog post)
  • 68. Workday • Data Discovery on their analytics platform, named Goku • Amundsen is Landing page for Goku • 1400 users using their platform
  • 69. Square • Compliance and regulatory use cases • Used by security analysis • Contribute the Gremlin / AWS Neptune integration • Production phase (e.g blog post)
  • 70. Recent Contributions from the community • Redash dashboard integration (Asana) • Tableau dashboard integration (Gusto) • Looker dashboard integration (in progress, Brex ) • Integrating with Delta analytics platform (In progress, Edmunds) • ...
  • 72. Data Lineage Pattern Description Example Key Benefit Key Challenge Tool Contributed Lineage The tool creating the data asset also writes the lineage 1) Informatica 2) Hive hook expose lineage At time of creation No standard way to write lineage; Manual linked by User Manual added and described how datasets are linked Does not scale Inferred from DAG Extract dependencies based on scheduling 1) Airflow lineage 2) Marquez Automatable Doesn’t support field/column level lineage Inferred from SQL Programmatic extracting lineage with SQL dialect https://github.com /uber/queryparser Accurate, supports all sql dialect SQL is easier, but long tail of support of others (Spark)
  • 73. Data Lineage • Current main Q4 focus ‒ working on UX design for table lineage • RFC is coming ‒ Provide data model for data lineage ‒ Provide UI for data lineage ‒ Allows different ingestion mechanisms (Push based, SQL parsing, etc)
  • 74. Machine Learning Feature as entity • ML Feature as a separate resource entity ‒ Surface feature stats ‒ Surface feature and upstream dataset lineage ‒ Surface various metadatas around ML features
  • 75. Metadata platform • Support other services metadata programmatic graphql API access use cases ‒ Expose metadata (e.g which table joined with what table more frequently) to BI sql Viz tool ‒ Integrate with data quality service to surface health score, data quality information in Amundsen • Support hybrid(pull + push) metadata ingestion ‒ Build SDK to push metadata to Amundsen either through API or through Kafka
  • 76. Q & A
  • 77. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.