Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform

Solving Data Discovery Challenges
with Amundsen, an open-source
metadata platform
Tao Feng | tfeng@apache.org
Staff Software Engineer

Who
● Engineer at Lyft Data Platform and
Tools
● Apache Airflow PMC and Committer
● Working on different data products
(Airflow, Amundsen, etc), and led
data org cost attribution effort
● Previously at Linkedin, Oracle

Agenda
● What is Data Discovery
● Challenges in Data Discovery
● Introducing Amundsen
● Amundsen Architecture
● Deep Dive
● Impact and Future Work

Data-Driven Decisions
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
● Axiom: Good decisions are based in data
● Who needs Data? Anyone who wants to make good decisions
○ HR wants to ensure salaries are competitive with market
○ Politician wants to optimize campaign strategy

Data-Driven Decisions
1. Data is Collected
2. Analyst Finds the Data
3. Analyst Understands the Data
4. Analyst Creates Report
5. Analyst Shares the Results
6. Someone Makes a Decision

● Why:
- An unknown number of RSVPs will no-show
- Need to procure pizza, drinks, chairs, etc
Case Study
● How: Use data from past meetups to build a predictive model
● Goal: Predict Meetup Attendance

● Ask a friend or expert
● Ask in a Slack channel
● Search in the Github repos, or other documents
Step 2: Find the Data

● We ﬁnd a table called core.meetup_events with columns:
attending, not_attending, date, init_date
● Does attending mean they actually showed up or just RSVPed?
● What's the diﬀerence between date and init_date?
● Is this data trustworthy and reliable?
Step 3: Understand the Data

Step 3: Understand the Data
● Ask the data owner, but how do we find the owner?
● Look for further documentation on Github, Confluence, etc
● Run queries and try to figure it out
SELECT * FROM core.meetup_events LIMIT 100;

Data Discovery is Not Productive
● Data Scientists spend up to 30% of their
time in Data Discovery
● Data Discovery in itself provides little to
no intrinsic value. Impactful work
happens in Analysis.
● The answer to these problems is
Metadata

What is Amundsen
• In a nutshell, Amundsen is a data discovery and metadata platform for improving the
productivity of data analysts, data scientists, and engineers when interacting with data.
• Amundsen is currently hosted at Linux Foundation AI (LFAI) as its incubation project with
open governance and RFC process. (e.g blog post)

Lyft data discovery before Amundsen exists
• Only a few
(20ish) core tables are listed
• Metadata refreshed through a cron
job, no human curation
• Metadata includes: owner, code, ETL
SLA(static deﬁned), table/column
description
• The metadata not easy to extend

See detailed descriptions and proﬁle of the column

See dashboards built on this data set

Search for existing dashboards/reports

Search for data owned and used by your peers

Postgres Hive Redshift ... Presto
Mode
Dashboa
rd
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Pluggable Pluggable

Metadata Service
• A proxy layer to interact with graph database with API
‒ Supports diﬀerent graph dbs: 1) Neo4j (Cypher based), 2) AWS Neptune
(Gremlin based)
‒ Supports Apache Atlas as meta-storedata engine
• Support Rest APIs for other services pushing / pulling metadata directly
‒ Service communication authorized through Envoy RBAC at Lyft

Search Service
• A proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch, and Apache Atlas as search backend.
• Support diﬀerent search patterns
‒ Fuzzy search: search based on popularity
‒ Multi facet search

How is the databuilder orchestrated?
Amundsen uses a workflow engine (e.g Apache Airflow) to orchestrate Databuilder jobs

1. What kind of information? (aka ABC of metadata)
Application Context
Metadata needed by humans or applications to operate
● Where is the data?
● What are the semantics of the data?
Behavior
How is data created and used over time?
● Who’s using the data?
● Who created the data?
Change
Change in data over time
● How is the data evolving over time?
● Evolution of code that generates the data
TODAY

Short answer: Any data within your organization
Long answer:
2. About what data?
Data stores
Schema registry
Events /
Schemas
StreamsPeople
Employees
TODAY
NotebooksDashboard /
Reports
Processes

Dataset
• Includes metadata both manual curated and programmatic curated
• Current metadata:
‒ Table description, column, column descriptions
‒ Last updated timestamp
‒ Partition date range
‒ Tags
‒ Owners, Frequent users
‒ Column stats, column usage
‒ Used in which dashboard
‒ Produced by which Airflow(ETL) task
‒ Github source definition
‒ Unstructured metadatas: (e.g data retention) which is easy to extend to cover different companies
metadata requirements
• Challenge: not every dataset defines the same set of metadata or
follows the same practice
‒ Tier, SLA (operation metadata)

User
• User has the most context / tribal knowledge around data assets.
• Connect user with data entities to surface those tribal knowledge.

Dashboard
• Dashboard represents existing users research analysis.

Dashboard
• Current metadata:
‒ Description
‒ Owner
‒ Last updated timestamp, last successful run timestamp, last run status
‒ Tables used in dashboard, queries, charts
‒ Dashboard preview
‒ Tags
• Challenge:
‒ Not every dashboard metadata applicable for other dashboard type

Pull model vs. Push model
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. DB) pushes to a message
bus which downstream subscribes to.
● Message format serves as the interface
● Allows for near-real time indexing
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
Preferred if
● Near-real time indexing is important
● Clean interface exists
Preferred if
● Waiting for indexing is ok
● Easy to bootstrap central metadata

Metadata ingestion
• Pull model ingestion with neo4j, AWS Neptune as backend.
‒ We could extend to a push and pull hybrid model if needed

Metadata ingestion
• Push model ingestion with Apache Atlas as backend (ING blog post)
• Cons: Apache Atlas doesn’t support the external source(e.g redshift)
if it doesn’t support hook interface (intercepting events, messages or function calls
during processing).

Why graph database
• Data entities with its relationships could be represented as a graph
• Performance is better than RDBMS once numbers of nodes and
relationships are in large scale
• Adding a new metadata is easy as it is just adding a new node in the
graph

Search Results
Ranked on Relevance and Popularity

Relevance - search for “apple” on Google
Low relevance High relevance

Popularity - search for “apple” on Google
Low popularity High popularity

Search Results - Striking the balance
Relevance Popularity
● Names, Description, Tags,
[Owners, Frequent users]
● Different weights for different
metadata. e.g., resource name
● Querying activity
● Lower weight for automated
querying
● Higher weight for ad-hoc
querying

Metadata source of truth
• Centralize all the fragmented metadata
• Treat Amundsen graph as metadata source of truth
‒ Unless upstream source of truth is available (E.g at Lyft, we deﬁne metadata for events in IDL repo)

Announcement page
• Plugin client to support new feature or new datasets

Central data quality issue portal
• Central portal for users to
report data issues.
• Users could see all the past
issues as well.
• Users could request further
context / descriptions from
owners through the portal.

Data Preview
• Supports data preview for
datasets.
• Plugin client with diﬀerent BI Viz
tools (e.g Apache Superset).
• Delegate the user authz to
Superset to verify whether the
given user could access the
data.

Data Exploration
• Supports integration between
Amundsen and BI Viz tool for
data exploration (e.g Apache
Superset by default).
• Allows users to do complex data
exploration.

“This is God’s
work” - George
X, ex-head of
Analytics, Lyft
“I was on call and
I’m confident 50%
of the questions
could have been
answered by a
simple search in
Amundsen” -
Bomee P, DS, Lyft
Amundsen @ Lyft: 750+ WAUs, 150k+ tables, 4k+ employee pages, 10k+
dashboards

Amundsen Open Source
950+
Community
members
150+
Companies in
the community
25+
Companies using
in production

Amundsen Open Source Community
ProminentusersActivecommunity

Edmunds.com
• Data Discovery use case and integrated with in-house Data quality
service (e.g blog post)
• Integrating with Databricks’ Delta analytics platform

ING
• Data Discovery on top of Amundsen with Apache Atlas
• Contributed a lot of security integrations to Amundsen (e.g blog post)

Workday
• Data Discovery on their analytics platform, named Goku
• Amundsen is Landing page for Goku
• 1400 users using their platform

Square
• Compliance and regulatory use cases
• Used by security analysis
• Contribute the Gremlin / AWS Neptune integration
• Production phase (e.g blog post)

Recent Contributions from the community
• Redash dashboard integration (Asana)
• Tableau dashboard integration (Gusto)
• Looker dashboard integration (in progress, Brex )
• Integrating with Delta analytics platform (In progress, Edmunds)
• ...

Data Lineage
Pattern Description Example Key Benefit Key Challenge
Tool Contributed
Lineage
The tool creating
the data asset
also writes the
lineage
1) Informatica
2) Hive hook
expose
lineage
At time of creation No standard way
to write lineage;
Manual linked by
User
Manual added
and described
how datasets are
linked
Does not scale
Inferred from
DAG
Extract
dependencies
based on
scheduling
1) Airflow
lineage
2) Marquez
Automatable Doesn’t support
field/column level
lineage
Inferred from SQL Programmatic
extracting lineage
with SQL dialect
https://github.com
/uber/queryparser
Accurate,
supports all sql
dialect
SQL is easier, but
long tail of
support of others
(Spark)

Data Lineage
• Current main Q4 focus
‒ working on UX design for table lineage
• RFC is coming
‒ Provide data model for data lineage
‒ Provide UI for data lineage
‒ Allows diﬀerent ingestion mechanisms (Push based, SQL parsing, etc)

Machine Learning Feature as entity
• ML Feature as a separate resource entity
‒ Surface feature stats
‒ Surface feature and upstream dataset lineage
‒ Surface various metadatas around ML features

Metadata platform
• Support other services metadata programmatic graphql API access
use cases
‒ Expose metadata (e.g which table joined with what table more frequently) to BI sql Viz
tool
‒ Integrate with data quality service to surface health score, data quality information in
Amundsen
• Support hybrid(pull + push) metadata ingestion
‒ Build SDK to push metadata to Amundsen either through API or through Kafka

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform

Similaire à Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform (20)

Plus de Databricks

Plus de Databricks (20)

Dernier

Dernier (20)

Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform