How Lyft Drives Data Discovery

Wednesday, September 18th 2019
Tamika Tannis | Software Engineer, Lyft
go.lyft.com/datadiscoveryslides
Disrupting Data Discovery

Agenda
• Data Ecosystem at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Amundsen’s Architecture
• What’s Next?
2

4
Core Data Infrastructure (High Level)
Custom
Applications
Architecture Applications
Mobile App
Services
Services
Data Streaming
Frameworks
(Kafka / Kinesis)
Flink

Challenges with Data
Discovery
5

Data is used to make informed decisions
6
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or make a decision
Make data the heart of every decision

• Goal: What new data-driven policies can we enact to reduce driver
insurance fraud?
• Idea: Let’s take a deeper look into insurance claims from drivers who
have given less than 𝑥 rides.
• Next Step: I’ll first get all drivers who have given less than 𝑥 rides...but
where do I look?
Hi! I’m a new Analyst in the Fraud Department !
7

• Ask a friend/manager/coworker
• Ask in a wider Slack channel
• Search in the Github repos
Step 1: Search & find data
8
We end up finding tables: driver_rides
& rides_driver_total

• What is the difference: driver_rides vs. rides_driver_total
• What do the different fields mean?
‒ Is driver_rides.completed different from
rides_driver_total.lifetime_completed?
‒ What period of time does the data in each table cover?
• Dig deeper: explore using SQL queries
Step 2: Understand the data
9
SELECT * FROM schema.driver_rides
WHERE ds=’2019-05-15’
LIMIT 100;
SELECT * FROM schema.rides_driver_total
WHERE ds=’2019-05-15’
LIMIT 100;

Data Scientists spend upto 1/3rd time in Data Discovery
10
Data Discovery
• Data discovery is a problem
because of the lack of understanding
of what data exists, where, who
owns it, & how to use it.
• It is not what our data scientist
should focus on: they should focus
on Analysis work
Data-based decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or take a decision

Audience for data
discovery
11

User Personas - (1/2)
12
Analysts Data Scientists General
Managers
ExperimentersEngineersProduct
Managers
• Frequent use of data
• Deep to very deep analysis
• Exposure to new datasets
• Creating insights & developing
models

User Personas - (2/2)
13
Power User
- Has been at Lyft for a long
time
- Knows the data environment
well: where to find data, what
it means, how to use it
Pain points:
- Needs to spend a fair amount
of their time sharing their
knowledge with the new user
- Could become “New user” if
they switch teams
New User
- Recently joined Lyft or
switched to a new team
- Needs to ramp up on a lot of
things, wants to start having
impact soon
Pain points:
- Doesn’t know where to start.
Spends their time asking
questions and cmd+F on
github
- Makes mistakes by mis-using
some datasets

3 complementary ways to do Data Discovery
14
Search based
I am looking for a table with data on “cancel rates”
- Where is the table?
- What does it contain?
- Has the analysis I want to perform already been done?
Lineage based
If this event is down, what datasets are going to be impacted?
- Upstream/downstream lineage
- Incidents, SLA misses, Data quality
Network based
I want to check what tables my manager uses
- Ownership information
- Bookmarking
- Usage through query logs

Data Discovery at Lyft
15
Product named after Roald Amundsen
● First expedition to reach the South Pole
● First to explore both North & South Poles

Landing Page - Optimized for search

Search Results - Ranked on relevance & popularity

Relevance - search for “apple” on Google
18
Low relevance High relevance

Popularity - search for “apple” on Google
19
Low popularity High popularity

Search Results - Striking the balance
20
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Different weights for different metadata, e.g.
resource name
● Querying activity
● Dashboarding
● Lower weight for automated querying
● Higher weight for adhoc querying

Computed Column Metadata Statistics
Disclaimer: these stats are arbitrary.

28
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

30
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

Why choose a graph
database?
32

35
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly

Neo4j is the source of truth for
editable metadata
36

Why not propagate the editabled metadata back to source
37

38

39

40

42
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources

Metadata - Challenges
• No Standardization: No single data model that fits for all data
resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched
differently
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
44

How is the databuilder orchestrated?
47
Amundsen uses Apache Airflow to orchestrate Databuilder jobs

49
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

3. Search Service
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as the search backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then
relevancy
‒ Wildcard Search
50

How to make the search result more relevant?
51
• Experiment with different weights, e.g boost the exact table ranking
• Collect metrics
‒ Instrumentation for search behavior
‒ Measure click-through-rate (CTR) over top 5 results
• Advanced search:
‒ Support wildcard search (e.g. event_*)
‒ Support category search (e.g. column: is_line_ride)
‒ Future: Filtering, Autosuggest

53
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources

Web Technologies
55
Develop Build Test

Amundsen’s Impact
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
‒ 90% penetration among Data Scientists
‒ +30% productivity for the Data science org.
57

Amundsen is Open Source!
• github.com/lyft/amundsen
• Growing and active community
‒ c.150 github stars, 10+ companies contributing back
‒ Slack w/ 30+ companies and c.100 people
‒ Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow by Lyft
employees and community
‒ Featured in blog posts and interviews
• Net positive impact for Lyft through external community contributing
‒ Integration with open source backend
‒ Integration with new data sources (BigQuery, Redshift, Postgres), lifting them from
our roadmap 58

Community Overview
59
ContributorsActivecommunity

Roadmap
PeopleDashboards
Data sets
Phase 1
(Complete)
Phase 2
(In Progress)
Phase 3
(In Scoping)
Streams Schemas Workflows
More
Metadata
Deeper integration with other
tools (e.g. Mode)
Privacy Governance

Tamika Tannis | @ttannis | /in/tamika-tannis
Project Code @ github.com/lyft/amundsen
Blog Post @ go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/
66

How Lyft Drives Data Discovery

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à How Lyft Drives Data Discovery

Similaire à How Lyft Drives Data Discovery (20)

Plus de Neo4j

Plus de Neo4j (20)

Dernier

Dernier (20)

How Lyft Drives Data Discovery

Notes de l'éditeur