6. Data is used to make informed decisions
6
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or make a decision
Make data the heart of every decision
7. • Goal: What new data-driven policies can we enact to reduce driver
insurance fraud?
• Idea: Let’s take a deeper look into insurance claims from drivers who
have given less than 𝑥 rides.
• Next Step: I’ll first get all drivers who have given less than 𝑥 rides...but
where do I look?
Hi! I’m a new Analyst in the Fraud Department !
7
8. • Ask a friend/manager/coworker
• Ask in a wider Slack channel
• Search in the Github repos
Step 1: Search & find data
8
We end up finding tables: driver_rides
& rides_driver_total
9. • What is the difference: driver_rides vs. rides_driver_total
• What do the different fields mean?
‒ Is driver_rides.completed different from
rides_driver_total.lifetime_completed?
‒ What period of time does the data in each table cover?
• Dig deeper: explore using SQL queries
Step 2: Understand the data
9
SELECT * FROM schema.driver_rides
WHERE ds=’2019-05-15’
LIMIT 100;
SELECT * FROM schema.rides_driver_total
WHERE ds=’2019-05-15’
LIMIT 100;
10. Data Scientists spend upto 1/3rd time in Data Discovery
10
Data Discovery
• Data discovery is a problem
because of the lack of understanding
of what data exists, where, who
owns it, & how to use it.
• It is not what our data scientist
should focus on: they should focus
on Analysis work
Data-based decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or take a decision
12. User Personas - (1/2)
12
Analysts Data Scientists General
Managers
ExperimentersEngineersProduct
Managers
• Frequent use of data
• Deep to very deep analysis
• Exposure to new datasets
• Creating insights & developing
models
13. User Personas - (2/2)
13
Power User
- Has been at Lyft for a long
time
- Knows the data environment
well: where to find data, what
it means, how to use it
Pain points:
- Needs to spend a fair amount
of their time sharing their
knowledge with the new user
- Could become “New user” if
they switch teams
New User
- Recently joined Lyft or
switched to a new team
- Needs to ramp up on a lot of
things, wants to start having
impact soon
Pain points:
- Doesn’t know where to start.
Spends their time asking
questions and cmd+F on
github
- Makes mistakes by mis-using
some datasets
14. 3 complementary ways to do Data Discovery
14
Search based
I am looking for a table with data on “cancel rates”
- Where is the table?
- What does it contain?
- Has the analysis I want to perform already been done?
Lineage based
If this event is down, what datasets are going to be impacted?
- Upstream/downstream lineage
- Incidents, SLA misses, Data quality
Network based
I want to check what tables my manager uses
- Ownership information
- Bookmarking
- Usage through query logs
15. Data Discovery at Lyft
15
Product named after Roald Amundsen
● First expedition to reach the South Pole
● First to explore both North & South Poles
35. 35
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
36. Neo4j is the source of truth for
editable metadata
36
44. Metadata - Challenges
• No Standardization: No single data model that fits for all data
resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched
differently
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
44
49. 49
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
50. 3. Search Service
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as the search backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then
relevancy
‒ Wildcard Search
50
51. How to make the search result more relevant?
51
• Experiment with different weights, e.g boost the exact table ranking
• Collect metrics
‒ Instrumentation for search behavior
‒ Measure click-through-rate (CTR) over top 5 results
• Advanced search:
‒ Support wildcard search (e.g. event_*)
‒ Support category search (e.g. column: is_line_ride)
‒ Future: Filtering, Autosuggest
57. Amundsen’s Impact
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
‒ 90% penetration among Data Scientists
‒ +30% productivity for the Data science org.
57
58. Amundsen is Open Source!
• github.com/lyft/amundsen
• Growing and active community
‒ c.150 github stars, 10+ companies contributing back
‒ Slack w/ 30+ companies and c.100 people
‒ Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow by Lyft
employees and community
‒ Featured in blog posts and interviews
• Net positive impact for Lyft through external community contributing
‒ Integration with open source backend
‒ Integration with new data sources (BigQuery, Redshift, Postgres), lifting them from
our roadmap 58
66. Tamika Tannis | @ttannis | /in/tamika-tannis
Project Code @ github.com/lyft/amundsen
Blog Post @ go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/
66
Notes de l'éditeur
Name & Role working on an open-source data discovery tool at Lyft.
It’s called “Amundsen” -- more on that name later.
It leverages Neo4j, glad to share how we’ve been using Neo4j at Lyft to achieve goals of our product Amundsen.
On the agenda for this talk
The data infrastructure at Lyft can be visualized by this diagram
Events are fires into streaming frameworks (Apache Kafka / Amazon Kinesis)
Apache Flink injests that data into Amazon S3, first layer of storage, persistent storage
Data stored in Amazon S3 is further transformed and stored in various other datastores
Initially started with Redshift. Introduced new datastores with different strengths that better serve specific purposes,
Hive for long running queries/ETLs
Presto for quick analysis & as-needed queries
Druid for faster interactive queries
The takeaway from this slide: Lots of data (~10PB), lots of places it can be (thousands of tables), and lots of tools/people trying to use the data on a regular basis.
Now onto challenges with data discovery
Effective data discovery is important because data is at the heart of every decision we make. It is the only way to make informed, objective decisions.
Applies to many roles
Data-driven decision making process
Search & find data
Understand the data
Perform an analysis
Share insights or make a decision
To highlight some data discover pain points that occur without the proper tools, let’s walk through a hypothetical example
Your experience searching and finding data may involve doing all of the following 3 things.
Your experience understanding the data doesn’t get any easier.
⅓ of time on data discovery
Difficult to find what exists, understand whether or not it’s what you are looking for, or trust that it is the source of truth for that information
We can significantly increase productivity and impact if we can reduce this time...
Let’s start to think of what a helpful tool would look like.
Complication: What audience to serve? Who are they and what do they need?
What audience to serve?
Second level of personas to consider.
Lastly, what by what means do they want to perform discovery? There are 3 complementary ways to do Data Discovery
Search based: most common and top priority
Lineage based: callback to the data ecosystem, if there is a hiccup in that system, what does it impact? Datasets must be trustworthy
Network based: helpful on the job to know what others are using for what purpose
We’ve talked about some pain points of data discovery and why it’s important, let’s talk about our solution -- Amundsen.
Disclaimer
Representative data
Amundsen circa March 2019
Our landing page is optimized for search
Most common method of data discovery, presented with search bar & help text for some advanced search features
We also want the landing page to be able to help users that don’t know what to search for.
Created this concept of popular tables
Users presented with ranked search results
Not like page-rank but based on relevance and popularity
This is what we mean when we say relevance
This is what we mean when we say popularity
Striking the balance between the two is an interesting challenge
Relevance is based on metadata
Popularity is not click through rate but through query access patterns.
Now that I’ve demonstrated what Amundsen is and how it can be used, let’s talk about how it was built.
Microservice architecture, services are divided by domains: ui/frontend, search, metadata
Walk from top to bottom & highlight “pluggability”
I’ll now dive deeper into each of the Amundsen services presented in the previous graph, starting with metadata service, which is backed by Neo4j.
As you may remember from the application walkthrough, Amundsen surfaces resource metadata and that is what we are storing in Neo4j
However graph databases are not common for many web applications, and so one might ask why choose a graph database.
Well if you remember the diagram of the data ecosystem at Lyft from the beginning of the talk, that can be modeled as a graph.
This is a very powerful feature because the alternative to created these kinds of relationships with a RDBMS is joins
A NoSQL database isn’t set up for this
Let’s take a note of some of the features from the table detail page again and see how this is represented in Neo4j
Walk through features
What’s very beneficial about this is that when we have a new use case and a new piece of metadata to represent, we just have to create the new node and relationship.
It’s worth noting that one key architectural decision made for this service and others is that it is a proxy to interact with Neo4j
Which means its can also interact with anything else that can store this data.
This choice is key for us as an open source project
Another key characteristic of our system is that neo4j is the source of truth for our editable metadata
This was actually not our original intent, we ran into a roadblock when we were first implementing the description editing feature.
We originally had a setup like this
Then we realized we forgot to account for something.
Tables can get rebuilt using the source code that generated the table and descriptions will be overwritten
The we thought about whether or not we could do this, update them both!
The answer was no.
...And that’s how Neo4j became the source of truth for editable metadata
Now onto the data builder service
It is the layer that ingests metadata from the sources. Which sources exactly?
Many sources.
Not just tables but dashboards, different kinds of resources
This creates some complexity
This is what databuilder helps to address, it is a data ingestion framework similar to Apache Gobblin
It functions as an ETL engine
Each part is modularized, and can be reused (e.g same transformer) or swapped out
A publisher leverages a transaction to make the data ingestion atomic -- it is not the case that there is partially updated data
Here is a more solid example
How is this all orchestrated? With Airflow dags, these are jobs that run to execute each piece of the puzzle
End with elastic search.
Elasticsearch is what sit behind our search service...
... in the same way that Neo4j stands behind the metadata service.
Similar data is loaded into both there are some minor differences, for example data that won’t be searchable like col stats
Also similar to the metadata service the search service acts as proxy
What I find most interesting about the search service is actually the biggest problem that we struggle with, “how to make the search results more relevant”.