SlideShare une entreprise Scribd logo
1  sur  66
Wednesday, September 18th 2019
Tamika Tannis | Software Engineer, Lyft
go.lyft.com/datadiscoveryslides
Disrupting Data Discovery
Agenda
• Data Ecosystem at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Amundsen’s Architecture
• What’s Next?
2
Data Ecosystem at Lyft
3
4
Core Data Infrastructure (High Level)
Custom
Applications
Architecture Applications
Mobile App
Services
Services
Data Streaming
Frameworks
(Kafka / Kinesis)
Flink
Challenges with Data
Discovery
5
Data is used to make informed decisions
6
Analysts Data Scientists General
Managers
Engineers ExperimentersProduct
Managers
Data-driven decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or make a decision
Make data the heart of every decision
• Goal: What new data-driven policies can we enact to reduce driver
insurance fraud?
• Idea: Let’s take a deeper look into insurance claims from drivers who
have given less than 𝑥 rides.
• Next Step: I’ll first get all drivers who have given less than 𝑥 rides...but
where do I look?
Hi! I’m a new Analyst in the Fraud Department !
7
• Ask a friend/manager/coworker
• Ask in a wider Slack channel
• Search in the Github repos
Step 1: Search & find data
8
We end up finding tables: driver_rides
& rides_driver_total
• What is the difference: driver_rides vs. rides_driver_total
• What do the different fields mean?
‒ Is driver_rides.completed different from
rides_driver_total.lifetime_completed?
‒ What period of time does the data in each table cover?
• Dig deeper: explore using SQL queries
Step 2: Understand the data
9
SELECT * FROM schema.driver_rides
WHERE ds=’2019-05-15’
LIMIT 100;
SELECT * FROM schema.rides_driver_total
WHERE ds=’2019-05-15’
LIMIT 100;
Data Scientists spend upto 1/3rd time in Data Discovery
10
Data Discovery
• Data discovery is a problem
because of the lack of understanding
of what data exists, where, who
owns it, & how to use it.
• It is not what our data scientist
should focus on: they should focus
on Analysis work
Data-based decision making process:
1. Search & find data
2. Understand the data
3. Perform an analysis/visualisation
4. Share insights and/or take a decision
Audience for data
discovery
11
User Personas - (1/2)
12
Analysts Data Scientists General
Managers
ExperimentersEngineersProduct
Managers
• Frequent use of data
• Deep to very deep analysis
• Exposure to new datasets
• Creating insights & developing
models
User Personas - (2/2)
13
Power User
- Has been at Lyft for a long
time
- Knows the data environment
well: where to find data, what
it means, how to use it
Pain points:
- Needs to spend a fair amount
of their time sharing their
knowledge with the new user
- Could become “New user” if
they switch teams
New User
- Recently joined Lyft or
switched to a new team
- Needs to ramp up on a lot of
things, wants to start having
impact soon
Pain points:
- Doesn’t know where to start.
Spends their time asking
questions and cmd+F on
github
- Makes mistakes by mis-using
some datasets
3 complementary ways to do Data Discovery
14
Search based
I am looking for a table with data on “cancel rates”
- Where is the table?
- What does it contain?
- Has the analysis I want to perform already been done?
Lineage based
If this event is down, what datasets are going to be impacted?
- Upstream/downstream lineage
- Incidents, SLA misses, Data quality
Network based
I want to check what tables my manager uses
- Ownership information
- Bookmarking
- Usage through query logs
Data Discovery at Lyft
15
Product named after Roald Amundsen
● First expedition to reach the South Pole
● First to explore both North & South Poles
Landing Page - Optimized for search
Search Results - Ranked on relevance & popularity
Relevance - search for “apple” on Google
18
Low relevance High relevance
Popularity - search for “apple” on Google
19
Low popularity High popularity
Search Results - Striking the balance
20
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Different weights for different metadata, e.g.
resource name
● Querying activity
● Dashboarding
● Lower weight for automated querying
● Higher weight for adhoc querying
View Resource Metadata
Data Preview
22
View Resource Metadata
Computed Column Metadata Statistics
Disclaimer: these stats are arbitrary.
View Resource Metadata
In-Application User Feedback
Amundsen’s
Architecture
27
28
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Metadata Service
29
30
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
View Resource Metadata
Why choose a graph
database?
32
33
Why Graph database? (1/2)
34
Why Graph database? (2/2)
35
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
Neo4j is the source of truth for
editable metadata
36
Why not propagate the editabled metadata back to source
37
Why not propagate the editabled metadata back to source
38
Why not propagate the editabled metadata back to source
39
Why not propagate the editabled metadata back to source
40
2. Databuilder
41
42
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources
43
Metadata Sources @ Lyft
Metadata - Challenges
• No Standardization: No single data model that fits for all data
resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Different Extraction: Each data set metadata is stored and fetched
differently
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
44
Databuilder
45
Databuilder in action
46
How is the databuilder orchestrated?
47
Amundsen uses Apache Airflow to orchestrate Databuilder jobs
3. Search Service
48
49
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Search Service
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as the search backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then
relevancy
‒ Wildcard Search
50
How to make the search result more relevant?
51
• Experiment with different weights, e.g boost the exact table ranking
• Collect metrics
‒ Instrumentation for search behavior
‒ Measure click-through-rate (CTR) over top 5 results
• Advanced search:
‒ Support wildcard search (e.g. event_*)
‒ Support category search (e.g. column: is_line_ride)
‒ Future: Filtering, Autosuggest
4. Frontend Service
52
53
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Web Application
Web Technologies
55
Develop Build Test
What’s Next?
56
Amundsen’s Impact
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
‒ 90% penetration among Data Scientists
‒ +30% productivity for the Data science org.
57
Amundsen is Open Source!
• github.com/lyft/amundsen
• Growing and active community
‒ c.150 github stars, 10+ companies contributing back
‒ Slack w/ 30+ companies and c.100 people
‒ Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow by Lyft
employees and community
‒ Featured in blog posts and interviews
• Net positive impact for Lyft through external community contributing
‒ Integration with open source backend
‒ Integration with new data sources (BigQuery, Redshift, Postgres), lifting them from
our roadmap 58
Community Overview
59
ContributorsActivecommunity
Roadmap
PeopleDashboards
Data sets
Phase 1
(Complete)
Phase 2
(In Progress)
Phase 3
(In Scoping)
Streams Schemas Workflows
More
Metadata
Deeper integration with other
tools (e.g. Mode)
Privacy Governance
Amundsen People
61
Amundsen People
62
Roadmap
PeopleDashboards
Data sets
Phase 1
(Complete)
Phase 2
(In Progress)
Phase 3
(In Scoping)
Streams Schemas Workflows
More
Metadata
Deeper integration with other
tools (e.g. Mode)
Privacy Governance
Roadmap
64
Roadmap
65
Tamika Tannis | @ttannis | /in/tamika-tannis
Project Code @ github.com/lyft/amundsen
Blog Post @ go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/
66

Contenu connexe

Tendances

Creating Maintainable Automated Acceptance Tests
Creating Maintainable Automated Acceptance TestsCreating Maintainable Automated Acceptance Tests
Creating Maintainable Automated Acceptance TestsJez Humble
 
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...Amazon Web Services
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systemsDave Gardner
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringErik Krogen
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixJosh Evans
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBMongoDB
 
Streaming Market Data to the Cloud - Pulsar Summit NA 2021
Streaming Market Data to the Cloud - Pulsar Summit NA 2021Streaming Market Data to the Cloud - Pulsar Summit NA 2021
Streaming Market Data to the Cloud - Pulsar Summit NA 2021StreamNative
 
Handling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeperHandling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeperryanlecompte
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guideCynthia Saracco
 
How to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyHow to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyVMware Tanzu
 
How Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBHow Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBMongoDB
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheAmazon Web Services
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safeconfluent
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson
 
OAuth2 - Introduction
OAuth2 - IntroductionOAuth2 - Introduction
OAuth2 - IntroductionKnoldus Inc.
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB FundamentalsMongoDB
 

Tendances (20)

Creating Maintainable Automated Acceptance Tests
Creating Maintainable Automated Acceptance TestsCreating Maintainable Automated Acceptance Tests
Creating Maintainable Automated Acceptance Tests
 
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
High Performance Data Streaming with Amazon Kinesis: Best Practices (ANT322-R...
 
Unique ID generation in distributed systems
Unique ID generation in distributed systemsUnique ID generation in distributed systems
Unique ID generation in distributed systems
 
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage TieringHadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
Hadoop Meetup Jan 2019 - Router-Based Federation and Storage Tiering
 
Embracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at NetflixEmbracing Failure - Fault Injection and Service Resilience at Netflix
Embracing Failure - Fault Injection and Service Resilience at Netflix
 
An Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDBAn Enterprise Architect's View of MongoDB
An Enterprise Architect's View of MongoDB
 
Streaming Market Data to the Cloud - Pulsar Summit NA 2021
Streaming Market Data to the Cloud - Pulsar Summit NA 2021Streaming Market Data to the Cloud - Pulsar Summit NA 2021
Streaming Market Data to the Cloud - Pulsar Summit NA 2021
 
Handling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeperHandling Redis failover with ZooKeeper
Handling Redis failover with ZooKeeper
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Big Data: Getting started with Big SQL self-study guide
Big Data:  Getting started with Big SQL self-study guideBig Data:  Getting started with Big SQL self-study guide
Big Data: Getting started with Big SQL self-study guide
 
How to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor NettyHow to Avoid Common Mistakes When Using Reactor Netty
How to Avoid Common Mistakes When Using Reactor Netty
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
How Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBHow Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDB
 
Unleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCacheUnleash the Power of Redis with Amazon ElastiCache
Unleash the Power of Redis with Amazon ElastiCache
 
How to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams SafeHow to Lock Down Apache Kafka and Keep Your Streams Safe
How to Lock Down Apache Kafka and Keep Your Streams Safe
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
OAuth2 - Introduction
OAuth2 - IntroductionOAuth2 - Introduction
OAuth2 - Introduction
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB Fundamentals
 

Similaire à How Lyft Drives Data Discovery

Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationTamikaTannis
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Using Data Science for Cybersecurity
Using Data Science for CybersecurityUsing Data Science for Cybersecurity
Using Data Science for CybersecurityVMware Tanzu
 
Citi Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics PresentationCiti Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics PresentationMarquis Cabrera
 
Philips john huffman
Philips john huffmanPhilips john huffman
Philips john huffmanBigDataExpo
 
Advanced Use Cases for Analytics Breakout Session
Advanced Use Cases for Analytics Breakout SessionAdvanced Use Cases for Analytics Breakout Session
Advanced Use Cases for Analytics Breakout SessionSplunk
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Ali Alkan
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Geoffrey Fox
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxRupaRani28
 
Data catalog
Data catalogData catalog
Data catalogiamtodor
 

Similaire à How Lyft Drives Data Discovery (20)

Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen PresentationNeo4j GraphTour Santa Monica 2019 - Amundsen Presentation
Neo4j GraphTour Santa Monica 2019 - Amundsen Presentation
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Using Data Science for Cybersecurity
Using Data Science for CybersecurityUsing Data Science for Cybersecurity
Using Data Science for Cybersecurity
 
Citi Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics PresentationCiti Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics Presentation
 
Philips john huffman
Philips john huffmanPhilips john huffman
Philips john huffman
 
Advanced Use Cases for Analytics Breakout Session
Advanced Use Cases for Analytics Breakout SessionAdvanced Use Cases for Analytics Breakout Session
Advanced Use Cases for Analytics Breakout Session
 
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
Intelligently Automating Machine Learning, Artificial Intelligence, and Data ...
 
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
Multi-faceted Classification of Big Data Use Cases and Proposed Architecture ...
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Business Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptxBusiness Intelligence and Analytics Unit-2 part-A .pptx
Business Intelligence and Analytics Unit-2 part-A .pptx
 
Data catalog
Data catalogData catalog
Data catalog
 

Plus de Neo4j

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansQIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansNeo4j
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...Neo4j
 
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafosBBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafosNeo4j
 
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...Neo4j
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jNeo4j
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j
 
Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Neo4j
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeNeo4j
 
Neo4j: Data Engineering for RAG (retrieval augmented generation)
Neo4j: Data Engineering for RAG (retrieval augmented generation)Neo4j: Data Engineering for RAG (retrieval augmented generation)
Neo4j: Data Engineering for RAG (retrieval augmented generation)Neo4j
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsNeo4j
 
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdfNeo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdfNeo4j
 
Neo4j Jesus Barrasa The Art of the Possible with Graph
Neo4j Jesus Barrasa The Art of the Possible with GraphNeo4j Jesus Barrasa The Art of the Possible with Graph
Neo4j Jesus Barrasa The Art of the Possible with GraphNeo4j
 

Plus de Neo4j (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansQIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
QIAGEN: Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
ISDEFE - GraphSummit Madrid - ARETA: Aviation Real-Time Emissions Token Accre...
 
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafosBBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
BBVA - GraphSummit Madrid - Caso de éxito en BBVA: Optimizando con grafos
 
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
Graph Everywhere - Josep Taruella - Por qué Graph Data Science en tus modelos...
 
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4jGraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
GraphSummit Madrid - Product Vision and Roadmap - Luis Salvador Neo4j
 
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdfNeo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
Neo4j_Exploring the Impact of Graph Technology on Financial Services.pdf
 
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdfRabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
Rabobank_Exploring the Impact of Graph Technology on Financial Services.pdf
 
Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!Webinar - IA generativa e grafi Neo4j: RAG time!
Webinar - IA generativa e grafi Neo4j: RAG time!
 
IA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG timeIA Generativa y Grafos de Neo4j: RAG time
IA Generativa y Grafos de Neo4j: RAG time
 
Neo4j: Data Engineering for RAG (retrieval augmented generation)
Neo4j: Data Engineering for RAG (retrieval augmented generation)Neo4j: Data Engineering for RAG (retrieval augmented generation)
Neo4j: Data Engineering for RAG (retrieval augmented generation)
 
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdfNeo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
Neo4j Graph Summit 2024 Workshop - EMEA - Breda_and_Munchen.pdf
 
Enabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge GraphsEnabling GenAI Breakthroughs with Knowledge Graphs
Enabling GenAI Breakthroughs with Knowledge Graphs
 
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdfNeo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
Neo4j_Anurag Tandon_Product Vision and Roadmap.Benelux.pptx.pdf
 
Neo4j Jesus Barrasa The Art of the Possible with Graph
Neo4j Jesus Barrasa The Art of the Possible with GraphNeo4j Jesus Barrasa The Art of the Possible with Graph
Neo4j Jesus Barrasa The Art of the Possible with Graph
 

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 

How Lyft Drives Data Discovery

  • 1. Wednesday, September 18th 2019 Tamika Tannis | Software Engineer, Lyft go.lyft.com/datadiscoveryslides Disrupting Data Discovery
  • 2. Agenda • Data Ecosystem at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Amundsen’s Architecture • What’s Next? 2
  • 4. 4 Core Data Infrastructure (High Level) Custom Applications Architecture Applications Mobile App Services Services Data Streaming Frameworks (Kafka / Kinesis) Flink
  • 6. Data is used to make informed decisions 6 Analysts Data Scientists General Managers Engineers ExperimentersProduct Managers Data-driven decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or make a decision Make data the heart of every decision
  • 7. • Goal: What new data-driven policies can we enact to reduce driver insurance fraud? • Idea: Let’s take a deeper look into insurance claims from drivers who have given less than 𝑥 rides. • Next Step: I’ll first get all drivers who have given less than 𝑥 rides...but where do I look? Hi! I’m a new Analyst in the Fraud Department ! 7
  • 8. • Ask a friend/manager/coworker • Ask in a wider Slack channel • Search in the Github repos Step 1: Search & find data 8 We end up finding tables: driver_rides & rides_driver_total
  • 9. • What is the difference: driver_rides vs. rides_driver_total • What do the different fields mean? ‒ Is driver_rides.completed different from rides_driver_total.lifetime_completed? ‒ What period of time does the data in each table cover? • Dig deeper: explore using SQL queries Step 2: Understand the data 9 SELECT * FROM schema.driver_rides WHERE ds=’2019-05-15’ LIMIT 100; SELECT * FROM schema.rides_driver_total WHERE ds=’2019-05-15’ LIMIT 100;
  • 10. Data Scientists spend upto 1/3rd time in Data Discovery 10 Data Discovery • Data discovery is a problem because of the lack of understanding of what data exists, where, who owns it, & how to use it. • It is not what our data scientist should focus on: they should focus on Analysis work Data-based decision making process: 1. Search & find data 2. Understand the data 3. Perform an analysis/visualisation 4. Share insights and/or take a decision
  • 12. User Personas - (1/2) 12 Analysts Data Scientists General Managers ExperimentersEngineersProduct Managers • Frequent use of data • Deep to very deep analysis • Exposure to new datasets • Creating insights & developing models
  • 13. User Personas - (2/2) 13 Power User - Has been at Lyft for a long time - Knows the data environment well: where to find data, what it means, how to use it Pain points: - Needs to spend a fair amount of their time sharing their knowledge with the new user - Could become “New user” if they switch teams New User - Recently joined Lyft or switched to a new team - Needs to ramp up on a lot of things, wants to start having impact soon Pain points: - Doesn’t know where to start. Spends their time asking questions and cmd+F on github - Makes mistakes by mis-using some datasets
  • 14. 3 complementary ways to do Data Discovery 14 Search based I am looking for a table with data on “cancel rates” - Where is the table? - What does it contain? - Has the analysis I want to perform already been done? Lineage based If this event is down, what datasets are going to be impacted? - Upstream/downstream lineage - Incidents, SLA misses, Data quality Network based I want to check what tables my manager uses - Ownership information - Bookmarking - Usage through query logs
  • 15. Data Discovery at Lyft 15 Product named after Roald Amundsen ● First expedition to reach the South Pole ● First to explore both North & South Poles
  • 16. Landing Page - Optimized for search
  • 17. Search Results - Ranked on relevance & popularity
  • 18. Relevance - search for “apple” on Google 18 Low relevance High relevance
  • 19. Popularity - search for “apple” on Google 19 Low popularity High popularity
  • 20. Search Results - Striking the balance 20 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Different weights for different metadata, e.g. resource name ● Querying activity ● Dashboarding ● Lower weight for automated querying ● Higher weight for adhoc querying
  • 24. Computed Column Metadata Statistics Disclaimer: these stats are arbitrary.
  • 28. 28 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 30. 30 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 32. Why choose a graph database? 32
  • 35. 35 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  • 36. Neo4j is the source of truth for editable metadata 36
  • 37. Why not propagate the editabled metadata back to source 37
  • 38. Why not propagate the editabled metadata back to source 38
  • 39. Why not propagate the editabled metadata back to source 39
  • 40. Why not propagate the editabled metadata back to source 40
  • 42. 42 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  • 44. Metadata - Challenges • No Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Different Extraction: Each data set metadata is stored and fetched differently ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 44
  • 47. How is the databuilder orchestrated? 47 Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  • 49. 49 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 50. 3. Search Service • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as the search backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 50
  • 51. How to make the search result more relevant? 51 • Experiment with different weights, e.g boost the exact table ranking • Collect metrics ‒ Instrumentation for search behavior ‒ Measure click-through-rate (CTR) over top 5 results • Advanced search: ‒ Support wildcard search (e.g. event_*) ‒ Support category search (e.g. column: is_line_ride) ‒ Future: Filtering, Autosuggest
  • 53. 53 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 57. Amundsen’s Impact • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! ‒ 90% penetration among Data Scientists ‒ +30% productivity for the Data science org. 57
  • 58. Amundsen is Open Source! • github.com/lyft/amundsen • Growing and active community ‒ c.150 github stars, 10+ companies contributing back ‒ Slack w/ 30+ companies and c.100 people ‒ Presented at conferences in San Francisco, Barcelona, Vilnius, Moscow by Lyft employees and community ‒ Featured in blog posts and interviews • Net positive impact for Lyft through external community contributing ‒ Integration with open source backend ‒ Integration with new data sources (BigQuery, Redshift, Postgres), lifting them from our roadmap 58
  • 60. Roadmap PeopleDashboards Data sets Phase 1 (Complete) Phase 2 (In Progress) Phase 3 (In Scoping) Streams Schemas Workflows More Metadata Deeper integration with other tools (e.g. Mode) Privacy Governance
  • 63. Roadmap PeopleDashboards Data sets Phase 1 (Complete) Phase 2 (In Progress) Phase 3 (In Scoping) Streams Schemas Workflows More Metadata Deeper integration with other tools (e.g. Mode) Privacy Governance
  • 66. Tamika Tannis | @ttannis | /in/tamika-tannis Project Code @ github.com/lyft/amundsen Blog Post @ go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 66

Notes de l'éditeur

  1. Name & Role working on an open-source data discovery tool at Lyft. It’s called “Amundsen” -- more on that name later. It leverages Neo4j, glad to share how we’ve been using Neo4j at Lyft to achieve goals of our product Amundsen.
  2. On the agenda for this talk
  3. The data infrastructure at Lyft can be visualized by this diagram Events are fires into streaming frameworks (Apache Kafka / Amazon Kinesis) Apache Flink injests that data into Amazon S3, first layer of storage, persistent storage Data stored in Amazon S3 is further transformed and stored in various other datastores Initially started with Redshift. Introduced new datastores with different strengths that better serve specific purposes, Hive for long running queries/ETLs Presto for quick analysis & as-needed queries Druid for faster interactive queries The takeaway from this slide: Lots of data (~10PB), lots of places it can be (thousands of tables), and lots of tools/people trying to use the data on a regular basis.
  4. Now onto challenges with data discovery
  5. Effective data discovery is important because data is at the heart of every decision we make. It is the only way to make informed, objective decisions. Applies to many roles Data-driven decision making process Search & find data Understand the data Perform an analysis Share insights or make a decision
  6. To highlight some data discover pain points that occur without the proper tools, let’s walk through a hypothetical example
  7. Your experience searching and finding data may involve doing all of the following 3 things.
  8. Your experience understanding the data doesn’t get any easier.
  9. ⅓ of time on data discovery Difficult to find what exists, understand whether or not it’s what you are looking for, or trust that it is the source of truth for that information We can significantly increase productivity and impact if we can reduce this time...
  10. Let’s start to think of what a helpful tool would look like. Complication: What audience to serve? Who are they and what do they need?
  11. What audience to serve?
  12. Second level of personas to consider.
  13. Lastly, what by what means do they want to perform discovery? There are 3 complementary ways to do Data Discovery Search based: most common and top priority Lineage based: callback to the data ecosystem, if there is a hiccup in that system, what does it impact? Datasets must be trustworthy Network based: helpful on the job to know what others are using for what purpose
  14. We’ve talked about some pain points of data discovery and why it’s important, let’s talk about our solution -- Amundsen.
  15. Disclaimer Representative data Amundsen circa March 2019 Our landing page is optimized for search Most common method of data discovery, presented with search bar & help text for some advanced search features We also want the landing page to be able to help users that don’t know what to search for. Created this concept of popular tables
  16. Users presented with ranked search results Not like page-rank but based on relevance and popularity
  17. This is what we mean when we say relevance
  18. This is what we mean when we say popularity
  19. Striking the balance between the two is an interesting challenge Relevance is based on metadata Popularity is not click through rate but through query access patterns.
  20. Now that I’ve demonstrated what Amundsen is and how it can be used, let’s talk about how it was built.
  21. Microservice architecture, services are divided by domains: ui/frontend, search, metadata Walk from top to bottom & highlight “pluggability”
  22. I’ll now dive deeper into each of the Amundsen services presented in the previous graph, starting with metadata service, which is backed by Neo4j.
  23. As you may remember from the application walkthrough, Amundsen surfaces resource metadata and that is what we are storing in Neo4j
  24. However graph databases are not common for many web applications, and so one might ask why choose a graph database.
  25. Well if you remember the diagram of the data ecosystem at Lyft from the beginning of the talk, that can be modeled as a graph. This is a very powerful feature because the alternative to created these kinds of relationships with a RDBMS is joins A NoSQL database isn’t set up for this
  26. Let’s take a note of some of the features from the table detail page again and see how this is represented in Neo4j Walk through features What’s very beneficial about this is that when we have a new use case and a new piece of metadata to represent, we just have to create the new node and relationship.
  27. It’s worth noting that one key architectural decision made for this service and others is that it is a proxy to interact with Neo4j Which means its can also interact with anything else that can store this data. This choice is key for us as an open source project
  28. Another key characteristic of our system is that neo4j is the source of truth for our editable metadata
  29. This was actually not our original intent, we ran into a roadblock when we were first implementing the description editing feature. We originally had a setup like this
  30. Then we realized we forgot to account for something. Tables can get rebuilt using the source code that generated the table and descriptions will be overwritten
  31. The we thought about whether or not we could do this, update them both!
  32. The answer was no. ...And that’s how Neo4j became the source of truth for editable metadata
  33. Now onto the data builder service
  34. It is the layer that ingests metadata from the sources. Which sources exactly?
  35. Many sources. Not just tables but dashboards, different kinds of resources This creates some complexity
  36. This is what databuilder helps to address, it is a data ingestion framework similar to Apache Gobblin It functions as an ETL engine Each part is modularized, and can be reused (e.g same transformer) or swapped out A publisher leverages a transaction to make the data ingestion atomic -- it is not the case that there is partially updated data
  37. Here is a more solid example
  38. How is this all orchestrated? With Airflow dags, these are jobs that run to execute each piece of the puzzle End with elastic search.
  39. Elasticsearch is what sit behind our search service...
  40. ... in the same way that Neo4j stands behind the metadata service. Similar data is loaded into both there are some minor differences, for example data that won’t be searchable like col stats
  41. Also similar to the metadata service the search service acts as proxy
  42. What I find most interesting about the search service is actually the biggest problem that we struggle with, “how to make the search results more relevant”.
  43. Lastly we have our Amundsen’s frontend service.
  44. Obligatory slide
  45. T