SlideShare a Scribd company logo
1 of 69
March 2019
Mark Grover | @mark_grover | Product Management, Lyft
Tao Feng | @feng-tao | Software Engineer, Lyft
Disrupting Data Discovery
Agenda
• Data at Lyft
• Challenges with Data Discovery
• Data Discovery at Lyft
• Architecture
• Summary
2
Data platform users
3
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
5
Core Infra high level architecture
Custom apps
Data Discovery
6
• My first project is to analyze and predict Strata Attendance
• Where is the data?
• What does it mean?
Hi! I am a n00b Data Scientist!
7
• Option 1: Phone a friend!
• Option 2: Github search
Status quo
8
• What does this field mean?
‒ Does attendance data include employees?
‒ Does it include revenue?
• Let me dig in and understand
Understand the context
9
Explore
SELECT
*
FROM
default.my_table
WHERE ds=’2018-01-01’
LIMIT 100;
Exploring with SELECT * is EVIL
1. Lack of productivity for data scientists
2. Increased load on the databases
11
Data Scientists spend upto 1/3rd time in Data Discovery...
12
• Data discovery
‒ Lack of
understanding of
what data exists,
where, who owns it,
who uses it, and how
to request access.
Audience for data
discovery
13
Data Discovery - User personas
14
Data Modelers Analysts Data Scientists General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
3 Data Scientist personas
Power user
● All info in their head
● Get interrupted a lot
due to questions
● Lost
● Ask “power users” a
lot of questions
● Dependencies
landing on time
● Communicating with
stakeholders
Noob user Manager
Search based Lineage based Network based
Where is the
table/dashboard for X?
What does it contain?
I am changing a data
model, who are the owner
and most common users?
I want to follow a power
user in my team.
Does this analysis already
exist?
This table’s delivery was
delayed today, I want to
notify everyone
downstream.
I want to bookmark tables of
interest and get a feed of
data delay, schema change,
incidents.
Data Discovery answers 3 kinds of questions
Buy vs. Build vs. Adopt
17
Compared various existing solutions/open source projects
Criteria / Products Alation Where
Hows
Airbnb
Data
Portal
Cloudera
Navigator
Apache
Atlas
Search based
Lineage based
Network based
Hive/Presto support
Redshift support
Open source (pref.)
Meet Amundsen
19
First person to discover the South Pole -
Norwegian explorer, Roald Amundsen
Landing page optimized for search
Search results ranked on relevance and query activity
How does search work?
22
Relevance - search for “apple” on Google
23
Low relevance High relevance
Popularity - search for “apple” on Google
24
Low popularity High popularity
Striking the balance
25
Relevance Popularity
● Names, Descriptions, Tags, [owners, frequent
users]
● Querying activity
● Dashboarding
● Different weights for automated vs adhoc
querying
Back to mocks...
26
Search results ranked on relevance and query activity
Detailed description and metadata about data resources
Data Preview within the tool
Computed stats about column metadata
Disclaimer: these stats are arbitrary.
Built-in user feedback
Amundsen’s architecture
32
33
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
1. Frontend Service
34
35
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
Detailed description and metadata about data resources
2. Metadata Service
37
38
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
39
2. Metadata Service
• A thin proxy layer to interact with graph database
‒ Currently Neo4j is the default option for graph backend engine.
‒ Work with the community to support Apache Atlas
• Support Rest API for other services pushing / pulling metadata directly
Trade Off #1
Why choose Graph
database
40
Why Graph database?
Why Graph database?
Trade Off #2
Why not propagate the
metadata back to source
43
Why not propagate the metadata back to source
44
Why not propagate the metadata back to source
45
?
?
3. Search Service
46
47
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Security
Service
Other Microservices
Metadata Sources
3. Search Service
• Support REST API for building indexes
• A thin proxy layer to interact with the search backend
‒ Currently it supports Elasticsearch as backend.
• Support different search patterns
‒ Normal Search: match records based on relevancy
‒ Category Search: match records first based on data type, then relevancy
‒ Wildcard Search
48
Challenge #1
How to make the search
result more relevant?
49
How to make the search result more relevant?
50
• Define a search quality metric
‒ Click-Through-Rate (CTR) over top 5 results
• Search behaviour instrumentation is key
• Couple of improvements:
‒ Boost the exact table ranking
‒ Support wildcard search
‒ Support category search (e.g. “column: is_line_ride”)
4. Data Builder
51
52
Postgres Hive Redshift ... Presto
Github
Source
File
Databuilder Crawler
Neo4j
Elastic
Search
Metadata Service Search Service
Frontend ServiceML
Feature
Service
Other
Services
Other Microservices
Metadata Sources
Challenge #1
Various forms of metadata
53
54
Metadata Sources @ Lyft
Metadata - Challenges
• Standardization: No single data model that fits for all data resources
‒ A data resource could be a table, an Airflow DAG or a dashboard
• Extraction: Each data set metadata is stored and fetched differently,
‒ Hive Table: Stored in Hive metastore
‒ RDBMS(postgres etc): Fetched through DBAPI interface
‒ Github source code: Fetched through git hook
‒ Mode dashboard: Fetched through Mode API
‒ …
55
Challenge #2
Pull model vs Push model
56
Pull model vs. Push model
57
Pull Model Push Model
● Periodically update the index by pulling from
the system (e.g. database) via crawlers.
● The system (e.g. database) pushes
metadata to a message bus which
downstream subscribes to.
Crawler
Database Data graph
Scheduler
Database Message
queue
Data graph
4. Databuilder
Databuilder in action
How are we building data? Databuilder
How is databuilder orchestrated?
Amundsen uses Apache Airflow to orchestrate
Databuilder jobs
What’s next?
64
Amundsen seems to be more useful than what we thought
• Tremendous success at Lyft
‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service!
• Many organizations have similar problems
‒ Collaborating with ING, WeWork and more
‒ We plan to announce open source soon
65
Impact - Amundsen at Lyft
66
Beta release
(internal)
Generally Available
(GA) release
Alpha release
Adding more kinds of data resources
PeopleDashboardsData sets
Phase 1
(Complete)
Phase 2
(In development)
Phase 3
(In Scoping)
Streams Schemas Workflows
Serving more metadata about existing resources
Application Context
Existence, description, semantics, etc.
Behavior
How data is created and used over time
Change
How data is changing over time
Summary
69
Summary
• Data Discovery making data scientists unproductive
• 3 types of data discovery - search, lineage and network based
• Amundsen: Data graph for all data
• Blog post with more details: go.lyft.com/datadiscoveryblog
70
Mark Grover | @mark_grover
Tao Feng | @feng-tao
Blog post at go.lyft.com/datadiscoveryblog
Icons under Creative Commons License from https://thenounproject.com/
71
Backup
72

More Related Content

What's hot

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composerBruce Kuo
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchpmanvi
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...Edureka!
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to KibanaVineet .
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumSasha Rosenbaum
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 

What's hot (20)

Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
From airflow to google cloud composer
From airflow to google cloud composerFrom airflow to google cloud composer
From airflow to google cloud composer
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...
Kibana Tutorial | Kibana Dashboard Tutorial | Kibana Elasticsearch | ELK Stac...
 
Introduction to Kibana
Introduction to KibanaIntroduction to Kibana
Introduction to Kibana
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Apache flink
Apache flinkApache flink
Apache flink
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Elk - An introduction
Elk - An introductionElk - An introduction
Elk - An introduction
 

Similar to Strata sf - Amundsen presentation

Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryMark Grover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 CareerBuilder.com
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewAbhishek Roy
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about SparkGiivee The
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelUwe Printz
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Neo4j GraphDay Seattle- Sept19- in the enterprise
Neo4j GraphDay Seattle- Sept19-  in the enterpriseNeo4j GraphDay Seattle- Sept19-  in the enterprise
Neo4j GraphDay Seattle- Sept19- in the enterpriseNeo4j
 

Similar to Strata sf - Amundsen presentation (20)

Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018 SDSC18 and DSATL Meetup March 2018
SDSC18 and DSATL Meetup March 2018
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Neo4j GraphDay Seattle- Sept19- in the enterprise
Neo4j GraphDay Seattle- Sept19-  in the enterpriseNeo4j GraphDay Seattle- Sept19-  in the enterprise
Neo4j GraphDay Seattle- Sept19- in the enterprise
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 

More from Tao Feng

Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Tao Feng
 
Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkTao Feng
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...Tao Feng
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopTao Feng
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeTao Feng
 

More from Tao Feng (6)

Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)Odp - On demand profiler (ICPE 2018)
Odp - On demand profiler (ICPE 2018)
 
Effective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza FrameworkEffective Multi-stream Joining in Apache Samza Framework
Effective Multi-stream Joining in Apache Samza Framework
 
A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...A memory capacity model for high performing data-filtering applications in Sa...
A memory capacity model for high performing data-filtering applications in Sa...
 
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshopSamza memory capacity_2015_ieee_big_data_data_quality_workshop
Samza memory capacity_2015_ieee_big_data_data_quality_workshop
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 

Recently uploaded

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 

Recently uploaded (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 

Strata sf - Amundsen presentation

  • 1. March 2019 Mark Grover | @mark_grover | Product Management, Lyft Tao Feng | @feng-tao | Software Engineer, Lyft Disrupting Data Discovery
  • 2. Agenda • Data at Lyft • Challenges with Data Discovery • Data Discovery at Lyft • Architecture • Summary 2
  • 3. Data platform users 3 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 4. 5 Core Infra high level architecture Custom apps
  • 6. • My first project is to analyze and predict Strata Attendance • Where is the data? • What does it mean? Hi! I am a n00b Data Scientist! 7
  • 7. • Option 1: Phone a friend! • Option 2: Github search Status quo 8
  • 8. • What does this field mean? ‒ Does attendance data include employees? ‒ Does it include revenue? • Let me dig in and understand Understand the context 9
  • 10. Exploring with SELECT * is EVIL 1. Lack of productivity for data scientists 2. Increased load on the databases 11
  • 11. Data Scientists spend upto 1/3rd time in Data Discovery... 12 • Data discovery ‒ Lack of understanding of what data exists, where, who owns it, who uses it, and how to request access.
  • 13. Data Discovery - User personas 14 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers
  • 14. 3 Data Scientist personas Power user ● All info in their head ● Get interrupted a lot due to questions ● Lost ● Ask “power users” a lot of questions ● Dependencies landing on time ● Communicating with stakeholders Noob user Manager
  • 15. Search based Lineage based Network based Where is the table/dashboard for X? What does it contain? I am changing a data model, who are the owner and most common users? I want to follow a power user in my team. Does this analysis already exist? This table’s delivery was delayed today, I want to notify everyone downstream. I want to bookmark tables of interest and get a feed of data delay, schema change, incidents. Data Discovery answers 3 kinds of questions
  • 16. Buy vs. Build vs. Adopt 17
  • 17. Compared various existing solutions/open source projects Criteria / Products Alation Where Hows Airbnb Data Portal Cloudera Navigator Apache Atlas Search based Lineage based Network based Hive/Presto support Redshift support Open source (pref.)
  • 18. Meet Amundsen 19 First person to discover the South Pole - Norwegian explorer, Roald Amundsen
  • 20. Search results ranked on relevance and query activity
  • 21. How does search work? 22
  • 22. Relevance - search for “apple” on Google 23 Low relevance High relevance
  • 23. Popularity - search for “apple” on Google 24 Low popularity High popularity
  • 24. Striking the balance 25 Relevance Popularity ● Names, Descriptions, Tags, [owners, frequent users] ● Querying activity ● Dashboarding ● Different weights for automated vs adhoc querying
  • 26. Search results ranked on relevance and query activity
  • 27. Detailed description and metadata about data resources
  • 29. Computed stats about column metadata Disclaimer: these stats are arbitrary.
  • 32. 33 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 34. 35 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 35. Detailed description and metadata about data resources
  • 37. 38 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 38. 39 2. Metadata Service • A thin proxy layer to interact with graph database ‒ Currently Neo4j is the default option for graph backend engine. ‒ Work with the community to support Apache Atlas • Support Rest API for other services pushing / pulling metadata directly
  • 39. Trade Off #1 Why choose Graph database 40
  • 42. Trade Off #2 Why not propagate the metadata back to source 43
  • 43. Why not propagate the metadata back to source 44
  • 44. Why not propagate the metadata back to source 45 ? ?
  • 46. 47 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Security Service Other Microservices Metadata Sources
  • 47. 3. Search Service • Support REST API for building indexes • A thin proxy layer to interact with the search backend ‒ Currently it supports Elasticsearch as backend. • Support different search patterns ‒ Normal Search: match records based on relevancy ‒ Category Search: match records first based on data type, then relevancy ‒ Wildcard Search 48
  • 48. Challenge #1 How to make the search result more relevant? 49
  • 49. How to make the search result more relevant? 50 • Define a search quality metric ‒ Click-Through-Rate (CTR) over top 5 results • Search behaviour instrumentation is key • Couple of improvements: ‒ Boost the exact table ranking ‒ Support wildcard search ‒ Support category search (e.g. “column: is_line_ride”)
  • 51. 52 Postgres Hive Redshift ... Presto Github Source File Databuilder Crawler Neo4j Elastic Search Metadata Service Search Service Frontend ServiceML Feature Service Other Services Other Microservices Metadata Sources
  • 52. Challenge #1 Various forms of metadata 53
  • 54. Metadata - Challenges • Standardization: No single data model that fits for all data resources ‒ A data resource could be a table, an Airflow DAG or a dashboard • Extraction: Each data set metadata is stored and fetched differently, ‒ Hive Table: Stored in Hive metastore ‒ RDBMS(postgres etc): Fetched through DBAPI interface ‒ Github source code: Fetched through git hook ‒ Mode dashboard: Fetched through Mode API ‒ … 55
  • 55. Challenge #2 Pull model vs Push model 56
  • 56. Pull model vs. Push model 57 Pull Model Push Model ● Periodically update the index by pulling from the system (e.g. database) via crawlers. ● The system (e.g. database) pushes metadata to a message bus which downstream subscribes to. Crawler Database Data graph Scheduler Database Message queue Data graph
  • 59. How are we building data? Databuilder
  • 60. How is databuilder orchestrated? Amundsen uses Apache Airflow to orchestrate Databuilder jobs
  • 62. Amundsen seems to be more useful than what we thought • Tremendous success at Lyft ‒ Used by Data Scientists, Engineers, PMs, Ops, even Cust. Service! • Many organizations have similar problems ‒ Collaborating with ING, WeWork and more ‒ We plan to announce open source soon 65
  • 63. Impact - Amundsen at Lyft 66 Beta release (internal) Generally Available (GA) release Alpha release
  • 64. Adding more kinds of data resources PeopleDashboardsData sets Phase 1 (Complete) Phase 2 (In development) Phase 3 (In Scoping) Streams Schemas Workflows
  • 65. Serving more metadata about existing resources Application Context Existence, description, semantics, etc. Behavior How data is created and used over time Change How data is changing over time
  • 67. Summary • Data Discovery making data scientists unproductive • 3 types of data discovery - search, lineage and network based • Amundsen: Data graph for all data • Blog post with more details: go.lyft.com/datadiscoveryblog 70
  • 68. Mark Grover | @mark_grover Tao Feng | @feng-tao Blog post at go.lyft.com/datadiscoveryblog Icons under Creative Commons License from https://thenounproject.com/ 71

Editor's Notes

  1. Today’s agenda: Why empowering with data is important… What are we doing in the data team at Lyft (context)... What challenges we are facing and have seen other companies face… How are we solving the problem... At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  2. Who is our audience: everyone who works at Lyft… Power users: Data Scientists, Research Scientists, Product Managers… Next: Engineers, GMs, Ops, etc.
  3. What do we do: Wide scope of work… de-mystify and democratize data at Lyft… Not mentioned here: Operational / transactional databases...
  4. What does the architecture for our core infra look like? Mobile application primarily… Raw events can come either from the client… or from the back end events triggered in the server… the data comes to our message bus… Kinesis/Kafka and then with light ELTing written to S3 where it persists… today we keep all the data in archival… then we develop data models and transform raw events to tables in Hive. We use Hive from long running queries and Presto for interactive queries… People build dashboards on top of Hive and visualize for exploratory analysis in Presto...
  5. Mark
  6. Mark
  7. Mark
  8. Mark
  9. Data Discovery: How much of a challenge is it? Significant challenge… Data Scientists spend up to 1/3rd of their time in Data Discovery while doing exploratory analysis… We surveyed users at Lyft and a few other companies: You’d want to spend most of the time on analysis… But we have ~10PBs of data, thousands of tables… so it is hard to find what is there and what is the source of truth… We can significantly increase productivity and impact if we can reduce this time...
  10. Mark
  11. Mark
  12. Popularity is not click through rate but through query access patterns.
  13. Amundsen architecture at Lyft: 3 micro-services(FE, metadata, search) and one generic data ingestion framework Will discuss each of the component in details High level walkthrough….how CCPA compilance works
  14. What options we have (graph, rdbms,
  15. ML features (one sentence on what is feature service) Add a logo of neo4j ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  16. Why graph db is the best option? Why not rdbms, nosql etc Why choose neo4j vs other graph db? (most popular graph db) Amundsen needs to provide metadata which includes table, column, column statistics, usage information etc. Along with that, Amundsen also needs to provide lineage information where it need to be able to provide producer, consumer relationships within the life cycle of data. Lineage could be simplified as a graph of entities and edges. E.g in the graph blahblah There are other options: NoSQL(no join support), RDBMS(performance of join is not good)
  17. Above graph data model shows our use case to show table and column metadata with usage to the column level. Querying a table detail from this Neo4j graph would be like asking Neo4j to search for a table node as a starting node and traverse it. In other words, there’s only one search operation needed to find anchor node which makes Neo4j performant -- no join operation at all. We model the graph as bi-direction relationships.
  18. ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  19. ? mention about github, HMS, neo4j backend for description, trade-off, initial version propagate the metadata back, but we found that full-rebuild table doesn’t work in this case. ? what if user modify github source file
  20. ? challenge How to improve search relevance?
  21. Think of search algorithm… Event_ride_table -> event ride table
  22. Talk about how we measure it: instrument empty results from user, % of click through rate What we did? Boost the rank of exact table name match
  23. Walk through pros and cons
  24. Pull approach is basically extract data from the source periodically. Amundsen databuilder will be responsible for extraction, transformation, and load. This naturally gives us three abstracted construct Extractor, Transformer, Loader and, optionally Publisher. The design principle follows Apache Gobblin Extractor extracts record from source one record at a time. For example in Hive, we would need a column metadata extractor for a table where each record represents a column of a table. Transformer transforms a record. Any use case that we may have to transform (e.g: remove special character) or decorate the record (e.g: make a service call to enrich data). This is a place for that. Loader writes data into either sink (destination) or into staging area. Publisher assumes that loader loaded into staging area and publishes it to destination. Atomicity is a desired behavior but it’s up to the limitation of sink itself’s support on Atomicity.
  25. We currently index 2 a day, dependency Why we index twice a day?
  26. Today’s agenda: Why empowering with data is important… What are we doing in the data team at Lyft (context)... What challenges we are facing and have seen other companies face… How are we solving the problem... At the core of it, we will primarily talk about the Data Discovery solution we are building and how we thought about the use case, solution, and the architecture.
  27. A slide on Amundsen @ Lyft? ? how long it has been in prod How many datasets Users WAU Usage
  28. Kafka topic Schema registry ML workflow and Airflow DAGs
  29. Three long term type of metadata(A.B.C) We want to use the index A,B,C for the datasources we mentioned in last slide
  30. Mark