SlideShare une entreprise Scribd logo
1  sur  60
Turning Big Data
into Knowledge
September 25, 2019
Kaan Onuk, Luyao Li, Atul Gupte
Hi, welcome!
● Engineer turned Product Manager
● Previously: building FarmVille & the mobile advertising
platform @ Zynga
● Currently: Product Manager on the Data Platform team
building data science, data knowledge, and interactive
analytics platforms
About me
Atul Gupte
Product Manager
What we’ll talk about today
Data landscape
at Uber
Our journey
since 2016
Metadata management
through Databook
dtbData relationships
through Lineage
lin
We ignite opportunity by setting the world in motion
15M
Trips/Day
700+
Cities
100M
Monthly Users
Data informs every decision at the company
Daily Uber trips
powered by ML
Millions
Messages
processed by Kafka
2T
Queries across
Hive, Vertica and
Presto
1M
Data ingested
into HDFS
150TB
How Big is our Big Data?
Data Platform Team
Move the world with
global data,
local insights, and
intelligent decisions.
Data Infrastructure
Data Platform
DataTools
Data Lake
Logging
Stream
Data
Modelers
Data
Consumers
...
Trips
Users
Data
Engineers
Overview of Data at Uber
Data
Scientists
Raw
~10,000
Curated
~100
Derived
>100,000
Data LakeSources Usage
WAU 8,000+
Queries 1M/day
Pipelines Thousands
Metrics Thousands
Experiments Thousands
ML models 10s of thousands
Self Serve & Open Platform
Use Cases
Eng ETA, surge, safety
DS incentives, churn, pickup
Ops driver onboarding, eats
cash, partner data sharing
Compliance ops metrics, city
Challenges compounded by the scale of Data
Data produced by
Mobile users 100s of millions
Events
Trillions/day
What are users looking to do?
What data exists? How does it look?
Who’s using it? What happens if I
change it?
How can I adapt when this data
changes?
Discover Understand Trust
3+ hours
week
8%
Time wasted
every week
$$$M
Cost to company
Tasks requiring
human skill
Unproductive
time sinks
We power data fluency to help Uber
make confident, data-driven decisions
Any and all users
can access and use
datasets with ease
Users trust our data
because it meets
their expectations
Users access
appropriate data,
through compliant
means
Discover Understand Trust
Discover
Late 2016
● Indexed small amount of data
○ Offline analytics systems
● Datasets only - no other data entities
● Catalogued basic information about datasets
Late 2016
Novice Neville
Data Scientists
Software Engineers
ML Researchers
New to Uber
Requires help finding data
Relies on George for basic tasks
Manager Michelle
General Managers
Product Managers
City Operations
CXOs & other executives
Interface w/regulators &
customers
Meet critical deadlines
Deliver reports and insights
Genius George
Data Scientists
Software Engineers
ML Researchers
Built underlying systems
Tribal knowledge champion
De-facto knowledge bank
Late 2016
2017
HQ Non-HQ
Rideshare Eats Freight ATG Elevate
Support
NLP models for
support tickets
Safety
Trip classification
Uber Eats
Restaurant
recommendations
Operations
LTV models
2017
Data Scientists
Software Engineers
ML/AI Researchers
Advanced SQL
Advanced Statistics
Scala/Spark, Python/R
Data Modeling
Inventor Ivan
Marketing Managers
Entry-level Analysts
General Managers
Product Managers
Limited SQL
Spreadsheets
Reliant Rebecca
City Operations
Regional Managers
Advanced SQL
Spreadsheets
Dashboarding
Monitoring Matt
Operations Managers
Data Analysts
Product Analysts
Advanced SQL
Spreadsheets
Limited Statistics
Limited Python/R
Analyst Anna
2017
2018Cumulativefunctionality
Time
Low internal
quality
High internal
quality
Delivers more rapidly
+ cheaply later
● Users care about a variety of data assets
○ Datasets, dashboards, metrics, etc.
● Users want a holistic view of everything that
exists about their data
○ Ownership
○ Schemas
2018
2018
2018
● Data quality and health are key concerns
● Table usage information is valuable
● Operational and regulatory environment is
growing more complex
○ GDPR
○ Access control & audits
2019
2019
● Unified interface highlighting relevant metadata
○ Ownership & usage
○ Schema and stats
○ Quality & health signals
○ Lineage
2019
● Advanced metadata management
○ Automated ingestion
○ Automated classification
○ Simplified controls for data owners
60+
Types of
metadata
Curiosity Knowledge Wisdom
● Manages Data Lineage team under Data Platform
● Earlier: Senior Software Engineer II at Uber
About me
Luyao Li
Tech Lead Manager
Data Lineage
What is data lineage?
● Where is the data from
● Where it’s been
● How it’s being transformed
“I’m no longer
responsible for this
table, please ask team
X”
“This is an upstream
problem, we can’t fix it”
“Please ask the
table owner”
“How do I find
the pipeline
owner?”
Multiple days
Why does it matter?
Applications
Data Freshness Data Chargebacks Anomaly Detection Compliance
Features
End-to-end Isolated ingestion
Flexible
consumption
Advanced
filtering
10,000-foot
view
1,000-foot
view
Lessons learned
High quality data is essential for success
Always be customer obsessed
Magical search has a huge impact on usability
Make big, bold bets
What’s next
● Column-level lineage
● Self-diagnostic and reporting
● Self-serve onboarding
● Recommendation
/ Manages the Metadata Platform team within Big Data
/ Previously - Senior Software Engineer / Tech Lead for
Data Discovery & Data Privacy @ Uber
About me
Kaan Onuk
Engineering Manager
/ What is metadata?
/ Why does metadata matter?
What is
metadata?
Uber’s massive data holds deep
hidden insights.
Metadata helps to surface them.
Metadata drives data
productivity by making data easy
to discover, understand, and
govern.
/ Metadata Sources
/ Metadata Registry
/ Metadata Collection
/ Metadata Storage
/ Data Model
Metadata
Sources
uMetadata
Metadata Registry/Definition
Metadata Collection
Pull model Push model
○ Crawler (periodic)
e.g. sample data, stats
○ Event-based (Event Listeners)
e.g. data quality
○ Automated
e.g. data retention policies
○ Crowdsource
e.g. table descriptions
Storage
● Hive for analytical queries
and audit purposes
● Kafka to capture
metadata changes
● MySQL for persistent
storage
● Redis for cache to support
low latency & high
throughput
● Search functionality
powers various internal
platform including
Databook for data
discovery
Metadata Store: Data Model Requirements
1. Discovery
2. Cluster-specific & agnostic metadata
4. Flexibility on onboarding new entities
3. Easy metadata type creation
Metadata
Store
Data Model
Metadata
Management
(2019)
1. Easy onboarding
2. Derived metadata
through relationships
3. Efficient metadata
retrieval
Key Takeaways
1. Centralized Metastore: Datasets + Artifacts
2. Metadata Registry: Taxonomy / Metadata scheme
3. Metadata Collection: Choose the right approach
4. Data Model: Leverage metadata relationships
Next Steps
Metadata Management
Innovate
- Personalization to improve discovery
- Graph traversal optimizations
Automate
- Human-in-the-loop AI
Establish trust & accountability
- More integrations with data infra
- Very high qps & low latency
Accessible but secure foundation
- Fully self-served, ontology-based metadata management
Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information
to any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
Thank you!
kaan@uber.com

Contenu connexe

Tendances

Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
 
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...DevGAMM Conference
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
New Analytic Uses of Master Data Management in the Enterprise
New Analytic Uses of Master Data Management in the EnterpriseNew Analytic Uses of Master Data Management in the Enterprise
New Analytic Uses of Master Data Management in the EnterpriseDATAVERSITY
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Business Data Lake Best Practices
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best PracticesCapgemini
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motionconfluent
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsMárton Kodok
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureLorenzo Nicora
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as ProductDATAVERSITY
 

Tendances (20)

Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
New Analytic Uses of Master Data Management in the Enterprise
New Analytic Uses of Master Data Management in the EnterpriseNew Analytic Uses of Master Data Management in the Enterprise
New Analytic Uses of Master Data Management in the Enterprise
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Business Data Lake Best Practices
Business Data Lake Best PracticesBusiness Data Lake Best Practices
Business Data Lake Best Practices
 
Evolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in MotionEvolution from EDA to Data Mesh: Data in Motion
Evolution from EDA to Data Mesh: Data in Motion
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Data Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and FutureData Mesh at CMC Markets: Past, Present and Future
Data Mesh at CMC Markets: Past, Present and Future
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as Product
 

Similaire à [Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale

Data Analytics in Digital Transformation
Data Analytics in Digital TransformationData Analytics in Digital Transformation
Data Analytics in Digital TransformationMukund Babbar
 
Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data BSP Media Group
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"MDS ap
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwarePanorama Software
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo
 
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...BigDataEverywhere
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyDatabricks
 
AWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AIAWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AIAmazon Web Services
 
UNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data MiningUNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data MiningNandakumar P
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sectorAnil Rana
 
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AIAWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AIAmazon Web Services
 
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyAgile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyTamrMarketing
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?Denodo
 
A Winning Strategy for the Digital Economy
A Winning Strategy for the Digital EconomyA Winning Strategy for the Digital Economy
A Winning Strategy for the Digital EconomyEric Kavanagh
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 
Maximizing Your Data’s Potential: DOTs & DPWs Edition
Maximizing Your Data’s Potential: DOTs & DPWs EditionMaximizing Your Data’s Potential: DOTs & DPWs Edition
Maximizing Your Data’s Potential: DOTs & DPWs EditionSafe Software
 

Similaire à [Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale (20)

Data Analytics in Digital Transformation
Data Analytics in Digital TransformationData Analytics in Digital Transformation
Data Analytics in Digital Transformation
 
Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama Software
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
 
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
 
Sgcp14dunlea
Sgcp14dunleaSgcp14dunlea
Sgcp14dunlea
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform Strategy
 
AWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AIAWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AI
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
UNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data MiningUNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data Mining
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sector
 
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AIAWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
 
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyAgile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
 
A Winning Strategy for the Digital Economy
A Winning Strategy for the Digital EconomyA Winning Strategy for the Digital Economy
A Winning Strategy for the Digital Economy
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Maximizing Your Data’s Potential: DOTs & DPWs Edition
Maximizing Your Data’s Potential: DOTs & DPWs EditionMaximizing Your Data’s Potential: DOTs & DPWs Edition
Maximizing Your Data’s Potential: DOTs & DPWs Edition
 

Dernier

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale

  • 1. Turning Big Data into Knowledge September 25, 2019 Kaan Onuk, Luyao Li, Atul Gupte
  • 3. ● Engineer turned Product Manager ● Previously: building FarmVille & the mobile advertising platform @ Zynga ● Currently: Product Manager on the Data Platform team building data science, data knowledge, and interactive analytics platforms About me Atul Gupte Product Manager
  • 4. What we’ll talk about today Data landscape at Uber Our journey since 2016 Metadata management through Databook dtbData relationships through Lineage lin
  • 5. We ignite opportunity by setting the world in motion 15M Trips/Day 700+ Cities 100M Monthly Users
  • 6. Data informs every decision at the company
  • 7. Daily Uber trips powered by ML Millions Messages processed by Kafka 2T Queries across Hive, Vertica and Presto 1M Data ingested into HDFS 150TB How Big is our Big Data?
  • 8. Data Platform Team Move the world with global data, local insights, and intelligent decisions.
  • 9. Data Infrastructure Data Platform DataTools Data Lake Logging Stream Data Modelers Data Consumers ... Trips Users Data Engineers Overview of Data at Uber Data Scientists
  • 10. Raw ~10,000 Curated ~100 Derived >100,000 Data LakeSources Usage WAU 8,000+ Queries 1M/day Pipelines Thousands Metrics Thousands Experiments Thousands ML models 10s of thousands Self Serve & Open Platform Use Cases Eng ETA, surge, safety DS incentives, churn, pickup Ops driver onboarding, eats cash, partner data sharing Compliance ops metrics, city Challenges compounded by the scale of Data Data produced by Mobile users 100s of millions Events Trillions/day
  • 11. What are users looking to do? What data exists? How does it look? Who’s using it? What happens if I change it? How can I adapt when this data changes?
  • 15. We power data fluency to help Uber make confident, data-driven decisions Any and all users can access and use datasets with ease Users trust our data because it meets their expectations Users access appropriate data, through compliant means
  • 17.
  • 18.
  • 20. Late 2016 ● Indexed small amount of data ○ Offline analytics systems ● Datasets only - no other data entities ● Catalogued basic information about datasets
  • 21. Late 2016 Novice Neville Data Scientists Software Engineers ML Researchers New to Uber Requires help finding data Relies on George for basic tasks Manager Michelle General Managers Product Managers City Operations CXOs & other executives Interface w/regulators & customers Meet critical deadlines Deliver reports and insights Genius George Data Scientists Software Engineers ML Researchers Built underlying systems Tribal knowledge champion De-facto knowledge bank
  • 23. 2017 HQ Non-HQ Rideshare Eats Freight ATG Elevate Support NLP models for support tickets Safety Trip classification Uber Eats Restaurant recommendations Operations LTV models
  • 24. 2017 Data Scientists Software Engineers ML/AI Researchers Advanced SQL Advanced Statistics Scala/Spark, Python/R Data Modeling Inventor Ivan Marketing Managers Entry-level Analysts General Managers Product Managers Limited SQL Spreadsheets Reliant Rebecca City Operations Regional Managers Advanced SQL Spreadsheets Dashboarding Monitoring Matt Operations Managers Data Analysts Product Analysts Advanced SQL Spreadsheets Limited Statistics Limited Python/R Analyst Anna
  • 25. 2017
  • 27. ● Users care about a variety of data assets ○ Datasets, dashboards, metrics, etc. ● Users want a holistic view of everything that exists about their data ○ Ownership ○ Schemas 2018
  • 28. 2018
  • 29. 2018 ● Data quality and health are key concerns ● Table usage information is valuable ● Operational and regulatory environment is growing more complex ○ GDPR ○ Access control & audits
  • 30. 2019
  • 31. 2019 ● Unified interface highlighting relevant metadata ○ Ownership & usage ○ Schema and stats ○ Quality & health signals ○ Lineage
  • 32. 2019 ● Advanced metadata management ○ Automated ingestion ○ Automated classification ○ Simplified controls for data owners 60+ Types of metadata
  • 34. ● Manages Data Lineage team under Data Platform ● Earlier: Senior Software Engineer II at Uber About me Luyao Li Tech Lead Manager
  • 36. What is data lineage? ● Where is the data from ● Where it’s been ● How it’s being transformed
  • 37. “I’m no longer responsible for this table, please ask team X” “This is an upstream problem, we can’t fix it” “Please ask the table owner” “How do I find the pipeline owner?” Multiple days Why does it matter?
  • 38. Applications Data Freshness Data Chargebacks Anomaly Detection Compliance
  • 42. Lessons learned High quality data is essential for success Always be customer obsessed Magical search has a huge impact on usability Make big, bold bets
  • 43. What’s next ● Column-level lineage ● Self-diagnostic and reporting ● Self-serve onboarding ● Recommendation
  • 44. / Manages the Metadata Platform team within Big Data / Previously - Senior Software Engineer / Tech Lead for Data Discovery & Data Privacy @ Uber About me Kaan Onuk Engineering Manager
  • 45. / What is metadata? / Why does metadata matter?
  • 47. Uber’s massive data holds deep hidden insights. Metadata helps to surface them.
  • 48. Metadata drives data productivity by making data easy to discover, understand, and govern.
  • 49. / Metadata Sources / Metadata Registry / Metadata Collection / Metadata Storage / Data Model
  • 52. Metadata Collection Pull model Push model ○ Crawler (periodic) e.g. sample data, stats ○ Event-based (Event Listeners) e.g. data quality ○ Automated e.g. data retention policies ○ Crowdsource e.g. table descriptions
  • 53. Storage ● Hive for analytical queries and audit purposes ● Kafka to capture metadata changes ● MySQL for persistent storage ● Redis for cache to support low latency & high throughput ● Search functionality powers various internal platform including Databook for data discovery
  • 54. Metadata Store: Data Model Requirements 1. Discovery 2. Cluster-specific & agnostic metadata 4. Flexibility on onboarding new entities 3. Easy metadata type creation
  • 56. Metadata Management (2019) 1. Easy onboarding 2. Derived metadata through relationships 3. Efficient metadata retrieval
  • 57. Key Takeaways 1. Centralized Metastore: Datasets + Artifacts 2. Metadata Registry: Taxonomy / Metadata scheme 3. Metadata Collection: Choose the right approach 4. Data Model: Leverage metadata relationships
  • 58. Next Steps Metadata Management Innovate - Personalization to improve discovery - Graph traversal optimizations Automate - Human-in-the-loop AI Establish trust & accountability - More integrations with data infra - Very high qps & low latency Accessible but secure foundation - Fully self-served, ontology-based metadata management
  • 59.
  • 60. Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. Thank you! kaan@uber.com