[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale

Turning Big Data
into Knowledge
September 25, 2019
Kaan Onuk, Luyao Li, Atul Gupte

● Engineer turned Product Manager
● Previously: building FarmVille & the mobile advertising
platform @ Zynga
● Currently: Product Manager on the Data Platform team
building data science, data knowledge, and interactive
analytics platforms
About me
Atul Gupte
Product Manager

What we’ll talk about today
Data landscape
at Uber
Our journey
since 2016
Metadata management
through Databook
dtbData relationships
through Lineage
lin

We ignite opportunity by setting the world in motion
15M
Trips/Day
700+
Cities
100M
Monthly Users

Data informs every decision at the company

Daily Uber trips
powered by ML
Millions
Messages
processed by Kafka
2T
Queries across
Hive, Vertica and
Presto
1M
Data ingested
into HDFS
150TB
How Big is our Big Data?

Data Platform Team
Move the world with
global data,
local insights, and
intelligent decisions.

Data Infrastructure
Data Platform
DataTools
Data Lake
Logging
Stream
Data
Modelers
Data
Consumers
...
Trips
Users
Data
Engineers
Overview of Data at Uber
Data
Scientists

Raw
~10,000
Curated
~100
Derived
>100,000
Data LakeSources Usage
WAU 8,000+
Queries 1M/day
Pipelines Thousands
Metrics Thousands
Experiments Thousands
ML models 10s of thousands
Self Serve & Open Platform
Use Cases
Eng ETA, surge, safety
DS incentives, churn, pickup
Ops driver onboarding, eats
cash, partner data sharing
Compliance ops metrics, city
Challenges compounded by the scale of Data
Data produced by
Mobile users 100s of millions
Events
Trillions/day

What are users looking to do?
What data exists? How does it look?
Who’s using it? What happens if I
change it?
How can I adapt when this data
changes?

Discover Understand Trust
3+ hours
week

8%
Time wasted
every week
$$$M
Cost to company

Tasks requiring
human skill
Unproductive
time sinks

We power data fluency to help Uber
make confident, data-driven decisions
Any and all users
can access and use
datasets with ease
Users trust our data
because it meets
their expectations
Users access
appropriate data,
through compliant
means

Late 2016
● Indexed small amount of data
○ Offline analytics systems
● Datasets only - no other data entities
● Catalogued basic information about datasets

Late 2016
Novice Neville
Data Scientists
Software Engineers
ML Researchers
New to Uber
Requires help finding data
Relies on George for basic tasks
Manager Michelle
General Managers
Product Managers
City Operations
CXOs & other executives
Interface w/regulators &
customers
Meet critical deadlines
Deliver reports and insights
Genius George
Data Scientists
Software Engineers
ML Researchers
Built underlying systems
Tribal knowledge champion
De-facto knowledge bank

2017
HQ Non-HQ
Rideshare Eats Freight ATG Elevate
Support
NLP models for
support tickets
Safety
Trip classification
Uber Eats
Restaurant
recommendations
Operations
LTV models

2017
Data Scientists
Software Engineers
ML/AI Researchers
Advanced SQL
Advanced Statistics
Scala/Spark, Python/R
Data Modeling
Inventor Ivan
Marketing Managers
Entry-level Analysts
General Managers
Product Managers
Limited SQL
Spreadsheets
Reliant Rebecca
City Operations
Regional Managers
Advanced SQL
Spreadsheets
Dashboarding
Monitoring Matt
Operations Managers
Data Analysts
Product Analysts
Advanced SQL
Spreadsheets
Limited Statistics
Limited Python/R
Analyst Anna

2018Cumulativefunctionality
Time
Low internal
quality
High internal
quality
Delivers more rapidly
+ cheaply later

● Users care about a variety of data assets
○ Datasets, dashboards, metrics, etc.
● Users want a holistic view of everything that
exists about their data
○ Ownership
○ Schemas
2018

2018
● Data quality and health are key concerns
● Table usage information is valuable
● Operational and regulatory environment is
growing more complex
○ GDPR
○ Access control & audits

2019
● Unified interface highlighting relevant metadata
○ Ownership & usage
○ Schema and stats
○ Quality & health signals
○ Lineage

2019
● Advanced metadata management
○ Automated ingestion
○ Automated classification
○ Simplified controls for data owners
60+
Types of
metadata

● Manages Data Lineage team under Data Platform
● Earlier: Senior Software Engineer II at Uber
About me
Luyao Li
Tech Lead Manager

What is data lineage?
● Where is the data from
● Where it’s been
● How it’s being transformed

“I’m no longer
responsible for this
table, please ask team
X”
“This is an upstream
problem, we can’t fix it”
“Please ask the
table owner”
“How do I find
the pipeline
owner?”
Multiple days
Why does it matter?

Applications
Data Freshness Data Chargebacks Anomaly Detection Compliance

Features
End-to-end Isolated ingestion
Flexible
consumption
Advanced
filtering

Lessons learned
High quality data is essential for success
Always be customer obsessed
Magical search has a huge impact on usability
Make big, bold bets

What’s next
● Column-level lineage
● Self-diagnostic and reporting
● Self-serve onboarding
● Recommendation

/ Manages the Metadata Platform team within Big Data
/ Previously - Senior Software Engineer / Tech Lead for
Data Discovery & Data Privacy @ Uber
About me
Kaan Onuk
Engineering Manager

/ What is metadata?
/ Why does metadata matter?

Uber’s massive data holds deep
hidden insights.
Metadata helps to surface them.

Metadata drives data
productivity by making data easy
to discover, understand, and
govern.

/ Metadata Sources
/ Metadata Registry
/ Metadata Collection
/ Metadata Storage
/ Data Model

Metadata Collection
Pull model Push model
○ Crawler (periodic)
e.g. sample data, stats
○ Event-based (Event Listeners)
e.g. data quality
○ Automated
e.g. data retention policies
○ Crowdsource
e.g. table descriptions

Storage
● Hive for analytical queries
and audit purposes
● Kafka to capture
metadata changes
● MySQL for persistent
storage
● Redis for cache to support
low latency & high
throughput
● Search functionality
powers various internal
platform including
Databook for data
discovery

Metadata Store: Data Model Requirements
1. Discovery
2. Cluster-specific & agnostic metadata
4. Flexibility on onboarding new entities
3. Easy metadata type creation

Metadata
Management
(2019)
1. Easy onboarding
2. Derived metadata
through relationships
3. Efficient metadata
retrieval

Key Takeaways
1. Centralized Metastore: Datasets + Artifacts
2. Metadata Registry: Taxonomy / Metadata scheme
3. Metadata Collection: Choose the right approach
4. Data Model: Leverage metadata relationships

Next Steps
Metadata Management
Innovate
- Personalization to improve discovery
- Graph traversal optimizations
Automate
- Human-in-the-loop AI
Establish trust & accountability
- More integrations with data infra
- Very high qps & low latency
Accessible but secure foundation
- Fully self-served, ontology-based metadata management

Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information
to any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
Thank you!
kaan@uber.com

[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à [Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale

Similaire à [Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale (20)

Dernier

Dernier (20)

[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale