Discover how Uber thinks about building big data knowledge platforms to allow teams to discover, manage, and govern entities. Explore how to build an extensible metadata management platform and infrastructure to democratize data at Uber's scale
3. ● Engineer turned Product Manager
● Previously: building FarmVille & the mobile advertising
platform @ Zynga
● Currently: Product Manager on the Data Platform team
building data science, data knowledge, and interactive
analytics platforms
About me
Atul Gupte
Product Manager
4. What we’ll talk about today
Data landscape
at Uber
Our journey
since 2016
Metadata management
through Databook
dtbData relationships
through Lineage
lin
5. We ignite opportunity by setting the world in motion
15M
Trips/Day
700+
Cities
100M
Monthly Users
7. Daily Uber trips
powered by ML
Millions
Messages
processed by Kafka
2T
Queries across
Hive, Vertica and
Presto
1M
Data ingested
into HDFS
150TB
How Big is our Big Data?
8. Data Platform Team
Move the world with
global data,
local insights, and
intelligent decisions.
10. Raw
~10,000
Curated
~100
Derived
>100,000
Data LakeSources Usage
WAU 8,000+
Queries 1M/day
Pipelines Thousands
Metrics Thousands
Experiments Thousands
ML models 10s of thousands
Self Serve & Open Platform
Use Cases
Eng ETA, surge, safety
DS incentives, churn, pickup
Ops driver onboarding, eats
cash, partner data sharing
Compliance ops metrics, city
Challenges compounded by the scale of Data
Data produced by
Mobile users 100s of millions
Events
Trillions/day
11. What are users looking to do?
What data exists? How does it look?
Who’s using it? What happens if I
change it?
How can I adapt when this data
changes?
15. We power data fluency to help Uber
make confident, data-driven decisions
Any and all users
can access and use
datasets with ease
Users trust our data
because it meets
their expectations
Users access
appropriate data,
through compliant
means
20. Late 2016
● Indexed small amount of data
○ Offline analytics systems
● Datasets only - no other data entities
● Catalogued basic information about datasets
21. Late 2016
Novice Neville
Data Scientists
Software Engineers
ML Researchers
New to Uber
Requires help finding data
Relies on George for basic tasks
Manager Michelle
General Managers
Product Managers
City Operations
CXOs & other executives
Interface w/regulators &
customers
Meet critical deadlines
Deliver reports and insights
Genius George
Data Scientists
Software Engineers
ML Researchers
Built underlying systems
Tribal knowledge champion
De-facto knowledge bank
27. ● Users care about a variety of data assets
○ Datasets, dashboards, metrics, etc.
● Users want a holistic view of everything that
exists about their data
○ Ownership
○ Schemas
2018
29. 2018
● Data quality and health are key concerns
● Table usage information is valuable
● Operational and regulatory environment is
growing more complex
○ GDPR
○ Access control & audits
36. What is data lineage?
● Where is the data from
● Where it’s been
● How it’s being transformed
37. “I’m no longer
responsible for this
table, please ask team
X”
“This is an upstream
problem, we can’t fix it”
“Please ask the
table owner”
“How do I find
the pipeline
owner?”
Multiple days
Why does it matter?
42. Lessons learned
High quality data is essential for success
Always be customer obsessed
Magical search has a huge impact on usability
Make big, bold bets
43. What’s next
● Column-level lineage
● Self-diagnostic and reporting
● Self-serve onboarding
● Recommendation
44. / Manages the Metadata Platform team within Big Data
/ Previously - Senior Software Engineer / Tech Lead for
Data Discovery & Data Privacy @ Uber
About me
Kaan Onuk
Engineering Manager
45. / What is metadata?
/ Why does metadata matter?
52. Metadata Collection
Pull model Push model
○ Crawler (periodic)
e.g. sample data, stats
○ Event-based (Event Listeners)
e.g. data quality
○ Automated
e.g. data retention policies
○ Crowdsource
e.g. table descriptions
53. Storage
● Hive for analytical queries
and audit purposes
● Kafka to capture
metadata changes
● MySQL for persistent
storage
● Redis for cache to support
low latency & high
throughput
● Search functionality
powers various internal
platform including
Databook for data
discovery
54. Metadata Store: Data Model Requirements
1. Discovery
2. Cluster-specific & agnostic metadata
4. Flexibility on onboarding new entities
3. Easy metadata type creation
57. Key Takeaways
1. Centralized Metastore: Datasets + Artifacts
2. Metadata Registry: Taxonomy / Metadata scheme
3. Metadata Collection: Choose the right approach
4. Data Model: Leverage metadata relationships
58. Next Steps
Metadata Management
Innovate
- Personalization to improve discovery
- Graph traversal optimizations
Automate
- Human-in-the-loop AI
Establish trust & accountability
- More integrations with data infra
- Very high qps & low latency
Accessible but secure foundation
- Fully self-served, ontology-based metadata management