4. Overview of Uber’s Data Platform
DATA SOURCES
RAW DATA
MODELED TABLES
MINING BUSINESS
INSIGHTS
CONSUMING BUSINESS INSIGHTS
EXPERIMENTATION
DATA SCIENCE
MACHINE
LEARNING
CUSTOM DATA SETS
Dashboarding
Alerting
Monitoring
Data Exploration
Knowledge Bases
Storage
Infrastructure
ETL Frameworks
Data Integrity
Query Engines
7. Presto use cases at Uber
Growth Marketing
Data Science
Marketplace
Pricing
Community
Operations
Data Quality
Ad-hoc Querying
8. The people who rely on us
Technical
Skills
Data Scientists
Software Engineers
ML/AI Researchers
Advanced SQL
Advanced Statistics
Scala/Spark, Python/R
Data Modeling
Inventor Ivan
Marketing Managers
Entry-level Analysts
General Managers
Product Managers
Limited SQL
Spreadsheets
Reliant Rebecca
City Operations
Regional Managers
Intermediate SQL
Spreadsheets
Dashboarding
Monitoring Matt
Operations Managers
Data Analysts
Product Analysts
Advanced SQL
Spreadsheets
Limited Statistics
Limited Python/R
Analyst Anna
12. What is Presto: Interactive SQL
Engine for Big Data
Interactive query speeds
Horizontally scalable
ANSI SQL
Battle-tested by Facebook, Uber, Linkedin, Twitter, Netflix, Airbnb, etc
Completely open source
Access to petabytes of data in the Hadoop, Elasticsearch, Pinot, etc.
14. Why Presto is Fast
● Data in memory during execution
● Pipelining and streaming
● Columnar storage & execution
● Bytecode generation
15. Resource Management
● Presto has its own resource manager
○ Not on YARN
○ Not on Mesos
● CPU Management
○ Priority queues
○ Short running queries higher priority
● Memory Management
○ Max memory per query per node
○ If query exceeds max memory limit, query fails
○ No OutOfMemory in Presto process
21. Data Model
● each Elasticsearch index is a table partition
● each field of an index is a column
● all Elasticsearch indexes sharing the same prefix
consist a logical table
○ Es-vehicles-sjc1, es-vehicles-dca1, es-vehicles
24. Optimizations
● Parallel Reads
○ Get all indices and search nodes
○ For each search node, send request for one specific index
● Cap Max Hits
● Predicate Pushdown
● Json Function Pushdown
● Limit Pushdown
● Nested Fields
25. How many Uber trip
requests did we serve
in Chicago yesterday?
26. Fetch daily trip count in seconds
SELECT T.base.city_id AS cid,
Count(CASE WHEN T.base.status = 'completed' THEN 1 END) AS
completed_trips,
Count(CASE WHEN T.base.status = 'canceled' THEN 1 END) AS
rider_canceled_trips
FROM trips AS T
WHERE T.datestr = '2019-03-11'
GROUP BY 1