4. What is Presto
● Open source distributed SQL engine
● ANSI SQL syntax
● Custom built for interactive analytic queries
● Queries data across multiple data stores
● Flexible deployment (on premise or cloud)
● Extensible
7. Presto @ Facebook
● Ad-hoc/interactive queries for Hadoop warehouse
● Batch processing for Hadoop warehouse
● Analytics for user-facing products
● Analytics over various specialized stores
8. Hadoop Warehouse - Stats
● 1000s of internal daily active users
● Millions of queries each month
● Scan PBs of data every day
● Process trillions of rows every day
● 10s of concurrent queries
10. Presto for User-facing Products
● Requirements
○ Hundreds of ms to seconds latency, low variability
○ Availability
○ Update semantics
○ 10 - 15 way joins
● Stats
○ > 99.99% query success rate
○ 100% system availability
○ 25 - 200 concurrent queries
○ 1 - 20 queries per second
○ <100ms - 5s latency
11. Presto with Raptor
● Large data sets (petabytes)
● Milliseconds to seconds latency
● Predictable performance
● 5-15 minute load latency
● Reliable data loads (no duplicates, no missing data)
● High availability
● 10s of concurrent queries
14. Netflix stats
Interactive, reporting, and app-driven queries
Data warehouse: 40PB in S3
~250 nodes across multiple clusters
~650 users with ~6K+ queries/day
15. Twitter stats
Ad-hoc and low-latency queries
~200 nodes dedicated to Presto
Parquet with nested data structures
19. SQL features
● DDL syntax
CREATE / ALTER / DROP TABLE
● DML syntax
INSERT / DELETE
● SQL features:
Data types: DECIMAL, VARCHAR(n), INT, SMALLINT, TINYINT
CUBE, ROLLUP, GROUPING SETS
INTERSECT
Non-equi joins
Uncorrelated subqueries
20. Other features
● Performance
Join and aggregation optimizations
● Connectors
Redis
MongoDB
● Kerberos
● Presto-Admin
● Ambari and YARN (via Apache Slider)
21. ● Enterprise-grade ODBC & JDBC drivers
● BI tools certifications
Information Builders, Looker, MicroStrategy, MS Power BI, Qlik, Tableau, ZoomData
Drivers and BI tools
23. Short term
● LDAP
● SQL features
Data types: FLOAT, CHAR(n), VAR/BINARY(n)
EXISTS, EXCEPT
Correlated subqueries
Lambda expressions
Prepared statements
● Connectors
Accumulo (by Bloomberg)
24. Long term
● Materialized Query Tables
● Workload management
● Spill to disk
● Cost-based Optimizer
See more at https://github.com/prestodb/presto/wiki/Roadmap
25. More about Presto
GitHub: https://github.com/prestodb & https://github.com/Teradata/presto
Website: http://prestodb.io
Group: https://groups.google.com/group/presto-users
Distro: http://www.teradata.com/presto