This is a presentation given at a Big Data Boulder / Denver Meetup event by Ashish Dubey, a Senior Solutions Architect at Qubole.
The following slides cover a background of Presto and its architecture, and how it differs in both performance and cost from traditional Hadoop / Hive for Adhoc queries as well as SparkSQL, Impala, Tez, and Redshift.
There are also several slides about how Qubole has been involved with the open-source Apache Presto project, along with performance optimizing contributions.
Qubole is a big data analytics software that has solved many headaches around the traditional model of big data (Hadoop, Spark, Presto) and cloud computing in popular IaaS providers: AWS, Google Cloud, Microsoft Azure, and Oracle BMC.
DevoxxFR 2024 Reproducible Builds with Apache Maven
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
1. PRESTO: AT SCALE IN THE CLOUD
Ashish Dubey
Solutions Architect
Qubole
2. COMPANY BACKGROUND
Founded in 2011 by the Lead Developers of Facebook’s data platform &
authors of the Apache Hive Project: Joydeep Sen Sarma & Ashish Thusoo.
Qubole started out of cloud based companies such as Pinterest and Shazam,
and has since grown with each phase of the emerging cloud to adoption with
companies like Autodesk and Oracle.
Today, Qubole process 500 Petabytes of data in the cloud each month on
behalf of their customers.
World class product and engineering team from:
3. THE OLD WORLD: HADOOP & MODEL ISSUES
➤ Hadoop puts compute and storage together within a
compute node
➤ Forces compute and storage to scale together, which is not
ideal
➤ The cluster must be persistently on or else the data is
inaccessible
➤ Fixed or inflexible pricing model
C+S
C+S
C+S
C+S
C+S
C+S
C+S
C+S
C+S
C+S C+S C+S
4. THE BREAKTHROUGH…
Qubole combined the components of creating a successful big
data platform from Facebook with the elasticity of the public cloud.
+
Big Data Cloud Infrastructure
=
The Future of
Advanced & Big Data
Analytics
Self-service access, and ease of managed scale take place in the cloud…
5.
6. QUBOLE VALUE PROPOSITION
Adaptability
➤Choose the number of nodes and machine type for each workload
➤Choose the best engine for each workload
Agility
➤Initial provisioning in minutes
➤Iteration – make changes on the fly
Cost
➤Use spot pricing up to 90% less
➤Automation enables admins to support more users
7. PRESTO BACKGROUND
➤ Interactive/distributed SQL engine
➤ Open Source project - from Facebook
➤ Tested and in production at Petabyte scale by companies
such as FB, Netflix, Airbnb, Dropbox etc.
➤ Stemmed from a demand from fast adhoc on columnar data
11. COMPARATIVE VIEW
➤ Differences in SQL distributed engines available:
➤ Hive
➤ Tez
➤ SparkSQL
➤ Presto
➤ Impala
➤ Various Use cases
12. HIVE VS PRESTO
➤ Hive is great tool for variety of ETL jobs
➤ Batch-processing nature makes it slow
➤ Presto - faster due to architectural difference (in-memory)
➤ Presto replaces Hive? - No…
13. PRESTO VS SPARKSQL
➤ Performance ( data formats, type of query )
➤ Concurrency
➤ Configuration/tuning
➤ SparkSQL has access to Hive Optimizer through HiveContext
17. PRESTO FEATURES
➤ 5x-20x faster compared to Hive
➤ Works really well with ORC
➤ Near 100% compliant with ANSI SQL
➤ Parquet related enhancements are in works
➤ Good tool for interactive discovery - (e.g. Aggregate, Group
by, Fact-Dim join type of queries)
18. PRESTO FEATURES
➤ Supports S3 out of the box
➤ Connectors to external data-sources
➤ Qubole built Kinesis connector to enable near real time
experience
19. QUBOLE FEATURES & OPTIMIZATIONS
➤ Qubole SSD caching - http://docs.qubole.com/en/latest/
user-guide/presto/ssd-caching.html
➤ Rubix optimized caching for Hive and Presto - https://
www.qubole.com/blog/product/rubix-fast-cache-access-for-
big-data-analytics-on-cloud-storage/
➤ GitHub: https://github.com/qubole/rubix
➤ Autoscaling Presto clusters
➤ AWS Kinesis connector - SQL analysis on stream data
➤ Plug and play UDF framework: http://www.qubole.com/
blog/product/plugging-in-presto-udfs/
20. BEST PRACTICES
➤ InputFormat - ORC
➤ Use Sorted input
➤ Partitioning
➤ Careful with Join Order
➤ Avoid Large Fact-Fact joins
➤ Use for Large Fact- Dimension joins