Presto & differences between popular SQL engines (Spark, Redshift, and Hive)

•

5 likes•3,250 views

This is a presentation given at a Big Data Boulder / Denver Meetup event by Ashish Dubey, a Senior Solutions Architect at Qubole. The following slides cover a background of Presto and its architecture, and how it differs in both performance and cost from traditional Hadoop / Hive for Adhoc queries as well as SparkSQL, Impala, Tez, and Redshift. There are also several slides about how Qubole has been involved with the open-source Apache Presto project, along with performance optimizing contributions. Qubole is a big data analytics software that has solved many headaches around the traditional model of big data (Hadoop, Spark, Presto) and cloud computing in popular IaaS providers: AWS, Google Cloud, Microsoft Azure, and Oracle BMC.

Technology

PRESTO: AT SCALE IN THE CLOUD
Ashish Dubey
Solutions Architect
Qubole

COMPANY BACKGROUND
Founded in 2011 by the Lead Developers of Facebook’s data platform &
authors of the Apache Hive Project: Joydeep Sen Sarma & Ashish Thusoo.
Qubole started out of cloud based companies such as Pinterest and Shazam,
and has since grown with each phase of the emerging cloud to adoption with
companies like Autodesk and Oracle.
Today, Qubole process 500 Petabytes of data in the cloud each month on
behalf of their customers.
World class product and engineering team from:

THE OLD WORLD: HADOOP & MODEL ISSUES
➤ Hadoop puts compute and storage together within a
compute node
➤ Forces compute and storage to scale together, which is not
ideal
➤ The cluster must be persistently on or else the data is
inaccessible
➤ Fixed or inﬂexible pricing model
C+S
C+S
C+S
C+S
C+S
C+S
C+S
C+S
C+S
C+S C+S C+S

THE BREAKTHROUGH…
Qubole combined the components of creating a successful big
data platform from Facebook with the elasticity of the public cloud.
+
Big Data Cloud Infrastructure
=
The Future of
Advanced & Big Data
Analytics
Self-service access, and ease of managed scale take place in the cloud…

QUBOLE VALUE PROPOSITION
Adaptability
➤Choose the number of nodes and machine type for each workload
➤Choose the best engine for each workload
Agility
➤Initial provisioning in minutes
➤Iteration – make changes on the fly
Cost
➤Use spot pricing up to 90% less
➤Automation enables admins to support more users

PRESTO BACKGROUND
➤ Interactive/distributed SQL engine
➤ Open Source project - from Facebook
➤ Tested and in production at Petabyte scale by companies
such as FB, Netflix, Airbnb, Dropbox etc.
➤ Stemmed from a demand from fast adhoc on columnar data

PRESTO ARCHITECTURE
Presto Client
Presto
Coordinator
S3/HDFS
worker
worker
worker
Hive-
Metastore

COMPARATIVE VIEW
➤ Differences in SQL distributed engines available:
➤ Hive
➤ Tez
➤ SparkSQL
➤ Presto
➤ Impala
➤ Various Use cases

HIVE VS PRESTO
➤ Hive is great tool for variety of ETL jobs
➤ Batch-processing nature makes it slow
➤ Presto - faster due to architectural difference (in-memory)
➤ Presto replaces Hive? - No…

PRESTO VS SPARKSQL
➤ Performance ( data formats, type of query )
➤ Concurrency
➤ Configuration/tuning
➤ SparkSQL has access to Hive Optimizer through HiveContext

PRESTO VS REDSHIFT
➤ Cost effectiveness ( spot instances )
➤ Storage is coupled with compute
➤ Efficiency
➤ Data Availability
➤ Autoscaling
➤ BI integration

PRESTO FEATURES
➤ 5x-20x faster compared to Hive
➤ Works really well with ORC
➤ Near 100% compliant with ANSI SQL
➤ Parquet related enhancements are in works
➤ Good tool for interactive discovery - (e.g. Aggregate, Group
by, Fact-Dim join type of queries)

PRESTO FEATURES
➤ Supports S3 out of the box
➤ Connectors to external data-sources
➤ Qubole built Kinesis connector to enable near real time
experience

QUBOLE FEATURES & OPTIMIZATIONS
➤ Qubole SSD caching - http://docs.qubole.com/en/latest/
user-guide/presto/ssd-caching.html
➤ Rubix optimized caching for Hive and Presto - https://
www.qubole.com/blog/product/rubix-fast-cache-access-for-
big-data-analytics-on-cloud-storage/
➤ GitHub: https://github.com/qubole/rubix
➤ Autoscaling Presto clusters
➤ AWS Kinesis connector - SQL analysis on stream data
➤ Plug and play UDF framework: http://www.qubole.com/
blog/product/plugging-in-presto-udfs/

BEST PRACTICES
➤ InputFormat - ORC
➤ Use Sorted input
➤ Partitioning
➤ Careful with Join Order
➤ Avoid Large Fact-Fact joins
➤ Use for Large Fact- Dimension joins

LIMITATIONS
➤ Fault tolerance
➤ Larger Joins
➤ Disk spills

QUESTIONS?
help@qubole.com
Try free for 14-days! - api.qubole.com
Education Courses (Presto, Spark, Hive, and more!) -
qubole.com/education

What's hot

Building Data Lakes with Apache AirflowGary Stafford

Operationalizing Big Data Pipelines At ScaleDatabricks

Loading Data into Redshift with LabAmazon Web Services

Data Science Across Data Sources with Apache ArrowDatabricks

Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.

Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.

Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit

Presto + Alluxio on steroids a romantic drama on Production with happy endAlluxio, Inc.

AWS & Database AnalyticsAmazon Web Services

How Amazon.com Uses AWS AnalyticsAmazon Web Services

Presto @ Uber Hadoop summit2017Zhenxiao Luo

The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S

Building a Virtual Data Lake with Apache ArrowDremio Corporation

SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...DataWorks Summit

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit

Delivering Data Science to the BusinessDataWorks Summit

Introduction to TitanDB Knoldus Inc.

Data Warehousing with Amazon RedshiftAmazon Web Services

Unified Data Access with GimelAlluxio, Inc.

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack

What's hot (20)

Building Data Lakes with Apache Airflow

Operationalizing Big Data Pipelines At Scale

Loading Data into Redshift with Lab

Data Science Across Data Sources with Apache Arrow

Accelerate Analytics and ML in the Hybrid Cloud Era

Modernizing Global Shared Data Analytics Platform and our Alluxio Journey

Data infrastructure architecture for medium size organization: tips for colle...

Presto + Alluxio on steroids a romantic drama on Production with happy end

AWS & Database Analytics

How Amazon.com Uses AWS Analytics

Presto @ Uber Hadoop summit2017

The Open Data Lake Platform Brief - Data Sheets | Whitepaper

Building a Virtual Data Lake with Apache Arrow

SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...

A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...

Delivering Data Science to the Business

Introduction to TitanDB

Data Warehousing with Amazon Redshift

Unified Data Access with Gimel

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

Similar to Presto & differences between popular SQL engines (Spark, Redshift, and Hive)

Power BI with Essbase in the Oracle CloudKellyn Pot'Vin-Gorman

Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole

Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.

Agile data warehousingSneha Challa

Big data or big dealeduarderwee

Google Cloud Platform for Data Science teamsBarton Rhodes

Shubham, 7.5+ years exp, mcp, map r spark-hive-bi-etl-azure-dataengineer-mlShubham Mallick

SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOLBCS Data Management Specialist Group

Oct 2011 CHADNUG Presentation on HadoopJosh Patterson

Azure Databricks & Spark @ Techorama 2018Nathan Bijnens

spark_v1_2Frank Schroeter

Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.

Trafodion overviewRohit Jain

Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudDataWorks Summit/Hadoop Summit

How pig and hadoop fit in data processing architectureKovid Academy

The Big Data Puzzle, Where Does the Eclipse Piece Fit?J Langley

Modern data warehouseStephen Alex

Open Source Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.

Similar to Presto & differences between popular SQL engines (Spark, Redshift, and Hive) (20)

Power BI with Essbase in the Oracle Cloud

Harnessing the Hadoop Ecosystem Optimizations in Apache Hive

Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More

Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack

Agile data warehousing

Big data or big deal

Google Cloud Platform for Data Science teams

Shubham, 7.5+ years exp, mcp, map r spark-hive-bi-etl-azure-dataengineer-ml

SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL

Oct 2011 CHADNUG Presentation on Hadoop

Azure Databricks & Spark @ Techorama 2018

spark_v1_2

Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads

Trafodion overview

Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud

How pig and hadoop fit in data processing architecture

The Big Data Puzzle, Where Does the Eclipse Piece Fit?

Modern data warehouse

Open Source Data Orchestration for AI, Big Data, and Cloud

Recently uploaded

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

CloudStudio User manual (basic edition):comworks

Training state-of-the-art general text embeddingZilliz

AI as an Interface for Commercial BuildingsMemoori

Commit 2024 - Secret Management made easyAlfredo García Lavilla

WordPress Websites for Engineers: Elevate Your Brandgvaughan

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Install Stable Diffusion in windows machinePadma Pradeep

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

CloudStudio User manual (basic edition):

Training state-of-the-art general text embedding

AI as an Interface for Commercial Buildings

Commit 2024 - Secret Management made easy

WordPress Websites for Engineers: Elevate Your Brand

DevEX - reference for building teams, processes, and platforms

The Future of Software Development - Devin AI Innovative Approach.pdf

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Dev Dives: Streamline document processing with UiPath Studio Web

SIP trunking in Janus @ Kamailio World 2024

What's New in Teams Calling, Meetings and Devices March 2024

Developer Data Modeling Mistakes: From Postgres to NoSQL

Install Stable Diffusion in windows machine

My Hashitalk Indonesia April 2024 Presentation

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Anypoint Exchange: It’s Not Just a Repo!

Gen AI in Business - Global Trends Report 2024.pdf

Designing IA for AI - Information Architecture Conference 2024

DevoxxFR 2024 Reproducible Builds with Apache Maven

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)

1. PRESTO: AT SCALE IN THE CLOUD Ashish Dubey Solutions Architect Qubole

2. COMPANY BACKGROUND Founded in 2011 by the Lead Developers of Facebook’s data platform & authors of the Apache Hive Project: Joydeep Sen Sarma & Ashish Thusoo. Qubole started out of cloud based companies such as Pinterest and Shazam, and has since grown with each phase of the emerging cloud to adoption with companies like Autodesk and Oracle. Today, Qubole process 500 Petabytes of data in the cloud each month on behalf of their customers. World class product and engineering team from:

3. THE OLD WORLD: HADOOP & MODEL ISSUES ➤ Hadoop puts compute and storage together within a compute node ➤ Forces compute and storage to scale together, which is not ideal ➤ The cluster must be persistently on or else the data is inaccessible ➤ Fixed or inﬂexible pricing model C+S C+S C+S C+S C+S C+S C+S C+S C+S C+S C+S C+S

4. THE BREAKTHROUGH… Qubole combined the components of creating a successful big data platform from Facebook with the elasticity of the public cloud. + Big Data Cloud Infrastructure = The Future of Advanced & Big Data Analytics Self-service access, and ease of managed scale take place in the cloud…

6. QUBOLE VALUE PROPOSITION Adaptability ➤Choose the number of nodes and machine type for each workload ➤Choose the best engine for each workload Agility ➤Initial provisioning in minutes ➤Iteration – make changes on the fly Cost ➤Use spot pricing up to 90% less ➤Automation enables admins to support more users

7. PRESTO BACKGROUND ➤ Interactive/distributed SQL engine ➤ Open Source project - from Facebook ➤ Tested and in production at Petabyte scale by companies such as FB, Netflix, Airbnb, Dropbox etc. ➤ Stemmed from a demand from fast adhoc on columnar data

8. PRESTO ARCHITECTURE Presto Client Presto Coordinator S3/HDFS worker worker worker Hive- Metastore

9. PRESTO & BIG DATA @QUBOLE

10. AUTOSCALING PRESTO @ QUBOLE

11. COMPARATIVE VIEW ➤ Differences in SQL distributed engines available: ➤ Hive ➤ Tez ➤ SparkSQL ➤ Presto ➤ Impala ➤ Various Use cases

12. HIVE VS PRESTO ➤ Hive is great tool for variety of ETL jobs ➤ Batch-processing nature makes it slow ➤ Presto - faster due to architectural difference (in-memory) ➤ Presto replaces Hive? - No…

13. PRESTO VS SPARKSQL ➤ Performance ( data formats, type of query ) ➤ Concurrency ➤ Configuration/tuning ➤ SparkSQL has access to Hive Optimizer through HiveContext

14. PRESTO VS SPARKSQL

15. PRESTO VS REDSHIFT ➤ Cost effectiveness ( spot instances ) ➤ Storage is coupled with compute ➤ Efficiency ➤ Data Availability ➤ Autoscaling ➤ BI integration

16. COST ANALYSIS PER WORKLOAD VS. REDSHIFT

17. PRESTO FEATURES ➤ 5x-20x faster compared to Hive ➤ Works really well with ORC ➤ Near 100% compliant with ANSI SQL ➤ Parquet related enhancements are in works ➤ Good tool for interactive discovery - (e.g. Aggregate, Group by, Fact-Dim join type of queries)

18. PRESTO FEATURES ➤ Supports S3 out of the box ➤ Connectors to external data-sources ➤ Qubole built Kinesis connector to enable near real time experience

19. QUBOLE FEATURES & OPTIMIZATIONS ➤ Qubole SSD caching - http://docs.qubole.com/en/latest/ user-guide/presto/ssd-caching.html ➤ Rubix optimized caching for Hive and Presto - https:// www.qubole.com/blog/product/rubix-fast-cache-access-for- big-data-analytics-on-cloud-storage/ ➤ GitHub: https://github.com/qubole/rubix ➤ Autoscaling Presto clusters ➤ AWS Kinesis connector - SQL analysis on stream data ➤ Plug and play UDF framework: http://www.qubole.com/ blog/product/plugging-in-presto-udfs/

20. BEST PRACTICES ➤ InputFormat - ORC ➤ Use Sorted input ➤ Partitioning ➤ Careful with Join Order ➤ Avoid Large Fact-Fact joins ➤ Use for Large Fact- Dimension joins

21. LIMITATIONS ➤ Fault tolerance ➤ Larger Joins ➤ Disk spills

22. ${DEMO}

23. QUESTIONS? help@qubole.com Try free for 14-days! - api.qubole.com Education Courses (Presto, Spark, Hive, and more!) - qubole.com/education

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Presto & differences between popular SQL engines (Spark, Redshift, and Hive)

Similar to Presto & differences between popular SQL engines (Spark, Redshift, and Hive) (20)

Recently uploaded

Recently uploaded (20)

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)