SlideShare a Scribd company logo
1 of 23
Download to read offline
PRESTO: AT SCALE IN THE CLOUD
Ashish Dubey
Solutions Architect
Qubole
COMPANY BACKGROUND
Founded in 2011 by the Lead Developers of Facebook’s data platform &
authors of the Apache Hive Project: Joydeep Sen Sarma & Ashish Thusoo.
Qubole started out of cloud based companies such as Pinterest and Shazam,
and has since grown with each phase of the emerging cloud to adoption with
companies like Autodesk and Oracle.
Today, Qubole process 500 Petabytes of data in the cloud each month on
behalf of their customers.
World class product and engineering team from:
THE OLD WORLD: HADOOP & MODEL ISSUES
➤ Hadoop puts compute and storage together within a
compute node
➤ Forces compute and storage to scale together, which is not
ideal
➤ The cluster must be persistently on or else the data is
inaccessible
➤ Fixed or inflexible pricing model
C+S
C+S
C+S
C+S
C+S
C+S
C+S
C+S
C+S
C+S C+S C+S
THE BREAKTHROUGH…
Qubole combined the components of creating a successful big
data platform from Facebook with the elasticity of the public cloud.
+
Big Data Cloud Infrastructure
=
The Future of
Advanced & Big Data
Analytics
Self-service access, and ease of managed scale take place in the cloud…
QUBOLE VALUE PROPOSITION
Adaptability
➤Choose the number of nodes and machine type for each workload
➤Choose the best engine for each workload
Agility
➤Initial provisioning in minutes
➤Iteration – make changes on the fly
Cost
➤Use spot pricing up to 90% less
➤Automation enables admins to support more users
PRESTO BACKGROUND
➤ Interactive/distributed SQL engine
➤ Open Source project - from Facebook
➤ Tested and in production at Petabyte scale by companies
such as FB, Netflix, Airbnb, Dropbox etc.
➤ Stemmed from a demand from fast adhoc on columnar data
PRESTO ARCHITECTURE
Presto Client
Presto
Coordinator
S3/HDFS
worker
worker
worker
Hive-
Metastore
PRESTO & BIG DATA @QUBOLE
AUTOSCALING PRESTO @ QUBOLE
COMPARATIVE VIEW
➤ Differences in SQL distributed engines available:
➤ Hive
➤ Tez
➤ SparkSQL
➤ Presto
➤ Impala
➤ Various Use cases
HIVE VS PRESTO
➤ Hive is great tool for variety of ETL jobs
➤ Batch-processing nature makes it slow
➤ Presto - faster due to architectural difference (in-memory)
➤ Presto replaces Hive? - No…
PRESTO VS SPARKSQL
➤ Performance ( data formats, type of query )
➤ Concurrency
➤ Configuration/tuning
➤ SparkSQL has access to Hive Optimizer through HiveContext
PRESTO VS SPARKSQL
PRESTO VS REDSHIFT
➤ Cost effectiveness ( spot instances )
➤ Storage is coupled with compute
➤ Efficiency
➤ Data Availability
➤ Autoscaling
➤ BI integration
COST ANALYSIS PER WORKLOAD VS. REDSHIFT
PRESTO FEATURES
➤ 5x-20x faster compared to Hive
➤ Works really well with ORC
➤ Near 100% compliant with ANSI SQL
➤ Parquet related enhancements are in works
➤ Good tool for interactive discovery - (e.g. Aggregate, Group
by, Fact-Dim join type of queries)
PRESTO FEATURES
➤ Supports S3 out of the box
➤ Connectors to external data-sources
➤ Qubole built Kinesis connector to enable near real time
experience
QUBOLE FEATURES & OPTIMIZATIONS
➤ Qubole SSD caching - http://docs.qubole.com/en/latest/
user-guide/presto/ssd-caching.html
➤ Rubix optimized caching for Hive and Presto - https://
www.qubole.com/blog/product/rubix-fast-cache-access-for-
big-data-analytics-on-cloud-storage/
➤ GitHub: https://github.com/qubole/rubix
➤ Autoscaling Presto clusters
➤ AWS Kinesis connector - SQL analysis on stream data
➤ Plug and play UDF framework: http://www.qubole.com/
blog/product/plugging-in-presto-udfs/
BEST PRACTICES
➤ InputFormat - ORC
➤ Use Sorted input
➤ Partitioning
➤ Careful with Join Order
➤ Avoid Large Fact-Fact joins
➤ Use for Large Fact- Dimension joins
LIMITATIONS
➤ Fault tolerance
➤ Larger Joins
➤ Disk spills
${DEMO}
QUESTIONS?
help@qubole.com
Try free for 14-days! - api.qubole.com
Education Courses (Presto, Spark, Hive, and more!) -
qubole.com/education

More Related Content

What's hot

Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowGary Stafford
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with LabAmazon Web Services
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowDatabricks
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Presto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy endPresto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy endAlluxio, Inc.
 
How Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsHow Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsAmazon Web Services
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Zhenxiao Luo
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...DataWorks Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
 
Delivering Data Science to the Business
Delivering Data Science to the BusinessDelivering Data Science to the Business
Delivering Data Science to the BusinessDataWorks Summit
 
Introduction to TitanDB
Introduction to TitanDB Introduction to TitanDB
Introduction to TitanDB Knoldus Inc.
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with GimelAlluxio, Inc.
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 

What's hot (20)

Building Data Lakes with Apache Airflow
Building Data Lakes with Apache AirflowBuilding Data Lakes with Apache Airflow
Building Data Lakes with Apache Airflow
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Data Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache ArrowData Science Across Data Sources with Apache Arrow
Data Science Across Data Sources with Apache Arrow
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Presto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy endPresto + Alluxio on steroids a romantic drama on Production with happy end
Presto + Alluxio on steroids a romantic drama on Production with happy end
 
AWS & Database Analytics
AWS & Database AnalyticsAWS & Database Analytics
AWS & Database Analytics
 
How Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsHow Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS Analytics
 
Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017Presto @ Uber Hadoop summit2017
Presto @ Uber Hadoop summit2017
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Delivering Data Science to the Business
Delivering Data Science to the BusinessDelivering Data Science to the Business
Delivering Data Science to the Business
 
Introduction to TitanDB
Introduction to TitanDB Introduction to TitanDB
Introduction to TitanDB
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 

Similar to Presto & differences between popular SQL engines (Spark, Redshift, and Hive)

Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudKellyn Pot'Vin-Gorman
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveQubole
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousingSneha Challa
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsBarton Rhodes
 
Shubham, 7.5+ years exp, mcp, map r spark-hive-bi-etl-azure-dataengineer-ml
Shubham, 7.5+ years exp, mcp, map r spark-hive-bi-etl-azure-dataengineer-mlShubham, 7.5+ years exp, mcp, map r spark-hive-bi-etl-azure-dataengineer-ml
Shubham, 7.5+ years exp, mcp, map r spark-hive-bi-etl-azure-dataengineer-mlShubham Mallick
 
SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL
SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOLSQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL
SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOLBCS Data Management Specialist Group
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Nathan Bijnens
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio, Inc.
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overviewRohit Jain
 
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudBring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudDataWorks Summit/Hadoop Summit
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureKovid Academy
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?J Langley
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudAlluxio, Inc.
 

Similar to Presto & differences between popular SQL engines (Spark, Redshift, and Hive) (20)

Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache HiveHarnessing the Hadoop Ecosystem Optimizations in Apache Hive
Harnessing the Hadoop Ecosystem Optimizations in Apache Hive
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Agile data warehousing
Agile data warehousingAgile data warehousing
Agile data warehousing
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
 
Shubham, 7.5+ years exp, mcp, map r spark-hive-bi-etl-azure-dataengineer-ml
Shubham, 7.5+ years exp, mcp, map r spark-hive-bi-etl-azure-dataengineer-mlShubham, 7.5+ years exp, mcp, map r spark-hive-bi-etl-azure-dataengineer-ml
Shubham, 7.5+ years exp, mcp, map r spark-hive-bi-etl-azure-dataengineer-ml
 
SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL
SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOLSQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL
SQL vs NoSQL: Why you’ll never dump your relations - Dave Shuttleworth, EXASOL
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018Azure Databricks & Spark @ Techorama 2018
Azure Databricks & Spark @ Techorama 2018
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloadsAlluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
Alluxio 2.0 Deep Dive – Simplifying data access for cloud workloads
 
Trafodion overview
Trafodion overviewTrafodion overview
Trafodion overview
 
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the CloudBring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
Bring your SAP and Enterprise Data to Hadoop, Apache Kafka and the Cloud
 
How pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architectureHow pig and hadoop fit in data processing architecture
How pig and hadoop fit in data processing architecture
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Open Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and CloudOpen Source Data Orchestration for AI, Big Data, and Cloud
Open Source Data Orchestration for AI, Big Data, and Cloud
 

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 

Presto & differences between popular SQL engines (Spark, Redshift, and Hive)

  • 1. PRESTO: AT SCALE IN THE CLOUD Ashish Dubey Solutions Architect Qubole
  • 2. COMPANY BACKGROUND Founded in 2011 by the Lead Developers of Facebook’s data platform & authors of the Apache Hive Project: Joydeep Sen Sarma & Ashish Thusoo. Qubole started out of cloud based companies such as Pinterest and Shazam, and has since grown with each phase of the emerging cloud to adoption with companies like Autodesk and Oracle. Today, Qubole process 500 Petabytes of data in the cloud each month on behalf of their customers. World class product and engineering team from:
  • 3. THE OLD WORLD: HADOOP & MODEL ISSUES ➤ Hadoop puts compute and storage together within a compute node ➤ Forces compute and storage to scale together, which is not ideal ➤ The cluster must be persistently on or else the data is inaccessible ➤ Fixed or inflexible pricing model C+S C+S C+S C+S C+S C+S C+S C+S C+S C+S C+S C+S
  • 4. THE BREAKTHROUGH… Qubole combined the components of creating a successful big data platform from Facebook with the elasticity of the public cloud. + Big Data Cloud Infrastructure = The Future of Advanced & Big Data Analytics Self-service access, and ease of managed scale take place in the cloud…
  • 5.
  • 6. QUBOLE VALUE PROPOSITION Adaptability ➤Choose the number of nodes and machine type for each workload ➤Choose the best engine for each workload Agility ➤Initial provisioning in minutes ➤Iteration – make changes on the fly Cost ➤Use spot pricing up to 90% less ➤Automation enables admins to support more users
  • 7. PRESTO BACKGROUND ➤ Interactive/distributed SQL engine ➤ Open Source project - from Facebook ➤ Tested and in production at Petabyte scale by companies such as FB, Netflix, Airbnb, Dropbox etc. ➤ Stemmed from a demand from fast adhoc on columnar data
  • 9. PRESTO & BIG DATA @QUBOLE
  • 11. COMPARATIVE VIEW ➤ Differences in SQL distributed engines available: ➤ Hive ➤ Tez ➤ SparkSQL ➤ Presto ➤ Impala ➤ Various Use cases
  • 12. HIVE VS PRESTO ➤ Hive is great tool for variety of ETL jobs ➤ Batch-processing nature makes it slow ➤ Presto - faster due to architectural difference (in-memory) ➤ Presto replaces Hive? - No…
  • 13. PRESTO VS SPARKSQL ➤ Performance ( data formats, type of query ) ➤ Concurrency ➤ Configuration/tuning ➤ SparkSQL has access to Hive Optimizer through HiveContext
  • 15. PRESTO VS REDSHIFT ➤ Cost effectiveness ( spot instances ) ➤ Storage is coupled with compute ➤ Efficiency ➤ Data Availability ➤ Autoscaling ➤ BI integration
  • 16. COST ANALYSIS PER WORKLOAD VS. REDSHIFT
  • 17. PRESTO FEATURES ➤ 5x-20x faster compared to Hive ➤ Works really well with ORC ➤ Near 100% compliant with ANSI SQL ➤ Parquet related enhancements are in works ➤ Good tool for interactive discovery - (e.g. Aggregate, Group by, Fact-Dim join type of queries)
  • 18. PRESTO FEATURES ➤ Supports S3 out of the box ➤ Connectors to external data-sources ➤ Qubole built Kinesis connector to enable near real time experience
  • 19. QUBOLE FEATURES & OPTIMIZATIONS ➤ Qubole SSD caching - http://docs.qubole.com/en/latest/ user-guide/presto/ssd-caching.html ➤ Rubix optimized caching for Hive and Presto - https:// www.qubole.com/blog/product/rubix-fast-cache-access-for- big-data-analytics-on-cloud-storage/ ➤ GitHub: https://github.com/qubole/rubix ➤ Autoscaling Presto clusters ➤ AWS Kinesis connector - SQL analysis on stream data ➤ Plug and play UDF framework: http://www.qubole.com/ blog/product/plugging-in-presto-udfs/
  • 20. BEST PRACTICES ➤ InputFormat - ORC ➤ Use Sorted input ➤ Partitioning ➤ Careful with Join Order ➤ Avoid Large Fact-Fact joins ➤ Use for Large Fact- Dimension joins
  • 21. LIMITATIONS ➤ Fault tolerance ➤ Larger Joins ➤ Disk spills
  • 23. QUESTIONS? help@qubole.com Try free for 14-days! - api.qubole.com Education Courses (Presto, Spark, Hive, and more!) - qubole.com/education