SlideShare une entreprise Scribd logo
1  sur  35
The Presto Cost-Based Optimizer
for interactive SQL on anything
Strata Data Conference - London
May 1, 2019
Wojciech.Biela@starburstdata.com
Piotr.Findeisen@starburstdata.com
Agenda
• Who we are?
• Presto - quick recap
• Presto’s CBO launch
• Recent CBO enhancements
• CBO Roadmap
2© 2019
Starburst
© 2019 3
Starburst Data
© 2019 4
Founded by Presto committers:
● Over 4 years of contributions to Presto
● Presto distro for on-prem and cloud env
● Supporting large customers in production
● Enterprise subscription add-ons
Notable features contributed:
● ANSI SQL syntax enhancements
● Execution engine improvements
● Security integrations
● Spill to disk
● Cost-Based Optimizer
https://www.starburstdata.com/presto-enterprise/
Starburst Presto & Cloud
© 2019 5
https://www.starburstdata.com/technical-blog/announcing-starburst-enterprise-302e-with-mission-control/
Presto
© 2019 6
7
Project History
8© 2019
FALL 2012
4 developers
start Presto
development
SUMMER 2017
180+ Releases
50+ Contributors
5000+ Commits
WINTER 2017
Starburst is founded by
a team of Presto
committers, Teradata
veterans
FALL 2013
Facebook open
sources Presto
SPRING 2015
Teradata joins the
community, begins
investing heavily in
the project
WINTER 2019
Presto Software
Foundation
established
The Presto
fan club
9© 2019
See more at
https://github.com/prestosql/presto/wiki/Presto-Users
Presto in Production
• Facebook: 10,000+ of nodes, HDFS (ORC, RCFile), sharded MySQL, 1000s of users
• Uber: 2,000+ nodes (clusters on prem.) with 160K+ queries daily over HDFS (Parquet/ORC)
• Twitter: 2,000+ nodes (several clusters on premises and GCP), 20K+ queries daily (Parquet)
• LinkedIn: 500+ nodes, 200K+ queries daily over HDFS (ORC), and ~1000 users
• Lyft: 400+ nodes in AWS, 100K+ queries daily, 20+ PBs in S3 (Parquet)
• Netflix: 300+ nodes in AWS, 100+ PB in S3 (Parquet)
• Yahoo! Japan: 200+ nodes for HDFS (ORC), and ObjectStore
• FINRA: 120+ nodes in AWS, 4PB in S3 (ORC), 200+ users
10© 2019
Why Presto?
Community-driven
open source project
High performance ANSI SQL engine
• New Cost-Based Query Optimizer
• Proven scalability
• High concurrency
Separation of compute and
storage
• Scale storage and compute
independently
• No ETL or data integration
necessary to get to insights
• SQL-on-anything
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in
© 2019 11
Built for Performance
Query Execution Engine:
• MPP-style pipelined in-memory execution
• Vectorized data processing
• Runtime query bytecode generation
• Memory efficient data structures
• Multi-threaded multi-core execution
• Optimized readers for columnar formats (ORC and Parquet)
• Predicate and column projection pushdown
• Now also Cost-Based Optimizer
12© 2019
Architecture
13© 2019
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer
Planner Scheduler
Worker
Client
Data location
API
Pluggable
Presto CBO
© 2019 14
Before we optimize ...
Join in Presto
• Hash Join
• Right table is in memory ("build table")
• Left table is streamed ("probe table")
• Can be broadcast or repartitioned
15© 2019
Before we optimize ...
Join in Presto
• Hash Join
• Right table is in memory ("build table")
• Left table is streamed ("probe table")
• Can be broadcast or repartitioned
• A join can be followed by a join, can be followed by a join…
16© 2019
CBO in a nutshell
Cost-Based Optimizer v1 includes:
• join reordering based on selectivity estimates and cost
• automatic join type selection (repartitioned vs broadcast)
• automatic left/right side selection for joined tables
• support for statistics stored in Hive Metastore
https://www.starburstdata.com/technical-blog/
17© 2019
Cost and Statistics
Cost calculation includes:
• CPU
• Memory usage
• Network I/O
Hive Metastore statistics:
• number of rows in a table
• number of distinct values in a column
• fraction of NULL values in a column
• minimum/maximum value in a column
• average data size for a column
18
Join left/right side decision
19© 2019
Join type selection
20© 2019
Join reordering
"Which customers are spending the most at our shop?"
SELECT c.custkey, sum(l.price)
FROM customer c
JOIN orders o ON c.custkey = o.custkey
JOIN lineitem l ON o.orderkey = l.orderkey
GROUP BY c.custkey
ORDER BY sum(l.price) DESC;
21
Join reordering
22© 2019
Join reordering with filter
"Which customers are spending the most on coffee?"
SELECT c.custkey, sum(l.price)
FROM customer c
JOIN orders o ON c.custkey = o.custkey
JOIN lineitem l ON o.orderkey = l.orderkey
WHERE l.item = 'coffee'
GROUP BY c.custkey
ORDER BY sum(l.price) DESC;
23
Join reordering with filter
24© 2019
Presto CBO Speedup
Duration of TPC-DS queries (lower is better)
© 2019 25
https://www.starburstdata.com/presto-benchmarks/
Cloud cost reduction
● on average 7x improvement vs EMR Presto (up to 18x faster!)
● EMR Presto cannot execute many TPC-DS queries (all pass on Starburst Presto)
● on average 4x improvement on simpler queries benchmark (TPC-H, 3rd party benchmark)
26© 2019
https://www.starburstdata.com/presto-aws/
https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-emr-sql/
CBO Next Chapter
27© 2019
Enhancements to date
• Deciding on semi-join distribution type based on cost
• "… WHERE x IN (SELECT y FROM ...)" queries
• Capping a broadcasted table size
• Various minor fixes in cardinality estimation
• Peak memory estimation (more meaningful memory cost metric)
28© 2019
Collecting statistics
• ANALYZE table
• native in Presto
• Hive runtime no longer needed to collect stats
• Collecting table statistics on write
• low overhead, the data is being processed in memory
• Stats for AWS Glue Catalog
• Glue does not provide statistics support out of the box
• exclusive from Starburst
29© 2019
Statistics in other connectors
• CBO initially supported with Hive connector only
• We added stats for RDBMS connectors (Starburst)
• PostgreSQL
• MySQL
• SQL Server
• Teradata
• Oracle
• Enables CBO in federation use-cases
30© 2019
CBO Roadmap
© 2019 31
What’s next
• Stats support
• Improved stats for Hive and Glue, e.g. NDV estimates across partitions
• Stats for NoSQL connectors
• Core CBO and engine enhancements
• Involve connectors in optimizations (" *-pushdown ")
• Peak memory-based decisions
• Adjust cost model weights based on the hardware
• Cost and stats for more operation types
• Introduce Traits — consider alternative plans during optimizations
• Late materialization
• Adaptive optimizations
32© 2019
Further reading
https://prestosql.io/
https://www.starburstdata.com/technical-blog/
https://fivetran.com/blog/warehouse-benchmark
https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-emr-sql/
http://bytes.schibsted.com/bigdata-sql-query-engine-benchmark/
https://virtuslab.com/blog/benchmarking-spark-sql-presto-hive-bi-processing-
googles-cloud-dataproc/
33
Thank You!
34
Twitter: @starburstdata @prestosql
Blog: www.starburstdata.com/technical-blog/
Newsletter: www.starburstdata.com/newsletter
© 2019
Rate today’s session
Session page on conference website O’Reilly Events App

Contenu connexe

Tendances

Accelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesAccelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesPaul Van Siclen
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceDataWorks Summit
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop securitybigdatagurus_meetup
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3Yu Lun Teo
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?DataWorks Summit
 
AWS RDS Benchmark - Instance comparison
AWS RDS Benchmark - Instance comparisonAWS RDS Benchmark - Instance comparison
AWS RDS Benchmark - Instance comparisonRoberto Gaiser
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Amazon Web Services
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep DiveAmazon Web Services
 
Azure Database Services for MySQL PostgreSQL and MariaDB
Azure Database Services for MySQL PostgreSQL and MariaDBAzure Database Services for MySQL PostgreSQL and MariaDB
Azure Database Services for MySQL PostgreSQL and MariaDBNicholas Vossburg
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Hive Authorization
Hive AuthorizationHive Authorization
Hive AuthorizationMinwoo Kim
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 

Tendances (20)

Accelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesAccelerate and modernize your data pipelines
Accelerate and modernize your data pipelines
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
snowpro (1).pdf
snowpro (1).pdfsnowpro (1).pdf
snowpro (1).pdf
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Intro to Amazon S3
Intro to Amazon S3Intro to Amazon S3
Intro to Amazon S3
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
 
AWS RDS Benchmark - Instance comparison
AWS RDS Benchmark - Instance comparisonAWS RDS Benchmark - Instance comparison
AWS RDS Benchmark - Instance comparison
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Azure Database Services for MySQL PostgreSQL and MariaDB
Azure Database Services for MySQL PostgreSQL and MariaDBAzure Database Services for MySQL PostgreSQL and MariaDB
Azure Database Services for MySQL PostgreSQL and MariaDB
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hive Authorization
Hive AuthorizationHive Authorization
Hive Authorization
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Data mesh
Data meshData mesh
Data mesh
 

Similaire à Starburst Presto - CBO talk - Strata London 2019

Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBOkbajda
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Bostonkbajda
 
Presto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspectivePresto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspectiveAlluxio, Inc.
 
Inawsidom - Data Journey
Inawsidom - Data JourneyInawsidom - Data Journey
Inawsidom - Data JourneyPhilipBasford
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesAlluxio, Inc.
 
CloudOpen Japan - Controlling the cost of your first cloud
CloudOpen Japan - Controlling the cost of your first cloudCloudOpen Japan - Controlling the cost of your first cloud
CloudOpen Japan - Controlling the cost of your first cloudTim Mackey
 
Standard Chartered Bank Cloud Journey
Standard Chartered Bank Cloud JourneyStandard Chartered Bank Cloud Journey
Standard Chartered Bank Cloud JourneyAmazon Web Services
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data AnalyticsAmazon Web Services
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL Torsten Steinbach
 
HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020HostBridge Technology
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j
 
Cruising in data lake from zero to scale
Cruising in data lake from zero to scaleCruising in data lake from zero to scale
Cruising in data lake from zero to scaleJohn Varghese
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon RedshiftAmazon Web Services
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsYong Feng
 

Similaire à Starburst Presto - CBO talk - Strata London 2019 (20)

Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Presto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspectivePresto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspective
 
Inawsidom - Data Journey
Inawsidom - Data JourneyInawsidom - Data Journey
Inawsidom - Data Journey
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with Kubernetes
 
CloudOpen Japan - Controlling the cost of your first cloud
CloudOpen Japan - Controlling the cost of your first cloudCloudOpen Japan - Controlling the cost of your first cloud
CloudOpen Japan - Controlling the cost of your first cloud
 
Standard Chartered Bank Cloud Journey
Standard Chartered Bank Cloud JourneyStandard Chartered Bank Cloud Journey
Standard Chartered Bank Cloud Journey
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
 
HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Cruising in data lake from zero to scale
Cruising in data lake from zero to scaleCruising in data lake from zero to scale
Cruising in data lake from zero to scale
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflowsCloud nativecomputingtechnologysupportinghpc cognitiveworkflows
Cloud nativecomputingtechnologysupportinghpc cognitiveworkflows
 

Dernier

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 

Dernier (20)

How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 

Starburst Presto - CBO talk - Strata London 2019

  • 1. The Presto Cost-Based Optimizer for interactive SQL on anything Strata Data Conference - London May 1, 2019 Wojciech.Biela@starburstdata.com Piotr.Findeisen@starburstdata.com
  • 2. Agenda • Who we are? • Presto - quick recap • Presto’s CBO launch • Recent CBO enhancements • CBO Roadmap 2© 2019
  • 4. Starburst Data © 2019 4 Founded by Presto committers: ● Over 4 years of contributions to Presto ● Presto distro for on-prem and cloud env ● Supporting large customers in production ● Enterprise subscription add-ons Notable features contributed: ● ANSI SQL syntax enhancements ● Execution engine improvements ● Security integrations ● Spill to disk ● Cost-Based Optimizer https://www.starburstdata.com/presto-enterprise/
  • 5. Starburst Presto & Cloud © 2019 5 https://www.starburstdata.com/technical-blog/announcing-starburst-enterprise-302e-with-mission-control/
  • 7. 7
  • 8. Project History 8© 2019 FALL 2012 4 developers start Presto development SUMMER 2017 180+ Releases 50+ Contributors 5000+ Commits WINTER 2017 Starburst is founded by a team of Presto committers, Teradata veterans FALL 2013 Facebook open sources Presto SPRING 2015 Teradata joins the community, begins investing heavily in the project WINTER 2019 Presto Software Foundation established
  • 9. The Presto fan club 9© 2019 See more at https://github.com/prestosql/presto/wiki/Presto-Users
  • 10. Presto in Production • Facebook: 10,000+ of nodes, HDFS (ORC, RCFile), sharded MySQL, 1000s of users • Uber: 2,000+ nodes (clusters on prem.) with 160K+ queries daily over HDFS (Parquet/ORC) • Twitter: 2,000+ nodes (several clusters on premises and GCP), 20K+ queries daily (Parquet) • LinkedIn: 500+ nodes, 200K+ queries daily over HDFS (ORC), and ~1000 users • Lyft: 400+ nodes in AWS, 100K+ queries daily, 20+ PBs in S3 (Parquet) • Netflix: 300+ nodes in AWS, 100+ PB in S3 (Parquet) • Yahoo! Japan: 200+ nodes for HDFS (ORC), and ObjectStore • FINRA: 120+ nodes in AWS, 4PB in S3 (ORC), 200+ users 10© 2019
  • 11. Why Presto? Community-driven open source project High performance ANSI SQL engine • New Cost-Based Query Optimizer • Proven scalability • High concurrency Separation of compute and storage • Scale storage and compute independently • No ETL or data integration necessary to get to insights • SQL-on-anything No vendor lock-in • No Hadoop distro vendor lock-in • No storage engine vendor lock-in • No cloud vendor lock-in © 2019 11
  • 12. Built for Performance Query Execution Engine: • MPP-style pipelined in-memory execution • Vectorized data processing • Runtime query bytecode generation • Memory efficient data structures • Multi-threaded multi-core execution • Optimized readers for columnar formats (ORC and Parquet) • Predicate and column projection pushdown • Now also Cost-Based Optimizer 12© 2019
  • 13. Architecture 13© 2019 Data stream API Worker Data stream API Worker Coordinator Metadata API Parser/ analyzer Planner Scheduler Worker Client Data location API Pluggable
  • 15. Before we optimize ... Join in Presto • Hash Join • Right table is in memory ("build table") • Left table is streamed ("probe table") • Can be broadcast or repartitioned 15© 2019
  • 16. Before we optimize ... Join in Presto • Hash Join • Right table is in memory ("build table") • Left table is streamed ("probe table") • Can be broadcast or repartitioned • A join can be followed by a join, can be followed by a join… 16© 2019
  • 17. CBO in a nutshell Cost-Based Optimizer v1 includes: • join reordering based on selectivity estimates and cost • automatic join type selection (repartitioned vs broadcast) • automatic left/right side selection for joined tables • support for statistics stored in Hive Metastore https://www.starburstdata.com/technical-blog/ 17© 2019
  • 18. Cost and Statistics Cost calculation includes: • CPU • Memory usage • Network I/O Hive Metastore statistics: • number of rows in a table • number of distinct values in a column • fraction of NULL values in a column • minimum/maximum value in a column • average data size for a column 18
  • 19. Join left/right side decision 19© 2019
  • 21. Join reordering "Which customers are spending the most at our shop?" SELECT c.custkey, sum(l.price) FROM customer c JOIN orders o ON c.custkey = o.custkey JOIN lineitem l ON o.orderkey = l.orderkey GROUP BY c.custkey ORDER BY sum(l.price) DESC; 21
  • 23. Join reordering with filter "Which customers are spending the most on coffee?" SELECT c.custkey, sum(l.price) FROM customer c JOIN orders o ON c.custkey = o.custkey JOIN lineitem l ON o.orderkey = l.orderkey WHERE l.item = 'coffee' GROUP BY c.custkey ORDER BY sum(l.price) DESC; 23
  • 24. Join reordering with filter 24© 2019
  • 25. Presto CBO Speedup Duration of TPC-DS queries (lower is better) © 2019 25 https://www.starburstdata.com/presto-benchmarks/
  • 26. Cloud cost reduction ● on average 7x improvement vs EMR Presto (up to 18x faster!) ● EMR Presto cannot execute many TPC-DS queries (all pass on Starburst Presto) ● on average 4x improvement on simpler queries benchmark (TPC-H, 3rd party benchmark) 26© 2019 https://www.starburstdata.com/presto-aws/ https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-emr-sql/
  • 28. Enhancements to date • Deciding on semi-join distribution type based on cost • "… WHERE x IN (SELECT y FROM ...)" queries • Capping a broadcasted table size • Various minor fixes in cardinality estimation • Peak memory estimation (more meaningful memory cost metric) 28© 2019
  • 29. Collecting statistics • ANALYZE table • native in Presto • Hive runtime no longer needed to collect stats • Collecting table statistics on write • low overhead, the data is being processed in memory • Stats for AWS Glue Catalog • Glue does not provide statistics support out of the box • exclusive from Starburst 29© 2019
  • 30. Statistics in other connectors • CBO initially supported with Hive connector only • We added stats for RDBMS connectors (Starburst) • PostgreSQL • MySQL • SQL Server • Teradata • Oracle • Enables CBO in federation use-cases 30© 2019
  • 32. What’s next • Stats support • Improved stats for Hive and Glue, e.g. NDV estimates across partitions • Stats for NoSQL connectors • Core CBO and engine enhancements • Involve connectors in optimizations (" *-pushdown ") • Peak memory-based decisions • Adjust cost model weights based on the hardware • Cost and stats for more operation types • Introduce Traits — consider alternative plans during optimizations • Late materialization • Adaptive optimizations 32© 2019
  • 34. Thank You! 34 Twitter: @starburstdata @prestosql Blog: www.starburstdata.com/technical-blog/ Newsletter: www.starburstdata.com/newsletter © 2019
  • 35. Rate today’s session Session page on conference website O’Reilly Events App