SlideShare une entreprise Scribd logo
1  sur  35
Télécharger pour lire hors ligne
Tuning Performance for SQL-on-Anything Analytics
Martin Traverso, Co-creator of Presto
Kamil Bajda-Pawlikowski, CTO Starburst
@prestosql @starburstdata
Strata Data 2019
San Francisco, CA
Presto: SQL-on-Anything
Deploy Anywhere, Query Anything
Why Presto?
Community-driven
open source project
High performance ANSI SQL engine
• New Cost-Based Query Optimizer
• Proven scalability
• High concurrency
Separation of compute
and storage
• Scale storage and compute
independently
• No ETL or data integration
necessary to get to insights
• SQL-on-anything
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in
Project History
©2017 Starburst Data, Inc. All Rights Reserved
Community
See more at our Wiki
Presto in Production
Facebook: 10,000+ of nodes, HDFS (ORC, RCFile), sharded MySQL, 1000s of users
Uber: 2,000+ nodes (several clusters on premises) with 160K+ queries daily over HDFS (Parquet/ORC)
Twitter: 2,000+ nodes (several clusters on premises and GCP), 20K+ queries daily (Parquet)
LinkedIn: 500+ nodes, 200K+ queries daily over HDFS (ORC), and ~1000 users
Lyft: ------ redacted due to the quiet period for the IPO -----------
Netflix: 300+ nodes in AWS, 100+ PB in S3 (Parquet)
Yahoo! Japan: 200+ nodes for HDFS (ORC), and ObjectStore
FINRA: 120+ nodes in AWS, 4PB in S3 (ORC), 200+ users
Starburst Data
© 2019 7
Founded by Presto committers:
● Over 4 years of contributions to Presto
● Presto distro for on-prem and cloud env
● Supporting large customers in production
● Enterprise subscription add-ons (ODBC,
Ranger, Sentry, Oracle, Teradata)
Notable features contributed:
● ANSI SQL syntax enhancements
● Execution engine improvements
● Security integrations
● Spill to disk
● Cost-Based Optimizer
https://www.starburstdata.com/presto-enterprise/
Performance
Built for Performance
Query Execution Engine:
● MPP-style pipelined in-memory execution
● Columnar and vectorized data processing
● Runtime query bytecode compilation
● Memory efficient data structures
● Multi-threaded multi-core execution
● Optimized readers for columnar formats (ORC and Parquet)
● Predicate and column projection pushdown
● Now also Cost-Based Optimizer
CBO in a nutshell
Presto Cost-Based Optimizer includes:
● support for statistics stored in Hive Metastore
● join reordering based on selectivity estimates and cost
● automatic join type selection (repartitioned vs broadcast)
● automatic left/right side selection for joined tables
https://www.starburstdata.com/technical-blog/
Statistics & Cost
Hive Metastore statistics:
● number of rows in a table
● number of distinct values in a column
● fraction of NULL values in a column
● minimum/maximum value in a column
● average data size for a column
Cost calculation includes:
● CPU
● Memory
● Network I/O
Join type selection
Join left/right side decision
Join reordering with filter
Join tree shapes
CBO off
CBO on
https://www.starburstdata.com/presto-benchmarks/
Benchmark results
Benchmark results
● on average 7x improvement vs EMR Presto
● EMR Presto cannot execute many TPC-DS queries
● All TPC-DS queries pass on Starburst Presto
https://www.starburstdata.com/presto-aws/
Recent CBO enhancements
● Deciding on semi-join distribution type based on cost
● Support for outer joins
● Capping a broadcasted table size
● Various minor fixes in cardinality estimation
● ANALYZE table (native in Presto)
● Stats for AWS Glue Catalog (exclusive from Starburst)
Current and Future work
What’s next for Optimizer
● Stats support
○ Improved stats for Hive
○ Stats for DBMS connectors and NoSQL connectors
○ Tolerate missing / incomplete stats
● Core CBO enhancements
○ Cost more operators
○ Adjust cost model weights based on the hardware
○ Adaptive optimizations
○ Introduce Traits
● Involve connectors in optimizations
Involving Connectors in Optimization
History and Current State
● Original motivation: partition pruning for queries over Hive tables
● Simple range predicates and nullability checks passed to connectors.
Modeled as TupleDomain
((col0 BETWEEN ? AND ?) OR (col0 BETWEEN ? and ?) OR …))
AND
((col1 BETWEEN ? AND ?) OR (col1 BETWEEN ? and ?) OR …))
AND
...
History and Current State
● Partial evaluation of non-trivial expressions
○ Bind only known variables
○ Result in "true/false/null" or "can't tell”. E.g.,
f(a, b) := lower(a) LIKE ‘john%’ AND b = 1
f(‘Mary’, ?) → false → can prune
f(‘John S’, ?) → b = 1 → ¯_(ツ)_/¯
Beyond Simple Filter Pushdown...
● Dereference expressions. E.g., x.a > 5
● Array/map subscript. E.g., a[‘key’] = 10
● Complex filters and projections
● Aggregations
● Joins
● Limit: https://github.com/prestosql/presto/pull/421
● Sampling
● Others…
https://github.com/prestosql/presto/issues/18
A
B
C
D E
F G
A
C’
B’
D E
F G
B
C
? ?
C’
B’?
?
Pattern Result
Rule 1
A
C’
E’
D B’’
F G
B
E
?
E’
B’
?
Pattern Result
A
C’
B’
D E
F G
Rule 2
Filter
x.f > 5 AND y LIKE ‘a%b’
Scan(t_0)
Filter
z > 5
Scan(t_1)
FilterIntoScan
Rule
Connector.applyFilter(...)
Table: t_0
Filter: x.f > 5 AND y LIKE ‘a%b’
Table t
x :: row(f bigint, g bigint)
y :: varchar(10)
Derived Table: t_1
Filter: z > 5 [z :: bigint]
SELECT count(*)
FROM t
WHERE x.f > 5 AND y LIKE ‘a%b’
New Connector APIs
applyFilter(ConnectorTableHandle table, Expression filter)
applyLimit(ConnectorTableHandle table, long limit)
applyAggregation(ConnectorTableHandle table, List<Aggregation> aggregates)
applySampling(ConnectorTableHandle table, double samplingRate)
...
Performance Benefits (?)
● Better support for sophisticated backend systems
○ Druid, Pinot, ElasticSearch
○ SQL databases
● Improved performance for columnar data formats (Parquet, ORC)
ORC Performance Improvements
https://github.com/prestosql/presto/pull/555
ORC Performance Improvements - TPC-DS
Project Roadmap
● Coordinator HA
● Kubernetes
● Dynamic filtering
● Connectors
○ Phoenix
○ Iceberg
○ Druid
● TIMESTAMP semantics
● And more… https://github.com/prestosql/presto/labels/roadmap
Getting Involved
● Join us on Slack
○ Invite link: https://prestosql.io/community.html
● Github: https://github.io/prestosql/presto
● Website: https://prestosql.io
Further reading
https://www.starburstdata.com/presto-newsletter/
https://fivetran.com/blog/warehouse-benchmark
https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-emr-sql/
http://bytes.schibsted.com/bigdata-sql-query-engine-benchmark/
https://virtuslab.com/blog/benchmarking-spark-sql-presto-hive-bi-processing-googles-cloud-d
ataproc/
Thank You!
@prestosql @starburstdata
www.starburstdata.comwww.prestosql.io

Contenu connexe

Plus de kbajda

Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedInkbajda
 
Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Prestokbajda
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBOkbajda
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CAkbajda
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016kbajda
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkkbajda
 

Plus de kbajda (6)

Presto Summit 2018 - 02 - LinkedIn
Presto Summit 2018  - 02 - LinkedInPresto Summit 2018  - 02 - LinkedIn
Presto Summit 2018 - 02 - LinkedIn
 
Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Presto
 
Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Presto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talk
 

Dernier

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Dernier (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Presto talk @ Strata Data CA 2019

  • 1. Tuning Performance for SQL-on-Anything Analytics Martin Traverso, Co-creator of Presto Kamil Bajda-Pawlikowski, CTO Starburst @prestosql @starburstdata Strata Data 2019 San Francisco, CA
  • 3. Why Presto? Community-driven open source project High performance ANSI SQL engine • New Cost-Based Query Optimizer • Proven scalability • High concurrency Separation of compute and storage • Scale storage and compute independently • No ETL or data integration necessary to get to insights • SQL-on-anything No vendor lock-in • No Hadoop distro vendor lock-in • No storage engine vendor lock-in • No cloud vendor lock-in
  • 4. Project History ©2017 Starburst Data, Inc. All Rights Reserved
  • 6. Presto in Production Facebook: 10,000+ of nodes, HDFS (ORC, RCFile), sharded MySQL, 1000s of users Uber: 2,000+ nodes (several clusters on premises) with 160K+ queries daily over HDFS (Parquet/ORC) Twitter: 2,000+ nodes (several clusters on premises and GCP), 20K+ queries daily (Parquet) LinkedIn: 500+ nodes, 200K+ queries daily over HDFS (ORC), and ~1000 users Lyft: ------ redacted due to the quiet period for the IPO ----------- Netflix: 300+ nodes in AWS, 100+ PB in S3 (Parquet) Yahoo! Japan: 200+ nodes for HDFS (ORC), and ObjectStore FINRA: 120+ nodes in AWS, 4PB in S3 (ORC), 200+ users
  • 7. Starburst Data © 2019 7 Founded by Presto committers: ● Over 4 years of contributions to Presto ● Presto distro for on-prem and cloud env ● Supporting large customers in production ● Enterprise subscription add-ons (ODBC, Ranger, Sentry, Oracle, Teradata) Notable features contributed: ● ANSI SQL syntax enhancements ● Execution engine improvements ● Security integrations ● Spill to disk ● Cost-Based Optimizer https://www.starburstdata.com/presto-enterprise/
  • 9. Built for Performance Query Execution Engine: ● MPP-style pipelined in-memory execution ● Columnar and vectorized data processing ● Runtime query bytecode compilation ● Memory efficient data structures ● Multi-threaded multi-core execution ● Optimized readers for columnar formats (ORC and Parquet) ● Predicate and column projection pushdown ● Now also Cost-Based Optimizer
  • 10. CBO in a nutshell Presto Cost-Based Optimizer includes: ● support for statistics stored in Hive Metastore ● join reordering based on selectivity estimates and cost ● automatic join type selection (repartitioned vs broadcast) ● automatic left/right side selection for joined tables https://www.starburstdata.com/technical-blog/
  • 11. Statistics & Cost Hive Metastore statistics: ● number of rows in a table ● number of distinct values in a column ● fraction of NULL values in a column ● minimum/maximum value in a column ● average data size for a column Cost calculation includes: ● CPU ● Memory ● Network I/O
  • 17. Benchmark results ● on average 7x improvement vs EMR Presto ● EMR Presto cannot execute many TPC-DS queries ● All TPC-DS queries pass on Starburst Presto https://www.starburstdata.com/presto-aws/
  • 18. Recent CBO enhancements ● Deciding on semi-join distribution type based on cost ● Support for outer joins ● Capping a broadcasted table size ● Various minor fixes in cardinality estimation ● ANALYZE table (native in Presto) ● Stats for AWS Glue Catalog (exclusive from Starburst)
  • 20. What’s next for Optimizer ● Stats support ○ Improved stats for Hive ○ Stats for DBMS connectors and NoSQL connectors ○ Tolerate missing / incomplete stats ● Core CBO enhancements ○ Cost more operators ○ Adjust cost model weights based on the hardware ○ Adaptive optimizations ○ Introduce Traits ● Involve connectors in optimizations
  • 21. Involving Connectors in Optimization
  • 22. History and Current State ● Original motivation: partition pruning for queries over Hive tables ● Simple range predicates and nullability checks passed to connectors. Modeled as TupleDomain ((col0 BETWEEN ? AND ?) OR (col0 BETWEEN ? and ?) OR …)) AND ((col1 BETWEEN ? AND ?) OR (col1 BETWEEN ? and ?) OR …)) AND ...
  • 23. History and Current State ● Partial evaluation of non-trivial expressions ○ Bind only known variables ○ Result in "true/false/null" or "can't tell”. E.g., f(a, b) := lower(a) LIKE ‘john%’ AND b = 1 f(‘Mary’, ?) → false → can prune f(‘John S’, ?) → b = 1 → ¯_(ツ)_/¯
  • 24. Beyond Simple Filter Pushdown... ● Dereference expressions. E.g., x.a > 5 ● Array/map subscript. E.g., a[‘key’] = 10 ● Complex filters and projections ● Aggregations ● Joins ● Limit: https://github.com/prestosql/presto/pull/421 ● Sampling ● Others… https://github.com/prestosql/presto/issues/18
  • 25. A B C D E F G A C’ B’ D E F G B C ? ? C’ B’? ? Pattern Result Rule 1
  • 26. A C’ E’ D B’’ F G B E ? E’ B’ ? Pattern Result A C’ B’ D E F G Rule 2
  • 27. Filter x.f > 5 AND y LIKE ‘a%b’ Scan(t_0) Filter z > 5 Scan(t_1) FilterIntoScan Rule Connector.applyFilter(...) Table: t_0 Filter: x.f > 5 AND y LIKE ‘a%b’ Table t x :: row(f bigint, g bigint) y :: varchar(10) Derived Table: t_1 Filter: z > 5 [z :: bigint] SELECT count(*) FROM t WHERE x.f > 5 AND y LIKE ‘a%b’
  • 28. New Connector APIs applyFilter(ConnectorTableHandle table, Expression filter) applyLimit(ConnectorTableHandle table, long limit) applyAggregation(ConnectorTableHandle table, List<Aggregation> aggregates) applySampling(ConnectorTableHandle table, double samplingRate) ...
  • 29. Performance Benefits (?) ● Better support for sophisticated backend systems ○ Druid, Pinot, ElasticSearch ○ SQL databases ● Improved performance for columnar data formats (Parquet, ORC)
  • 32. Project Roadmap ● Coordinator HA ● Kubernetes ● Dynamic filtering ● Connectors ○ Phoenix ○ Iceberg ○ Druid ● TIMESTAMP semantics ● And more… https://github.com/prestosql/presto/labels/roadmap
  • 33. Getting Involved ● Join us on Slack ○ Invite link: https://prestosql.io/community.html ● Github: https://github.io/prestosql/presto ● Website: https://prestosql.io